1 - Monitoring Addons

How to monitor addons.

1.1 - SLO Dashboards

Development teams are required to co-maintain, in conjunction with the MT-SRE Team, SLO Dashboards for the Addons they develop. This document explains how to bootstrap the dashboard creation and deployment.

First Dashboard

├── <addon-name>
│   ├── dashboards
│   │   └── <addon-name>-slo-dashboard.configmap.yaml
│   └── OWNERS
.

Example OWNERS:

approvers:
- akonarde
- asegundo

<addon-name>-slo-dashboard.configmap.yaml contents (replace all occurrences of <addon-name>):

apiVersion: v1
kind: ConfigMap
metadata:
  name: <addon-name>-slo-dashboard
  labels:
    grafana_dashboard: "true"
  annotations:
    grafana-folder: /grafana-dashboard-definitions/Addons
data:
  mtsre-rhods-slos.json: |
    {
      "annotations": {
        "list": [
          {
            "builtIn": 1,
            "datasource": {
              "type": "grafana",
              "uid": "-- Grafana --"
            },
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "name": "Annotations & Alerts",
            "target": {
              "limit": 100,
              "matchAny": false,
              "tags": [],
              "type": "dashboard"
            },
            "type": "dashboard"
          }
        ]
      },
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "4rNsqZfnz"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "custom": {
                "align": "auto",
                "displayMode": "auto",
                "inspect": false
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 16,
            "w": 3,
            "x": 0,
            "y": 0
          },
          "id": 2,
          "options": {
            "footer": {
              "fields": "",
              "reducer": [
                "sum"
              ],
              "show": false
            },
            "showHeader": true
          },
          "pluginVersion": "9.0.1",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "4rNsqZfnz"
              },
              "editorMode": "code",
              "expr": "group by (_id) (subscription_sync_total{name=\"${addon_name}\"})",
              "format": "table",
              "range": true,
              "refId": "A"
            }
          ],
          "title": "Clusters",
          "transformations": [
            {
              "id": "groupBy",
              "options": {
                "fields": {
                  "_id": {
                    "aggregations": [],
                    "operation": "groupby"
                  }
                }
              }
            }
          ],
          "type": "table"
        }
      ],
      "schemaVersion": 36,
      "style": "dark",
      "tags": [],
      "templating": {
        "list": [
          {
            "hide": 2,
            "name": "addon_name",
            "query": "addon-<addon-name>",
            "skipUrlSync": false,
            "type": "constant"
          }
        ]
      },
      "time": {
        "from": "now-6h",
        "to": "now"
      },
      "timepicker": {},
      "timezone": "",
      "title": "<addon-name> - SLO Dashboard",
      "version": 0,
      "weekStart": ""
    }    
  • Create a Merge Request adding the files to the managed-tenants-slos git repository.
  • Ping @mt-sre-ic in the #forum-managed-tenants Slack channel for review.

Dashboard Deployment

Merging of the above merge request is a prerequisite for this step.

The dashboard deployment happens through app-interface, using saas-files.

  • For each new Addon, we need to create a new saas-file in app-interface.
  • Give ownership of the saas-file to your team using an app-interface role file.

Example Merge Request content to app-interface:

https://gitlab.cee.redhat.com/service/app-interface/-/commit/9306800aabaca18cd034dfb3933a12d29506fa08

  • Ping @mt-sre-ic in the #forum-managed-tenants Slack channel for approval.
  • Merge Requests to app-interface are constantly reviewed/merged by AppSRE. After the MT-SRE approval, wait until the Merge Request is merged.

Accessing the Dashboards

Once the app-interface merge request is merged, you will see your ConfigMaps being deployed in the #sd-mt-sre-info Slack channel. For example:

[app-sre-stage-01] ConfigMap odf-ms-cluster-status applied
...
[app-sre-prod-01] ConfigMap odf-ms-cluster-status applied

Once the dashboards are deployed, you can see them here:

Development Flow

After all the configuration is in place:

STAGE:

  • Dashboards on the STAGE Grafana instance should not be used by external audiences other than the people developing the dashboards.
  • Changes in the managed-tenants-slos repository can be merged by the development team with “/lgtm” comments from those in the OWNERS file.
  • After merged, changes are automatically delivered to the STAGE grafana instance.

PRODUCTION:

  • The dashboards on the PRODUCTION Grafana are pinpointed to a specific git commit from the managed-tenants-slos repository in the corresponding saas-file in app-interface.
  • After patching the git commit in the saas-file, owners of the saas-file can merge the promotion with a “/lgtm” comment in the app-interface Merge Request.

1.2 - Dead Man's Snitch Operator Integration

Overview

Dead Man’s Snitch (DMS) is essentially a constantly firing prometheus alert and an external receiver (called a snitch) that will alert should the monitoring stack go down and stop sending alerts. The generation of the snitch URLs is done dynamically via the DMS operator, which runs on hive and is owned by SREP. The snitch URL shows up in a secret.

Usage

The Add-On metadata file (addon.yaml) allows you to provide a deadmanssnitch field (see deadmansnitch field in the Add-On metadata file schema documentation for more information). This field allows you to provide the required Dead Man’s Snitch integration configuration. A DeadmansSnitchIntegrationresource is then created and applied to Hive alongside the Add-On SelectorSyncSet (SSS).

DeadmansSnitchIntegration Resource

The default DMS configurations which will be created if you specify the bare minimum fields under ‘deadmanssnitch’ field in addon metadata:

- apiVersion: deadmanssnitch.managed.openshift.io/v1alpha1
  kind: DeadmansSnitchIntegration
  metadata:
    name: addon-{{ADDON.metadata['id']}}
    namespace: deadmanssnitch-operator
  spec:
    clusterDeploymentSelector: ## can be overridden by .deadmanssnitch.clusterDeploymentSelector field in addon metadata
      matchExpressions:
      - key: {{ADDON.metadata['label']}}
        operator: In
        values:
        - "true"

    dmsAPIKeySecretRef: ## fixed
      name: deadmanssnitch-api-key
      namespace: deadmanssnitch-operator

    snitchNamePostFix: {{ADDON.metadata['id']}} ## can be overridden by .deadmanssnitch.snitchNamePostFix field in addon metadata

    tags: {{ADDON.metadata['deadmanssnitch']['tags']}} ## Required

    targetSecretRef:
      ## can be overridden by .deadmanssnitch.targetSecretRef.name field in addon metadata
      name: {{ADDON.metadata['id']}}-deadmanssnitch
      ## can be overridden by .deadmanssnitch.targetSecretRef.namespace field in addon metadata
      namespace: {{ADDON.metadata['targetNamespace']}}

Examples of deadmanssnitch field in addon.yaml

id: ocs-converged
....
....
deadmanssnitch:
  tags: ["ocs-converged-stage"]
....
id: managed-odh
....
....
deadmanssnitch:
  snitchNamePostFix: rhods
  tags: ["rhods-integration"]
  targetSecretRef:
    name: redhat-rhods-deadmanssnitch
    namespace: redhat-ods-monitoring
....
id: managed-api-service-internal
....
....
deadmanssnitch:
  clusterDeploymentSelector:
    matchExpressions:
    - key: "api.openshift.com/addon-managed-api-service-internal"
      operator: In
      values:
      - "true"
    - key: "api.openshift.com/addon-managed-api-service-internal-delete"
      operator: NotIn
      values:
      - 'true'
  snitchNamePostFix: rhoam
  tags: ["rhoam-production"]
  targetSecretRef:
    name: redhat-rhoami-deadmanssnitch
    namespace: redhat-rhoami-operator

Generated Secret

A secrete will be generated (by default in the same namespace as your addon) with the SNITCH_URL. Your add-on will need to pick up the generated secret in cluster and inject it into your alertmanager config. Example of in-cluster created secret:

kind: Secret
apiVersion: v1
metadata:
  namespace: redhat-myaddon-operator
  labels:
    hive.openshift.io/managed: 'true'
data:
  SNITCH_URL: #url like https://nosnch.in/123123123
type: Opaque

Alert

Your alertmanager will need a constantly firing alert that is routed to DMS: Example of an alert that always fires:

- name: DeadManSnitch
  interval: 1m
  rules:
    - alert: DeadManSnitch
      expr: vector(1)
      labels:
        severity: critical
      annotations:
        description: This is a DeadManSnitch to ensure RHODS monitoring and alerting pipeline is online.
        summary: Alerting DeadManSnitch

Route

Example of a route that forwards the firing-alert to DMS:

- match:
    alertname: DeadManSnitch
  receiver: deadman-snitch
  repeat_interval: 5m

Receiver

Example receiver for DMS:

- name: 'deadman-snitch'
  webhook_configs:
  - url: '<snitch_url>?m=just+checking+in'
    send_resolved: false

Please log a JIRA with your assigned SRE team to have this completed at least one week before going live with the SRE team.

Current Example

1.3 - PagerDuty Integration

The PagerDuty integration is configured in the pagerduty field in the addon.yaml metadata file. Given this configuration, a secret with the specified name is created in the specified namespace by the PagerDuty Operator, which runs on Hive. The secret contains the PAGERDUTY_KEY.

1.4 - OCM SendGrid Service Integration

OCM SendGrid Service is an event driven service that manages SendGrid subuser accounts and credential bundles based on addon cluster logs.

The secret name and namespace are configured in app interface, see this section in the documentation.

2 - Testing Addons

How to test addons.

2.1 - Installing a specific version of an Addon in a staging environment

Add-on services are typically installed using the OpenShift Cluster Manager web console, by selecting the specific addon from the Add-ons tab and clicking Install. However, only the latest version of an addon service can be installed using the OpenShift Cluster Manager console.

In some cases, you might need to install an older version of an addon, for example, to test the upgrade of an addon from one version to the next. Follow this procedure to install a specific version of an addon service in a staging environment.

IMPORTANT: Installing an addon service using this procedure is only recommended for testing upgrades in a staging environment and is not supported for customer-facing production workloads.

Prerequisites

Procedure

  1. Create a JSON file with the addon service and addon version that you want to install. In this example, the JSON file is install-payload.json, the addon id is reference-addon, and the version we want to install is 0.6.7.

    Example

    {
     "addon": {
       "id": "reference-addon"
     },
     "addon_version": {
       "id": "0.6.7"
     }
    }
    

    NOTE: If the addon that you are installing has a required parameter, ensure that you add it to the JSON file. For instance, the managed-odh addon, which is shown in the example below, requires the parameter notification-email to be included.

    Example

    {
      "addon": {
        "id": "managed-odh"
    },
      "addon_version": {
        "id": "1.23.0"
    },
      "parameters": {
            "items": [
            {
              "id": "notification-email",
              "value": "me@somewhere.com"
            }
          ]
        }
    }
    
  2. Set the CLUSTER_ID environment variable:

    export CLUSTER_ID=<your_cluster_internal_id>
    
  3. Run the following API request to install the addon:

    ocm post /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addons --body install-payload.json
    
  4. Verify the addon installation:

    1. Log into your cluster:

      oc login
      
    2. Run the oc get addons command to view the addon installation status:

      $ oc get addons
      NAME              STATUS   AGE
      reference-addon   Pending  10m
      
    3. Optionally, run the watch command to watch the addon installation status:

      $ watch oc get addons
      NAME                 STATUS    AGE
      reference-addon      Ready     32m
      
  5. If you do not want the addon to automatically upgrade to the latest version after installation, delete the addon upgrade policy before the addon installation completes.

    1. List the upgrade policies:

      Example

      $ ocm get /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies
      {
      "kind": "AddonUpgradePolicyList",
      "page": 1,
      "size": 1,
      "total": 1,
      "items": [
       {
        "kind": "AddonUpgradePolicy",
        "id": "991a69a5-ce33-11ed-9dda-0a580a8308f5",
        "href": "/api/clusters_mgmt/v1/clusters/22ogsfo8kd36bk280b6bqbi7l03micmm/addon_upgrade_policies/991a69a5-ce33-11ed-9dda-0a580a8308f5",
        "schedule": "0,15,30,45 * * * *",
        "schedule_type": "automatic",
        "upgrade_type": "ADDON",
        "version": "",
        "next_run": "2023-03-29T19:30:00Z",
        "cluster_id": "22ogsfo8kd36bk280b6bqbi7l03micmm",
        "addon_id": "reference-addon"
       }
      ]
      }
      
    2. Delete the addon upgrade policy:

      Syntax

      ocm delete /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies/<addon_upgrade_policy_id>
      

      Example

      ocm delete /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies/991a69a5-ce33-11ed-9dda-0a580a8308f5
      
    3. Verify the upgrade policy no longer exists:

      Syntax

      ocm get /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies | grep <addon_upgrade_policy_id>
      

      Example

      ocm get /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies | grep 991a69a5-ce33-11ed-9dda-0a580a8308f5
      
  6. Review the addon installation status and version:

    Example

    $ oc get addons reference-addon -o yaml
    apiVersion: addons.managed.openshift.io/v1alpha1
    kind: Addon
    metadata:
      annotations:
     ...
     creationTimestamp: "2023-03-20T19:07:08Z"
     finalizers:
     - addons.managed.openshift.io/cache
     ...
    spec:
    displayName: Reference Addon
     ...
     pause: false
     version: 0.6.7
    status:
      conditions:
      - lastTransitionTime: "2023-03-20T19:08:10Z"
        message: ""
        observedGeneration: 2
        reason: FullyReconciled
        status: "True"
        type: Available
      - lastTransitionTime: "2023-03-20T19:08:10Z"
        message: Addon has been successfully installed.
        observedGeneration: 2
        reason: AddonInstalled
        status: "True"
        type: Installed
     lastObservedAvailableCSV: redhat-reference-addon/reference-addon.v0.6.7
     observedGeneration: 2
     observedVersion: 0.6.7
     phase: Ready
    

    In this example, you can see the addon version is set to 0.6.7 and AddonInstalled status is True.

  7. (Optional) If needed, recreate the addon upgrade policy manually.

    1. Create a JSON file with the addon upgrade policy information.

      Example of automatic upgrade

      {
        "kind": "AddonUpgradePolicy",
        "addon_id": "reference-addon",
        "cluster_id": "$CLUSTER_ID",
        "schedule_type": "automatic",
        "upgrade_type": "ADDON"
      }
      

      Example of manual upgrade

      {
        "kind": "AddonUpgradePolicy",
        "addon_id": "reference-addon",
        "cluster_id": "$CLUSTER_ID",
        "schedule_type": "manual",
        "upgrade_type": "ADDON",
        "version": "0.7.0"
      }
      

      In the example above, the schedule_type for the reference-addon is set to manual and the version to upgrade to is set 0.7.0. The upgrade policy will execute once and the addon will upgrade to version 0.7.0.

    2. Run the following API request to install the addon upgrade policy:

      Syntax

      ocm post /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies --body <your_json_filename>
      

      Example

      ocm post /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies --body reference-upgrade-policy.json
      
    3. Verify the upgrade policy exists:

      Syntax

       ocm get /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies | jq '.items[] | select(.addon_id=="<addon_id>")'
      

      Example

      ocm get /api/clusters_mgmt/v1/clusters/$CLUSTER_ID/addon_upgrade_policies | jq '.items[] | select(.addon_id=="reference-addon")'
      

Useful commands

  • Get a list of available addons:

    ocm get /api/clusters_mgmt/v1/addons | jq '.items[].id'
    
  • Get a list of available versions to install for a given addon id:

    Syntax

    ocm get /api/clusters_mgmt/v1/addons/<addon-id>/versions | jq '.items[].id'
    

    Example

    $ ocm get /api/clusters_mgmt/v1/addons/reference-addon/versions | jq '.items[].id'
    "0.0.0"
    "0.1.5"
    "0.1.6"
    "0.2.2"
    "0.3.0"
    "0.3.1"
    "0.3.2"
    "0.4.0"
    "0.4.1"
    "0.5.0"
    "0.5.1"
    "0.6.0"
    "0.6.1"
    "0.6.2"
    "0.6.3"
    "0.6.4"
    "0.6.5"
    "0.6.6"
    "0.6.7"
    "0.7.0"
    

2.2 - Testing With OCP (Without OCM)

Testing Without OCM

During the development process, it might be useful (and cheaper) to run your addon on an OCP cluster.

You can spin up an OCP cluster on your local machine using CRC.

OCP and OSD differ in one important aspect: OCP gives you full access, while OSD restricts the administrative actions. But Managed Tenants will apply resources as unrestricted admin to OSD, just like you can do with your OCP, so OCP is a good OSD mockup for our use case.

By doing this, you’re skipping:

  • OCM and SKU management
  • Hive

First, you have to build your catalog. Let take the managed-odh as example:

$ managedtenants --environment=stage --addons-dir addons --dry-run run --debug tasks/deploy/10_build_push_catalog.py:managed-odh
Loading stage...
Loading stage OK
== TASKS =======================================================================
tasks/deploy/10_build_push_catalog.py:BuildCatalog:managed-odh:stage...
 -> creating the temporary directory
 -> /tmp/managed-odh-stage-1bkjtsea
 -> generating the bundle directory
 -> generating the bundle package.yaml
 -> building the docker image
 -> ['docker', 'build', '-f', PosixPath('/home/apahim/git/managed-tenants/Dockerfile.catalog'), '-t', 'quay.io/osd-addons/opendatahub-operator:stage-91918fe', PosixPath('/tmp/managed-odh-stage-1bkjtsea')]
tasks/deploy/10_build_push_catalog.py:BuildCatalog:managed-odh:stage OK
tasks/deploy/10_build_push_catalog.py:PushCatalog:managed-odh:stage...
 -> pushing the docker image
 -> ['docker', '--config', '/home/apahim/.docker', 'push', 'quay.io/osd-addons/opendatahub-operator:stage-91918fe']
tasks/deploy/10_build_push_catalog.py:PushCatalog:managed-odh:stage OK

That command has built the image quay.io/osd-addons/opendatahub-operator:stage-91918fe on your local machine.

You can inspect the image with:

$ docker run --rm -it --entrypoint "bash"  quay.io/osd-addons/opendatahub-operator:stage-91918fe -c "ls manifests/"
0.8.0  1.0.0-experiment  managed-odh.package.yml

$ docker run --rm -it --entrypoint "bash"  quay.io/osd-addons/opendatahub-operator:stage-91918fe -c "cat manifests/managed-odh.package.yml"
channels:
- currentCSV: opendatahub-operator.1.0.0-experiment
  name: beta
defaultChannel: beta
packageName: managed-odh

Next, you have to tag/push that image to some public registry repository of yours:

$ docker tag quay.io/osd-addons/opendatahub-operator:stage-91918fe quay.io/<my-repository>/opendatahub-operator:stage-91918fe
$ docker push quay.io/<my-repository>/opendatahub-operator:stage-91918fe
Getting image source signatures
Copying blob 9fbc4a1ed0b0 done
Copying blob c4d8f7894b7d skipped: already exists
Copying blob 61598d8d1b24 skipped: already exists
Copying blob 38ada4bcd26f skipped: already exists
Copying blob d5fdf1f627c8 skipped: already exists
Copying blob 2bf094d88b12 skipped: already exists
Copying blob 8a6c7bacb5db done
Copying config 3088e48540 done
Writing manifest to image destination
Copying config 3088e48540 [--------------------------------------] 0.0b / 3.6KiB
Writing manifest to image destination
Writing manifest to image destination
Storing signatures

Now we have to apply the OpenShift resources that will install the operator in the OCP cluster. You can use the managedtenants command to generate the stage SelectorSyncSet and look at it for reference:

$ managedtenants --environment=stage --addons-dir addons --dry-run run --debug tasks/generate/99_generate_SelectorSyncSet.py
Loading stage...
Loading stage OK
== POSTTASKS ===================================================================
tasks/generate/99_generate_SelectorSyncSet.py:GenerateSSS:stage...
 -> Generating SSS template /home/apahim/git/managed-tenants/openshift/stage.yaml
tasks/generate/99_generate_SelectorSyncSet.py:GenerateSSS:stage OK

Here’s the SelectorSyncSet snippet we are interested in:

---
- apiVersion: hive.openshift.io/v1
  kind: SelectorSyncSet
  metadata:
    name: addon-managed-odh
  spec:
    clusterDeploymentSelector:
      matchLabels:
        api.openshift.com/addon-managed-odh: "true"
    resourceApplyMode: Sync
    resources:
      - apiVersion: v1
        kind: Namespace
        metadata:
          annotations:
            openshift.io/node-selector: ""
          labels: null
          name: redhat-opendatahub
      - apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        metadata:
          name: addon-managed-odh-catalog
          namespace: openshift-marketplace
        spec:
          displayName: Managed Open Data Hub Operator
          image: quay.io/osd-addons/opendatahub-operator:stage-${IMAGE_TAG}
          publisher: OSD Red Hat Addons
          sourceType: grpc
      - apiVersion: operators.coreos.com/v1alpha2
        kind: OperatorGroup
        metadata:
          name: redhat-layered-product-og
          namespace: redhat-opendatahub
      - apiVersion: operators.coreos.com/v1alpha1
        kind: Subscription
        metadata:
          name: addon-managed-odh
          namespace: redhat-opendatahub
        spec:
          channel: beta
          name: managed-odh
          source: addon-managed-odh-catalog
          sourceNamespace: openshift-marketplace

Our OpenShift manifest to be applied to the OCP cluster looks as follows:

kind: List
metadata: {}
apiVersion: v1
items:
  - apiVersion: v1
    kind: Namespace
    metadata:
      name: redhat-opendatahub
  - apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: addon-managed-odh-catalog
    spec:
      displayName: Managed Open Data Hub Operator
      image: quay.io/<my-repository>/opendatahub-operator:stage-91918fe
      publisher: OSD Red Hat Addons
      sourceType: grpc
  - apiVersion: operators.coreos.com/v1alpha2
    kind: OperatorGroup
    metadata:
      name: redhat-layered-product-og
  - apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: addon-managed-odh
    spec:
      channel: beta
      name: managed-odh
      source: addon-managed-odh-catalog
      sourceNamespace: openshift-marketplace

Finally, apply it to the OCP cluster:

$ oc apply -f manifest.yaml
Namespace/redhat-opendatahub created
CatalogSource/addon-managed-odh-catalog created
Subscription/addon-managed-odh created
OperatorGroup/redhat-layered-product-og created

Your operator should be installed in the cluster.

2.3 - Testing With OSD-E2E

Testing With OSD-E2E

All Add-Ons must have a reference to a test harness container in a publicly available repository. The Add-On development team is responsible for creating and maintaining the test harness image. That image is generated by the OSD e2e process.

The test harness will be tested against OCP nightly and OSD next.

Please refer to the OSD-E2E Add-On Documentation for more details on how this test harness will be run and how it is expected to report results.

Primer into OSD E2E tests and prow jobs

To ensure certain things such as validating that the addon can be easily and successfully installed on a customer’s cluster, we have prow jobs setup which run e2e tests (one test suite per addon) every 12 hours. If the e2e tests corresponding to any addon fail, then automated alerts/notifications are sent to the addon team. Every addon’s e2e tests are packaged in an image called “testHarness”, which is built and pushed to quay.ioby the team maintaining the addon. Once the “testHarness” image is built and pushed, the team must register their addon to testHarness image’s e2e tests by making a PR against this file.

You can access the portal for prow jobs here. The prow jobs follow the below steps to run the e2e tests. For every e2e test defined inside this file:

  • An OSD cluster is created and the addon, which is being tested, is installed. Openshift API is used to perform these operations via the API definition provided at https://api.openshift.com
  • The e2e prow job definition, specifically for the addon from this file, is parsed and hence, the parameters required to run its e2e tests will be recognized as well.
  • The “testHarness” image for the addon is parsed and executed against the parameters fetched from the above step.
  • If an MT-SRE team member notices those tests failing, they should notify the respective team to take a look at them and fix them.

3 - Top Level Operator

Top Level Operator.

3.1 - Customer Notifications

Status Page

https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/docs/app-sre/statuspage.md

https://service.pages.redhat.com/dev-guidelines/docs/appsre/advanced/statuspage/

Service Logs

https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/blob/main/MT-SRE/sops/mt-sre-customer-notification.md

Internal Email

There are multiple ways a user or group can get notified of service events (e.g. planned maintenance, outages). There are two fields in the addon metadata file (see Add-On metadata file schema documentation for more information) where email addresses can be provided:

  • addonOwner: REQUIRED Point of contact for communications from Service Delivery to addon owners. Where possible, this should be a development team mailing list (rather than an individual developer).
  • addonNotifications: This is a list of additional email addresses of employees who would like to receive notifications about a service.

There is also a mailing list that receives notifications for all services managed by Service Delivery. Subscribe to the sd-notifications mailing list here.

3.2 - Dependencies

This document describes the supported implementation for Addon dependencies, as signed-off by the Managed Tenants SRE Team.

Dependencies Specification

  • Addons must specify dependencies using the OLM dependencies feature, documented here
  • The dependencies must have the version pin-pointed. Ranges are not allowed.
  • The dependencies must come from a Trusted Catalog. See the Trusted Catalogs section for details.

Trusted Catalogs

The Addon and its dependencies must come from Trusted Catalogs. Trusted Catalogs are those with content published by the Managed Services Pipelines, implemented by CPaaS, or by the Managed Tenants SRE Team.

Trusted Catalogs List

  • Addon catalog: the catalog created by the Managed Tenants SRE Team, for the purpose of releasing the Addon. Dependency bundles can be shipped in the same catalog of the Addon. The Addon catalog is considered “trusted” for the dependencies it carries.
  • Red Hat Operators catalog: the catalog content goes through the Managed Services Pipelines, same process to build some Addons themselves, just with a different release process. This catalog is considered “trusted” and can be used for dependencies.

Including a Catalog in the Trusted List

  • Make sure that the catalog is available on OSD and its content is released through the Managed Services Pipelines, implemented by CPaaS.
  • Create a Jira ticket in the MT-SRE Team backlog, requesting the assessment of the OSD catalog you want to consider as “trusted”.

Issues

There’s a feature request to the OLM Team to allow specifying the CatalogSource used for the dependencies:

3.3 - Environments

Mandatory environments

Add-ons are normally deployed to two environments:

  • ocm stage: development/testing - All add-ons must deploy to this environment before being released to production.
  • ocm production: once the deployment in stage has been reviewed, accepted, and approved it can be promoted to production via /lgtm by your SRE team.

We recommend the ocm stage and ocm production add-on metadata be as similar as possible.

SLOs

ocm stage have no SLO and operates with best effort support from Add-on SRE, SREP, and App-SRE osd stage cluster have no SLO and operates with best effort support from Add-on SRE, SREP, and App-SRE ocm production environments are subject to App-SRE SLOs. osd production cluster environments are subject OSD SLOs.

Additional Environments (via duplicate add-ons)

Some add-on providers have had use cases which require additional add-on envs. While we only have ocm stage and ocm prod, managed-tenants may be leveraged to deploy to an additional add-on (like edge or internal). Today we don’t recommend this practice due to the need to clone all add-on metadata which increases the risk for incorrect metadata going to production/customer clusters.

If you need to do the above, please reach out to your assigned SRE team for guidance first.

3.4 - Plug and Play Addon

Package Operator

Package Operator is a Kubernetes Operator for packaging and managing a collection of arbitrary Kubernetes objects.

Each addon with a packageOperator defined in its spec will have a corresponding ClusterObjectTemplate. The ClusterObjectTemplate is an API defined in Package Operator, enabling users to create an object by templating a manifest and injecting values retrieved from other arbitrary source objects. However, regular users typically do not need to interact with the ClusterObjectTemplate. Instead, they can interact with the generated ClusterPackage manifest.

Example of a ClusterPackage manifest:

apiVersion: package-operator.run/v1alpha1
kind: ClusterPackage
metadata:
  name: <addon_name>
spec:
  image: <addon.spec.packageOperator>
   config:
    addonsv1:
      clusterID: a440b136-b2d6-406b-a884-fca2d62cd170
      deadMansSnitchUrl: https://example.com/test-snitch-url
      ocmClusterID: abc123
      ocmClusterName: asdf
      pagerDutyKey: 1234567890ABCDEF
      parameters:
        foo1: bar
        foo2: baz
      targetNamespace: pko-test-ns-00-req-apy-dsy-pdy
  • The deadMansSnitchUrl and pagerDutyKey are obtained from the ConfigMaps using their default names and locations. IMPORTANT: To successfully inject the deadMansSnitchUrl and pagerDutyKey values into the ClusterPackage manifest, you must keep the default naming scheme and location of the corresponding ConfigMaps. See the addons deadMansSnitch and addons pagerDuty documentation for more information.

  • Additionally, all the values present in .spec.config.addonsv1 can be injected into the objects within your packageImage. See the package operator documentation for more information.

Tenants Onboarding Steps

Although you can generate the packageImage yourself using the package operator documentation, we recommend you use the Managed Tenants Bundles (MTB) facilities.

The following steps are an example of generating the packageImage for the reference-addon package using the MTB flow:

  1. In the MTB repository, create a package directory and add the manifests.yaml inside the package directory. See the following merge request for an example.

  2. The MTB CI creates the packageImage and the Operator Lifecycle Manager (OLM) Index Image as part of the team’s addon folder.

  3. The MTB CI creates a merge request to the managed-tenants repository and adds a new AddonImageSet with the PackageImage and OLM Index images.

4 - managed-tenants Repository

managed-tenants Repository.

Addons are deployed through GitOps pipelines. Most of the configuration for Addons can be found in the managed-tenants Repository . See the create an addon documentation page for a good starting point.

5 - SKU

How to request a SKU for your addon.

NOTE MT-SRE do not influence SKU creation/priorities. You must work with OCM directly for this.

Requesting a SKU

To request a SKU, please complete the following steps:

  • Determine a unique quota ID for the addon. This should be lowercase with dashes and of the format addon-<addon-name>. For example: addon-prow-operator
  • Create a JIRA Request at Openshift Cluster Manager with the subject Request for new Add-On SKU in OCM and the following information:
    • Add-On name.
    • Add-On owner.
    • Requested Add-On unique quota ID.
    • Additional information that would help qualify the ask, including goals, timelines, etc., you might have in mind.
  • You will need at least your PM and the OCM PM’s to sign off before the SKU is created. We expect to resolve these requests within 7 working days.

Requesting SKU Attributes Changes

From time to time you may want to update some SKU fields like supported cloud providers, quota cost, product support, etc. To do this:

  • Create a JIRA Request at Openshift Cluster Manager
  • Ping the ticket in #service-development-b Slack channel (@sd-b-team is the handle)
  • This requires an update to be committed in-code in AMS, then deployed to stage and eventually prod (allow up to 7 working days).

Current Status

To check current SKUs and attributes, see OCM Resource Cost Mappings.