1 - SLO Dashboards

Development teams are required to co-maintain, in conjunction with the MT-SRE Team, SLO Dashboards for the Addons they develop. This document explains how to bootstrap the dashboard creation and deployment.

First Dashboard

├── <addon-name>
│   ├── dashboards
│   │   └── <addon-name>-slo-dashboard.configmap.yaml
│   └── OWNERS
.

Example OWNERS:

approvers:
- akonarde
- asegundo

<addon-name>-slo-dashboard.configmap.yaml contents (replace all occurrences of <addon-name>):

apiVersion: v1
kind: ConfigMap
metadata:
  name: <addon-name>-slo-dashboard
  labels:
    grafana_dashboard: "true"
  annotations:
    grafana-folder: /grafana-dashboard-definitions/Addons
data:
  mtsre-rhods-slos.json: |
    {
      "annotations": {
        "list": [
          {
            "builtIn": 1,
            "datasource": {
              "type": "grafana",
              "uid": "-- Grafana --"
            },
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "name": "Annotations & Alerts",
            "target": {
              "limit": 100,
              "matchAny": false,
              "tags": [],
              "type": "dashboard"
            },
            "type": "dashboard"
          }
        ]
      },
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "4rNsqZfnz"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "custom": {
                "align": "auto",
                "displayMode": "auto",
                "inspect": false
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 16,
            "w": 3,
            "x": 0,
            "y": 0
          },
          "id": 2,
          "options": {
            "footer": {
              "fields": "",
              "reducer": [
                "sum"
              ],
              "show": false
            },
            "showHeader": true
          },
          "pluginVersion": "9.0.1",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "4rNsqZfnz"
              },
              "editorMode": "code",
              "expr": "group by (_id) (subscription_sync_total{name=\"${addon_name}\"})",
              "format": "table",
              "range": true,
              "refId": "A"
            }
          ],
          "title": "Clusters",
          "transformations": [
            {
              "id": "groupBy",
              "options": {
                "fields": {
                  "_id": {
                    "aggregations": [],
                    "operation": "groupby"
                  }
                }
              }
            }
          ],
          "type": "table"
        }
      ],
      "schemaVersion": 36,
      "style": "dark",
      "tags": [],
      "templating": {
        "list": [
          {
            "hide": 2,
            "name": "addon_name",
            "query": "addon-<addon-name>",
            "skipUrlSync": false,
            "type": "constant"
          }
        ]
      },
      "time": {
        "from": "now-6h",
        "to": "now"
      },
      "timepicker": {},
      "timezone": "",
      "title": "<addon-name> - SLO Dashboard",
      "version": 0,
      "weekStart": ""
    }    
  • Create a Merge Request adding the files to the managed-tenants-slos git repository.
  • Ping @mt-sre-ic in the #forum-managed-tenants Slack channel for review.

Dashboard Deployment

Merging of the above merge request is a prerequisite for this step.

The dashboard deployment happens through app-interface, using saas-files.

  • For each new Addon, we need to create a new saas-file in app-interface.
  • Give ownership of the saas-file to your team using an app-interface role file.

Example Merge Request content to app-interface:

https://gitlab.cee.redhat.com/service/app-interface/-/commit/9306800aabaca18cd034dfb3933a12d29506fa08

  • Ping @mt-sre-ic in the #forum-managed-tenants Slack channel for approval.
  • Merge Requests to app-interface are constantly reviewed/merged by AppSRE. After the MT-SRE approval, wait until the Merge Request is merged.

Accessing the Dashboards

Once the app-interface merge request is merged, you will see your ConfigMaps being deployed in the #sd-mt-sre-info Slack channel. For example:

[app-sre-stage-01] ConfigMap odf-ms-cluster-status applied
...
[app-sre-prod-01] ConfigMap odf-ms-cluster-status applied

Once the dashboards are deployed, you can see them here:

Development Flow

After all the configuration is in place:

STAGE:

  • Dashboards on the STAGE Grafana instance should not be used by external audiences other than the people developing the dashboards.
  • Changes in the managed-tenants-slos repository can be merged by the development team with “/lgtm” comments from those in the OWNERS file.
  • After merged, changes are automatically delivered to the STAGE grafana instance.

PRODUCTION:

  • The dashboards on the PRODUCTION Grafana are pinpointed to a specific git commit from the managed-tenants-slos repository in the corresponding saas-file in app-interface.
  • After patching the git commit in the saas-file, owners of the saas-file can merge the promotion with a “/lgtm” comment in the app-interface Merge Request.

2 - Dead Man's Snitch Operator Integration

Overview

Dead Man’s Snitch (DMS) is essentially a constantly firing prometheus alert and an external receiver (called a snitch) that will alert should the monitoring stack go down and stop sending alerts. The generation of the snitch URLs is done dynamically via the DMS operator, which runs on hive and is owned by SREP. The snitch URL shows up in a secret.

Usage

The Add-On metadata file (addon.yaml) allows you to provide a deadmanssnitch field (see deadmansnitch field in the Add-On metadata file schema documentation for more information). This field allows you to provide the required Dead Man’s Snitch integration configuration. A DeadmansSnitchIntegrationresource is then created and applied to Hive alongside the Add-On SelectorSyncSet (SSS).

DeadmansSnitchIntegration Resource

The default DMS configurations which will be created if you specify the bare minimum fields under ‘deadmanssnitch’ field in addon metadata:

- apiVersion: deadmanssnitch.managed.openshift.io/v1alpha1
  kind: DeadmansSnitchIntegration
  metadata:
    name: addon-{{ADDON.metadata['id']}}
    namespace: deadmanssnitch-operator
  spec:
    clusterDeploymentSelector: ## can be overridden by .deadmanssnitch.clusterDeploymentSelector field in addon metadata
      matchExpressions:
      - key: {{ADDON.metadata['label']}}
        operator: In
        values:
        - "true"

    dmsAPIKeySecretRef: ## fixed
      name: deadmanssnitch-api-key
      namespace: deadmanssnitch-operator

    snitchNamePostFix: {{ADDON.metadata['id']}} ## can be overridden by .deadmanssnitch.snitchNamePostFix field in addon metadata

    tags: {{ADDON.metadata['deadmanssnitch']['tags']}} ## Required

    targetSecretRef:
      ## can be overridden by .deadmanssnitch.targetSecretRef.name field in addon metadata
      name: {{ADDON.metadata['id']}}-deadmanssnitch
      ## can be overridden by .deadmanssnitch.targetSecretRef.namespace field in addon metadata
      namespace: {{ADDON.metadata['targetNamespace']}}

Examples of deadmanssnitch field in addon.yaml

id: ocs-converged
....
....
deadmanssnitch:
  tags: ["ocs-converged-stage"]
....
id: managed-odh
....
....
deadmanssnitch:
  snitchNamePostFix: rhods
  tags: ["rhods-integration"]
  targetSecretRef:
    name: redhat-rhods-deadmanssnitch
    namespace: redhat-ods-monitoring
....
id: managed-api-service-internal
....
....
deadmanssnitch:
  clusterDeploymentSelector:
    matchExpressions:
    - key: "api.openshift.com/addon-managed-api-service-internal"
      operator: In
      values:
      - "true"
    - key: "api.openshift.com/addon-managed-api-service-internal-delete"
      operator: NotIn
      values:
      - 'true'
  snitchNamePostFix: rhoam
  tags: ["rhoam-production"]
  targetSecretRef:
    name: redhat-rhoami-deadmanssnitch
    namespace: redhat-rhoami-operator

Generated Secret

A secrete will be generated (by default in the same namespace as your addon) with the SNITCH_URL. Your add-on will need to pick up the generated secret in cluster and inject it into your alertmanager config. Example of in-cluster created secret:

kind: Secret
apiVersion: v1
metadata:
  namespace: redhat-myaddon-operator
  labels:
    hive.openshift.io/managed: 'true'
data:
  SNITCH_URL: #url like https://nosnch.in/123123123
type: Opaque

Alert

Your alertmanager will need a constantly firing alert that is routed to DMS: Example of an alert that always fires:

- name: DeadManSnitch
  interval: 1m
  rules:
    - alert: DeadManSnitch
      expr: vector(1)
      labels:
        severity: critical
      annotations:
        description: This is a DeadManSnitch to ensure RHODS monitoring and alerting pipeline is online.
        summary: Alerting DeadManSnitch

Route

Example of a route that forwards the firing-alert to DMS:

- match:
    alertname: DeadManSnitch
  receiver: deadman-snitch
  repeat_interval: 5m

Receiver

Example receiver for DMS:

- name: 'deadman-snitch'
  webhook_configs:
  - url: '<snitch_url>?m=just+checking+in'
    send_resolved: false

Please log a JIRA with your assigned SRE team to have this completed at least one week before going live with the SRE team.

Current Example

3 - PagerDuty Integration

The PagerDuty integration is configured in the pagerduty field in the addon.yaml metadata file. Given this configuration, a secret with the specified name is created in the specified namespace by the PagerDuty Operator, which runs on Hive. The secret contains the PAGERDUTY_KEY.

4 - OCM SendGrid Service Integration

OCM SendGrid Service is an event driven service that manages SendGrid subuser accounts and credential bundles based on addon cluster logs.

The secret name and namespace are configured in app interface, see this section in the documentation.