This the multi-page printable view of this section. Click here to print.
Monitoring Addons
1 - SLO Dashboards
Development teams are required to co-maintain, in conjunction with the MT-SRE Team, SLO Dashboards for the Addons they develop. This document explains how to bootstrap the dashboard creation and deployment.
First Dashboard
- Fork/clone the managed-tenants-slos repository.
- Create the following directory structure:
├── <addon-name>
│ ├── dashboards
│ │ └── <addon-name>-slo-dashboard.configmap.yaml
│ └── OWNERS
.
Example OWNERS
:
approvers:
- akonarde
- asegundo
<addon-name>-slo-dashboard.configmap.yaml
contents (replace all occurrences of <addon-name>
):
apiVersion: v1
kind: ConfigMap
metadata:
name: <addon-name>-slo-dashboard
labels:
grafana_dashboard: "true"
annotations:
grafana-folder: /grafana-dashboard-definitions/Addons
data:
mtsre-rhods-slos.json: |
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "4rNsqZfnz"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {
"align": "auto",
"displayMode": "auto",
"inspect": false
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 16,
"w": 3,
"x": 0,
"y": 0
},
"id": 2,
"options": {
"footer": {
"fields": "",
"reducer": [
"sum"
],
"show": false
},
"showHeader": true
},
"pluginVersion": "9.0.1",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "4rNsqZfnz"
},
"editorMode": "code",
"expr": "group by (_id) (subscription_sync_total{name=\"${addon_name}\"})",
"format": "table",
"range": true,
"refId": "A"
}
],
"title": "Clusters",
"transformations": [
{
"id": "groupBy",
"options": {
"fields": {
"_id": {
"aggregations": [],
"operation": "groupby"
}
}
}
}
],
"type": "table"
}
],
"schemaVersion": 36,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"hide": 2,
"name": "addon_name",
"query": "addon-<addon-name>",
"skipUrlSync": false,
"type": "constant"
}
]
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "<addon-name> - SLO Dashboard",
"version": 0,
"weekStart": ""
}
- Create a Merge Request adding the files to the managed-tenants-slos git repository.
- Ping @mt-sre-ic in the #forum-managed-tenants Slack channel for review.
Dashboard Deployment
Merging of the above merge request is a prerequisite for this step.
The dashboard deployment happens through app-interface, using saas-files.
- For each new Addon, we need to create a new saas-file in app-interface.
- Give ownership of the saas-file to your team using an app-interface role file.
Example Merge Request content to app-interface:
- Ping
@mt-sre-ic
in the#forum-managed-tenants
Slack channel for approval. - Merge Requests to app-interface are constantly reviewed/merged by AppSRE. After the MT-SRE approval, wait until the Merge Request is merged.
Accessing the Dashboards
Once the app-interface merge request is merged, you will see your ConfigMaps
being deployed in the #sd-mt-sre-info
Slack channel. For example:
[app-sre-stage-01] ConfigMap odf-ms-cluster-status applied
...
[app-sre-prod-01] ConfigMap odf-ms-cluster-status applied
Once the dashboards are deployed, you can see them here:
- STAGE: https://grafana.stage.devshift.net/dashboards/f/aGqy3WB7k/addons
- PRODUCTION: https://grafana.app-sre.devshift.net/dashboards/f/sDiLLtgVz/addons
Development Flow
After all the configuration is in place:
STAGE:
- Dashboards on the STAGE Grafana instance should not be used by external audiences other than the people developing the dashboards.
- Changes in the
managed-tenants-slos
repository can be merged by the development team with “/lgtm” comments from those in the OWNERS file. - After merged, changes are automatically delivered to the STAGE grafana instance.
PRODUCTION:
- The dashboards on the PRODUCTION Grafana are pinpointed to a specific git commit from the managed-tenants-slos repository in the corresponding saas-file in app-interface.
- After patching the git commit in the saas-file, owners of the saas-file can merge the promotion with a “/lgtm” comment in the app-interface Merge Request.
2 - Dead Man's Snitch Operator Integration
Overview
Dead Man’s Snitch (DMS) is essentially a constantly firing prometheus alert and an external receiver (called a snitch) that will alert should the monitoring stack go down and stop sending alerts. The generation of the snitch URLs is done dynamically via the DMS operator, which runs on hive and is owned by SREP. The snitch URL shows up in a secret.
Usage
The Add-On metadata file (addon.yaml
) allows you to provide
a deadmanssnitch
field (see deadmansnitch
field in
the Add-On metadata file
schema documentation
for more information).
This field allows you to provide the required Dead Man’s Snitch integration configuration.
A DeadmansSnitchIntegration
resource is then created and applied to Hive alongside the Add-On
SelectorSyncSet (SSS).
DeadmansSnitchIntegration Resource
The default DMS configurations which will be created if you specify the bare minimum fields under ‘deadmanssnitch’ field in addon metadata:
- apiVersion: deadmanssnitch.managed.openshift.io/v1alpha1
kind: DeadmansSnitchIntegration
metadata:
name: addon-{{ADDON.metadata['id']}}
namespace: deadmanssnitch-operator
spec:
clusterDeploymentSelector: ## can be overridden by .deadmanssnitch.clusterDeploymentSelector field in addon metadata
matchExpressions:
- key: {{ADDON.metadata['label']}}
operator: In
values:
- "true"
dmsAPIKeySecretRef: ## fixed
name: deadmanssnitch-api-key
namespace: deadmanssnitch-operator
snitchNamePostFix: {{ADDON.metadata['id']}} ## can be overridden by .deadmanssnitch.snitchNamePostFix field in addon metadata
tags: {{ADDON.metadata['deadmanssnitch']['tags']}} ## Required
targetSecretRef:
## can be overridden by .deadmanssnitch.targetSecretRef.name field in addon metadata
name: {{ADDON.metadata['id']}}-deadmanssnitch
## can be overridden by .deadmanssnitch.targetSecretRef.namespace field in addon metadata
namespace: {{ADDON.metadata['targetNamespace']}}
Examples of deadmanssnitch
field in addon.yaml
id: ocs-converged
....
....
deadmanssnitch:
tags: ["ocs-converged-stage"]
....
id: managed-odh
....
....
deadmanssnitch:
snitchNamePostFix: rhods
tags: ["rhods-integration"]
targetSecretRef:
name: redhat-rhods-deadmanssnitch
namespace: redhat-ods-monitoring
....
id: managed-api-service-internal
....
....
deadmanssnitch:
clusterDeploymentSelector:
matchExpressions:
- key: "api.openshift.com/addon-managed-api-service-internal"
operator: In
values:
- "true"
- key: "api.openshift.com/addon-managed-api-service-internal-delete"
operator: NotIn
values:
- 'true'
snitchNamePostFix: rhoam
tags: ["rhoam-production"]
targetSecretRef:
name: redhat-rhoami-deadmanssnitch
namespace: redhat-rhoami-operator
Generated Secret
A secrete will be generated (by default in the same namespace as your addon) with the SNITCH_URL
.
Your add-on will need to pick up the generated secret in cluster and inject it into your
alertmanager config. Example of in-cluster created secret:
kind: Secret
apiVersion: v1
metadata:
namespace: redhat-myaddon-operator
labels:
hive.openshift.io/managed: 'true'
data:
SNITCH_URL: #url like https://nosnch.in/123123123
type: Opaque
Alert
Your alertmanager will need a constantly firing alert that is routed to DMS: Example of an alert that always fires:
- name: DeadManSnitch
interval: 1m
rules:
- alert: DeadManSnitch
expr: vector(1)
labels:
severity: critical
annotations:
description: This is a DeadManSnitch to ensure RHODS monitoring and alerting pipeline is online.
summary: Alerting DeadManSnitch
Route
Example of a route that forwards the firing-alert to DMS:
- match:
alertname: DeadManSnitch
receiver: deadman-snitch
repeat_interval: 5m
Receiver
Example receiver for DMS:
- name: 'deadman-snitch'
webhook_configs:
- url: '<snitch_url>?m=just+checking+in'
send_resolved: false
tags: ["my-addon-production"]
in the Service Delivery DMS account to their pagerduty service.Please log a JIRA with your assigned SRE team to have this completed at least one week before going live with the SRE team.
Current Example
3 - PagerDuty Integration
The PagerDuty integration is configured in the pagerduty
field in the
addon.yaml metadata file.
Given this configuration, a secret with the specified name is created in the
specified namespace by the PagerDuty Operator,
which runs on Hive. The secret contains the PAGERDUTY_KEY
.
4 - OCM SendGrid Service Integration
OCM SendGrid Service is an event driven service that manages SendGrid subuser accounts and credential bundles based on addon cluster logs.
The secret name and namespace are configured in app interface, see this section in the documentation.