This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Internal Documentation

Internal Documentation for the SRE teams.

1 - Tutorials

1.1 - Adding ADO alerts

This guide describes how to add alerts for addon-operator in the OSD RHOBS tenant.

Background

Since the addon-operator doesnโ€™t have a specific tenant in RHOBS with a service account, the addon-operator metrics are scraped to the OSD tenant with its own service account, so instead of adding rules using obsctl cli and syncing them using obsctl-reloader SRE-P has a repo called rhobs-rules-and-dashboards which, based on the tenant, will automatically sync rules defined in the repo to app-interface.

Overview

We need to have the following prerequisites:

If the tenant is not defined in the repo, we need to follow the process defined here which explains how to register the tenant in app-interface and how to configure obsctl-reloader to sync rules for the particular tenant.

The OSD tenant is already registered and the obsctl-reloader configuration is already defined here (we should see osd in MANAGED_TENANTS parameter in the observatorium-mst-common named item and secrets are defined here)

Steps

we can define rules for the addon-operator metrics in the rhobs-rules-and-dashboards repo in the rules/osd folder (the tenants are added as individual folders) and tests for the prometheus rules are defined in the tests/rules/osd folder. The suggested file naming conventions are described here.

tldr: create file name with .prometheusrule.yaml as a suffix, add tenant label to the PrometheusRule object.

Adding promrules

For creating a PoC alert we have selected the addon_operator_addon_health_info metric since it basically explains the addon health for a particular version and cluster_id. The metrics for creating alerts can be decided and the promql queries can be tested out from promlens stage or promlens prod. The metrics are scraped from addon-operator to the observatorium-mst-stage (stage) / observatorium-mst-prod (prod) datasources.

Sample addon_operator_addon_health_info metric data:

addon_operator_addon_health_info{_id="08d94ae0-a943-47ea-ac29-6cf65284aeba", container="metrics-relay-server", endpoint="https", instance="10.129.2.11:8443", job="addon-operator-metrics", name="managed-odh", namespace="openshift-addon-operator", pod="addon-operator-manager-7c9df45684-86mh4", prometheus="openshift-monitoring/k8s", receive="true", service="addon-operator-metrics", tenant_id="770c1124-6ae8-4324-a9d4-9ce08590094b", version="0.0.0"}

This particular metric gives information about version, cluster_id (_id) and addon_name (name) which can be used to create the alert.

  • We should ignore the metrics with version "0.0.0"
  • The addon health is "Unhealthy" if no version exists for the particular addon with value 1
  • In theory, the latest version for the particular addon should have the value of 1. If not, then the addon is "Unhealthy"

The prometheus rule for the addon_operator_addon_health_info metric is defined here

expr: (count by (name, _id) (addon_operator_addon_health_info{version!="0.0.0"})) - (count by (name,_id) (addon_operator_addon_health_info{version!="0.0.0"} == 0)) == 0

Explanation:

We aggregate all metrics on name and _id and count the non-0 value metrics. If the count is 0 (implying there has been no non-0 value metrics), raise the alert.

Writing tests for the promrules

First create a <alertname>.promrulestests.yaml file in the test/rules// folder, It is advised to test out different scenarios for the edge cases so that the expr defined in the rules/ folder will raise alert if say the addon is "Unhealthy". Try out different scenarios by defining different series as input to the test rules such as here.

The tests can be validated by running: make test-rules in the rhobs-rules-and-dashboards directory. NOTE: Make sure that the tests are defined in the tests/rules/ folder

Since the pagerduty config is defined on a per tenant basis which is osd tenant in this case, the alert will be triggered and paged to SRE-P, so a proper runbook should be added redirecting the alert to lp-sre folks.

The runbook links/ SOP links can be validated by running: make check-runbooks

1.2 - Run Hive Locally

This guide describes how to deploy a Hive environment in your local machine using kind.

For the managed clusters, this guide covers both kind clusters and CRC clusters.

Preparation

Setup your GOPATH. Add to your ~/.bashrc:

export GOPATH=$HOME/go
export PATH=${PATH}:${GOPATH}/bin

Install kind:

~$ curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.10.0/kind-linux-amd64
~$ chmod +x ./kind
~$ mv ./kind ~/.local/bin/kind

This guide was created using kind version: kind v0.14.0 go1.18.2 linux/amd64

Install the dependencies:

~$ GO111MODULE=on go get sigs.k8s.io/kustomize/kustomize/v3
~$ go get github.com/cloudflare/cfssl/cmd/cfssl
~$ go get github.com/cloudflare/cfssl/cmd/cfssljson
~$ go get -u github.com/openshift/imagebuilder/cmd/imagebuilder

Clone OLM and checkout the version:

~$ git clone git@github.com:operator-framework/operator-lifecycle-manager.git
~$ cd operator-lifecycle-manager
~/operator-lifecycle-manager$ git checkout -b v0.21.2 v0.21.2
~/operator-lifecycle-manager$ cd ..

Clone Hive and checkout the version:

~$ git clone git@github.com:openshift/hive.git
~$ cd hive
~/hive$ git checkout 56adaaacf5f8075e3ad0896dac35243a863ec07b

Edit the hack/create-kind-cluster.sh, adding the apiServerAddress pointing to your local docker0 bridge IP. This is needed so the Hive cluster, which runs inside a docker container, can reach the managed cluster, running inside another docker container:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  apiServerAddress: "172.17.0.1"  # docker0 bridge IP
containerdConfigPatches:
  - |-
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:${reg_port}"]
      endpoint = ["http://${reg_name}:${reg_port}"]    

Hive

Export the Hive kubeconfig filename (it will be created later):

~$ export KUBECONFIG=/tmp/hive.conf

Enter the hive directory:

~$ cd hive

Create the Hive cluster:

~/hive$ ./hack/create-kind-cluster.sh hive
Creating cluster "hive" ...
Creating cluster "hive" ...
 โœ“ Ensuring node image (kindest/node:v1.24.0) ๐Ÿ–ผ
 โœ“ Preparing nodes ๐Ÿ“ฆ
 โœ“ Writing configuration ๐Ÿ“œ
 โœ“ Starting control-plane ๐Ÿ•น๏ธ
 โœ“ Installing CNI ๐Ÿ”Œ
 โœ“ Installing StorageClass ๐Ÿ’พ
Set kubectl context to "kind-hive"
You can now use your cluster with:

kubectl cluster-info --context kind-hive

Not sure what to do next? ๐Ÿ˜…  Check out https://kind.sigs.k8s.io/docs/user/quick-start/

The /tmp/hive.conf file is created now. Checking the installation:

~/hive$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED         STATUS         PORTS                                       NAMES
901f215229a4   kindest/node:v1.24.0   "/usr/local/bin/entrโ€ฆ"   2 minutes ago   Up 2 minutes   172.17.0.1:41299->6443/tcp                  hive-control-plane
0d4bf61da0a3   registry:2             "/entrypoint.sh /etcโ€ฆ"   3 hours ago     Up 3 hours     0.0.0.0:5000->5000/tcp, :::5000->5000/tcp   kind-registry
~/hive$  kubectl cluster-info
Kubernetes control plane is running at https://172.17.0.1:41299
KubeDNS is running at https://172.17.0.1:41299/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Build Hive and push the image to the local registry:

~/hive$ CGO_ENABLED=0 IMG=localhost:5000/hive:latest make docker-dev-push

Deploy Hive to the hive cluster:

~/hive$ IMG=localhost:5000/hive:latest make deploy

Because we are not running on OpenShift we must also create a secret with certificates for the hiveadmission webhooks:

~/hive$ ./hack/hiveadmission-dev-cert.sh

If the hive cluster is using node image kindest/node:v1.24.0 or later, you will have to additionally run:

~/hive$ ./hack/create-service-account-secrets.sh

because starting in Kubernetes 1.24.0, secrets are no longer automatically generated for service accounts.

Tip: if it fails, check kubectl version. The Client and Server versions should be in sync:

~/hive$ kubectl version --short
Client Version: v1.24.0
Kustomize Version: v4.5.4
Server Version: v1.24.0

Checking the Hive pods:

~/hive$ kubectl get pods -n hive
NAME                                READY   STATUS    RESTARTS   AGE
hive-clustersync-0                  1/1     Running   0          26m
hive-controllers-79bbbc7f98-q9pxm   1/1     Running   0          26m
hive-operator-69c4649b96-wmd79      1/1     Running   0          26m
hiveadmission-6697d9df99-jdl4l      1/1     Running   0          26m
hiveadmission-6697d9df99-s9pv9      1/1     Running   0          26m

Managed Cluster - Kind

Open a new terminal.

Export the Hive kubeconfig filename (it will be created later):

~$ export KUBECONFIG=/tmp/cluster1.conf

Enter the hive directory:

~$ cd hive
~/hive$

Create the managed cluster:

~/hive$ ./hack/create-kind-cluster.sh cluster1
Creating cluster "cluster1" ...
 โœ“ Ensuring node image (kindest/node:v1.24.0) ๐Ÿ–ผ
 โœ“ Preparing nodes ๐Ÿ“ฆ
 โœ“ Writing configuration ๐Ÿ“œ
 โœ“ Starting control-plane ๐Ÿ•น๏ธ
 โœ“ Installing CNI ๐Ÿ”Œ
 โœ“ Installing StorageClass ๐Ÿ’พ
Set kubectl context to "kind-cluster1"
You can now use your cluster with:

kubectl cluster-info --context kind-cluster1

Have a nice day! ๐Ÿ‘‹
 ๐Ÿ˜Š

The /tmp/cluster1.conf file is created now. Checking the installation:

~/hive$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED             STATUS             PORTS                                       NAMES
267fa20f4a0f   kindest/node:v1.24.0   "/usr/local/bin/entrโ€ฆ"   2 minutes ago       Up 2 minutes       172.17.0.1:40431->6443/tcp                  cluster1-control-plane
901f215229a4   kindest/node:v1.24.0   "/usr/local/bin/entrโ€ฆ"   About an hour ago   Up About an hour   172.17.0.1:41299->6443/tcp                  hive-control-plane
0d4bf61da0a3   registry:2             "/entrypoint.sh /etcโ€ฆ"   5 hours ago         Up 5 hours         0.0.0.0:5000->5000/tcp, :::5000->5000/tcp   kind-registry
~/hive$ kubectl cluster-info
Kubernetes control plane is running at https://172.17.0.1:40431
KubeDNS is running at https://172.17.0.1:40431/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Before we install OLM, we have to edit the install scripts to use cluster1. Go into scripts/build_local.sh and replace

  if [[ ${#CLUSTERS[@]} == 1 ]]; then
    KIND_FLAGS="--name ${CLUSTERS[0]}"
    echo 'Use cluster ${CLUSTERS[0]}'
  fi

with

  KIND_FLAGS="--name cluster1"

Now enter the OLM directory:

~/hive$ cd ../operator-lifecycle-manager/
~/operator-lifecycle-manager$

Install the CRDs and OLM using:

~/operator-lifecycle-manager$ make run-local

OLM pods should be running now:

~/git/operator-lifecycle-manager$ kubectl get pods -n olm
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-54bbdffc6b-hf8rz   1/1     Running   0          87s
olm-operator-6bfbd74fb8-cjl4b       1/1     Running   0          87s
operatorhubio-catalog-gdk2d         1/1     Running   0          48s
packageserver-5d67bbc56b-6vxqb      1/1     Running   0          46s

With the cluster1 installed, let’s create a ClusterDeployment in Hive pointing to it.

Export the Hive kubeconfig filename and enter the hive directory (or just switch to the first terminal, the one used to deploy Hive):

~$ export KUBECONFIG=/tmp/hive.conf
~$ cd hive

Attention: hiveutil wants you to have default credentials in ~/.aws/credentials You can fake them like this:

~/hive$ cat ~/.aws/credentials
[default]
aws_access_key_id = foo
aws_secret_access_key = bar

Because Hive will not provision that cluster, we have can use the hiveutil to adopt the cluster:

~/hive$ bin/hiveutil create-cluster \
--base-domain=new-installer.openshift.com kind-cluster1  \
--adopt --adopt-admin-kubeconfig=/tmp/cluster1.conf \
--adopt-infra-id=fakeinfra \
--adopt-cluster-id=fakeid

Checking the ClusterDeployment:

~/hive$ kubectl get clusterdeployment
NAME            PLATFORM   REGION      CLUSTERTYPE   INSTALLED   INFRAID   VERSION   POWERSTATE   AGE
kind-cluster1   aws        us-east-1                 true        infra1                           48s

Checking the ClusterDeployment status:

~/hive$ kubectl get clusterdeployment kind-cluster1 -o json | jq .status.conditions
[
  ...
  {
    "lastProbeTime": "2021-02-01T14:02:42Z",
    "lastTransitionTime": "2021-02-01T14:02:42Z",
    "message": "SyncSet apply is successful",
    "reason": "SyncSetApplySuccess",
    "status": "False",
    "type": "SyncSetFailed"
  },
  {
    "lastProbeTime": "2021-02-01T14:02:41Z",
    "lastTransitionTime": "2021-02-01T14:02:41Z",
    "message": "cluster is reachable",
    "reason": "ClusterReachable",
    "status": "False",
    "type": "Unreachable"
  },
  ...
]

Managed Cluster - CRC

Export the CRC kubeconfig filename to be created:

~$ export KUBECONFIG=/tmp/crc.conf

Login to your CRC cluster with the kubeadmin user:

~$ oc login -u kubeadmin -p **** https://api.crc.testing:6443

The /tmp/crc.conf file should now contain the dockerconfig for your CRC cluster.

Export the Hive kubeconfig filename and enter the hive directory (or just switch to the first terminal, the one used to deploy Hive):

~$ export KUBECONFIG=/tmp/hive.conf
~$ cd hive

Because Hive will not provision that cluster, we have can use the hiveutil to adopt it:

~/hive$ bin/hiveutil create-cluster \
--base-domain=crc.openshift.com crc  \
--adopt --adopt-admin-kubeconfig=/tmp/crc.conf \
--adopt-infra-id=fakeinfra \
--adopt-cluster-id=fakeid

Checking the ClusterDeployment status:

~/hive$ kubectl get clusterdeployment crc -o json | jq .status.conditions
[
  {
    "lastProbeTime": "2021-02-02T14:21:02Z",
    "lastTransitionTime": "2021-02-02T14:21:02Z",
    "message": "cluster is reachable",
    "reason": "ClusterReachable",
    "status": "False",
    "type": "Unreachable"
  },
  {
    "lastProbeTime": "2021-02-02T01:45:19Z",
    "lastTransitionTime": "2021-02-02T01:45:19Z",
    "message": "SyncSet apply is successful",
    "reason": "SyncSetApplySuccess",
    "status": "False",
    "type": "SyncSetFailed"
  }
]

Tip: in case the cluster status is “unreachable”, that’s because Hive runs in a Kubernetes cluster deployed inside a container, and it is trying to access the CRC virtual machine that is controlled by libvirt. You will have to figure out you firewall, but this is what worked for me:

~/hive$ firewall-cmd --permanent --zone=trusted --change-interface=docker0
success
~/hive$ firewall-cmd --reload
success

SelectorSyncSet

Export the Hive kubeconfig:

~$ export KUBECONFIG=/tmp/hive.conf

Create a test SelectorSyncSet. Example:

apiVersion: v1
kind: List
metadata: {}
items:
  - apiVersion: hive.openshift.io/v1
    kind: SelectorSyncSet
    metadata:
      name: cso-test
    spec:
      clusterDeploymentSelector:
        matchLabels:
          api.openshift.com/cso-test: 'true'
      resourceApplyMode: Sync
      resources:
        - apiVersion: v1
          kind: Namespace
          metadata:
            annotations: {}
            labels: {}
            name: cso

Apply it:

~$ kubectl apply -f cso-test.yaml
selectorsyncset.hive.openshift.io/cso-test created

Now edit the ClusterDeployment of a cluster:

~$ kubectl edit clusterdeployment kind-cluster1

Adding the label api.openshift.com/cso-test: 'true' to it. Save and exit.

The cso namespace should be now created in the target cluster:

$ export KUBECONFIG=/tmp/cluster1.conf
$ oc get namespace cso
NAME   STATUS   AGE
cso    Active   81s

Cleanup

To clean-up, delete the two clusters and surrounding clutter:

~/hive$ kind delete cluster --name hive
~/hive$ kind delete cluster --name cluster1
~/hive$ docker rm -f kind-registry
~/hive$ docker network rm kind

2 - Getting Access

2.1 - Getting Backplane Access

Backplane

Backplane is the system used to provide access to the fleet of Openshift clusters. It creates ssh tunnels and modifies your local ~/.kube/config.

Getting access

  1. Install ocm CLI
  2. Follow the instructions here
  3. Make sure your user is part of the sd-mtsre Rover group.
  4. Wait for https://gitlab.cee.redhat.com/service/authorizedkeys-builder to sync your ssh key onto the fleet of Openshift clusters
  5. Install backplane CLI or use the PKGBUILD.

2.2 - Getting OCM API Access

3 -

Team Ceremonies

Sprint Duration

The Managed Tenants SRE Team lasts 3 full weeks.

Backlog Refinement

Every 3 weeks. 1h (max). Before the sprint ends; planned around half a week before the next sprint.

  • Scrum Team comes together to refine issues in our backlog and to ensure that important issues are actionable for the next sprint.
  • Product Owner owns the backlog and can communicate their needs and wishes to the Dev Team
  • This includes:
    • ensuring that the Definition of Ready is met for most of our issues
    • sorting and prioritizing them
    • estimating issue complexity with the team

Sprint Retro

Every 3 weeks. 30-minute (max). Before the sprint ends. Right now this happens right before the planning meeting for the next sprint:

  • Scrum Team comes together to fine tune processes and other stuff that they deem important.
  • The goals of continuous Retro meetings are:
    • to inspect the current way of working
    • and adapt it if necessary
    • in small steps and an iterative fashion
    • incorporate eventual process changes in the next sprint
  • Guide that @jgwosdz used for our first retro
  • Our retro dashboard with retros and derived action items: https://action.parabol.co/team/3heARr2dbz

Sprint Review

Every 3 weeks. 30-minute (max) meeting, hosted on the same day that the Sprint finishes:

  • Scrum Team presents the results of their work to key stakeholders.

Sprint Planning

Every 3 weeks. 30-minute (max) hosted on the same day that the Sprint begins. Sprint Planning addresses the following topics:

  • Why is this Sprint valuable?
  • What can be Done this Sprint?
  • How will the chosen work get done?

The Sprint Goal, the Product Backlog items selected for the Sprint, plus the plan for delivering them are together referred to as the Sprint Backlog.

Scrum Meeting

Every week. 1-hour (max) meeting for the Scrum Team with focus on progress towards the Sprint Goal and produces an actionable plan for the next week of work.

Each Scrum Team member will describe:

  • Last week’s work.
  • Plans for the next week.
  • Blockers.

Weekly Tenants Sync

Every week. 1-hour (max) meeting for the Scrum Team and the Tenants to make announcements, collect feedback and discuss requirements.

Definition Of Ready/Done

Both definitions have been inlined into our issue template that live in a separate ’eversprint’

This is our jira project: https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=8694&projectKey=MTSRE&view=planning

4 - Feature Flags in ADO

Currently, some ADO functionality, for example, addons plug and play, is hidden behind a feature flag. Feature flags are specified in the AddonOperator resources under .spec.featureflags. This field is a string which is a comma separated list of feature flags to enable. For addons plug and play and thus the addon package to be enabled, the string ADDONS_PLUG_AND_PLAY has to be included in this field.

5 - Interrupt Catcher

The Interrupt Catcher is the entry-point for any tenant to get help, ask questions and raise issues. It’s our interface with our Tenants.

The MT-SRE Team member with the Interrupt Catcher responsibility can be reached out via the @mt-sre-ic Slack handle, in the #forum-managed-tenants channel.

Coverage

Each working day has 15 hours of IC “Follow The Sun” coverage:

  • APAC: From 4:30 to 9:30 UTC
  • EMEA: From 9:30 to 14:30 UTC
  • NASA: From 14:30 to 19:30 UTC

Work items generated outside that time-frame will be picked up in the next FTS shift.

PagerDuty Schedule: https://redhat.pagerduty.com/schedules#PM3YCH1

Response Time

Service Level Indicator (SLI)SLO TimeMT-SRE “Goal Time”
MR’s to managed-tenants repositories24 FTSH*4 FTSH*
#forum-managed-tenants Slack messages, misc. supportbest effort4 FTSH*

*FTSH: Follow The Sun working Hours

Responsibilities

  • Review Merge Requests created by the MT-SRE Tenants on the “Surfaces” repositories (listed in the next section)
  • Respond to the alerts in the #sd-mt-sre-alert Slack channel
  • Respond to incidents from PagerDuty
  • Respond to questions and requests from the MT-SRE tenants in the #forum-managed-tenants Slack channel
  • Engage on incidents with Addons that are on-boarding to the MT-SRE
  • Handover the outstanding work to the next IC

Surfaces

6 - Incident Management

Preparedness for major incidents is crucial. We have established the following Incident Management processes to ensure SREs can follow predetermined procedures:

Coverage

Layered Products SRE (LPSRE) provides 24x7 coverage and support.

If you need to escalate an incident, please refer to the Layered Products SRE Escalation Procedure.

NOTE: Only escalate an incident if the standard manual notification process using an OHSS ticket has failed.