This the multi-page printable view of this section. Click here to print.
Internal Documentation
1 - Tutorials
1.1 - Adding ADO alerts
This guide describes how to add alerts for addon-operator in the OSD RHOBS tenant.
Background
Since the addon-operator doesnโt have a specific tenant in RHOBS with a service account, the addon-operator metrics are scraped to the OSD tenant with its own service account, so instead of adding rules using obsctl cli and syncing them using obsctl-reloader SRE-P has a repo called rhobs-rules-and-dashboards which, based on the tenant, will automatically sync rules defined in the repo to app-interface.
Overview
We need to have the following prerequisites:
If the tenant is not defined in the repo, we need to follow the process defined here which explains how to register the tenant in app-interface and how to configure obsctl-reloader to sync rules for the particular tenant.
The OSD tenant is already registered and the obsctl-reloader configuration is already defined here (we should see osd in MANAGED_TENANTS parameter in the observatorium-mst-common named item and secrets are defined here)
Steps
we can define rules for the addon-operator metrics in the rhobs-rules-and-dashboards repo in the rules/osd folder (the tenants are added as individual folders) and tests for the prometheus rules are defined in the tests/rules/osd folder. The suggested file naming conventions are described here.
tldr: create file name with .prometheusrule.yaml
as a suffix,
add tenant label to the PrometheusRule
object.
Adding promrules
For creating a PoC alert we have selected the addon_operator_addon_health_info
metric
since it basically explains the addon health for a particular version
and cluster_id
.
The metrics for creating alerts can be decided and the promql queries
can be tested out from promlens stage
or promlens prod.
The metrics are scraped from addon-operator to the
observatorium-mst-stage (stage) / observatorium-mst-prod (prod) datasources.
Sample addon_operator_addon_health_info
metric data:
addon_operator_addon_health_info{_id="08d94ae0-a943-47ea-ac29-6cf65284aeba", container="metrics-relay-server", endpoint="https", instance="10.129.2.11:8443", job="addon-operator-metrics", name="managed-odh", namespace="openshift-addon-operator", pod="addon-operator-manager-7c9df45684-86mh4", prometheus="openshift-monitoring/k8s", receive="true", service="addon-operator-metrics", tenant_id="770c1124-6ae8-4324-a9d4-9ce08590094b", version="0.0.0"}
This particular metric gives information about
version
, cluster_id (_id)
and addon_name (name)
which can be used to create the alert.
- We should ignore the metrics with version
"0.0.0"
- The addon health is
"Unhealthy"
if no version exists for the particular addon with value 1 - In theory, the latest version for the particular addon
should have the value of 1. If not, then the addon is
"Unhealthy"
The prometheus rule for the addon_operator_addon_health_info metric is defined here
expr: (count by (name, _id) (addon_operator_addon_health_info{version!="0.0.0"})) - (count by (name,_id) (addon_operator_addon_health_info{version!="0.0.0"} == 0)) == 0
Explanation:
We aggregate all metrics on name
and _id
and count the non-0 value metrics.
If the count is 0 (implying there has been no non-0 value metrics), raise the alert.
Writing tests for the promrules
First create a <alertname>.promrulestests.yaml
file in the test/rules/"Unhealthy"
.
Try out different scenarios by defining different series as
input to the test rules such as here.
The tests can be validated by running: make test-rules
in the rhobs-rules-and-dashboards directory.
NOTE: Make sure that the tests are defined in the tests/rules/
Since the pagerduty config is defined on a per tenant basis which is osd tenant
in this case,
the alert will be triggered and paged to SRE-P,
so a proper runbook should be added redirecting the alert to lp-sre folks.
The runbook links/ SOP links can be validated by running: make check-runbooks
1.2 - Run Hive Locally
This guide describes how to deploy a Hive environment in your local machine using kind.
For the managed clusters, this guide covers both kind clusters and CRC clusters.
Preparation
Setup your GOPATH
. Add to your ~/.bashrc
:
export GOPATH=$HOME/go
export PATH=${PATH}:${GOPATH}/bin
Install kind
:
~$ curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.10.0/kind-linux-amd64
~$ chmod +x ./kind
~$ mv ./kind ~/.local/bin/kind
This guide was created using kind version:
kind v0.14.0 go1.18.2 linux/amd64
Install the dependencies:
~$ GO111MODULE=on go get sigs.k8s.io/kustomize/kustomize/v3
~$ go get github.com/cloudflare/cfssl/cmd/cfssl
~$ go get github.com/cloudflare/cfssl/cmd/cfssljson
~$ go get -u github.com/openshift/imagebuilder/cmd/imagebuilder
Clone OLM and checkout the version:
~$ git clone git@github.com:operator-framework/operator-lifecycle-manager.git
~$ cd operator-lifecycle-manager
~/operator-lifecycle-manager$ git checkout -b v0.21.2 v0.21.2
~/operator-lifecycle-manager$ cd ..
Clone Hive and checkout the version:
~$ git clone git@github.com:openshift/hive.git
~$ cd hive
~/hive$ git checkout 56adaaacf5f8075e3ad0896dac35243a863ec07b
Edit the hack/create-kind-cluster.sh
, adding the apiServerAddress pointing
to your local docker0
bridge IP. This is needed so the Hive cluster, which
runs inside a docker container, can reach the managed cluster, running inside
another docker container:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
apiServerAddress: "172.17.0.1" # docker0 bridge IP
containerdConfigPatches:
- |-
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:${reg_port}"]
endpoint = ["http://${reg_name}:${reg_port}"]
Hive
Export the Hive kubeconfig filename (it will be created later):
~$ export KUBECONFIG=/tmp/hive.conf
Enter the hive
directory:
~$ cd hive
Create the Hive cluster:
~/hive$ ./hack/create-kind-cluster.sh hive
Creating cluster "hive" ...
Creating cluster "hive" ...
โ Ensuring node image (kindest/node:v1.24.0) ๐ผ
โ Preparing nodes ๐ฆ
โ Writing configuration ๐
โ Starting control-plane ๐น๏ธ
โ Installing CNI ๐
โ Installing StorageClass ๐พ
Set kubectl context to "kind-hive"
You can now use your cluster with:
kubectl cluster-info --context kind-hive
Not sure what to do next? ๐
Check out https://kind.sigs.k8s.io/docs/user/quick-start/
The /tmp/hive.conf
file is created now. Checking the installation:
~/hive$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
901f215229a4 kindest/node:v1.24.0 "/usr/local/bin/entrโฆ" 2 minutes ago Up 2 minutes 172.17.0.1:41299->6443/tcp hive-control-plane
0d4bf61da0a3 registry:2 "/entrypoint.sh /etcโฆ" 3 hours ago Up 3 hours 0.0.0.0:5000->5000/tcp, :::5000->5000/tcp kind-registry
~/hive$ kubectl cluster-info
Kubernetes control plane is running at https://172.17.0.1:41299
KubeDNS is running at https://172.17.0.1:41299/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Build Hive and push the image to the local registry:
~/hive$ CGO_ENABLED=0 IMG=localhost:5000/hive:latest make docker-dev-push
Deploy Hive to the hive cluster:
~/hive$ IMG=localhost:5000/hive:latest make deploy
Because we are not running on OpenShift we must also create a secret with certificates for the hiveadmission webhooks:
~/hive$ ./hack/hiveadmission-dev-cert.sh
If the hive cluster is using node image kindest/node:v1.24.0
or later, you will
have to additionally run:
~/hive$ ./hack/create-service-account-secrets.sh
because starting in Kubernetes 1.24.0, secrets are no longer automatically generated for service accounts.
Tip: if it fails, check kubectl
version. The Client
and Server
versions
should be in sync:
~/hive$ kubectl version --short
Client Version: v1.24.0
Kustomize Version: v4.5.4
Server Version: v1.24.0
Checking the Hive pods:
~/hive$ kubectl get pods -n hive
NAME READY STATUS RESTARTS AGE
hive-clustersync-0 1/1 Running 0 26m
hive-controllers-79bbbc7f98-q9pxm 1/1 Running 0 26m
hive-operator-69c4649b96-wmd79 1/1 Running 0 26m
hiveadmission-6697d9df99-jdl4l 1/1 Running 0 26m
hiveadmission-6697d9df99-s9pv9 1/1 Running 0 26m
Managed Cluster - Kind
Open a new terminal.
Export the Hive kubeconfig filename (it will be created later):
~$ export KUBECONFIG=/tmp/cluster1.conf
Enter the hive
directory:
~$ cd hive
~/hive$
Create the managed cluster:
~/hive$ ./hack/create-kind-cluster.sh cluster1
Creating cluster "cluster1" ...
โ Ensuring node image (kindest/node:v1.24.0) ๐ผ
โ Preparing nodes ๐ฆ
โ Writing configuration ๐
โ Starting control-plane ๐น๏ธ
โ Installing CNI ๐
โ Installing StorageClass ๐พ
Set kubectl context to "kind-cluster1"
You can now use your cluster with:
kubectl cluster-info --context kind-cluster1
Have a nice day! ๐
๐
The /tmp/cluster1.conf
file is created now. Checking the installation:
~/hive$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
267fa20f4a0f kindest/node:v1.24.0 "/usr/local/bin/entrโฆ" 2 minutes ago Up 2 minutes 172.17.0.1:40431->6443/tcp cluster1-control-plane
901f215229a4 kindest/node:v1.24.0 "/usr/local/bin/entrโฆ" About an hour ago Up About an hour 172.17.0.1:41299->6443/tcp hive-control-plane
0d4bf61da0a3 registry:2 "/entrypoint.sh /etcโฆ" 5 hours ago Up 5 hours 0.0.0.0:5000->5000/tcp, :::5000->5000/tcp kind-registry
~/hive$ kubectl cluster-info
Kubernetes control plane is running at https://172.17.0.1:40431
KubeDNS is running at https://172.17.0.1:40431/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Before we install OLM, we have to edit the install scripts to use cluster1
. Go into scripts/build_local.sh
and replace
if [[ ${#CLUSTERS[@]} == 1 ]]; then
KIND_FLAGS="--name ${CLUSTERS[0]}"
echo 'Use cluster ${CLUSTERS[0]}'
fi
with
KIND_FLAGS="--name cluster1"
Now enter the OLM directory:
~/hive$ cd ../operator-lifecycle-manager/
~/operator-lifecycle-manager$
Install the CRDs and OLM using:
~/operator-lifecycle-manager$ make run-local
OLM pods should be running now:
~/git/operator-lifecycle-manager$ kubectl get pods -n olm
NAME READY STATUS RESTARTS AGE
catalog-operator-54bbdffc6b-hf8rz 1/1 Running 0 87s
olm-operator-6bfbd74fb8-cjl4b 1/1 Running 0 87s
operatorhubio-catalog-gdk2d 1/1 Running 0 48s
packageserver-5d67bbc56b-6vxqb 1/1 Running 0 46s
With the cluster1 installed, let’s create a ClusterDeployment
in Hive
pointing to it.
Export the Hive kubeconfig filename and enter the hive
directory (or just
switch to the first terminal, the one used to deploy Hive):
~$ export KUBECONFIG=/tmp/hive.conf
~$ cd hive
Attention: hiveutil wants you to have default credentials in ~/.aws/credentials You can fake them like this:
~/hive$ cat ~/.aws/credentials
[default]
aws_access_key_id = foo
aws_secret_access_key = bar
Because Hive will not provision that cluster, we have can use the hiveutil
to adopt the cluster:
~/hive$ bin/hiveutil create-cluster \
--base-domain=new-installer.openshift.com kind-cluster1 \
--adopt --adopt-admin-kubeconfig=/tmp/cluster1.conf \
--adopt-infra-id=fakeinfra \
--adopt-cluster-id=fakeid
Checking the ClusterDeployment
:
~/hive$ kubectl get clusterdeployment
NAME PLATFORM REGION CLUSTERTYPE INSTALLED INFRAID VERSION POWERSTATE AGE
kind-cluster1 aws us-east-1 true infra1 48s
Checking the ClusterDeployment
status:
~/hive$ kubectl get clusterdeployment kind-cluster1 -o json | jq .status.conditions
[
...
{
"lastProbeTime": "2021-02-01T14:02:42Z",
"lastTransitionTime": "2021-02-01T14:02:42Z",
"message": "SyncSet apply is successful",
"reason": "SyncSetApplySuccess",
"status": "False",
"type": "SyncSetFailed"
},
{
"lastProbeTime": "2021-02-01T14:02:41Z",
"lastTransitionTime": "2021-02-01T14:02:41Z",
"message": "cluster is reachable",
"reason": "ClusterReachable",
"status": "False",
"type": "Unreachable"
},
...
]
Managed Cluster - CRC
Export the CRC kubeconfig filename to be created:
~$ export KUBECONFIG=/tmp/crc.conf
Login to your CRC cluster with the kubeadmin
user:
~$ oc login -u kubeadmin -p **** https://api.crc.testing:6443
The /tmp/crc.conf
file should now contain the dockerconfig for your
CRC cluster.
Export the Hive kubeconfig filename and enter the hive
directory (or just
switch to the first terminal, the one used to deploy Hive):
~$ export KUBECONFIG=/tmp/hive.conf
~$ cd hive
Because Hive will not provision that cluster, we have can use the hiveutil
to adopt it:
~/hive$ bin/hiveutil create-cluster \
--base-domain=crc.openshift.com crc \
--adopt --adopt-admin-kubeconfig=/tmp/crc.conf \
--adopt-infra-id=fakeinfra \
--adopt-cluster-id=fakeid
Checking the ClusterDeployment
status:
~/hive$ kubectl get clusterdeployment crc -o json | jq .status.conditions
[
{
"lastProbeTime": "2021-02-02T14:21:02Z",
"lastTransitionTime": "2021-02-02T14:21:02Z",
"message": "cluster is reachable",
"reason": "ClusterReachable",
"status": "False",
"type": "Unreachable"
},
{
"lastProbeTime": "2021-02-02T01:45:19Z",
"lastTransitionTime": "2021-02-02T01:45:19Z",
"message": "SyncSet apply is successful",
"reason": "SyncSetApplySuccess",
"status": "False",
"type": "SyncSetFailed"
}
]
Tip: in case the cluster status is “unreachable”, that’s because Hive runs in a Kubernetes cluster deployed inside a container, and it is trying to access the CRC virtual machine that is controlled by libvirt. You will have to figure out you firewall, but this is what worked for me:
~/hive$ firewall-cmd --permanent --zone=trusted --change-interface=docker0
success
~/hive$ firewall-cmd --reload
success
SelectorSyncSet
Export the Hive kubeconfig:
~$ export KUBECONFIG=/tmp/hive.conf
Create a test SelectorSyncSet
. Example:
apiVersion: v1
kind: List
metadata: {}
items:
- apiVersion: hive.openshift.io/v1
kind: SelectorSyncSet
metadata:
name: cso-test
spec:
clusterDeploymentSelector:
matchLabels:
api.openshift.com/cso-test: 'true'
resourceApplyMode: Sync
resources:
- apiVersion: v1
kind: Namespace
metadata:
annotations: {}
labels: {}
name: cso
Apply it:
~$ kubectl apply -f cso-test.yaml
selectorsyncset.hive.openshift.io/cso-test created
Now edit the ClusterDeployment
of a cluster:
~$ kubectl edit clusterdeployment kind-cluster1
Adding the label api.openshift.com/cso-test: 'true'
to it. Save and exit.
The cso
namespace should be now created in the target cluster:
$ export KUBECONFIG=/tmp/cluster1.conf
$ oc get namespace cso
NAME STATUS AGE
cso Active 81s
Cleanup
To clean-up, delete the two clusters and surrounding clutter:
~/hive$ kind delete cluster --name hive
~/hive$ kind delete cluster --name cluster1
~/hive$ docker rm -f kind-registry
~/hive$ docker network rm kind
2 - Getting Access
2.1 - Getting Backplane Access
Backplane
Backplane is the system used to provide access to the fleet of Openshift clusters. It
creates ssh tunnels and modifies your local ~/.kube/config
.
Getting access
- Install ocm CLI
- Follow the instructions here
- Make sure your user is part of the
sd-mtsre
Rover group. - Wait for
https://gitlab.cee.redhat.com/service/authorizedkeys-builder
to sync your ssh key onto the fleet of Openshift clusters - Install backplane CLI or use the PKGBUILD.
2.2 - Getting OCM API Access
3 -
Team Ceremonies
Sprint Duration
The Managed Tenants SRE Team lasts 3 full weeks.
Backlog Refinement
Every 3 weeks. 1h (max). Before the sprint ends; planned around half a week before the next sprint.
- Scrum Team comes together to refine issues in our backlog and to ensure that important issues are actionable for the next sprint.
- Product Owner owns the backlog and can communicate their needs and wishes to the Dev Team
- This includes:
- ensuring that the Definition of Ready is met for most of our issues
- sorting and prioritizing them
- estimating issue complexity with the team
Sprint Retro
Every 3 weeks. 30-minute (max). Before the sprint ends. Right now this happens right before the planning meeting for the next sprint:
- Scrum Team comes together to fine tune processes and other stuff that they deem important.
- The goals of continuous Retro meetings are:
- to inspect the current way of working
- and adapt it if necessary
- in small steps and an iterative fashion
- incorporate eventual process changes in the next sprint
- Guide that @jgwosdz used for our first retro
- Our retro dashboard with retros and derived action items: https://action.parabol.co/team/3heARr2dbz
Sprint Review
Every 3 weeks. 30-minute (max) meeting, hosted on the same day that the Sprint finishes:
- Scrum Team presents the results of their work to key stakeholders.
Sprint Planning
Every 3 weeks. 30-minute (max) hosted on the same day that the Sprint begins. Sprint Planning addresses the following topics:
- Why is this Sprint valuable?
- What can be Done this Sprint?
- How will the chosen work get done?
The Sprint Goal, the Product Backlog items selected for the Sprint, plus the plan for delivering them are together referred to as the Sprint Backlog.
Scrum Meeting
Every week. 1-hour (max) meeting for the Scrum Team with focus on progress towards the Sprint Goal and produces an actionable plan for the next week of work.
Each Scrum Team member will describe:
- Last week’s work.
- Plans for the next week.
- Blockers.
Weekly Tenants Sync
Every week. 1-hour (max) meeting for the Scrum Team and the Tenants to make announcements, collect feedback and discuss requirements.
Definition Of Ready/Done
Both definitions have been inlined into our issue template that live in a separate ’eversprint’
This is our jira project: https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=8694&projectKey=MTSRE&view=planning
4 - Feature Flags in ADO
Currently, some ADO functionality, for example, addons plug and play, is hidden behind a feature flag.
Feature flags are specified in the AddonOperator
resources under .spec.featureflags
. This field is a string
which is a comma separated list of feature flags to enable. For addons plug and play and thus the addon package
to be enabled, the string ADDONS_PLUG_AND_PLAY
has to be included in this field.
5 - Interrupt Catcher
The Interrupt Catcher is the entry-point for any tenant to get help, ask questions and raise issues. It’s our interface with our Tenants.
The MT-SRE Team member with the Interrupt Catcher responsibility can be reached out via the @mt-sre-ic Slack handle, in the #forum-managed-tenants channel.
Coverage
Each working day has 15 hours of IC “Follow The Sun” coverage:
- APAC: From 4:30 to 9:30 UTC
- EMEA: From 9:30 to 14:30 UTC
- NASA: From 14:30 to 19:30 UTC
Work items generated outside that time-frame will be picked up in the next FTS shift.
PagerDuty Schedule: https://redhat.pagerduty.com/schedules#PM3YCH1
Response Time
Service Level Indicator (SLI) | SLO Time | MT-SRE “Goal Time” |
---|---|---|
MR’s to managed-tenants repositories | 24 FTSH* | 4 FTSH* |
#forum-managed-tenants Slack messages, misc. support | best effort | 4 FTSH* |
*FTSH: Follow The Sun working Hours
Responsibilities
- Review Merge Requests created by the MT-SRE Tenants on the “Surfaces” repositories (listed in the next section)
- Respond to the alerts in the #sd-mt-sre-alert Slack channel
- Respond to incidents from PagerDuty
- Respond to questions and requests from the MT-SRE tenants in the #forum-managed-tenants Slack channel
- Engage on incidents with Addons that are on-boarding to the MT-SRE
- Handover the outstanding work to the next IC
Surfaces
- Slack channels:
#sd-mt-sre-info
#mt-cs-sre-teamchat
#mt-cs-sre-teamhandover
#forum-managed-tenants
- Mailing lists:
- Git Repositories:
- https://gitlab.cee.redhat.com/service/managed-tenants-bundles (not automated)
- https://gitlab.cee.redhat.com/service/managed-tenants (tenants-related MRs, partially automated)
- https://gitlab.cee.redhat.com/service/managed-tenants-manifests (fully automated)
- https://gitlab.cee.redhat.com/service/managed-tenants-sops (tenants-related MRs, partially automated)
6 - Incident Management
Preparedness for major incidents is crucial. We have established the following Incident Management processes to ensure SREs can follow predetermined procedures:
Coverage
Layered Products SRE (LPSRE) provides 24x7 coverage and support.
If you need to escalate an incident, please refer to the Layered Products SRE Escalation Procedure.
NOTE: Only escalate an incident if the standard manual notification process using an OHSS ticket has failed.