Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
imaustink committed Jun 28, 2024
0 parents commit a26267b
Show file tree
Hide file tree
Showing 26 changed files with 1,760 additions and 0 deletions.
84 changes: 84 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
name: CI

on:
push:
branches: [main]
pull_request:
release:
types: [published]

jobs:
lint:
name: lint
runs-on: ubuntu-latest
steps:
- name: Set up Go
uses: actions/setup-go@v4
with:
go-version: '1.22.3'
- name: Check out code
uses: actions/checkout@v3
- name: Check formatting
run: |
test -z $(gofmt -l .)
build-and-test:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.22.3'
- name: Install dependencies
run: go get ./src
- name: Build
run: go build -o ./dist/bin ./src
- name: Test with the Go CLI
run: go test ./src

build-and-publish-image:
runs-on: ubuntu-latest
needs:
- lint
- build-and-test

steps:
- uses: actions/checkout@v3
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Tag the image
id: meta
uses: docker/metadata-action@v4
with:
images: |
bitovi/temporal-cloud-metrics-to-k8s
tags: |
type=raw,value=latest,enable=${{ github.ref_name == 'main' }}
type=semver,pattern={{version}},enable=${{ github.event_name == 'release' }}
-
name: Login to Docker Hub
uses: docker/login-action@v2
if: github.event_name != 'pull_request'
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
-
name: Build Docker image
uses: docker/build-push-action@v4
with:
context: .
platforms: linux/amd64,linux/arm64
tags: ${{ steps.meta.outputs.tags }}
-
name: Push Docker image
uses: docker/build-push-action@v4
if: ${{ (github.ref_name == 'main') || (github.event_name == 'release') }}
with:
context: .
platforms: linux/amd64,linux/arm64
tags: ${{ steps.meta.outputs.tags }}
push: true
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
certs
config.yaml
.DS_Store
13 changes: 13 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM golang:1.22.3

WORKDIR /app

COPY go.mod go.sum ./

RUN go mod download

COPY src/*.go ./

RUN CGO_ENABLED=0 GOOS=linux go build -o ./temporal-cloud-metrics-adapter

CMD ["./temporal-cloud-metrics-adapter"]
21 changes: 21 additions & 0 deletions LISCENCE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Bitovi

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
209 changes: 209 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Temporal Cloud Metrics to Kubernetes

Bring Temporal Cloud Metrics into your Kubernetes cluster to inform autoscaling of your workers.

![Metrics Dashboard Demo](./img/metrics-dashboard.jpg)

## Setup

### Prerequisites

1. A [Temporal Cloud account](https://temporal.io/)
- [An mTLS certificate provisioned](https://docs.temporal.io/cloud/certificates)
- [The metrics endpoint enabled](https://docs.temporal.io/production-deployment/cloud/metrics/general-setup)
2. A [Kubernetes](https://kubernetes.io/) compliant cluster (also tested on [K3s](https://k3s.io/) and [minikube](https://minikube.sigs.k8s.io/))
3. The [Helm](https://helm.sh/docs/intro/install/) CLI

### Step 1: Copy mTLS Certificate

We need the client mTLS certificate for our Temporal Cloud namespace so that we can load it into our cluster for use in the metrics adapter and worker.

1. Copy the certificate into `./certs/client.crt`
2. Copy the key into `./certs/client.key`

### Step 2: Configuration

A YAML config file is used to define the connection parameters and the specific metrics you'd like to pull into Kubernetes from Temporal Cloud.

There is an example configuration in [`./sample-config.yaml`](./sample-config.yaml). Copy it to `config.yaml` and and make your changes to it. The Helm chart will use this path by default.

__Considerations__

Autoscaling in Kubernetes is triggered when a target metric value increases beyond a designated threshold, such as CPU usage, memory usage, or request count. Therefore, it is important that the metrics we calculate are positive numbers that increase when the system is under some kind of stress.

The queries in the included example configuration were derived from queries associated with Temporal best practices, but they have been modified to align with these requirements. Let's see an example.

__Before__

```
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{}[1m]
)
)
-
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{}[1m]
)
)
```

__After__

We've made two important changes here: (1) we've swapped the places of the two underlying metrics to invert the resulting number so it will now be positive and increase as the Sync Match Rate falls, (2) use clamp_min to set a lower bound of zero, and (3) we default the resulting value to zero in the event no data points are available within the specified time window.

```
sum(
clamp_min(
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{}[1m]
)
)
-
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{}[1m]
)
)
),
0
)
) or vector(0)
```

### Step 3: HPA

The HPA (Horizontal Pod Autoscaler) defines the desired scaling behavior and bounds, and manages our deployment replicas accordingly.

There is a complete example HPA in [`./chart/templates/hpa.yaml`](./chart/templates/hpa.yaml). You may use it as it or adjust it to fit your needs before installing the helm chart.

### Step 4: Install

__Install with Existing worker__

This allows you to setup autoscaling on an existing deployment.

```bash
helm install temporal-cloud-metrics-adapter ./chart --wait \
--namespace staging \
--set-file=temporal.tls.cert=certs/client.crt \
--set-file=temporal.tls.key=certs/client.key \
--set-file=adapter.config=config.yaml \
--set temporal.namespace=xyz.123 \
--set worker.deployment=temporal-workers
```

__Install with Demo worker__

This is intended for testing and demos and should never been used in a production environment.

```bash
helm install temporal-cloud-metrics-adapter ./chart --wait \
--namespace staging --create-namespace \
--set-file=temporal.tls.cert=certs/client.crt \
--set-file=temporal.tls.key=certs/client.key \
--set-file=adapter.config=config.yaml \
--set temporal.namespace=xyz.123 \
--set temporal.address=xyz.123.tmprl.cloud:7233 \
--set worker.demo=true
```

__Uninstall__

```bash
helm uninstall -n staging temporal-cloud-metrics-adapter
```

__Helm Values__

| Option | Type | Example Value | Description |
|---------------------------|---------|--------------------------------------|-----------------------------------------------------|
| temporal.tls.cert | File | `certs/client.crt` | Path to the client certificate file |
| temporal.tls.key | File | `certs/client.key` | Path to the client key file |
| temporal.namespace | String | `xyz.123` | The target Temporal Cloud namespace |
| temporal.address | String | `xyz.123.tmprl.cloud:7233` | Address of the Temporal Cloud instance |
| adapter.config | String | `./config.yaml` | The file path for the configuration for the adapter |
| worker.deployment | String | `temporal-worker` | Name of existing Temporal worker deployment |
| worker.demo | Boolean | `true` or `false` | Flag to determine whether to deploy a demo worker |

### Demo

This repo includes a script to create a burst of workflows to simulate load.

```bash
# Startup 50 demo workflows
TEMPORAL_ADDRESS=xyz.123.tmprl.cloud:7233 \
TEMPORAL_NAMESPACE=xyz.123 \
./scripts/execute-demo-workflows 50
```

## Metric Granularity

Temporal Cloud metrics do not include labels that indicate which Workflow they are associated with. Depending on your architecture, you might need to divide your workers across unique namespaces to obtain metrics for specific Workflows.

## Tuning Scaling Behavior

__HPA Polling Interval__

By default, the `HorizontalPodAutoscaler` fetches metrics every 15 seconds. This can be configured by setting the `--horizontal-pod-autoscaler-sync-period` on the [kube controller](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/).

_Note: The `--horizontal-pod-autoscaler-sync-period` is not currently supported in K3s._

__Adjust Metrics Time Window__

You can also adjust the timescale used in the query for the Temporal Cloud metrics. To do this, change the time window specified in the queries in the [adapter configuration file](./chart/templates/configuration.yaml).

Currently, the time window is set to `1m` (1 minute). This can be reduced to slightly improve the responsiveness of the scaling behavior. Be cautious about going below `45s` (45 seconds) for systems with relatively low throughput, as it can result in dead zones in the resulting metrics.

__Adjust HPA Behavior__

You can adjust the how quickly the cluster scales up and down our workers.

```yaml
metrics:
- type: External
external:
metric:
# The name of the metrics to watch
name: temporal_cloud_sync_match_rate
selector:
matchLabels:
# Match a particular Temporal Cloud namespace
temporal_namespace: xyz.123
target:
type: Value
# Scale up when the target metric exceeds 50 milli values (0.05)
value: 50m
behavior:
scaleUp:
# The highest value in the last 10 seconds will be used to determine the need to scale up
stabilizationWindowSeconds: 10
selectPolicy: Max
policies:
# Scale up by 5 pods every 10 seconds whole the target threshold is exceeded
- type: Pods
value: 5
periodSeconds: 10
scaleDown:
# The highest value in the last 60 seconds will be used to determine the need to scale down
stabilizationWindowSeconds: 60
selectPolicy: Max
policies:
# Scale up by 5 pods every 10 seconds whole the target threshold is achieved
- type: Pods
value: 3
periodSeconds: 30
```
You can find a complete example in this [manifest](./chart/templates/hpa.yaml). For more detailed information on the HorizontalPodAutoscaler, refer to the official [HPA documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/).
## Scaling to Zero
In some use cases, you might want your application to scale completely down to zero. This can be achieved by configuring the [`HorizontalPodAutoscaler`](./chart/templates/hpa.yaml).

To scale to zero, set `minReplicas` to `0`. The cluster will then scale down to zero when the targeted metrics fall below the defined threshold.

_Note: Scaling to zero may cause a delay in processing new tasks, as it can take time for metrics to propagate to the cluster._
23 changes: 23 additions & 0 deletions chart/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
9 changes: 9 additions & 0 deletions chart/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
apiVersion: v2
name: temporal-cloud-metrics-to-k8s
description: A Helm chart to enable access to metrics from Temporal Cloud within your cluster.

type: application

version: 0.1.0

appVersion: "0.1.0"
13 changes: 13 additions & 0 deletions chart/templates/api-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.external.metrics.k8s.io
spec:
service:
name: temporal-cloud-metrics-adapter
namespace: {{ .Release.Namespace }}
group: external.metrics.k8s.io
version: v1beta1
insecureSkipTLSVerify: true
groupPriorityMinimum: 100
versionPriority: 100
8 changes: 8 additions & 0 deletions chart/templates/configuration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-configuration
namespace: {{ .Release.Namespace }}
data:
config.yaml: |
{{ .Values.adapter.config | nindent 4 }}
Loading

0 comments on commit a26267b

Please sign in to comment.