Completion of a restore triggers secrets refresh #1134

Miles-Garnsey · 2023-12-11T01:49:54Z

What this PR does:

Adds an annotation to the user secrets when a DC is restored. Cass operator should then pick this up and cause a refresh of user credentials on the Cassandra side.

Which issue(s) this PR fixes:
Fixes #1080

Checklist

Changes manually tested
Automated Tests added/updated
Documentation added/updated
CHANGELOG.md updated (not required for documentation PRs)
CLA Signed: DataStax CLA

codecov · 2023-12-11T02:22:07Z

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (922b88c) 56.84% compared to head (905a912) 57.09%.
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1134      +/-   ##
==========================================
+ Coverage   56.84%   57.09%   +0.24%     
==========================================
  Files         100      101       +1     
  Lines       10352    10401      +49     
==========================================
+ Hits         5885     5938      +53     
- Misses       3949     3950       +1     
+ Partials      518      513       -5

Files	Coverage Δ
controllers/medusa/medusarestorejob_controller.go	`58.33% <75.00%> (+0.99%)`	⬆️
pkg/medusa/refresh_secrets.go	`73.80% <73.80%> (ø)`

... and 6 files with indirect coverage changes

Miles-Garnsey · 2023-12-12T06:52:46Z

The failing integration tests here look like a flake to me. I'm re-running again locally but it appears that a different one is failing each time.

(It passes locally...)

adejanovski

Todo: We need a changelog entry.

Aside from the e2e test fix, everything works as expected 👍

test/e2e/medusa_test.go

Miles-Garnsey · 2023-12-13T04:31:08Z

Totally stumped on why CreateMultiMedusaJob is failing. I can see a secret exists with the expected name, but it seems that RefreshSecrets isn't getting called (or at least it isn't producing any of the expected log output). But the line immediately before does produce The restore operation is complete so I'm thinking maybe I've just broken the logging.

I haven't been able to get this to run locally either, so I'm stuck using println statements to get output out of GHA. Not the best developer experience in the world. I'll keep waiting for the test to finish.

adejanovski · 2023-12-13T12:54:49Z

Totally stumped on why CreateMultiMedusaJob is failing. I can see a secret exists with the expected name, but it seems that RefreshSecrets isn't getting called (or at least it isn't producing any of the expected log output). But the line immediately before does produce The restore operation is complete so I'm thinking maybe I've just broken the logging.

I haven't been able to get this to run locally either, so I'm stuck using println statements to get output out of GHA. Not the best developer experience in the world. I'll keep waiting for the test to finish.

Looking at the dumped artifacts, we don't se the annotation being placed on the superuser secrets:

- apiVersion: v1
  data:
    password: RThrTmFuRE5tUFJieXM2c1lrZGo=
    username: dGVzdC1zdXBlcnVzZXI=
  kind: Secret
  metadata:
    annotations:
      k8ssandra.io/resource-hash: dYW2TwnnvXE1bFKIWTCM8HBhJH5cI1CGKqL5AxoGAlU=
    creationTimestamp: "2023-12-13T04:35:27Z"
    labels:
      k8ssandra.io/cluster-name: test
      k8ssandra.io/cluster-namespace: multi-dc-encryption-medusa-rqhscw
      k8ssandra.io/replicated-by: k8ssandracluster-controller
    name: test-superuser
    namespace: multi-dc-encryption-medusa-rqhscw
    resourceVersion: "1516"
    uid: fd7c250e-637b-450b-b56d-fe47edc40755
  type: Opaque

The error seems legit.

Miles-Garnsey · 2023-12-14T02:23:09Z

I've tried to replicate the test failure locally in various ways. I wasn't able to get two DCs working in different clusters, but I was able to get two working in two different namespaces. When I run the restores I see the following results in the logs:

Restore complete for DC v1.ObjectMeta{Name:"dc1", GenerateName:"", Namespace:"k8ssandra-operator", SelfLink:"", UID:"2c9800ec-cb67-4500-a6ad-449943a131f6", ResourceVersion:"4388", Generation:4, CreationTimestamp:time.Date(2023, time.December, 14, 1, 58, 33, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app.kubernetes.io/component":"cassandra", "app.kubernetes.io/name":"k8ssandra-operator", "app.kubernetes.io/part-of":"k8ssandra", "k8ssandra.io/cleaned-up-by":"k8ssandracluster-controller", "k8ssandra.io/cluster-name":"test", "k8ssandra.io/cluster-namespace":"k8ssandra-operator"}, Annotations:map[string]string{"k8ssandra.io/resource-hash":"QQjIAisNZPhBCWJ5tMc/DGRrDDXj9dM5psNEcDiK7vU="}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string{"finalizer.cassandra.datastax.com"}, ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"manager", Operation:"Update", APIVersion:"cassandra.datastax.com/v1beta1", Time:time.Date(2023, time.December, 14, 2, 10, 42, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0013a1410), Subresource:""}}}, Refreshing secrets

...

Restore complete for DC v1.ObjectMeta{Name:"dc2", GenerateName:"", Namespace:"dc2", SelfLink:"", UID:"a01aae28-a95a-45ab-92e8-e82edd51b663", ResourceVersion:"4487", Generation:4, CreationTimestamp:time.Date(2023, time.December, 14, 2, 1, 55, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app.kubernetes.io/component":"cassandra", "app.kubernetes.io/name":"k8ssandra-operator", "app.kubernetes.io/part-of":"k8ssandra", "k8ssandra.io/cleaned-up-by":"k8ssandracluster-controller", "k8ssandra.io/cluster-name":"test", "k8ssandra.io/cluster-namespace":"k8ssandra-operator"}, Annotations:map[string]string{"cassandra.datastax.com/skip-user-creation":"true", "k8ssandra.io/resource-hash":"BLRvijufjPloop/zaQwxfvs6LhHSEHXRg3uPyuCKANw="}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string{"finalizer.cassandra.datastax.com"}, ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"manager", Operation:"Update", APIVersion:"cassandra.datastax.com/v1beta1", Time:time.Date(2023, time.December, 14, 2, 10, 42, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0009fe7e0), Subresource:""}}}, Refreshing secrets

The superuser secrets have the refresh tokens in both namespaces.

So I think the problem here is something to do with using the wrong client in my function RefreshSecrets. Right now I'm passing in the client from the MedusaRestoreJobReconciler, but maybe this is wrong. In either event it doesn't seem to work in the multi-cluster context. I don't understand what the problem is though, because we should have k8ssandra-operator instances working independantly in both clusters, where they should be both completing this reconciliation.

…user names.

Co-authored-by: Alexander Dejanovski <[email protected]>

Miles-Garnsey · 2024-01-04T02:33:04Z

I've found a small problem in this. When we add the refresh annotation we add the current time. The way a reconciliation works in all other cases, is that we'd -

Create the desired object.
Add the desired object's resource hash.
Check whether the actual object on the server has a matching resource hash.
If not, create it then re-queue. The requeue will then have matching resource hashes and the reconciler gets back a Done result at that point.

Normally, all of the requeues etc. are delegated back up to the top level reconciler function via the returned result type, but because we are always adding the current time, taking this approach here would lead to a constantly changing desired object, and reconciliation will never progress.

I've modified the logic so we have an inner retry loop in the RefreshSecrets function, but it breaks with our usual design which I'm not particularly overjoyed about.

Miles-Garnsey · 2024-01-04T07:11:26Z

I'm a bit blocked on this @adejanovski, now that the correct images are being used in the e2e tests I've gone through and resolved a number of other issues that came up (especially around the way requeues are handled etc.) however I'm still seeing issues with the tests not passing.

In particular, the following log line comes up pretty regularly, do you have any ideas off the top of your head what would cause it?

2024-01-04T04:41:14.964Z	ERROR	Reconciler error	{"controller": "medusarestorejob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaRestoreJob", "MedusaRestoreJob": {"name":"test-restore","namespace":"single-dc-dse-medusa-frz5kn"}, "namespace": "single-dc-dse-medusa-frz5kn", "name": "test-restore", "reconcileID": "046210fe-1882-478e-9cb2-6252d27ce8ff", "error": "Operation cannot be fulfilled on secrets \"firstcluster-medusa\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235

I thought that refreshing the operator's view of the secret on every retry (this bit) would be sufficient to overcome that issue, but it looks like it is still occurring. Any ideas on what I might be missing?

…me of restoreJob.

…'t endlessly loop.

sonarqubecloud · 2024-01-08T06:12:14Z

Quality Gate failed

Failed conditions

5.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

adejanovski · 2024-01-09T07:41:42Z

main.go

+	version        = "dev"
+	commit         = "n/a"
+	date           = "n/a"
+	versionMessage = "#######################" +
+		fmt.Sprintf("#### version %s commit %s date %s ####", version, commit, date) +
+		"#######################"


nit: this shouldn't be part of this PR I guess.

Miles-Garnsey requested a review from a team as a code owner December 11, 2023 01:49

Miles-Garnsey force-pushed the feature/refresh-secrets branch 4 times, most recently from 0db7e27 to 480ca8d Compare December 12, 2023 04:53

adejanovski requested changes Dec 12, 2023

View reviewed changes

test/e2e/medusa_test.go Outdated Show resolved Hide resolved

Miles-Garnsey force-pushed the feature/refresh-secrets branch from b0d5853 to 12f6ae0 Compare December 13, 2023 03:05

Miles-Garnsey force-pushed the feature/refresh-secrets branch 2 times, most recently from cf0755f to 8cfde04 Compare December 18, 2023 04:13

Miles-Garnsey and others added 16 commits January 4, 2024 11:06

Completion of a restore triggers secrets refresh.

e8831f3

Fix error check.

1ce6ce1

Add a test to the envtests too.

8b29f68

Fix envtest.

281b556

Delete envtest test.

85d9aed

Add e2e test.

111c5ff

more logging for medusa tests.

a0e6f02

Add some logging to allow debugging.

227a42e

Ensure right default username is used.

8277e18

Deal with multiple different cluster names leading to different super…

66c45af

…user names.

Alex's suggested test change.

c4e1eee

Co-authored-by: Alexander Dejanovski <[email protected]>

Fix error introduced by last change.

5ec13af

Fix tests so that they always retrieve the right secret.

2166919

Changelog.

0cee309

More debugging.

5d63dfb

More debugging.

e922327

Miles-Garnsey added 3 commits January 4, 2024 11:06

Error handling, restore old logging style.

9a218fb

Print version and commit information at startup.

f7019cb

Reinstate other tests.

d4c5188

Miles-Garnsey force-pushed the feature/refresh-secrets branch from 8cfde04 to d4c5188 Compare January 4, 2024 00:06

Miles-Garnsey added 4 commits January 4, 2024 13:33

Fix infinite requeue problem.

d823aa4

Longer timeout for restore e2e test.

824732d

Ensure we always get most recent version of secret even after failure.

c98f5f2

Back out changes to e2e test timeout so we can get logs again.

a9e707f

Fix up build variables in makefile.

b28e8d1

Miles-Garnsey force-pushed the feature/refresh-secrets branch 2 times, most recently from ec3db10 to ab5b96e Compare January 8, 2024 03:33

Fix requeue handling and switch to making timestamp based on finishTi…

2ba393e

…me of restoreJob.

Miles-Garnsey force-pushed the feature/refresh-secrets branch from ab5b96e to 2ba393e Compare January 8, 2024 03:42

Maybe let's use StartTime instead of finishTime to ensure that we don…

905a912

…'t endlessly loop.

adejanovski approved these changes Jan 9, 2024

View reviewed changes

adejanovski merged commit b02da26 into k8ssandra:main Jan 9, 2024
60 of 61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completion of a restore triggers secrets refresh #1134

Completion of a restore triggers secrets refresh #1134

Miles-Garnsey commented Dec 11, 2023

codecov bot commented Dec 11, 2023 •

edited

Loading

Miles-Garnsey commented Dec 12, 2023 •

edited

Loading

adejanovski left a comment

Miles-Garnsey commented Dec 13, 2023

adejanovski commented Dec 13, 2023

Miles-Garnsey commented Dec 14, 2023 •

edited

Loading

Miles-Garnsey commented Jan 4, 2024

Miles-Garnsey commented Jan 4, 2024

sonarqubecloud bot commented Jan 8, 2024

adejanovski Jan 9, 2024

Completion of a restore triggers secrets refresh #1134

Completion of a restore triggers secrets refresh #1134

Conversation

Miles-Garnsey commented Dec 11, 2023

codecov bot commented Dec 11, 2023 • edited Loading

Codecov Report

Miles-Garnsey commented Dec 12, 2023 • edited Loading

adejanovski left a comment

Choose a reason for hiding this comment

Miles-Garnsey commented Dec 13, 2023

adejanovski commented Dec 13, 2023

Miles-Garnsey commented Dec 14, 2023 • edited Loading

Miles-Garnsey commented Jan 4, 2024

Miles-Garnsey commented Jan 4, 2024

sonarqubecloud bot commented Jan 8, 2024

Quality Gate failed

adejanovski Jan 9, 2024

Choose a reason for hiding this comment

codecov bot commented Dec 11, 2023 •

edited

Loading

Miles-Garnsey commented Dec 12, 2023 •

edited

Loading

Miles-Garnsey commented Dec 14, 2023 •

edited

Loading