Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completion of a restore triggers secrets refresh #1134

Merged
merged 26 commits into from
Jan 9, 2024

Conversation

Miles-Garnsey
Copy link
Member

What this PR does:

Adds an annotation to the user secrets when a DC is restored. Cass operator should then pick this up and cause a refresh of user credentials on the Cassandra side.

Which issue(s) this PR fixes:
Fixes #1080

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@Miles-Garnsey Miles-Garnsey requested a review from a team as a code owner December 11, 2023 01:49
Copy link

codecov bot commented Dec 11, 2023

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (922b88c) 56.84% compared to head (905a912) 57.09%.
Report is 4 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1134      +/-   ##
==========================================
+ Coverage   56.84%   57.09%   +0.24%     
==========================================
  Files         100      101       +1     
  Lines       10352    10401      +49     
==========================================
+ Hits         5885     5938      +53     
- Misses       3949     3950       +1     
+ Partials      518      513       -5     
Files Coverage Δ
controllers/medusa/medusarestorejob_controller.go 58.33% <75.00%> (+0.99%) ⬆️
pkg/medusa/refresh_secrets.go 73.80% <73.80%> (ø)

... and 6 files with indirect coverage changes

@Miles-Garnsey Miles-Garnsey force-pushed the feature/refresh-secrets branch 4 times, most recently from 0db7e27 to 480ca8d Compare December 12, 2023 04:53
@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Dec 12, 2023

The failing integration tests here look like a flake to me. I'm re-running again locally but it appears that a different one is failing each time.

(It passes locally...)

Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo: We need a changelog entry.

Aside from the e2e test fix, everything works as expected 👍

@Miles-Garnsey Miles-Garnsey force-pushed the feature/refresh-secrets branch from b0d5853 to 12f6ae0 Compare December 13, 2023 03:05
@Miles-Garnsey
Copy link
Member Author

Totally stumped on why CreateMultiMedusaJob is failing. I can see a secret exists with the expected name, but it seems that RefreshSecrets isn't getting called (or at least it isn't producing any of the expected log output). But the line immediately before does produce The restore operation is complete so I'm thinking maybe I've just broken the logging.

I haven't been able to get this to run locally either, so I'm stuck using println statements to get output out of GHA. Not the best developer experience in the world. I'll keep waiting for the test to finish.

@adejanovski
Copy link
Contributor

Totally stumped on why CreateMultiMedusaJob is failing. I can see a secret exists with the expected name, but it seems that RefreshSecrets isn't getting called (or at least it isn't producing any of the expected log output). But the line immediately before does produce The restore operation is complete so I'm thinking maybe I've just broken the logging.

I haven't been able to get this to run locally either, so I'm stuck using println statements to get output out of GHA. Not the best developer experience in the world. I'll keep waiting for the test to finish.

Looking at the dumped artifacts, we don't se the annotation being placed on the superuser secrets:

- apiVersion: v1
  data:
    password: RThrTmFuRE5tUFJieXM2c1lrZGo=
    username: dGVzdC1zdXBlcnVzZXI=
  kind: Secret
  metadata:
    annotations:
      k8ssandra.io/resource-hash: dYW2TwnnvXE1bFKIWTCM8HBhJH5cI1CGKqL5AxoGAlU=
    creationTimestamp: "2023-12-13T04:35:27Z"
    labels:
      k8ssandra.io/cluster-name: test
      k8ssandra.io/cluster-namespace: multi-dc-encryption-medusa-rqhscw
      k8ssandra.io/replicated-by: k8ssandracluster-controller
    name: test-superuser
    namespace: multi-dc-encryption-medusa-rqhscw
    resourceVersion: "1516"
    uid: fd7c250e-637b-450b-b56d-fe47edc40755
  type: Opaque

The error seems legit.

@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Dec 14, 2023

I've tried to replicate the test failure locally in various ways. I wasn't able to get two DCs working in different clusters, but I was able to get two working in two different namespaces. When I run the restores I see the following results in the logs:

Restore complete for DC v1.ObjectMeta{Name:"dc1", GenerateName:"", Namespace:"k8ssandra-operator", SelfLink:"", UID:"2c9800ec-cb67-4500-a6ad-449943a131f6", ResourceVersion:"4388", Generation:4, CreationTimestamp:time.Date(2023, time.December, 14, 1, 58, 33, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app.kubernetes.io/component":"cassandra", "app.kubernetes.io/name":"k8ssandra-operator", "app.kubernetes.io/part-of":"k8ssandra", "k8ssandra.io/cleaned-up-by":"k8ssandracluster-controller", "k8ssandra.io/cluster-name":"test", "k8ssandra.io/cluster-namespace":"k8ssandra-operator"}, Annotations:map[string]string{"k8ssandra.io/resource-hash":"QQjIAisNZPhBCWJ5tMc/DGRrDDXj9dM5psNEcDiK7vU="}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string{"finalizer.cassandra.datastax.com"}, ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"manager", Operation:"Update", APIVersion:"cassandra.datastax.com/v1beta1", Time:time.Date(2023, time.December, 14, 2, 10, 42, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0013a1410), Subresource:""}}}, Refreshing secrets

...

Restore complete for DC v1.ObjectMeta{Name:"dc2", GenerateName:"", Namespace:"dc2", SelfLink:"", UID:"a01aae28-a95a-45ab-92e8-e82edd51b663", ResourceVersion:"4487", Generation:4, CreationTimestamp:time.Date(2023, time.December, 14, 2, 1, 55, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app.kubernetes.io/component":"cassandra", "app.kubernetes.io/name":"k8ssandra-operator", "app.kubernetes.io/part-of":"k8ssandra", "k8ssandra.io/cleaned-up-by":"k8ssandracluster-controller", "k8ssandra.io/cluster-name":"test", "k8ssandra.io/cluster-namespace":"k8ssandra-operator"}, Annotations:map[string]string{"cassandra.datastax.com/skip-user-creation":"true", "k8ssandra.io/resource-hash":"BLRvijufjPloop/zaQwxfvs6LhHSEHXRg3uPyuCKANw="}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string{"finalizer.cassandra.datastax.com"}, ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"manager", Operation:"Update", APIVersion:"cassandra.datastax.com/v1beta1", Time:time.Date(2023, time.December, 14, 2, 10, 42, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0009fe7e0), Subresource:""}}}, Refreshing secrets

The superuser secrets have the refresh tokens in both namespaces.

So I think the problem here is something to do with using the wrong client in my function RefreshSecrets. Right now I'm passing in the client from the MedusaRestoreJobReconciler, but maybe this is wrong. In either event it doesn't seem to work in the multi-cluster context. I don't understand what the problem is though, because we should have k8ssandra-operator instances working independantly in both clusters, where they should be both completing this reconciliation.

@Miles-Garnsey Miles-Garnsey force-pushed the feature/refresh-secrets branch 2 times, most recently from cf0755f to 8cfde04 Compare December 18, 2023 04:13
@Miles-Garnsey Miles-Garnsey force-pushed the feature/refresh-secrets branch from 8cfde04 to d4c5188 Compare January 4, 2024 00:06
@Miles-Garnsey
Copy link
Member Author

I've found a small problem in this. When we add the refresh annotation we add the current time. The way a reconciliation works in all other cases, is that we'd -

  1. Create the desired object.
  2. Add the desired object's resource hash.
  3. Check whether the actual object on the server has a matching resource hash.
  4. If not, create it then re-queue. The requeue will then have matching resource hashes and the reconciler gets back a Done result at that point.

Normally, all of the requeues etc. are delegated back up to the top level reconciler function via the returned result type, but because we are always adding the current time, taking this approach here would lead to a constantly changing desired object, and reconciliation will never progress.

I've modified the logic so we have an inner retry loop in the RefreshSecrets function, but it breaks with our usual design which I'm not particularly overjoyed about.

@Miles-Garnsey
Copy link
Member Author

I'm a bit blocked on this @adejanovski, now that the correct images are being used in the e2e tests I've gone through and resolved a number of other issues that came up (especially around the way requeues are handled etc.) however I'm still seeing issues with the tests not passing.

In particular, the following log line comes up pretty regularly, do you have any ideas off the top of your head what would cause it?

2024-01-04T04:41:14.964Z	ERROR	Reconciler error	{"controller": "medusarestorejob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaRestoreJob", "MedusaRestoreJob": {"name":"test-restore","namespace":"single-dc-dse-medusa-frz5kn"}, "namespace": "single-dc-dse-medusa-frz5kn", "name": "test-restore", "reconcileID": "046210fe-1882-478e-9cb2-6252d27ce8ff", "error": "Operation cannot be fulfilled on secrets \"firstcluster-medusa\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235

I thought that refreshing the operator's view of the secret on every retry (this bit) would be sufficient to overcome that issue, but it looks like it is still occurring. Any ideas on what I might be missing?

@Miles-Garnsey Miles-Garnsey force-pushed the feature/refresh-secrets branch 2 times, most recently from ec3db10 to ab5b96e Compare January 8, 2024 03:33
@Miles-Garnsey Miles-Garnsey force-pushed the feature/refresh-secrets branch from ab5b96e to 2ba393e Compare January 8, 2024 03:42
Copy link

sonarqubecloud bot commented Jan 8, 2024

Quality Gate Failed Quality Gate failed

Failed conditions

5.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

Comment on lines +71 to +76
version = "dev"
commit = "n/a"
date = "n/a"
versionMessage = "#######################" +
fmt.Sprintf("#### version %s commit %s date %s ####", version, commit, date) +
"#######################"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this shouldn't be part of this PR I guess.

@adejanovski adejanovski merged commit b02da26 into k8ssandra:main Jan 9, 2024
60 of 61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refresh the users secrets after a remote restore to recreate them in the database
2 participants