Update deployment and behavior in ExecutionInfo #6798

ShahabT · 2024-11-11T07:54:40Z

What changed?

Update the versioning info of workflows based on information received in WF and Activity task events.

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

carlydf

need to look at the test files still, submitting to save progress / share feedback while I help with a canary issue in the release channel

proto/internal/temporal/server/api/persistence/v1/executions.proto

carlydf · 2024-11-12T21:07:47Z

proto/internal/temporal/server/api/persistence/v1/executions.proto

+        temporal.api.common.v1.WorkerDeployment deployment = 2;
+        // Manual override for execution's versioning behavior. Takes precedence over `behavior`.
+        temporal.api.enums.v1.VersioningBehavior behavior_override = 3;
+        // Used to manually pin the execution to a deployment. Must be set when if and only if


Suggested change

// Used to manually pin the execution to a deployment. Must be set when if and only if

// Used to manually pin the execution to a deployment. Must be set if and only if

proto/internal/temporal/server/api/persistence/v1/executions.proto

carlydf · 2024-11-12T21:30:01Z

proto/internal/temporal/server/api/persistence/v1/executions.proto

+    // If this activity was attempted to start during an ongoing redirect of the workflow, we set
+    // this flag so we remember after completion of the activity to reschedule the dropped task.
+    // History rejects the start of activities when a workflow is redirecting to a different
+    // deployment. Those rejected starts will cause the task to be dropped by Matching.


When you're talking about the "dropped task" could you specify that it's a dropped Activity Task? I feel like it doesn't hurt to be extra clear.

Also, what do you think about calling the field has_dropped_task since it's a boolean and not the task itself?

dropped_task is fine too, I'm just wondering if has_dropped_task would be more clear

I ended up deleting this. We have to reschedule all the non-started activities when the redirect successfully completes anyways. There is not much use to this.

service/history/api/recordactivitytaskstarted/api.go

service/history/workflow/mutable_state_impl.go

service/history/api/recordactivitytaskstarted/api.go

service/history/workflow/mutable_state_impl.go

service/history/api/recordworkflowtaskstarted/api.go

carlydf

Looks good!

The only major comment I had is about how to handle conflict resolution when there is an ongoing redirect due to a manual override, and another redirect comes in because the poller's deployment/behavior is different than the override deployment/behavior. In that case, we need to reject the poller-initiated redirect.

Now that I think about it, I think I can also handle this as part of the move PR though, since I think it will hopefully just involve adding an if-statement. I think it will make more sense to handle this case in the same PR that adds the override.

service/history/api/respondworkflowtaskcompleted/api.go

service/history/workflow/mutable_state_impl.go

carlydf · 2024-11-14T00:33:16Z

service/history/workflow/mutable_state_impl_test.go

+// creates a mutable state with first WFT completed on deployment "my_app:build_1" and behavior set
+// to the passed value.


Suggested change

// creates a mutable state with first WFT completed on deployment "my_app:build_1" and behavior set

// to the passed value.

// creates a mutable state with first WFT completed on the given deployment and behavior set

// to the given behavior, testing expected output after Add, Start, and Complete Workflow Task.

service/history/workflow/mutable_state_impl_test.go

carlydf · 2024-11-14T05:22:15Z

common/worker_versioning/worker_versioning.go

 func StampForBuildId(buildId string) *commonpb.WorkerVersionStamp {
 	return &commonpb.WorkerVersionStamp{UseVersioning: true, BuildId: buildId}
 }
+func StampForDeployment(deployment *commonpb.WorkerDeployment) *commonpb.WorkerVersionStamp {
+	return &commonpb.WorkerVersionStamp{UseVersioning: true, BuildId: deployment.BuildId, DeploymentName: deployment.DeploymentName}
+}


should these be StampFromBuildId and StampFromDeployment?

carlydf · 2024-11-14T05:37:20Z

service/history/api/recordactivitytaskstarted/api.go

+					// If the redirect was initiated by this activity we must create a workflow task
+					// to ensure the workflow won't be stuck.


Is this to handle the case where the activity poller / activity TQ initiates the deployment change, and we don't know yet whether the workflow poller / WFTQ is also changing deployments?

If so, is this a correct understanding of why we create a WFT here: We create a workflow task to send to the current workflow TQ, so that poller can tell us it's updated deployment on wf task completion?

carlydf · 2024-11-14T05:56:33Z

service/history/workflow/mutable_state_impl.go

+		behaviorOverride == enumspb.VERSIONING_BEHAVIOR_UNSPECIFIED {
+		// WF is pinned and the redirect is not from a manual override, so we reject it.
+		// It's possible that a backlogged task in matching from an earlier time that this wf was
+		// unpinned is being dispatched now and wants to redirect the wf. Such task should be dropped.


Super basic thing I should know: If Matching drops a backlogged task, will History eventually re-send it to Matching, because History never got a "WFT Started Event" for that task? Or, does Matching send some type of ack to History when it writes a task to the backlog, meaning it's Matching's responsibility to not drop anything in the backlog?

I think it's the first way but want to confirm

carlydf · 2024-11-14T06:37:21Z

proto/internal/temporal/server/api/persistence/v1/executions.proto

+        // When present, indicates the workflow is being redirected to a different deployment.
+        // A redirect can only exist during the lifetime of a pending workflow task.
+        // If the pending workflow task completes (at the next WorkflowTaskCompleted event), the
+        // redirect is considered complete and the workflow's deployment is updated. If the pending
+        // workflow task fails or times out, then the redirect is canceled and workflow remains on
+        // the previous deployment.


why do we want to cancel the redirect if the pending WFT fails or times out? what if it timed out due to reasons totally unrelated to versioning?

is the idea that it's ok to cancel the redirect because if the WF really does need to change deployments or change behavior, and the timeout was due to a transient issue, the next WFT will re-initiate the cancelled redirect, it will succeed, and the redirect will complete?

carlydf

approved with some clarifying questions

ShahabT added 2 commits November 10, 2024 23:47

unfinished history logic

482331e

Update deployment and behavior in ExecutionInfo

79e8564

ShahabT requested a review from a team as a code owner November 11, 2024 07:54

ShahabT requested a review from carlydf November 11, 2024 07:54

carlydf reviewed Nov 13, 2024

View reviewed changes

carlydf reviewed Nov 14, 2024

View reviewed changes

ShahabT added 2 commits November 13, 2024 17:15

Address comments and lint errors.

3a31c33

Add my name to immediate TODOs

893ba9f

carlydf reviewed Nov 14, 2024

View reviewed changes

carlydf approved these changes Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update deployment and behavior in ExecutionInfo #6798

Update deployment and behavior in ExecutionInfo #6798

ShahabT commented Nov 11, 2024

carlydf left a comment

carlydf Nov 12, 2024

carlydf Nov 12, 2024

carlydf Nov 12, 2024

ShahabT Nov 13, 2024

carlydf left a comment

carlydf Nov 14, 2024

carlydf Nov 14, 2024

carlydf Nov 14, 2024

carlydf Nov 14, 2024

carlydf Nov 14, 2024

carlydf Nov 14, 2024

carlydf left a comment

	// Used to manually pin the execution to a deployment. Must be set when if and only if
	// Used to manually pin the execution to a deployment. Must be set if and only if

		// creates a mutable state with first WFT completed on deployment "my_app:build_1" and behavior set
		// to the passed value.

		// If the redirect was initiated by this activity we must create a workflow task
		// to ensure the workflow won't be stuck.

Update deployment and behavior in ExecutionInfo #6798

Are you sure you want to change the base?

Update deployment and behavior in ExecutionInfo #6798

Conversation

ShahabT commented Nov 11, 2024

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

carlydf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlydf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlydf left a comment

Choose a reason for hiding this comment