Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update deployment and behavior in ExecutionInfo #6798

Open
wants to merge 4 commits into
base: versioning-3
Choose a base branch
from

Conversation

ShahabT
Copy link
Contributor

@ShahabT ShahabT commented Nov 11, 2024

What changed?

Update the versioning info of workflows based on information received in WF and Activity task events.

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

@ShahabT ShahabT requested a review from a team as a code owner November 11, 2024 07:54
@ShahabT ShahabT requested a review from carlydf November 11, 2024 07:54
Copy link
Contributor

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to look at the test files still, submitting to save progress / share feedback while I help with a canary issue in the release channel

temporal.api.common.v1.WorkerDeployment deployment = 2;
// Manual override for execution's versioning behavior. Takes precedence over `behavior`.
temporal.api.enums.v1.VersioningBehavior behavior_override = 3;
// Used to manually pin the execution to a deployment. Must be set when if and only if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Used to manually pin the execution to a deployment. Must be set when if and only if
// Used to manually pin the execution to a deployment. Must be set if and only if

Comment on lines 515 to 518
// If this activity was attempted to start during an ongoing redirect of the workflow, we set
// this flag so we remember after completion of the activity to reschedule the dropped task.
// History rejects the start of activities when a workflow is redirecting to a different
// deployment. Those rejected starts will cause the task to be dropped by Matching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you're talking about the "dropped task" could you specify that it's a dropped Activity Task? I feel like it doesn't hurt to be extra clear.

Also, what do you think about calling the field has_dropped_task since it's a boolean and not the task itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped_task is fine too, I'm just wondering if has_dropped_task would be more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up deleting this. We have to reschedule all the non-started activities when the redirect successfully completes anyways. There is not much use to this.

service/history/api/recordactivitytaskstarted/api.go Outdated Show resolved Hide resolved
service/history/workflow/mutable_state_impl.go Outdated Show resolved Hide resolved
service/history/api/recordactivitytaskstarted/api.go Outdated Show resolved Hide resolved
service/history/api/recordactivitytaskstarted/api.go Outdated Show resolved Hide resolved
Copy link
Contributor

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

The only major comment I had is about how to handle conflict resolution when there is an ongoing redirect due to a manual override, and another redirect comes in because the poller's deployment/behavior is different than the override deployment/behavior. In that case, we need to reject the poller-initiated redirect.

Now that I think about it, I think I can also handle this as part of the move PR though, since I think it will hopefully just involve adding an if-statement. I think it will make more sense to handle this case in the same PR that adds the override.

service/history/api/respondworkflowtaskcompleted/api.go Outdated Show resolved Hide resolved
service/history/workflow/mutable_state_impl.go Outdated Show resolved Hide resolved
service/history/workflow/mutable_state_impl.go Outdated Show resolved Hide resolved
Comment on lines 554 to 555
// creates a mutable state with first WFT completed on deployment "my_app:build_1" and behavior set
// to the passed value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// creates a mutable state with first WFT completed on deployment "my_app:build_1" and behavior set
// to the passed value.
// creates a mutable state with first WFT completed on the given deployment and behavior set
// to the given behavior, testing expected output after Add, Start, and Complete Workflow Task.

Comment on lines 198 to +203
func StampForBuildId(buildId string) *commonpb.WorkerVersionStamp {
return &commonpb.WorkerVersionStamp{UseVersioning: true, BuildId: buildId}
}
func StampForDeployment(deployment *commonpb.WorkerDeployment) *commonpb.WorkerVersionStamp {
return &commonpb.WorkerVersionStamp{UseVersioning: true, BuildId: deployment.BuildId, DeploymentName: deployment.DeploymentName}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be StampFromBuildId and StampFromDeployment?

Comment on lines +129 to +130
// If the redirect was initiated by this activity we must create a workflow task
// to ensure the workflow won't be stuck.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is this to handle the case where the activity poller / activity TQ initiates the deployment change, and we don't know yet whether the workflow poller / WFTQ is also changing deployments?
  2. If so, is this a correct understanding of why we create a WFT here: We create a workflow task to send to the current workflow TQ, so that poller can tell us it's updated deployment on wf task completion?

behaviorOverride == enumspb.VERSIONING_BEHAVIOR_UNSPECIFIED {
// WF is pinned and the redirect is not from a manual override, so we reject it.
// It's possible that a backlogged task in matching from an earlier time that this wf was
// unpinned is being dispatched now and wants to redirect the wf. Such task should be dropped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super basic thing I should know: If Matching drops a backlogged task, will History eventually re-send it to Matching, because History never got a "WFT Started Event" for that task? Or, does Matching send some type of ack to History when it writes a task to the backlog, meaning it's Matching's responsibility to not drop anything in the backlog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the first way but want to confirm

Comment on lines +275 to +280
// When present, indicates the workflow is being redirected to a different deployment.
// A redirect can only exist during the lifetime of a pending workflow task.
// If the pending workflow task completes (at the next WorkflowTaskCompleted event), the
// redirect is considered complete and the workflow's deployment is updated. If the pending
// workflow task fails or times out, then the redirect is canceled and workflow remains on
// the previous deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to cancel the redirect if the pending WFT fails or times out? what if it timed out due to reasons totally unrelated to versioning?

is the idea that it's ok to cancel the redirect because if the WF really does need to change deployments or change behavior, and the timeout was due to a transient issue, the next WFT will re-initiate the cancelled redirect, it will succeed, and the redirect will complete?

Copy link
Contributor

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with some clarifying questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants