Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment Workflow Definition + Pollers starting workflow #6807

Draft
wants to merge 25 commits into
base: versioning-3
Choose a base branch
from

Conversation

Shivs11
Copy link
Member

@Shivs11 Shivs11 commented Nov 13, 2024

What changed?

  • Added logic for deployment workflow definition to signalWithStart and update local state
  • Added logic for pollers in matching starting these workflows
  • Added de-dup mechanism since we don't want repeated workflow executions for already seen <TaskQueue, DeploymentGroup, BuildID pollers.

Why?

  • versioning-3

How did you test it?

  • Added unit tests for each addition of code I have made.
  • Functional tests to follow.

Potential risks

  • None, going to a feature.

Documentation

Is hotfix candidate?

  • NA

@@ -163,6 +182,8 @@ type (
namespaceUpdateLockMapLock sync.Mutex
// Stores results of reachability queries to visibility
reachabilityCache reachabilityCache
// De-duping poll requests when starting/signaling deployment workflows
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added to adhere to the constraint:

if same task queue and build ID appear in a poll request but with a different deployment name, the poll will be rejected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this requirement changed with the recent semantic changes. We allow the poller to hang on. No poller rejection happens anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I had meant to write an additional statement with this requirement by saying "We will not reject the poll but won't make a duplicate workflow execution request".

)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the skeleton as this shall aid quicker iteration of the next PR :)

@Shivs11 Shivs11 changed the title Deployment Workflow Definition + Pollers starting them Deployment Workflow Definition + Pollers starting workflow Nov 13, 2024
@@ -163,6 +182,8 @@ type (
namespaceUpdateLockMapLock sync.Mutex
// Stores results of reachability queries to visibility
reachabilityCache reachabilityCache
// De-duping poll requests when starting/signaling deployment workflows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this requirement changed with the recent semantic changes. We allow the poller to hang on. No poller rejection happens anymore.

@@ -545,6 +569,42 @@ pollLoop:
workerVersionCapabilities: request.WorkerVersionCapabilities,
forwardedFrom: req.GetForwardedSource(),
}
if e.config.EnableDeployments(req.PollRequest.Namespace) && req.PollRequest.WorkerVersionCapabilities.UseVersioning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deployment workflow is the concern of the physical queue, hence, the best place for this is physicalTaskQueueManager. Basically, physical queue wants to register its deployment when it receives the first poll, before it starts processing that poll or any other polls arriving before the registration is done.

One advantage of that is you don't need to repeat things in different types of polls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about this and had put it inside of matching before we register a poll for the same reason. I am not opposed to putting this inside of physicalTaskQueueManager and in hindsight, should have seen this as a better way to reduce code duplication (to a certain extent).

thanks

@@ -1987,6 +2119,114 @@ func (e *matchingEngineImpl) pollTask(
return pm.PollTask(ctx, pollMetadata)
}

// isValidName checks if each character is a letter/number in the input string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems cleaner to move all these stuff to a different file called deployment_workflow_util.go or something similar.

// isValidName checks if each character is a letter/number in the input string
func isValidName(input string) bool {
for _, char := range input {
if !unicode.IsLetter(char) && !unicode.IsDigit(char) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not be too restrictive here, we want to disallow only the characters that we use as delimiter. That also, means that we should not use common characters as delimiters, something like ":" or "/" seems common. Maybe we should use "|" as delimiter and just disallow that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not though? I find it rather...odd..for a user to come in and have their deployment group named to a non-alphanumerical sequence (something like: "hi!@#"). I know we give liberty to our customers when naming task-queues but I thought we should keep things simpler here while naming deploymentGroups/buildId's

This might be a more product related concern though I fear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is more of a product decision. But since we're not putting same restrictions anywhere else, I think we should avoid that here to be consistent: only the things that can break things should be prevented. In this example, it's very likely that build ID has chars such as "." or "-".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to go ahead and use "?" as the delimiter for our buildID. To the best of my product knowledge, users won't name their deployment group/BuildID with a question mark (but I see a world where they name things with :, -, !) so I'm taking that gamble here and going with it as a potential delimiter.

return nil
}

func (e *matchingEngineImpl) validateWorkflowID(workflowID string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wf ID is something internal that we create, so it does not make sense to return error for something invalid that we make. we should instead validate the user-provided pieces (deployment name and build ID) and make sure they are such that the workflow ID that they'll result into is not going to be invalid.
That means we should check the following for deployment name and build ID:

  • They don't have the invalid chars
  • The as UTF8 strings
  • Their length does not exceed the limit of the part in wf ID that we allocate to them. Because workflow ID will have both these values plus prefix and delimeter, the allowed length for each will be (MaxIDLengthLimit - prefix and delimeter length) / 2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I am not opposed to this. The deployment and buildID checks will be individually done on the matching side (I have added the piece of code currently doing this) and can append the new cases to the existing logic.

service/matching/matching_engine_test.go Outdated Show resolved Hide resolved
NamespaceId: req.NamespaceId,
TaskQueueFamilies: nil,
}
_, err := e.startAndSignalDeploymentWorkflow(ctx, startDeploymentWorkflowArgs, updateDeploymentSignalInput, deploymentName, buildID, req.PollRequest.Identity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down the road, we'll need many different signals/updates doing different things to the wf. Each should be names specifically as what it is meant for and have its own input definition.

In this case, the signal is meant to register the TQ in the deployment. Hence it should be called something like RegisterTaskQueue or something.

Also, the deployment wf might accept the registration or not (say if this registration exceeds some limit) so it should be an undate that returns some result. We can handle the result and failure cases later, but for now it makes sense to start with an Update rather than a Signal.

Copy link
Member Author

@Shivs11 Shivs11 Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points here:

  1. I am okay to change the naming of the signal although I do think we should have "deployment" in the signal naming for better readability (it's nit, can be ignored I believe)
  2. I had considered using an Update but went ahead with signals because I thought we wanted to lazily initiate our workflow executions. Moreover, using SignalWithStart also provides less latency since it's async as opposed to sync write of an Update (yes, there is guarantee of the registration happening here but I gave latency the upper hand here). Overall, I thought both options have their pros and cons each but I thought having lazy initializations of workflows was something that we desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For item, I think it's better to call the Update name RegisterWorker where "Worker" highlights that we are registering the set of TQ worker in a deployment. As oppose to future Update RegisterBacklog or similar when we see the first task (pinned) task for a versioned TQ.

(RegisterBacklog is a future thing that I anticipate we'd need to be able to build APIs for backlog stats for deployments, etc., no need to think about it right now)

we should have "deployment" in the signal naming for better readability
If this helps with code readability, sure. As far as the wf is concerned all the signal names are scoped within a WF definition.

@@ -545,6 +569,42 @@ pollLoop:
workerVersionCapabilities: request.WorkerVersionCapabilities,
forwardedFrom: req.GetForwardedSource(),
}
if e.config.EnableDeployments(req.PollRequest.Namespace) && req.PollRequest.WorkerVersionCapabilities.UseVersioning {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we factor out more of this code?

}

// adds to map to prevent multiple duplicate requests from starting workflow execution
e.dedupDeployments[dedupDeploymentKey] = req.PollRequest.WorkerVersionCapabilities.DeploymentName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need some synchronization?

@@ -163,6 +182,8 @@ type (
namespaceUpdateLockMapLock sync.Mutex
// Stores results of reachability queries to visibility
reachabilityCache reachabilityCache
// De-duping poll requests when starting/signaling deployment workflows
dedupDeployments map[dedupDeploymentsKey]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the map ever shrink? maybe we want a cache with a ttl?

@@ -1987,6 +2119,114 @@ func (e *matchingEngineImpl) pollTask(
return pm.PollTask(ctx, pollMetadata)
}

// isValidName checks if each character is a letter/number in the input string
func isValidName(input string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is valid name for what? deployments? then it should be isValidDeploymentName.. this is a package scope

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to alter this function definition by making it a method for taskQueuePartitionManagerImpl and change the name to isValidDeploymentName.

// isValidName checks if each character is a letter/number in the input string
func isValidName(input string) bool {
for _, char := range input {
if !unicode.IsLetter(char) && !unicode.IsDigit(char) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pretty constraining.. why are we limiting things like this?

return serviceerror.NewInvalidArgument("DeploymentName/BuildID cannot be empty")
}

// Prefix check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong if a deployment name or build id has one of these prefixes? it may be confusing but does anything break?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing breaks (as of now) since we separate the prefixes from the name with a delimeter (:). However, I do wonder if we want users to be allowed to keep their deployment name as a constant we use internally.

now that you say it though, I don't see much problem in it and shall remove it - thanks for making me think again on this

info := workflow.GetInfo(ctx)
workflowID := info.WorkflowExecution.ID

deploymentName, buildID, err := parseDeploymentWorkflowID(workflowID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After seeing this code I feel more strongly: We should not be parsing things out of strings. Just pass the deployment name and build id in the args and don't worry about the name here at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puts forward the argument of space again since passing those two as args would mean they repeatedly get passed around when doing CAN

I don't have an exact metric right now to tell you if adding two constants would make such a huge difference in size or not but I thought it was best to keep things separate and not repeat themself (hence the parsing)

@ShahabT , I recall you were opposed to the idea of having them passed as args for the same reason right? wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an objection to pass these values in the args, but then we should validate that the passed values does match the wf ID.

Also, regardless, we need to keep deployment -> wf ID mapping deterministic for the DescribeDeployment and ListDeployment APIs. So we can not get rid of the verification and wf ID construction part. (We can get rid of parsing though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an objection to pass these values in the args, but then we should validate that the passed values does match the wf ID.

We build out the workflowID using the two supplied arguments in matching - In the event that the supplied arguments are valid (individual checks), the workflowID build would also be valid since it's a combination of the supplied args + our reserved delimiters. I am not sure why, in the workflow, should we be further validating if the passed values match the wf ID

var signalInput *deployspb.UpdateDeploymentSignalInput
updateDeploymentSignalChannel.Receive(d.ctx, &signalInput)

if d.DeploymentLocalState.TaskQueueFamilies == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need separate checks for d.DeploymentLocalState.TaskQueueFamilies == nil and d.DeploymentLocalState.TaskQueueFamilies[signalInput.Name] == nil, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't d.DeploymentLocalState.TaskQueueFamilies[signalInput.Name] == nil will always be true whenever we have d.DeploymentLocalState.TaskQueueFamilies == nil no?

This line of code will anyways be hit only when we are going to add a key in our map, so I thought that check was repeated (unless I've missed something)

Copy link
Member Author

@Shivs11 Shivs11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnr - I left some comments as unanswered because they will not be applicable given that I am currently in the process of moving things inside of physicalTaskQueueManager

var signalInput *deployspb.UpdateDeploymentSignalInput
updateDeploymentSignalChannel.Receive(d.ctx, &signalInput)

if d.DeploymentLocalState.TaskQueueFamilies == nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't d.DeploymentLocalState.TaskQueueFamilies[signalInput.Name] == nil will always be true whenever we have d.DeploymentLocalState.TaskQueueFamilies == nil no?

This line of code will anyways be hit only when we are going to add a key in our map, so I thought that check was repeated (unless I've missed something)

info := workflow.GetInfo(ctx)
workflowID := info.WorkflowExecution.ID

deploymentName, buildID, err := parseDeploymentWorkflowID(workflowID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puts forward the argument of space again since passing those two as args would mean they repeatedly get passed around when doing CAN

I don't have an exact metric right now to tell you if adding two constants would make such a huge difference in size or not but I thought it was best to keep things separate and not repeat themself (hence the parsing)

@ShahabT , I recall you were opposed to the idea of having them passed as args for the same reason right? wdyt?

return serviceerror.NewInvalidArgument("DeploymentName/BuildID cannot be empty")
}

// Prefix check
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing breaks (as of now) since we separate the prefixes from the name with a delimeter (:). However, I do wonder if we want users to be allowed to keep their deployment name as a constant we use internally.

now that you say it though, I don't see much problem in it and shall remove it - thanks for making me think again on this

@@ -1987,6 +2119,114 @@ func (e *matchingEngineImpl) pollTask(
return pm.PollTask(ctx, pollMetadata)
}

// isValidName checks if each character is a letter/number in the input string
func isValidName(input string) bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to alter this function definition by making it a method for taskQueuePartitionManagerImpl and change the name to isValidDeploymentName.

@@ -99,6 +109,9 @@ type (
taskValidator taskValidator
tasksAddedInIntervals *taskTracker
tasksDispatchedInIntervals *taskTracker
// isDeploymentWorkflowStarted keeps track if we have started a deployment workflow for this
// physicalTaskQueue
isDeploymentWorkflowStarted atomic.Bool
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This atomic is present to handle the following condition:
"If a poller arrives with an previously-seen task-queue + buildID combination but with a different DeploymentName (meaning there has already been a Deployment wf started)"

@@ -51,31 +46,28 @@ type (
}
)

var (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this for now since I haven't drafted activities yet. Makes sense to include this as and when I do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants