-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment Workflow Definition + Pollers starting workflow #6807
base: versioning-3
Are you sure you want to change the base?
Conversation
… shivam/deployment-entity-workflow-fix-base-branch-latest
… shivam/deployment-entity-workflow-fix-base-branch-latest
service/matching/matching_engine.go
Outdated
@@ -163,6 +182,8 @@ type ( | |||
namespaceUpdateLockMapLock sync.Mutex | |||
// Stores results of reachability queries to visibility | |||
reachabilityCache reachabilityCache | |||
// De-duping poll requests when starting/signaling deployment workflows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added to adhere to the constraint:
if same task queue and build ID appear in a poll request but with a different deployment name, the poll will be rejected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this requirement changed with the recent semantic changes. We allow the poller to hang on. No poller rejection happens anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I had meant to write an additional statement with this requirement by saying "We will not reject the poll but won't make a duplicate workflow execution request".
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the skeleton as this shall aid quicker iteration of the next PR :)
service/matching/matching_engine.go
Outdated
@@ -163,6 +182,8 @@ type ( | |||
namespaceUpdateLockMapLock sync.Mutex | |||
// Stores results of reachability queries to visibility | |||
reachabilityCache reachabilityCache | |||
// De-duping poll requests when starting/signaling deployment workflows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this requirement changed with the recent semantic changes. We allow the poller to hang on. No poller rejection happens anymore.
service/matching/matching_engine.go
Outdated
@@ -545,6 +569,42 @@ pollLoop: | |||
workerVersionCapabilities: request.WorkerVersionCapabilities, | |||
forwardedFrom: req.GetForwardedSource(), | |||
} | |||
if e.config.EnableDeployments(req.PollRequest.Namespace) && req.PollRequest.WorkerVersionCapabilities.UseVersioning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deployment workflow is the concern of the physical queue, hence, the best place for this is physicalTaskQueueManager. Basically, physical queue wants to register its deployment when it receives the first poll, before it starts processing that poll or any other polls arriving before the registration is done.
One advantage of that is you don't need to repeat things in different types of polls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought about this and had put it inside of matching before we register a poll for the same reason. I am not opposed to putting this inside of physicalTaskQueueManager
and in hindsight, should have seen this as a better way to reduce code duplication (to a certain extent).
thanks
service/matching/matching_engine.go
Outdated
@@ -1987,6 +2119,114 @@ func (e *matchingEngineImpl) pollTask( | |||
return pm.PollTask(ctx, pollMetadata) | |||
} | |||
|
|||
// isValidName checks if each character is a letter/number in the input string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems cleaner to move all these stuff to a different file called deployment_workflow_util.go
or something similar.
service/matching/matching_engine.go
Outdated
// isValidName checks if each character is a letter/number in the input string | ||
func isValidName(input string) bool { | ||
for _, char := range input { | ||
if !unicode.IsLetter(char) && !unicode.IsDigit(char) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not be too restrictive here, we want to disallow only the characters that we use as delimiter. That also, means that we should not use common characters as delimiters, something like ":" or "/" seems common. Maybe we should use "|" as delimiter and just disallow that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not though? I find it rather...odd..for a user to come in and have their deployment group named to a non-alphanumerical sequence (something like: "hi!@#"). I know we give liberty to our customers when naming task-queues but I thought we should keep things simpler here while naming deploymentGroups/buildId's
This might be a more product related concern though I fear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is more of a product decision. But since we're not putting same restrictions anywhere else, I think we should avoid that here to be consistent: only the things that can break things should be prevented. In this example, it's very likely that build ID has chars such as "." or "-".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to go ahead and use "?" as the delimiter for our buildID. To the best of my product knowledge, users won't name their deployment group/BuildID with a question mark (but I see a world where they name things with :, -, !) so I'm taking that gamble here and going with it as a potential delimiter.
service/matching/matching_engine.go
Outdated
return nil | ||
} | ||
|
||
func (e *matchingEngineImpl) validateWorkflowID(workflowID string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this wf ID is something internal that we create, so it does not make sense to return error for something invalid that we make. we should instead validate the user-provided pieces (deployment name and build ID) and make sure they are such that the workflow ID that they'll result into is not going to be invalid.
That means we should check the following for deployment name and build ID:
- They don't have the invalid chars
- The as UTF8 strings
- Their length does not exceed the limit of the part in wf ID that we allocate to them. Because workflow ID will have both these values plus prefix and delimeter, the allowed length for each will be (MaxIDLengthLimit - prefix and delimeter length) / 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I am not opposed to this. The deployment and buildID checks will be individually done on the matching side (I have added the piece of code currently doing this) and can append the new cases to the existing logic.
service/matching/matching_engine.go
Outdated
NamespaceId: req.NamespaceId, | ||
TaskQueueFamilies: nil, | ||
} | ||
_, err := e.startAndSignalDeploymentWorkflow(ctx, startDeploymentWorkflowArgs, updateDeploymentSignalInput, deploymentName, buildID, req.PollRequest.Identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down the road, we'll need many different signals/updates doing different things to the wf. Each should be names specifically as what it is meant for and have its own input definition.
In this case, the signal is meant to register the TQ in the deployment. Hence it should be called something like RegisterTaskQueue
or something.
Also, the deployment wf might accept the registration or not (say if this registration exceeds some limit) so it should be an undate that returns some result. We can handle the result and failure cases later, but for now it makes sense to start with an Update rather than a Signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two points here:
- I am okay to change the naming of the signal although I do think we should have "deployment" in the signal naming for better readability (it's nit, can be ignored I believe)
- I had considered using an
Update
but went ahead with signals because I thought we wanted to lazily initiate our workflow executions. Moreover, usingSignalWithStart
also provides less latency since it's async as opposed to sync write of an Update (yes, there is guarantee of the registration happening here but I gave latency the upper hand here). Overall, I thought both options have their pros and cons each but I thought having lazy initializations of workflows was something that we desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For item, I think it's better to call the Update name RegisterWorker
where "Worker" highlights that we are registering the set of TQ worker in a deployment. As oppose to future Update RegisterBacklog
or similar when we see the first task (pinned) task for a versioned TQ.
(RegisterBacklog is a future thing that I anticipate we'd need to be able to build APIs for backlog stats for deployments, etc., no need to think about it right now)
we should have "deployment" in the signal naming for better readability
If this helps with code readability, sure. As far as the wf is concerned all the signal names are scoped within a WF definition.
service/matching/matching_engine.go
Outdated
@@ -545,6 +569,42 @@ pollLoop: | |||
workerVersionCapabilities: request.WorkerVersionCapabilities, | |||
forwardedFrom: req.GetForwardedSource(), | |||
} | |||
if e.config.EnableDeployments(req.PollRequest.Namespace) && req.PollRequest.WorkerVersionCapabilities.UseVersioning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we factor out more of this code?
service/matching/matching_engine.go
Outdated
} | ||
|
||
// adds to map to prevent multiple duplicate requests from starting workflow execution | ||
e.dedupDeployments[dedupDeploymentKey] = req.PollRequest.WorkerVersionCapabilities.DeploymentName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need some synchronization?
service/matching/matching_engine.go
Outdated
@@ -163,6 +182,8 @@ type ( | |||
namespaceUpdateLockMapLock sync.Mutex | |||
// Stores results of reachability queries to visibility | |||
reachabilityCache reachabilityCache | |||
// De-duping poll requests when starting/signaling deployment workflows | |||
dedupDeployments map[dedupDeploymentsKey]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the map ever shrink? maybe we want a cache with a ttl?
service/matching/matching_engine.go
Outdated
@@ -1987,6 +2119,114 @@ func (e *matchingEngineImpl) pollTask( | |||
return pm.PollTask(ctx, pollMetadata) | |||
} | |||
|
|||
// isValidName checks if each character is a letter/number in the input string | |||
func isValidName(input string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is valid name for what? deployments? then it should be isValidDeploymentName
.. this is a package scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to alter this function definition by making it a method for taskQueuePartitionManagerImpl
and change the name to isValidDeploymentName
.
service/matching/matching_engine.go
Outdated
// isValidName checks if each character is a letter/number in the input string | ||
func isValidName(input string) bool { | ||
for _, char := range input { | ||
if !unicode.IsLetter(char) && !unicode.IsDigit(char) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems pretty constraining.. why are we limiting things like this?
service/matching/matching_engine.go
Outdated
return serviceerror.NewInvalidArgument("DeploymentName/BuildID cannot be empty") | ||
} | ||
|
||
// Prefix check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong if a deployment name or build id has one of these prefixes? it may be confusing but does anything break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing breaks (as of now) since we separate the prefixes from the name with a delimeter (:
). However, I do wonder if we want users to be allowed to keep their deployment name as a constant we use internally.
now that you say it though, I don't see much problem in it and shall remove it - thanks for making me think again on this
info := workflow.GetInfo(ctx) | ||
workflowID := info.WorkflowExecution.ID | ||
|
||
deploymentName, buildID, err := parseDeploymentWorkflowID(workflowID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After seeing this code I feel more strongly: We should not be parsing things out of strings. Just pass the deployment name and build id in the args and don't worry about the name here at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This puts forward the argument of space again since passing those two as args would mean they repeatedly get passed around when doing CAN
I don't have an exact metric right now to tell you if adding two constants would make such a huge difference in size or not but I thought it was best to keep things separate and not repeat themself (hence the parsing)
@ShahabT , I recall you were opposed to the idea of having them passed as args for the same reason right? wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an objection to pass these values in the args, but then we should validate that the passed values does match the wf ID.
Also, regardless, we need to keep deployment -> wf ID mapping deterministic for the DescribeDeployment and ListDeployment APIs. So we can not get rid of the verification and wf ID construction part. (We can get rid of parsing though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an objection to pass these values in the args, but then we should validate that the passed values does match the wf ID.
We build out the workflowID using the two supplied arguments in matching - In the event that the supplied arguments are valid (individual checks), the workflowID build would also be valid since it's a combination of the supplied args + our reserved delimiters. I am not sure why, in the workflow, should we be further validating if the passed values match the wf ID
var signalInput *deployspb.UpdateDeploymentSignalInput | ||
updateDeploymentSignalChannel.Receive(d.ctx, &signalInput) | ||
|
||
if d.DeploymentLocalState.TaskQueueFamilies == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need separate checks for d.DeploymentLocalState.TaskQueueFamilies == nil
and d.DeploymentLocalState.TaskQueueFamilies[signalInput.Name] == nil
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't d.DeploymentLocalState.TaskQueueFamilies[signalInput.Name] == nil
will always be true whenever we have d.DeploymentLocalState.TaskQueueFamilies == nil
no?
This line of code will anyways be hit only when we are going to add a key in our map, so I thought that check was repeated (unless I've missed something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dnr - I left some comments as unanswered because they will not be applicable given that I am currently in the process of moving things inside of physicalTaskQueueManager
var signalInput *deployspb.UpdateDeploymentSignalInput | ||
updateDeploymentSignalChannel.Receive(d.ctx, &signalInput) | ||
|
||
if d.DeploymentLocalState.TaskQueueFamilies == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't d.DeploymentLocalState.TaskQueueFamilies[signalInput.Name] == nil
will always be true whenever we have d.DeploymentLocalState.TaskQueueFamilies == nil
no?
This line of code will anyways be hit only when we are going to add a key in our map, so I thought that check was repeated (unless I've missed something)
info := workflow.GetInfo(ctx) | ||
workflowID := info.WorkflowExecution.ID | ||
|
||
deploymentName, buildID, err := parseDeploymentWorkflowID(workflowID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This puts forward the argument of space again since passing those two as args would mean they repeatedly get passed around when doing CAN
I don't have an exact metric right now to tell you if adding two constants would make such a huge difference in size or not but I thought it was best to keep things separate and not repeat themself (hence the parsing)
@ShahabT , I recall you were opposed to the idea of having them passed as args for the same reason right? wdyt?
service/matching/matching_engine.go
Outdated
return serviceerror.NewInvalidArgument("DeploymentName/BuildID cannot be empty") | ||
} | ||
|
||
// Prefix check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing breaks (as of now) since we separate the prefixes from the name with a delimeter (:
). However, I do wonder if we want users to be allowed to keep their deployment name as a constant we use internally.
now that you say it though, I don't see much problem in it and shall remove it - thanks for making me think again on this
service/matching/matching_engine.go
Outdated
@@ -1987,6 +2119,114 @@ func (e *matchingEngineImpl) pollTask( | |||
return pm.PollTask(ctx, pollMetadata) | |||
} | |||
|
|||
// isValidName checks if each character is a letter/number in the input string | |||
func isValidName(input string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to alter this function definition by making it a method for taskQueuePartitionManagerImpl
and change the name to isValidDeploymentName
.
…go-routine for cleaner workflow look
@@ -99,6 +109,9 @@ type ( | |||
taskValidator taskValidator | |||
tasksAddedInIntervals *taskTracker | |||
tasksDispatchedInIntervals *taskTracker | |||
// isDeploymentWorkflowStarted keeps track if we have started a deployment workflow for this | |||
// physicalTaskQueue | |||
isDeploymentWorkflowStarted atomic.Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This atomic is present to handle the following condition:
"If a poller arrives with an previously-seen task-queue + buildID combination but with a different DeploymentName (meaning there has already been a Deployment wf started)"
@@ -51,31 +46,28 @@ type ( | |||
} | |||
) | |||
|
|||
var ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this for now since I haven't drafted activities yet. Makes sense to include this as and when I do
What changed?
signalWithStart
and update local state<TaskQueue, DeploymentGroup, BuildID
pollers.Why?
versioning-3
How did you test it?
Potential risks
Documentation
Is hotfix candidate?