Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat skip states for workflows #1075

Open
wants to merge 99 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
3909f65
happy path example
Nov 4, 2024
9e08354
progress commit
Nov 21, 2024
3a31c7f
cleanup
Nov 27, 2024
cff050f
cleanup and pass the step run repo around
Nov 27, 2024
a4bd207
happy path example
Nov 4, 2024
2e1444f
progress commit
Nov 21, 2024
26c1aba
cleanup
Nov 27, 2024
6e0c9fb
cleanup and pass the step run repo around
Nov 27, 2024
4d152cb
merge in main
Dec 2, 2024
417a689
merge
Dec 2, 2024
4305611
before refactor
Dec 2, 2024
55daf28
quick refactor
Dec 2, 2024
b9107a0
lets do the hit the step run queue from everywhere else too
Dec 2, 2024
f81f0f4
some more refactoring
Dec 3, 2024
0a0e66d
some more cleanup and refactor
Dec 3, 2024
679b5cf
cleanup the caches when we have quit the step run engine
Dec 3, 2024
a621d25
deal with the cache in the caller to prevent leaks
Dec 3, 2024
78caa84
Merge branch 'main' into feat-skip-states-for-workflows
reillyse Dec 3, 2024
877dc08
Merge branch 'feat-skip-states-for-workflows' of github.com:hatchet-d…
Dec 3, 2024
63f7076
cleanup unused fields in query, no need to update the workflow run - …
Dec 3, 2024
3852d03
crazy-dag with e2e
Dec 5, 2024
cfa4c8e
parallelize short circuiting
Dec 6, 2024
51681f7
fix the simple test and reduce noise
Dec 6, 2024
bdc4470
check for the onfailure job when we create and only then update the w…
Dec 6, 2024
f92c408
clean up
Dec 6, 2024
f8c0234
reduce noise in tests
Dec 6, 2024
607a2a0
Merge branch 'main' of github.com:hatchet-dev/hatchet
Dec 16, 2024
97a7a8e
merge
Dec 16, 2024
6735c03
some cleanup
Dec 17, 2024
ad903fe
crazy dag
Dec 17, 2024
b180036
cleanup
Dec 17, 2024
172aa64
cleanup comments and make func private
Dec 17, 2024
abcfb2c
merge
reillyse Dec 18, 2024
a9f603d
Merge branch 'main' into feat-skip-states-for-workflows
reillyse Dec 18, 2024
d602a47
generate to remove the comment
reillyse Dec 18, 2024
8347614
make the e2e check the state of the workflow run to make sure it was …
reillyse Dec 18, 2024
30aca96
merge
reillyse Dec 18, 2024
3307c74
namespace the load tests and only queue the item once
reillyse Dec 19, 2024
92a1725
not working locally lets see about actions
reillyse Dec 19, 2024
72a26fe
no delay creating worfklow runs
reillyse Dec 19, 2024
e0df595
cleanup migrations
reillyse Dec 19, 2024
981f51d
make the concurrency test a bit more robust, explicitly check for the…
reillyse Dec 19, 2024
422acdb
more modifications for the concurrency test
reillyse Dec 20, 2024
4b78b2e
log and return an error
reillyse Dec 20, 2024
e7aa2f1
remove the namespace stuff to debug
reillyse Dec 20, 2024
eccb383
maybe the duplicate code is causing this
reillyse Dec 20, 2024
2942597
tighten up the tests a little
reillyse Dec 20, 2024
f382827
rewrite load tests
reillyse Dec 21, 2024
c8e9fe8
if we don't have a worker we can't register a workflow
reillyse Dec 21, 2024
677bc5b
remove debug
reillyse Dec 21, 2024
8d151e9
clean up context and go funcs
reillyse Dec 21, 2024
dc8ffff
fix the crazy dag timeout
reillyse Dec 21, 2024
761501b
add back in the timeout
reillyse Dec 21, 2024
e6bc3d3
explicitly quit the go funcs
reillyse Dec 21, 2024
99aa4b3
don't wait for the engine
reillyse Dec 21, 2024
ac6974d
wait for the engine to cleanup
reillyse Dec 21, 2024
0df3fb3
clean up the worker off the ctx
reillyse Dec 21, 2024
e08d0bb
tighten up the failure a little
reillyse Dec 21, 2024
6d86d98
add a bunch of logging - not working locally want to see if it works …
reillyse Dec 23, 2024
b973390
revert the change to running the engine
reillyse Dec 23, 2024
e5a913d
see how these work on github actions
reillyse Dec 24, 2024
80c9ee9
fix the execution duration for the single event test
reillyse Dec 24, 2024
c6f8ba7
test: fix the test so we don't time out due to lack of activity
reillyse Dec 24, 2024
767984e
clean up debug log
reillyse Dec 24, 2024
4110399
clean up logging
reillyse Dec 24, 2024
f396231
relax so we don't flake
reillyse Dec 24, 2024
6cc7a1a
cleanup test
reillyse Dec 27, 2024
0ffedf2
add a new test with limits
reillyse Dec 27, 2024
b8afe82
cleanup commits, remove accidentally committed files, patch up schema…
reillyse Dec 27, 2024
7b7b98b
fixing the migrations
reillyse Dec 27, 2024
5a857b7
fix migration change precommit
reillyse Dec 27, 2024
4fe214b
Merge branch 'main' into feat-skip-states-for-workflows
reillyse Dec 27, 2024
709aafb
don't quit test early
reillyse Dec 27, 2024
1781404
fix potential race when cleaning up
reillyse Dec 27, 2024
ef99e40
cleanup
reillyse Dec 27, 2024
4df0ccf
simplify and cleanup the loadtest
reillyse Dec 27, 2024
46ecc2a
turn down the log level
reillyse Dec 27, 2024
a9053fa
fix comment
reillyse Dec 27, 2024
6a57af0
add in utils, incorporate feedback from review
reillyse Jan 13, 2025
ef74b41
Merge branch 'main' into feat-skip-states-for-workflows
reillyse Jan 13, 2025
c709068
update the sql
reillyse Jan 13, 2025
b17c746
configure the client log level as well as the worker
reillyse Jan 14, 2025
462a5b4
Merge branch 'main' into feat-skip-states-for-workflows
reillyse Jan 14, 2025
efbc8e9
warn for client logger
reillyse Jan 14, 2025
a8f6c69
adjust timing for postgres
reillyse Jan 14, 2025
c4a976e
lets see if increasing the timeout helps these pass
reillyse Jan 14, 2025
3c42538
replace Logger
reillyse Jan 14, 2025
3324e04
change it so it's just the timeout that is extended
reillyse Jan 14, 2025
89b8529
seems like the events are taking 15 seconds on postgres mq
reillyse Jan 14, 2025
23ea8e0
fudge to get the tests to pass even though they are sometimes slow
reillyse Jan 14, 2025
1ac1b6f
Merge branch 'main' into feat-skip-states-for-workflows
reillyse Jan 18, 2025
4a6675b
merge in main
reillyse Jan 21, 2025
358a57e
get rid of timers
reillyse Jan 22, 2025
c776c6d
change the timeouts and fix data race:
reillyse Jan 22, 2025
dfa72a5
fix nanosecond overflow
reillyse Jan 22, 2025
eda3247
get rid of the moving timeout too flakey
reillyse Jan 22, 2025
a5edbab
tweak these timings - I think sometimes in github they don't start in…
reillyse Jan 22, 2025
529fe1c
tests passing
reillyse Jan 23, 2025
fa8a681
lets increase the timeout this appears to just be flaky
reillyse Jan 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ repos:
- id: mixed-line-ending
args: ["--fix=lf"]
- id: end-of-file-fixer
exclude: prisma/migrations/.*\.sql|sql/migrations/.*\.sql
exclude: prisma/migrations/.*\.sql|sql/migrations/.*\.sql|sql/schema/schema.sql
- id: trailing-whitespace
exclude: prisma/migrations/.*\.sql|sql/migrations/.*\.sql
- id: check-yaml
Expand Down
4 changes: 2 additions & 2 deletions Taskfile.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ tasks:
recreate-db-from-scratch:
cmds:
- docker compose down
- docker volume rm oss_hatchet_postgres_data
- docker volume rm oss_hatchet_rabbitmq_data
- docker volume rm oss_hatchet_postgres_data || true
- docker volume rm oss_hatchet_rabbitmq_data || true
- docker compose up -d
- task: setup
- task: init-dev-env
Expand Down
41 changes: 29 additions & 12 deletions api/v1/server/handlers/workflows/trigger.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import (
"github.com/hatchet-dev/hatchet/api/v1/server/oas/transformers"
"github.com/hatchet-dev/hatchet/internal/msgqueue"
"github.com/hatchet-dev/hatchet/internal/services/shared/tasktypes"
wutils "github.com/hatchet-dev/hatchet/internal/workflowutils"
"github.com/hatchet-dev/hatchet/pkg/repository"
"github.com/hatchet-dev/hatchet/pkg/repository/metered"
"github.com/hatchet-dev/hatchet/pkg/repository/prisma/db"
Expand Down Expand Up @@ -95,21 +96,37 @@ func (t *WorkflowService) WorkflowRunCreate(ctx echo.Context, request gen.Workfl
return nil, fmt.Errorf("trigger.go could not create workflow run: %w", err)
}

// send to workflow processing queue
err = t.config.MessageQueue.AddMessage(
ctx.Request().Context(),
msgqueue.WORKFLOW_PROCESSING_QUEUE,
tasktypes.WorkflowRunQueuedToTask(
sqlchelpers.UUIDToStr(createdWorkflowRun.TenantId),
sqlchelpers.UUIDToStr(createdWorkflowRun.ID),
),
)
if !wutils.CanShortCircuit(createdWorkflowRun.Row) {
// send to workflow processing queue
err = t.config.MessageQueue.AddMessage(
ctx.Request().Context(),
msgqueue.WORKFLOW_PROCESSING_QUEUE,
tasktypes.WorkflowRunQueuedToTask(
sqlchelpers.UUIDToStr(createdWorkflowRun.Row.WorkflowRun.TenantId),
sqlchelpers.UUIDToStr(createdWorkflowRun.Row.WorkflowRun.ID),
),
)

if err != nil {
return nil, fmt.Errorf("could not add workflow run to queue: %w", err)
if err != nil {
return nil, fmt.Errorf("could not add workflow run to queue: %w", err)
}
}

workflowRun, err := t.config.APIRepository.WorkflowRun().GetWorkflowRunById(ctx.Request().Context(), tenant.ID, sqlchelpers.UUIDToStr(createdWorkflowRun.ID))
for _, queueName := range createdWorkflowRun.InitialStepRunQueueNames {

if schedPartitionId, ok := tenant.SchedulerPartitionID(); ok {
err = t.config.MessageQueue.AddMessage(
ctx.Request().Context(),
msgqueue.QueueTypeFromPartitionIDAndController(schedPartitionId, msgqueue.Scheduler),
tasktypes.CheckTenantQueueToTask(tenant.ID, queueName, true, false),
)

if err != nil {
t.config.Logger.Err(err).Msg("could not add message to scheduler partition queue")
}
}
}
workflowRun, err := t.config.APIRepository.WorkflowRun().GetWorkflowRunById(ctx.Request().Context(), tenant.ID, sqlchelpers.UUIDToStr(createdWorkflowRun.Row.WorkflowRun.ID))

if err != nil {
return nil, fmt.Errorf("could not get workflow run: %w", err)
Expand Down
10 changes: 2 additions & 8 deletions cmd/hatchet-engine/engine/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,6 @@ func RunWithConfig(ctx context.Context, sc *server.ServerConfig) ([]Teardown, er
if isV1 {
return runV1Config(ctx, sc)
}

return runV0Config(ctx, sc)
}

Expand Down Expand Up @@ -371,6 +370,7 @@ func runV0Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
admin.WithRepository(sc.EngineRepository),
admin.WithMessageQueue(sc.MessageQueue),
admin.WithEntitlementsRepository(sc.EntitlementRepository),
admin.WithLogger(sc.Logger),
)
if err != nil {
return nil, fmt.Errorf("could not create admin service: %w", err)
Expand Down Expand Up @@ -551,7 +551,6 @@ func runV1Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
if err != nil {
return nil, fmt.Errorf("could not create events controller: %w", err)
}

cleanup, err := ec.Start()

if err != nil {
Expand All @@ -574,7 +573,6 @@ func runV1Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
if err != nil {
return nil, fmt.Errorf("could not create ticker: %w", err)
}

cleanup, err = t.Start()

if err != nil {
Expand All @@ -599,7 +597,6 @@ func runV1Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
if err != nil {
return nil, fmt.Errorf("could not create jobs controller: %w", err)
}

cleanupJobs, err := jc.Start()

if err != nil {
Expand Down Expand Up @@ -672,7 +669,6 @@ func runV1Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
})

wh := webhooks.New(sc, p)

cleanup2, err := wh.Start()

if err != nil {
Expand Down Expand Up @@ -701,7 +697,6 @@ func runV1Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
if err != nil {
return nil, fmt.Errorf("could not create dispatcher: %w", err)
}

dispatcherCleanup, err := d.Start()

if err != nil {
Expand Down Expand Up @@ -732,6 +727,7 @@ func runV1Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
admin.WithRepository(sc.EngineRepository),
admin.WithMessageQueue(sc.MessageQueue),
admin.WithEntitlementsRepository(sc.EntitlementRepository),
admin.WithLogger(sc.Logger),
)

if err != nil {
Expand Down Expand Up @@ -761,7 +757,6 @@ func runV1Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
if err != nil {
return nil, fmt.Errorf("could not create grpc server: %w", err)
}

grpcServerCleanup, err := s.Start()
if err != nil {
return nil, fmt.Errorf("could not start grpc server: %w", err)
Expand Down Expand Up @@ -852,7 +847,6 @@ func runV1Config(ctx context.Context, sc *server.ServerConfig) ([]Teardown, erro
if healthProbes {
h.SetReady(true)
}

<-ctx.Done()

if healthProbes {
Expand Down
4 changes: 2 additions & 2 deletions examples/bulk_imports/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,9 @@ func run() (func() error, error) {

var events []client.EventWithAdditionalMetadata

// 20000 times to test the bulk push
// 999 (max amount) times to test the bulk push

for i := 0; i < 20000; i++ {
for i := 0; i < 999; i++ {
testEvent := userCreateEvent{
Username: "echo-test",
UserID: "1234 " + fmt.Sprint(i),
Expand Down
88 changes: 75 additions & 13 deletions examples/concurrency/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,27 +30,63 @@ func main() {
}

events := make(chan string, 50)
wfrIds := make(chan *client.Workflow, 50)
interrupt := cmdutils.InterruptChan()
c, err := client.New()

cleanup, err := run(events)
if err != nil {
log.Fatalf("error creating client: %v", err)
}
cleanup, err := run(c, events, wfrIds)
if err != nil {
panic(err)
}
selectLoop:
for {
select {

case <-interrupt:
log.Print("Interrupted")
break selectLoop
case wfrId := <-wfrIds:
log.Printf("Workflow run id: %s", wfrId.WorkflowRunId())
wfResult, err := wfrId.Result()
if err != nil {

<-interrupt
if err.Error() == "step output for step-one not found" {
log.Printf("Step output for step-one not found because it was cancelled due to CANCELLED_BY_CONCURRENCY_LIMIT")
continue
}
panic(fmt.Errorf("error getting workflow run result: %w", err))
}

stepOneOutput := &stepOneOutput{}

err = wfResult.StepOutput("step-one", stepOneOutput)

if err != nil {
if err.Error() == "step run failed: this step run was cancelled due to CANCELLED_BY_CONCURRENCY_LIMIT" {
log.Printf("Workflow run was cancelled due to CANCELLED_BY_CONCURRENCY_LIMIT")
continue
}
if err.Error() == "step output for step-one not found" {
log.Printf("Step output for step-one not found because it was cancelled due to CANCELLED_BY_CONCURRENCY_LIMIT")
continue
}
panic(fmt.Errorf("error getting workflow run result: %w", err))
}
case e := <-events:
log.Printf("Event: %s", e)
}
}

if err := cleanup(); err != nil {

panic(fmt.Errorf("error cleaning up: %w", err))
}
}

func run(events chan<- string) (func() error, error) {
c, err := client.New()

if err != nil {
return nil, fmt.Errorf("error creating client: %w", err)
}
func run(c client.Client, events chan<- string, wfrIds chan<- *client.Workflow) (func() error, error) {

w, err := worker.NewWorker(
worker.WithClient(
Expand All @@ -74,7 +110,8 @@ func run(events chan<- string) (func() error, error) {
err = ctx.WorkflowInput(input)

// we sleep to simulate a long running task
time.Sleep(10 * time.Second)

time.Sleep(7 * time.Second)

if err != nil {

Expand All @@ -98,7 +135,11 @@ func run(events chan<- string) (func() error, error) {
err = ctx.StepOutput("step-one", input)

if err != nil {
return nil, err

if err.Error() == "step run failed: this step run was cancelled due to CANCELLED_BY_CONCURRENCY_LIMIT" {
return nil, nil
}

}

if ctx.Err() != nil {
Expand All @@ -125,18 +166,39 @@ func run(events chan<- string) (func() error, error) {
"test": "test",
},
}

// I want some to be in Running and some to be in Pending so we cancel both

go func() {
// do this 10 times to test concurrency
for i := 0; i < 10; i++ {
// do this 7 times to test concurrency
for i := 0; i < 7; i++ {

wfr_id, err := c.Admin().RunWorkflow("simple-concurrency", testEvent)

log.Println("Starting workflow run id: ", wfr_id)
log.Println("Starting workflow run id: ", wfr_id.WorkflowRunId())

if err != nil {
panic(fmt.Errorf("error running workflow: %w", err))
}

wfrIds <- wfr_id
time.Sleep(400 * time.Millisecond)
}
}()
go func() {
// do this 13 times to test concurrency (20 times total)
for i := 0; i < 13; i++ {

wfr_id, err := c.Admin().RunWorkflow("simple-concurrency", testEvent)

log.Println("Starting workflow run id: ", wfr_id.WorkflowRunId())

if err != nil {
panic(fmt.Errorf("error running workflow: %w", err))
}

wfrIds <- wfr_id

}
}()

Expand Down
Loading
Loading