Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlmesh runs seem to always take hours #2109

Open
ryscheng opened this issue Sep 9, 2024 · 6 comments
Open

sqlmesh runs seem to always take hours #2109

ryscheng opened this issue Sep 9, 2024 · 6 comments
Assignees

Comments

@ryscheng
Copy link
Member

ryscheng commented Sep 9, 2024

What is it?

I only wanted to copy 1 column into metrics_v0, and it turned into a multi-hour run.
Is there a bug? I guess logging could help determine this.

@ravenac95
Copy link
Member

Ok. So I've been spending maybe too much time on this but the current performance of sqlmesh + clickhouse is seemingly untenable.

What we're seeing right now are a few things:

  • Clickhouse doesn't handle delete very well. It's an asynchronous action normally but in order to do the incremental loads of data sqlmesh does a DELETE. If asynchronous mode there are unexpected errors caused by data not being ready due to the data still mutating in clickhouse
  • Concurrent requests also seem to have similar problems related to this in clickhouse - there's more to test here but it's not looking good
  • We can get this to run without issue but then we need to make it synchronous and currently serial. The caveat is that it seems that when running this it takes ages. After 5 hours of execution I've only completed 5% of queued queries (this is if we process all time)

I've messed with various scaling/optimization techniques:

  • Partitioning - This seems to have helped with some memory issues we had initially when running some of the queries with sqlmesh on some of the larger time rolling windows. So good on that
  • Increasing concurrency on sqlmesh. This seemed to help but then I ran into issues with mutating the same database that is being deleted from. I think there's a way for me to set concurrency limits on a per model basis so any given model doesn't do any concurrent things.
  • Instead of using the default MergeTree I have switched to VersionedCollapsingMergeTree in some cases. This seems to help significantly. Some requests without it that were taking ~5-10mins are now taking less than or within a minute. There are some serious caveats with this approach but it might be worth using. See the docs

I don't want to abandon cilckhouse but it's defnitely giving me a bunch of headaches. I think, particularly for the way our metrics are calculated, we can do massive parallel processing, but there seems to be limited capacity for this right now.

@davidgasquez
Copy link
Contributor

Hey! Curious if you have explored ClickHouse Materialized Views for this. Curious what issues you faced if that's the case!

Also, ClickHouse is... a beast, but a complicated beast. Have you considered getting some consulting help? The only thing I know about ClickHouse is that the right setting and setup might make a query go from never ending to completing in milliseconds. 😅

@ravenac95
Copy link
Member

ravenac95 commented Sep 18, 2024

@davidgasquez ya sadly I did look into materialized views but many of our queries are based on using rolling windows of last X days/months/years. Previously with bigquery we attempted to calculate this all at once by essentially generating data for all time at all times but it ended up running out of memory. It also limited the amount of recursion that one could do with those rolling windows. Now with sqlmesh we actually are generating metrics based on various dimensions by just calling a timeseries_metrics function that takes some configuration for metrics and automatically generates models for each of the dimensions we care about and all the specific times/rolling windows. So sadly, I think, except for simple aggregations, rolling windows won't work well on clickhouse unless i'm misunderstanding. That being said, yeah we should definitely have a chat with a clickhouse engineer.

Funny enough running this with duckdb. Totally no problems haha. If only motherduck were on gcp.

I think the solution really is using the VersionedCollapsingMergeTree it just means we rewrite all of the metric aggregation functions to be aware of the Sign that is present in the table. However in my tests I had a thing go from 6min execution to 5seconds - both were looking at something like 90 day periods.

All this being said, because of the way our metrics are currently calculated (each as separate tables and then joined in a view later), I think it actually fits really well with horizontally scalable architectures like Trino, starrocks, spark, or druid. Clickhouse doesn't do well with horizontal scale based on all that I've seen, but I think these others solutions might actually perform well here. Sadly it's a bit to get that setup and tested as well.

@ravenac95
Copy link
Member

ravenac95 commented Sep 18, 2024

Oh interestingly though, I'm realizing I can improve the performance of some of the upstream models that we use to calculate the metrics with materialized views. I hadn't thought about this until now but in theory that should take care of a bit part of the initial calculation's issues. Though I may have to abandon doing any testing on duckdb.

This could also be used with some of the other metric aggregations that aren't rolling

@ravenac95
Copy link
Member

Well hrm, this might be exactly what i need: https://clickhouse.com/docs/en/sql-reference/statements/create/view#window-view-experimental

but I imagine isn't supported with sqlmesh. hrmmmmm

@ryscheng
Copy link
Member Author

@ravenac95 CCed you on a thread to ask the Clickhouse folks directly. They've offered engineering support in the past.

Feel free to dump questions in the thread for them to forward to the right person

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants