sqlmesh runs seem to always take hours #2109

ryscheng · 2024-09-09T16:38:50Z

What is it?

I only wanted to copy 1 column into metrics_v0, and it turned into a multi-hour run.
Is there a bug? I guess logging could help determine this.

ravenac95 · 2024-09-18T05:46:11Z

Ok. So I've been spending maybe too much time on this but the current performance of sqlmesh + clickhouse is seemingly untenable.

What we're seeing right now are a few things:

Clickhouse doesn't handle delete very well. It's an asynchronous action normally but in order to do the incremental loads of data sqlmesh does a DELETE. If asynchronous mode there are unexpected errors caused by data not being ready due to the data still mutating in clickhouse
Concurrent requests also seem to have similar problems related to this in clickhouse - there's more to test here but it's not looking good
We can get this to run without issue but then we need to make it synchronous and currently serial. The caveat is that it seems that when running this it takes ages. After 5 hours of execution I've only completed 5% of queued queries (this is if we process all time)

I've messed with various scaling/optimization techniques:

Partitioning - This seems to have helped with some memory issues we had initially when running some of the queries with sqlmesh on some of the larger time rolling windows. So good on that
Increasing concurrency on sqlmesh. This seemed to help but then I ran into issues with mutating the same database that is being deleted from. I think there's a way for me to set concurrency limits on a per model basis so any given model doesn't do any concurrent things.
Instead of using the default MergeTree I have switched to VersionedCollapsingMergeTree in some cases. This seems to help significantly. Some requests without it that were taking ~5-10mins are now taking less than or within a minute. There are some serious caveats with this approach but it might be worth using. See the docs

I don't want to abandon cilckhouse but it's defnitely giving me a bunch of headaches. I think, particularly for the way our metrics are calculated, we can do massive parallel processing, but there seems to be limited capacity for this right now.

davidgasquez · 2024-09-18T09:15:53Z

Hey! Curious if you have explored ClickHouse Materialized Views for this. Curious what issues you faced if that's the case!

Also, ClickHouse is... a beast, but a complicated beast. Have you considered getting some consulting help? The only thing I know about ClickHouse is that the right setting and setup might make a query go from never ending to completing in milliseconds. 😅

ravenac95 · 2024-09-18T20:57:46Z

@davidgasquez ya sadly I did look into materialized views but many of our queries are based on using rolling windows of last X days/months/years. Previously with bigquery we attempted to calculate this all at once by essentially generating data for all time at all times but it ended up running out of memory. It also limited the amount of recursion that one could do with those rolling windows. Now with sqlmesh we actually are generating metrics based on various dimensions by just calling a timeseries_metrics function that takes some configuration for metrics and automatically generates models for each of the dimensions we care about and all the specific times/rolling windows. So sadly, I think, except for simple aggregations, rolling windows won't work well on clickhouse unless i'm misunderstanding. That being said, yeah we should definitely have a chat with a clickhouse engineer.

Funny enough running this with duckdb. Totally no problems haha. If only motherduck were on gcp.

I think the solution really is using the VersionedCollapsingMergeTree it just means we rewrite all of the metric aggregation functions to be aware of the Sign that is present in the table. However in my tests I had a thing go from 6min execution to 5seconds - both were looking at something like 90 day periods.

All this being said, because of the way our metrics are currently calculated (each as separate tables and then joined in a view later), I think it actually fits really well with horizontally scalable architectures like Trino, starrocks, spark, or druid. Clickhouse doesn't do well with horizontal scale based on all that I've seen, but I think these others solutions might actually perform well here. Sadly it's a bit to get that setup and tested as well.

ravenac95 · 2024-09-18T23:45:29Z

Oh interestingly though, I'm realizing I can improve the performance of some of the upstream models that we use to calculate the metrics with materialized views. I hadn't thought about this until now but in theory that should take care of a bit part of the initial calculation's issues. Though I may have to abandon doing any testing on duckdb.

This could also be used with some of the other metric aggregations that aren't rolling

ravenac95 · 2024-09-18T23:58:41Z

Well hrm, this might be exactly what i need: https://clickhouse.com/docs/en/sql-reference/statements/create/view#window-view-experimental

but I imagine isn't supported with sqlmesh. hrmmmmm

ryscheng · 2024-09-19T06:42:35Z

@ravenac95 CCed you on a thread to ask the Clickhouse folks directly. They've offered engineering support in the past.

Feel free to dump questions in the thread for them to forward to the right person

ryscheng assigned ravenac95 Sep 9, 2024

ryscheng added this to the [c] Clickhouse-based pipeline milestone Sep 9, 2024

ravenac95 mentioned this issue Sep 12, 2024

dagster-sqlmesh: Enhance logging in dagster #2138

Closed

ravenac95 mentioned this issue Sep 19, 2024

Various sqlmesh and clickhouse updates for concurrency #2174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlmesh runs seem to always take hours #2109

sqlmesh runs seem to always take hours #2109

ryscheng commented Sep 9, 2024

ravenac95 commented Sep 18, 2024

davidgasquez commented Sep 18, 2024

ravenac95 commented Sep 18, 2024 •

edited

Loading

ravenac95 commented Sep 18, 2024 •

edited

Loading

ravenac95 commented Sep 18, 2024

ryscheng commented Sep 19, 2024

sqlmesh runs seem to always take hours #2109

sqlmesh runs seem to always take hours #2109

Comments

ryscheng commented Sep 9, 2024

What is it?

ravenac95 commented Sep 18, 2024

davidgasquez commented Sep 18, 2024

ravenac95 commented Sep 18, 2024 • edited Loading

ravenac95 commented Sep 18, 2024 • edited Loading

ravenac95 commented Sep 18, 2024

ryscheng commented Sep 19, 2024

ravenac95 commented Sep 18, 2024 •

edited

Loading

ravenac95 commented Sep 18, 2024 •

edited

Loading