You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The timeseries metrics have proven too large for dbt and bigquery to execute as we hope for them to be executed.
Some things were attempted to get these things working properly which unfortunately has led to a larger investigation.
Things tried:
Split the CTEs that compose the int_timeseries_ models into smaller models. This led to a discovery that generating 0 events for all artifacts created too large a model (55TB for developers * artifacts * days)
SQLMesh against clickhouse
Sadly they don't yet have a clickhouse "engine" created yet. So I attempted to use the mysql engine but because clickhouse isn't a complete mysql implementation that failed.
The remaining methods to try (in order) now are:
Use SQLMesh against bigquery
Use Dagster against clickhouse with our own SQL execution against clickhouse
Other things that have been considered and not tried:
Spark
For the queries we'd like to do, having a spark cluster could make it possible against our bigquery dataset
Trino
Similar to spark. This gives us the lever to make those queries.
Apache Iceberg
Delta Lake
The reasons we didn't do most of these is mostly that we want to be able to expose exploring the dimensions of our metrics to users. These could still be useful in that world.
The text was updated successfully, but these errors were encountered:
SQLMesh was attempted but the current issue is that it doesn't support running on clickhouse. Additionally, if we are to run this on bigquery we still need to finish writing a dagster-sqlmesh adapter. I have one I've been working on personally that I figured I'd give to us if we wanted but it's still missing a few necessary features.
We are currently testing cube.dev. In general it seems nice but there seems to be some set of problems that I'm currently debugging. Namely some of our classifications for metrics (dimensions in cube) happen over a rolling window. Creating a dimension like this isn't supported so I've tried a few things and those still don't seem to work. The latest issue appears to be that using rolling window measure with clickhouse seems to have some kind of bug. I'm testing a few more things there but we may need to do a hybrid of processing rolling windows with sqlmesh and using cube for querying some of the preprocessed metrics.
ravenac95
changed the title
New timeseries metrics models are too large for current dbt configuration
Investigation: New timeseries metrics models are too large for current dbt configuration
Aug 27, 2024
I'm going to close this now in favor of breaking down the remaining work. This was a large chunk of the sprint but the remaining work is now to make some pragmatic choices to make this work with a hybrid solution of sqlmesh + cube + clickhouse.
What is it?
The timeseries metrics have proven too large for dbt and bigquery to execute as we hope for them to be executed.
Some things were attempted to get these things working properly which unfortunately has led to a larger investigation.
Things tried:
int_timeseries_
models into smaller models. This led to a discovery that generating0
events for all artifacts created too large a model (55TB for developers * artifacts * days)The remaining methods to try (in order) now are:
Other things that have been considered and not tried:
The reasons we didn't do most of these is mostly that we want to be able to expose exploring the dimensions of our metrics to users. These could still be useful in that world.
The text was updated successfully, but these errors were encountered: