Investigation: New timeseries metrics models are too large for current dbt configuration #1970

ravenac95 · 2024-08-21T22:11:34Z

What is it?

The timeseries metrics have proven too large for dbt and bigquery to execute as we hope for them to be executed.

Some things were attempted to get these things working properly which unfortunately has led to a larger investigation.

Things tried:

Split the CTEs that compose the int_timeseries_ models into smaller models. This led to a discovery that generating 0 events for all artifacts created too large a model (55TB for developers * artifacts * days)
SQLMesh against clickhouse
- Sadly they don't yet have a clickhouse "engine" created yet. So I attempted to use the mysql engine but because clickhouse isn't a complete mysql implementation that failed.

The remaining methods to try (in order) now are:

Use SQLMesh against bigquery
Use Dagster against clickhouse with our own SQL execution against clickhouse

Other things that have been considered and not tried:

Spark
- For the queries we'd like to do, having a spark cluster could make it possible against our bigquery dataset
Trino
- Similar to spark. This gives us the lever to make those queries.
Apache Iceberg
Delta Lake

The reasons we didn't do most of these is mostly that we want to be able to expose exploring the dimensions of our metrics to users. These could still be useful in that world.

The text was updated successfully, but these errors were encountered:

ravenac95 · 2024-08-26T16:17:20Z

Related to #1981 and #1982

ravenac95 · 2024-08-26T19:44:07Z

SQLMesh was attempted but the current issue is that it doesn't support running on clickhouse. Additionally, if we are to run this on bigquery we still need to finish writing a dagster-sqlmesh adapter. I have one I've been working on personally that I figured I'd give to us if we wanted but it's still missing a few necessary features.

ravenac95 · 2024-08-26T19:49:30Z

We are currently testing cube.dev. In general it seems nice but there seems to be some set of problems that I'm currently debugging. Namely some of our classifications for metrics (dimensions in cube) happen over a rolling window. Creating a dimension like this isn't supported so I've tried a few things and those still don't seem to work. The latest issue appears to be that using rolling window measure with clickhouse seems to have some kind of bug. I'm testing a few more things there but we may need to do a hybrid of processing rolling windows with sqlmesh and using cube for querying some of the preprocessed metrics.

ravenac95 · 2024-08-27T17:11:47Z

I'm going to close this now in favor of breaking down the remaining work. This was a large chunk of the sprint but the remaining work is now to make some pragmatic choices to make this work with a hybrid solution of sqlmesh + cube + clickhouse.

ryscheng mentioned this issue Aug 21, 2024

sqlmesh metrics pipeline on BigQuery #1456

Closed

ryscheng added this to the [c] Time Series Models milestone Aug 21, 2024

ravenac95 changed the title ~~New timeseries metrics models are too large for current dbt configuration~~ Investigation: New timeseries metrics models are too large for current dbt configuration Aug 27, 2024

ravenac95 closed this as completed Aug 27, 2024

ravenac95 self-assigned this Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: New timeseries metrics models are too large for current dbt configuration #1970

Investigation: New timeseries metrics models are too large for current dbt configuration #1970

ravenac95 commented Aug 21, 2024 •

edited

Loading

ravenac95 commented Aug 26, 2024

ravenac95 commented Aug 26, 2024 •

edited

Loading

ravenac95 commented Aug 26, 2024

ravenac95 commented Aug 27, 2024

Investigation: New timeseries metrics models are too large for current dbt configuration #1970

Investigation: New timeseries metrics models are too large for current dbt configuration #1970

Comments

ravenac95 commented Aug 21, 2024 • edited Loading

What is it?

ravenac95 commented Aug 26, 2024

ravenac95 commented Aug 26, 2024 • edited Loading

ravenac95 commented Aug 26, 2024

ravenac95 commented Aug 27, 2024

ravenac95 commented Aug 21, 2024 •

edited

Loading

ravenac95 commented Aug 26, 2024 •

edited

Loading