Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: New timeseries metrics models are too large for current dbt configuration #1970

Closed
ravenac95 opened this issue Aug 21, 2024 · 4 comments
Assignees

Comments

@ravenac95
Copy link
Member

ravenac95 commented Aug 21, 2024

What is it?

The timeseries metrics have proven too large for dbt and bigquery to execute as we hope for them to be executed.

Some things were attempted to get these things working properly which unfortunately has led to a larger investigation.

Things tried:

  • Split the CTEs that compose the int_timeseries_ models into smaller models. This led to a discovery that generating 0 events for all artifacts created too large a model (55TB for developers * artifacts * days)
  • SQLMesh against clickhouse
    • Sadly they don't yet have a clickhouse "engine" created yet. So I attempted to use the mysql engine but because clickhouse isn't a complete mysql implementation that failed.

The remaining methods to try (in order) now are:

  • Use SQLMesh against bigquery
  • Use Dagster against clickhouse with our own SQL execution against clickhouse

Other things that have been considered and not tried:

  • Spark
    • For the queries we'd like to do, having a spark cluster could make it possible against our bigquery dataset
  • Trino
    • Similar to spark. This gives us the lever to make those queries.
  • Apache Iceberg
  • Delta Lake

The reasons we didn't do most of these is mostly that we want to be able to expose exploring the dimensions of our metrics to users. These could still be useful in that world.

@ravenac95
Copy link
Member Author

Related to #1981 and #1982

@ravenac95
Copy link
Member Author

ravenac95 commented Aug 26, 2024

SQLMesh was attempted but the current issue is that it doesn't support running on clickhouse. Additionally, if we are to run this on bigquery we still need to finish writing a dagster-sqlmesh adapter. I have one I've been working on personally that I figured I'd give to us if we wanted but it's still missing a few necessary features.

@ravenac95
Copy link
Member Author

We are currently testing cube.dev. In general it seems nice but there seems to be some set of problems that I'm currently debugging. Namely some of our classifications for metrics (dimensions in cube) happen over a rolling window. Creating a dimension like this isn't supported so I've tried a few things and those still don't seem to work. The latest issue appears to be that using rolling window measure with clickhouse seems to have some kind of bug. I'm testing a few more things there but we may need to do a hybrid of processing rolling windows with sqlmesh and using cube for querying some of the preprocessed metrics.

@ravenac95 ravenac95 changed the title New timeseries metrics models are too large for current dbt configuration Investigation: New timeseries metrics models are too large for current dbt configuration Aug 27, 2024
@ravenac95
Copy link
Member Author

I'm going to close this now in favor of breaking down the remaining work. This was a large chunk of the sprint but the remaining work is now to make some pragmatic choices to make this work with a hybrid solution of sqlmesh + cube + clickhouse.

@ravenac95 ravenac95 self-assigned this Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants