Activity Stream Utilities #31

bcodell · 2023-04-04T16:46:06Z

Description

This package should contain utilities to help users build Activity Stream models.

Requirements

The following features need to be supported:

multiple streams in a single project
incremental functionality for users to only insert new rows into the Activity Stream table
insert rows into the Activity Stream table one Activity at a time
skip the Activity Stream model build, and query Activity tables directly using the dataset macro
conditionally skip the Activity Stream model build based on environment (dev vs prod)
drop and/or re-insert all rows for a single Activity into an Activity Stream (e.g. when a new feature is added)
only insert new rows for a single Activity into an Activity Stream (e.g. when a data quality bug is resolved)
maintain lineage across models in the dbt project

The following functionality is optional but would be beneficial:

Activity Stream models identify upstream Activity dependencies (probably from the Activity Registry config) so that Activity Stream sql files don't need to be modified when new Activities are added

Implementation

Components

I'm proposing the following updates:

Expanding the dbt_project.yml config to take an optional skip_activity_stream parameter - a boolean that defaults to false - that is used to determine if Activity Stream models should be built in the project, or if Datasets should query Activity models directly.

Expanding the dbt_project.yml config to register the name of each activity stream contained in the project, with nested attributes for each stream to specify which standard Activity Schema columns are included in the stream itself, if that stream should be built or skipped (overrides global setting), and aliases for standard Activity Schema column names specific to that stream (overrides globally defined aliases in column_mappings attribute).

A build_activity_stream macro that does the following:

takes a required argument of a list of one or more dbt models that represent the Activities that should be inserted
validates that each of the models passed in the list have specified the Activity Stream model in the appropriate attribute in their config meta tag
takes an optional input argument insert_strategy that defaults to union but can optionally be insert_by_activity
retrieves all Activities for the Activity stream and asserts dependencies
compiles the appropriate sql based on the following:
- if the model is incremental
- if the full-refresh flag is enabled for the run
- If the Activity Stream is configured to write records for all Activities in one statement or one Activity at a time
- if the Activity Stream is configured to be skipped

A macro called union_activities that does the following:

takes a required argument of a list of one or more dbt models that represent the Activities that should be unioned
takes an optional argument that identifies if the run is a full-refresh run - defaults to false
takes an optional argument that identifies if the materialization type is incremental - defaults to false
returns a sql statement that unions all models together, with appropriate filters applied based on the full-refresh and incremental argument values

A macro called drop_activity that does the following:

takes a required argument of a model that represents an Activity to be dropped from an Activity Stream
takes an optional argument that identifies if the generated SQL should be executed - defaults to false - this allows users to call this macro as a run-operation for one-off updates
if SQL should not be executed, returns a sql statement that will remove records from the Activity Stream
if SQL should be executed, returns a result statement specifying the number of rows removed from the Activity Stream

A macro called insert_activity that does the following:

takes a required argument of a model that represents an Activity to be inserted into an Activity Stream
takes an optional argument that identifies if the run is a full-refresh run - defaults to false
takes an optional argument that identifies if the materialization type is incremental - defaults to false
takes an optional argument that identifies if the generated SQL should be executed - defaults to false - this allows users to call this macro as a run-operation for updates
returns a sql statement that unions all models together, with appropriate filters applied based on the full-refresh and incremental argument values

A refactor of the dataset macro to query Activities models directly in a non-production environment (likely using dbt targets)

A refactor of the dataset macro to query Activities models directly if the skip_activity_stream project config parameter is set to true

Deployment Plan

I'm proposing the following multi-step deployment plan for the above in order to break up the work into reasonable chunks and make useful new features available faster:

Add the build_activity_stream and union_activities macros to support building union-based Activity Stream models
Add the stream-specific configs (excluding skip_activity_stream) to support multiple streams
Add drop_activity and insert_activity macros and update the build_activity_stream macro to support building insert-based Activity Stream models
Add support for the skip_activity_stream config and update the dataset macro as specified

Dependencies

#25 - Activity Registry

Open Questions

The optional feature of not having to update the Activity Stream model when new Activities are added is excluded from this implementation. If the Activity Registration happens in the Activity model config, how (if at all) can the dependencies for the Activity Stream be automatically defined when the DAG is built without manipulating the Activity Stream sql file, given the restrictions dbt has for building the graph?
- To note - a valid workaround is to register the Activity in the Activity Stream model's meta tag instead of in the Activity model's meta tag, but if a developer has to touch a file related to the Activity Stream, then it's better to be the model sql, so that other developers can clearly see the dependencies.
The insert_by_activity insertion strategy violates one of dbt's assumptions that a single query is executed to create the table associated with the model. Is violating this a concern?

The text was updated successfully, but these errors were encountered:

bcodell added the enhancement New feature or request label Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activity Stream Utilities #31

Activity Stream Utilities #31

bcodell commented Apr 4, 2023 •

edited

Loading

Activity Stream Utilities #31

Activity Stream Utilities #31

Comments

bcodell commented Apr 4, 2023 • edited Loading

Description

Requirements

Implementation

Components

Deployment Plan

Dependencies

Open Questions

bcodell commented Apr 4, 2023 •

edited

Loading