You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This package should contain utilities to help users build Activity Stream models.
Requirements
The following features need to be supported:
multiple streams in a single project
incremental functionality for users to only insert new rows into the Activity Stream table
insert rows into the Activity Stream table one Activity at a time
skip the Activity Stream model build, and query Activity tables directly using the dataset macro
conditionally skip the Activity Stream model build based on environment (dev vs prod)
drop and/or re-insert all rows for a single Activity into an Activity Stream (e.g. when a new feature is added)
only insert new rows for a single Activity into an Activity Stream (e.g. when a data quality bug is resolved)
maintain lineage across models in the dbt project
The following functionality is optional but would be beneficial:
Activity Stream models identify upstream Activity dependencies (probably from the Activity Registry config) so that Activity Stream sql files don't need to be modified when new Activities are added
Implementation
Components
I'm proposing the following updates:
Expanding the dbt_project.yml config to take an optional skip_activity_stream parameter - a boolean that defaults to false - that is used to determine if Activity Stream models should be built in the project, or if Datasets should query Activity models directly.
Expanding the dbt_project.yml config to register the name of each activity stream contained in the project, with nested attributes for each stream to specify which standard Activity Schema columns are included in the stream itself, if that stream should be built or skipped (overrides global setting), and aliases for standard Activity Schema column names specific to that stream (overrides globally defined aliases in column_mappings attribute).
A build_activity_stream macro that does the following:
takes a required argument of a list of one or more dbt models that represent the Activities that should be inserted
validates that each of the models passed in the list have specified the Activity Stream model in the appropriate attribute in their config meta tag
takes an optional input argument insert_strategy that defaults to union but can optionally be insert_by_activity
retrieves all Activities for the Activity stream and asserts dependencies
compiles the appropriate sql based on the following:
if the model is incremental
if the full-refresh flag is enabled for the run
If the Activity Stream is configured to write records for all Activities in one statement or one Activity at a time
if the Activity Stream is configured to be skipped
A macro called union_activities that does the following:
takes a required argument of a list of one or more dbt models that represent the Activities that should be unioned
takes an optional argument that identifies if the run is a full-refresh run - defaults to false
takes an optional argument that identifies if the materialization type is incremental - defaults to false
returns a sql statement that unions all models together, with appropriate filters applied based on the full-refresh and incremental argument values
A macro called drop_activity that does the following:
takes a required argument of a model that represents an Activity to be dropped from an Activity Stream
takes an optional argument that identifies if the generated SQL should be executed - defaults to false - this allows users to call this macro as a run-operation for one-off updates
if SQL should not be executed, returns a sql statement that will remove records from the Activity Stream
if SQL should be executed, returns a result statement specifying the number of rows removed from the Activity Stream
A macro called insert_activity that does the following:
takes a required argument of a model that represents an Activity to be inserted into an Activity Stream
takes an optional argument that identifies if the run is a full-refresh run - defaults to false
takes an optional argument that identifies if the materialization type is incremental - defaults to false
takes an optional argument that identifies if the generated SQL should be executed - defaults to false - this allows users to call this macro as a run-operation for updates
returns a sql statement that unions all models together, with appropriate filters applied based on the full-refresh and incremental argument values
A refactor of the dataset macro to query Activities models directly in a non-production environment (likely using dbt targets)
A refactor of the dataset macro to query Activities models directly if the skip_activity_stream project config parameter is set to true
Deployment Plan
I'm proposing the following multi-step deployment plan for the above in order to break up the work into reasonable chunks and make useful new features available faster:
Add the build_activity_stream and union_activities macros to support building union-based Activity Stream models
Add the stream-specific configs (excluding skip_activity_stream) to support multiple streams
Add drop_activity and insert_activity macros and update the build_activity_stream macro to support building insert-based Activity Stream models
Add support for the skip_activity_stream config and update the dataset macro as specified
The optional feature of not having to update the Activity Stream model when new Activities are added is excluded from this implementation. If the Activity Registration happens in the Activity model config, how (if at all) can the dependencies for the Activity Stream be automatically defined when the DAG is built without manipulating the Activity Stream sql file, given the restrictions dbt has for building the graph?
To note - a valid workaround is to register the Activity in the Activity Stream model's meta tag instead of in the Activity model's meta tag, but if a developer has to touch a file related to the Activity Stream, then it's better to be the model sql, so that other developers can clearly see the dependencies.
The insert_by_activity insertion strategy violates one of dbt's assumptions that a single query is executed to create the table associated with the model. Is violating this a concern?
The text was updated successfully, but these errors were encountered:
Description
This package should contain utilities to help users build Activity Stream models.
Requirements
The following features need to be supported:
dataset
macroThe following functionality is optional but would be beneficial:
Implementation
Components
I'm proposing the following updates:
Expanding the
dbt_project.yml
config to take an optionalskip_activity_stream
parameter - a boolean that defaults tofalse
- that is used to determine if Activity Stream models should be built in the project, or if Datasets should query Activity models directly.Expanding the
dbt_project.yml
config to register the name of each activity stream contained in the project, with nested attributes for each stream to specify which standard Activity Schema columns are included in the stream itself, if that stream should be built or skipped (overrides global setting), and aliases for standard Activity Schema column names specific to that stream (overrides globally defined aliases incolumn_mappings
attribute).A
build_activity_stream
macro that does the following:meta
taginsert_strategy
that defaults tounion
but can optionally beinsert_by_activity
full-refresh
flag is enabled for the runA macro called
union_activities
that does the following:full-refresh
run - defaults tofalse
incremental
- defaults tofalse
full-refresh
andincremental
argument valuesA macro called
drop_activity
that does the following:false
- this allows users to call this macro as arun-operation
for one-off updatesA macro called
insert_activity
that does the following:full-refresh
run - defaults tofalse
incremental
- defaults tofalse
false
- this allows users to call this macro as arun-operation
for updatesfull-refresh
andincremental
argument valuesA refactor of the
dataset
macro to query Activities models directly in a non-production environment (likely using dbttargets
)A refactor of the
dataset
macro to query Activities models directly if theskip_activity_stream
project config parameter is set totrue
Deployment Plan
I'm proposing the following multi-step deployment plan for the above in order to break up the work into reasonable chunks and make useful new features available faster:
build_activity_stream
andunion_activities
macros to support building union-based Activity Stream modelsskip_activity_stream
) to support multiple streamsdrop_activity
andinsert_activity
macros and update thebuild_activity_stream
macro to support building insert-based Activity Stream modelsskip_activity_stream
config and update thedataset
macro as specifiedDependencies
#25 - Activity Registry
Open Questions
meta
tag instead of in the Activity model'smeta
tag, but if a developer has to touch a file related to the Activity Stream, then it's better to be the model sql, so that other developers can clearly see the dependencies.insert_by_activity
insertion strategy violates one of dbt's assumptions that a single query is executed to create the table associated with the model. Is violating this a concern?The text was updated successfully, but these errors were encountered: