Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activity Stream Utilities #31

Open
bcodell opened this issue Apr 4, 2023 · 0 comments
Open

Activity Stream Utilities #31

bcodell opened this issue Apr 4, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@bcodell
Copy link
Collaborator

bcodell commented Apr 4, 2023

Description

This package should contain utilities to help users build Activity Stream models.

Requirements

The following features need to be supported:

  • multiple streams in a single project
  • incremental functionality for users to only insert new rows into the Activity Stream table
  • insert rows into the Activity Stream table one Activity at a time
  • skip the Activity Stream model build, and query Activity tables directly using the dataset macro
  • conditionally skip the Activity Stream model build based on environment (dev vs prod)
  • drop and/or re-insert all rows for a single Activity into an Activity Stream (e.g. when a new feature is added)
  • only insert new rows for a single Activity into an Activity Stream (e.g. when a data quality bug is resolved)
  • maintain lineage across models in the dbt project

The following functionality is optional but would be beneficial:

  • Activity Stream models identify upstream Activity dependencies (probably from the Activity Registry config) so that Activity Stream sql files don't need to be modified when new Activities are added

Implementation

Components

I'm proposing the following updates:

Expanding the dbt_project.yml config to take an optional skip_activity_stream parameter - a boolean that defaults to false - that is used to determine if Activity Stream models should be built in the project, or if Datasets should query Activity models directly.

Expanding the dbt_project.yml config to register the name of each activity stream contained in the project, with nested attributes for each stream to specify which standard Activity Schema columns are included in the stream itself, if that stream should be built or skipped (overrides global setting), and aliases for standard Activity Schema column names specific to that stream (overrides globally defined aliases in column_mappings attribute).

A build_activity_stream macro that does the following:

  • takes a required argument of a list of one or more dbt models that represent the Activities that should be inserted
  • validates that each of the models passed in the list have specified the Activity Stream model in the appropriate attribute in their config meta tag
  • takes an optional input argument insert_strategy that defaults to union but can optionally be insert_by_activity
  • retrieves all Activities for the Activity stream and asserts dependencies
  • compiles the appropriate sql based on the following:
    • if the model is incremental
    • if the full-refresh flag is enabled for the run
    • If the Activity Stream is configured to write records for all Activities in one statement or one Activity at a time
    • if the Activity Stream is configured to be skipped

A macro called union_activities that does the following:

  • takes a required argument of a list of one or more dbt models that represent the Activities that should be unioned
  • takes an optional argument that identifies if the run is a full-refresh run - defaults to false
  • takes an optional argument that identifies if the materialization type is incremental - defaults to false
  • returns a sql statement that unions all models together, with appropriate filters applied based on the full-refresh and incremental argument values

A macro called drop_activity that does the following:

  • takes a required argument of a model that represents an Activity to be dropped from an Activity Stream
  • takes an optional argument that identifies if the generated SQL should be executed - defaults to false - this allows users to call this macro as a run-operation for one-off updates
  • if SQL should not be executed, returns a sql statement that will remove records from the Activity Stream
  • if SQL should be executed, returns a result statement specifying the number of rows removed from the Activity Stream

A macro called insert_activity that does the following:

  • takes a required argument of a model that represents an Activity to be inserted into an Activity Stream
  • takes an optional argument that identifies if the run is a full-refresh run - defaults to false
  • takes an optional argument that identifies if the materialization type is incremental - defaults to false
  • takes an optional argument that identifies if the generated SQL should be executed - defaults to false - this allows users to call this macro as a run-operation for updates
  • returns a sql statement that unions all models together, with appropriate filters applied based on the full-refresh and incremental argument values

A refactor of the dataset macro to query Activities models directly in a non-production environment (likely using dbt targets)

A refactor of the dataset macro to query Activities models directly if the skip_activity_stream project config parameter is set to true

Deployment Plan

I'm proposing the following multi-step deployment plan for the above in order to break up the work into reasonable chunks and make useful new features available faster:

  • Add the build_activity_stream and union_activities macros to support building union-based Activity Stream models
  • Add the stream-specific configs (excluding skip_activity_stream) to support multiple streams
  • Add drop_activity and insert_activity macros and update the build_activity_stream macro to support building insert-based Activity Stream models
  • Add support for the skip_activity_stream config and update the dataset macro as specified

Dependencies

#25 - Activity Registry

Open Questions

  • The optional feature of not having to update the Activity Stream model when new Activities are added is excluded from this implementation. If the Activity Registration happens in the Activity model config, how (if at all) can the dependencies for the Activity Stream be automatically defined when the DAG is built without manipulating the Activity Stream sql file, given the restrictions dbt has for building the graph?
    • To note - a valid workaround is to register the Activity in the Activity Stream model's meta tag instead of in the Activity model's meta tag, but if a developer has to touch a file related to the Activity Stream, then it's better to be the model sql, so that other developers can clearly see the dependencies.
  • The insert_by_activity insertion strategy violates one of dbt's assumptions that a single query is executed to create the table associated with the model. Is violating this a concern?
@bcodell bcodell added the enhancement New feature or request label Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant