Skip to content

Commit 9cd11d6

Browse files
Update user_pseudo_id to client_key (#181)
* Replace user_pseudo_id with client_key * Remove stream_id from fct_ga4__user_ids * Add descriptions for client_key throughout package * restore stream_id to several models * Move client_key creation to stag_ga4__events --------- Co-authored-by: Adam Ribaudo <[email protected]>
1 parent 236c970 commit 9cd11d6

26 files changed

+124
-110
lines changed

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ Features include:
2020
| stg_ga4__event_items | Contains item data associated with e-commerce events (Purchase, add to cart, etc) |
2121
| stg_ga4__event_to_query_string_params | Mapping between each event and any query parameters & values that were contained in the event's `page_location` field |
2222
| stg_ga4__user_properties | Finds the most recent occurance of specified user_properties for each user |
23-
| stg_ga4__derived_user_properties | Finds the most recent occurance of specific event_params value and assigns them to a user_pseudo_id. Derived user properties are specified as variables (see documentation below) |
23+
| stg_ga4__derived_user_properties | Finds the most recent occurance of specific event_params value and assigns them to a client_key. Derived user properties are specified as variables (see documentation below) |
2424
| stg_ga4__derived_session_properties | Finds the most recent occurance of specific event_params or user_properties value and assigns them to a session's session_key. Derived session properties are specified as variables (see documentation below) |
2525
| stg_ga4__session_conversions_daily | Produces daily counts of conversions per session. The list of conversion events to include is configurable (see documentation below) |
2626
| stg_ga4__sessions_traffic_sources | Finds the first source, medium, campaign, content, paid search term (from UTM tracking), and default channel grouping for each session. |
27-
| dim_ga4__user_pseudo_ids | Dimension table for user devices as indicated by user_pseudo_ids. Contains attributes such as first and last page viewed.|
27+
| dim_ga4__client_keys | Dimension table for user devices as indicated by client_keys. Contains attributes such as first and last page viewed.|
2828
| dim_ga4__sessions | Dimension table for sessions which contains useful attributes such as geography, device information, and acquisition data. Can be expensive to run on large installs (see `dim_ga4__sessions_daily`) |
2929
| dim_ga4__sessions_daily | Query-optimized session dimension table that is incremental and partitioned on date. Assumes that each partition is contained within a single day |
3030
| fct_ga4__pages | Fact table for pages which aggregates common page metrics by page_location, date, and hour. |

models/marts/core/core.yml

+10-8
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,11 @@ models:
77
- name: session_key
88
tests:
99
- unique
10-
- name: dim_ga4__user_pseudo_ids
11-
description: Dimension table for user devices (user_pseudo_id) which includes data from the first and last event produced. Unique on user_pseudo_id
10+
- name: dim_ga4__client_keys
11+
description: Dimension table for user devices (client_key) which includes data from the first and last event produced. Unique on client_key
1212
columns:
13-
- name: user_pseudo_id
13+
- name: client_key
14+
description: Hashed combination of user_pseudo_id and stream_id
1415
tests:
1516
- unique
1617
- name: fct_ga4__sessions
@@ -26,15 +27,16 @@ models:
2627
description: The total engagement time for that page_location.
2728
- name: avg_engagement_time_denominator
2829
description: Use avg_engagement_time_denominator to calculate the average engagement time, which is derived by dividing the sum of total engagement time by the product of the sum of the denominator and 1000 to get the average engagement time in seconds (average_engagement_time = sum(total_engagement_time_msec)/(sum(avg_engagement_time_denominator) *1000 )). The denominator excludes page_view events where no engagement time is recorded for the page_location within a session. However, it includes subsequent page_view events to a page_location that has previously recorded a page_view event in the same session, even if the subsequent event has no recorded engagement time.
29-
- name: fct_ga4__user_pseudo_ids
30-
description: Fact table with aggregate metrics at the level of the user's device (as indicated by the user_pseudo_id). Metrics are aggregated from fct_ga4__sessions.
30+
- name: fct_ga4__client_keys
31+
description: Fact table with aggregate metrics at the level of the user's device (as indicated by the client_key). Metrics are aggregated from fct_ga4__sessions.
3132
columns:
32-
- name: user_pseudo_id
33+
- name: client_key
34+
description: Hashed combination of user_pseudo_id and stream_id
3335
tests:
3436
- unique
3537
- name: fct_ga4__user_ids
36-
description: Fact table with aggregate metrics at the level of the user_id when one is present, otherwise at the device level (as indicated by the user_pseudo_id). Metrics are aggregated from fct_ga4__user_pseudo_ids.
38+
description: Fact table with aggregate metrics at the level of the user_id when one is present, otherwise at the device level (as indicated by the client_key). Metrics are aggregated from fct_ga4__client_keys.
3739
columns:
38-
- name: user_id_or_user_pseudo_id
40+
- name: user_id_or_client_key
3941
tests:
4042
- unique

models/marts/core/dim_ga4__user_pseudo_ids.sql models/marts/core/dim_ga4__client_keys.sql

+7-7
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
-- Mart for dimensions related to user devices (based on user_pseudo_id)
1+
-- Mart for dimensions related to user devices (based on client_key)
22

33
with include_first_last_events as (
44
select
55
*
6-
from {{ref('stg_ga4__user_pseudo_id_first_last_events')}}
6+
from {{ref('stg_ga4__client_key_first_last_events')}}
77
),
88
include_first_last_page_views as (
99
select
@@ -15,19 +15,19 @@ include_first_last_page_views as (
1515
first_last_page_views.last_page_hostname,
1616
first_last_page_views.last_page_referrer
1717
from include_first_last_events
18-
left join {{ref('stg_ga4__user_pseudo_id_first_last_pageviews')}} as first_last_page_views using (user_pseudo_id)
18+
left join {{ref('stg_ga4__client_key_first_last_pageviews')}} as first_last_page_views using (client_key)
1919
),
2020
include_user_properties as (
2121

2222

2323
select * from include_first_last_page_views
2424
{% if var('derived_user_properties', false) %}
25-
-- If derived user properties have been assigned as variables, join them on the user_pseudo_id
26-
left join {{ref('stg_ga4__derived_user_properties')}} using (user_pseudo_id)
25+
-- If derived user properties have been assigned as variables, join them on the client_key
26+
left join {{ref('stg_ga4__derived_user_properties')}} using (client_key)
2727
{% endif %}
2828
{% if var('user_properties', false) %}
29-
-- If user properties have been assigned as variables, join them on the user_pseudo_id
30-
left join {{ref('stg_ga4__user_properties')}} using (user_pseudo_id)
29+
-- If user properties have been assigned as variables, join them on the client_key
30+
left join {{ref('stg_ga4__user_properties')}} using (client_key)
3131
{% endif %}
3232

3333
)

models/marts/core/dim_ga4__sessions_daily.sql

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
with event_dimensions as
3535
(
3636
select
37-
user_pseudo_id,
37+
client_key,
3838
session_key,
3939
session_partition_key,
4040
event_date_dt as session_partition_date,

models/marts/core/fct_ga4__user_pseudo_ids.sql models/marts/core/fct_ga4__client_keys.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
select
2-
user_pseudo_id,
2+
client_key,
33
stream_id,
44
min(session_start_timestamp) as first_seen_timestamp,
55
min(session_start_date) as first_seen_start_date,
@@ -15,5 +15,5 @@ select
1515
{% endfor %}
1616
{% endif %}
1717
from {{ref('fct_ga4__sessions')}}
18-
group by 1,2
18+
group by 1, 2
1919

models/marts/core/fct_ga4__pages.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@ with page_view as (
2121
page_title, -- would like to move this to dim_ga4__pages but need to think how to handle page_title changing over time
2222
page_engagement_key,
2323
count(event_name) as page_views,
24-
count(distinct user_pseudo_id ) as distinct_user_pseudo_ids,
25-
sum( if(session_number = 1,1,0)) as new_user_pseudo_ids,
24+
count(distinct client_key ) as distinct_client_keys,
25+
sum( if(session_number = 1,1,0)) as new_client_keys,
2626
sum(entrances) as entrances,
2727
from {{ref('stg_ga4__event_page_view')}}
2828
group by 1,2,3,4,5,6,7

models/marts/core/fct_ga4__sessions.sql

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
-- Stay mindful of performance/cost when leavin this model enabled. Making this model incremental on date is not possible because there's no way to create a single record per session AND partition on date.
22

33
select
4-
user_pseudo_id,
4+
client_key,
55
session_key,
66
stream_id,
77
min(session_partition_min_timestamp) as session_start_timestamp,

models/marts/core/fct_ga4__sessions_daily.sql

+2-2
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ with session_metrics as (
3535
select
3636
session_key,
3737
session_partition_key,
38-
user_pseudo_id,
38+
client_key,
3939
stream_id,
4040
min(event_date_dt) as session_partition_date, -- Date of the session partition, does not represent the true session start date which, in GA4, can span multiple days
4141
min(event_timestamp) as session_partition_min_timestamp,
@@ -72,7 +72,7 @@ with session_metrics as (
7272
),
7373
join_metrics_and_conversions as (
7474
select
75-
session_metrics.user_pseudo_id,
75+
session_metrics.client_key,
7676
session_metrics.stream_id,
7777
session_metrics.session_partition_min_timestamp,
7878
session_metrics.session_partition_count_page_views,

models/marts/core/fct_ga4__user_ids.sql

+8-8
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
with user_id_mapped as (
22
select
3-
user_pseudo_ids.*,
4-
-- Use a user_id if it exists, otherwise fall back to the user_pseudo_id
5-
coalesce(user_id_mapping.last_seen_user_id, user_pseudo_ids.user_pseudo_id) as user_id_or_user_pseudo_id,
6-
-- Indicate whether the user_id_or_user_pseudo_id value is a user_id
3+
client_keys.*,
4+
-- Use a user_id if it exists, otherwise fall back to the client_key
5+
coalesce(user_id_mapping.last_seen_user_id, client_keys.client_key) as user_id_or_client_key,
6+
-- Indicate whether the user_id_or_client_key value is a user_id
77
CASE
88
WHEN user_id_mapping.last_seen_user_id is null THEN 0 ELSE 1
99
END as is_user_id
10-
from {{ref('fct_ga4__user_pseudo_ids')}} user_pseudo_ids
11-
left join {{ref('stg_ga4__user_id_mapping')}} user_id_mapping using (user_pseudo_id)
10+
from {{ref('fct_ga4__client_keys')}} client_keys
11+
left join {{ref('stg_ga4__user_id_mapping')}} user_id_mapping using (client_key)
1212
)
1313

1414
select
15-
user_id_or_user_pseudo_id,
15+
user_id_or_client_key,
1616
stream_id,
1717
max(is_user_id) as is_user_id,
1818
min(first_seen_timestamp) as first_seen_timestamp,
@@ -29,5 +29,5 @@ select
2929
{% endfor %}
3030
{% endif %}
3131
from user_id_mapped
32-
group by 1,2
32+
group by 1, 2
3333

models/staging/stg_ga4__user_pseudo_id_first_last_events.sql models/staging/stg_ga4__client_key_first_last_events.sql

+10-10
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,24 @@
44

55
with first_last_event as (
66
select
7-
user_pseudo_id,
8-
FIRST_VALUE(event_key) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_event,
9-
LAST_VALUE(event_key) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_event,
7+
client_key,
8+
FIRST_VALUE(event_key) OVER (PARTITION BY client_key ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_event,
9+
LAST_VALUE(event_key) OVER (PARTITION BY client_key ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_event,
1010
stream_id
1111
from {{ref('stg_ga4__events')}}
12-
where user_pseudo_id is not null --remove users with privacy settings enabled
12+
where client_key is not null --remove users with privacy settings enabled
1313
),
14-
events_by_user_pseudo_id as (
14+
events_by_client_key as (
1515
select distinct
16-
user_pseudo_id,
16+
client_key,
1717
first_event,
1818
last_event,
1919
stream_id
2020
from first_last_event
2121
),
2222
events_joined as (
2323
select
24-
events_by_user_pseudo_id.*,
24+
events_by_client_key.*,
2525
events_first.geo_continent as first_geo_continent,
2626
events_first.geo_country as first_geo_country,
2727
events_first.geo_region as first_geo_region,
@@ -74,11 +74,11 @@ events_joined as (
7474
events_last.user_campaign as last_user_campaign,
7575
events_last.user_medium as last_user_medium,
7676
events_last.user_source as last_user_source,
77-
from events_by_user_pseudo_id
77+
from events_by_client_key
7878
left join {{ref('stg_ga4__events')}} events_first
79-
on events_by_user_pseudo_id.first_event = events_first.event_key
79+
on events_by_client_key.first_event = events_first.event_key
8080
left join {{ref('stg_ga4__events')}} events_last
81-
on events_by_user_pseudo_id.last_event = events_last.event_key
81+
on events_by_client_key.last_event = events_last.event_key
8282
)
8383

8484
select * from events_joined
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
version: 2
22

33
models:
4-
- name: stg_ga4__user_pseudo_id_first_last_events
4+
- name: stg_ga4__client_key_first_last_events
55
description: Captures the first and last event completed by the user's device in order to pull in the first and last geo, device, and traffic source seen from the user
66
columns:
7-
- name: user_pseudo_id
7+
- name: client_key
8+
description: Hashed combination of user_pseudo_id and stream_id
89
tests:
910
- unique

models/staging/stg_ga4__user_pseudo_id_first_last_pageviews.sql models/staging/stg_ga4__client_key_first_last_pageviews.sql

+10-10
Original file line numberDiff line numberDiff line change
@@ -4,34 +4,34 @@
44

55
with page_views_first_last as (
66
select
7-
user_pseudo_id,
8-
FIRST_VALUE(event_key) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_page_view_event_key,
9-
LAST_VALUE(event_key) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_page_view_event_key
7+
client_key,
8+
FIRST_VALUE(event_key) OVER (PARTITION BY client_key ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_page_view_event_key,
9+
LAST_VALUE(event_key) OVER (PARTITION BY client_key ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_page_view_event_key
1010
from {{ref('stg_ga4__event_page_view')}}
11-
where user_pseudo_id is not null -- Remove users with privacy settings enabled
11+
where client_key is not null -- Remove users with privacy settings enabled
1212
),
13-
page_views_by_user_pseudo_id as (
13+
page_views_by_client_key as (
1414
select distinct
15-
user_pseudo_id,
15+
client_key,
1616
first_page_view_event_key,
1717
last_page_view_event_key
1818
from page_views_first_last
1919
),
2020

2121
page_views_joined as (
2222
select
23-
page_views_by_user_pseudo_id.*,
23+
page_views_by_client_key.*,
2424
first_page_view.page_location as first_page_location,
2525
first_page_view.page_hostname as first_page_hostname,
2626
first_page_view.page_referrer as first_page_referrer,
2727
last_page_view.page_location as last_page_location,
2828
last_page_view.page_hostname as last_page_hostname,
2929
last_page_view.page_referrer as last_page_referrer
30-
from page_views_by_user_pseudo_id
30+
from page_views_by_client_key
3131
left join {{ref('stg_ga4__event_page_view')}} first_page_view
32-
on page_views_by_user_pseudo_id.first_page_view_event_key = first_page_view.event_key
32+
on page_views_by_client_key.first_page_view_event_key = first_page_view.event_key
3333
left join {{ref('stg_ga4__event_page_view')}} last_page_view
34-
on page_views_by_user_pseudo_id.last_page_view_event_key = last_page_view.event_key
34+
on page_views_by_client_key.last_page_view_event_key = last_page_view.event_key
3535
)
3636

3737
select * from page_views_joined
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
version: 2
2+
3+
models:
4+
- name: stg_ga4__client_key_first_last_pageviews
5+
description: Captures data related to the first and last page view that each user device has completed (by client_key).
6+
columns:
7+
- name: client_key
8+
description: Hashed combination of user_pseudo_id and stream_id
9+
tests:
10+
- unique

models/staging/stg_ga4__derived_user_properties.sql

+5-5
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@
33
materialized = "table"
44
) }}
55

6-
-- Remove null user_pseudo_id (users with privacy enabled)
6+
-- Remove null client_key (users with privacy enabled)
77
with events_from_valid_users as (
88
select * from {{ref('stg_ga4__events')}}
9-
where user_pseudo_id is not null
9+
where client_key is not null
1010
),
1111
unnest_user_properties as
1212
(
1313
select
14-
user_pseudo_id,
14+
client_key,
1515
event_timestamp
1616
{% for up in var('derived_user_properties', []) %}
1717
,{{ ga4.unnest_key('event_params', up.event_parameter , up.value_type ) }}
@@ -20,9 +20,9 @@ unnest_user_properties as
2020
)
2121

2222
SELECT DISTINCT
23-
user_pseudo_id
23+
client_key
2424
{% for up in var('derived_user_properties', []) %}
2525
, LAST_VALUE({{ up.event_parameter }} IGNORE NULLS) OVER (user_window) AS {{ up.user_property_name }}
2626
{% endfor %}
2727
FROM unnest_user_properties
28-
WINDOW user_window AS (PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
28+
WINDOW user_window AS (PARTITION BY client_key ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

models/staging/stg_ga4__derived_user_properties.yml

+3-2
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@ version: 2
22

33
models:
44
- name: stg_ga4__derived_user_properties
5-
description: Optional model that will pull out the most recent instance of a particular event parameter for each device (user_pseudo_id). Later used in the dim_ga4__user_pseudo_id dimension table.
5+
description: Optional model that will pull out the most recent instance of a particular event parameter for each device (client_key). Later used in the dim_ga4__client_key dimension table.
66
columns:
7-
- name: user_pseudo_id
7+
- name: client_key
8+
description: Hashed combination of user_pseudo_id and stream_id
89
tests:
910
- unique

models/staging/stg_ga4__events.sql

+9-3
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,18 @@ with base_events as (
77
select * from {{ref('base_ga4__events_intraday')}}
88
{% endif %}
99
),
10-
-- Add unique key for sessions. session_key will be null if user_pseudo_id is null due to consent being denied. ga_session_id may be null during audience trigger events.
10+
-- Add key that captures a combination of stream_id and user_pseudo_id to uniquely identify a 'client' (aka. a device) within a single stream
11+
include_client_key as (
12+
select *
13+
, to_base64(md5(concat(user_pseudo_id, stream_id))) as client_key
14+
from base_events
15+
),
16+
-- Add key for sessions. session_key will be null if client_key is null due to consent being denied. ga_session_id may be null during audience trigger events.
1117
include_session_key as (
1218
select
1319
*,
14-
to_base64(md5(CONCAT(stream_id, user_pseudo_id, CAST(session_id as STRING)))) as session_key -- Surrogate key to determine unique session across streams and users. Sessions do NOT reset after midnight in GA4
15-
from base_events
20+
to_base64(md5(CONCAT(client_key, CAST(session_id as STRING)))) as session_key
21+
from include_client_key
1622
),
1723
-- Add a key that combines session key and date. Useful when working with session table within date-partitioned tables
1824
include_session_partition_key as (

models/staging/stg_ga4__events.yml

+2
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ models:
44
- name: stg_ga4__events
55
description: Staging model that generates keys for users, sessions, and events. Also parses URLs to remove query string params as defined in project config.
66
columns:
7+
- name: client_key
8+
description: Surrogate key created from stream_id and user_pseudo_id. Provides a way to uniquely identify a user's device within a stream. Important when using the package to combine data across properties and streams.
79
- name: event_key
810
tests:
911
- unique

0 commit comments

Comments
 (0)