Skip to content

User Behavior Analysis in Practice 8:The Changing Dimension Table

esProcSPL edited this page Oct 21, 2024 · 1 revision

Target task

We have a user events table T. Below is its structure and part of its data:

Time UserID ProductID Quantity
2022/6/1 10:20 1072755 1 7
2022/6/1 12:12 1078030 2 8
2022/6/1 12:36 1005093 3 3
2022/6/1 13:21 1048655 4 9
2022/6/1 14:46 1037824 5 5
2022/6/1 15:19 1049626 6 4
2022/6/1 16:00 1009296 7 6
2022/6/1 16:39 1070713 8 7
2022/6/1 17:40 1090884 9 4

Fields in table T:

Field name Data type Description
Time Datetime Time stamp of an event, accurate to milliseconds
UserID String User ID
ProductID Integer Product ID
Quantity Numeric Quantity

Dimension table Product:

ProductID ProductName Unit Origin ProductType
1 Apple Pound Shandong Fruits
2 Tissue Packs Guangdong Home&Personalcare
3 Beef Pound Qingdao Meat
4 Wine Bottles Shanxi Beverage
5 Pork Pound Xizang Meat
6 Bread Packs Beijing Bakery
7 Juice Bottles Xinjiang Beverage

Fields of dimension table Product:

Field name Data type Description
ProductID String Product ID
ProductName String Product name
Unit String Unit
ProductTypeID String Product type
Origin Numeric Product origin

Computing task:

Find the sales quantity of each type of product of each place of origin within the specified time period.

One thing we should take into consideration is that the place of origin is not fixed, and get sales quantity of a type of product according to the place of origin in the current transaction.

Techniques involved:

Employing time key on the dimension table. You can find more about SPL time key mechanism in Why Are There So Many Snapshot Tables in BI Systems?.

  1. Generate a dimension table with time key according to the dimension table in the production system at regular time interval.

Add a time key field to Product.btx. Here’s the dimension table’s new structure:

Field name Data type Description
eTime Datetime Effective time for the current record
ProductID String Product ID
ProductName String Product name
Unit String Sales unit
ProductTypeID String Product type
Origin Numeric Place of origin

Perform daily data dump on Product.btx with a time key according to the updated information in the production system’s Product table every day.

  1. Summarize data using the dimension table with a time key

Define a new composite primary key using eTime and ProductID for the dimension table while specifying eTime as the time key. Table *T *is associated with Product through Time and ProductID fields. Programmers do not need to take care of how to achieve the time key because SPL is designed to automatically handle the related computing logic. They just use the dimension table as a regular one.

Sample code

  1. Dump dimension table Product.btx.

For original data: Directly add the time key field.

A
1 =connect("demo").query@x("select * from Product").derive(now():eTime)
2 = file("Product.btx").export@b(A1)

When data is updated: Import the new dimension table, get the record with the latest time under each type of product from the dumped btx file with the time key, compare them with the new table, and append a new record to the btx file for the updated data and record a new effective time.

A
1 =connect("demo").cursor@x("select * from Product")
2 = T("Product.btx").keys@it(ProductID,eTime)
3 =A1.select(cmp(~.array(),A2.find(A1.ProductID).array().to(2,))!=0)
4 = A3.fetch().derive(now():eTime)
5 =file("Product.btx").export@ab(A4)

A1 Import the new dimension table and set index on the primary key.

A2 Import the old dimension table Product from the btx file, get record having the latest time for each product, and set index on the primary key.

@t option means the last key field is the time key.

A3 Get records under each primary key value from the new dimension table where the latest (largest) eTime is different from that in the records under same key value in the old dimension table. The find function will automatically select the record having the latest time corresponding to same key value.

A4 Add effective time field to A3’s records.

A5 Append and export A4’s records to the bin file.

  1. Join the fact table and the dimension table and perform aggregation

Suppose T.ctx is already generated as the above explains and sorted by Time:

A
1 >Product=T("Product.btx").keys@ti(ProductID,eTime)
2 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3 =file("T.ctx").open().cursor(ProductID,Quantity,Time;Time>=start && Time<=end)
4 =A3.switch(ProductID:Time,Product)
5 =A4.groups(ProductID.ProductType,ProductID.Origin; sum(Quantity):Quantity)

A1 Import dimension table Product, and set index on the primary key while specifying the time key.

A4 Join the fact table and the dimension table using the familiar SPL join syntax. As the joining fields contains a time field, the operation will find the record having the latest time (the largest time that is smaller than eTime) before the current time field value.

Execution result:

ProductType Origin Quantity
Fruits Shandong 1241628
Fruits Xinjiang 546357
Fruits Hainan 24526
Home&Personalcare Guangdong 7411008
Meat Qingdao 3303230
Meat Neimeng 657546
Meat Xizang 2456235
Bakery Beijing 247673
Beverage Xinjiang 3526574
Beverage Shanxi 6090112
Clone this wiki locally