Skip to content

User Behavior Analysis in Practice 12:Using Pseudo Tables

esProcSPL edited this page Oct 23, 2024 · 1 revision

Target task

We have a user events table T. Below is its structure and part of its data:

Time UserID EventType OS Browser ProductID f1 f2 f3 f4 f5
2022/6/1 10:20 1072755 Search Android IE 100001 true false false true false
2022/6/1 12:12 1078030 Browse IOS Safari 100002 false false true true true
2022/6/1 12:36 1005093 Submit Android Chrome 100003 true true true false false
2022/6/1 13:21 1048655 Login Windows Chrome false false true true true
2022/6/1 14:46 1037824 Logout Android Edge false false false true true
2022/6/1 15:19 1049626 AddtoCart Windows Edge 100004 true true false true false
2022/6/1 16:00 1009296 Submit IOS Firefox 100005 false true false false true
2022/6/1 16:39 1070713 Browse IOS Sogou 100006 true true true false false
2022/6/1 17:40 1090884 Search Windows IE 100007 true false true true false

Fields in table T:

Field name Data type Description
Time Datetime Time stamp of an event, accurate to milliseconds
UserID String User ID
EventType String Event type, whose value is Login, Browse, Search, AddtoCart, Submit or Logout
OS String Operating system, whose value is Android, IOS, Windows or Unknown
Browser String Browser, whose value is IE, Safari, Edge, Firefox, Chrome, Sogou or Unknown
ProductID String Product ID, whose value is the ProductID field of dimension table Product
String Other fields that have enumerated values
f1 Boolean Whether it is an offsite event or not; value is true or false
f2 Boolean Whether it is a usual device or not; value is true or false
f3 Boolean Whether it is a usual browser or not; value is true or false
f4 Boolean Whether it is a cell phone or not; value is true or false
f5 Boolean Whether it is the first operation; value is true or false
Boolean Other fields that have Boolean values

In the previous article User Behavior Analysis in Practice 9: Enumerated Dimension and Tag Dimension, we convert the enumerated field and the tag field into corresponding ordinal numbers and bits respectively. Below is the structure and part of data of converted user events composite table T.ctx:

Time UserID EventType OS Browser ProductID b1
2022/6/1 10:20 1072755 3 1 1 100001 36864
2022/6/1 12:12 1078030 2 2 2 100002 14336
2022/6/1 12:36 1005093 5 1 5 100003 57344
2022/6/1 13:21 1048655 1 3 5 14336
2022/6/1 14:46 1037824 6 1 3 6144
2022/6/1 15:19 1049626 4 3 3 100004 53248
2022/6/1 16:00 1009296 5 2 4 100005 18432
2022/6/1 16:39 1070713 2 2 6 100006 57344
2022/6/1 17:40 1090884 3 3 1 100007 45056

Fields in converted table T:

Field name Data type Description
Time Datetime Time stamp of an event, accurate to milliseconds
UserID String User ID
EventType Integer Event type, whose value is an ordinal number of the enumerated sequence
OS Integer Operating system, whose value is an ordinal number of the enumerated sequence
Browser Integer Browser, whose value is an ordinal number of the enumerated sequence
ProductID String Product ID, whose value is the ProductID field of dimension table Product
b1 Integer Integer field that stores binary fields as bits; the first five bits correspond to whether it is offsite, whether it is the usual device, whether it is the usual browser, whether it is cell phone, and whether it is the first operation

Ordinal numbers in EventType field and their description:

1 Login 2 Browse 3 Search 4 AddtoCart 5 Submit 6 Logout

Ordinal numbers in OS field and their description:

1 Android 2 IOS 3 Windows 4 Unknown

Ordinal numbers in Browser field and their description:

1 IE 2 Safari 3 Edge 4 Firefox 5 Chrome 6 Sogou 7 Unknown

Dimension table Product:

ProductID ProductName Unit Price ProductType
100001 Apple Pound 5.5 Fruits
100002 Tissue Packs 16 Home&Personalcare
100003 Beef Pound 35 Meat
100004 Wine Bottles 120 Beverage
100005 Pork Pound 25 Meat
100006 Bread Packs 10 Bakery
100007 Juice Bottles 6 Beverage

Fields in dimension table Product:

Field name Data type Description
ProductID String Product ID
ProductName String Product name
Unit String Sales unit
Price Numeric Unit price
ProductType Integer Product type

Relationship between* T* and* Product*:

Computing task:

Find the number of occurrences of each type of event performed by users who are not newcomers on a local Android or IOS system using Safari, Edge or Chrome under the product type Home & Personal care in each month within a specified time period, and count the distinct users under same conditions.

Techniques involved:

Learn more about SPL pseudo tables in SPL Pseudo Table Data Type Optimization.

We can use the pseudo table to predefine computed columns and foreign keys, and simplify the representations of enumerated dimensions and tag dimensions:

1.Define a computed column Month to calculate the month in the time when the current event occurs.

2.Define pseudo fields EventTypeName, OSName and BrowserName to store correspondence relationships between EventType, OS and Browser’s ordinal numbers and their names.

3.Define the foreign key association on ProductID field between it and the corresponding dimension table.

4.Define field name represented by each bit in b1.

After a pseudo table is defined, we can use it to achieve data dump and summarization. This can significantly reduce the amount of code for summarization.

Sample code

1.Define pseudo table

A
1 =T("Product.btx").keys@i(ProductID)
2 =file("T.ctx").create(#Time,UserID,EventType,OS,Browser,ProductID,……,b1,……)
3 =A2.close()
4 =[{file:"T.ctx",
column:[
{name:"Month",exp:"month@y(Time)"},
{name:"EventType",pseudo:"EventTypeName",enum:["Login","Browse","Search","AddtoCart","Submit","Logout"]},
{name:"OS",pseudo:"OSName",enum:["Android","IOS","Windows","Unknown"]},
{name:"Browser",pseudo:"BrowserName",enum:["IE","Safari","Edge","Firefox","Chrome","Sogou","Unknown"]},
{name:"b1",bits:["f1","f2","f3","f4","f5",…….]},
{name:"ProductID",dim:A1}]
}]
5 =pseudo(A4)

A1 Import dimension table Product, and set primary key and index for it.

A2 Create composite table structure after enumerated fields are converted into ordinal numbers and binary fields are transformed to bits, and write it to the empty file T.ctx (database data will later be directly appended to the file through the pseudo table without writing the conversion code for ordinal numbers and bits).

A3 Close A2’s composite table.

A4 Define a pseudo table based on composite table T.ctx:

{name:"Month",exp:"month@y(Time)"}: The code defines a computed column named Month and evaluated through month@y(Time).

{name:"EventType", pseudo:"EventTypeName", enum:[ "Login", "Browse", "Search", "AddtoCart", "Submit", "Logout"]}: The code defines correspondence relationship between EventType field values and the enumerated sequence, and represents the matching name through pseudo field EventTypeName.

{name:"OS",pseudo:"OSName",enum:["Android","IOS","Windows","Unknown"]}: The code defines correspondence relationship between OS field values and the enumerated sequence, and represents the matching name through pseudo field OSName.

{name:"Browser",pseudo:"BrowserName",enum:["IE","Safari","Edge","Firefox","Chrome","Sogou","Unknown"]}: The code defines correspondence relationship between Browser field values and the enumerated sequence, and represents the matching name through pseudo field BrowserName.

{name:"b1",bits:["f1","f2","f3","f4","f5",…….]}: The code defines field name each bit in the bit-based dimension b1 represents.

{name:"ProductID",dim:A1}: The code defines association relationship on ProductID field between it and dimension table A1.

A5 Generate the pseudo table.

The pseudo table definition string can be saved for direct use in later computations. This can further reduce the amount of code.

2.Use pseudo table to dump data. SPL will automatically convert data into corresponding ordinal numbers and bits according to pseudo table definition, and store the converted data in table T.

A
/ The above code for defining pseudo table
6 =connect("demo").cursor@x("select * from T order by Time")
7 =A5.append@i(A6)

A6 Connect to the database, and import data in table T to generate a cursor.

A7 Retrieve and append data in the cursor to the pseudo table.

3.Perform aggregation using the pseudo table

A
/ The above code for defining pseudo table
6 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
7 =A5.select(Time>=start && Time<=end && ProductID.ProductType=="Home&Personalcare"&& ["Safari","Edge","Chrome"].pos(BrowserName) && ["Android","IOS"].pos(OSName) && ! f1 && f4 && !f5).groups(Month,EventTypeName; count(1):Num, icount(UserID):iNum)

A7 Perform aggregation using the pseudo table. We can use pseudo fields defined for the pseudo table as ordinary fields and treat the pseudo table as an ordinary, simple table without taking care of storage and computing mechanisms.

Execution result:

Month EventTypeName Num iNum
202203 AddtoCart 307603 29252
202203 Browse 596492 58140
202203 Login 672163 65569
202203 Logout 672163 65569
202203 Search 491317 42919
202203 Submit 144552 13901
202204 AddtoCart 615222 58484
202204 Browse 1192970 116265
202204 Login 1344323 131123
202204 Logout 1344323 131123
202204 Search 982637 85843
202204 Submit 289112 27799
202205 AddtoCart 615214 58484
202205 Browse 1192976 116262
202205 Login 1344339 131133
202205 Logout 1344339 131133
202205 Search 982633 85848
202205 Submit 289108 27788
202206 AddtoCart 307635 29256
202206 Browse 596463 58124
202206 Login 672175 65575
202206 Logout 672175 65575
202206 Search 491344 42929
202206 Submit 144573 13887
Clone this wiki locally