-
Notifications
You must be signed in to change notification settings - Fork 328
The Positioning Operation and Position Utilization of SPL
Based on an ordered set, multiple operations can be extended, with the simplest being access by position, i.e. retrieving members of a record set by subscript or position. On the other hand, there is also positioning operation, which obtains the position of members in the record set in a certain way, such as the position of the record where the field Max/Min value is located, the position of records that meet certain conditions, and the position of sorted records in the original set, etc. The variety and powerful functions of positioning operations enhance and expand the expressive ability of structured data operations. Proficient in positioning operations and flexible use of positions can greatly simplify complex calculation targets and improve inefficient computing performance.
The basic data type (table) of SQL does not have a natural sequence number and belongs to an unordered set. Although there are pseudo fields such as rownum that generate temporary serial numbers, there are many restrictions and high thresholds, and there is no directly available positioning function. These reasons often result in SQL programmers not consciously using positions to solve problems.
The basic data type (table sequence) of SPL has a natural sequence number and belongs to an ordered set. SPL has built-in rich positioning operation functions, making it convenient for SPL programmers to calculate positions and make good use of them.
Starting from locating members
Positioning a member refers to obtaining the position of a specified member (record) in a set, which is the most basic positioning operation. SPL provides the function pos to implement positioning a member.
Example 1: Find the position of member banana from the fruit set, if not found, return null:
["Orange","Banana","Apple","Banana"].pos("Banana") //2
By selecting optional parameters and function options, the ability of function pos can also be extended, such as finding from back to front, using binary search when data is ordered, returning the positions where members appear multiple times, and so on. Other positioning functions have similar functions.
Using positioning members to solve problems
Although it is simple to locate members, establishing a mindset of utilizing position and applying it flexibly often simplifies complex computational tasks.
Example 2: Now there are a course schedule of a certain class and the teaching intention table made by professors based on the course schedule. The teaching intention table often has conflicts and gaps, meaning that multiple professors may have chosen the same course, or a certain course may not be chosen by any professor. Now we need to calculate the course conflict table so that the academic affairs department can adjust and determine the final teaching schedule in the next step.
schedule.txt
Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|
lua | maa | mia | jua | via |
lub | mab | mib | jub | vib |
luc | mac | mic | juc | vic |
lud | mad | mid | jud | vid |
lue | mae | mie | jue | vie |
luf | maf | mif | juf | vif |
lug | mag | mig | jug | vig |
Note: The concept of a course is a specific time period subject, such as Tuesday's third geography class, which is usually represented by code, such as luc; The concept of subject is not used in this example.
Intent.txt
Petitti | mif | mig | vif | vig | null | null | … |
---|---|---|---|---|---|---|---|
Canales | luc | lud | mac | mad | mic | mid | … |
Lucero | lub | luc | lud | lue | mab | mac | … |
Bergamaschi | lua | luf | maa | maf | mia | mif | … |
… | … | … | … | … | … | … | … |
Note: The above data does not have a column header. The first column is the name of the professor, and each subsequent column is the course intention, with null indicating a vacancy.
To calculate course conflicts, the key is to find the list of professors corresponding to each class. It is difficult to calculate the ideal result using only conditional queries, but with the pos function, it is much easier to first determine whether a certain professor exists in the course schedule. SPL code:
A | B | |
---|---|---|
1 | =file("Intent.txt").import() | / Import intent table |
2 | =A1.new(#1:professor, |
/ Generate a new table sequence, the first column is the professor ‘s name, the second column is course schedule(set) |
3 | =file("schedule.txt").import@t().conj(~.array()) | / Merge all courses horizontally and then vertically into a set |
4 | =A3.(A2.select(codeArray.pos(A3.~)).(professor)) | / Using position to calculate course conflicts |
5 | =create(Monday,Tuesday,Wednesday,Thursday,Friday).record(A4.(~.concat@c())) | / Present conflicting data as a visual course conflict table |
A4: Loop through the course set of A3, use the pos function to search for the current course in the teaching intention, and select the corresponding professor set for each class. The professor set may contain one or more professors, or it can be null.
[Bergamaschi,Puebla] |
---|
[Bergamaschi,Puebla] |
[Bergamaschi,Puebla] |
… |
A5: Fill the course conflict data of A4 in the course schedule in the order of first horizontal and then vertical. When multiple professors correspond to a course, merge them with commas. The results are as follows:
Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|
Bergamaschi,Puebla | Bergamaschi,Pue… | Bergamaschi,Puebla | Bergamaschi,Pue… | Bergamaschi,Puebla |
Lucero,Puebla,Lu… | Lucero,Mazza,Pu… | Lucero,Puebla,Chi… | Lucero,Mazza,Pe… | Lucero,Puebla,Vel… |
Canales,Lucero,P… | Canales,Lucero,M… | Canales,Lucero,P… | Canales,Lucero,M… | Lucero,Velasco,Lu… |
… | … | … | … | … |
Common positioning operations
Many positioning operations are based on structured data, such as pmax and pmin, to obtain the positions of the maximum and minimum values; ptop to locate the positions where the top N or bottom N are located; pselect to obtain the position of records that meet the conditions. By flexibly applying these positioning operations, complex structured data calculation problems can be simplified using positions.
Example 3: Based on a certain stock index, calculate the growth rate on the day when the closing price is highest. Part of the data is as follows:
Date | Open | Close | Amount |
---|---|---|---|
2019-12-31 | 3036 | 3050 | 2.27E+11 |
2019-12-30 | 2998 | 3040 | 2.67E+11 |
2019-12-27 | 3006 | 3005 | 2.58E+11 |
2019-12-26 | 2981 | 3007 | 1.96E+11 |
2019-12-25 | 2980 | 2981 | 1.91E+11 |
2019-12-24 | 2983 | 3040 | 1.92E+11 |
2019-12-23 | 2910 | 2991 | 1.99E+11 |
If position is not used, the usual approach is to first calculate the growth rate of each record, and then filter the records with the "highest closing price". The code is as follows:
A | |
---|---|
1 | =file("d:/000001.txt").import@t().sort(Date) |
2 | =A1.derive(Close/Close[-1]-1:rate) |
3 | =A2.select(Close==A1.max(Close)) |
This is quite intuitive, but to calculate the growth rate of each record, the performance is not optimal.
To improve performance, we should first find the record with the highest closing price and only calculate the growth rate of this record. The pmax function can easily achieve this goal:
A | |
---|---|
1 | =file("d:/000001.txt").import@t().sort(Date) |
2 | =A1.pmax(Close) |
3 | =A1(A2).Close/A1(A2-1).Close-1 |
The date with the highest closing price may be more than one day, pmax@a can find the positions where multiple maximum values are located, and by looping through these positions, we can find multiple growth rates:
A | |
---|---|
1 | =file("d:/000001.txt").import@t().sort(Date) |
2 | =A1.pmax@a(Close) |
3 | =A2.new(A1( |
By utilizing the positions of the top N or bottom N, we can also simplify problems and improve performance. Example 4: Calculate the increase in trading volume for the three days with the highest closing price.
A | |
---|---|
1 | =file("d:/000001.txt").import@t().sort(Date) |
2 | =A1.ptop(-3, Close) |
3 | =A2.new(A1( |
Similarly, we can use the position of records that meet the query criteria to solve problems. Example 5: What is the increase in trading volume on those trading days with a closing price increase of more than 1.03%?
A | |
---|---|
1 | =file("d:/000001.txt").import@t().sort(Date) |
2 | =A1.pselect@a(Close/Close[-1]>1.03) |
3 | =A2.new(A1( |
Partner of positioning operations
The result of positioning operations is often a set of positions, and loop calculation of each member of the set of positions is common in SPL. Using the basic syntax and functions of SPL can implement loop calculation, but because the main body of the loop is a set of positions, rather than a common set of records, the code is not intuitive enough and slightly complex. For example, in Example 3, the loop processes multiple positions with the highest closing price, and the final code is: A2.new(A1().Date, A1().Amount / A1(~-1).Amount)
In order to facilitate the loop processing of position set, SPL partners the calc function with positioning operations, replacing the loop of position set with a loop of record set, and changing the loop variable to a more intuitive current record, thus simplifying the code. Example 3 can be abbreviated as: A1.calc(A2,[Date,Close/Close[-1]-1])
The function calc can be combined with various positioning operations, and Example 4 and Example 5 can be simplified as (the code is the same):
A1.calc(A2,[Date,Amount / Amount[-1]])
Further utilization of position
In structured data computing, sometimes it is necessary to perform two sorting schemas on the same data and switch between the two schemas. It is difficult to switch directly without utilizing positions. SPL provides the psort function to calculate the position of sorted data in the original order, making it easy to switch between the two sorting schemas.
Example 6: The userConsumer table stores a large number of e-commerce users and their consumption amounts. Now, based on the external parameter p_userID, calculate how much more money the user needs to spend in order to increase their consumption ranking. This algorithm will be executed frequently.
For a large amount of data and frequent execution, the userConsumer table can be resident in memory. Calculate how much a user needs to spend to improve their ranking, that is, find the record where the user is located, and then perform cross row calculations with the previous user's consumption amount. The difference in amount is the desired goal. The difficulty lies in that the data should be sorted in reverse order of amount in order to perform cross row calculations with the previous user; However, the faster way to find the record of the specified user is to sort the data by user ID, which can be quickly searched using binary search (or hash index, etc., to be discussed separately).
If we don’t have the psort function, it is difficult to use both sorting schemas simultaneously and convert between them. The compromise approach is to only sort by amount when resident in memory:
A | |
---|---|
1 | =connect("orcl").query@x("select userID, amount |
from userConsume order by amount") | |
2 | =env(userConsume,A1) |
Binary search cannot be used for in-memory query:
A | |
---|---|
1 | =userConsume.pselect(userID: p_userID) |
2 | =userConsume.calc(A1,if(#>1,amount[-1]-amount,0)) |
The advantage of the function psort is that it uses position as a mediator to easily switch between two types of sorted data. The principle is as follows:
Original data UserConsume:
100 |
---|
300 |
200 |
Calculate the position of sorted data in the original order using psort,oPos=userConsume.psort():
1 |
---|
3 |
2 |
Generate new-order data using the results of psort, index=userConsumer (oPos):
100 |
---|
200 |
300 |
Calculate the data corresponding to the position of the new order in the original order using the results of psort: oPos (3)=300
After using the psort function, we can use two types of sorting to significantly improve performance. Change the memory-resident code to:
A | B | |
---|---|---|
1 | =connect("orcl").query@x("select userID, amount | |
from userConsume order by amount") | ||
2 | =env(userConsume,A1) | Original order |
3 | =env(oPos, userConsume.psort(userID)) | The position of sorted records in the original order |
4 | =env(index, userConsume(oPos)) | Using position to create sorted data, equivalent to indexing |
Binary search can be used for in-memory query:
A | B | |
---|---|---|
1 | =oPos(index.pselect@b(userID: p_userID)) | Perform binary search using sorted data and convert to the original order position |
2 | =userConsume.calc(A1,if(#>1,amount[-1]-amount,0)) | Perform relative position calculation in the original order |
In addition to the aforementioned, SPL also provides positioning functions such as pseg and pfind, which can be flexibly used to simplify complex computational objectives or improve inefficient computational performance.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code