-
Notifications
You must be signed in to change notification settings - Fork 328
First Half Ordered Sorting and Ordered Cursor
Having encountered such a case, the situation can be simplified and summarized as follows: there is a table T in the database, which has two important fields a and b, a is a timestamp, accurate to seconds; b is the user ID; Other fields are used to represent the event attributes that occurred for user b at time a.
The current task is to export the data sorted by a and b. Simply put, it means writing out the result set of SELECT * From T ORDER BY a, b to a file.
However, this T table has billions of records, and after the SQL was sent out, the database seemed dead, with no response for over an hour.
It's no wonder that this task requires sorting the data first before outputting, and the large sorting of billions of records is very slow. When the memory cannot fit, it will involve complex inside/outside memory swapping. The data needs to be traversed three times (read twice and write once), and it is impossible to complete the sorting in an hour,
Is there any other way?
The universal sorting method can be said to have been studied to the extreme worldwide, and it is almost impossible to come up with a better solution. However, if we can identify some features of these data, perhaps we can come up with a solution.
After learning some business information, we found that this batch of data has the following characteristics:
-
The data is inserted in order of occurrence time (i.e. a), so the physical storage order will be close to the order of a, and the database has already built an index for field a;
-
From the start to the end, data is inserted almost every second;
-
The data is evenly distributed over time, with tens of thousands of entries per second, and there is not a particularly large amount of data in a single second;
By utilizing these features, we can design an algorithm (written in SPL) where start and end are the start and end times of the data, respectively.
A | B | |
---|---|---|
1 | for interval@s(start,end)+1 | =elapse@s(start,A1-1) |
2 | =db.query("SELECT * FROM T WHERE a=?",B1) | |
3 | =B2.sort(b) | |
4 | >outputfile.export@a(B3) |
The basic logic is: loop through all seconds, retrieve the records of a certain second from the database, sort them by b, and then write them out to the file. Because the database has an index for a, and the data is stored in an almost orderly manner according to a, using index to retrieve data is very fast. The amount of data per second is not large, and it can be sorted in memory at a fast speed. It is easy to prove that the result set returned by this algorithm is ordered by a and b, so there is no need to buffer data to implement this large sorting.
After the execution of this code, data immediately begins to be outputted, and the task of exporting data in order is completed within a few hours. The reason why it takes several hours is mainly due to the time it takes to retrieve data from the database and write files (billions of records and Terabytes of data), and sorting itself hardly takes up time.
For this batch of data, we have another task: to know if fields a and b can be used as primary keys for T, that is, whether the values of fields a and b are unique in the T table.
It was originally very simple to make this judgment using SQL, just take a look at whether
SELECT COUNT(*) FROM T
and
SELECT COUNT(*) FROM (SELECT a,b FROM T GROUP BY a,b)
are equal (some databases do not support the COUNT (DISTINCT a, b) notation, which is written as a subquery here).
COUNT (*) is easy to calculate, but when doing a GROUP BY operation on billions of rows of big data, its method is similar to external storage sorting, and the cost is also similar. It has been running for over an hour without any progress.
If we utilize the above features, it is easy to calculate this value:
A | B | |
---|---|---|
1 | for interval@s(start,end)+1 | =elapse@s(start,A1-1) |
2 | =db.query@1("SELECT COUNT(DISTINCT b) FROM T WHERE a=?",B1) | |
3 | =@+B2 |
Similarly, loop through each second, and for the records of each second, calculate a COUNT (DISTINCT B), and then add them up to obtain the answer we need (which can be easily proven the correctness of this algorithm). This code completed the operation in just a few minutes (compared to the previous example, without exporting, there is no need to retrieve detailed data, no need to write files, and it can also be calculated in parallel, unlike in the previous example where sequential writing is necessary and can only be done serially).
These two examples are both about how to use indexes to quickly calculate, why is the title of this article called "First Half Ordered Sorting"?
In fact, we have utilized the order information already available in this batch of data. The key point of these two examples is that they both need to be sorted by a and b, and under the influence of the index, this batch of data appears to have been sorted by a, that is, the previous part of the fields to be sorted is already sorted. If the number of records when the previous fields are the same is not too large to fit in memory, then large sorting can be implemented without using buffer.
If the data is already stored in a file that can maintain order, the adaptability of this method will be broader, without the need to know the start and end times of a in advance and loop every second, and the code will be simpler.
If the records of Table T are written in the order of a in the data file T, the algorithms for the two examples above can be written separately as follows:
A | B | |
---|---|---|
1 | for file(T).cursor();a | =A1.sort(b) |
2 | >outputfile.export@a(B1) |
A | B | |
---|---|---|
1 | for file(T).cursor(a,b);a | =@+A1.id(b).len() |
SPL provides an ordered retrieval method for cursor. In these two sections of code, A1 represents looping through cursor for file T data. Each time the value of field a changes, it enters the loop body and then reads the next batch of records with the same value of a....
File based operations are several times more efficient than using index to retrieve data from the database mentioned above. And these few pieces of code also have very little memory usage. Originally, large sorting was a memory intensive action, as in order to minimize the number of merged segments, it was necessary to make each segment as large as possible. Therefore, the larger the memory, the better the performance. After utilizing the ordered features of the first half, with just a little bit of memory (in this case, as long as tens of thousands of rows of records can be loaded), the operation can be implemented at high speed.
Performance optimization should be tailored to local conditions, based on the characteristics of data and operations.
We cannot solve general large sorting problems, but we can design good algorithms to improve performance in specific situations. The database is too transparent, which may seem like programmers don't have to worry anymore, but the database is not as intelligent and often doesn't utilize data features for automatic optimization. Moreover, under the SQL system, even if good algorithms are artificially devised, they are almost impossible to be implemented.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code