Skip to content

Performance optimization skill:TopN

esProcSPL edited this page Aug 19, 2024 · 1 revision

1. Data preparation and environment

SPL script is used to generate the data file with two columns of data. The first column ID consists of random integers of less than 2 billion, and the second column amount is random real numbers of no more than 10 million. The data record is 8 billion lines, and the original text file size generated is 169 G. With the data importing tool provided by the database, the file data is imported into Oracle's data table topn, and the SPL composite table file topn.ctx is also generated with the file data.

The test is performed on an Intel server, 2 Intel 3014 CPUs, 1.7G frequency, 12 cores in total, 64G memory. The database table data and SPL composite table file are both stored on the same SSD hard disk.

The data volume is deliberately designed larger than the memory to ensure the data exchanging between memory and external storage when sorting, which degrades the performance greatly and can be easily observed.

2. Routine TopN

Retrieve the top 100 pieces of data with the largest amount in the topn table.

a) Oracle Test

The SQL statement for query is:

select * from ( 
        select /*+ parallel(4) */ 
        * from topn order by amount desc 
    ) where rownum<=100;

Note:/*+ parallel(4) */ is the parallel query grammar of Oracle, of which 4 is the parallel number.

b) SPL Test

Write SPL script to execute the test:

A
1 =now() /record time
2 =4 /number of parallel
3 =file("/home/topn/topn.ctx").open() /generate the composite table object
4 =A3.cursor@m(id,amount;;A2).groups(;top(100;-amount))
5 =interval@s(A1,now()) /calculate the execution time

Unlike SQL, SPL regards TopN as an aggregation operation, the same as sum/count, except that TopN returns a set while sum/count returns a single value. But their computational logic is the same, i.e., only traversing the original data set once without full sorting involved.

groups(;top(100;-amount) in A4 is to do TopN aggregation operation on the whole set and calculate the top 100.

c) Conclusion and analysis

The test time of routine TopN is shown in the following table:

Test results (time unit: seconds)

Number of parallel 1 2 4 8 12
Oracle 1922 952 526 308 256
SPL composite table 2641 1565 729 371 321

It is shown that Oracle is a little faster than SPL, and SPL did not do full sorting, which indicates that Oracle will automatically optimize itself in this case.

It is also understandable that Oracle is faster than SPL, because Oracle is developed in C++ and the current version of SPL is developed in Java. It’s normal that C++ calculation is faster than Java. Besides, this test reads all the two columns of data, and the data is random and unordered, which is difficult to compress, so the columnar storage of composite table has no advantage.

3. Increasing complexity

For the most basic TopN, Oracle is smart enough to optimize it even if it is written as a sub-query. Let's increase the difficulty of the problem and do TopN in each group after grouping.

The specific operation design is: grouping by the last digit of the ID field, and then querying the first 100 records with the largest amount in each group. ID is an integer, so there are only 10 groups. The calculation amount of grouping itself is not large, but the overall calculation complexity will be slightly higher if we perform TopN on grouped data. If there is no full sorting, the total operation time will be longer than the case before, but still within the same order of magnitude.

a) Oracle Test

The SQL statement for query is:

select * from ( 
        select  /*+ parallel(2) */ 
    mod(id,10) as gid,amount,
            row_number() over (partition by mod(id,10) order by amount desc) rn 
        from topn
    ) where rn <= 100;

SQL can not directly write the operation of retrieving TopN after grouping, and can only calculate the sequence number using window function with keyword of sorting(Order by) in the syntax.

b) SPL composite table Test

Write SPL script to execute the test:

A
1 =now() /record the time
2 =4 /number of parallel
3 =file("/home/topn/topn.ctx").open() /generate the composite table object
4 =A3.cursor@m(id,amount;;A2).groups@u(id%10:gid;top(100;-amount))
5 =interval@s(A1,now()) /calculation the execution time

Because SPL regards TopN as aggregate operation, it can be easily put into grouping aggregation and has almost the same syntax as full aggregation.

c) Conclusion and analysis

Test results (time unit: seconds)

Number of parallel 1 2 4 8 12
Oracle 41649 19602 9359 4627 3211
SPL composite table 4380 2127 1007 465 349

After increasing the complexity, Oracle is more than 10 times slower than the previous simple case, and nearly 10 times slower than SPL doing the same operation, which indicates that Oracle is very likely to have done sorting in this case, and the optimization engine does not work when the situation becomes more complex.

The difference in the operation time of SPL in the two cases is less than two times, basically in the same order of magnitude, which accords with our previous analysis, and the advantages of the algorithm are fully reflected.

Clone this wiki locally