1/27/2015

esProc External Memory Computing: Principle of Sorting

It is common to sort records in the table during data analysis and computing. In esProc, sort function is used to sort data in the sequence or the table sequence. External memory sorting is required when data being sorted are massive and cannot be loaded into memory all together, for the ordinary sorting method cannot handle this situation.

1. Massive data sorting with external memory

In data statistics, cursors are usually used to fetch massive data. This applies in esProc, which also processes big data with the cursor. In esProc, the function of a cursor, which reads one or more records each time according to the position(s) marked by it and won’t return all data all at once, is similar to that in a database stored procedure.

A cursor can only fetch part of the data every time, thus operations like sorting and grouping all data in the cursors cannot be executed directly. esProc uses external memory to handle these operations on massive data. Each time it reads some of the data for computing and records the result temporarily in the external memory. Later it will merge all the sub-results into a cursor and works out the final result.


Let’s create a data table with huge data in which the dates and 8-digit phone numbers (the first and the last digit should not be zero) are generated arbitrarily. The data table will be stored in the format of a binary file for convenience. 

Altogether 100,000 rows of data are generated. Read data from the 50,001th row to the 51,000th row using the cursor. The result can be seen in C10 as follows:

We’ll take PhoneRecord, the generated data file, as an example to explore how to perform external sorting in esProc.

Using cs.sortx(x…;n) function in external memory sorting, we can sort the data in cursor cs in ascending order according to the computed result of expression x… and set the number of rows in buffer area by defining n to determine the number of records fetched each time for generating a temporary file. For example:

In order to understand how the external memory is used in esProc to sort data, we click
 in the debug area on the toolbar to execute the code step by step until A5. A2 uses the binary data file PhoneRecord to create cursors. A3 uses sortx function to sort data of the cursors. The sorting result, in fact, will be a large cursor merged orderly by many temporary cursor files. The result of A3 is as follows:

A4 fetches the first 1,000 records from this result cursor as follows:

While the code in A3 is executed, external files, which are also called as temporary files, are generated in the directory of temporary files:

Because the number of rows in buffer area was set as 20,000 while using sortx function, five temporary files have been generated for the 100,000 records in the cursor. The data of one of them will be imported:

A2 imports the data as follows:

A3 works out the number of rows in this temporary file as follows:

By comparing the data in A2 with the final sorting result previously obtained, we can see that the former is, actually, the result obtained by sorting a part of the data. This indicates that each temporary file is the sorting result of some data fetched according to the number of rows in buffer area.

Then go on to execute the previous cellset file. We may find that the temporary files will be deleted automatically when the cursor is closed.

sortx function can also be used to sort multiple fields. For example:

A3 sorts data by Date and –PhoneNum , meaning sorting data by date in ascending order, and then sorting those of the same date by phone number in descending order. In programming, it is common to precede an expression for sorting data in descending order by a negative sign. A4 fetches the first 1,000 records from the result as follows:

2. Applications of external memory sorting

In fact, from the operation of external memory sorting we can see one of the uses of the cursor-style sorting, that is, the sorted data can be used in orderly merging. The operation of orderly merging gets data from multiple cursors and computes the sorting expression. Records will be fetched from the cursor that currently has the smallest (or biggest) result. Apparently, this method can only be used when the data of every cursor are properly ordered. In addition, joining records of cursors in alignment with join@x() function also requires that data in every cursor be sorted. About the usage of join@x() function, please refer to esProc External Memory Sorting: Merge and Join Cursor Data.

When the data of a cursor are properly ordered, it can be specified that we fetch data continuously from the cursor until the expression is changed. For example:

A4 fetches data from the cursor in A3 until the Date is changed, meaning the data of the first day will be fetched; A5 skips data of consecutive three days; A6 fetches the data of the fifth day. The results of A4 and A6 are as follows:

No comments:

Post a Comment