It is common to sort records in the table during data analysis and
computing. In esProc, sort function
is used to sort data in the sequence or the table sequence. External memory sorting is required
when data being sorted are massive and cannot be loaded into memory all
together, for the ordinary sorting method cannot handle this situation.
1. Massive data sorting with external memory
In data statistics, cursors are usually used to fetch massive data. This
applies in esProc, which also processes big data with the cursor. In esProc, the
function of a cursor, which reads one or more records each time according to
the position(s) marked by it and won’t return all data all at once, is similar
to that in a database stored procedure.
A cursor can only fetch part of the data every time, thus operations
like sorting and grouping all data in the cursors cannot be executed directly. esProc
uses external memory to handle these operations on massive data. Each time it
reads some of the data for computing and records the result temporarily in the
external memory. Later it will merge all the sub-results into a cursor and works
out the final result.
Let’s create a data table with huge data in which the dates and
8-digit phone numbers (the first and the last digit should not be zero) are generated
arbitrarily. The data table will be stored in the format of a binary file for
convenience.
Altogether 100,000 rows of data are generated. Read data from the 50,001th row to the 51,000th row using the cursor. The result can be seen in C10 as follows:
We’ll take PhoneRecord,
the generated data file, as an example to explore how to perform external
sorting in esProc.
In order to understand how the external memory is used in esProc to sort data, we click
A4 fetches the first 1,000 records from this result cursor as follows:
While the code in A3 is executed, external files, which are also called as temporary files, are generated in the directory of temporary files:
Because the number of rows in buffer area was set as 20,000 while using sortx function, five temporary files have been generated for the 100,000 records in the cursor. The data of one of them will be imported:
A2 imports the data as follows:
A3 works out the number of rows in this temporary file as follows:
By comparing the data in A2 with the final sorting result previously
obtained, we can see that the former is, actually, the result obtained by
sorting a part of the data. This indicates that each temporary file is the sorting
result of some data fetched according to the number of rows in buffer area.
Then go on to execute the previous cellset file. We may find that
the temporary files will be deleted automatically when the cursor is closed.
A3 sorts data by Date and –PhoneNum , meaning sorting data by date in ascending order, and then sorting those of the same date by phone number in descending order. In programming, it is common to precede an expression for sorting data in descending order by a negative sign. A4 fetches the first 1,000 records from the result as follows:
2. Applications of external memory sorting
In fact, from the operation of external memory sorting we can see
one of the uses of the cursor-style sorting, that is, the sorted data can be used
in orderly merging. The operation of orderly merging gets data from multiple cursors
and computes the sorting expression. Records will be fetched from the cursor
that currently has the smallest (or biggest) result. Apparently, this method
can only be used when the data of every cursor are properly ordered. In
addition, joining records of cursors in alignment with join@x() function also requires that data
in every cursor be sorted. About the usage of join@x() function, please refer to esProc
External Memory Sorting: Merge and Join Cursor Data.
A4 fetches data from the cursor in A3 until the Date is changed, meaning the data of the first day will be fetched; A5 skips data of consecutive three days; A6 fetches the data of the fifth day. The results of A4 and A6 are as follows:
No comments:
Post a Comment