In data analysis, we often need to group data and then compute the aggregate
value for each group, or perform other computations on each group. esProc allows
using groups function to compute
aggregate values for groups of data, as well as grouping records of a table with group function for use in subsequent
computations. However, external memory grouping
is required when data being grouped is big and cannot be loaded to the memory
in its entirety and thus the above-mentioned methods for data summarizing and grouping
become useless.
Mobile phone numbers used for the simulation are 8-digit integers whose first four digits are the fixed 1234 and the rest is randomly generated. The records start on a randomly generated day of August in 2014. Values of call duration are integers randomly generated with a 90 percent chance of being one minute. The maximum limit is 20 minutes. After the data file is prepared, B11 fetches the first 1,000 rows:
Perform the following computations based on the data in PhoneBill:
①
Compute total call duration of
all users and their average call duration in each day of August.
②
Compute every user’s total call
duration in August.
③
Store each day’s call records in
a file.
④
Find the numbers of five users
who make the longest call in total for each day of August.
A2 generates a cursor for the binary text data in A1. A3 groups and summarizes data from the cursor. The result is as follows:
Since there is not so much data in the result, the cursor data only needs to be traversed once. Here groups function is adequate to perform the group and aggregate operation and external memory grouping is not necessary. In fact the result set can be directly acquired without the cursor, thanks to the reasonable amount of data. For more related information, please see esProc External Memory Computing: Principle of Grouping. To compute the average call duration in a day, we need to first get the total call duration and the total number of calls in this day. The result of A4 is as follows:
It is a different case for the second task. As the number of users is far more than the number of days in August, we should consider if the result of data grouping and summarizing can be returned to the memory all at once. There are 10,000 users at most in this example, which, actually, is not a large number. It is merely used to illustrate the computation of big result set. Here we suppose the memory can only hold 1,000 records:
groupn function is needed to handle the grouping and summarizing of big data. The function fetches data from cursor and performs group operation in batches according to the pre-specified number of buffer rows, using the external memory. By Phone and according to the specified 1,000 buffer rows, A3 groups and aggregates cursor data, computing the total call duration in the month. The result of grouping and summarizing big data is still a cursor. Fetching data from it is no different to fetching data from any other cursor. The result of A3 is as follows:
A4 fetches the records of the first 1,000 users. Their total call duration is as follows:
By the way, if data in the cursor hasn’t been entirely fetched, cs.close() function
needs to be called to clean the temporary files from the external memory in
time.
Different from other grouping operations on big data, groupn function requires specifying group numbers directly in the grouping expression. In A3, the dates in DateTime are used as the group numbers. Different from the previous result of data grouping and summarizing, A3’s function returns a sequence of cursors. Each cursor corresponds to a group:
The fourth and fifth lines of code store cursor data in each day in a file. A6 selects four days’ data, and A7 fetches the first 1,000 records from it:
For the fourth problem, group data by DateTime, and then group and summarize data of each group. Finally, select the desired mobile phone numbers from the aggregate results:
A3 directly specifies dates of DateTime as group numbers in the grouping expression, and returns a sequence of cursors:
The code from the fourth to sixth line loops through cursor data of each group to compute the total call duration per user per day, and according to the aggregate result, finds records of users whose call duration is in the top five. Please note that when aggregate function topx is used to sort data in descending order, just add a negative sign before the sorting expression, like –TotalDuration in B5. Select data of five numbers that have made the longest call and store it in the table sequence in B3. After loops are finished, the final result can be viewed in B3 as follows:
An alternative choice is to use the files generated from handling
the third problem, instead of grouping all the original data.