1/26/2015

esProc External Memory Computing : Principle of Grouping

After data are imported from a data table, we often need to group them as required, or work out grouping and summarizing result. In esProc, groups function is used to compute the result of data grouping and summarizing; or group function can be used to first group the data, then perform further analysis and computation.

But the case will be different in processing huge data, for the records cannot be loaded to the memory all together and distributed into each group. Other times the number of groups is huge and the grouping and summarizing result cannot even be returned all at once. In these two occasions, the external memory grouping is required.

1. Grouping with cursor by directly specifying group numbers


Let’s create a big, simple data table containing employee information, which includes three fields: employee ID, state and birthday. The employee IDs are generated in order and the states are written in abbreviations obtained arbitrarily from the STATES table of demo database; birthdays are the dates selected arbitrarily within 10,000 days before 1994-1-1.The data table will be stored as a binary file for convenience. 

Altogether 1,000,000 rows of data are generated. The result of reading the 50,001th ~51,000th rows of data with cursor can be seen in C10 as follows:

In the following, we’ll take the generated data file, BirthStateRecord, as an example to explore how to group data in cursor computing by directly specifying group numbers. Because the data of the big data table cannot be loaded all together into the memory, we cannot group them as we do with the data from an ordinary table sequence. To solve this problem, esProc offers cs.groupn(x) function which can distribute the records in cursor cs according to the computed results of expression x into groups with specified serial numbers and return a sequence of cursors. For example:

To explain the way in which cs.groupn(x) function performs grouping by using external memory to specify group numbers, Click in the debug area of the toolbar to execute the code step by step until A6. A2 creates a cursor with the binary data file BirthStateRecord. A4 creates a sequence using the states’ abbreviations in STATES table.A5 uses groupn function to group data of the cursor; in this process, we need to find the corresponding serial numbers of the states in A4 and make them the group numbers. During the execution of groupn function in A5, a temporary file will be generated for each group to store the grouping result and a sequence of temporary file cursors will be returned as follows:

While the code in A5 is executed, external files are generated in the directory of temporary files:

In the execution of groupn function, the number of temporary files equals that of the groups of records. We can import data from one of the temporary files in another cellset:

The data A2 imports are as follows:

It can be seen that the data of a temporary file is, in fact, the employee information of a state. Here it is the employee information of the state of Missouri. The name of a temporary file is generated arbitrarily by the esProc program and managed by it.
Click  on the toolbar in the previous cellset file and go on with the execution of this cellset. In A8, when all file cursors are closed, the temporary files will be deleted automatically. A6 fetches from the first file cursor the employee information of the state of Alabama as follows:

A7 works out the grouping and summarizing result of the 22nd group using groups function, that is, the number of employees from the state of Michigan:
Actually A6 and A7 fetch data from two groups respectively. When the data of a file cursor have been fetched, the corresponding temporary file will be deleted automatically.

It is thus clear that a sequence consisting of temporary file cursors will be returned in grouping records of cursors using directly specified group numbers. Each file cursor contains the records of a group which can be further processed.

2. Grouping and summarizing result sets of huge data

When grouping data of cursors, most of the time we needn’t to get the detailed data of each group. What we only need is the grouping and summarizing result. To get the number of employees of each state from BirthStateRecord, for example, we use groups function to compute the grouping and summarizing result: 

Thus we can get the result in A3:

Here we notice that the groups function for grouping and summarizing will return a table sequence as the result after the computation is completed. In processing massive data, sometimes they need to be divided into a lot of groups and the result set of grouping and summarizing is too big to be returned all at once. This includes the following cases: the telecom company makes statistics of each customer’s bill; online shopping malls make statistics by categories about the sales of each kind of commodity, and the like. In these cases, the use of groups function may result in a memory overflow. We can use groupx(x:F,…;y:F,…;n) function instead to perform grouping and summarizing by making use of external memory. In the function, n represents the number of rows in buffer area. For example:

Still, the code is executed step by step until A4. In A3, groupx function uses external memory to perform grouping and summarizing. In cursor computing, groupx function is used in the operations of both grouping and summarizing with external memory and grouping by directly specified group numbers. The difference of the two operations lies in the parameters. A3 performs grouping by employees’ birthdays, then summate the number of employees born on each of the dates. During the computation, the number of rows in buffer area is set as 1,000. The result returned by A3 is a cursor as follows:

After the code in A3 is executed, external files will be generated in the directory of temporary files:

The data of one of the temporary files can be imported:


The data A2 imports are as follows:

The data of A3 is as follows:

It can be seen that each temporary file is the result of grouping and summarizing a part of the data according to employees’ birthdays. A larger cursor composed of all temporary files will be merged and returned by esProc. When the temporary files are generated, esProc will select a group number suitable for computing, so the rows of data in the temporary files will be a little more than the number of rows we set in buffer area. Special attention is needed in this point.

Go on with the execution of the cellset in the previous cellset file. When cursors are closed in A5, the temporary files will be automatically deleted. A4 fetches the first 1,000 birthdays from the cursor generated in A3 and counts the number of employees born on each birth date, as shown below:

No comments:

Post a Comment