2/08/2015

esProc External Memory Computing: Cross-cellset Cursor

In the article esProcExternal Memory Computing: Concept of Cursor, we only touched on the basic usage of the cross-cellset cursor. Here we’ll delve into more issues about it.

1.Basic usage


The cross-cellset cursor is typically used to handle big data analysis and computing, but it doesn’t impose a minimum limit on data volume. First we’ll explain its usage using relatively small data volume. Cellset D:\files\FindEmployees1.dfx is as follows:

The program in the cellset is simple. It retrieves data of employees in a certain department and returns a table sequence in A3 using a result statement. The arg1 used in A1 is a cellset parameter specifying the department name. The parameter can be set by clicking Program>Parameter on the menu bar:

The article esProc Advanced Coding: Subroutines discusses the method of executing program in another cellset file via call function. The cross-cellset cursor has a similar usage, except that it will return a cursor using pcursor. It calls a cellset file in the master cellset as follows:

While creating a cross-cellset cursor using pcursor, directly specify the name of the to-be-invoked dfx file. Enter the parameter, if needed, after the file and use a semicolon to separate them. After the program is executed, the result of A1 is a cross-cellset cursor that is used in the same way as the ordinary cursor:

Using fetch, A3 returns a table sequence of employees of sales department:

Both cellsets use output function that outputs characters to the console during the execution, thus we can clearly see the execution order in the cellset. Click Tool>Console on the menu bar to see the output information:

It can be seen that the program in a specified cellset will only be invoked to kick off the cross-cellset computation when fetch starts to fetch data.

A1 calls the dfx file without using its full path. But this requires that the file have been placed into esProc’s main path or searching path, whose configuration is the same as that in the cross-cellset call by call function. We’ll reemphasize it by explaining the configuration process again.

Click Tool>Options on the menu bar to configure the main path and the searching path on the Environment page, as shown below:

The dfx file can be invoked only by its name without the necessity of writing the full path, regardless of being placed in the main path, or under any directory of the searching path in the above configuration.

In the case of being integrated, the main path and searching path of the dfx file need to be configured in the configuration file config.xml:
<dfxPathList>
            <dfxPath>E:\tools\raqsoft\esProc\demo\Case\Structural;D:\files\txt;D:\files</dfxPath>
</dfxPathList>
<mainPath>D:\files\demo</mainPath>

2. Cellset that returns multiple results

The cellset file to be invoked through the cross-cellset cursor could return multiple results, as D:\files\FindEmployees2.dfx shown below:

In this cellset, the program queries the database to get data of employees in a certain department, and returns two record sequences, composed respectively of data of female employees and of those of male ones, in A2 and A3 separately using two result statements. It sets the same cellset parameter as the previous section did.

Then the cross-cellset cursor is called in the master cellset as follows:

While creating a cross-cellset cursor by invoking pcursor, use the default value of the cellset parameter if it is omitted. That is, in this instance, to find data of employees in HR department. After execution A2 returns result as follows: 

The first half of the result table sequence is data of female employees in HR department and the second half is those of male ones in the department. We can see that the cross-cellset cursor will join in due order the multiple results that are returned via multiple result statements by the cellset being invoked.

During the invocation for the cellset that returns multiple results, all these results must have the same data structure because they will be returned altogether by the cross-cellset cursor. Otherwise errors will occur.

In order to see the detailed execution process, use output statements in the cellset to be invoked, as shown in the following cellset D:\files\FindEmployees3.dfx:

In this cellset, the program queries the database to get data of employees in a certain department, and returns two record sequences, composed respectively of data of female employees and of those of male ones, in A3 and A5 separately using two result statements. Here the same cellset parameter is used.

The cross-cellset cursor is called as follows:

In our examples, the cross-cellset cursor only contains a very small volume of data, but this is sufficient for explaining the cursor invocation. A2 loops through data in the cursor and fetches five records each time. B2 outputs the execution information during loops. B3 concatenates the records returned from the cursor into B1’s record sequence. After the program is executed, B1gets the same result as the previous instance did – records of female and male employees in HR department. The output information received by the console is as follows:

As can be seen from the output information, during the invocation of FindEmployees3.dfx by the master cellset, the program will be executed step by step according to the specified number of records to be returned. That’s why it is only after all the records of female employees in A3 have been returned that the fetching of records of male employees begins.

3. Cellset that returns data by loop

When a large volume of data need to be returned from a cross-cellset cursor, result statements can be used to return them through multiple steps as multiple results that will be concatenated in the master cellset. But a more commonly used way to handle this is to return data from the cursor by loop in multiple results in the cellset to be invoked. These results will be automatically concatenated as they are being retrieved in the master cellset. D:\files\Order1.dfx is such a cellset file for invocation:

A2 fetches data from the cursor by loop and B2 returns each batch of data as a table sequence using the result statement. This is not the way used in the previous example that the whole table sequence or record sequence is returned at a time.

In the master cellset, the cross-cellset cursor works in completely the same way as those in the above:

The cross-cellset cursor returns huge data. A2 only fetches the first 1,000 records as follows:

Since not all data in the cross-cellset cursor have been fetched out, it needs to be closed deliberately in A3. Once the cross-cellset cursor closes, the cursors in the cellset being invoked will close simultaneously.

Cursor data can also be returned as multiple cursors using multiple result statements. In this case data in every cursor will be concatenated together and fetched as a whole. This is the same as the case in which multiple record sequences are returned. So the data in every cursor need to have the same data structure.

4. Applications

By invoking a cellset file, the cross-cellset cursor converts the file’s return result into a cursor. Thus the cellset file can be used to deal with some complicated data processing tasks. During the big data analysis and processing, for example, text data are often used as the source data. But as they are not as orderly as the database data, they need to be rearranged by a routine. In this case the cross-cellset cursor is used to call this separate routine.

D:\files\EmployeeMul.txt contains employee data, as shown below:

In this text file, data of each employee hold information of ID, name, gender, entry date, birthday and city, etc. in three lines. It is common for a text file to save a record with many fields in multiple lines. The task is to find the data of Philadelphian employees from the file.

The text file cannot be used directly for computation, because esProc requires that each line of data in the file be regarded as a record when directly importing text data as the table sequence or the cursor. Data in the file EmployeeMul.txt need to be reorganized before they are used for analysis and computing. The program will become more intuitive if we choose to use a cross-cellset cursor. First reorganize the data in mergeRecord.dfx:

A1 imports the text file as a cursor. A2 fetches 5000*3 records each time and joins the fetched data together within its statement block. For every three records, B2, C2 and D2 respectively fetch the first one, the second one and the third one to create a record sequence. The table sequences they create with the last loop are separately shown below:

Only the first two columns of data in D2 are valid. B3 joins the three table sequences together using pjoin function:

After all this C3 can generate the desirable records using those source data:

Then the computation required by the task will be easily handled by invoking mergeRecord.dfx through the cross-cellset cursor.

A cross-cellset cursor allows preprocessing the complicated multi-line records in a sub-cellset, making the program focused, concise and readable. It is particularly handy in performing analysis and computation based on big data file. Data volume is not large in this example, so A3 fetches all data of Philadelphian employees at once:

Sometimes the text data have a more complicated format besides the need of combining multiple lines of data into a record, as the following D:\files\mailInfo.txt shows:

The file contains emails including recipient addresses, sender addresses and contents, which start respectively with RECIPIENT:, SENDADDRESS: and CONTENT:. As the number of lines in each mail’s content is different, it makes it impossible to import and combine a specified number of rows of data and to use them directly. Yet esProc supports the cross-cellset cursor through which the data are allowed to be first reorganized in another cellset - readMail.dfx:

A1 creates a cursor based on the text file. B1 skips the first two rows of meaningless data. For the text data, we haven’t decided how many rows are needed to form a record. It is a question under consideration. C1 stores the unprocessed data left by each loop.

By loop, A2 fetches 5,000 rows from the cursor each time. B2 concatenates the fetched data into a big string, placing the unprocessed data left by the previous loop at the head of each batch of data. Since each email starts with RECIPIENT:, C2 splits the big string into a sequence by it. The first member of the newly-created sequence will be an empty row and we cannot confirm if its last member has been processed. So the data will continue to be processed from the 3rd to the 7th line. @0 option in cs.fetch@0() in B3 is used to judge if the cursor data have all been fetched out, rather than really fetching data from the cursor. If not all has been fetched out, the remaining data will be stored in C1. For the first batch of data, results of B2 and C2 are as follows:

B7 performs regular expression parsing on each row of data in C2 using regex function. According to another two key words SENDADDRESS: and CONTENT:, it fetches the desirable data and makes them the field values. The following figure shows the parsing result of the first batch of data in B7:

B8 returns the parsing result of each batch to the main program.

Despite the complexity of the parsing process, the main program fetches data simply using a cross-cellset cursor, whose usage is nothing different from an ordinary cursor.

A2 fetches the first 1,000 rows:

No comments:

Post a Comment