In the article esProcExternal Memory Computing: Concept of Cursor, we only touched on the basic
usage of the cross-cellset cursor. Here we’ll delve into more issues about it.
1.Basic usage
The cross-cellset cursor is typically used to handle big data
analysis and computing, but it doesn’t impose a minimum limit on data volume. First
we’ll explain its usage using relatively small data volume. Cellset D:\files\FindEmployees1.dfx is as follows:
The program in the cellset is simple. It retrieves data of employees in a certain department and returns a table sequence in A3 using a result statement. The arg1 used in A1 is a cellset parameter specifying the department name. The parameter can be set by clicking Program>Parameter on the menu bar:
The article esProc Advanced Coding: Subroutines discusses the method of executing program in another cellset file via call function. The cross-cellset cursor has a similar usage, except that it will return a cursor using pcursor. It calls a cellset file in the master cellset as follows:
While creating a cross-cellset cursor using pcursor, directly specify the name of the to-be-invoked dfx file. Enter the parameter, if needed, after the file and use a semicolon to separate them. After the program is executed, the result of A1 is a cross-cellset cursor that is used in the same way as the ordinary cursor:
Using fetch, A3 returns a table sequence of employees of sales department:
Both cellsets use output function that outputs characters to the console during the execution, thus we can clearly see the execution order in the cellset. Click Tool>Console on the menu bar to see the output information:
It can be seen that the program in a specified cellset will only be invoked
to kick off the cross-cellset computation when fetch starts to fetch data.
A1 calls the dfx file without using its full
path. But this requires that the file have been placed into esProc’s main path
or searching path, whose configuration is the same as that in the cross-cellset
call by call function. We’ll
reemphasize it by explaining the configuration process again.
The dfx file can be invoked only by its name without the necessity
of writing the full path, regardless of being placed in the main path, or under
any directory of the searching path in the above configuration.
In the case of being integrated, the main path and searching path of
the dfx
file need to be configured in the configuration file config.xml:
<dfxPathList>
<dfxPath>E:\tools\raqsoft\esProc\demo\Case\Structural;D:\files\txt;D:\files</dfxPath>
</dfxPathList>
<mainPath>D:\files\demo</mainPath>
2. Cellset that returns multiple results
The cellset file to be invoked through the cross-cellset cursor could return multiple results, as D:\files\FindEmployees2.dfx shown below:
In this cellset, the program queries the database to get data of employees
in a certain department, and returns two record sequences, composed respectively
of data of female employees and of those of male ones, in A2 and A3 separately
using two result statements. It sets
the same cellset parameter as the previous section did.
While creating a cross-cellset cursor by invoking pcursor, use the default value of the cellset parameter if it is omitted. That is, in this instance, to find data of employees in HR department. After execution A2 returns result as follows:
The first half of the result table sequence is data of female
employees in HR department and the second half is those of male ones in the
department. We can see that the cross-cellset cursor will join in due order the
multiple results that are returned via multiple result statements by the cellset being invoked.
During the invocation for the cellset that returns multiple results,
all these results must have the same
data structure because they will be returned altogether by the cross-cellset
cursor. Otherwise errors will occur.
In this cellset, the program queries the database to get data of
employees in a certain department, and returns two record sequences, composed
respectively of data of female employees and of those of male ones, in A3 and
A5 separately using two result
statements. Here the same cellset parameter is used.
In our examples, the cross-cellset cursor only contains a very small volume of data, but this is sufficient for explaining the cursor invocation. A2 loops through data in the cursor and fetches five records each time. B2 outputs the execution information during loops. B3 concatenates the records returned from the cursor into B1’s record sequence. After the program is executed, B1gets the same result as the previous instance did – records of female and male employees in HR department. The output information received by the console is as follows:
As can be seen from the output information, during the invocation of
FindEmployees3.dfx by the master
cellset, the program will be executed step by step according to the specified
number of records to be returned. That’s why it is only after all the records
of female employees in A3 have been returned that the fetching of records of male
employees begins.
3. Cellset that returns data by loop
When a large volume of data need to be returned from a cross-cellset cursor, result statements can be used to return them through multiple steps as multiple results that will be concatenated in the master cellset. But a more commonly used way to handle this is to return data from the cursor by loop in multiple results in the cellset to be invoked. These results will be automatically concatenated as they are being retrieved in the master cellset. D:\files\Order1.dfx is such a cellset file for invocation:
A2 fetches data from the cursor by loop and B2 returns each batch of
data as a table sequence using the result
statement. This is not the way used in the previous example that the whole
table sequence or record sequence is returned at a time.
The cross-cellset cursor returns huge data. A2 only fetches the first 1,000 records as follows:
Since not all data in the cross-cellset cursor have been fetched
out, it needs to be closed deliberately in A3. Once the cross-cellset cursor closes, the cursors in the cellset being invoked will
close simultaneously.
Cursor data can also be returned as multiple cursors using multiple result statements. In this case data in
every cursor will be concatenated together and fetched as a whole. This is the
same as the case in which multiple record sequences are returned. So the data
in every cursor need to have the same data structure.
4. Applications
By invoking a cellset file, the cross-cellset cursor converts the
file’s return result into a cursor. Thus the cellset file can be used to deal
with some complicated data processing tasks. During the big data analysis and
processing, for example, text data are often used as the source data. But as they
are not as orderly as the database data, they need to be rearranged by a
routine. In this case the cross-cellset cursor is used to call this separate
routine.
In this text file, data of each employee hold information of ID,
name, gender, entry date, birthday and city, etc. in three lines. It is common
for a text file to save a record with many fields in multiple lines. The task
is to find the data of Philadelphian employees from the file.
A1 imports the text file as a cursor. A2 fetches 5000*3 records each time and joins the fetched data together within its statement block. For every three records, B2, C2 and D2 respectively fetch the first one, the second one and the third one to create a record sequence. The table sequences they create with the last loop are separately shown below:
Only the first two columns of data in D2 are valid. B3 joins the three table sequences together using pjoin function:
After all this C3 can generate the desirable records using those source data:
Then the computation required by the task will be easily handled by invoking mergeRecord.dfx through the cross-cellset cursor.
A cross-cellset cursor allows preprocessing the complicated multi-line records in a sub-cellset, making the program focused, concise and readable. It is particularly handy in performing analysis and computation based on big data file. Data volume is not large in this example, so A3 fetches all data of Philadelphian employees at once:
Sometimes the text data have a more complicated format besides the need of combining multiple lines of data into a record, as the following D:\files\mailInfo.txt shows:
The file contains emails including recipient addresses, sender addresses and contents, which start respectively with RECIPIENT:, SENDADDRESS: and CONTENT:. As the number of lines in each mail’s content is different, it makes it impossible to import and combine a specified number of rows of data and to use them directly. Yet esProc supports the cross-cellset cursor through which the data are allowed to be first reorganized in another cellset - readMail.dfx:
A1 creates a cursor based on the text file. B1 skips the first two
rows of meaningless data. For the text data, we haven’t decided how many rows
are needed to form a record. It is a question under consideration. C1 stores
the unprocessed data left by each loop.
B7 performs regular expression parsing on each row of data in C2 using regex function. According to another two key words SENDADDRESS: and CONTENT:, it fetches the desirable data and makes them the field values. The following figure shows the parsing result of the first batch of data in B7:
B8 returns the parsing result of each batch to the main program.
A2 fetches the first 1,000 rows:
No comments:
Post a Comment