Data Process Engine: computation outside database

Showing posts with label computation outside database. Show all posts

2/08/2015

esProc External Memory Computing: Cross-cellset Cursor

In the article esProcExternal Memory Computing: Concept of Cursor, we only touched on the basic usage of the cross-cellset cursor. Here we’ll delve into more issues about it.

1.Basic usage

The cross-cellset cursor is typically used to handle big data analysis and computing, but it doesn’t impose a minimum limit on data volume. First we’ll explain its usage using relatively small data volume. Cellset D:\files\FindEmployees1.dfx is as follows:

The program in the cellset is simple. It retrieves data of employees in a certain department and returns a table sequence in A3 using a result statement. The arg1 used in A1 is a cellset parameter specifying the department name. The parameter can be set by clicking Program>Parameter on the menu bar:

The article esProc Advanced Coding: Subroutines discusses the method of executing program in another cellset file via call function. The cross-cellset cursor has a similar usage, except that it will return a cursor using pcursor. It calls a cellset file in the master cellset as follows:

While creating a cross-cellset cursor using pcursor, directly specify the name of the to-be-invoked dfx file. Enter the parameter, if needed, after the file and use a semicolon to separate them. After the program is executed, the result of A1 is a cross-cellset cursor that is used in the same way as the ordinary cursor:

Using fetch, A3 returns a table sequence of employees of sales department:

Both cellsets use output function that outputs characters to the console during the execution, thus we can clearly see the execution order in the cellset. Click Tool>Console on the menu bar to see the output information:

It can be seen that the program in a specified cellset will only be invoked to kick off the cross-cellset computation when fetch starts to fetch data.

A1 calls the dfx file without using its full path. But this requires that the file have been placed into esProc’s main path or searching path, whose configuration is the same as that in the cross-cellset call by call function. We’ll reemphasize it by explaining the configuration process again.

Click Tool>Options on the menu bar to configure the main path and the searching path on the Environment page, as shown below:

The dfx file can be invoked only by its name without the necessity of writing the full path, regardless of being placed in the main path, or under any directory of the searching path in the above configuration.

In the case of being integrated, the main path and searching path of the dfx file need to be configured in the configuration file config.xml:

<dfxPath>E:\tools\raqsoft\esProc\demo\Case\Structural;D:\files\txt;D:\files</dfxPath>

</dfxPathList>

<mainPath>D:\files\demo</mainPath>

2. Cellset that returns multiple results

The cellset file to be invoked through the cross-cellset cursor could return multiple results, as D:\files\FindEmployees2.dfx shown below:

In this cellset, the program queries the database to get data of employees in a certain department, and returns two record sequences, composed respectively of data of female employees and of those of male ones, in A2 and A3 separately using two result statements. It sets the same cellset parameter as the previous section did.

Then the cross-cellset cursor is called in the master cellset as follows:

While creating a cross-cellset cursor by invoking pcursor, use the default value of the cellset parameter if it is omitted. That is, in this instance, to find data of employees in HR department. After execution A2 returns result as follows:

The first half of the result table sequence is data of female employees in HR department and the second half is those of male ones in the department. We can see that the cross-cellset cursor will join in due order the multiple results that are returned via multiple result statements by the cellset being invoked.

During the invocation for the cellset that returns multiple results, all these results must have the same data structure because they will be returned altogether by the cross-cellset cursor. Otherwise errors will occur.

In order to see the detailed execution process, use output statements in the cellset to be invoked, as shown in the following cellset D:\files\FindEmployees3.dfx:

In this cellset, the program queries the database to get data of employees in a certain department, and returns two record sequences, composed respectively of data of female employees and of those of male ones, in A3 and A5 separately using two result statements. Here the same cellset parameter is used.

The cross-cellset cursor is called as follows:

In our examples, the cross-cellset cursor only contains a very small volume of data, but this is sufficient for explaining the cursor invocation. A2 loops through data in the cursor and fetches five records each time. B2 outputs the execution information during loops. B3 concatenates the records returned from the cursor into B1’s record sequence. After the program is executed, B1gets the same result as the previous instance did – records of female and male employees in HR department. The output information received by the console is as follows:

As can be seen from the output information, during the invocation of FindEmployees3.dfx by the master cellset, the program will be executed step by step according to the specified number of records to be returned. That’s why it is only after all the records of female employees in A3 have been returned that the fetching of records of male employees begins.

3. Cellset that returns data by loop

When a large volume of data need to be returned from a cross-cellset cursor, result statements can be used to return them through multiple steps as multiple results that will be concatenated in the master cellset. But a more commonly used way to handle this is to return data from the cursor by loop in multiple results in the cellset to be invoked. These results will be automatically concatenated as they are being retrieved in the master cellset. D:\files\Order1.dfx is such a cellset file for invocation:

A2 fetches data from the cursor by loop and B2 returns each batch of data as a table sequence using the result statement. This is not the way used in the previous example that the whole table sequence or record sequence is returned at a time.

In the master cellset, the cross-cellset cursor works in completely the same way as those in the above:

The cross-cellset cursor returns huge data. A2 only fetches the first 1,000 records as follows:

Since not all data in the cross-cellset cursor have been fetched out, it needs to be closed deliberately in A3. Once the cross-cellset cursor closes, the cursors in the cellset being invoked will close simultaneously.

Cursor data can also be returned as multiple cursors using multiple result statements. In this case data in every cursor will be concatenated together and fetched as a whole. This is the same as the case in which multiple record sequences are returned. So the data in every cursor need to have the same data structure.

4. Applications

By invoking a cellset file, the cross-cellset cursor converts the file’s return result into a cursor. Thus the cellset file can be used to deal with some complicated data processing tasks. During the big data analysis and processing, for example, text data are often used as the source data. But as they are not as orderly as the database data, they need to be rearranged by a routine. In this case the cross-cellset cursor is used to call this separate routine.

D:\files\EmployeeMul.txt contains employee data, as shown below:

In this text file, data of each employee hold information of ID, name, gender, entry date, birthday and city, etc. in three lines. It is common for a text file to save a record with many fields in multiple lines. The task is to find the data of Philadelphian employees from the file.

The text file cannot be used directly for computation, because esProc requires that each line of data in the file be regarded as a record when directly importing text data as the table sequence or the cursor. Data in the file EmployeeMul.txt need to be reorganized before they are used for analysis and computing. The program will become more intuitive if we choose to use a cross-cellset cursor. First reorganize the data in mergeRecord.dfx:

A1 imports the text file as a cursor. A2 fetches 5000*3 records each time and joins the fetched data together within its statement block. For every three records, B2, C2 and D2 respectively fetch the first one, the second one and the third one to create a record sequence. The table sequences they create with the last loop are separately shown below:

Only the first two columns of data in D2 are valid. B3 joins the three table sequences together using pjoin function:

After all this C3 can generate the desirable records using those source data:

Then the computation required by the task will be easily handled by invoking mergeRecord.dfx through the cross-cellset cursor.

A cross-cellset cursor allows preprocessing the complicated multi-line records in a sub-cellset, making the program focused, concise and readable. It is particularly handy in performing analysis and computation based on big data file. Data volume is not large in this example, so A3 fetches all data of Philadelphian employees at once:

Sometimes the text data have a more complicated format besides the need of combining multiple lines of data into a record, as the following D:\files\mailInfo.txt shows:

The file contains emails including recipient addresses, sender addresses and contents, which start respectively with RECIPIENT:, SENDADDRESS: and CONTENT:. As the number of lines in each mail’s content is different, it makes it impossible to import and combine a specified number of rows of data and to use them directly. Yet esProc supports the cross-cellset cursor through which the data are allowed to be first reorganized in another cellset - readMail.dfx:

A1 creates a cursor based on the text file. B1 skips the first two rows of meaningless data. For the text data, we haven’t decided how many rows are needed to form a record. It is a question under consideration. C1 stores the unprocessed data left by each loop.

By loop, A2 fetches 5,000 rows from the cursor each time. B2 concatenates the fetched data into a big string, placing the unprocessed data left by the previous loop at the head of each batch of data. Since each email starts with RECIPIENT:, C2 splits the big string into a sequence by it. The first member of the newly-created sequence will be an empty row and we cannot confirm if its last member has been processed. So the data will continue to be processed from the 3^rd to the 7^th line. @0 option in cs.fetch@0() in B3 is used to judge if the cursor data have all been fetched out, rather than really fetching data from the cursor. If not all has been fetched out, the remaining data will be stored in C1. For the first batch of data, results of B2 and C2 are as follows:

B7 performs regular expression parsing on each row of data in C2 using regex function. According to another two key words SENDADDRESS: and CONTENT:, it fetches the desirable data and makes them the field values. The following figure shows the parsing result of the first batch of data in B7:

B8 returns the parsing result of each batch to the main program.

Despite the complexity of the parsing process, the main program fetches data simply using a cross-cellset cursor, whose usage is nothing different from an ordinary cursor.

A2 fetches the first 1,000 rows:

2/04/2015

esProc External Memory Computing: Group Cursor

During big data computing, besides data traversal and grouping and aggregate operations, sometimes we need to retrieve one group of data each time to analyze. For example, analyze sales data by dates, plot sales curve for each product, and study the purchase habit of each client.

1. Fetch data by groups according to the expression

In esProc, we can use cs.fetch(;x) function or cs.skip(;x) function to get or skip records till the value of expression x is changed. By doing so, a group of consecutive data can be obtained. For example, retrieve the data of one product each time and examine the sales:

The cursor in A6 contains the sales data of four kinds of products, which have been sorted by PID. From B7, records of the 20^th piece of commodity have been retrieved:

Note that the traversal of cursor data in esProc is executed just once in a one-way street. Thus the data in cursor must be in order when retrieving a single group of records each time as necessary.

2. Fetch data from file cursor segmentally

Both articles – esProcExternal Memory Computing: Text Files and esProc External Memory Computing: Binary Files – deal with the usage of cs.fetch@z(). As we know, @z option can be used to retrieve file or fetch data from cursor segmentally (or by block). However, when fetching data segmentally, it is esProc that determine how the data are divided and sometimes problems may arise.

First, let’s create a text file to save the above used data which are already sorted by PID as a new binary file Order_Products:

In the later computation, if retrieving data by segment, we will get the situation given below:

After all data are divided into 100 segments, retrieve data from the 1st segment in A3, and from the 2nd segment in A5, as shown below:

Now a problem appears. For the product with the number B1445, its sales records appear in both groups. If we need to perform the aggregate operation after each time data are fetched, then duplicate product numbers will appear in the return result. Thus re-aggregation will be necessary to get the final result. Such segmental computation is quite common for the parallel computation over big data, but the above problem will make the computation more complicated. To avoid it, we should store data through the way of segmentation by groups.

When storing the binary data with the cursor, simply use the @g option. In this case, the data written into the cursor will be segmented by groups. By doing so, the data from a same group is sure to be fully fetched all at once when retrieving data segmentally. For example:

Save the data that have been sorted by the product numbers as a binary file Order_Products_G and segment it by groups according to PID. This is slightly different to the method we adopted previously to write the data to the file of Order_Products. Please note that segmental storage is only valid for the binary file. The field by which the data are written into a binary file through the way of segmentation by groups is determined by the parameter used in export@bg() function. It is specified in A8 that A6’s cursor is segmented by groups according to PID, for instance.

At this point, result will be different if we fetch data by segments:

The binary data stored like this is fetched using the cursor in the same way as that of other binary data being fetched with the cursor. But their data fetching results are different as a result of segmenting data by groups during file saving in the former case. Data fetched in A3 and A5 are as follows:

Therefore records of the product whose ID is B1445 will all be fetched for segment one and the data for segment two will be fetched from the next product. It can be seen that, after we specify that data are written into the binary file through the way of segmentation by groups, data of the same group will be put in the same segment for being fetched from the cursor. With segmentation by group, the integrity of the data in each group is guaranteed, and the segmental computation over big data becomes simpler and easier.

2/03/2015

esProc External Memory Computing: Merge and Join Cursor Data

During data computing based on the table sequence, we can combine data from multiple table sequences together for use in the analysis and computation. For instance, use A.merge() to combine records of multiple table sequences in a certain order, or A.conj() to union them in order into a grand table, or JOIN functions, like join, pjoin and xjoin, to establish certain relationship between multiple table sequences by joining their data together and perform query or other computations based on this relation.

Cursors for processing big data can be merged and joined likewise. This article aims to teach the method of merging and joining data of the cursor with functions including CS.conj@x(), CS.merge@x(x) and join@x(), etc.

1. Merge data in order

Most of the time, data could be stored in multiple data tables, for example, sales records of multiple products and employee profiles of each department. In this case, we need to merge the data from multiple data tables together for a better use. For multiple normal homogeneous table sequences, A.conj() or A.merge(x) can be used to merge the records of these table sequence into a record sequence. For big data, use CS.conj@x() and CS.merge@x(x) to combine data in every cursor of cursor sequence CS, thus the data can be fetched as a whole.

Data in the cursor will only be traversed once, so it is impossible to re-sort them after they are merged and fetched out. In view of this, the data in each cursor must be ordered in the case of merging data from multiple cursors.

Now let’s learn about the usages of and differences between CS.conj@x() and CS.merge@x(x). Firstly, let’s look at the case of simple union.

There are four text files respectively recording the order information about wines, electrical appliances, foods, and books. In A6, the four file cursors are concatenated. To find out the order in which the data are retrieved, the following code retrieves 300 records each time and data retrieval is terminated once the record of a different kind of goods appear. The result table sequence in B7 is as follows:

As can be learned from the result, regarding the unioned cursor, after all wine order data are retrieved from the 1^st text data table, the electrical appliance data from the 2^nd text data table follows. In other words, after the simple union by using CS.conj@x() function, records in the result cursor will be retrieved in the same order as each cursor is arranged in the cursor sequence CS.

In most cases, we need more than just the union of the records from each data table one after another. We want to merge them in certain order we need. To serve this purpose, CS.merge@x(x) function can be used. Please note that the records in each cursor of the cursor sequence CS must be already sorted according to expression x when using this function. For example, sort and merge the order data of various products by the sales date:

In this case, we intend to have a clear view of the order in which the records are retrieved from the cursor after they are merged in certain order. In order to do this, only the first 300 entries are retrieved. The table sequence in B7 is shown below:

It can be seen that the data are retrieved by Date in a specified order . Once all wine order data of January 1^st is retrieved, the retrieval of all electrical appliance order data of the same day will start. Because retrieving data from the cursor is a forward-only operation, the order data in each cursor must have been ordered by the date. After data of multiple cursors are merged using function CS.merge@x() in the specified order, the result data will be retrieved from the grand cursor according to the order of one of the member cursors of sequence CS, which is selected by comparing the current expression values of the data tables. In this way, we can ultimately get the result arranged in the specified order. In data retrieving, each cursor will still traverse the records in each data table for once.

Merging the data in multiple cursors in a certain order simply creates a single grand cursor, and adjusts the orders in which data are retrieved from each cursor, without increasing or decreasing any records.

If the cursor data are not ordered, they must be sorted before merging, as illustrated in the following example:

Before cursors are merged in proper order by the product number, we must ensure the data in each cursor are ordered against the product number. To do so, A5 uses cs.sortx() function to sort cursors of various products.

Please note that the cursor and the table sequence are sorted differently. Because there is usually a great amount of data in the cursor, they cannot be loaded into the memory all at once for sorting. Therefore, the data retrieval will be performed along with the data sorting. The sorted data will be saved as a temporary data file when they are accumulated to a certain amount. Once all data are retrieved and sorted, all temporary data files will be merged in proper order and returned as the cursor. For more details about sorting cursor data, please see esProc External MemorySorting: Principle of Sorting.

In B7, the retrieved records are shown below:

It can be seen that the ordered merging is accomplished once the data in each cursor have been sorted.

2. Aligned joining

When making the statistics, sometimes we need to consolidate the data from multiple cursors, which is similar to joining the data from multiple tables together. If the data in cursor need to point to a normal table sequence, then we can use cs.switch().

What if the data to be joined are all from the cursor? As we know, it is usually impossible to retrieve all data from the cursor. How can we join these data then? In esProc, we can use join@x() function to join the data from multiple cursors together. For example:

In A5~A8, perform aggregate operations over each kind of product, and return the cursor of respective temporary files (Usage of groupx is explained in detail in esProc ExternalMemory Computing: Principle of Grouping). In A9, the daily sales data of each kind of products will be aligned and joined by the date. A10 fetches the statistical results of the first 25 days, as shown below:

After cursors are joined in alignment, a grand cursor will be returned, from which the fetched result is similar to the result of joining table sequences together, that is, every field in every row is composed of the record. Thus we must note that the joined data take up more memory than those in the normal cases. In addition, since data are composed of records instead of values, please note the writing form of the expression, particularly when re-joining the result cursors used for computation.

The result of aligned joining of cursors can also be reused for filtering or generating data, for example:

From the joined cursor, A10 selects records in which the total amount of food orders is greater than that of wine orders. Then A11 generates the result cursor. A12 returns the first 100 records:

In performing the aligned joining of cursors, we must remember that the data in the cursor cannot be fetched into and store in the memory. They can only be traversed once from the first to the last. Therefore the data in each cursor must be sorted for aligned joining, which is different from handling the database multi-table join and the join() operation for normal table sequences. As shown in the above example, the data in A5~A8 are already sorted by date to ensure a correct computation.

In order to better explain this point, we create a cursor using two in-memory table sequences in which data are not big:

The table sequences in A1 and A2 are shown below:

In A5, we will see the aligned joining result:

The cursor data are different to those in the normal table sequence. During the operation of aligned joining, when looking for the New York state corresponding to the New York city, the cursor for the State data has already moved to entry 32 and the records before it are unavailable for later computations. So most cities cannot find their corresponding states. Furthermore, options @1 and @a are not used in function join@x() to specify the left join or full join. Only data of few cities finding out corresponding states are returned.

But if having the city data sorted first, we can have the normal result:

In A1, the data are sorted by STATEID:

In A6, we can see the joining result:

The joining of data from multiple cursors doesn’t require the same cursor type. For example:

A5 joins personnel data with the corresponding state data. What’s important is that the cursor data have been sorted. A6 fetches the first 100 rows of data from A5’s cursor:

In fact the aim of joining data from different cursors is usually to get the related detail information from them. In the above example, there’s no need to join records from the two cursors together if the purpose is only to get the full names of the states where employees come from. Because the state data actually come from a table sequence, cs.switch() can be used to convert a certain field of cursor cs into the corresponding records of the table sequence. There are explanations about the function’s usage in esProc External Memory Computing: BasicUsages of the Cursor. Difference between join@x() and cs.switch() is that the former creates a mutual reference between different cursors, whereas the latter establishes a relationship, through the foreign key, between one or more certain fields of a cursor and a table sequence/sequence, which is regarded as a dimension table in the process. join@x() for joining cursor data requires the data be sorted by the field according to which they are joined. cs.switch() hasn’t such a requirement as the dimension table in use can be referenced directly from the memory.

Sometimes, in a foreign-key relationship, the composite foreign key is needed to reference certain data of a table sequence or a cursor. In this case cs.join() can be used to append fields to the cursor. For example:

A4 creates a table sequence based on the month and the states’ abbreviations and A5 inserts detailed data into it. Then A4’s table sequence is as follows:

A6 appends fields to A2’s cursor using cs.join(). Equally, the Month field holding employees’ birthday information and the State field of A2’s cursor are joined with the Month and State field of A4’s cursor. Then fetch GroupID field from A4’s cursor and append it to State field of A2’s cursor after the joining. A7 fetches the first 100 records from A6’s cursor:

With cs.join(), we can join not only the cursor, but the table sequence, with the field in another cursor. Different from join@x(),cs.join() doesn’t require data in the cursor cs be sorted. But if the dimension table in use comes from the cursor, its data must be fetched into the memory all at one. With the requirement of joining two fields of the cursor cs and the dimension table, this example cannot be handled with cs.switch().

3 Unconditional joining

In the example in the previous section, the statistical data of orders of different products are joined by the dates. In reality the number of rows of data of every kind of order is always the same and so is the data’s order in terms of the date. Thus we don’t have to specify a criterion for sorting. We simply join the data together using pjoin function, without any condition.