2/03/2015

esProc External Memory Computing: Merge and Join Cursor Data

During data computing based on the table sequence, we can combine data from multiple table sequences together for use in the analysis and computation. For instance, use A.merge() to combine records of multiple table sequences in a certain order, or A.conj() to union them in order into a grand table, or JOIN functions, like join, pjoin and xjoin, to establish certain relationship between multiple table sequences by joining their data together and perform query or other computations based on this relation.

Cursors for processing big data can be merged and joined likewise. This article aims to teach the method of merging and joining data of the cursor with functions including CS.conj@x(), CS.merge@x(x) and join@x(), etc. 

1. Merge data in order

Most of the time, data could be stored in multiple data tables, for example, sales records of multiple products and employee profiles of each department. In this case, we need to merge the data from multiple data tables together for a better use. For multiple normal homogeneous table sequences, A.conj() or A.merge(x) can be used to merge the records of these table sequence into a record sequence. For big data, use CS.conj@x() and CS.merge@x(x) to combine data in every cursor of cursor sequence CS, thus the data can be fetched as a whole.

Data in the cursor will only be traversed once, so it is impossible to re-sort them after they are merged and fetched out. In view of this, the data in each cursor must be ordered in the case of merging data from multiple cursors.

Now let’s learn about the usages of and differences between CS.conj@x() and CS.merge@x(x). Firstly, let’s look at the case of simple union.

There are four text files respectively recording the order information about wines, electrical appliances, foods, and books. In A6, the four file cursors are concatenated. To find out the order in which the data are retrieved, the following code retrieves 300 records each time and data retrieval is terminated once the record of a different kind of goods appear. The result table sequence in B7 is as follows:

As can be learned from the result, regarding the unioned cursor, after all wine order data are retrieved from the 1st text data table, the electrical appliance data from the 2nd text data table follows. In other words, after the simple union by using CS.conj@x() function, records in the result cursor will be retrieved in the same order as each cursor is arranged in the cursor sequence CS.

In most cases, we need more than just the union of the records from each data table one after another. We want to merge them in certain order we need. To serve this purpose, CS.merge@x(x) function can be used. Please note that the records in each cursor of the cursor sequence CS must be already sorted according to expression x when using this function. For example, sort and merge the order data of various products by the sales date:

In this case, we intend to have a clear view of the order in which the records are retrieved from the cursor after they are merged in certain order. In order to do this, only the first 300 entries are retrieved. The table sequence in B7 is shown below:

It can be seen that the data are retrieved by Date in a specified order . Once all wine order data of January 1st is retrieved, the retrieval of all electrical appliance order data of the same day will start. Because retrieving data from the cursor is a forward-only operation, the order data in each cursor must have been ordered by the date. After data of multiple cursors are merged using function CS.merge@x() in the specified order, the result data will be retrieved from the grand cursor according to the order of one of the member cursors of sequence CS, which is selected by comparing the current expression values of the data tables. In this way, we can ultimately get the result arranged in the specified order. In data retrieving, each cursor will still traverse the records in each data table for once.

Merging the data in multiple cursors in a certain order simply creates a single grand cursor, and adjusts the orders in which data are retrieved from each cursor, without increasing or decreasing any records.

If the cursor data are not ordered, they must be sorted before merging, as illustrated in the following example:

Before cursors are merged in proper order by the product number, we must ensure the data in each cursor are ordered against the product number. To do so, A5 uses cs.sortx() function to sort cursors of various products.

Please note that the cursor and the table sequence are sorted differently. Because there is usually a great amount of data in the cursor, they cannot be loaded into the memory all at once for sorting. Therefore, the data retrieval will be performed along with the data sorting. The sorted data will be saved as a temporary data file when they are accumulated to a certain amount. Once all data are retrieved and sorted, all temporary data files will be merged in proper order and returned as the cursor. For more details about sorting cursor data, please see esProc External MemorySorting: Principle of Sorting.

In B7, the retrieved records are shown below:

It can be seen that the ordered merging is accomplished once the data in each cursor have been sorted.

2. Aligned joining

When making the statistics, sometimes we need to consolidate the data from multiple cursors, which is similar to joining the data from multiple tables together. If the data in cursor need to point to a normal table sequence, then we can use cs.switch().

What if the data to be joined are all from the cursor? As we know, it is usually impossible to retrieve all data from the cursor. How can we join these data then? In esProc, we can use join@x() function to join the data from multiple cursors together. For example:

In A5~A8, perform aggregate operations over each kind of product, and return the cursor of respective temporary files (Usage of groupx is explained in detail in esProc ExternalMemory Computing: Principle of Grouping). In A9, the daily sales data of each kind of products will be aligned and joined by the date. A10 fetches the statistical results of the first 25 days, as shown below:

After cursors are joined in alignment, a grand cursor will be returned, from which the fetched result is similar to the result of joining table sequences together, that is, every field in every row is composed of the record. Thus we must note that the joined data take up more memory than those in the normal cases. In addition, since data are composed of records instead of values, please note the writing form of the expression, particularly when re-joining the result cursors used for computation.

The result of aligned joining of cursors can also be reused for filtering or generating data, for example:

From the joined cursor, A10 selects records in which the total amount of food orders is greater than that of wine orders. Then A11 generates the result cursor. A12 returns the first 100 records:

In performing the aligned joining of cursors, we must remember that the data in the cursor cannot be fetched into and store in the memory. They can only be traversed once from the first to the last. Therefore the data in each cursor must be sorted for aligned joining, which is different from handling the database multi-table join and the join() operation for normal table sequences. As shown in the above example, the data in A5~A8 are already sorted by date to ensure a correct computation.

In order to better explain this point, we create a cursor using two in-memory table sequences in which data are not big:

The table sequences in A1 and A2 are shown below:

In A5, we will see the aligned joining result:
The cursor data are different to those in the normal table sequence. During the operation of aligned joining, when looking for the New York state corresponding to the New York city, the cursor for the State data has already moved to entry 32 and the records before it are unavailable for later computations. So most cities cannot find their corresponding states. Furthermore, options @1 and @a are not used in function join@x() to specify the left join or full join. Only data of few cities finding out corresponding states are returned.

But if having the city data sorted first, we can have the normal result:

In A1, the data are sorted by STATEID:

In A6, we can see the joining result:

The joining of data from multiple cursors doesn’t require the same cursor type. For example:

A5 joins personnel data with the corresponding state data. What’s important is that the cursor data have been sorted. A6 fetches the first 100 rows of data from A5’s cursor:

In fact the aim of joining data from different cursors is usually to get the related detail information from them. In the above example, there’s no need to join records from the two cursors together if the purpose is only to get the full names of the states where employees come from. Because the state data actually come from a table sequence, cs.switch() can be used to convert a certain field of cursor cs into the corresponding records of the table sequence. There are explanations about the function’s usage in esProc External Memory Computing: BasicUsages of the Cursor. Difference between join@x() and cs.switch() is that the former creates a mutual reference between different cursors, whereas the latter establishes a relationship, through the foreign key, between one or more certain fields of a cursor and a table sequence/sequence, which is regarded as a dimension table in the process. join@x() for joining cursor data requires the data be sorted by the field according to which they are joined. cs.switch() hasn’t such a requirement as the dimension table in use can be referenced directly from the memory.
Sometimes, in a foreign-key relationship, the composite foreign key is needed to reference certain data of a table sequence or a cursor. In this case cs.join() can be used to append fields to the cursor. For example: 

A4 creates a table sequence based on the month and the states’ abbreviations and A5 inserts detailed data into it. Then A4’s table sequence is as follows:

A6 appends fields to A2’s cursor using cs.join(). Equally, the Month field holding employees’ birthday information and the State field of A2’s cursor are joined with the Month and State field of A4’s cursor. Then fetch GroupID field from A4’s cursor and append it to State field of A2’s cursor after the joining. A7 fetches the first 100 records from A6’s cursor:

With cs.join(), we can join not only the cursor, but the table sequence, with the field in another cursor. Different from join@x(),cs.join() doesn’t require data in the cursor cs be sorted. But if the dimension table in use comes from the cursor, its data must be fetched into the memory all at one. With the requirement of joining two fields of the cursor cs and the dimension table, this example cannot be handled with cs.switch(). 

3 Unconditional joining

In the example in the previous section, the statistical data of orders of different products are joined by the dates. In reality the number of rows of data of every kind of order is always the same and so is the data’s order in terms of the date. Thus we don’t have to specify a criterion for sorting. We simply join the data together using pjoin function, without any condition.

A9 joins cursors of sales data of four products together, without specified order. A10 fetches the first 100 records:

We can see from the result the way fields from different cursors are joined. The result can be rearranged as follows:

A11 fetches the first 100 records:

No comments:

Post a Comment