During data computing based on the table sequence, we can combine
data from multiple table sequences together for use in the analysis and computation.
For instance, use A.merge() to
combine records of multiple table sequences in a certain order, or A.conj() to union them in order into a
grand table, or JOIN functions, like join, pjoin and xjoin, to establish certain
relationship between multiple table sequences by joining their data together
and perform query or other computations based on this relation.
Cursors for processing big data can be merged and joined likewise. This article aims to teach the method of merging and joining data of the cursor with functions including CS.conj@x(), CS.merge@x(x) and join@x(), etc.
1. Merge data in order
Most of the time, data could be stored in multiple data tables, for
example, sales records of multiple products and employee profiles of each
department. In this case, we need to merge the data from multiple data tables together
for a better use. For multiple normal homogeneous table
sequences, A.conj() or A.merge(x) can be used to
merge the records of these table sequence into a record sequence. For big data, use CS.conj@x() and CS.merge@x(x) to combine data in every cursor of cursor sequence CS, thus the data can be fetched
as a whole.
Data in the cursor will only be traversed once, so it is impossible
to re-sort them after they are merged and fetched out. In view of this, the
data in each cursor must be ordered in the case of merging data from multiple cursors.
Now let’s learn about the usages of and differences between CS.conj@x() and CS.merge@x(x). Firstly, let’s
look at the case of simple union.
There are four text files respectively recording the order information about wines, electrical appliances, foods, and books. In A6, the four file cursors are concatenated. To find out the order in which the data are retrieved, the following code retrieves 300 records each time and data retrieval is terminated once the record of a different kind of goods appear. The result table sequence in B7 is as follows:
As can be learned from the result, regarding the unioned cursor,
after all wine order data are
retrieved from the 1st text data table, the electrical appliance data from the 2nd
text data table follows. In other words, after the simple union by using CS.conj@x() function,
records in the result cursor will be retrieved in the same order as each cursor
is arranged in the cursor sequence CS.
In this case, we intend to have a clear view of the order in which the records are retrieved from the cursor after they are merged in certain order. In order to do this, only the first 300 entries are retrieved. The table sequence in B7 is shown below:
It can be seen that the data are retrieved by Date in a specified
order . Once all wine order data of January 1st
is retrieved, the retrieval of all electrical appliance order data of the same
day will start. Because retrieving data from the cursor is a forward-only operation,
the order data in each cursor must have been ordered by the date. After data of
multiple cursors are merged using function CS.merge@x() in the specified order, the result data will be retrieved from the grand
cursor according to the order of one of the member cursors of sequence CS, which is selected by comparing the
current expression values of the data tables. In this
way, we can ultimately get the result arranged in the specified order. In data
retrieving, each cursor will still traverse the records in each data table for
once.
Merging the data in multiple cursors in a certain order simply creates
a single grand cursor, and adjusts the orders in which data are retrieved from each
cursor, without increasing or decreasing any records.
Before cursors are merged in proper order by the product number, we
must ensure the data in each cursor are ordered against the product number. To
do so, A5 uses cs.sortx() function to sort cursors of various products.
Please note that the cursor and the table sequence are sorted
differently. Because there is usually a great amount of data in the cursor, they
cannot be loaded into the memory all at once for sorting. Therefore, the data retrieval
will be performed along with the data sorting. The sorted data will be saved as
a temporary data file when they are accumulated to a certain amount. Once all
data are retrieved and sorted, all temporary data files will be merged in
proper order and returned as the cursor. For more details about sorting cursor
data, please see esProc External MemorySorting: Principle of Sorting.
It can be seen that the ordered merging is accomplished once the
data in each cursor have been sorted.
2. Aligned joining
When making the statistics, sometimes we need to consolidate the
data from multiple cursors, which is similar to joining the data from multiple
tables together. If the data in cursor need to point to a normal table sequence,
then we can use cs.switch().
In A5~A8, perform aggregate operations over each kind of product, and return the cursor of respective temporary files (Usage of groupx is explained in detail in esProc ExternalMemory Computing: Principle of Grouping). In A9, the daily sales data of each kind of products will be aligned and joined by the date. A10 fetches the statistical results of the first 25 days, as shown below:
After cursors are joined in alignment, a grand cursor will be
returned, from which the fetched result is similar to the result of joining table
sequences together, that is, every field in every row is
composed of the record. Thus we must note that the joined data take up more
memory than those in the normal cases. In addition,
since data are composed of records instead of values, please note the writing form of the expression, particularly when
re-joining the result cursors used for computation.
From the joined cursor, A10 selects records in which the total amount of food orders is greater than that of wine orders. Then A11 generates the result cursor. A12 returns the first 100 records:
In performing the aligned joining of cursors, we must remember that
the data in the cursor cannot be fetched into and store in the memory. They can
only be traversed once from the first to the last. Therefore
the data in each cursor must be sorted for aligned
joining, which is different from handling the database multi-table join
and the join() operation for normal table sequences. As shown in the above
example, the data in A5~A8 are already sorted by date
to ensure a correct computation.
The table sequences in A1 and A2 are shown below:
In A5, we will see the aligned joining result:
The cursor data are different to those in the normal table sequence.
During the operation of aligned joining, when looking for the New York state
corresponding to the New York city, the cursor for the State data has already
moved to entry 32 and the records before it are unavailable
for later computations. So most cities cannot find their corresponding states. Furthermore,
options @1 and @a are not used in function join@x() to specify the left join or full join. Only data of few cities finding
out corresponding states are returned.
In A1, the data are sorted by STATEID:
In A6, we can see the joining result:
The joining of data from multiple cursors doesn’t require the same cursor type. For example:
A5 joins personnel data with the corresponding state data. What’s important is that the cursor data have been sorted. A6 fetches the first 100 rows of data from A5’s cursor:
In fact the aim of joining data from different cursors is usually to
get the related detail information from them. In the above example, there’s no
need to join records from the two cursors together if the purpose is only to
get the full names of the states where employees come from. Because the state
data actually come from a table sequence, cs.switch() can be used to convert a
certain field of cursor cs into the corresponding records of
the table sequence. There are explanations about the function’s usage in esProc External Memory Computing: BasicUsages of the Cursor. Difference between join@x() and cs.switch() is that the former
creates a mutual reference between different cursors, whereas the latter establishes
a relationship, through the foreign key, between one or more certain fields of
a cursor and a table sequence/sequence, which is regarded as a dimension table
in the process. join@x() for joining
cursor data requires the data be sorted by the field according to which they
are joined. cs.switch() hasn’t such a requirement as the dimension table in
use can be referenced directly from the memory.
Sometimes, in a foreign-key relationship, the composite foreign key is needed to reference certain data of a table
sequence or a cursor. In this case cs.join() can
be used to append fields to the
cursor. For example:
A4 creates a table sequence based on the month and the states’ abbreviations and A5 inserts detailed data into it. Then A4’s table sequence is as follows:
A6 appends fields to A2’s cursor using cs.join(). Equally, the Month field holding employees’ birthday information and the State field of A2’s cursor are joined with the Month and State field of A4’s cursor. Then fetch GroupID field from A4’s cursor and append it to State field of A2’s cursor after the joining. A7 fetches the first 100 records from A6’s cursor:
With cs.join(), we can join not only the cursor, but the
table sequence, with the field in another cursor. Different from join@x(),cs.join() doesn’t
require data in the cursor cs be sorted. But if the dimension
table in use comes from the cursor, its data must be fetched into the memory
all at one. With the requirement of joining two fields of the cursor cs
and the dimension table, this example cannot be handled with cs.switch().
3 Unconditional joining
In the example in the previous section, the statistical data of orders of different products are joined by the dates. In reality the number of rows of data of every kind of order is always the same and so is the data’s order in terms of the date. Thus we don’t have to specify a criterion for sorting. We simply join the data together using pjoin function, without any condition.A9 joins cursors of sales data of four products together, without specified order. A10 fetches the first 100 records:
We can see from the result the way fields from different cursors are joined. The result can be rearranged as follows:
A11 fetches the first 100 records:
No comments:
Post a Comment