esProc supports importing big data in batches with the cursor, which
is the usual method used in big data computing. Usages of cursors, including external
file cursor, database cursor and in-memory record sequence cursor, are
basically the same. This article will take the external file cursor as the
example to explain basic usages of the cursor in esProc.
1. Data fetching
When a cursor is created, its data can be fetched by cs.fecth()
function. Different from the way data of an ordinary table are imported, data in
the cursor are fetched by specifying the number of rows to be returned, or one
or more conditions on which data will be fetched, rather than being fetched all
at once. For example:
A2 creates an external file cursor. A3 fetches the first 1,000 rows
from it, as shown below:With fetch function, each batch of data fetched from the cursor will be returned as a table sequence. A4 calls cs.close() function to close the cursor. When a cursor is closed, data cannot be fetched from it. That is why the result of A5 is null:
When the cursor has accomplished its mission and is not needed any
more, it ought to be closed through calling cs.close() function to release the memory.
A2 creates the cursor and A3 skips 45,000 rows. Then A4 fetches data from the cursor as follows:
As can be seen from the transaction ID, the data fetching begins from
the 45001th row. Notice that the fetch
function in A4 doesn’t specify the number of rows to be returned. In this case,
all the rest of the data in the cursor will be fetched. Once data in the cursor
are completely fetched, the cursor closes
automatically, without the necessity of calling cs.close() function.
In fact, both cs.skip() function and cs.fecth()
function will only traverse the cursor data once from front to back. When the data fetching is over or cs.close() function is executed, the cursor becomes
useless.
A2 creates the cursor. A3 fetches data from it until the value of Date changes. The fetched data are as follows:
It can be seen that A3 fetches all data whose Date value is 2013-01-01. When A4 goes on with fetching data from A2’s cursor, it will begin from the date 2013-01-02:
Note that we cannot specify both the number of rows to be returned
and the condition for data fetching at the same time; otherwise only the former
is valid. Besides, since not all the data in A2’s cursor have been fetched, the
cursor needs to be closed in A5 after doing its job.
A2 creates the cursor. A3, A4 and A5 respectively skip records of one day. Then A6 will fetch data from the cursor from the fourth day:
Similarly, if not all the data have been fetched from the cursor,
call cs.close() function in A7 to close the cursor.
2. Cursor loop
When fetching data from the cursor, use for statement to process the rows returned from the cursor by loop. For example:A2 creates the cursor. By loop, A3 processes 1,000 records returned by the cursor each time. After all records are traversed, the cursor closes automatically. In A3’s code block, based on the table sequence returned by the cursor each time, the code computes the total sales amount of the salesperson whose ID number is 1 and stores the result in B2. The value is shown as follows:
Sometimes not all data need to be traversed during the cursor loop. For example:
A2 creates the cursor and A3 executes cursor loop, fetching 1,000 records each time. In A3’s code block, according to the batches of data fetched from the cursor, the code selects the sale records of the salesperson whose ID number is 1 and stores them in B2. Exit the loop after the first 5 records are fetched. At this point the result seen in B2 is as follows:
One point worthy of special note: When the computation is over and
the loop is terminated by break
function, there are data still remaining in the cursor. Different from the
previous example, in this case the cursor won’t close automatically, thus close function is called in A6 to close
it and release the memory.
With the condition, A3 fetches the sales data of one day each time, computes the number of transactions and the total sales amount in this day in its code block and stores the results in B2’s table sequence. After the code is executed, B2’s result table sequence is as follows:
3. Computation with the cursor
Besides data fetching, esProc offers the use of cursor in performing common forms of data computing by directly using certain functions. The following example is to select data from the cursor according to the condition:B1 creates the external file cursor and A2 selects the sales data of the salesperson whose ID number is 1 from B1’s cursor. A2’s result is as follows:
It can be seen that, different from the table sequence or the record sequence, the select function used with the cursor returns the cursor as the result. Actually A2 hasn’t really got data from the cursor. It is in B2 where cs.fecth() function is being executed that the eligible data are fetched from B1’s cursor according to the condition. Result of B2 is as follows:
No number of rows to be returned or the condition for data fetching
is specified for cs.fecth()
function used in B2. All data, or all the sales data of the salesperson, in A2’s
cursor will be returned. After data of the cursor in A2 have all been fetched
out, the cursor will close automatically. Since the cursor in A2 is generated
from the one in B1, the data query is in fact executed by traversing data of
the cursor in B1 and the completion of fetching data from the cursor in A2
means the same thing to the cursor in B1, which will close at the same time.
Therefore the effort of A3 and B3 trying to fetch data from the cursors in
either A2 or B1 will definitely fail.
B1 creates a file cursor using some of the fields of the external file employee.txt. In A2, the employees’ full names are generated by joining NAME field and SURNAME field in B1’s cursor together. B2 computes each employee’s age according to the BIRTHDAY field in B1’s cursor and insert the results to Age field. Similar to cs.select(), both cs.new() and cs.derive() return cursor as the result, rather than returning data directly. When executed, A2 and B2 have the following results:
A3 fetches the first 100 rows from A2’s cursor:
Generating data with cs.new() function
means executing new statement on every record.
B3 fetches the first 100 rows from B2’s cursor:
Generating data with cs.derive() function
means appending computed columns into the returned table sequence.
It can be seen that B3’s data begin not from the first employee, but
from the 101th one. That’s because fetching data from cursors in A2 and B2 is,
in reality, fetching data from the cursor in B1 and cursor data will be
traversed only once from front to back. After A3 is executed, the first 100
rows of data in B1’s cursor have been traversed. So B3 can only go on with the
traversal in order in the execution of cs.fecth() function by beginning with the 101th employee.
Because data haven’t been all fetched out, the cursors need to be
closed using cs.close() function. As A4 closes the cursor
in A2, the cursor in B1 is also closed simultaneously, for the former is
generated by the latter. This affects the cursor in B2, from which B4 is trying
to fetch data but to no avail.
In practical use, we ought to
avoid using the same cursor to generate multiple ones as possibly as we can;
the same problem as the one with B3 in fetching data, or as another one that
cursors will be unavailable when the source cursor closes, will arise otherwise.
B2 replaces names of the states in the returned data by their abbreviations and still returns the cursor as the result:
A3 fetches the first 100 rows as follows:
Similar to a table sequence, the cursor allows the conversion of a certain field of its data into the corresponding records in another table sequence by using cs.switch() function. For example:
B2 uses cs.switch() to switch certain data of the cursor. Its result is also a cursor:
The data A3 fetches are as follows:
It can be seen that, in the return result, STATE field has been converted into the corresponding state
records.
B3 switches records of several fields of the cursor to the corresponding ones in another table sequence. A4 fetches the first 100 records:
In the article – RelatedComputing in esProc – Primary Keys and Index Function, we explained the
usage of switch in the ordinary
table sequence, which is similar to that of cs.switch() in external memory computing. Both functions will create
corresponding index table for certain field(s) of the dimension table in order
to increase efficiency. It is more important to adopt this processing method while
using the cursor because the data under processing are usually big.
No comments:
Post a Comment