Data Process Engine: esProc External Memory Computing: Basic Usages of the Cursor

esProc supports importing big data in batches with the cursor, which is the usual method used in big data computing. Usages of cursors, including external file cursor, database cursor and in-memory record sequence cursor, are basically the same. This article will take the external file cursor as the example to explain basic usages of the cursor in esProc.

1. Data fetching

When a cursor is created, its data can be fetched by cs.fecth() function. Different from the way data of an ordinary table are imported, data in the cursor are fetched by specifying the number of rows to be returned, or one or more conditions on which data will be fetched, rather than being fetched all at once. For example:

A2 creates an external file cursor. A3 fetches the first 1,000 rows from it, as shown below:

With fetch function, each batch of data fetched from the cursor will be returned as a table sequence. A4 calls cs.close() function to close the cursor. When a cursor is closed, data cannot be fetched from it. That is why the result of A5 is null:

When the cursor has accomplished its mission and is not needed any more, it ought to be closed through calling cs.close() function to release the memory.

Besides fetching data from the file or the database result set with the cursor, cs.skip() function can be used to skip the specified number of rows in the cursor. For example:

A2 creates the cursor and A3 skips 45,000 rows. Then A4 fetches data from the cursor as follows:

As can be seen from the transaction ID, the data fetching begins from the 45001th row. Notice that the fetch function in A4 doesn’t specify the number of rows to be returned. In this case, all the rest of the data in the cursor will be fetched. Once data in the cursor are completely fetched, the cursor closes automatically, without the necessity of calling cs.close() function.

So A5 fetches nothing from A2’s cursor:

In fact, both cs.skip() function and cs.fecth() function will only traverse the cursor data once from front to back. When the data fetching is over or cs.close() function is executed, the cursor becomes useless.

Another type of fetching data from the cursor using cs.fecth() function is cs.fetch(;x), which specifies the condition on which data fetching will continue interruptedly until the value of expression x changes. For example:

A2 creates the cursor. A3 fetches data from it until the value of Date changes. The fetched data are as follows:

It can be seen that A3 fetches all data whose Date value is 2013-01-01. When A4 goes on with fetching data from A2’s cursor, it will begin from the date 2013-01-02:

Note that we cannot specify both the number of rows to be returned and the condition for data fetching at the same time; otherwise only the former is valid. Besides, since not all the data in A2’s cursor have been fetched, the cursor needs to be closed in A5 after doing its job.

Similarly, we can also specify condition for cs.skip() function to skip certain records. For example:

A2 creates the cursor. A3, A4 and A5 respectively skip records of one day. Then A6 will fetch data from the cursor from the fourth day:

Similarly, if not all the data have been fetched from the cursor, call cs.close() function in A7 to close the cursor.

2. Cursor loop

When fetching data from the cursor, use for statement to process the rows returned from the cursor by loop. For example:

A2 creates the cursor. By loop, A3 processes 1,000 records returned by the cursor each time. After all records are traversed, the cursor closes automatically. In A3’s code block, based on the table sequence returned by the cursor each time, the code computes the total sales amount of the salesperson whose ID number is 1 and stores the result in B2. The value is shown as follows:

Sometimes not all data need to be traversed during the cursor loop. For example:

A2 creates the cursor and A3 executes cursor loop, fetching 1,000 records each time. In A3’s code block, according to the batches of data fetched from the cursor, the code selects the sale records of the salesperson whose ID number is 1 and stores them in B2. Exit the loop after the first 5 records are fetched. At this point the result seen in B2 is as follows:

One point worthy of special note: When the computation is over and the loop is terminated by break function, there are data still remaining in the cursor. Different from the previous example, in this case the cursor won’t close automatically, thus close function is called in A6 to close it and release the memory.

Similar to cs.fecth() and cs.skip(), the condition can be specified for the cursor loop, so that the data fetching will stop as soon as the value of the conditional expression changes. For example:

With the condition, A3 fetches the sales data of one day each time, computes the number of transactions and the total sales amount in this day in its code block and stores the results in B2’s table sequence. After the code is executed, B2’s result table sequence is as follows:

3. Computation with the cursor

Besides data fetching, esProc offers the use of cursor in performing common forms of data computing by directly using certain functions. The following example is to select data from the cursor according to the condition:

B1 creates the external file cursor and A2 selects the sales data of the salesperson whose ID number is 1 from B1’s cursor. A2’s result is as follows:

It can be seen that, different from the table sequence or the record sequence, the select function used with the cursor returns the cursor as the result. Actually A2 hasn’t really got data from the cursor. It is in B2 where cs.fecth() function is being executed that the eligible data are fetched from B1’s cursor according to the condition. Result of B2 is as follows:

No number of rows to be returned or the condition for data fetching is specified for cs.fecth() function used in B2. All data, or all the sales data of the salesperson, in A2’s cursor will be returned. After data of the cursor in A2 have all been fetched out, the cursor will close automatically. Since the cursor in A2 is generated from the one in B1, the data query is in fact executed by traversing data of the cursor in B1 and the completion of fetching data from the cursor in A2 means the same thing to the cursor in B1, which will close at the same time. Therefore the effort of A3 and B3 trying to fetch data from the cursors in either A2 or B1 will definitely fail.

The cursor can also be used to create new records, or append new fields. For example:

B1 creates a file cursor using some of the fields of the external file employee.txt. In A2, the employees’ full names are generated by joining NAME field and SURNAME field in B1’s cursor together. B2 computes each employee’s age according to the BIRTHDAY field in B1’s cursor and insert the results to Age field. Similar to cs.select(), both cs.new() and cs.derive() return cursor as the result, rather than returning data directly. When executed, A2 and B2 have the following results:

A3 fetches the first 100 rows from A2’s cursor:

Generating data with cs.new() function means executing new statement on every record.

B3 fetches the first 100 rows from B2’s cursor:

Generating data with cs.derive() function means appending computed columns into the returned table sequence.

It can be seen that B3’s data begin not from the first employee, but from the 101th one. That’s because fetching data from cursors in A2 and B2 is, in reality, fetching data from the cursor in B1 and cursor data will be traversed only once from front to back. After A3 is executed, the first 100 rows of data in B1’s cursor have been traversed. So B3 can only go on with the traversal in order in the execution of cs.fecth() function by beginning with the 101th employee.

Because data haven’t been all fetched out, the cursors need to be closed using cs.close() function. As A4 closes the cursor in A2, the cursor in B1 is also closed simultaneously, for the former is generated by the latter. This affects the cursor in B2, from which B4 is trying to fetch data but to no avail.

In practical use, we ought to avoid using the same cursor to generate multiple ones as possibly as we can; the same problem as the one with B3 in fetching data, or as another one that cursors will be unavailable when the source cursor closes, will arise otherwise.

What’s more, cs.run() can be used to modify values of a certain field of cursor data. For example:

B2 replaces names of the states in the returned data by their abbreviations and still returns the cursor as the result:

A3 fetches the first 100 rows as follows:

Similar to a table sequence, the cursor allows the conversion of a certain field of its data into the corresponding records in another table sequence by using cs.switch() function. For example:

B2 uses cs.switch() to switch certain data of the cursor. Its result is also a cursor:

The data A3 fetches are as follows：

It can be seen that, in the return result, STATE field has been converted into the corresponding state records.

cs.switch() can convert more than one field of the cursor. For example:

B3 switches records of several fields of the cursor to the corresponding ones in another table sequence. A4 fetches the first 100 records:

In the article – RelatedComputing in esProc – Primary Keys and Index Function, we explained the usage of switch in the ordinary table sequence, which is similar to that of cs.switch() in external memory computing. Both functions will create corresponding index table for certain field(s) of the dimension table in order to increase efficiency. It is more important to adopt this processing method while using the cursor because the data under processing are usually big.

Data Process Engine

menu

2/01/2015

esProc External Memory Computing: Basic Usages of the Cursor

1. Data fetching

2. Cursor loop

3. Computation with the cursor

No comments:

Post a Comment