1/29/2015

esProc External Memory Computing: Concept of Cursor

The concept of cursor is very important for the database. With the cursor, data can be manipulated more flexibly and returned from the data table by rows. esProc supports many types of cursor, like database cursor, file cursor and in-memory record sequence cursor, to satisfy various needs in data fetching and processing. In this article, we’ll introduce what the cursor used in esProc is and its three types.

1. What is cursor

Using the cursor in the database enables returning the result in batches, instead of trying to retrieve all of it into the memory at a time. The cursor is like a pointer, it fetches one row of data each time from the result set by moving its own position. This mechanism of retrieving data by rows is usually used to process big data table, which cannot be entirely loaded into the limited memory.

Compared with directly returning all result, the database cursor is flexible enough to avoid the memory overflow but its speed of processing is not very fast.

There are some differences, however, between esProc cursor and database cursor:
1)  esProc cursor is only used to get data without modifying the result set.
2)  Data are traversed only once from front to back during fetching data from the esProc cursor, which is similar to the setting of TYPE_FORWARD_ONLY in JDBC.
3)   esProc cursor can be created based on either the database or the data file, or the in-memory record sequence.
Different types of esProc cursors have similar usages and we can mix and match them to suit the requirements.

2. Database cursor


With db.cursor(sql) function, data returned from the database can be converted into the cursor. For example: 

A1 creates a database cursor. A2 fetches the first 100 records from it, as shown below:

Data fetched from the cursor with cs.fetch() will be returned as a table sequence.

A3 will follow on to fetch another 100 records in order, for esProc allows only a single traversal of cursor data from front to back:

As can be seen from EID field, A3 fetches the records from the 101th employee. In this example, cs.fetch() is used merely to fetch data from the cursor. For more usages of the esProc cursor, please refer to esProc External Memory Computing: Basic Usages of the Cursor.

If the data in the cursor haven’t been all fetched out when data fetching is over, use cs.close() to close the cursor. It is very important to close the cursor, particularly the database cursor, timely. Because the database cursor gets data from the database, the existence of it means the connection to the database will remain, causing unnecessary memory usage and even the overrun of cursors allowed to be created by the database.

Similar to the use of db.query() function in data query, the parameter is allowed in creating a database cursor. For example:

In SQL statements, parameters can be represented by the question mark (?). They are entered in a certain order after it and separated by the comma. A1’s cursor will return data of female employees whose EID is greater than 100. A2 returns all data of A1’s cursor, as shown below:

The number of rows of data to be returned is not specified for the cs.fetch() function used in A2, thus all data in the cursor will be returned. When data in a cursor have all been fetched out, it will close automatically with no need of calling cs.close().

3. External file cursor

In practice, particularly big data processing, data are often generated from the file rather than the database. Given this, esProc provides the function of creating the external file cursor for data fetching based on the structured file. The following text file PersonnelInfo.txt holds the personnel information:

In the file, each row holds a record and columns are separated by tabs. The external file cursor can be created according to the file for data fetching. For example:

A2 creates a cursor based on the external file and uses @t option to make the file’s first row the column names of the result set. A3 skips the first 50,000 rows and then A4 fetches 100 rows from A2’s cursor, as shown below:

As the database cursor, data are fetched from the external file cursor through a single one-way traversal. After A3 skips certain rows using cs.skip(), the data fetching in A4 will begin from the 50,001th row.

While creating a cursor based on a certain file, it is not necessary to use all its fields. We can specify fields to create the cursor, for example:

The code in the above cellset is almost the same as that the previous one, except that A2 creates the external file cursor using specified fields.
Note: It’s only after field names are imported using @t option that they can be used to specify fields. Without the option, fields can only be specified directly by their positions, like =A1.cursor(#1, #2, #6, #3).

About the external file cursor, we’ll make a deeper explanation in esProc External Memory Computing: Text Files.

4. In-memory record sequence cursor

In esProc, the result table sequence or record sequence can be converted into the cursor to be used with other types of cursor. For example:

The following is data of the states A1 selects:

A2 creates a cursor using the table sequence:

Generally the in-memory record sequence is converted into the cursor to perform some operations based on the cursor, like merging and joining the cursor data. For details in this respect, please refer to esProc External Memory Computing: Merge and Join Cursor Data.

Besides, the cursor can be created based on the result table sequence or record sequence got in another cellset file. Take cellset file createTable.dfx as an example:

The result of the cellset file returned by the result statement can be converted into a cursor, as shown below:

A2 fetches the first 100 rows from the A1’s cursor:
This is only a simple conversion of a result table sequence got in another cellset file into the in-memory record sequence cursor, which is a kind of cross-cellset cursor in this case. More usages of the cross-cellset cursor will be illustrated in the esProc External Memory Computing: Cross-cellset Cursor

No comments:

Post a Comment