Data Process Engine: January 2015

1/29/2015

esProc External Memory Computing: Concept of Cursor

The concept of cursor is very important for the database. With the cursor, data can be manipulated more flexibly and returned from the data table by rows. esProc supports many types of cursor, like database cursor, file cursor and in-memory record sequence cursor, to satisfy various needs in data fetching and processing. In this article, we’ll introduce what the cursor used in esProc is and its three types.

1. What is cursor

Using the cursor in the database enables returning the result in batches, instead of trying to retrieve all of it into the memory at a time. The cursor is like a pointer, it fetches one row of data each time from the result set by moving its own position. This mechanism of retrieving data by rows is usually used to process big data table, which cannot be entirely loaded into the limited memory.

Compared with directly returning all result, the database cursor is flexible enough to avoid the memory overflow but its speed of processing is not very fast.

There are some differences, however, between esProc cursor and database cursor:

1) esProc cursor is only used to get data without modifying the result set.

2) Data are traversed only once from front to back during fetching data from the esProc cursor, which is similar to the setting of TYPE_FORWARD_ONLY in JDBC.

3) esProc cursor can be created based on either the database or the data file, or the in-memory record sequence.

Different types of esProc cursors have similar usages and we can mix and match them to suit the requirements.

2. Database cursor

With db.cursor(sql) function, data returned from the database can be converted into the cursor. For example:

A1 creates a database cursor. A2 fetches the first 100 records from it, as shown below:

Data fetched from the cursor with cs.fetch() will be returned as a table sequence.

A3 will follow on to fetch another 100 records in order, for esProc allows only a single traversal of cursor data from front to back:

As can be seen from EID field, A3 fetches the records from the 101th employee. In this example, cs.fetch() is used merely to fetch data from the cursor. For more usages of the esProc cursor, please refer to esProc External Memory Computing: Basic Usages of the Cursor.

If the data in the cursor haven’t been all fetched out when data fetching is over, use cs.close() to close the cursor. It is very important to close the cursor, particularly the database cursor, timely. Because the database cursor gets data from the database, the existence of it means the connection to the database will remain, causing unnecessary memory usage and even the overrun of cursors allowed to be created by the database.

Similar to the use of db.query() function in data query, the parameter is allowed in creating a database cursor. For example:

In SQL statements, parameters can be represented by the question mark (?). They are entered in a certain order after it and separated by the comma. A1’s cursor will return data of female employees whose EID is greater than 100. A2 returns all data of A1’s cursor, as shown below:

The number of rows of data to be returned is not specified for the cs.fetch() function used in A2, thus all data in the cursor will be returned. When data in a cursor have all been fetched out, it will close automatically with no need of calling cs.close().

3. External file cursor

In practice, particularly big data processing, data are often generated from the file rather than the database. Given this, esProc provides the function of creating the external file cursor for data fetching based on the structured file. The following text file PersonnelInfo.txt holds the personnel information:

In the file, each row holds a record and columns are separated by tabs. The external file cursor can be created according to the file for data fetching. For example:

A2 creates a cursor based on the external file and uses @t option to make the file’s first row the column names of the result set. A3 skips the first 50,000 rows and then A4 fetches 100 rows from A2’s cursor, as shown below:

As the database cursor, data are fetched from the external file cursor through a single one-way traversal. After A3 skips certain rows using cs.skip(), the data fetching in A4 will begin from the 50,001th row.

While creating a cursor based on a certain file, it is not necessary to use all its fields. We can specify fields to create the cursor, for example:

The code in the above cellset is almost the same as that the previous one, except that A2 creates the external file cursor using specified fields.

Note: It’s only after field names are imported using @t option that they can be used to specify fields. Without the option, fields can only be specified directly by their positions, like =A1.cursor(#1, #2, #6, #3).

About the external file cursor, we’ll make a deeper explanation in esProc External Memory Computing: Text Files.

4. In-memory record sequence cursor

In esProc, the result table sequence or record sequence can be converted into the cursor to be used with other types of cursor. For example:

The following is data of the states A1 selects:

A2 creates a cursor using the table sequence:

Generally the in-memory record sequence is converted into the cursor to perform some operations based on the cursor, like merging and joining the cursor data. For details in this respect, please refer to esProc External Memory Computing: Merge and Join Cursor Data.

Besides, the cursor can be created based on the result table sequence or record sequence got in another cellset file. Take cellset file createTable.dfx as an example:

The result of the cellset file returned by the result statement can be converted into a cursor, as shown below:

A2 fetches the first 100 rows from the A1’s cursor:

This is only a simple conversion of a result table sequence got in another cellset file into the in-memory record sequence cursor, which is a kind of cross-cellset cursor in this case. More usages of the cross-cellset cursor will be illustrated in the esProc External Memory Computing: Cross-cellset Cursor.

1/28/2015

esProc External Memory Computing: Text Files

Sources of data used for analysis usually fall into two categories: the database source and the file source. Compared with the database data, the file data are simple to deploy and publish. The problem is that, since the file data generally need to be used as a whole and thus need to be loaded into the memory all at once, they are not so easy to handle when being big. esProc offers solution to this problem by fetching them with the cursor. This makes the use of file data more convenient.

This article will explore how to manipulate file data in esProc through the case of text data - the most common external file data.

1. Basic usage

file function and import function can be used to read data from text files. The text file employee.txt holds information of 500 employees, as shown below:

EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY

1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000

2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000

3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000

4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000

5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000

6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000

7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000

8 Megan Wilson F California 1979-04-19 2004-04-19 Marketing 11000

9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000

…

esProc sets rules for the format of the text data used in the data table: Records must be separated by the carriage return and fields should be separated by the tab. Text data can be imported using simple code:

In esProc, import function is used to import the file data as the table sequence. In the above cellset, A1 creates a file object. If the path hasn’t been specified in the file name, it can be found in the main path on the settings page. Click on Tool>Options on the menu bar to view or set the main path on the Environment page of Option window:

A2 imports the file as a table sequence. The use of @t option in the import function will import the text file’s first row as the column names. For a clearer comparison, the option is omitted in A3’s data import. After the code is executed, A2’s table sequence is as follows:

Not all files necessarily use tab as the field separator. The CSV file, for instance, uses commas. We can specify a separator for a file when using import. Thus A2’s code can be modified as =A1 .import@t(;",").

A3 generates a table sequence as follows:

It can be seen that, when @t option is omitted, the table sequence’s field names will be automatically generated and named _1,_2,_3 according to their positions respectively.

When the text data are being imported, their data type will be automatically parsed according to the data type of the first row. As the first row of employee.txt holds field names, all data of A3’s fields will be parsed into strings, like column _1 and column _9 in the result above, which are left-justified. While the type of data in A2’s EID field and SALARY field is integer, and the data are right-justified.

We can import only some of the fields to generate a table sequence. For example:

The use of @t option in A2 means that we can directly use names to specify fields to be imported. Result of A2 is as follows:

Without @t option, A3 specifies the fields with their serial numbers in the format #i. The result table sequence is as follows:

As mentioned above, during text data import, data type will be automatically parsed according to that of the first row. If a particular data type is wanted, it can be defined for the specified fields to be imported. For example:

The following is the set of settings used in esProc to specify the data type for the field:

To learn more about the data type, please refer to esProc Getting Started: Basic Data Types.

A2 specifies data types for EID field, BIRTHDAY field and SALARY field – string, date and integer respectively. When executed, A2’s result table sequence is as follows:

A3 specifies the same data types for the fields as in A2. The result table sequence is as follows:

It can be seen that if certain data of a field cannot be parsed into the specified data type, they will be parsed automatically, like those of the title row, which have been automatically parsed into strings.

About the subsequent computations after data are imported, please see the article esProcProgram: Operations on Table Sequences and Record Sequences.

In esProc, not only text data can be imported as a table sequence, in-memory data can also be exported to a text file. The to-be-exported data are usually those of the table sequence or the record sequence, or sometimes, members of the sequence. export function is used to export data to the file. For example:

A1 creates a file object, employee1.txt, to save data. But at this point the file data are still stored in the report’s main path.

A2 creates a table sequence and inserts two records into it:

A3 exports the data of the table sequence to file employee1.txt:

We can use @t option to export the column names. And we can specify certain fields to be exported. By modifying A3’s code into >A1.export@t(A2,NAME), the exported data in employee1.txt are as follows:

As in import, tab is used as the default field separator in export. Of course we can also specify the field separator used in the function.

2. Batch processing of big data

When a file holds big data, to load it all together to the memory may cause the memory overflow. In this case, only some of the data ought to be imported each time. import function can be used to realize this by specifying the beginning byte and the ending byte. For example:

The text file PersonnelInfo.txt holds the personnel information. A2 specifies the range of data import - from the 10,000^th byte to the 20,000^th byte. Since only the complete record has meaning in the result table sequence, esProc will adjust the positions of beginning byte and ending byte to ensure they are respectively in a record’s starting point and another record’s ending point. The imported data in A2 are as follows:

As we can see, @t option can still be used to import field names while importing data through specifying the range.

A3 continues to import data from the 20,000^th byte to 30,000^th byte. Result is as follows:

The two results show that the personnel information in A2 and A3 is continuous. esProc specifies the range for data import in a uniform way to ensure the continuity, integrity and uniqueness of the imported data. A text file with big data can thus be imported and processed in batches to avoid the memory overflow.

In addition to directly specifying beginning and ending positions for data import, @z option can be used in import to import data segmentally. For example:

A2 and A3 use both @t and @z in the import function to import column names as well as the data in segments. That means we can use multiple options simultaneously in a function. A2 divides data into 500 segments and imports the 100^th segment. Its result is as follows:

A3 continues to import the 101^th segment. Its result is as follows:

It can be seen that, during the batch import, esProc is also able to ensure that complete records are imported by setting appropriate positions for beginning and ending points, and at the same time, the continuity, integrity and uniqueness of the imported data.

Note: With import function in which the file is imported in batches by specifying the beginning and ending bytes or dividing it into segments, the number of imported rows of data is different for each batch because of the variation of the number of bytes each record holds. This also explains why a specified row of data cannot be got directly. To locate the record in a specified positon, all records before this position need to be traversed. This is not efficient at all. Yet, the external file cursor can be used to access the target record precisely, which will be explained in the next section.

The export of big data to a text file can also be hindered by the limited memory. Because data cannot be loaded into the memory all at once, they cannot be exported to a certain file by simply using export. But similarly, they can be exported in batches. Use @a option in export function to append data to the existing data at each batch of data export. For example:

The loop in A2 imports data from PersonnelInfo.txt in 500 segments and specifies certain fields to be written to a new file PersonnelInfo1.txt. @at options are used in export to export both column names and data by appending each batch export to the previous data. When the code is executed, PersonnelInfo1.txt gets data as follows:

But without @a option, the existing data in the file will be removed at each time of export.

3. Access big text files with the cursor

A more convenient tool to process text files containing big data is the cursor. Cursor functions can be called conveniently after the external file cursor is generated based on the text file. For example:

A2 creates a file cursor using cursor function, in which @t option is used to make the file’s first row the column names. A3 skips the first 10,000 rows. A4 and A5 respectively fetch 1,000 rows from A2’s cursor:

Data can be fetched conveniently and quickly by their position from the cursor. As with f.import() in section one, fields can be specified for creating a cursor and the data type of a certain field can be defined with f.cursor().

With the external file cursor, we are able to handle various operations easily. About its usages, please refer to esProc External Memory Computing: Basic Usages of the Cursor and other articles on esProc external memory computing.

Also through the cursor, data can be exported to a file. For example:

A2 creates a database cursor. A3 gets the expected result by computing the cursor data and then export it to the file. After the code is executed, employee2.txt holds data as follows:

Exporting data through the cursor is convenient and the code is concise. To retain the existing data during the data export, use @a option to append each batch export to the previous data.

Besides the text file, esProc also supports the binary file. The latter takes up less memory space and makes the computation more efficient. For its usages, please see article esProc External Memory Computing: BinaryFiles.

1/27/2015

esProc External Memory Computing: Principle of Sorting

It is common to sort records in the table during data analysis and computing. In esProc, sort function is used to sort data in the sequence or the table sequence. External memory sorting is required when data being sorted are massive and cannot be loaded into memory all together, for the ordinary sorting method cannot handle this situation.

1. Massive data sorting with external memory

In data statistics, cursors are usually used to fetch massive data. This applies in esProc, which also processes big data with the cursor. In esProc, the function of a cursor, which reads one or more records each time according to the position(s) marked by it and won’t return all data all at once, is similar to that in a database stored procedure.

A cursor can only fetch part of the data every time, thus operations like sorting and grouping all data in the cursors cannot be executed directly. esProc uses external memory to handle these operations on massive data. Each time it reads some of the data for computing and records the result temporarily in the external memory. Later it will merge all the sub-results into a cursor and works out the final result.

Let’s create a data table with huge data in which the dates and 8-digit phone numbers (the first and the last digit should not be zero) are generated arbitrarily. The data table will be stored in the format of a binary file for convenience.

Altogether 100,000 rows of data are generated. Read data from the 50,001^th row to the 51,000^throw using the cursor. The result can be seen in C10 as follows:

We’ll take PhoneRecord, the generated data file, as an example to explore how to perform external sorting in esProc.

Using cs.sortx(x…;n) function in external memory sorting, we can sort the data in cursor cs in ascending order according to the computed result of expression x… and set the number of rows in buffer area by defining n to determine the number of records fetched each time for generating a temporary file. For example:

In order to understand how the external memory is used in esProc to sort data, we click

in the debug area on the toolbar to execute the code step by step until A5. A2 uses the binary data file PhoneRecord to create cursors. A3 uses sortx function to sort data of the cursors. The sorting result, in fact, will be a large cursor merged orderly by many temporary cursor files. The result of A3 is as follows:

A4 fetches the first 1,000 records from this result cursor as follows:

While the code in A3 is executed, external files, which are also called as temporary files, are generated in the directory of temporary files:

Because the number of rows in buffer area was set as 20,000 while using sortx function, five temporary files have been generated for the 100,000 records in the cursor. The data of one of them will be imported:

A2 imports the data as follows:

A3 works out the number of rows in this temporary file as follows:

By comparing the data in A2 with the final sorting result previously obtained, we can see that the former is, actually, the result obtained by sorting a part of the data. This indicates that each temporary file is the sorting result of some data fetched according to the number of rows in buffer area.

Then go on to execute the previous cellset file. We may find that the temporary files will be deleted automatically when the cursor is closed.

sortx function can also be used to sort multiple fields. For example:

A3 sorts data by Date and –PhoneNum , meaning sorting data by date in ascending order, and then sorting those of the same date by phone number in descending order. In programming, it is common to precede an expression for sorting data in descending order by a negative sign. A4 fetches the first 1,000 records from the result as follows:

2. Applications of external memory sorting

In fact, from the operation of external memory sorting we can see one of the uses of the cursor-style sorting, that is, the sorted data can be used in orderly merging. The operation of orderly merging gets data from multiple cursors and computes the sorting expression. Records will be fetched from the cursor that currently has the smallest (or biggest) result. Apparently, this method can only be used when the data of every cursor are properly ordered. In addition, joining records of cursors in alignment with join@x() function also requires that data in every cursor be sorted. About the usage of join@x() function, please refer to esProc External Memory Sorting: Merge and Join Cursor Data.

When the data of a cursor are properly ordered, it can be specified that we fetch data continuously from the cursor until the expression is changed. For example:

A4 fetches data from the cursor in A3 until the Date is changed, meaning the data of the first day will be fetched; A5 skips data of consecutive three days; A6 fetches the data of the fifth day. The results of A4 and A6 are as follows: