1/28/2015

esProc External Memory Computing: Text Files

Sources of data used for analysis usually fall into two categories: the database source and the file source. Compared with the database data, the file data are simple to deploy and publish. The problem is that, since the file data generally need to be used as a whole and thus need to be loaded into the memory all at once, they are not so easy to handle when being big. esProc offers solution to this problem by fetching them with the cursor. This makes the use of file data more convenient.

This article will explore how to manipulate file data in esProc through the case of text data - the most common external file data.

1. Basic usage

file function and import function can be used to read data from text files. The text file employee.txt holds information of 500 employees, as shown below:
EID       NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1  Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2  Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3  Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4  Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5  Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000
6  Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000
7  Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000
8  Megan     Wilson      F       California 1979-04-19       2004-04-19       Marketing        11000
9  Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000

esProc sets rules for the format of the text data used in the data table: Records must be separated by the carriage return and fields should be separated by the tab. Text data can be imported using simple code: 
In esProc, import function is used to import the file data as the table sequence. In the above cellset, A1 creates a file object. If the path hasn’t been specified in the file name, it can be found in the main path on the settings page. Click on Tool>Options on the menu bar to view or set the main path on the Environment page of Option window:
A2 imports the file as a table sequence. The use of @t option in the import function will import the text file’s first row as the column names. For a clearer comparison, the option is omitted in A3’s data import. After the code is executed, A2’s table sequence is as follows:
Not all files necessarily use tab as the field separator. The CSV file, for instance, uses commas. We can specify a separator for a file when using import. Thus A2’s code can be modified as =A1 .import@t(;",").

A3 generates a table sequence as follows:
It can be seen that, when @t option is omitted, the table sequence’s field names will be automatically generated and named _1,_2,_3 according to their positions respectively.

When the text data are being imported, their data type will be automatically parsed according to the data type of the first row. As the first row of employee.txt holds field names, all data of A3’s fields will be parsed into strings, like column _1 and column _9 in the result above, which are left-justified. While the type of data in A2’s EID field and SALARY field is integer, and the data are right-justified.

We can import only some of the fields to generate a table sequence. For example:
The use of @t option in A2 means that we can directly use names to specify fields to be imported. Result of A2 is as follows:
Without @t option, A3 specifies the fields with their serial numbers in the format #i. The result table sequence is as follows:
As mentioned above, during text data import, data type will be automatically parsed according to that of the first row. If a particular data type is wanted, it can be defined for the specified fields to be imported. For example:
The following is the set of settings used in esProc to specify the data type for the field:
To learn more about the data type, please refer to esProc Getting Started: Basic Data Types.

A2 specifies data types for EID field, BIRTHDAY field and SALARY field – string, date and integer respectively. When executed, A2’s result table sequence is as follows:
A3 specifies the same data types for the fields as in A2. The result table sequence is as follows:
It can be seen that if certain data of a field cannot be parsed into the specified data type, they will be parsed automatically, like those of the title row, which have been automatically parsed into strings.

About the subsequent computations after data are imported, please see the article esProcProgram: Operations on Table Sequences and Record Sequences.

In esProc, not only text data can be imported as a table sequence, in-memory data can also be exported to a text file. The to-be-exported data are usually those of the table sequence or the record sequence, or sometimes, members of the sequence. export function is used to export data to the file. For example:
A1 creates a file object, employee1.txt, to save data. But at this point the file data are still stored in the report’s main path.

A2 creates a table sequence and inserts two records into it:
A3 exports the data of the table sequence to file employee1.txt:
We can use @t option to export the column names. And we can specify certain fields to be exported. By modifying A3’s code into >A1.export@t(A2,NAME), the exported data in employee1.txt are as follows:
As in import, tab is used as the default field separator in export. Of course we can also specify the field separator used in the function.

2. Batch processing of big data

When a file holds big data, to load it all together to the memory may cause the memory overflow. In this case, only some of the data ought to be imported each time. import function can be used to realize this by specifying the beginning byte and the ending byte. For example:
The text file PersonnelInfo.txt holds the personnel information. A2 specifies the range of data import - from the 10,000th byte to the 20,000th byte. Since only the complete record has meaning in the result table sequence, esProc will adjust the positions of beginning byte and ending byte to ensure they are respectively in a record’s starting point and another record’s ending point. The imported data in A2 are as follows:
As we can see, @t option can still be used to import field names while importing data through specifying the range.

A3 continues to import data from the 20,000th byte to 30,000th byte. Result is as follows:
The two results show that the personnel information in A2 and A3 is continuous. esProc specifies the range for data import in a uniform way to ensure the continuity, integrity and uniqueness of the imported data. A text file with big data can thus be imported and processed in batches to avoid the memory overflow.

In addition to directly specifying beginning and ending positions for data import, @z option can be used in import to import data segmentally. For example:
A2 and A3 use both @t and @z in the import function to import column names as well as the data in segments. That means we can use multiple options simultaneously in a function. A2 divides data into 500 segments and imports the 100th segment. Its result is as follows:
A3 continues to import the 101th segment. Its result is as follows:
It can be seen that, during the batch import, esProc is also able to ensure that complete records are imported by setting appropriate positions for beginning and ending points, and at the same time, the continuity, integrity and uniqueness of the imported data.

Note: With import function in which the file is imported in batches by specifying the beginning and ending bytes or dividing it into segments, the number of imported rows of data is different for each batch because of the variation of the number of bytes each record holds. This also explains why a specified row of data cannot be got directly. To locate the record in a specified positon, all records before this position need to be traversed. This is not efficient at all. Yet, the external file cursor can be used to access the target record precisely, which will be explained in the next section.

The export of big data to a text file can also be hindered by the limited memory. Because data cannot be loaded into the memory all at once, they cannot be exported to a certain file by simply using export. But similarly, they can be exported in batches. Use @a option in export function to append data to the existing data at each batch of data export. For example:
The loop in A2 imports data from PersonnelInfo.txt in 500 segments and specifies certain fields to be written to a new file PersonnelInfo1.txt. @at options are used in export to export both column names and data by appending each batch export to the previous data. When the code is executed, PersonnelInfo1.txt gets data as follows:
But without @a option, the existing data in the file will be removed at each time of export.

3. Access big text files with the cursor

A more convenient tool to process text files containing big data is the cursor. Cursor functions can be called conveniently after the external file cursor is generated based on the text file. For example:
A2 creates a file cursor using cursor function, in which @t option is used to make the file’s first row the column names. A3 skips the first 10,000 rows. A4 and A5 respectively fetch 1,000 rows from A2’s cursor:
Data can be fetched conveniently and quickly by their position from the cursor. As with f.import() in section one, fields can be specified for creating a cursor and the data type of a certain field can be defined with f.cursor().

With the external file cursor, we are able to handle various operations easily. About its usages, please refer to esProc External Memory Computing: Basic Usages of the Cursor and other articles on esProc external memory computing.

Also through the cursor, data can be exported to a file. For example:
A2 creates a database cursor. A3 gets the expected result by computing the cursor data and then export it to the file. After the code is executed, employee2.txt holds data as follows:
Exporting data through the cursor is convenient and the code is concise. To retain the existing data during the data export, use @a option to append each batch export to the previous data.

Besides the text file, esProc also supports the binary file. The latter takes up less memory space and makes the computation more efficient. For its usages, please see article esProc External Memory Computing: BinaryFiles

No comments:

Post a Comment