Sources of data used for analysis usually fall into two categories:
the database source and the file source. Compared with the database data, the
file data are simple to deploy and publish. The problem is that, since the file
data generally need to be used as a whole and thus need to be loaded into the
memory all at once, they are not so easy to handle when being big. esProc
offers solution to this problem by fetching them with the cursor. This makes
the use of file data more convenient.
This article will explore how to manipulate file data in esProc
through the case of text data - the most common external file data.
1. Basic usage
file function and import
function can be used to read data from text files. The text file employee.txt holds information of 500
employees, as shown below:
EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY
1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000
2 Ashley Wilson F New
York 1980-07-19 2008-03-16 Finance 11000
3 Rachel Johnson F New
Mexico 1970-12-17 2010-12-01 Sales 9000
4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000
7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 2004-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000
…
esProc sets rules for the format of the
text data used in the data table: Records must be separated by the carriage
return and fields should be separated by the tab. Text data can be imported using simple code:
In esProc, import function is used to import the file data as the table
sequence. In the above cellset, A1 creates a file object. If the path hasn’t
been specified in the file name, it can be found in the main path on the
settings page. Click on Tool>Options on the menu bar to view
or set the main path on the Environment page of Option window:A2 imports the file as a table sequence. The use of @t option in the import function will import the text file’s first row as the column names. For a clearer comparison, the option is omitted in A3’s data import. After the code is executed, A2’s table sequence is as follows:
Not all files necessarily use tab as the field separator. The CSV
file, for instance, uses commas. We can specify a separator for a file when
using import. Thus A2’s code can be
modified as =A1 .import@t(;",").
It can be seen that, when @t option is omitted, the table
sequence’s field names will be automatically generated and named _1,_2,_3 according to their positions respectively.
When the text data are being imported,
their data type will be automatically parsed according to the data type of the
first row. As the first row of employee.txt
holds field names, all data of A3’s fields will be parsed into strings, like
column _1 and column _9 in the result above, which are
left-justified. While the type of data in A2’s EID field and SALARY
field is integer, and the data are right-justified.
The use of @t option in A2 means that we can directly use names to specify fields to be imported. Result of A2 is as follows:
Without @t option, A3 specifies the fields with their serial numbers in the format #i. The result table sequence is as follows:
As mentioned above, during text data import, data type will be automatically parsed according to that of the first row. If a particular data type is wanted, it can be defined for the specified fields to be imported. For example:
The following is the set of settings used in esProc to specify the data type for the field:
To learn more about the data type, please
refer to esProc Getting Started: Basic
Data Types.
A3 specifies the same data types for the fields as in A2. The result table sequence is as follows:
It can be seen that if certain data of a
field cannot be parsed into the specified data type, they will be parsed
automatically, like those of the title row, which have been automatically
parsed into strings.
About the subsequent computations after data
are imported, please see the article esProcProgram: Operations on Table Sequences and Record Sequences.
A1 creates a file object, employee1.txt, to save data. But at
this point the file data are still stored in the report’s main path.
A3 exports the data of the table sequence to file employee1.txt:
We can use @t option to export the column names. And we can specify certain fields to be exported. By modifying A3’s code into >A1.export@t(A2,NAME), the exported data in employee1.txt are as follows:
As in import,
tab is used as the default field
separator in export. Of course we
can also specify the field separator used in the function.
2. Batch processing of big data
When a file holds big data, to load it all together to the memory may cause the memory overflow. In this case, only some of the data ought to be imported each time. import function can be used to realize this by specifying the beginning byte and the ending byte. For example:The text file PersonnelInfo.txt holds the personnel information. A2 specifies the range of data import - from the 10,000th byte to the 20,000th byte. Since only the complete record has meaning in the result table sequence, esProc will adjust the positions of beginning byte and ending byte to ensure they are respectively in a record’s starting point and another record’s ending point. The imported data in A2 are as follows:
As we can see, @t option
can still be used to import field names while importing data through specifying
the range.
The two results show that the personnel information in A2 and A3 is continuous.
esProc specifies the range for data import in a uniform way to ensure the continuity,
integrity and uniqueness of the imported data. A text file with big data can
thus be imported and processed in batches to avoid the memory overflow.
A2 and A3 use both @t and @z in the import function to import column names as well as the data in segments. That means we can use multiple options simultaneously in a function. A2 divides data into 500 segments and imports the 100th segment. Its result is as follows:
A3 continues to import the 101th segment. Its result is as follows:
It can be seen that, during the batch import, esProc is also able to
ensure that complete records are imported by setting appropriate positions for
beginning and ending points, and at the same time, the continuity, integrity
and uniqueness of the imported data.
Note: With import
function in which the file is imported in batches by specifying the beginning
and ending bytes or dividing it into segments, the number of imported rows of
data is different for each batch because of the variation of the number of
bytes each record holds. This also explains why a specified row of data cannot
be got directly. To locate the record in a specified positon, all records
before this position need to be traversed. This is not efficient at all. Yet,
the external file cursor can be used to access the target record precisely,
which will be explained in the next section.
The loop in A2 imports data from PersonnelInfo.txt in 500 segments and specifies certain fields to be written to a new file PersonnelInfo1.txt. @at options are used in export to export both column names and data by appending each batch export to the previous data. When the code is executed, PersonnelInfo1.txt gets data as follows:
But without @a option,
the existing data in the file will be removed at each time of export.
3. Access big text files with the cursor
A more convenient tool to process text files containing big data is the cursor. Cursor functions can be called conveniently after the external file cursor is generated based on the text file. For example:A2 creates a file cursor using cursor function, in which @t option is used to make the file’s first row the column names. A3 skips the first 10,000 rows. A4 and A5 respectively fetch 1,000 rows from A2’s cursor:
Data can be fetched conveniently and
quickly by their position from the cursor. As with f.import() in section one, fields can be specified for creating a
cursor and the data type of a certain field can be defined with f.cursor().
With the external file cursor, we are able
to handle various operations easily. About its usages, please refer to esProc External Memory Computing: Basic
Usages of the Cursor and other articles on esProc external memory
computing.
A2 creates a database cursor. A3 gets the expected result by computing the cursor data and then export it to the file. After the code is executed, employee2.txt holds data as follows:
Exporting data through the cursor is convenient
and the code is concise. To retain the existing data during the data export,
use @a option to append each batch export to the previous data.
Besides the text file, esProc also supports
the binary file. The latter takes up less memory space and makes the
computation more efficient. For its usages, please see article esProc External Memory Computing: BinaryFiles.
No comments:
Post a Comment