Two kinds of data file - the normal text
file and the binary file – are most used in esProc, of which the binary file adopts compressed encoding of low CPU consumption, meaning that it takes less space than an uncompressed text file, and the data reading efficiency will be higher. Thus, we can
conclude that the binary file is a better choice when
you need to use data files.
In esProc External Memory
Computing: Text Files, we learned how to manipulate the text data. This
article will deal with the use of binary data.
1.Comparison between binary files and text files
Their usages are almost the same, except that @b option cannot be omitted in functions like import, export and cursor used with the binary file. Let’s look
at this with an example:
The above two files - PersonnelInfo and PersonnelInfo.txt
– hold the same personnel information, respectively stored in binary format and
text format. Each file includes 100,000 records of 6 fields. As can be seen, in
the hard disk, size of a binary file is less than that of a text file.
In the cellset, A3 creates a cursor using the binary file. A5 computes the time (millisecond) consumed for performing the grouping and aggregate operation over the binary data. The result is shown below:
In the same way, the text data are used to perform the same operation. The computing time is shown below:
In the above two cellsets, the binary file and the text file are
respectively used to perform the same grouping and aggregate computation to count
the number of employees of each state, and then compute the time (millisecond) consumed
for doing this in A5. As can be seen from the result,
the binary data are retrieved at a speed significantly higher than the speed at
which the text data are retrieved.
In a word, it is more convenient to store data in a binary file in
esProc.
2. Fetch data segmentally with the cursor
During big data computing, it is a common solution to first split the data into segments and then compute them respectively. When using the file data, both the text file and the binary file can be imported segmentally using @z option. The article – esProc External Memory Computing: Text Files – explains how to import text data segmentally for creating a table sequence. Similarly, data can be fetched through the cursor. For example:In both A2 and A4, in generating the cursor, @z option is used to divide the data in the cursor into 5 parts based on parameters. The 1st part is returned by A2, and the 2nd is returned by A4. A3 and A5 fetch all data, as shown below:
In the file PersonnelInfo,
there are 100,000 records in total. We can see that they
are roughly divided into 5 parts based on approximate sizes, instead of being divided
precisely according to the number of records. When
retrieving data by segment, esProc will adjust the range of data retrieval
automatically so as to ensure the data integrity, as with the 1st and the 2nd segment of data. This can ensure the continuity
of the data and prevent duplicate data during the process of computing.
The usage of @z option with the text file is exactly the same as that with the binary
file, except that @b option will be
omitted for the former.
While retrieving data by segment from a binary file, the file can be
specifically divided into multiple segments corresponding to different groups
of data. For detailed information in this respect, see esProc External Memory
Computing: Group Cursor. This is a function specially applied to binary files, so
it cannot be used for text files.
3. Columnar storage of binary file
If an access-intensive big data table contains multiple fields, then you can use the columnar storage of binary file to store the data table into multiple files by fields. In this way, you can select the data file of the desired fields to generate the cursor, so as not to read the unnecessary data. For example:In A9, the data in cursor are saved as multiple binary files in a columnar format, and each file only stores one column of data. In A10, according to the desired fields, select the corresponding files to build the cursor jointly, which can be used conveniently as the normal cursor, while keeping the system resources from consuming by the extra data. From A11, retrieve the first100 records, as shown below:
However, because retrieving data by block is based on dividing the
data volume of the file itself instead of the number of records, the
consistency cannot be ensured for the multiple files in a columnar format. Therefore, the access by segment is
not allowed regarding the file cursor composed of multiple columnar files.
If accessing multiple files simultaneously on the mechanical disks and
the file buffer is small, then the data retrieval efficiency will be reduced
greatly due to the frequent access to different files. Thus
the file buffer must be set to a greater one, like 16M. However, please note
that the memory overflow may be incurred if the file buffer is over-sized or
there are too many parallel threads. If using the SSD instead of the mechanical
hard disk, then you will not encounter the great decrease in the data retrieval
speed. Just set the default file buffer settings, i.e.64K/65536.
No comments:
Post a Comment