1/25/2015

esProc External Memory Computing: Binary Files

Two kinds of data file - the normal text file and the binary file – are most used in esProc, of which the binary file adopts compressed encoding of low CPU consumption, meaning that it takes less space than an uncompressed text file, and the data reading efficiency will be higher. Thus, we can conclude that the binary file is a better choice when you need to use data files.

In esProc External Memory Computing: Text Files, we learned how to manipulate the text data. This article will deal with the use of binary data. 

1.Comparison between binary files and text files


Their usages are almost the same, except that @b option cannot be omitted in functions like import, export and cursor used with the binary file. Let’s look at this with an example:

The above two files - PersonnelInfo and PersonnelInfo.txt – hold the same personnel information, respectively stored in binary format and text format. Each file includes 100,000 records of 6 fields. As can be seen, in the hard disk, size of a binary file is less than that of a text file.

Then, let’s check out how data are retrieved from these two kinds of files:

In the cellset, A3 creates a cursor using the binary file. A5 computes the time (millisecond) consumed for performing the grouping and aggregate operation over the binary data. The result is shown below:

In the same way, the text data are used to perform the same operation. The computing time is shown below:
In the above two cellsets, the binary file and the text file are respectively used to perform the same grouping and aggregate computation to count the number of employees of each state, and then compute the time (millisecond) consumed for doing this in A5. As can be seen from the result, the binary data are retrieved at a speed significantly higher than the speed at which the text data are retrieved.

In a word, it is more convenient to store data in a binary file in esProc. 

2. Fetch data segmentally with the cursor

During big data computing, it is a common solution to first split the data into segments and then compute them respectively. When using the file data, both the text file and the binary file can be imported segmentally using @z option. The article – esProc External Memory Computing: Text Files – explains how to import text data segmentally for creating a table sequence. Similarly, data can be fetched through the cursor. For example:

In both A2 and A4, in generating the cursor, @z option is used to divide the data in the cursor into 5 parts based on parameters. The 1st part is returned by A2, and the 2nd is returned by A4. A3 and A5 fetch all data, as shown below:

In the file PersonnelInfo, there are 100,000 records in total. We can see that they are roughly divided into 5 parts based on approximate sizes, instead of being divided precisely according to the number of records. When retrieving data by segment, esProc will adjust the range of data retrieval automatically so as to ensure the data integrity, as with the 1st and the 2nd segment of data. This can ensure the continuity of the data and prevent duplicate data during the process of computing.

The usage of @z option with the text file is exactly the same as that with the binary file, except that @b option will be omitted for the former.
While retrieving data by segment from a binary file, the file can be specifically divided into multiple segments corresponding to different groups of data. For detailed information in this respect, see esProc External Memory Computing: Group Cursor. This is a function specially applied to binary files, so it cannot be used for text files. 

3. Columnar storage of binary file

If an access-intensive big data table contains multiple fields, then you can use the columnar storage of binary file to store the data table into multiple files by fields. In this way, you can select the data file of the desired fields to generate the cursor, so as not to read the unnecessary data. For example:

In A9, the data in cursor are saved as multiple binary files in a columnar format, and each file only stores one column of data. In A10, according to the desired fields, select the corresponding files to build the cursor jointly, which can be used conveniently as the normal cursor, while keeping the system resources from consuming by the extra data. From A11, retrieve the first100 records, as shown below:

However, because retrieving data by block is based on dividing the data volume of the file itself instead of the number of records, the consistency cannot be ensured for the multiple files in a columnar format. Therefore, the access by segment is not allowed regarding the file cursor composed of multiple columnar files.

If accessing multiple files simultaneously on the mechanical disks and the file buffer is small, then the data retrieval efficiency will be reduced greatly due to the frequent access to different files. Thus the file buffer must be set to a greater one, like 16M. However, please note that the memory overflow may be incurred if the file buffer is over-sized or there are too many parallel threads. If using the SSD instead of the mechanical hard disk, then you will not encounter the great decrease in the data retrieval speed. Just set the default file buffer settings, i.e.64K/65536.

No comments:

Post a Comment