Data Process Engine: esProc Improves Text Processing

During text processing, you often have the tasks of querying data from a big file on one or more conditions. Command line grep\cat command can be used to handle some simple situations with simple command yet low efficiency. Or high-level languages can be used to get a much higher efficiency with complicated code. If the query conditions are complex or dynamic, you need to create an additional SQL-like low-level class library, which increases the complexity of the computation.

esProc supports performing conditional query on big files and multithreaded parallel computing, and its code for handling this kind of problem is both concise and efficient. The following example will teach you the esProc method of doing the job.

A text file - employee.txt – holds the employee data. Import the data, select the female employees born after January 1, 1981 inclusive and export the query result to result.txt.

The format of employee.txt is as follows:

EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY

1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000

2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000

3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000

4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000

5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000

esProc code for accomplishing the task:

A1：Open the file as a cursor. cursor function won't import all the data into the memory, it will open the file in the form of a cursor (stream) without the memory footprint. The function uses a default parameter which makes tab as the column separator to import all the fields. @t option means that the file's first line will be the column names and thus specific column names can be used in the expression later. Without the option, columns will be named _1, _2, _3…_n automatically.

A2=A1.select(${where})

Filter the data according to the condition. Here a macro is used to dynamically parse the expression. "where" is the dynamic input parameter, it needs to be pre-defined. The following is the interface on which a parameter is defined:

The esProc program will first compute the expression surrounded by ${…}, then assign the computed result as the value to the macro string and replace ${…} with it; after that, the program will interpret and execute the code. For example, if where gets assigned as BIRTHDAY>=date(1981,1,1) && GENDER=="F" according to the given condition in the example, the expression in A2 will be =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F"). The parameter can be entered into esProc’s Integration Development Environment (IDE), or can be passed from the Java code or the command line.

A3=file("D:/result.txt").export@t(A2). This line of code exports the computed result to a file. If the size of computing result is always small, use the code =A2.fetch() in A3 to fetch the results into the memory for direct observation, or use result A2.fetch() to return the results to the Java application.

The final result of this example is as follows:

This example shows the method of realizing a dynamic query, that is, there is no need to change the code when the query condition changes, just modify the value of the parameter “where”. For example, if the condition becomes "query female employees born after January 1, 1981 inclusive, or employees whose FULLNAME is RebeccaMoore", the value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore". After the code is executed, the result set of A2 will be as follows:

The above algorithm is a sequential computation. But the use of parallel computation can further improve the performance. The method is: Import the file using multithreads, each of which accesses a part of the file with a cursor; meanwhile query the data according to the condition and finally merge the result of each cursor together.

The esProc code for parallel computing is as follows:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the efficiency won’t be really increased. The maximum number of parallel tasks can be configured in the environment option.

A2=A1.(file("d:/employee.txt").cursor@z(;, ~:A1))

This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with each member of A1 in order. "~" can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

file("d:/employee.txt ").cursor@z(;, ~:A1) surrounded in the parentheses is an expression, in which cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file will be roughly divided into four segments (A1=4) and the ~th segment will be fetched. "~" represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

Besides, though exact division will result in incomplete lines, esProc can import complete lines automatically by skipping the beginning half line of a segment and completing the ending half line of the segment. This is why the file should be divided "roughly".

A3=A2.(~.select(${where})). This line of code will query data of each cursor (i.e. ~) in A2 and select the eligible rows. The computed results are still four cursors.

A4=A3.conj@xm(). This line of code will merge the four cursors in A3 in parallel.

A5=file("d:/result.txt").export(A4). This line of code will export the final result to a file.

Data Process Engine

menu

12/23/2014

esProc Improves Text Processing – Conditional Query on Big Files

No comments:

Post a Comment