During text processing, you often have the
tasks of querying data from a big file on one or more conditions. Command line grep\cat
command can be used to handle some simple situations with simple command yet
low efficiency. Or high-level languages can be used to get a much higher
efficiency with complicated code. If the query conditions are complex or
dynamic, you need to create an additional SQL-like low-level class library,
which increases the complexity of the computation.
esProc supports performing conditional
query on big files and multithreaded parallel computing, and its code for
handling this kind of problem is both concise and efficient. The following
example will teach you the esProc method of doing the job.
A text file - employee.txt – holds the employee data. Import the data, select the
female employees born after January 1, 1981 inclusive and export the query result
to result.txt.
The format of employee.txt is as follows:
EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY
1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000
3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000
esProc code for accomplishing the task:
A1:Open the file as a
cursor. cursor function won't import
all the data into the memory, it will open the file in the form of a cursor
(stream) without the memory footprint. The function uses a default parameter which
makes tab as the column separator to
import all the fields. @t option
means that the file's first line will be the column names and thus specific
column names can be used in the expression later. Without the option, columns will
be named _1, _2, _3…_n automatically.
A2=A1.select(${where})
Filter
the data according to the condition. Here a macro is used to dynamically parse
the expression. "where" is
the dynamic input parameter, it needs to be pre-defined. The following is the
interface on which a parameter is defined:
The
esProc program will first compute the expression surrounded by ${…}, then
assign the computed result as the value to the macro string and replace ${…}
with it; after that, the program will interpret and execute the code. For
example, if where gets assigned as BIRTHDAY>=date(1981,1,1) &&
GENDER=="F" according to the given condition in the example, the
expression in A2 will be =A1.select(BIRTHDAY>=date(1981,1,1)
&& GENDER=="F"). The parameter can be entered into
esProc’s Integration Development Environment (IDE), or can be passed from the
Java code or the command line.
A3=file("D:/result.txt").export@t(A2). This line of code exports the computed result to a file. If the size of
computing result is always small, use the code =A2.fetch()
in A3 to fetch the results into the memory for direct observation, or
use result A2.fetch() to return the results
to the Java application.
The final
result of this example is as follows:This example shows the method of realizing a dynamic query, that is, there is no need to change the code when the query condition changes, just modify the value of the parameter “where”. For example, if the condition becomes "query female employees born after January 1, 1981 inclusive, or employees whose FULLNAME is RebeccaMoore", the value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore". After the code is executed, the result set of A2 will be as follows:
The above algorithm is a sequential computation. But the use of parallel computation can further improve the
performance. The method is: Import the file using multithreads, each of which
accesses a part of the file with a cursor; meanwhile query the data according
to the condition and finally merge the result of each cursor together.
A1=4. A1 is the number of segments, which
means the file will be divided into 4 segments. The number is equal to the
number of parallel tasks in operation, which generally should not exceed the
number of CPU cores. Otherwise the tasks will be queued for processing and the
efficiency won’t be really increased. The maximum number of parallel tasks can
be configured in the environment option.
A2=A1.(file("d:/employee.txt").cursor@z(;,
~:A1))
This line of code will generate four
cursors according to the specified number of segments. A1.(express)
means computing the expression with each member of A1 in order. "~" can be used
in the parentheses to represent the current member. Generally A1 is a set, like
["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting
with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.
file("d:/employee.txt ").cursor@z(;, ~:A1) surrounded in the parentheses is an expression, in which cursor function uses @z option to segment the file and fetch
each part with a cursor. ~:A1 means that the
file will be roughly divided into four segments (A1=4) and the ~th segment will
be fetched. "~" represents the current member in A1 and
each cursor corresponds to the first, the second, the third and the fourth
segment respectively.
Besides,
though exact division will result in incomplete lines, esProc can import complete
lines automatically by skipping the beginning half line of a segment and completing
the ending half line of the segment. This is why the file should be divided "roughly".
A3=A2.(~.select(${where})). This line of code will query
data of each cursor (i.e. ~) in A2 and select the eligible rows. The computed
results are still four cursors.
A4=A3.conj@xm(). This line of code will merge the four cursors in A3 in parallel.
A5=file("d:/result.txt").export(A4). This line of code will export the final result to a file.
No comments:
Post a Comment