Following problems will arise if you
perform conditional filtering on text files in Java alone:
1. The text file is not a database, so it
cannot be accessed by SQL. The code needs to be modified if the expression of
grouping and summarizing is changed. Besides, if you want a flexible expression,
you have to self-program the dynamic expression parsing and evaluating, resulting
in a great amount of programming work.
2. The grouping result produced during
traversing will be recorded. If the result is small in size, it can be stored
directly in the memory; if the size of the result is too big, an intermediate
result will have to be stored as cache files which should be merged later. The
process will be quite complicated.
These problems can be solved with
ready-made class library by introducing esProc to the programming in Java. Now
let's look at in detail how this will happen.
The text file employee.txt has the employee information. It is required to group
by DEPT, count the employees andsum up the total amount of their salary in each
group.
The format of text file employee.txtis as follows:
EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY
1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000
3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000
7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000
10 Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 R&D 13000
11 Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000
12 Jessica Davis F New York 1980-09-11 2008-09-11 Sales 7000
13 Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000
…
Implementation approach: call esProc script
with Java, import and compute the data, then return the result in the form of ResultSet to Java. Because esProc
supports dynamic expression parsing and evaluating, it enables Java to process
data from the text file as flexibly as SQL does.
For example, you are required to group by
DEPT, count the employees and sum up the total amount of their salary in each
group. esProc can usean input parameter "groupBy" as the dynamic grouping and summarizing condition,
which is shown below:
The value of "groupBy" is DEPT:dept;count(~):count,sum(SALARY):salary. And the code written in esProc is as follows:
A1: Define a file object and import the data, with the first row being the
title. tab is used as the field separator
by default. esProc's IDE can display the imported data visually, as shown in
the right part of the above figure.
A2: Group and summarize according to specified fields, using macro to
realize parsing the expression dynamically. The "groupBy" in this process is an
input parameter. In executing, esProc will first compute the expression
enclosed by ${…}, then replace ${…} with the computed result acting as the
macro string value and interpret and execute the code. The final code to be
executed in this example is=A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary).
A3: Return the eligible result set to the external program.
You just need to modify the parameter –"groupBy" when
grouping fields are changed. For example, you are required to group by DEPT and
GENDER, count the employees and sum up the total amount of salary in each
group. The value of "groupBy" can be written as DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary.
The simple summarizing on all data can be
regarded as a special case of grouping and summarizing operation. For example,
when counting the number of employees and summating the total amount of salary,
the value of parameter "groupBy" can be written as ;count(~):count,sum(SALARY):salary. That the parameter part for
grouping is omitted means all data is put into one group. The advantage by
doing so is that multiple summarizing results of these data can be computed by
traversing them once.
The code of calling this piece of code (which
is saved as test.dfx) in Java with
esProc JDBC is as follows:
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
//call the program in esProc (the stored
procedure); test is the name of file dfx
com.esproc.jdbc.InternalCStatementst;
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call
test(?)");
//set the parameters
st.setObject(1,"DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary");//
the parameters are the dynamic grouping and summarizing fields
//execute the esProc stored procedure
st.execute();
//get the result set
ResultSet set = st.getResultSet();
If the script is simple, the code can be
written directly into the program in Java that calls the esProc JDBC. It won’t
be necessary to write a special script file (test.dfx):
st=(com.
esproc.jdbc.InternalCStatement)con.createStatement();
ResultSet set=st.executeQuery("=file(\"D:/employee.txt\").cursor@t().groups(DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary)");
This piece of code in Java calls a line of
code in esProc script directly, that is, get the data from the text file and
return the result set to set– the
object of ResultSet.
groups function puts the grouping and summarizing result completely in the
memory.groupx will write the result
into temporary files if the grouping and summarizing result is beyond the
boundary of buffer rows, redistribute the memory, and then merge the temporary
files. Here the parameter 1000000 refers to buffer rows. The principle of assigning
value to it is to make the best of the memory, trying to reduce the number of
temporary files as far as possible. The number of temporary files is related to
the size of both the physical memory and the record, and should be evaluated
during programming. Generally, the recommended number is between magnitudes of several
hundred thousand to a magnitude of one million.
Though cell A3 returns a cursor, instead of
a result set, to Java, it is no need to modify the calling program of Java.
esProc will automatically fetch the data corresponding to the cursor while Java
is traversing the data with ResultSet.
Cellset parameters are made absent here for
easy understanding, yet the code is the same as that in the above:A2.groupx(${groupBy}). The parameter of select function can be written as the
macro which will be passed from Java program.
No comments:
Post a Comment