Data Process Engine: esProc Helps Process Structured Text in Java

Following problems will arise if you perform conditional filtering on text files in Java alone:

1. The text file is not a database, so it cannot be accessed by SQL. The code needs to be modified if the expression of grouping and summarizing is changed. Besides, if you want a flexible expression, you have to self-program the dynamic expression parsing and evaluating, resulting in a great amount of programming work.

2. The grouping result produced during traversing will be recorded. If the result is small in size, it can be stored directly in the memory; if the size of the result is too big, an intermediate result will have to be stored as cache files which should be merged later. The process will be quite complicated.

These problems can be solved with ready-made class library by introducing esProc to the programming in Java. Now let's look at in detail how this will happen.

The text file employee.txt has the employee information. It is required to group by DEPT, count the employees andsum up the total amount of their salary in each group.

The format of text file employee.txtis as follows:

EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY

1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000

2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000

3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000

4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000

5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000

6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000

7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000

8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000

9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000

10 Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 R&D 13000

11 Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000

12 Jessica Davis F New York 1980-09-11 2008-09-11 Sales 7000

13 Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000

…

Implementation approach: call esProc script with Java, import and compute the data, then return the result in the form of ResultSet to Java. Because esProc supports dynamic expression parsing and evaluating, it enables Java to process data from the text file as flexibly as SQL does.

For example, you are required to group by DEPT, count the employees and sum up the total amount of their salary in each group. esProc can usean input parameter "groupBy" as the dynamic grouping and summarizing condition, which is shown below:

The value of "groupBy" is DEPT:dept;count(~):count,sum(SALARY):salary. And the code written in esProc is as follows:

A1: Define a file object and import the data, with the first row being the title. tab is used as the field separator by default. esProc's IDE can display the imported data visually, as shown in the right part of the above figure.

A2: Group and summarize according to specified fields, using macro to realize parsing the expression dynamically. The "groupBy" in this process is an input parameter. In executing, esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as the macro string value and interpret and execute the code. The final code to be executed in this example is=A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary).

A3: Return the eligible result set to the external program.

You just need to modify the parameter –"groupBy" when grouping fields are changed. For example, you are required to group by DEPT and GENDER, count the employees and sum up the total amount of salary in each group. The value of "groupBy" can be written as DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary.

The simple summarizing on all data can be regarded as a special case of grouping and summarizing operation. For example, when counting the number of employees and summating the total amount of salary, the value of parameter "groupBy" can be written as ;count(~):count,sum(SALARY):salary. That the parameter part for grouping is omitted means all data is put into one group. The advantage by doing so is that multiple summarizing results of these data can be computed by traversing them once.

The code of calling this piece of code (which is saved as test.dfx) in Java with esProc JDBC is as follows:

// create a connection using esProc JDBC

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

//call the program in esProc (the stored procedure); test is the name of file dfx

com.esproc.jdbc.InternalCStatementst;

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");

//set the parameters

st.setObject(1,"DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary");// the parameters are the dynamic grouping and summarizing fields

//execute the esProc stored procedure

st.execute();

//get the result set

ResultSet set = st.getResultSet();

If the script is simple, the code can be written directly into the program in Java that calls the esProc JDBC. It won’t be necessary to write a special script file (test.dfx):

st=(com. esproc.jdbc.InternalCStatement)con.createStatement();

ResultSet set=st.executeQuery("=file(\"D:/employee.txt\").cursor@t().groups(DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary)");

This piece of code in Java calls a line of code in esProc script directly, that is, get the data from the text file and return the result set to set– the object of ResultSet.

If the result set of grouping is still too big to be entirely loaded to the memory, groupx statement will return the grouping result using file cursor. Thus the code written in esProc will be modified like this:

groups function puts the grouping and summarizing result completely in the memory.groupx will write the result into temporary files if the grouping and summarizing result is beyond the boundary of buffer rows, redistribute the memory, and then merge the temporary files. Here the parameter 1000000 refers to buffer rows. The principle of assigning value to it is to make the best of the memory, trying to reduce the number of temporary files as far as possible. The number of temporary files is related to the size of both the physical memory and the record, and should be evaluated during programming. Generally, the recommended number is between magnitudes of several hundred thousand to a magnitude of one million.

Though cell A3 returns a cursor, instead of a result set, to Java, it is no need to modify the calling program of Java. esProc will automatically fetch the data corresponding to the cursor while Java is traversing the data with ResultSet.

This piece of program can be further improved to support filtering before and after the grouping. Now the role of the program is similar to that of where and having in SQL. For example, the statistical object becomes female employees (GENDER=="F"), and it is required to retain only the departments where the number of female employees is greater than ten after grouping and summarizing operation. The code is as follows:

Cellset parameters are made absent here for easy understanding, yet the code is the same as that in the above:A2.groupx(${groupBy}). The parameter of select function can be written as the macro which will be passed from Java program.

Data Process Engine

menu

11/06/2014

esProc Helps Process Structured Text in Java – Grouping and Summarizing

No comments:

Post a Comment