10/30/2014

esPro Helps Process Structured Texts in Java – Data Import

Java provides only the most basic data import functions like specifying separators. But other common functions, such as importing specified columns by column names, specifying the order of columns and data types and that no separator should be used, need to be realized manually. Though they are not difficult to realize, the code is complicated and easy to get wrong.

But if esProc is used to help Java in developing the program, the problems will be dealt with without writing code manually. The following example will teach you how esProc works.


Here is a text file data.txt with tab being the separator and 30 columns. The first row of the file is the column names of business significance. It is required to import the following columns by column names: ID, x1Shift, x2Shift and ratio, and compute a new column - value - using the business formula "((x1Shift+x2Shift)/2)*ratio". The first rows and columns of the file are as follows:

To solve the problem in Java, we must sort out all the 30 columns, quote specified columns with subscripts and then perform the computation. If too many formulas are involved, the computation will become rather complicated and easy to get wrong. In order to reduce the error probability, we have to store each piece of data in an object and give each field a business name according to which the formula is computed.

But with the help of esProc, Java can avoid all these troubles. The code is written as follows:

A1import function is used to import the file, not all the 30 columns though. It will import the specified columns by column names instead. Parameter option @t means importing the first row of the file as the column names. The computed result of this step is as follows:

A2Perform computation directly according to the business names. The result is as follows:

In practice, sometimes the above result needs to be exported to a file. The code for this is =file("E:\\result.txt").export@t(A2.new(ID,value)). It means writing the two columns: ID and value, to the file result.txt. The content of the file is as follows:

If the computed result needs to be returned to Java for further use, we only need to write code in esProc like this, result A2.new(ID,value)), which means returning the columns ID and result to Java through JDBC interface, with a data type of resultSet. Then the result can be obtained by calling esProc script in Java through JDBC. The code is as follows:
         //create a connection using esProc jdbc
         Class.forName("com.esproc.jdbc.InternalDriver");
         con= DriverManager.getConnection("jdbc:esproc:local://");
         //call the esProc script file whose name is test
         st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
         st.execute();//execute esProc stored procedure
         ResultSet set = st.getResultSet();   // get the result set

Sometimes the order of columns needs to be specified in order to manipulate data visually while data is imported. For example, import data from the same file data.txt according to a new order of the specified fields: x1Shift, x2Shift, radio and ID. esProc can specify the order automatically, and the code can be written as this:
         =file("E:\\data.txt").import@t(xShift,yShift,ratio,ID)

The computed result is as follows:

In the above code, esProc can set suitable data type automatically. For example, xShift and yShift will be set as float. But sometimes we need to specify the data types. For example, the data in ID is similar to integer but actually it is the string. If the first four characters of ID are to be imported separately, esProc will execute the following code:

A1Convert the data type of column ID by specially importing it as strings. The result is as follows:  

Note: It is a rule that strings are left justified and numbers are right justified in esProc's IDE, as shown in the above figure. 

A2Select the first four characters and the result is as follows:

During data import, sometimes there is not a separator. For example, data2.txt has 20 columns and part of the data is as follows:

It can be seen that there is no column separator in data2.txt and some data is just useless blank row. But esProc can import data correctly through executing the following code:

A1Import the data as a single-column table sequence with the column name being "_1" by default. The function option @s means the field will be imported directly without being split. The result is as follows:

A2A1.select(trim(_1)!="") is to select the non-blank rows. select function can make query by field name or row numbers. The result is as follows:

A3=A2.new(mid(_1,1,1),mid(_1,2,1),mid(_1,3,1),mid(_1,4,1),mid(_1,5,1),mid(_1,6,1), mid(_1,7,1),mid(_1,8,1),mid(_1,9,1),mid(_1,10,1),mid(_1,11,1),mid(_1,12,1),mid(_1,13,1),mid(_1,14,1),mid(_1,15,1),mid(_1,16,1),mid(_1,17,1),mid(_1,18,1),mid(_1,19,1),mid(_1,20,1))

This long line of code is to split each row of data into 20 fields. mid function has three parameters: the name of field to be split, starting and ending positions and the length of each split field. The result after splitting is as follows:

A3 shows the needed result. 

But the long code in A3 is inconvenient for checking and maintenance. We can simplify it using esProc's dynamic code as follows:

A4=20.loops(~~+"mid(_1," + string(~) + ",1),")

A5=exp=left(A4,len(A4)-1)

A6=eval("A2.new("+ A5+")")

loops function in A4 is used to perform loop computation and generate a regular string, that is, "mid(_1,1,1),mid(_1,2,1),mid(_1,3,1),mid(_1,4,1),mid(_1,5,1),mid(_1,6,1),mid(_1,7,1),mid(_1,8,1),mid(_1,9,1),mid(_1,10,1),mid(_1,11,1),mid(_1,12,1),mid(_1,13,1),mid(_1,14,1),mid(_1,15,1),mid(_1,16,1),mid(_1,17,1),mid(_1,18,1),mid(_1,19,1),mid(_1,20,1),"

Notice that there is a redundant comma at the end of the string, which can be removed using the code in A5.

A6Execute the dynamic script. eval function can parse the string dynamically to an expression. For example, eval("2+3") equals expression 2+3, whose value is 5. So actually the role of expression in A5 equals that of the code in A3 and their computed results are identical:


No comments:

Post a Comment