Data Process Engine: October 2014

10/30/2014

esPro Helps Process Structured Texts in Java – Data Import

Java provides only the most basic data import functions like specifying separators. But other common functions, such as importing specified columns by column names, specifying the order of columns and data types and that no separator should be used, need to be realized manually. Though they are not difficult to realize, the code is complicated and easy to get wrong.

But if esProc is used to help Java in developing the program, the problems will be dealt with without writing code manually. The following example will teach you how esProc works.

Here is a text file data.txt with tab being the separator and 30 columns. The first row of the file is the column names of business significance. It is required to import the following columns by column names: ID, x1Shift, x2Shift and ratio, and compute a new column - value - using the business formula "((x1Shift+x2Shift)/2)*ratio". The first rows and columns of the file are as follows:

To solve the problem in Java, we must sort out all the 30 columns, quote specified columns with subscripts and then perform the computation. If too many formulas are involved, the computation will become rather complicated and easy to get wrong. In order to reduce the error probability, we have to store each piece of data in an object and give each field a business name according to which the formula is computed.

But with the help of esProc, Java can avoid all these troubles. The code is written as follows:

A1: import function is used to import the file, not all the 30 columns though. It will import the specified columns by column names instead. Parameter option @t means importing the first row of the file as the column names. The computed result of this step is as follows:

A2: Perform computation directly according to the business names. The result is as follows:

In practice, sometimes the above result needs to be exported to a file. The code for this is =file("E:\\result.txt").export@t(A2.new(ID,value)). It means writing the two columns: ID and value, to the file result.txt. The content of the file is as follows:

If the computed result needs to be returned to Java for further use, we only need to write code in esProc like this, result A2.new(ID,value)), which means returning the columns ID and result to Java through JDBC interface, with a data type of resultSet. Then the result can be obtained by calling esProc script in Java through JDBC. The code is as follows:

//create a connection using esProc jdbc

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

//call the esProc script file whose name is test

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");

st.execute();//execute esProc stored procedure

ResultSet set = st.getResultSet(); // get the result set

Sometimes the order of columns needs to be specified in order to manipulate data visually while data is imported. For example, import data from the same file data.txt according to a new order of the specified fields: x1Shift, x2Shift, radio and ID. esProc can specify the order automatically, and the code can be written as this:

=file("E:\\data.txt").import@t(xShift,yShift,ratio,ID)

The computed result is as follows:

In the above code, esProc can set suitable data type automatically. For example, xShift and yShift will be set as float. But sometimes we need to specify the data types. For example, the data in ID is similar to integer but actually it is the string. If the first four characters of ID are to be imported separately, esProc will execute the following code:

A1: Convert the data type of column ID by specially importing it as strings. The result is as follows:

Note: It is a rule that strings are left justified and numbers are right justified in esProc's IDE, as shown in the above figure.

A2: Select the first four characters and the result is as follows:

During data import, sometimes there is not a separator. For example, data2.txt has 20 columns and part of the data is as follows:

It can be seen that there is no column separator in data2.txt and some data is just useless blank row. But esProc can import data correctly through executing the following code:

A1: Import the data as a single-column table sequence with the column name being "_1" by default. The function option @s means the field will be imported directly without being split. The result is as follows:

A2: A1.select(trim(_1)!="") is to select the non-blank rows. select function can make query by field name or row numbers. The result is as follows:

A3：=A2.new(mid(_1,1,1),mid(_1,2,1),mid(_1,3,1),mid(_1,4,1),mid(_1,5,1),mid(_1,6,1), mid(_1,7,1),mid(_1,8,1),mid(_1,9,1),mid(_1,10,1),mid(_1,11,1),mid(_1,12,1),mid(_1,13,1),mid(_1,14,1),mid(_1,15,1),mid(_1,16,1),mid(_1,17,1),mid(_1,18,1),mid(_1,19,1),mid(_1,20,1))

This long line of code is to split each row of data into 20 fields. mid function has three parameters: the name of field to be split, starting and ending positions and the length of each split field. The result after splitting is as follows:

A3 shows the needed result.

But the long code in A3 is inconvenient for checking and maintenance. We can simplify it using esProc's dynamic code as follows:

A4：=20.loops(~~+"mid(_1," + string(~) + ",1),")

A5：=exp=left(A4,len(A4)-1)

A6：=eval("A2.new("+ A5+")")

loops function in A4 is used to perform loop computation and generate a regular string, that is, "mid(_1,1,1),mid(_1,2,1),mid(_1,3,1),mid(_1,4,1),mid(_1,5,1),mid(_1,6,1),mid(_1,7,1),mid(_1,8,1),mid(_1,9,1),mid(_1,10,1),mid(_1,11,1),mid(_1,12,1),mid(_1,13,1),mid(_1,14,1),mid(_1,15,1),mid(_1,16,1),mid(_1,17,1),mid(_1,18,1),mid(_1,19,1),mid(_1,20,1),"

Notice that there is a redundant comma at the end of the string, which can be removed using the code in A5.

A6：Execute the dynamic script. eval function can parse the string dynamically to an expression. For example, eval("2+3") equals expression 2+3, whose value is 5. So actually the role of expression in A5 equals that of the code in A3 and their computed results are identical:

esProc Getting Started: Basic Usage of JDBC

esProc can be embedded into Java program. So the latter can call the cellset program written in esProc using a way of connection such as JDBC. The method of calling the esProc program is the same as that of calling the stored procedure. The following is a brief introduction to esProc JDBC.

1. Description of the jars of esProc JDBC

esProc JDBC is like an incomplete database JDBC driver without physical tables. It can be regarded simply as a database that only supports the stored procedure. In addition, it is a built-in computing engine, thus no standalone servers are needed.

esProc JDBC has five basic jars, which are all situated in \esProc\ lib in installation directory:

dm.jar esProc computing engine and JDBC driver

poi-3.7-20101029.jar process the access of Excel files

log4j_128.jar process logs

icu4j_3_4_5.jar handle internationalization

dom4j-1.6.1.jar parse the configuration files

If other databases are to be used as the datasources of esProc JDBC, then the driver jars of these databases are required to be in place. For example, hsqldb.jar is necessary to use the demo database. Please note the esProc JDBC requires JDK1.6 or higher versions.

2. Basic usage of esProc JDBC

In the cellset code, the result set is returned by result statement.

In the application code, arg1 is a cellset parameter. This dfx file will be named as test.dfx.

Note: The result set of dfx is returned by result statement. When dfx is called, the parameter names that receive the parameters won’t be used; the values of parameters will be assigned according to their order instead.

(1) Load the jars to be used. Load the basic jars of esProc JDBC mentioned above while launching the Java application. These jars can be put in the directory of WEB-INF/lib under a web application.

(2) Deploy dfxConfig.xml, config.xml and the dfx file

Prepare file config.xml, which contains the basic configuration information of esProc, such as registration code, searching path, datasource configuration, etc. The file can be found in the path esProc/config in esProc's installation directory. The information in it is the same as that set in the esProc Option page. The configuration is allowed to be adjusted before deployment (like modifying the Searching path which is used to search the dfx file):

Or the datasources necessary for dfx can be configured in the Data Source Explorer:

After the modification, config.xml and dfxConfig.xml, which are situated in esProc\classes in the esProc's installation directory, will be saved in the class path of the application that will use them.

Put the test.dfx created in Step 1 in the class path of the application, or put it in the path specified by file dfxConfig.xml's <paths/> node (i.e. the above-mentioned Searching path).

(1) Further configure file dfxConfig.xml manually if necessary. For detailed operation, please refer to related documents. Please note the name of configured file should still be dfxConfig.xml and must not be changed.

(2) Call dfx in Java program.

public void testDataServer(){

Connection con = null;

com.esproc.jdbc.InternalCStatement st;

com.esproc.jdbc.InternalCStatement st2;

try{

//create a connection

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

//call the stored procedure in which test is the name of dfx file

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");

//set the parameters

st.setObject(1,"3");

//the result obtained by executing the following code is the same as that obtained using the above calling method

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call test(3)");

//execute the stored procedure

st.execute();

//get result set

ResultSet set = st.getResultSet();

}

catch(Exception e){

System.out.println(e);

}

finally{

//close the connection

if (con!=null) {

try {

con.close();

}

catch(Exception e) {

System.out.println(e);

}

To know more about calling methods and configuration, please refer to documents that cover a more in-depth discussion at this point.

10/29/2014

esProc Getting Started: Multilayer Parameters

esProc provides a large number of functions, many of which use many parameters. In order to clearly judge the positions of these parameters and make writing and reading easier, esProc is specially equipped with multilayer separators of function parameters.

1. Separators of function parameters

Colon (:), comma (,) and semicolon (;) are used as separators of function parameters in esProc. Their priority decreases in turn.

The most common method is to use commas to separate parameters, which is in line with function syntax in most programming languages. For example:

In this example, parameters of if(), create() and T.insert() function in A3, A4 and A5 are separated from each other by commas. After the code in A3 is executed, the result of A4 is as follows:

Some functions have "coupled" parameters, which are closely related or work together. In this case, a colon is often used to separate them.

For example, for if() function in A3, each condition corresponds to a returned result and colons are used to separate the results. In the in() function used in a condition, 150 and 180 are also separated by a colon and together they form a numerical interval [150,180]. In A5's T.insert() function, field value and field name also come in pairs with colons in between to separate them. After execution, the result in A4 is as follows:

In some functions, indicative parameters can be added to certain parameters to change the computational method relating to them. In this case, colons are usually used as separators. See below:

Both A2 and A3 sort records of cities according to state ID first, then sort by name if cities belong to the same state. Difference is that -1 is appended after NAME in A3’s function, meaning that sorting by name is in a descending order. The results in A2 and A3 are as follows:

Sometimes, parameters in a function can be divided into different parts according to their roles. Semicolons are usually used to separate these parts.

In A2's groups() function, the parameter before the semicolon is used for grouping, and those after it are for summarizing computation, whose parameters are separated by a colon to define name of the summarizing field. The result in A2 is as follows:

In A3's top() function, the parameter after the semicolon defines that the top 5 records are fetched. The result is as follows:

In some functions, parameters are quite many. Usually these parameters are divided into several groups which are separated by semicolons:

T.switch() function in A4 transforms different fields into records of another table sequence, and a semicolon is used here. Besides, comma, colon and semicolon are all used in A4 as separators. This kind of code writing creates clear layers for function parameters. The result in A4 is as follows:

2. Omission of function parameters

Some esProc parameter functions have default values and, therefore, can be omitted, making functions more concise.

The parameter after a colon is generally used to complement another’s computational model. If default mode is used, the parameter can be omitted. For example:

In A5, parameters after the colons are used to designate field names corresponding to certain values when inserting records. Parameters for designating field names can be omitted in A6 because they use default field names to set field values one by one. Two records are inserted respectively in A5 and A6, then the result in A4 is as follows:

But colons cannot be omitted when they are used to separate intervals:

In A3, in(B1+B2,180:) and in(B1+B2,:120) represent respectively B1+B2>=180 and B1+B2<=120, in which the colons cannot be omitted. The result in A3 is as follows:

For parameters separated by semicolons, the semicolon can be omitted if there is no parameter after it. For example, if n is not set in A.top() function and only the top one is needed; or when the transformation of a certain field is not needed in T.switch() function.

When commas are used as separators, they should generally be retained if parameters are set by default. For example:

Expression in A2 is equal to =A1.to(4,A1.len()), expression in A3 is equal to =A1.to(1,4). Let’s compare the results in A2 and A3:

esProc Getting Started: Option Syntax

1. esProc's function options

Many functions in esProc can use function options with which the same function can have different work patterns. The basic format of function options is f@o(…) in which o is f function’s option. For example:

We can use interval function to compute the number of days between two dates. In expressions in A3 and A4, @y and @m are interval function’s options. With these options, the function will use the year and the month as computational units in computing time interval. With function options, a function can meet multiple needs, extend its role and avoid too many function names or function parameters. The number of days, years and months between the two dates are computed respectively in A2, A3 and A4, as shown below:

2. Common function options

Some function options are common in esProc. They can be used by many functions.

@1 and @a

@1 option and @a option are frequently used by position, select and join functions , such as A.pos(), A.select(), A.pmax(), A.pselect(), A.minp(), P.align(), etc.

Use of @a option can make functions that return the first query result by default return all eligible results. Contrary to @a, functions that return multiple members by default will return only the first query result by using @1 option.

We’ll look at the use of these two options through some examples:

The results in A2, A3, A4 and A5 are as follows:

The cities whose names start with a C are selected in A2:

The first city whose name starts with a C is selected in A3:

The first cities whose STATEIDs are 6, 35 and 40 respectively are selected in A4:

All cities whose STATEIDs are 6, 35 and 40 are selected and grouped by states in A5:

By the way, since digit 1 is difficult to distinguish from the lowercase letter l, in most cases, the former is used in esProc's options.

One other thing need to be made known is that the same option in different functions can have different meanings. For example, @a used in position functions means returning all results, while in functions for file writing, like f.write() and f.export(), it means appending.

@z

@z option is often used in some functions related to order, like sort, position, select, etc. Such as A.rank(), A.sort(), A.pos(), A.pselect(), A.select() and so on.

With @z option, positioning or selecting data in a sequence or a table sequence will be executed from back to front. Thus in this example, A2 returns the position of the last 2:

A3 returns results of sorting in descending order:

Records obtained in A5 are also sorted in descending order:

@b

@b option is often used in functions to position, select, etc., like A.pos(), A.pselect(), A.select(), and so on. The use of @b option is usually accompanied by binary search algorithm, which is more efficient in query, with the prerequisite that A is ordered; otherwise results may be wrong.

Since data in A1 is not sorted according to state ID, only one result is obtained in A2 with the binary search when @b option is used:

while A4 obtains the correct results because data is sorted in A3:

But it is another thing when @b is used in functions to read and write, such as f.import(), f.export(), f.cursor(). In this case, it is the binary files that being read in or written out. In esProc, binary files use less memory space and have faster access speed, therefore the use of binary files will bring higher efficiency.

3. Use multiple options simultaneously

Multiple options can be used simultaneously in esProc when needed. There is no specific order among these options. For example:

Because both A 2 and A3 get positions of all the 2 in A1 from back to front, their results are identical:

A5 gets the first city whose STATEID is 5 from back to front:

Note that some function options are mutually exclusive , thus cannot be used simultaneously. Such as @a and @1, or @t and @b, options of f.import().

menu