Data Process Engine: November 2014

11/30/2014

esProc Helps with Computation in MongoDB – Cross Summarizing

It is difficult for MongoDB to realize the operation of cross summarizing. It is also quite complicated to realize it using high-level languages, like Java, after the desired data is retrieved out. In this case, you can consider using esProc to help MongoDB realize the operation. The following example will teach you how it works in detail.

A collection – student – is given in the following:

db.student.insert ( {school:'school1', sname : 'Sean' , sub1: 4, sub2 :5 })

db.student.insert ( {school:'school1', sname : 'chris' , sub1: 4, sub2 :3 })

db.student.insert ( {school:'school1', sname : 'becky' , sub1: 5, sub2 :4 })

db.student.insert ( {school:'school1', sname : 'sam' , sub1: 5, sub2 :4 })

db.student.insert ( {school:'school2', sname : 'dustin' , sub1: 2, sub2 :2 })

db.student.insert ( {school:'school2', sname : 'greg' , sub1: 3, sub2 :4 })

db.student.insert ( {school:'school2', sname : 'peter' , sub1: 5, sub2 :1 })

db.student.insert ( {school:'school2', sname : 'brad' , sub1: 2, sub2 :2 })

db.student.insert ( {school:'school2', sname : 'liz' , sub1: 3, sub2 :null })

We are expected to produce a cross table as the one in the following, in which each row is a school and the first column holds students whose results of sub1 are a 5 and the second column holds those whose results of sub1 are a 4 and so forth.

esProc script:

A1: Connect to MongoDB. Both IP and the port number are localhost:27017. The database name, user name and the password all are test.

A2: Use find function to fetch the collection – student - from MongoDB and create a cursor. Here esProc uses the same parameter format in find function as that in find statement of MongoDB. As esProc's cursor supports fetching and processing data in batches, the memory overflow caused by importing big data all at once can thus be avoided. In this case, the data can be fetched altogether using fetch function because the size is not big.

A3: Group the data by schools.

A4: Then group each group of data in alignment according to the sequence [1,2,3,4,5] and compute the length of each subgroup.

A5: Put the lengths got in A4 into corresponding positions as required and a record sequence wil be generated as the result.

The result is as follows:

Note：esProc isn't equipped with a Java driver included in MongoDB. So to access MongoDB using esProc, you must put MongoDB's Java driver (a version of 2.12.2 or above is required for esProc, e.g. mongo-java-driver-2.12.2.jar) into [esProc installation directory]\common\jdbc beforehand.

The esProc script used to help MongoDB with the computation is easy to be integrated into the Java program. You just need to add another line of code - result A6 to output a result in the form of resultset to Java program. For the detailed code, please refer to esProc Tutorial. In the same way, MongoDB's Java driver must be put into the classpath of a Java program before the latter accesses MongoDB by calling an esProc program.

11/27/2014

esProc Simplifies SQL-style Computations – In-group Computation

During developing the database applications, we often need to perform computations on the grouped data in each group. For example, list the names of the students who have published papers in each of the past three years; make statistics of the employees who have taken part in all previous training; select the top three days when each client gets the highest scores in a golf game; and the like. To perform these computations, SQL needs multi-layered nests, which will make the code difficult to understand and maintain. By contrast, esProc is better at handling this kind of in-group computation, as well as easy to integrate with Java and the reporting tool. We’ll illustrate this through an example.

According to the database table SaleData, select the clients whose sales amount of each month in the year 2013 is always in the top 20. Part of the data of SalesData is as follows:

To complete the task, first select the sales data of the year of 2013, and then group the data by the month and, in each group, select the clients whose monthly sales amount is in the top 20. Finally, compute the intersection of these groups.

With esProc we can split this complicated problem into several steps and then get the final result. First, retrieve the data of 2013 from SaleData and group it by the month:

Note: The code for filtering in A2 can also be written in SQL.

It is the real grouping that esProc separates data into multiple groups. This is different from the case in SQL, whose group by command will compute the summary value of each group directly and won't keep the intermediate results of the grouping. After grouping, the data in A3 are as follows:

esProc will sorts the data automatically before grouping. Each group is a set of sales data. The data of March, for example, are as follows:

In order to compute every client's sales amount of each month, we need to group the data a second time by clients. In esProc, we just need to perform this step by looping the data of each month and group it respectively. A.(x) can be used to execute the loop on members of a certain group, with no necessity for loop code.

A4：=A3.(~group(Client))

In A4, the data of each month constitute a subgroup of each previous group after the second grouping:

At this point, the data of March are as follows:

It can be seen that each group of data in March contains the sales data of a certain client.

Please note "~" in the above code represents each member of the group, and the code written with "~" is called in-group computation code, like the above-mentioned ~.group(Client).

Next, select the clients whose rankings of each month are in the top 20 through the in-group computation:

A5：=A4.(~.top(-sum(Amount);20))

A6：=A5.(~.new(Client,sum(Amount):MonthAmount))

A5 computes the top 20 clients of each month in sales amount by looping each month's data. A6 lists the clients and their sales amount every month. The result of A6 is as follows:

Finally, list the field Client of each subgroup and compute the intersection of the subgroups:

A7：=A6.(~.(Client))

A8：=A7.isect()

A7 computes the top 20 clients of each month in sales amount. A8 computes the intersection of the field Clients of the twelve months. The result is as follows:

As can be seen from this problem, esProc can easily realize the in-group computation, including the second group and sort, on the structured data, make the solving way more visually, and display a clear and smooth data processing in each step. Moreover, the operations, like looping members of a group or computing intersection, become easier in esProc, which will reduce the amount of code significantly.

The method with which a Java program calls esProc is similar to that with which it calls an ordinary database. The JDBC provided by esProc can be used to return a computed result of the form of ResultSet to Java main program. For more details, please refer to the related documents .

11/26/2014

esProc Simplifies SQL-style Computations – Records Corresponding to Max Value

`In developing database applications, usually it is the records corresponding to the max/min value that we need to retrieve, instead of the value itself. For example, the occasion in which each employee gets his/her biggest pay raise; the three lowest scores ever got in golf; the five days in each month when each product gets its highest sales amount; and so on. As the max function of SQL can only retrieve the max value, instead of the records to which the max value corresponds, it is quite complicated to handle the computation in SQL with the aid of some advanced techniques, such as window functions or the nested sub-queries or keep/top/row number. If multi-layered grouping or relations are involved, the computation will become even more complicated.

With the top function in esProc, the records corresponding to the max value can be retrieved and the computation becomes much easier. The following is such an example.

The database table golf contains the scores of members in a golf club. Please select the best three scores each member has ever got. Part of the data is as follows:

The code written in esProc:

A1: Retrieve data from the database. If the data come from a structured text file, the following equivalent code can be used: =file("\\golf").import@t(). Click the cell and we can check the retrieving result:

A2: =A1.group(User_ID), i.e., group the result of A1. The result is as follows:

As shown in the above figure, the data have been separated into multiple groups by User_ID and each row is a group. Click the blue hyperlink and members of the group will be shown as follows:

A3: =A2.(~.top(-Score;3)). The code is to compute the records of each group of data whose field Score is in the top three. Here "~" represents each group of data. ~.top() represents that top function will work on every group of data in turn. The top function can retrieve the N biggest/smallest records from a data set. For example, top(Score;3) represents sorting by Score in ascending order and fetching the first three records (i.e. min values); top(-Score;3) represents sorting in descending order and fetching the first three records (i.e. max values). The result of this step is as follows:

A4: =A3.union(), which means union data of every group. The result is as follows:

In the above, the computation is performed step by step. But the steps can be integrated into one for the convenience of maintenance and debugging: db.query("select * from golf").group(User_ID). (~.top(-Score;3)).union().

11/25/2014

esProc Helps Process Heterogeneous Data Sources in Java –Cross-Database Relating

JoinRowSet and FilteredRowSet provided by RowSet– Java's class library for data computing – can perform cross-database related computing, but they have a lot of weaknesses. First, JoinRowSet only supports inner join, it doesn't support outer join. Second, test shows that db2, mysql and hsql can work with JoinRowSet, yet the result set of join oracle11g to other databases is empty though no error reporting will appear. The fact is there were two users who perform cross-database join using oracle11g database even got the correct result. This suggests that JDBC produced by different database providers will probably affect the result obtained by using this method. Last, the code is complicated.

esProc has proved its ability in assisting Java to perform cross-database relating. It can work with various databases, such as oracle, db2, mysql, sqlserver, sybase and postgresql, to perform a variety of cross-database related computing, like inner join and outer join involving heterogeneous data. An example will teach you the way esProc works. Requirement: relate table sales in db2 to table employee in mysql through sale.sellerid and employee.eid, and then filter data in both sales and employee according to the criterion state="California". The way the code is written in this task applies to situations where other types of databases are involved.

The structure and data of table sales are as follows:

The structure and data of table employee are as follows:

Implementation approach: Call esProc script using Java program, join the multiple databases together to realize the cross-database relating, perform filtering and return the result to Java in the form of ResultSet.

The code written in esProc is as follows:

A1：Connect to the data source db2 configured in advance.

A2：Connect to the data source mysql configured in advance. In fact oracle and other types of databases can be used too.

A3, A4：Retrieve table sequences: sales and employee, from db2 and mysql respectively. esProc's Integration Development Environment (IDE) can display the retrieved data visually, as shown in the right part of the figure in the above.

A5：Relate sales to employee through sellerid=eid using esProc's object reference mechanism.

A6：Filter the two table sequences according to state="California".

A7：Generate a new table sequence and get the desired fields.

A8：Return the result to the caller of esProc program.

This piece of program is called in Java using esProc JDBC to get the result. The code is as follows (save the above esProc program as test.dfx):

//create a connection using esProcjdbc

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

// call esProc program (the stored procedure) in which test is the name of file dfx

com.esproc.jdbc.InternalCStatementst;

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");

// execute esProc stored procedure

st.execute();

// get the result set

ResultSet set = st.getResultSet();

11/24/2014

esProc Helps with Computation in MongoDB – Relationships between Tables

MongoDB doesn’t support join. The unity JDBC recommended by its official website can perform join operation after retrieving data out, but charges a fee for the service. Other free JDBC drivers can only support the basic SQL statements, without join included. If you use programming languages, like Java, to retrieve data first and proceed to perform join operation, the process will still be complicated.

However, the join operation can be realized by using esProc, which is free of charge, to help MongoDB with the computation. Here is an example for illustrating the method in detail.

In MongoDB, there is a document - orders – that holds the order data, and another file – employee – for storing employee data, as shown in the following:

MongoDB shell version: 2.6.4

connecting to: test

> db.orders.find();

{ "_id" : ObjectId("5434f88dd00ab5276493e270"), "ORDERID" : 1, "CLIENT" : "UJRNP

", "SELLERID" : 17, "AMOUNT" : 392, "ORDERDATE" : "2008/11/2 15:28" }

{ "_id" : ObjectId("5434f88dd00ab5276493e271"), "ORDERID" : 2, "CLIENT" : "SJCH"

, "SELLERID" : 6, "AMOUNT" : 4802, "ORDERDATE" : "2008/11/9 15:28" }

{ "_id" : ObjectId("5434f88dd00ab5276493e272"), "ORDERID" : 3, "CLIENT" : "UJRNP

", "SELLERID" : 16, "AMOUNT" : 13500, "ORDERDATE" : "2008/11/5 15:28" }

{ "_id" : ObjectId("5434f88dd00ab5276493e273"), "ORDERID" : 4, "CLIENT" : "PWQ",

"SELLERID" : 9, "AMOUNT" : 26100, "ORDERDATE" : "2008/11/8 15:28" }

…

> db.employee.find();

{ "_id" : ObjectId("5437413513bdf2a4048f3480"), "EID" : 1, "NAME" : "Rebecca", "

SURNAME" : "Moore", "GENDER" : "F", "STATE" : "California", "BIRTHDAY" : "1974-1

1-20", "HIREDATE" : "2005-03-11", "DEPT" : "R&D", "SALARY" : 7000 }

{ "_id" : ObjectId("5437413513bdf2a4048f3481"), "EID" : 2, "NAME" : "Ashley", "S

URNAME" : "Wilson", "GENDER" : "F", "STATE" : "New York", "BIRTHDAY" : "1980-07-

19", "HIREDATE" : "2008-03-16", "DEPT" : "Finance", "SALARY" : 11000 }

{ "_id" : ObjectId("5437413513bdf2a4048f3482"), "EID" : 3, "NAME" : "Rachel", "S

URNAME" : "Johnson", "GENDER" : "F", "STATE" : "New Mexico", "BIRTHDAY" : "1970-

12-17", "HIREDATE" : "2010-12-01", "DEPT" : "Sales", "SALARY" : 9000 }

…

The SELLERID in orders corresponds to EID in employee. Please select from employee all order information in which the state is California. orders holds big data and thus cannot be loaded into the memory entirely. But employee has fewer data and so has the filtering result of orders.

The conditional expression for selecting data can be passed to the esProc program as a parameter, as shown in the following figure:

esProc code:

A1: Connect to MongoDB. Both IP and the port number are localhost:27017. The database name, user name and the password all are test.

A2: find function is used to fetch data from MongoDB and create a cursor. orders is the collection, with a filtering condition being null and _id , the specified key, not being fetched. esProc uses the same parameter format in find function as that in find statement of MongoDB. esProc's cursor supports fetching and processing data in batches, thereby avoiding the memory overflow caused by importing big data at once.

A3: Fetch data from employee. Because the data size to be fetched is not big, you can use fetch function to get them all at once.

A4: switch function is used to convert values of SELLERID field in A2's cursor into the record references in A3's employee.

A5: Select the desired data according to the condition. Here a macro is used to dynamically parse an expression, in which where is the input parameter. In esProc, the expression surrounded by ${…} will be first computed, the result be taken as the macro string value to replace ${…} , and then the code be interpreted and executed. Therefore the actual code executed in this step is =A4.select(SELLERID.STATE=="California"). Since SELLERID has been converted into the references of corresponding records in employee, you can write the code in such a way as SELLERID.STATE. As the data size of the filtering result is not big, you can fetch the data all at once. But if the data size is still rather big, you can fetch the data in batches, say, 10,000 rows per batch, which can be expressed as fetch(10000).

A6: Reconvert the values of SELLERID in the filtering result into ordinary ones.

The result of A6 is as follows:

When the filtering condition is changed, you need not change the whole program, but modifying the parameter where. For example, the condition is changed to "orders information in which state is California or CLIENT is PWQ", then the value of where can be expressed as CLIENT=="PWQ"|| SELLERID.STATE=="California".

The thing is that esProc isn't equipped with a Java driver included in MongoDB. So to access MongoDB using esProc, you must put MongoDB's Java driver (a version of 2.12.2 or above is required for esProc, e.g. mongo-java-driver-2.12.2.jar) into [esProc installation directory]\common\jdbc beforehand.

The script written in esProc which is used to help MongoDB with the computation is easy to be integrated into the Java program. You just need to add another line of code - A7 – that is, result A6, for outputting a result in the form of resultset to Java program. For the detailed code, please refer to esProc Tutorial. In the same way, MongoDB's Java driver must be put into the classpath of a Java program before the latter accesses MongoDB by calling an esProc program.

11/23/2014

esProc Helps Process Heterogeneous DataSources in Java - MongoDB

MongoDB doesn't support join directly. The unity JDBC recommended by the official website can perform the join operation after retrieving data out. But the advanced functions, like join, group, functions and expressions, are only provided by the paid version of unity JDBC. Even the paid version doesn't support the complicated SQL operations, such as subquery, window functions, etc. The free JDBC drivers can only support the most basic SQL statements.

Using free esProc working with MongoDB can realize the above-mentioned complicated structured (or semi-structured) computations. We'll take join as an example to illustrate the method in detail.

As shown in the following, the file -orders- in MongoDB contains the sales orders, and employee contains the employee information:

MongoDB shell version: 2.6.4

connecting to: test

>db.orders.find();