Data Process Engine: May 2015

5/28/2015

Performance Test of esProc In-Memory Computing

In this article, we’ll test the performance of esProc in handling in-memory small data computing, and compare with that of Oracle when performing the same computation.

The test involves two cases: normal simple computing and complicated related computing:

The test data used in normal computing is order information, as shown below:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE

1 287 47 5825 2013-05-31

2 89 22 8681 2013-05-04

3 47 67 7702 2009-11-22

4 76 85 8717 2011-12-13

5 307 81 8003 2008-06-01

6 366 39 6948 2009-09-25

7 295 8 1419 2013-11-11

8 496 35 6018 2011-02-18

9 273 37 9255 2011-05-04

10 212 0 2155 2009-03-22

esProc script for normal computing:

	A	B
1	=date("2013-09-01")	=date("2013-11-01")
2	=file("/ssd/data/orders.b").import@b()
3	=A2.select(ORDERDATE>A1 && ORDERDATE<B1)
4	=A3.groups@u(CLIENT,SELLERID;sum(AMOUNT):S,count(ORDERID):C)
5	=A4.select(C==13)
6	result A5

Note about the esProc script: A2 imports data in one go and its result can be used repeatedly. So it will not be counted in the execution time of in-memory computing, which only includes the time spent in executing steps from A3 to A6.

SQL script that Oracle uses:

select client,sellerid,sum(amount) s,count(orderid) c

from orders

where orderdate>to_date('2013-09-01','yyyy-MM-dd')

and orderdate<to_date('2013-11-01','yyyy-MM-dd')

group by client,sellerid

having count(orderid)=13

There are 8 million rows of data in orders.

Test result:

	esProc	Oracle
Execution time	0.570 seconds	0.623 seconds

Note: Execute Oracle’s SQL script four times and get the average of the last three times. In this way Oracle can make full use of its in-memory cached data. As for esProc script, we’ll execute it three times and get the average time.

During related computing testing, esProc imports the following files - orders.b, orders_detail.b, employee.b, department.b and performance.b. The relationships among them are as follows:

esProc script for data importing:

	A	B
1	="/ssd/data/"
2	=file(A1+"orders_detail.b").import@b()
3	=file(A1+"orders.b").import@b().primary(ORDERID).index()
4	=file(A1+"employee.b").import@b().primary(EID).index()
5	=file(A1+"department.b").import@b().primary(DEPT).index()
6	=file(A1+"performance.b").import@b().primary(EID).index()
7	=A2.switch(ORDERID,A3:ORDERID)	=A3.switch(SELLERID,A4:EID)
8	=A4.switch(DEPT,A5:DEPT)	=A5.switch(MANAGER,A4:EID)
9	=A4.switch(EID,A6:EID)	=A2.select(ORDERID.SELLERID.DEPT.MANAGER.EID.BONUS>6000)
10	=B9.groups@u(ORDERID.CLIENT:CLIENT,ORDERID.SELLERID.EID.EID:EID,PRODUCT;sum(PRICE*QUANTITY):S,count(ORDERID):C)
11	=A10.select(S>24500 && S<25000)	result A11

Note about esProc script: The code from A1 to A9 imports files and establishes the relationships (import and switch). This is done at one time and the result can be used repeatedly. This part of execution will not be included in the execution time of in-memory computing. The execution time is the time spent in executing steps from B9 to B11.

Oracle’s SQL：

select o.client,o.sellerid,od.product,sum(od.price*od.quantity) s,count(o.orderid) c

from orders_detail od left join orders o on o.orderid=od.orderid

left join employee e on o.sellerid=e.eid

left join department d on d.dept=e.dept

left join employee e1 on d.manager=e1.eid

left join performance p on p.employeeid=e1.eid

where p.bonus>=6000

group by o.client,o.sellerid,od.product

having sum(od.price*od.quantity) between 24500 and 25000

Test one: Both the orders table and the orders_detail table have 8 million rows. There are less than 1,000 rows in other tables.

Test result:

	esProc	Oracle
Execution time	4.9 seconds	9.7 seconds

Test two: Both the orders table and the orders_detail table have 4 million rows. There are less than 1,000 rows in other tables.

Test result:

	esProc	Oracle
Execution time	2.3 seconds	5.1 seconds

Similarly, we execute Oracle’s SQL script four times and get the average of the last three times, in order to let Oracle make full use of its in-memory cached data. We execute esProc script three times and get the average.

Conclusion:

For small data in-memory computing, when the computation is simple without file relating, esProc is a little faster than Oracle. This is probably because, though Oracle can cache data, it still needs to convert the cached data (without using the physical disk) before processing it. When handling the complicated related computing, esProc works much faster than Oracle. The reason is that Oracle uses the hash algorithm to join two tables, while esProc uses a pointer to reference foreign key values, and thus does not need to compute hash values and query matching values.

Therefore, when there is enough available memory capacity, esProc can achieve much higher performance than conventional relational database if we import data into memory and arrange it beforehand. esProc is great in achieving high performance for in-memory computing.

Test environment:

CPU: Core(TM) i5-3450 four cores, four threads

Memory capacity: 16GB

OS：CENT OS

Oracle11g：Maximum memory capacity is 14G

esProc 3.1 version: Maximum memory capacity is 14G

5/27/2015

Performance Test of File Retrieval in esProc

This article aims to test esProc performance of retrieving files and compares it with the result got by using Oracle JDBC.

The test data is some order records saved in the orders.b file. Import the file with esProc and its content would be like this:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE NOTE

1 287 47 5825 2013-05-31 gafcaghafdgie f ci…

2 89 22 8681 2013-05-04 gafcaghafdgie f ci…

3 47 67 7702 2009-11-22 gafcaghafdgie f ci…

4 76 85 8717 2011-12-13 gafcaghafdgie f ci…

5 307 81 8003 2008-06-01 gafcaghafdgie f ci…

6 366 39 6948 2009-09-25 gafcaghafdgie f ci…

7 295 8 1419 2013-11-11 gafcaghafdgie f ci…

8 496 35 6018 2011-02-18 gafcaghafdgie f ci…

9 273 37 9255 2011-05-04 gafcaghafdgie f ci…

10 212 0 2155 2009-03-22 gafcaghafdgie f ci…

…

note field is the string field for increasing each record’s length and doesn’t have practical meaning. Test process: Write a Java program to call esProc JDBC and Oracle JDBC respectively for retrieving and traversing the same data, and record the time taken. The Java program, esProc script, the data file and Oracle are all held in one machine, so no network is needed to transmit them.

Data volume: 890M

Hardware and software configuration for the test machine: Normal PC

CPU: Core(TM) i5-3450 4 cores, 4 threads

Memory: 16GB

SSD

OS：CENT OS

Oracle：11g

esProc version: 3.1

esProc script esProcjdbc.dfx for data retrieval (cellset parameter fileName is the file name):

	A
1	=file(fileName).import@b()
2	result A1

The test program of calling esProc JDBC by Java:

import java.sql.*;

import java.text.SimpleDateFormat;

import java.util.Date;

public class esproc {

public static void testDataServer(String input){

Connection con = null;

try{

SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date begin=new Date();

System.out.println("begin"+df.format(begin));

long diff=0;

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

com.esproc.jdbc.InternalCStatement st=

(com.esproc.jdbc.InternalCStatement)con.prepareCall("call esProcjdbc(?)");

st.setObject(1,input);

ResultSet set =st.executeQuery();

int len=0;

while(set.next())

{

String orderid=set.getString("ORDERID");

String client=set.getString("CLIENT");

String sellerid=set.getString("SELLERID");

float amount=set.getFloat("AMOUNT");

Timestamp ORDERDATE=set.getTimestamp("ORDERDATE");

String note=set.getString("NOTE");

len++;

}

System.out.println("len="+len);

diff=((new Date()).getTime()-begin.getTime())/1000;

System.out.println("time of all="+diff);

}

catch(Exception e){

System.out.println(e);

}

finally{

if (con!=null) {

try {

con.close();

}

catch(Exception e) {

System.out.println(e);

}

public static void main(String[] args) {

testDataServer(“/ssd/data/orders.b”);

}

The SQL statement that Oracle implements:

Select * from orders;

The test program of calling Oracle JDBC by Java:

package jdbc;

import java.sql.*;

import java.util.Date;

import java.text.SimpleDateFormat;

public class oracle {

public static void testDataServer(String input){

Connection con = null;

try{

SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date begin=new Date();

System.out.println("begin"+df.format(begin));

long diff=0;

Class.forName("oracle.jdbc.driver.OracleDriver");

con= DriverManager.getConnection("jdbc:oracle:thin:@192.168.1.21:1521:orcl","test","test");

Statement st=con.createStatement();

ResultSet set =st.executeQuery(input);

int len=0;

while(set.next())

{

String orderid=set.getString("ORDERID");

String client=set.getString("CLIENT");

String sellerid=set.getString("SELLERID");

float amount=set.getFloat("AMOUNT");

Timestamp ORDERDATE=set.getTimestamp("ORDERDATE");

String note=set.getString("NOTE");

len++;

}

System.out.println("len="+len);

diff=((new Date()).getTime()-begin.getTime())/1000;

System.out.println("time of all="+diff);

}

catch(Exception e){

System.out.println(e);

}

finally{

if (con!=null) {

try {

con.close();

}

catch(Exception e) {

System.out.println(e);

}

public static void main(String[] args) {

testDataServer(“select * from orders”);

}

Test result:

	esProc	Oracle
Execution time	13 seconds	43 seconds

Conclusion:

Data retrieval using Oracle JDBC is more than one time slower than that via esProc JDBC. The reason is that data flow needs to be converted to object through Oracle JDBC, which takes time. There is no need for Java-based esProc to do the same thing, so it works much faster.

One point to note is that the time taken by the first execution of calling Oracle JDBC to retrieve data is not used by the test. On the other hand, because the data volume is well within 6G, Oracle’s maximum memory capacity, it can make full use of the cache memory to save data. Therefore most of the execution time is devoted to conversion to object via JDBC.

This shows that esProc approach of retrieving data into the external memory is more efficient than Oracle’s throughput capacity.

Generally data computing within database will achieve higher performance. But in some cases you need to retrieve a great amount of data from the database. For example, data outside the database is involved, or the algorithm is too complex and specialized to be expressed in SQL. In these cases, JDBC for database connectivity will be the key factor to determine the performance.

Yet better performance can be achieved by using esProc to retrieve data from the database and save it in files for use in later computation. To handle or display a large amount of historical data by report applications, it is more efficient to retrieve data into files before processing than to process it by retrieving data directly from the database. Within the log analysis system, you can save data as files and use esProc to handle the computation if you must retrieve data from the database using a programming language to perform certain operations involving log files.

5/25/2015

Performance Test of Concurrent File Processing with esProc

This article aims to test esProc performance in processing files concurrently, based on filter, group and aggregate operations and comparison with Oracle.

Test data includes tables of ordering information of different years. With esProc, files (Orders2009.b, Orders2010.b, Orders2011.b and Orders2012.b) of binary format are imported. Below is a selection from the source data:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE NOTE

1 287 47 5825 2013-05-31 gafcaghafdgie f ci…

2 89 22 8681 2013-05-04 gafcaghafdgie f ci…

3 47 67 7702 2009-11-22 gafcaghafdgie f ci…

4 76 85 8717 2011-12-13 gafcaghafdgie f ci…

5 307 81 8003 2008-06-01 gafcaghafdgie f ci…

6 366 39 6948 2009-09-25 gafcaghafdgie f ci…

7 295 8 1419 2013-11-11 gafcaghafdgie f ci…

8 496 35 6018 2011-02-18 gafcaghafdgie f ci…

9 273 37 9255 2011-05-04 gafcaghafdgie f ci…

10 212 0 2155 2009-03-22 gafcaghafdgie f ci…

…

In which note is a string field for increasing each record’s length and hasn’t practical meaning.

Oracle imports the four tables Orders2009, Orders2010, Orders2011 and Orders2012. The data is the same as that of those files. The number of parallel threads is 4. esProc and Oracle use single thread, 2 threads and 4 threads respectively to process the computational task. Each subtask has an input parameter year.

esProc script group.dfx for data grouping and aggregation:

Oracle SQL1 for data grouping and aggregation:

"select /*+ Parallel(4) */ client,sum(amount),count(orderid) from orders"+string(year)+" group by client"

esProc script select.dfx for data filtering:

Oracle SQL2 for data filtering:

"select * from orders1g"+string(year)+" where SELLERID=1 and CLIENT=50"

Hardware description:

CPU: Core(TM) i5-3450 4 cores and 4 threads

Memory capacity: 16GB

SSD

Note:

1. This performance test focuses on cases in which the result set is small and can be entirely loaded into memory.

2. The input parameters for the four concurrent tasks are four different years.

3. The test result adopts the average value, the maximum value and the minimum value of results of four consecutive computations. Both esProc and Oracle might use the disk cache.

4. Oracle’s available memory space is 6G.

Test result one: Each subtask handles 1G data. The total data is 4G.

Time unit: Second

1.Group and aggregate

2. Query and filter

Test result two: Each subtask handles 4G data. The total data is 16G.

Time unit: Second

1.Group and aggregate

2.Query and filter

Conclusion

1. Oracle cache can be brought into full play in handling the 4G total data that is within the Oracle memory capacity. In this case Oracle gets a higher performance.

2. When the total data is 16G, which exceeds Oracle’s available memory, esProc is generally better than Oracle if using the parallel processing.

3. On the same conditions, esProc has a steady computational time, while Oracle has a big gap between the maximum time and the minimum time.

5/21/2015

Performance Test of esProc File Traversal Algorithm

This article focuses on testing performance of esProc’s handling of file traversal, using commonly used group and aggregate operations as well as query and filter operations. The performance test will include evaluation of the influence of using multithreads, and comparison between esProc approach and Oracle counterpart for the same operation.

We use ordering information as the test data. The following is a selection of the Orders.b file in esProc binary format:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE NOTE

1 287 47 5825 2013-05-31 gafcaghafdgie f ci…

2 89 22 8681 2013-05-04 gafcaghafdgie f ci…

3 47 67 7702 2009-11-22 gafcaghafdgie f ci…

4 76 85 8717 2011-12-13 gafcaghafdgie f ci…

5 307 81 8003 2008-06-01 gafcaghafdgie f ci…

6 366 39 6948 2009-09-25 gafcaghafdgie f ci…

7 295 8 1419 2013-11-11 gafcaghafdgie f ci…

8 496 35 6018 2011-02-18 gafcaghafdgie f ci…

9 273 37 9255 2011-05-04 gafcaghafdgie f ci…

10 212 0 2155 2009-03-22 gafcaghafdgie f ci…

…

NOTE field is added for the purpose of increasing each record’s length but hasn’t practical meaning.

esProc script for grouping and aggregating data：

Oracle approach to grouping and aggregating data using SQL:

select /*+ Parallel(8) */ client,sellerid,count(orderid),sum(amount) from orders group by client,sellerid。

esProc script for querying and filtering data:

Oracle approach to querying and filtering data using SQL:

select /*+ Parallel(4) */ * from orders where sellerid=1 and client=50 and to_char(orderdate,'yyyy')>= 2013。

Two types of hardware are used for the test：PC server (disk array) and regular PC (SSD)

Configuration for a PC server:

CPU: 2 Xeon E5620 ( 4 cores 8 threads, 16 threads in total)

Memory: 24GB

Hard drive: 800G HDD

Configurain for a regular PC:

CPU:1 Core(TM) i5-3450 (4 cores)

Memory: 16GB

SSD

Note: We will get the test result by computing the average value of running a script 3 times, during which the machine won’t be restarted. Both Oracle program and esProc program might store data temporarily in the disk.

Test results for the PC server:

Unit: Seconds

Test 1 for data grouping and aggregation (data volume is 25G):

Test 2 for data grouping and aggregation (data volume is 10G):

Test 3 for data grouping and aggregation (data volume is 3.5G):

Test 1 for data query and filtering (data volume is 25G):

Test 2 for data query and filtering (data volume is 10G):

Test 3 for data query and filtering (data volume is 3.5G):

Test result summary on PC server (disk array):

1.Maximum memory supported by Oracle database is 12G. When data volume is below this limit, like 10G and 3.5G, Oracle’s performance is better than esProc’s.

2.When data volume is greater than 12G, like 25G, the use of parallel computing has little effect of increasing performance. In grouping and aggregating data, esProc’s performance is significantly higher than Oracle; the two are nearly even in performing data query and filtering.

3.esProc makes quite a leap in performance in each addition of the number of parallel tasks from 1 to 4. But during the increase from 4 to 16, its performance improves not as much as Oracle’s.

Test results for the regular PC:

Unit：Seconds

Test 1 for data grouping and aggregation (data volume is 25G):

Test 2 for data grouping and aggregation (data volume is 10G):

Test 3 for data grouping and aggregation (data volume is 3.5G):

Test 1 for data query and filtering (data volume is 25G):

Test 2 for data query and filtering (data volume is 10G):

Test 3 for data query and filtering (data volume is 3.5G):

Test result summary on regular PC:

1.In this case the maximum memory supported by Oracle database is 6G. When data volume is within the memory capacity, like 3.5G, Oracle performs better than esProc.

2. When data volume exceeds 12G, like 25G and 10G, parallel processing enhances performance very little. In grouping and aggregating data, esProc’s performance is significantly higher than Oracle; but only slightly better in performing data query and filtering.

Performance test summary:

According to the test results of both types of hardware, Oracle can achieve a better performance when its available memory can hold all data. The reason may be that it stores the data temporarily in the memory, instead of actually reading the disk. So it can work faster. In comparison, esProc, with no disk cache, reads the file for each computing action and thus can only make use of the limited buffer provided by the file system.

When a single task involves data that is more than the available memory can hold, esProc’s multithreading approach can achieve higher performance.

On the other hand, although PC server supports 16-thread computational power, performance rarely increases when the number of parallel tasks is nearly 16, due to the limited throughput of the disk array. In view of this, an esProc solution should set a reasonable number of parallel tasks according to the machine’s hard drive performance.

menu

5/28/2015

Performance Test of esProc In-Memory Computing

5/27/2015

Performance Test of File Retrieval in esProc

5/25/2015

Performance Test of Concurrent File Processing with esProc

5/21/2015

Performance Test of esProc File Traversal Algorithm