Data Process Engine

1/07/2015

esProc Integration & Application: Integration by BIRT

esProc provides standard JDBC to integrate with BIRT and other reporting tools. Here we’ll explain the process of integration of esProcand BIRT in development environment and WEB application environment.

1.Integration in development environment

1.1.Create configuration files

Create the two configuration files – config.xml and dfxConfig.xml – necessary for esProc JDBC.

1.1.1 config.xml

<?xml version="1.0" encoding="UTF-8"?>

</dfxPathList>

</DB>

</DBList>

<logLevel>DEBUG</logLevel>

</Config>

1.1.2 dfxConfig.xml

<?xml version="1.0" encoding="GBK"?>

</dfxConfig>

1.2 Create a report and load the driver jars

Create a report, add JDBC Data Source to the list of “DataSources”, then move on to the next step.

Click “Manage Drivers” on the New Data Source window to add esProc JDBC’s driver jars.

Five jars - dm.jar,poi-3.7-20101029.jar,log4j_128.jar,icu4j_3_4_5.jaranddom4j-1.6.1.jar– need to be added. They can be found in esProc IDE’s [installation directory]\esProc\lib directory. If esProc also uses other databases as the datasources, then the driver jars of these databases need to be added too. For instance, hsqldb.jar has been added here.

Placethe two configuration files – config.xml and dfxConfig.xml – created in the first step in the classpath’s root path (no function of setting adirectory-style classpath is provided here and the configuration files can be compressed into a root directory of any of the jars above like dm.jar), with their names remaining unchanged.

After that, select com.esproc.jdbc.InternalDriver( v1.0)in the drop-down list of the Driver Class and enter the database URL jdbc:esproc:local://. Leave both the user name and the password blank.

esProc JDBC is a fully-embedded computing engine, enabling to perform all computations in its embedded package, so the database URL is local. But as an incomplete database, it has neither a user name nor a password.

Having finished setting the above information, click on “Test Connection” to check if the connection has been successfully established.

1.3. DeployesProc script

Copy the esProc script (dfx) to the dfxPath configured in config.xml. Suppose to query the closing stock prices of a certain month, the script stocks.dfx is as follows:

1.4. Invocation in BIRT

Create a new data set by selecting the datasource just configured. The data set type is stored procedure.

Then define a query text - {call stocks(?)}. The way of invoking an esProc script is similar to that of invoking a stored procedure. “stocks” is the dfx’s name and “?” represents a passed parameter, which is the month being queried.

Set the parameter.

Click on Preview Results to see the query results. At this point, the invocation of esProc script has been completed.

The above is the process of integration of esProc and BIRT development environment. Now let’s look at how to deploy both of them in the WEB server-side.

2. Deploy web server-side

Select the appropriate application server. Here Tomcat6.0 is used as an example to deploy the application.

2.1. Prepare web distribution package

Download BIRT4.3.0 distribution package birt-runtime-4_3_0.zip. The link ishttp://www.eclipse.org/downloads/download.php?file=/birt/downloads/drops/R-R1-4_3_0-201306131152/birt-runtime-4_3_0.zip.

Copy the WebViewerExample directory to Tomcat’s webapps directory and rename it birt.

2.2. Create configuration files

Two configuration files – config.xml and dfxConfig.xml – are needed. The configuration information is the same as that for integration in development environment.

2.3. Copy necessary jars

Copy dm.jar,poi-3.7-20101029.jar,log4j_128.jar,icu4j_3_4_5.jaranddom4j-1.6.1.jarnecessary for the above “integration in development environment” to the application’s WEB-INF\lib directory and compress the above two configuration files into dm.jar.

2.4. Publish the report

When the report (likestocks.rptdesign) is completed, copy it to the application’s root directory, launch Tomcat and access the website http://127.0.0.1:8080/birt/frameset?report=stocks.rptdesign where we can see the result as displayed below. Now we can call it a successful deployment.

esProc Improves Text Processing –Remove Duplicate Rows

During processingthe text file,sometimes we need to remove duplicate rows from the grouped data. But the operation becomes complicated when the file under processing is too big to be entirely loaded into the memory. esProc’s group operation supports importing a whole group of data with the file cursor, as well as many options. So it can handle this kind of operation effortlessly. The following example will show you how it works.

The file EPRom.log has four columns separated by tab and its data have been grouped by the second column. Duplicate rows need to be removed (only the first row of each group is wanted). Some of the source data are as follows:

esProc code for doing this：

A1=file("E:\\EPRom.log").import()

This line of code imports the file into the memory. By default, the separator is tab and column names are _1，_2，_3……. If it is a CSV file, the separator needs to be specified using the code import(;”,”). If the file’s first row contains column names, @toption can be used to import them, that is, import@t(). The result of A1 is as follows:

A2=A1.group@1o(_2)

This line of code gets the first row from every group.The group field is _2, the second field. This is the case’s final result, as shown below.

By default, group function will regroup the data. For instance, A1.group(_2)will divide A1 into two groups by the second field, as the following shows:

But the use of@o option won’t make that happen. For instance, result of A1.group@o(_2)is as follows:

With @1 option, the function will get the first row of every group. By using both @1 and @o, we’ve hit the target of this case.

In the situation that the file is too big to be wholly loaded into the memory, esProc cursor can be used to deal with it. Corresponding code is as follows:

A1=file("E:\\EPRom.log").cursor()

This line of code opens the log file in the form of cursor. cursor function returns a cursor object according to the corresponding file object, with tab being the separator and _1,_2…_n being column names by default. Notice that the code merely creates the cursor object without importing data. Data importing will be started by for statement or fetch function.

B1= file("e:\\result.txt")

This line of code creates a file object to which the computed results can be exported later on.

A2:for A1;_2

This line of code imports data from cursor A1 by loop. Each time it imports a group of data with the same second column (column nameis _2). In this step the data are truly being imported into the memory.

Here for statement deserves special attention. In esProc, for cs,nimportsn records from cursor cseach time. for cs;ximports a group of records with the same x field from cursor cseach time, the condition is that records have been grouped by x field.

The x in for cs;xstatement can be an expression, which means multiple rows will be imported until the result of computing expression x changes. Take for A1 ;floor(_1/5)as an example. It divides _1 field by 5 androunds the result off, put the records with the same results into the same group, like the first row to the fifth row.

B2=file("e:\\result.txt").export@a([A2(1)])

As the loop body of forstatement in A2, it processes every group of data in the same way. The method is to get the first row of the current group and append it to file result.txt. A2 is the loop variable which represents all records in the current group. A2(1) represents the first record in 2. exportfunction is used to write the structured data into a new file, its @a option means appending. Since A2(1) is a single record, it needs to be converted into array with the operator [].

We can see the final result in result.txt:

In esProc, the working rangeof for statement can be represented by indentation instead of the parentheses or identifiers like begin/end. In the following code block, for instance, B2-B5 is A2’s working range.

1/05/2015

Comparison of esProc and R Language in Processing Text Files

As languages for data computations, both esProc and R language have rich functions to process text files. They have many similarities in basic usage, as well as obvious differences, such as in the aspect of processing files with fixed column width and big text files, reading and writing designated columns, computational performance, etc. The article aims to compare their similarities and differences.

1.Comparison of basic functions

Description:

There are six columns in sales.txt, they are separated from each other by tab \t. Lines are separated from each other by line break\n. The first row contains column names. Read the file into the memory and write it anew. The first rows of the file are as follows:

esProc:

data=file("e:\\sales.txt").import@t()

file("e:\\salesResult.txt").export@t(data)

R language:

data<-read.table("e:\\sales.txt",sep="\t", header=TRUE)

write.table(data, file="e:\\salesResult.txt",sep="\t",quote=FALSE,row.names=FALSE)

Comparison:

1. Both esProc and R language can do this job conveniently. esProc uses function option “@t”to represent that the first row contains column names, while R language uses “header=TURE” to do the same thing.

2. Line breaks are the most common separators for separating lines from each other. Both esProc and R language support line breaks by default. And tabs are the most common separators for separating columns from each other. esProc supports tabs by default. If other types of separators like comma are designated to be used, the code should be import@t(;”,”).InR language, default column separators are “blanks and tabs”, which can mistakenly separate the Client column containing blanks into two columns, thus sep=”\t” is needed to define separators as tabs. In addition, “quote=FALSE,row.names=FALSE” in the code represents that it is not necessary to put elements in quotes and to output row number.

3. Usually, files read into the memory will be stored as structured two-dimensional data objects, which are called table sequence(TSeq) in esProc or data frame (data.frame) in R language. Both TSeq and data.frame have rich computational functions. For example,group by Client and SellerID, then sum upAmount and find maximum. The code for esProc to perform the computations is:

data.groups(Client,SellerId;sum(Amount),max(OrderID))

As data.frame doesn’t directly support simultaneous use of multiple aggregation methods, two steps are needed to sum up and find maximum. Finally, cbind will be used to combine the results. See below:

result1<-aggregate(data[,4],data[c(2,3)],sum)

result2<-aggregate(data[,1],data[c(2,3)],max)

result<-cbind(result1,result2[,3])

4. Except storing files as the structured two-dimensional data objects in the memory, esProc can access files by cursor objects. While R language can access files by matrix objects.

Conclusion:For basic file reading and writing, both esProc and TSeq provide rich functions to meet users’ needs.

2. Reading files with fixed column width

In some files, fixed width, instead of separators, is used to differentiate one column from another. For example, read file static.txt which contains three columns of data into the memory and modify column names respectively to col1, col2 and col3, among which the width of col1 is 1, that of col2 is 4 and that of col3 is 3.

esProc:

data=file("e:\\static.txt").import()

data.new(mid(_1,1,1):col1, mid(_1,2,4):col2, mid(_1,6,8):col3)

R language:

data<-read.fwf("e:\\sales.txt ", widths=c(1, 4, 3),col.names=c("col1","col2","col3"))

Comprison: R language does this job directly while esProc does it indirectly by reading the file into the memory first and split it into multiple columns. Note that in the code mid(_1,1,1), “_1” represents default column names, and if the file read into the memory has more than one column, the default column names will be in due order: _1、_2、_3 and so on.

Conclusion: R language is more convenient than esProc because it can read files with fixed column width.

3.Reading and writing designated columns

Sometimes only some of the data columns are needed in order to save memory and enhance performance. In this example, read columns ORDERID, CLIENT and AMOUNT into the memory and write ORDERID and AMOUNT to a new file.

esProc:

data=file("e:\\sales.txt").import@t(ORDERID,CLIENT,AMOUNT)

file("e:\\salesResult.txt").export@t(data,ORDERID,AMOUNT)

R language:

data<-read.table("e:\\sales.txt",sep="\t", header=TRUE)

col3<-data[,c(“ORDERID”,”CLIENT”,”AMOUNT”)]

col2<-col3[,c(“ORDERID”,”AMOUNT”)]

write.table(col2, file="e:\\ salesResult.txt", sep="\t",quote=FALSE,row.names=FALSE)

Comparison: esProc does the job directly, while R language does it indirectly by reading all columns into the memory and saving designated columns in a new variable.

Conclusion:R language can only read all columns into the memory, which will occupy a relatively large memory.

4. Processing big text files

Big text files are files whose sizes are bigger than memory size. Usually they are processed by reading and computing in batches. For example, in big text file sales.txt, filter data according to the condition Amount>2000 and sum up Amount of each SellerID.

esProc:

A1：As reading the big text file into the memory at a time will result in memory overflow, it will be read in batches with cursor.

A2：Read by loop with 100,000 rows of data each time and store them in TSeq A2.

B3：Among each batch of data, filter out records whose order amount is greater than 2,000.

B4：Group and summarize the filtered data, and seek each seller’s sales amount in this batch.

B5：Append the computed results of this batch to a certain variable (B1), and begin the computation of the next batch.

B6：After the computations all batches are over, each seller’s sales amount of each batch can be found in B1, execute another and the last grouping and summarizing to get the total sales amount of each seller.

R language:

1-4：Create an empty data frame data to generate each batch’s data frame databatch.

5-9：Create an empty data frame agg to append the results of grouping and summarizing of each batch.

11-13：Read in the file by rows, with 100,000 lines each time, but skip the column names of the first row.

15-21：In each batch of data, filter out records whose order amount is greater than 2,000.

22：Group and summarize the filtered data, and seek each seller’s sales amount of this batch.

23：Append the computed results of this batch to a certain variable (agg), and begin the computation of next batch.

24：After the computations of all batches are over, each seller’s sales amount of each batch can be found in B1, execute another and the last grouping and summarizing to get the total sales amount of each seller.

Comparison:

1. Both of them have the same way of thinking. Differences are that esProc does the job with library function and its code is concise and easy to understand, while R language needs to process a great deal of details manually and its code is lengthy, complicated and error-prone.

2. With esProc cursor, the above computations can be performed more easily, that is:

In this piece of code, esProc engine can automatically process data in batches, and it is not necessary for programmers to control manually by loop statements.

Conclusion:In processing big text files, esProc code is more concise, more flexible and easier to understand than that of R language.

5. Processing big text files in parallel

Parallel computing can make full use of the resource of multi-core CPU and significantly improve computational performance.

The example in the above part is still used here, but parallel computing is used. That is, divide sales.txt into four segments to give four CPU cores to perform computations, then filter data according to the condition Amount>2000 and compute the total sales amount of each seller.

esProc:

Main program（pro5.dfx）

A1：Set the number of parallel tasks as four, meaning the file would be divided into four segments.

A2：Call subprogram to perform multithreaded parallel computing, and there are two task parameters: to(A1) and A1. Value of to(A1) is [1,2,3…24], representing segment number assigned to each task; A1 is the total number of segments. When all the tasks are completed, all computed results will be stored in the current cell.

A3：Merge the computed results of every task in A2 according to SellerID.

A4：Group and summarize the merge results and seek each seller’s sales amount.

Subprogram（sub.dfx）

A1：Read the file with cursor, and decide which segment of the file the current task should process according to the parameter sent by the main program. Take the third task as an example, value of the parameter segment is 3 and that of parameter total is always 4.

A2：Select records whose order amount is greater than 2,000.

A3：Group and summarize the filtered data.

A4：Return the computed results of current task to main program.

R language:

It cannot do this job by using parallel computing.

Comparison:

esProc can read big text files segmentally by bytes, and designated part by skipping useless data and supporting multithreaded parallel computing in the low level.

Though R language can perform parallel computing of in-memory data, it cannot read files in disk segmentally by bytes. It can also read data by skipping multiple rows, but this method has to traversal all useless data, resulting in poor performance and inability to perform parallel computing of big text files in the low level.

In addition, esProc can automatically manage the situation that there is only half line of data when segmenting by bytes, as shown in the above code, thus it is unnecessary for programmers to handle it manually.

Summary:

esProc can process big text files in parallel and has a high computational performance. R language cannot perform the parallel computing of big text files in the low level and has a much poorer performance.

6. Computational performance

Under the same test circumstance, use esProc and R language to read a file of 1G size, and summarize one of the fields.

esProc:

=file("d:/T21.txt").cursor@p(#1:long)

=A1.groups(;sum(#1))

R language:

con<- file("d:/T21.txt", "r")

lines=readLines(con,n=1024)

value=0

while( length(lines) != 0) {

for(line in lines){

data<-strsplit(line,'\t')

value=value+as.numeric(data[[1]][1])

}

lines=readLines(con,n=1024)

}

print(value)

close(con)

Comparison:

1. It takes esProc 26 seconds and R language 9 minutes and 47 seconds respectively to finish the task. Theirgap exceeds an order of magnitude.

2. In processing big files, R language cannot use data frame objects and library function. It can only write loop statements manually and compute while the file is being read, so the performance is poor. esProc can directly use cursor objects and library function and has a higher performance. But there is no big difference between them when processing small files.

Summary:esProc’s performance is far beyond that of R language in processing big text files.

1/04/2015

esProc Improves Text Processing – Parse Logs with Arbitrary Number of Lines

When parsing logs into structured data, we often find that the records consist of a variable number of lines. This makes the conversion, as well as the corresponding operation, quite complicated. Equipped with various flexible functions for structured data processing, such as regular expressions, string splitting, fetching data located in a different row, and data concatenation, esProc is ideal for processing this kind of text. Following example will show you how it works.

The log file reportXXX.log holds records, each of which consists of multiple lines with 14 data items (fields) and starts with the string “Object Type”. Our goal is to rearrange the log into structured data and write the result to a new text file. Some of the source data are as follows:

esProc code for doing this：

A1=file("e:\\reportXXX.log").read()

This line of code reads the logs entirely into the memory. Result is as follows:

A2=A1.array("Object Type: ").to(2,)

This line of code can be divided into two parts. The first part - A1.array("Object Type: ") – splits A1 into strings according to “Object Type”. Result is as follows:

Except the first item, every item of data is valid. to(2,) means getting items from the second one to the last one. Result of A2 is as follows:

This line of code applies the same regular expression to each member of A2 and gets the 14 fields separated by commas. Following lists the first fields:

A4=file("e:\\result.txt").export@t(A3)

This line of code writes the final result to a new file. Tab is the default separator. The use of @t option will export the field names as the file’s first row. We can see the following data in result.txt:

The regular expression used in the code above is complicated. We’ll use esProc’ built-in functions to make the operation more intuitive. For example, ObjectType field is the first line of each record, so we can separate the records from each other with the line break and then get the first line. left\top\right\bottom actually splits each record’s second line by space and get item 3, 5, 7 and 9.

The task can be handled with esProc built-in functions as follows:

In the above code, pjoin function concatenates many sets together; array function splits a string into many segments by the specified delimiter and creates a set with them, in which (~.array("\r\n") splits each record by carriage return.

In the above example, we assumed that the log file is not big and can be wholly loaded into the memory for computing. But sometimes the file is big and needs to be imported, parsed and exported in batch, which makes the code extremely difficult to write. Besides, because the number of records is variable, there is always a record in a batch of data which cannot be imported completely. This further complicates the coding.

esProc can handle the big log files with arbitrary number of lines more easily using file cursors. Following is a code sample:

A1=file("\\reportXXX.log").cursor@s()

This line of code opens the log file in the form of a cursor. cursor function returns a cursor object according to the file object, with tab being the default separator and _1,_2…_n being the default column names. @s option means ignoring the separator and importing the file as a one-column string, with _1 being the column name. Note that this code only creates a cursor object and doesn’t import data. Data importing will be started by for statement or fetch function.

A2: for A1,10000

A2 is a loop statement, which imports a batch of data (10,000 rows) each time and sends them to the loop body. This won’t stop until the end of the log file. It can be seen that a loop body in esProc is visually represented by the indentation instead of the parentheses or identifiers like begin/end. The area of B3-B7 is A2’s loop body which processes data like this: by the carriage-return the current batch of data is restored to the text which is split into records again according to “Object Type” , and then the last, incomplete record is saved in B1, a temporary variable, and the first and the last record, both of which are useless, are deleted; and then the regular expression is parsed with each of the rest of the records, getting a two-dimensional table to be written into result.txt. Following will explain this process in detail:

B2=B1+A2.(_1).string@d("\r\n")

This line of code concatenates the temporary variable B1 with the current text. In the first-run loop, B1 is empty. But after that B1 will accept the incomplete record from the previous loop and then concatenate with the current text, thus making the incomplete record complete.

string function concatenates members of a set by the specified separator and @d function forbids surrounding members with quotation marks. Top rows in A2 are as follows:

A2.(_1) represents the set formed by field _1 in A2 :

A2.(_1).string@d("\r\n") means concatenating members of the above set into a big string, which is Object Type: Symbol Location: left: 195 top: 11 right: 123 bottom: 15 Line Color: RGB ( 1 0 0 ) Fill Color: RGB ( 251 255 0 ) Link:l11….

B3=B2.array("Object Type: ")

This line of code splits the big text in B2 into strings by “Object Type”. Result of B3’s first-run loop is as follows:

Since the last string in B3 is not a complete record and cannot be computed, it will be stored in the temporary variable and concatenated with the new string created in the next loop. B4’s code will store this last string in the temporary variable B1.

B4=B1="Object Type: "+B3.m(-1)+"\r\n"

m function gets one or more members of a set in normal or reverse order. For example, m(1) gets the first one, m([1,2,3]) gets the top three and m(-1) gets the bottom one. Or B3(1) can be used to get the first one. And now we should restore the “Object Type” at the beginning of each record which has been deleted in the previous string splitting in A2. And the carriage return removed during fetching the text by rows from cursors will be appended.

The first member of B3 is an empty row and the last one is an incomplete row, both of them cannot be computed. We can delete them as follows:

B5=B3.to(2,B3.len()-if(A1.fetch@0(1),1,0)))

This line of code fetches the valid data from B3. If the data under processing is not the last batch, fetch rows from the second one to the second-last one and give up the first empty row and last incomplete row. But if the current batch is the last one, fetch rows from the second one and the last one which is complete and give up the first empty row only.

B3.to(m,n) function fetches rows from the mth one and the nth one in B3. B3.len() represents the number of records in B3, which is the sequence number of the last record in the current batch of data. A1.fetch(n) means fetching n rows from cursor A1 and @0 option means only peeking data but the position of cursor remaining unchanged. if function has three parameters, which are respectively boolean expression, return result when the expression is true and return result when the expression is false. When the current batch of data is not the last one, A1.fetch@0(1) is the valid records and if function will return 1; when it is the last one, value of A1.fetch@0(1) is null and if function will return 0.

B6=B5.regex(regular expression;field names list). This line of code applies the same regular expression to each member of B5 and gets the 14 fields separated by commas. Following lists the first fields:

B7=file("e:\\result.txt").export@a(B6)

This line of code appends the results of B6 to result.txt. It will append a batch of records to the file after each loop until the loop is over. We can view this example’s final result in the big file result.txt:

In the above algorithm, regular expression was used in the loop. But it has a relatively poor compilation performance, so we’d better avoid using it. In this case, we can use two esProc scripts along with pcursor function to realize the stream-style splitting and parsing.

First let’s look at the code for master routine main.dfx:

pcursor function calls a subroutine and returns a cursor consisting of one-column records. A2 parses the regular expression with each record in A1 and returns structured data. Note that the result of A2 is a cursor instead of the in-memory data. Data will be exported to the memory for computing from A2’s cursor segmentally and automatically by executing export function.

Subroutine sub.dfx is used to return the cursor, whose code is similar to the previous one. The difference is that the results need not be written to a new file, the one-column records will be returned instead, as the following code shows:

B6’s result statement can convert the result of B5 to a one-column table sequence and return it to the caller (pcursor function in main.dfx) in the form of a cursor.

With pcursor function, master routine main.dfx can fetch data from the subroutine sub.dfx by regarding it as an ordinary cursor and ignoring the process of data generation. While main.dfx needs data, pcursor function will judge if the loop in sub.dfx should continue, or if it should supply data by returning them from the buffer area. The whole process is automatic.

1/03/2015

esProc Improves Text Processing – Insert Summary values into Grouped Data

The usual way to insert summary values into the grouped data is to process data group by group. Import a group of data, append them and their summary value to a new file and then do the same with the next group, and so on. But it is not easy to realize this in hard coding. esProc, however, supports group cursor with which a whole group of data can be imported automatically. The following example will show how esProc deals with this kind of computation.

The log webdata.log has three columns separated by commas. The first column is the identifier for grouping data. The other two columns hold numerical values. Some of the data are as follows:

Notice that the first and fourth group has the same identifier for grouping data.

Now we are asked to insert the average value of the second column and an empty row between each group, as shown below:

esProc code for doing this task:

A1=file("E: \\webdata.log").cursor(;",")

This line of code opens the log file in the form of a cursor. cursor function returns a cursor object according to the corresponding file object. In the function, comma is used as the column separator (default separator is tab) and default column names are _1,_2…_n, in which _1 is the column to mark data grouping. We can also specify the column names like cursor(groupName,data1,data2;",").

The code only creates cursor objects but does not import the data. The data importing will be started by for statement or fetch function.

B1=file("e:\\result.txt"). This line of code creates a file object for storing the computed results.

A2:for A1;_1

This line of code fetches data from the cursor in A1 by loop, importing a group of data with the same first column (the name is _1) each time. It is in this step that data are really imported into the memory.

The for statement here is worth special attention. for cs,n means fetching n rows from cursor cs at a time. While for cs;x means fetching a group of records with the same x field from cursor cs in which data need to be grouped beforehand by x. In this example, the data are already grouped. But if the data are ungrouped, they can be prepared them by using other esProc functions (like sortx, a function for sorting cursors).

The x in the statement for cs;x can be an expression, according to which multiple rows will be imported each time uninterruptedly until the expression changes. For example, for A14 ;left(_1,4) will judge the first four characters of the first column according to the expression and corresponding records will be classified into the same group until the characters change .

B2-B4 is the loop body of for statement in A2. The loop body processes every group of data in the same way. Its working scope, as can be seen from the cellset, is represented by indentation rather than by parentheses or other identifiers like begin/end. What's more, the loop variable can be represented by the name of the cell where for statement resides, which, in this example, means A2 represents the records of the current group. Seen in debug mode, the value of A2 in the first-run loop is as follows:

B2=B1.export@a(A2;",")

This line of code appends A2 to the defined file object. export function exports a group of records to the file, in which @a option means appending. In order to keep consistent with the source data, comma is used here as the separator (though the default separator is tab). Open result.txt after the first loop and we can see the following data:

B3=A2._1+"_avg,"+string(A2.avg(_2))+"\r\n"

This line of code is used to piece together the summarizing string. A2._1 represents the first column of the current group. Its value is "webcat_service" as with the first group. The expression A2.avg(_2) means getting the average value of the second column of the current group. The value is 2.25 as with the first group. string function will formatting the variable of floating point type into the string.

For the first group of data, B3’s value is this:

B4=B1.write@a(B3)

This line of code appends B3 to the result file. Both export function and write function can write data into a new file. The former writes structured data into the file, whereas the latter writes strings or an array of strings into the file. @a option appends data, which is preceded by writing the carriage return into the file.

At this point, the above script has finished processing all data. The final result can be viewed in result.txt as follows:

menu

1/07/2015

esProc Integration & Application: Integration by BIRT

1.Integration in development environment

1.1.Create configuration files

1.1.1 config.xml

1.1.2 dfxConfig.xml

1.2 Create a report and load the driver jars

1.3. DeployesProc script

1.4. Invocation in BIRT

2. Deploy web server-side

2.1. Prepare web distribution package

2.2. Create configuration files

2.3. Copy necessary jars

2.4. Publish the report

esProc Improves Text Processing –Remove Duplicate Rows

1/05/2015

Comparison of esProc and R Language in Processing Text Files

1.Comparison of basic functions

2. Reading files with fixed column width

3.Reading and writing designated columns

4. Processing big text files

5. Processing big text files in parallel

6. Computational performance

1/04/2015

esProc Improves Text Processing – Parse Logs with Arbitrary Number of Lines

1/03/2015

esProc Improves Text Processing – Insert Summary values into Grouped Data