Data Process Engine: 2014

12/29/2014

esProc Improves Text Processing – Parse Logs with Arbitrary Number of Lines

When parsing logs into structured data, we often find that the records consist of a variable number of lines. This makes the conversion, as well as the corresponding operation, quite complicated. Equipped with various flexible functions for structured data processing, such as regular expressions, string splitting, fetching data located in a different row, and data concatenation, esProc is ideal for processing this kind of text. Following example will show you how it works.

The log file reportXXX.log holds records, each of which consists of multiple lines with 14 data items (fields) and starts with the string "Object Type". Our goal is to rearrange the log into structured data and write the result to a new text file. Some of the source data are as follows:

esProc code for doing this：

A1=file("e:\\reportXXX.log").read()

This line of code reads the logs entirely into the memory. Result is as follows:

A2=A1.array("Object Type: ").to(2,)

This line of code can be divided into two parts. The first part - A1.array("Object Type: ") – splits A1 into strings according to “Object Type”. Result is as follows:

Except the first item, every item of data is valid. to(2,) means getting items from the second one to the last one. Result of A2 is as follows:

This line of code applies the same regular expression to each member of A2 and gets the 14 fields separated by commas. Following lists the first fields:

A4=file("e:\\result.txt").export@t(A3)

This line of code writes the final result to a new file. Tab is the default separator. The use of @t option will export the field names as the file’s first row. We can see the following data in result.txt:

The regular expression used in the code above is complicated. We'll use esProc' built-in functions to make the operation more intuitive. For example, ObjectType field is the first line of each record, so we can separate the records from each other with the line break and then get the first line. left\top\right\bottom actually splits each record’s second line by space and get item 3, 5, 7 and 9.

The task can be handled with esProc built-in functions as follows:

In the above code, pjoin function concatenates many sets together; array function splits a string into many segments by the specified delimiter and creates a set with them, in which (~.array("\r\n") splits each record by carriage return.

In the above example, we assumed that the log file is not big and can be wholly loaded into the memory for computing. But sometimes the file is big and needs to be imported, parsed and exported in batch, which makes the code extremely difficult to write. Besides, because the number of records is variable, there is always a record in a batch of data which cannot be imported completely. This further complicates the coding.

esProc can handle the big log files with arbitrary number of lines more easily using file cursors. Following is a code sample:

A1=file("\\reportXXX.log").cursor@s()

This line of code opens the log file in the form of a cursor. cursor function returns a cursor object according to the file object, with tab being the default separator and _1,_2…_n being the default column names. @s option means ignoring the separator and importing the file as a one-column string, with _1 being the column name. Note that this code only creates a cursor object and doesn't import data. Data importing will be started by for statement or fetch function.

A2: for A1,10000

A2 is a loop statement, which imports a batch of data (10,000 rows) each time and sends them to the loop body. This won't stop until the end of the log file. It can be seen that a loop body in esProc is visually represented by the indentation instead of the parentheses or identifiers like begin/end. The area of B3-B7 is A2's loop body which processes data like this: by the carriage-return the current batch of data is restored to the text which is split into records again according to "Object Type" , and then the last, incomplete record is saved in B1, a temporary variable, and the first and the last record, both of which are useless, are deleted; and then the regular expression is parsed with each of the rest of the records, getting a two-dimensional table to be written into result.txt. Following will explain this process in detail:

B2=B1+A2.(_1).string@d("\r\n")

This line of code concatenates the temporary variable B1 with the current text. In the first-run loop, B1 is empty. But after that B1 will accept the incomplete record from the previous loop and then concatenate with the current text, thus making the incomplete record complete.

string function concatenates members of a set by the specified separator and @d function forbids surrounding members with quotation marks. Top rows in A2 are as follows:

A2.(_1) represents the set formed by field _1 in A2 :

A2.(_1).string@d("\r\n") means concatenating members of the above set into a big string, which is Object Type: Symbol Location: left: 195 top: 11 right: 123 bottom: 15 Line Color: RGB ( 1 0 0 ) Fill Color: RGB ( 251 255 0 ) Link:l11….

B3=B2.array("Object Type: ")

This line of code splits the big text in B2 into strings by "Object Type". Result of B3's first-run loop is as follows:

Since the last string in B3 is not a complete record and cannot be computed, it will be stored in the temporary variable and concatenated with the new string created in the next loop. B4's code will store this last string in the temporary variable B1.

B4=B1="Object Type: "+B3.m(-1)+"\r\n"

m function gets one or more members of a set in normal or reverse order. For example, m(1) gets the first one, m([1,2,3]) gets the top three and m(-1) gets the bottom one. Or B3(1) can be used to get the first one. And now we should restore the "Object Type" at the beginning of each record which has been deleted in the previous string splitting in A2. And the carriage return removed during fetching the text by rows from cursors will be appended.

The first member of B3 is an empty row and the last one is an incomplete row, both of them cannot be computed. We can delete them as follows:

B5=B3.to(2,B3.len()-if(A1.fetch@0(1),1,0)))

This line of code fetches the valid data from B3. If the data under processing is not the last batch, fetch rows from the second one to the second-last one and give up the first empty row and last incomplete row. But if the current batch is the last one, fetch rows from the second one and the last one which is complete and give up the first empty row only.

B3.to(m,n) function fetches rows from the mth one and the nth one in B3. B3.len() represents the number of records in B3, which is the sequence number of the last record in the current batch of data. A1.fetch(n) means fetching n rows from cursor A1 and @0 option means only peeking data but the position of cursor remaining unchanged. if function has three parameters, which are respectively boolean expression, return result when the expression is true and return result when the expression is false. When the current batch of data is not the last one, A1.fetch@0(1) is the valid records and if function will return 1; when it is the last one, value of A1.fetch@0(1) is null and if function will return 0.

B6=B5.regex(regular expression;field names list). This line of code applies the same regular expression to each member of B5 and gets the 14 fields separated by commas. Following lists the first fields:

B7=file("e:\\result.txt").export@a(B6)

This line of code appends the results of B6 to result.txt. It will append a batch of records to the file after each loop until the loop is over. We can view this example's final result in the big file result.txt:

In the above algorithm, regular expression was used in the loop. But it has a relatively poor compilation performance, so we'd better avoid using it. In this case, we can use two esProc scripts along with pcursor function to realize the stream-style splitting and parsing.

First let's look at the code for master routine main.dfx:

pcursor function calls a subroutine and returns a cursor consisting of one-column records. A2 parses the regular expression with each record in A1 and returns structured data. Note that the result of A2 is a cursor instead of the in-memory data. Data will be exported to the memory for computing from A2’s cursor segmentally and automatically by executing export function.

Subroutine sub.dfx is used to return the cursor, whose code is similar to the previous one. The difference is that the results need not be written to a new file, the one-column records will be returned instead, as the following code shows:

B6's result statement can convert the result of B5 to a one-column table sequence and return it to the caller (pcursor function in main.dfx) in the form of a cursor.

With pcursor function, master routine main.dfx can fetch data from the subroutine sub.dfx by regarding it as an ordinary cursor and ignoring the process of data generation. While main.dfx needs data, pcursor function will judge if the loop in sub.dfx should continue, or if it should supply data by returning them from the buffer area. The whole process is automatic.

12/28/2014

esProc Improves Text Processing – Characters Matching

Sometimes during text processing you need to find out words containing certain characters. The logic of this computation is simple, but the code is difficult to write using the regular expression because the order of the characters is flexible. Moreover, the method is inefficient. You may do better to write the program by yourself, but the problem is that high-level languages don't support set operations and this also makes the coding not easy. By contrast, esProc can parse a string dynamically and thus can match specific characters more easily with simple and intuitive code. Let's look at how it works through the following example.

Find out words containing e, a, c from the following Sample.txt. Some of the original data are as follows:

esProc code for doing the task:

Besides, read function, used with @n option, can read the data by lines. For example, the result of executing file("e:\\sampleB.txt").read@n() is as follows:

import function can be used if the data are structured. To import, for example, a file with tab being the separator and the first row being the column names, the code can be file("e:\\sampleC.txt").import@t(). The result is as follows:

A2=A1.words()

This line of code splits the big string into multiple words and creates a set with them. The words function can filter away the numbers and signs automatically and select only the alphabetic characters. Select only the numbers by adding @d option and both the words and the numbers by adding @a option. The result of A2 is as follows:

A3=A2.(~.array(""))

This line of code splits each word in A2 into characters. "~" represents each member of the set (word); there is no space within the double quotation marks (""). When the code is executed, A3 holds the subsets of a set, as shown below:

A4=A3.select(set==set^~)

This line of code selects the words containing set's characters. select function is used to execute a query statement, in which "~" represents A3's member for the current computation, operator "^" represents the intersection and "set==set^~" represents that if the intersection of set and the current member is equal to set itself, the current member is an eligible word according to the query condition. "==" is a comparison operator, operators of the same kind also include "!=" (not equal to), "<" (less than) and ">=" (greater than or equal to). "^" is a binary operator representing intersection, other operators of the same kind include "&" (union) and "\" (difference).

set is an external parameter, which can be transferred from either the command line or a Java program according to its different usages. It can be defined on the Integration Development Environment (IDE) interface, as shown below:

Suppose the value of parameter set is ["e","a","c"], then the above line of code is equal to A3.select(["e","a","c"]==["e","a","c"]^~). Once it is executed, the result is as follows:

It can be seen that both "complicated" and "Rebecca" contain the three characters: e, a, c.

Besides by computing the intersection, the operation can be realized through position query. The corresponding code is A3.select(~.pos(set)). pos function is used to locate members of set in ~ (also a set). If all of them can be found, then return a sequence consisting of their sequence numbers (that is true); if not found, then return null (that is false).

After A4 finds out the words satisfying the query condition, join the characters of each set, the word in fact, together using the following code:

A5=A4.(~.conj@s())

conj function can concatenate multiple sets together to form a single set. When used with @s option, it can combine all the members of a set into a string. The final result of this example is as follows:

The above step-by-step computation is intuitive and easy to understand. Actually you can omit the step for splitting the words up and then again concatenating every character, thus the code will become a single line:

file("e:\\sample.txt").read().words().select(set==set ^ ~.array(""))

12/25/2014

esProc Improves Text Processing – Fetching Data from a Batch of Files

Sometimes we need to fetch certain data from multiple files of a multi-level directory during text processing. The operation is too complicated to be well performed at the command line. Though it can be realized in high-level languages, the code is difficult to write; and the involvement of big files will increase the difficulty. esProc, however, can import big files with cursors and call the script recursively and thus can process the data fetching in batch. The following example will show its way of doing it.

A directory - "D:\files" – has subdirectories of multiple levels. Each subdirectory has many files of text format. We are asked to fetch a specified line (say the second line) from each of these files and write them into a new file – result.txt. Part of the structure of D:\files is as follows:

esProc code for doing this:

First define a parameter, path, and set its initial value as "D:\files" so as to get data from this directory, as shown below:

A1=directory@p(path)

directory function is used to get the file list in the root directory of the parameter, path. @p option means file names should be presented with full path. The following shows some of the results:

A2=A1.(file(~).cursor@s()). This line of code opens A1's files respectively in the form of cursors. A1.(…) means processing A1's members in proper order; "~" represents the current member; file function is used to create a file object and cursor function will return a cursor object according to the file object.

Tab is used as the default separator in cursor function. Default column names are 1,_2…_n. @s function means ignoring the separator and importing the file content as the strings in a single column with _1 being the column name. Note that the code only creates the cursor objects but doesn’t fetch data. The data fetching will be started by the use of fetch function. The results of A2 are as follows:

A3=A2.((~.skip(1),~.fetch@x(1))). This line of code fetches the second row from A2’s each file cursor. A2.(…) means computing A2's cursors one by one. (~.skip(1),~.fetch@x(1)) means computing the expression in the parentheses in order and returning the last computed result. ~.skip(1) means skipping a row. ~.fetch@x(1) means fetching the row at the current position (i.e. the second row) and closing the cursor. @x means closing the cursor automatically after the data are fetched. ~.fetch@x(1) represents the result which the parentheses operator will return.

skip function skips multiple rows. You can determine how many rows need to be skipped through a parameter. fetch function fetches multiple rows. Fetch two rows starting from the 10^th row, for example, the code is ~.skip(10),fetch@x(2).

The following shows some of the results of A3:

A4=A3.union(). This line of code unions the results in A4 together. union function is used to realize the union operation, removing the duplicate data at the same time. For example, the code for computing the union of two sets: [1,2] and [2,3] is [1,2],[2,3]].union() and the result is [1,2,3]. If duplicate data are wanted, conj function (for concatenation) should be used. Some of the results of A4 are as follows:

A5=file("d:\\result.txt").export@a(A4). This line of code exports the results of A4 to result.txt. export function is used to write data to a file. @a option means appending.

At this point, all data have been fetched as required from the current directory. The rest of the work is to fetch the subdirectories of the current directory and to call this script recursively.

A6=directory@dp(path). directory function is used to fetch all the subdirectories from the current directory. One of the options, d, means fetching the subdirectory names and the other one, p, means fetching the full paths. Thus A6 gets the subdirectories from D:\files:

A7=A6.(call("c:\\readfile.dfx",~)). This line of code deals with A6's members (the subdirectories). The operation is to call the esProc script - c:\\readfile.dfx, and makes the current member (one of the subdirectories) as the input parameter. Note that readfile.dfx is the name of this script.

Through the recursive call in A7, esProc will fetch data from a batch of files of the multilevel directory of D:\files. You can see the final result in result.txt:

12/24/2014

esProc Improves Text Processing –Batch String Replacement

During text processing, sometimes we need to replace multiple strings in the source file according to a criteria file. The command line can be used to replace a single string, but it cannot realize the batch string replacement. High-level languages can only handle this task through complicated multilayer loops. If the source file is too big to be loaded into the memory, the task will become more difficult to handle. esProc supports processing the loop using the iterative function and importing big files with cursors, thus it can perform the batch string replacement much more easily. Methods will be explained in detail through the following examples.

A criteria file, condition.txt, has two columns, with tab being the separator. Column before has strings waiting to be replaced and column after holds the resulting strings after the replacement. Suppose to replace certain strings in source.txt in batch according to this configuration file, and write the result to result.txt. Some of the data of condition.txt are as follows (the first row is the column names):

The following is part of the source file, source.txt:

esProc Improves Text Processing – Conditional Query on Big Files

During text processing, you often have the tasks of querying data from a big file on one or more conditions. Command line grep\cat command can be used to handle some simple situations with simple command yet low efficiency. Or high-level languages can be used to get a much higher efficiency with complicated code. If the query conditions are complex or dynamic, you need to create an additional SQL-like low-level class library, which increases the complexity of the computation.

esProc supports performing conditional query on big files and multithreaded parallel computing, and its code for handling this kind of problem is both concise and efficient. The following example will teach you the esProc method of doing the job.

…

esProc code for doing the task:

A1=file("e:\\condition.txt").import@t()

This line of code imports the criteria file. import function can import a text file or a binary file as a two-dimensional table (a table sequence), with tab being the default column separator. @t means making the first row the column names. Result of A1 is as follows:

A2=file("e:\\source.txt").read()

This line of code reads the source file. read function can read a text file as a big string. Result of A2 is as follows:

A3=A1.loops(replace(~~,before,after);A2)

This line of code replaces A2's strings in batch according to A1. As an iterative function, loops function can perform loop computation on a set (like A1, the set of records) by getting members of the set in order and use them to compute the specified expression (like replace(~~,before,after)) one by one. The computed result can be used in the next round of computation (~~ represents the previous computed result) until the last one. A2 is the initial value of loops function.

replace function is used to perform string replacement. It has three parameters: source string, to-be-replaced string and the after-replacement string, represented respectively by ~~, before and after. before and after are the column names (field names) of the table sequence in A1.

Actually only A3 really performs the replacement. The following line of code writes the result to a file.

A4=file("e:\\result.txt").write(A3). Here write function is used to write the strings to a file.

You can also combine these steps into a single line of code:

A1=file("e:\\condition.txt").import@t().loops(replace(~~,before,after);file("e:\\source.txt").read())

A2=file("e:\\result.txt").write(A1)

If the file is too big to be loaded into the memory, the data can be imported segmentally to make the replacement and append each set of result to the new file. The computation is performed in this way until the whole file is processed. The corresponding esProc code is as follows:

A1：Import the criteria file.

A2=file("e:\\source.txt").cursor@s().

This line of code opens the source file. cursor function won't import the whole data into the memory, instead it will open the file in the form of cursors (stream). @s option means the data will be imported as a single-column table sequence, with _1 being the column name. Without the option, the data will be imported as a multi-column table sequence according to the separator and columns will be named _1、_2、_3…_n automatically.

A3:for A2,1000

It imports the data with the cursor in A2 by loop. A certain batch of data (1,000 rows as with this case) will be imported each time.

The area of B3-B5 is the loop body of A3, whose operation is similar to the handling of the previous example. The operation is to perform batch string replacement on the current rows and append the result to the new file. Note that a loop body is represented visually in esProc by the indentation instead of the parentheses or identifiers like begin/end.

B3=A3.(_1).string@d("\r\n")

This line of code converts the current batch of data to a big string. A3 is the loop variable, representing the current batch of data. A3.(_1) means fetching column _1 from A3. string function concatenates members of a set into a big string by the specified separator, which is the carriage return in this example. @d option forbids surrounding each member with double quotation marks.

B4=A1.loops(replace(~~,before,after);B3))

This line of code performs batch string replacement on each big string.

B4=file("e:\\result.txt").write@a(B4)

This line of code writes the replacement result of the current row to the new file. @a represents appending the result to the file.

Thus you have completed the batch string replacement with regard to a big file. See the final data in result.txt:

12/23/2014

esProc Improves Text Processing – Conditional Query on Big Files

A text file - employee.txt – holds the employee data. Import the data, select the female employees born after January 1, 1981 inclusive and export the query result to result.txt.

The format of employee.txt is as follows:

EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY

1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000

2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000

3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000

4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000

5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000

esProc code for accomplishing the task:

A1：Open the file as a cursor. cursor function won't import all the data into the memory, it will open the file in the form of a cursor (stream) without the memory footprint. The function uses a default parameter which makes tab as the column separator to import all the fields. @t option means that the file's first line will be the column names and thus specific column names can be used in the expression later. Without the option, columns will be named _1, _2, _3…_n automatically.

A2=A1.select(${where})

Filter the data according to the condition. Here a macro is used to dynamically parse the expression. "where" is the dynamic input parameter, it needs to be pre-defined. The following is the interface on which a parameter is defined:

The esProc program will first compute the expression surrounded by ${…}, then assign the computed result as the value to the macro string and replace ${…} with it; after that, the program will interpret and execute the code. For example, if where gets assigned as BIRTHDAY>=date(1981,1,1) && GENDER=="F" according to the given condition in the example, the expression in A2 will be =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F"). The parameter can be entered into esProc’s Integration Development Environment (IDE), or can be passed from the Java code or the command line.

A3=file("D:/result.txt").export@t(A2). This line of code exports the computed result to a file. If the size of computing result is always small, use the code =A2.fetch() in A3 to fetch the results into the memory for direct observation, or use result A2.fetch() to return the results to the Java application.

The final result of this example is as follows:

This example shows the method of realizing a dynamic query, that is, there is no need to change the code when the query condition changes, just modify the value of the parameter “where”. For example, if the condition becomes "query female employees born after January 1, 1981 inclusive, or employees whose FULLNAME is RebeccaMoore", the value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore". After the code is executed, the result set of A2 will be as follows:

The above algorithm is a sequential computation. But the use of parallel computation can further improve the performance. The method is: Import the file using multithreads, each of which accesses a part of the file with a cursor; meanwhile query the data according to the condition and finally merge the result of each cursor together.

The esProc code for parallel computing is as follows:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the efficiency won’t be really increased. The maximum number of parallel tasks can be configured in the environment option.

A2=A1.(file("d:/employee.txt").cursor@z(;, ~:A1))

This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with each member of A1 in order. "~" can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

file("d:/employee.txt ").cursor@z(;, ~:A1) surrounded in the parentheses is an expression, in which cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file will be roughly divided into four segments (A1=4) and the ~th segment will be fetched. "~" represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

Besides, though exact division will result in incomplete lines, esProc can import complete lines automatically by skipping the beginning half line of a segment and completing the ending half line of the segment. This is why the file should be divided "roughly".

A3=A2.(~.select(${where})). This line of code will query data of each cursor (i.e. ~) in A2 and select the eligible rows. The computed results are still four cursors.

A4=A3.conj@xm(). This line of code will merge the four cursors in A3 in parallel.

A5=file("d:/result.txt").export(A4). This line of code will export the final result to a file.

12/22/2014

esProc Improves Text Processing – String Matching with Big Files

There are many occasions during text processing which require performing string matching with big files. Coding with command line grep\cat is simple yet inefficient. Though higher efficiency can be achieved with high-level languages, coding will be rather difficult.

Yet this operation, as well as multithreaded parallel computing, can be handled more easily in esProc, with more concise code and much better performance. The following examples will show esProc method in detail.

file1.txt has a great many strings. Find out the rows ending with ".txt" and export them to result.txt. Some of the original data are as follows:

esProc code for doing this task:

A1：Open the file in the form of cursors. Instead of importing all the data into the memory at a time, cursor function opens the file in the form cursors (stream) without memory footprint. The function uses default parameters to import all the fields with tab being the column separator and to automatically name them _1, _2, _3…_n respectively. There is only one field, _1, in this example.

A2=A1.select(like@c(_1,"*.txt"))

This line of code selects rows ending with ".txt" from cursor A1. select function executes the query and like function performs string matching. _1 represents the first field. The use of @c option in like function means the matching is case insensitive.

One point worth noting is that the result of A2 is still a cursor without memory footprint. Only with the use of functions like export/fetch/groups will esProc allocate suitable memory buffers and convert the cursor computing to memory computing.

A3=file("e:\\result.txt").export(A2). This line of code exports the final result to a file. Some of the data are as follows:

The matching rule in the example above is relatively simple. If the rule is complex, a regular expression will be needed. For example, find out rows starting with "c:\windows" and not ending with ".txt".

regex function is used to perform string matching with the regular expression. Just modify A2's code to A1.regex@c("^c:\\\\windows.*(?<!\\\\(.txt)$)") , in which @c option means case insensitive.

Though the regular expression can be used to realize the string matching with complex rule, its performance is not satisfactory. For example, to find out rows ending with ".txt" from a file of 2.13G size in the same test environment, it takes 206 seconds with a regular expression, while it takes only 119 seconds with an ordinary expression (the select statement).

In fact, many tasks of string matching with complex rule can also be realized with the ordinary expression. Moreover, the syntax is more visual and cost of learning is lower. For example, emp.txt holds a large number of user records, each of which has multiple fields, separated by tab and with the first r"Eid field is lesser than 100, the first letter of Name filed is a and Birthday field is greater than 1984-01-01". You can do it in esProc as follows:

The @t option used with cursor function means that the first row will be imported as column names for the use of accessing data at a later time.

The three query conditions can be represented by EId>100, like@c(Name,"a*") and Birthday>=date("1984-01-01") respectively. The logic relation between the conditions is "AND", which can be represented by &&.

The above algorithm is sequental computation. The performance can be further improved if parallel computing is used. The method is this: Import the file using multithreads, each of which will access some of the data of the file with a cursor, and perform set operations at the same time; finally, merge the result of each cursor together.

Test the processing of a file of 2.13G size under the same hardware environment. It takes an average of 119 seconds with the sequential computation, whereas it takes only an average of 56 seconds with the parallel computing, which speeds the performance almost doubly. The algorithm used in the example is not so complex, so the bottleneck is the hard driver's ability to import data. With the increase of the complexity of the computation, the performance will be improved more greatly.

esProc code for parallel computing:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the performance won't be really increased. The maximum number of the parallel tasks can be configured in the environment option.

A2=A1.(file("e:\\file1.txt").cursor@z(;, ~:A1))

This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with members of A1 respectively. "~" can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

In the expression, file("e:\\file1.txt").cursor@z(;, ~:A1), surrounded in the parentheses, cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file is roughly divided into four segments (A1=4) and the ~th segment is fetched. "~" represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

A3=A2.(~.select(like@c(_1,"*.txt"))). This line of code queries data of each cursor (i.e. ~) in A2 and selects the eligible rows. The computed results are still four cursors.

A4=A3.conj@xm(). This line of code merges the four cursors in A3 in parallel.

A5=file("e:\\result.txt”).export(A4). This line of code exports the final result to a file.

An esProc script not only can work independently in an Integration Development Environment (IDE), it also can be called by a Java program through JDBC interface. The calling method is the same as the method of calling an ordinary database. A one-step esProc script can be embedded in the Java program directly without script file. Actually the above steps can be combined into one single step:

file("e:\\result.txt").export(4.(file("e:\\file1.txt").cursor@z(;, ~:4)).(~.select(like@c(_1, "*.txt"))).conj@xm())

It is also allowed to run this kind of one-step script in operating system's command line. Please refer to related documents for further information.

menu

12/29/2014

esProc Improves Text Processing – Parse Logs with Arbitrary Number of Lines

12/28/2014

esProc Improves Text Processing – Characters Matching

12/25/2014

esProc Improves Text Processing – Fetching Data from a Batch of Files

12/24/2014

esProc Improves Text Processing –Batch String Replacement

12/23/2014

esProc Improves Text Processing – Conditional Query on Big Files

12/22/2014

esProc Improves Text Processing – String Matching with Big Files