Java doesn't support set operations
directly, so nested loops have to be used to realize the operations of
intersection, union, complement and etc. between text files. If there are many
text files, or the file to be computed is too big to be loaded into the memory,
or it is required to perform set operations according to multiple fields, the
code will become even more complicated. However, with the assistance of esProc,
which supports set operations directly, Java can realize these operations more
easily. Through the following example, you will learn in detail how esProc
works.
There are two small files: f1.txt and f2.txt, with their first rows being the column names. You are
required to get intersection of the Name
fields of the two files. Part of the data is as follows:
File f1.txt:
File f2.txt:
esProc code:
Import the two files into the memory using import function, with tab being the separator by default. Here
function option @t means importing
the first row as the column names, thus the subsequent computation can use Name and Dept directly to reference the corresponding columns. If the first
row is not the column names, then the default column names such as _1 and _2 should be used to make the reference.
import function can import specified columns. In this case, for instance,
only Name field will be computed, so
you can merely import this column. The code for importing it is file("E:\\f1.txt").import@t(Name).
A2: isect function performs intersection operation. A1.(Name) means drawing out A1's Name field to create a set, B1.(Name) means drawing out B1's Name field. The final result of this example is as follows:
A3: The code result A2
means outputting the computed result of A2 to JDBC. You can also get the final result by combining A2 and A3
into one step, with the code being written as result
[A1.(Name),B1.(Name)].isect().
The code for complement operation is [A1.(Name),B1.(Name)].diff(). The computed result is as follows:
The concatenation operation, which is equal to union operation with duplicates retained, is special. The code for it is [A1.(Name),B1.(Name)].conj(). The computed result is as follows:
You can
also replace the functions with operators to make the code simpler. Thus the above
code for operations of intersection, union, complement and concatenation can be
rewritten as:
A1.(Name) ^ B1.(Name)
A1.(Name) & B1.(Name)
A1.(Name) \ B1.(Name)
A1.(Name) | B1.(Name)
Or you can perform set operations on
multiple files. For instance, the corresponding variables of f1.txt, f2.txt and f3.txt after
they are imported into the memory are A1, B1 and C1 respectively. The code for
getting their intersection is as follows:
Sometimes the file is so big that it would
degrade performance. In this occasion, sort
function can be used to sort data first, and then perform set operations using merge function. In this way, the
performance will be improved significantly. Function options @i, @u
and @d need to be used respectively
in performing operations of intersection, union and complement. Their
corresponding code is respectively as follows:
=[A1.(Name).sort(),B1.(Name).sort()].merge@u()
=[A1.(Name).sort(),B1.(Name).sort()].merge@d()
merge function can also be used in performing set operations on multiple
fields. Suppose different Dept
corresponds to the same Name, now it
is required to perform intersection operation by taking the two fields as a
whole. The code will be written as:
[A1.sort(Dept,Name),B1.sort(Dept,Name)].merge@i(Dept,Name)
The computed result is as follows:
For files that are too big to be loaded
into the memory, cursor function can
be used to import them and merge
function to perform set operations on them. For example, the code for
intersection operation is as follows:
A1=file("e:\\f1.txt").cursor()
B1=file("e:\\f2.txt").cursor()
A2=[ A1.sortx(Name),B1.sortx(Name)].merge@xi(Name)
B1=file("e:\\f2.txt").cursor()
A2=[ A1.sortx(Name),B1.sortx(Name)].merge@xi(Name)
Note: cursor
function won’t import all the data into the memory at once. It opens a file in
the form of cursors (or flow). esProc engine will allocate appropriate buffer
for each imported batch of data. The operation thus runs until it is completed.
Different from in-memory computing, the use
of cursor requires cursor functions.
For instance, sortx function should
be used for sorting. Here merge
function uses two function options: @i,
which means performing intersection operation, and @x, which means the object involved in the computation is cursor
instead of in-memory data. Besides, functions like union can only perform set operations on
in-memory data; hence they cannot be used to deal with big files.
The above esProc script has
completed all data processing. The rest of the work is to integrate the script
with Java program through JDBC. Java code is as follows:
// create a connection using esProc
JDBC
Class.forName("com.esproc.jdbc.InternalDriver");
con=
DriverManager.getConnection("jdbc:esproc:local://");
// call esProc script, whose name is test
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call
test()");
st.execute();//execute esProc stored
procedure
ResultSet set = st.getResultSet();//get
the result
No comments:
Post a Comment