During processing the text file, sometimes
we need to remove duplicate rows from the grouped data. But the operation
becomes complicated when the file under processing is too big to be entirely loaded
into the memory. esProc’s group operation supports importing a whole group of
data with the file cursor, as well as many options. So it can handle this kind
of operation effortlessly. The following example will show you how it works.
The file EPRom.log has four columns separated by tab and its data have been
grouped by the second column. Duplicate rows need to be removed (only the first
row of each group is wanted). Some of the source data are as follows:
esProc code for doing this:
A1=file("E:\\EPRom.log").import()
This line of code imports the file into the memory. By default, the separator is tab and column names are _1,_2,_3……. If it is a CSV file, the separator needs to be specified using the code import(;”,”). If the file’s first row contains column names, @t option can be used to import them, that is, import@t(). The result of A1 is as follows:
A2=A1.group@1o(_2)
This line of code gets the first row from every group. The group field is _2, the second field. This is the case’s final result, as shown below.
By default, group function will regroup the data. For instance, A1.group(_2) will divide A1 into two groups by the second field, as the following shows:
But the use of @o option won’t make that happen. For instance, result of A1.group@o(_2) is as follows:
With @1
option, the function will get the first row of every group. By using both @1 and @o, we’ve hit the target of this case.
A1=file("E:\\EPRom.log").cursor()
This line of code opens the log file in the
form of cursor. cursor function
returns a cursor object according to the corresponding file object, with tab
being the separator and _1,_2…_n being column names by default. Notice that the
code merely creates the cursor object without importing data. Data importing
will be started by for statement or fetch function.
B1=
file("e:\\result.txt")
This line of code creates a file object to
which the computed results can be exported later on.
A2:for A1;_2
This line of code imports data from cursor
A1 by loop. Each time it imports a group of data with the same second column
(column name is _2). In this step the data are truly being imported into the
memory.
Here for
statement deserves special attention. In esProc, for
cs,n imports n records from cursor
cs each time. for cs;x imports a group of records with the same x field from cursor cs each time, the condition is that records have been grouped by x field.
The x
in for cs;x statement can be an expression,
which means multiple rows will be imported until the result of computing expression
x changes. Take for A1 ; floor(_1/5) as
an example. It divides _1 field by 5 and rounds the result off, put the records
with the same results into the same group, like the first row to the fifth row.
B2=file("e:\\result.txt").export@a([A2(1)])
As the loop body of for statement in A2, it processes every group of data in the same
way. The method is to get the first row of the current group and append it to
file result.txt. A2 is the loop
variable which represents all records in the current group. A2(1) represents
the first record in 2. export function
is used to write the structured data into a new file, its @a option means appending. Since A2(1) is a single record, it needs
to be converted into array with the operator [].
In esProc, the working range of for statement can be represented by indentation instead of the parentheses or identifiers like begin/end. In the following code block, for instance, B2-B5 is A2’s working range.
No comments:
Post a Comment