Both esProc and R language are typical
data processing and analysis languages with two-dimensional structured data
objects. They are all good at multi-step complex computations. However their
two-dimensional structured data objects are quite different from each other in
the underlying mechanism. As a result, esProc is better at computation with structured
data, and especially suitable for developers to do business computing. R is
better at matrix computation and more suitable for scientists to do scientific
or engineering computation.
esProc’s two-dimensional structured
data type is sequence table object (TSeq). Sequence table is based on records,
with multiple records forming a row-styled two-dimensional table. In
combination with the column name, this two-dimensional table can form a
complete data structure. R language is based on vector, with multiple vectors
forming a column-styled two-dimensional table. In combination with the column
name, the two-dimensional table can form a complete data structure.
These underlying mechanisms affect actual
user experience. In the following part we will compare the difference in
practical use between sequence table object and data frame, in terms of basic
functions, advanced features, actual use cases and test results.
Note:
Primitive functions of development language are to be used in the following
comparisons, the third party extension packages won’t be involved.
Basic functions
Example 1:retrieve two-dimensional
structured data from the file, and access the value of the second column in the
first row by coordinates.
Data frame:
data<-read.table("e:/sales.txt",header=TRUE,sep="\t")
result<-data[1,2]
Sequence
table:
=data=file("e:/sales.txt").import@t()
=data(1).#2
Comparison:
there is no significant difference in the most basic functions.
Note: the
sales.txt file is tab separated structured data, and the first few lines are as
following:
Example 2: access the value
of the second column in the first row, by row number and by field name.
Data frame:
Result1<-data$Client[1]
Result2<-data[1,]$Client
Sequence table:
=data(1).(Client)
=data.(Client)(1)
Comparison: there is no significant
difference between the two.
Example 3: Access column
data. There are two scenarios, and each falls into two situations: access by
column number and column names:retrieve only the second column, or retrieve a
combination of the second column and the fourth column.
Data frame:
Result1<-data[2]
Result2<-data[,c(2,4)]
Result3<-data$Client
Result4<-data[,c("Client","Amount")]
Sequence table:
=data.(#2)
=data.new(#2,#4)
=data.(Client)
=data.new(Client,Amount)
Comparison: Both can access the column
data. The only difference is in the syntax for retrieving multiple column data.
Data frame is retrieving the number directly, while with sequence table a new sequence
table will be build with the new function.
Although the syntax is different, the actual methods used are the same: both
are duplicating two columns of data from the original objects to new objects.
Example 4: record
manipulation. Includes: retrieve the first two records, appending records,
inserting record in the second row, deleting the record in the second row.
Data frame:
Record1<-data[c(1,2),]
append<-
data.frame(OrderID=152, Client="CA", SellerId=5, Amount=2961.40, OrderDate="2010-12-5 0:00:00")
data<-
rbind(data, append)
insert<-data.frame(OrderID=153, Client="RA", SellerId=4, Amount=1931.20, OrderDate="2009-11-5 0:00:00")
data<-rbind(data[1,],
insert,data[2:151,])
data<-data[-2,]
Sequence table:
=data([1,2])
=data.insert(0,152:OrderID,"CA":Client,5:SellerId,2961.40:Amount,"2010-12-5
0:00:00":OrderDate)
=data.insert(2,153:OrderID,"RA":Client,4:SellerId,1931.20:Amount,"2009-11-5
0:00:00":OrderDate)
=data.delete(2)
Comparison: record manipulation is possible
in both ways. esProc is relatively more convenient. It can use insert function to append or insert
records directly to sequence table, while in R language we need to split the
data frame and then merge them again to achieve the same result in an indirect
way.
Summary:As both sequence table and data frame are
structured, two-dimensional data object, no significant difference exists in
basic functions for data reading/writing,data access and maintenance.
Advanced features
Example 5: modifying the
association. A1, A2 are two-dimensional structured data object with the same
field ID. We now need to add the bonus field values of A2 to the salary field
values in A1 according to ID.
Sequence table:
A1=db.query("select id,name,salary
from salary order by id")
A2=db.query("select id,bonus from
bonus order by id")
A1.modify(1:A2,salary+bonus:salary)
Data frame has no functions to modify
the association. We need to do manual coding for this, which is omitted here.
Example 6: merging
associations. A1, A2, A3 are two-dimensional structured data objects with the
same field sequence number. Please associate them with left join. As the data is
sorted by sequence number, please leverage merging methods to improve the speed
for association.
Sequence Table:join@m1(A1:salary,id; A2:bonus,id; A3,attendance,id)
Data frame
supports association of two tables, such as: merge(A1,A2,by.x="id",by.y="id",all=TRUE).In
this case three tables are associated, which can be achieved indirectly through
two two-table associations.
In addition, the data frame does not
support merging of association, and therefore no speed improvement is possible.
In other words, data frame cannot use ordered sequence data to improve
performance, not only with association, but also with other operations.
Example 7: Record lookup.
Four scenarios: retrieving records with the Amount greater than 1000;
retrieving the sequence number or records with the Amount greater than 1000;return
records with primary key value of “v”, return the sequence number for records
with primary key value of “v”.
Sequence table:
=data.select(Amount>1000)
= data.pselect(Amount>1000)
= data.find(v)
= data.pfind(v)
Data frame:only the first two scenarios can be achieved, which is done with
following code:
newdata<- data [data $
Amount>1000,]
Data frame hasn’t the concept of major
key, so we need to do manual coding for other 2 scenarios as indirect methods,
or employ a third party package (i.e. data.table). The codes are omitted here.
Example 8: Group sum. The
data is grouped by Client and SellerId. Then the other two fields are
aggregated: do a sum for Amount field, and do a count for OrderID field.
Sequence table:
=data.groups(Client,SellerId;sum(Amount),count(OrderID))
Data frame:only support single field aggregation, such as the sum of Amount. As
following:
result<aggregate(data[,4],data[c(2,3)],sum)
To do aggregation of two fields at the
same time with data frame, we can only use two separate aggregate statements and then merge the results. Codes are omitted
here.
Example 9: Reuse grouping.
Group data by Client. Complete multiple subsequent computations on group
result. Including: aggregation by amount, and count after grouping by SellerId.
Sequence table:
A2=data.group(Client)
=A2.(~.sum(Amount))
=A2.(~.groups(SellerId;count(OrderID)))
Data frame does not support reuse of grouping
directly. Grouping and aggregation usually need to be done in one step. This
means we need to do two identical grouping operations to accomplish the same
purpose. As following:
result<-aggregate(data[,2],sum)
result<-aggregate(data[,2],data[,3],count)
If
we want to reuse grouping, we must use split
function and loop to achieve this.
The code is both lengthy and with low performance.
Summary:Sequence tables and data frame are quite different in terms of advanced
features. This is mainly demonstrated in the following five ways:
1.
Richness of features. Sequence
table has rich functions, and is very convenient to do structured data
computation. Data frame originates from matrix, with less support for structured
data and lack of many features. Use of the third party packages can in some
degree supplement the functions data frame lacks, but these packages are no
match for R’s primitive library function in muturity and stability.
2.
Difficulty in syntax. The function
names of sequence table are more intuitive.For example, select means to find; pselect
is to find the location (position). With data frame the syntax is relatively
obscure. For example, “find by field” is data [data
$ Amount> 1000,], and retrieve value by field is data[,"Amount"]. These two are confusing
and difficult for the programmer to understand. One must have some knowledge on
vector to grasp it.
3.
Memory consumption. Basically
sequence table function only returns a reference, with very little memory occupation.
Data frame must copied record from the original object. If we need to do
multiple search, association and grouping operations on large amounts of data, data
frame’s memory consumption will be very large. It will impact the whole system.
4.
Code workload and code
performance. The functions supported by data frame are not rich enough. We need
to do hand-coding to achieve this indirectly. This means more workload. The R
interpreter is known to be very slow. With hand-coding the performance is much lower
than library functions.
5.
Library function performance.
Sequence table has many functions to improve computing performance, such as
merging association, grouping functions, binary search, hash lookup. Although
data frame supports association, aggregation and search, it’s hard to improve the
performance.
Actual case
In this part we use a real case for comprehensive comparison o fdata frame and sequence table.
Computation target: according to daily transactions, selecting stocks from blue-chip stocks whose prices rises in 5 days in a row.
Ideas: Importing data; filtering out previous month's data; grouped them according to the ticker; sort the data by dates; compute the growth amount for closing price over previous day; compute the number of days for continuous positive growth; filtering out the stocks which rise in 5 or more days in a row.
Sequence Table Solution:
Data frame
Solution:
Comparison:
1.
Data frame function is not rich
enough, and is lack of professionalism. We need to use nested loops to meet the
requirement in this case. It’s of low computational efficiency. Sequence table has
rich and diverse functions. Without the use of loop statement we can achieve
the same purpose. The code is shorter and simpler, and the performance is
higher.
2.
When programming for data
frame, the code is obscure and hard to write. With sequence table, the code is clear
and easy to understand. The cost of learning is lower.
3.
When large amount of data is involved
in this scenario, the memory consumption will be huge. Sequence table is computation by
reference, which consumes less memory. Data frame is computation by value pass.
The memory consumption is several times more than sequence table. It easy to
result into memory overflow in this scenario.
4.
To import Excel data into data
frame, R requires third-party software packages. However they seem to have
difficulty working together. Data import needs ten minutes to complete. With sequence
table this only needs tens of seconds.
Test Performance
Test 1: Generating 10
million records in memory, each consists of three fields. All values are
random numbers. Records are filtered, and each field is summed.
Sequence table:Data frame:
Comparison: sequence table needs 50.534
seconds, while data frame needs 91.999 seconds. The gap is obvious.
Test 2: Retrieving 1.2G
txt file. Do filtering and sum on two fields
Sequence Table
Data frame
Comparison: sequence table takes 87.122 seconds, while data frame
takes 1.1347 hours. The performance difference is tens of times. The reason for
this is mainly due to the extremely low speed for file reading.
From the above comparison, we can see that sequence table are better
than data frame in terms of rich features, easy syntax, memory consumption, development
effort, library function performance and coding performance, etc.. Of course,
data frame is not the full strength of R language. R has a powerful vector
matrix and the associated mass functions, which make it more professional than
esProc in scientific and engineering computation.
No comments:
Post a Comment