Data Process Engine: r language

Showing posts with label r language. Show all posts

3/16/2016

Handling 10 SQL Problems with R-like Languages

SQL is a sophisticated and all-around database programming language, making most instances of structured-data computing a painless experience. Yet there are still some instances that are difficult to handle in SQL in computer programming.

Here’s an example. duty is a MySQL table of shift schedule, in which an employee works several weekdays continuously before another employee takes his place. The task is to find out the beginning and ending of every shift based on the duty table. Below is the source table:

workday	name
2016-02-05	Ashley
2016-02-08	Ashley
2016-02-09	Ashley
2016-02-10	Johnson
2016-02-11	Johnson
2016-02-12	Johnson
2016-02-15	Ashley
2016-02-16	Ashley
2016-02-17	Ashley

Here’s the desired table:

name	begin	end
Ashley	2016-02-05	2016-02-09
Johnson	2016-02-10	2016-02-12
Ashley	2016-02-15	2016-02-17

An intuitive approach is to sort records by date, then group them by name, and finally retrieve the first and last records of each group. But it’s hard to implement it in SQL. As for MySQL, it doesn’t support window functions. So you have to shift your thinking to take the long way around: invent 4 temporary tables as the data manipulation basis, with a number field added to each, and perform joins between them by associating records with different number field values, and finally group, aggregate and sort the resulting joining table.

select name, begin, end from(

SELECT a.name, a.workday begin, MIN(c.workday) end

FROM (SELECT *,@a:=@a+1 rank FROM t,(SELECT @a:=0) vars ORDER BY workday) a

LEFT JOIN (SELECT *,@b:=@b+1 rank FROM t,(SELECT @b:=0) vars ORDER BY workday) b

ON b.name = a.name AND b.rank = a.rank - 1

LEFT JOIN (SELECT *,@c:=@c+1 rank FROM t,(SELECT @c:=0) vars ORDER BY workday) c

ON c.name = a.name AND c.rank >= a.rank

LEFT JOIN (SELECT *,@d:=@d+1 rank FROM t,(SELECT @d:=0) vars ORDER BY workday) d

ON d.name = a.name AND d.rank = c.rank + 1

WHERE b.workday IS NULL AND c.workday IS NOT NULL AND d.workday IS NULL

GROUP BY a.name,a.workday

) e order by begin

This piece of SQL code is long and difficult to understand. The code would be easier if the database supports window functions, but still it wouldn’t be simple enough.

Actually, this isn’t the only scenario where SQL performs badly. Other cases include handling the subgroups, order-related calculations, related computing and multi-step calculation. SQL’s inability is a result of incomplete set orientation, as well as the lack of explicit sets, ordered sets, and support for individual records and step-by-step calculation.

Facing these SQL weaknesses, you should choose R-like languages that supports explicit sets, ordered sets, individual records and step-by-step calculation to import the data before manipulating it. The R-like languages include Python(Pandas), esProc, R and others.Among them esProc is the most integration-friendly and convenient-to-use.

Here’s the esProc code of implementing the above task:

The group function groups records without performing aggregation (while SQL forces an aggregation). The @o option working with it forbids a re-sorting when performing the group operation, and creates a new group only when the value of the grouping criterion changes (but SQL forces a re-sorting). The new function creates a two-dimensional table based on each group of the grouped table; ~.m(1) stands for the first record of each group and ~.m(-1) stands for the last record of each group.

But you should be aware that the scripting language you use can’t replace SQL, but that it is responsible for transforming difficult SQL queries to a form of simple SQL statement plus simple script. It is suitable for being used to handle data computing tasks involving small data but complex logic, or heterogeneous data sources. It’s not a good choice when handling big data computing with simple logic, because the IO cost of importing data would be very high.

Here are some examples of those SQL problems and their solutions in esProc. All of them are simplified and adapted from the real world cases posted on the internet.

Manipulating column data dynamically

The physicalExam table contains PE grades with the following fields: Athlete, Dash, Long-distance running, Long jump, Shot put… There are four grades for each event – excellent, good, pass and fail. The task is to find out the number of people for each grade in all events.

SQL believes column is a static attribute of table data, and doesn’t provide set-oriented operations for column data. It will be extremely cumbersome to handle the task in SQL. esProc, however, supports dynamic column access with simple code:

Non-equi-grouping

It’s common to group records by certain intervals, such as grade ranges (excellent, good …) and age groups (young people, middle-aged people…). It’s always a complicated job to perform grouping by these intervals in SQL. SQL can perform grouping by a few fixed ranges using the conditional statement case when; but it needs to create temporary tables and perform non-equi-joins when there are many intervals or the conditional intervals are continuous but dynamic.

esProc penum function can return the sequence numbers of the enumerated conditions. Other times the pseg function can be conveniently used to find out the sequence number of one of a series of continuous intervals specified in the form of a sequence, and together with the groups function, can easily realize the non-equi-grouping:

A related scenario is sorting data in a specified order, which means that you specify the exact order, rather than use the standard order, to display the data manipulation results. For example, the capital city is always put in the first place when arranging data by administrative regions. SQL handles this type of sorting in a similar way as it handles grouping by conditional intervals. When the sorting criterion is simple, decode function is used; but if the criterion is long or dynamic, temporary tables need to be created and joins would be performed to generate sequence numbers.

esProc specifically provides align@s function for this type of sorting in alignment. The code is simple:

Equi-grouping doesn’t produce an empty subset, but non-equi-grouping may have one. For the latter, some tasks require that the result be continuous ranges. Therefore the missing values of the empty subset should be supplied. To do this in SQL, you need to create continuous conditional ranges manually for the grouping, and then perform left join on it and the data table, during which the complicated subqueries are necessary. The whole process is awfully cumbersome. esProc align function is specific to this requirement with convenient method of preparing the interval to which the table is aligned, making the process much simpler.

Here’s a transaction records table – transaction, whose structure is code, date and amount. The task is to calculate the accumulated transaction amount by week, with weeks having no transaction records listed.

Handling subgroups

Because SQL lacks an explicit set data type, it will force an aggregation after data grouping. But sometimes you also want the subset of the set of groups, like finding out the grade records of the students whose total scores are above 500.

With SQL, you would first group records to calculate the total score of each student and find out those whose scores are above 500, and then perform join between the name list and the achievement table or use the IN operator to specify the selected names in the achievement table. The implementation is indirect and data will be retrieved repeatedly:

with t as (select student from achievement group by student having sum(score)>500)

select tt.* from t left join achievement tt on t.student =tt.student

Having set data type and group functions that can return the subgroup, esProc is able to handle effortlessly the subgroups generated by data grouping:

Even if only the aggregate values are desired, the subgroups are worth keeping for further handling because the current aggregate operation may be too special to be performed by simple aggregate functions.

Here’s such an example. The user access table, access, has two fields – user (ID) and login (access time). The task is to find out the last login time per ID and the number of this ID’s logins within 3 days before the last time.

A SQL approach will first group the records to find out the last login time per ID, then join with the original table and get the records in the specified time period and again perform grouping and aggregation. It is roundabout and inefficient.

with t as (select "user",max(login) last from access group by "user")

select t."user",t.last,count(tt.login) from t left join access tt on t."user"=tt."user"

where t.last-tt.login<=3 group by t."user",t.last

Since esProc can retain the subgroups, it’s easy to realize a step-by-step calculation:

In the code, “~” stands for the subgroup generated from grouping records by user.

A more efficient method could be used to get the result if the records are already sorted:

Order-related aggregation

Finding top N from each group and getting record with the maximum value are also common in data handling. For example, the user access table access has the following structure: user, login time, IP…. The task is to list the first login record per user.

SQL will use a window function to generate the sequence numbers for records in each group and get all records whose sequence numbers are 1. As the window function works on the basis of a result set, you should first create a subquery before performing filtering, making the code a little complex:

select * from (select rank() over(partition by "user" order by login) rk,t.* from access t) a

where a.rk=1;

But this expression doesn’t apply to the SQL dialects that don’t support window functions.

The esProc method is using group@1 to get the first member from each group straightforwardly.

The following example is about finding Top N from each group. The structure of the stock price table transaction is code, transaction date and closing price. The task is to calculate the latest growth rate of each stock.

The growth rate calculation involves the records of the last trading days. SQL would use two levels of window functions to separately perform the intra-group inter-row calculation and then get the first row of the intermediate table. The expression is difficult to understand:

with tt as (select rank() over(partition by share order by date desc) rk,t.* from transaction t)

select share,(select tt.closingPrice from tt where tt.share =ttt.share and tt.rk=1)-

(select tt.closingPrice from tt where tt.share =ttt.share and tt.rk=2) increase from transaction ttt group by share

esProc has topN function for performing aggregation on a set of records (grouped records).

The desired records can be obtained efficiently according to sequence numbers if the original records are already sorted:

Both finding the record with the maximum value and getting the first/last record are special cases of finding top N.

Inverse grouping

The inverse grouping is at the opposite end of the spectrum from the grouping and aggregation. It splits each of aggregated records into multiple detail records.

The installment table installment includes these fields – number, total amount, beginning date, number of periods. The task is to divide each loan into a number of payment records having the structure of number, period, payment date and amount. The total amount will simply be distributed evenly to every period of one month.

It’s easy to summarize data, but it’s difficult to do the opposite. The common ways in which SQL expands a record are performing a join with a sequence number table and using recursive query. Both are roundabout. By contrast, esProc enables writing the code in an intuitive way:

Cross-row reference

The sales table holds the monthly sales amount for each product. Its fields are product, month and amount. The task is to find out the records in which the amount increases by 10% compared with the previous month.

The calculation involves cross-row reference. The early SQL standards don’t support cross-row reference. Sequence numbers need to be generated for performing a join. The procedure is achingly complicated and difficult. Thanks to the introduction of window functions, SQL is able to make the cross-row reference more conveniently, but far from concisely due to the subquery. The code will be lengthy when there are multiple referenced items.

with t as (select a.*, nvl(volume/lag(volume) over(partition by product order by month,0) "relative"

from sales a) select * from t where relative>1.1

esProc provides the intuitive and convenient-to-use cross-row reference syntax.

Conveniently, [-1] can be used to reference a value of the previous month after data is sorted. The syntax also makes it possible to filtering data based on the results of inter-row calculations.

Then based on the above sales table, we calculate the moving averages of the sales amounts across a period of time including the current month, its previous month and the following month.

The moving average calculation involves forward reference and the reference of sets. A MySQL variable can’t express a forward reference, making a direct calculation impossible. Window functions can make it but they produce bloated and unintelligible code by using subqueries:

with b as(select a.*,lag(volumn) over(partition by product order by month) f1,

lead(volumn) over (partition by product order by month) f2 from sales a)

select product, month, volumn,

(nvl(f1,0)+nvl(f2,0)+volumn)/(decode(f1,null,0,1)+decode(f2,null,0,1)+1) MA from b

esProc uses [1] to reference a value from the next record and {-1:1} to reference a set consisting values of the previous, the current and the next months. With them the code is easy to write:

Order-related grouping

The budget table records incomes and expenses with three fields – month, income and expense. The task is to find out the records where the income is less than the expense for at least 3 continuous months.

In this case the key value based on which records are grouped can’t be obtained directly from the records. It is related to the order of the records. SQL would do it by creating sequence numbers, producing bloated code. esProc provides the grouping method for order-related calculations, facilitating those aiming at finding continuous records. Here’s the esProc code:

The expression group@o means grouping records by comparing only the neighboring records. By comparing the values of income and expense the records can be divided into groups of budget surplus, budget deficit, budget surplus… Then get the budget deficit groups that have at least 3 members and concatenate them.

You can also try to find out the biggest number of consecutive months when the income increases. For this the following grouping strategy can be used: group the current record with the previous one when income increases, and put it into a new group when income decreases. Finally you can get the number of members in the group where income increases continuously.

The SQL code for implementing this grouping algorithm is hard to understand:

select max(continuousMonthNnumber) from(select count(*)continuousMonthNnumber from (select sum(isIncrease) over(order by month) notIncreaseMonthNumber from

(select month,income,case when income >lag(income) over(order by month) then 0 else 1 end isIncrease from budget))group by notIncreaseMonthNumber)

But esProc can create a new group when the grouping criterion changes (i.e. when income decreases) using group@i, generating clear and easy-to-understand code:

Merging intervals is another type of order-related group operation. Here’s an example. The event table T has two fields S (starting time) and E (ending time). The task is to merge the overlapped intervals to find out the total duration of the event.

The code offers methods of achieving different computing goals which capture the essence of cross-row operation and order-related operation and express them in a most intuitive way. It’s impossible to implement both operations in SQL using merely the window functions. The more sophisticated recursive query is required.

Record accessing by sequence number

Median price is frequently needed during the analysis of economic statistics. The value is easily calculated if you can access the records by sequence numbers. But you should create the sequence numbers and then perform filtering with SQL used. The code is tedious and inefficient:

select avg(price) from

(select price,row_number() over (order by price) r from priceTable ) t

where r=trunc((select count(price)+1 from priceTable)/2) or r=trunc((select count(price) from priceTable)/2)+1

With support for ordered sets, esProc can generate simple code:

The sequence numbers are useful in data grouping. The event table event has the following structure – number, time, action (including starting time and ending time). The task is to calculate the total duration of the whole event. That is, summing up the duration between each pair of starting action and ending action.

SQL inherits the mathematical concept of the unordered sets, stipulating that sequence numbers be created and conditional filtering be performed in order to access members in the specified positions:

with t as (select a.*,trunc((row_number() over (order by ID)-1)/2) r from event a)

select ROUND(TO_NUMBER(max(time)-min(time)) * 24 * 60) from t group by r

esProc, however, performs member accessing with simple and clear code:

The sign “#” represents the sequence number of a record. group((#-1)\2) place every two records into one group. After that esProc calculates the duration for each group and sums up the results.

Sequence numbers are easy to use when referencing a value from a neighboring row. The transaction table contains stock prices in two fields trading date and closing price. The task is to list the trading days when the stock prices are above ¥100 and calculate the growth rate each of those trading days.

You can’t calculate the growth rate after filtering records by the condition closing price>100. So with SQL you should first calculate all the growth rates and perform a filtering with the window function. This makes the code difficult to understand:

with t as (select date, closingPrice, closingPrice-lag(closingPrice) over(order by date) increase from transaction)

select * from t where closingPrice >100 and increase is not null

The esProc solution is using pselect function to find out the sequence numbers of those records that satisfying the condition. With the sequence numbers, you can easily calculate the desired growth rates, without having to calculate all growth rates and then perform a filtering as the window function does.

Using sets for string handling

A class table classtb include three fields – class, male students, and female students. The task is to change its structure and make it a student table that includes these fields – class, name, and gender.

SQL has group_concat for concatenating strings. But as with this case, strings need to be split to form records. Lacking explicit sets, SQL resorts to recursive query or a join with an alignment table. Both methods are complex. Being a dynamic language supporting explicit sets, esProc can handle this type of inverse grouping operation in a much easier way:

At times the object of string splitting is to perform set operations. Here’s a book table named book with the following structure – title and author. Each author value includes multiple people separated by the comma and with an unfixed order. The task is to find out the records where the names of the authors appear in at least two books.

Split all the author strings to form a set and sort them, then use members of the ordered set as grouping key values to perform the subsequent operations.

Using sets for date handling

Like string handling, the database is capable of dealing with normal, single date values. But it has difficulty in splitting a time interval or in generating a sequence of dates due to SQL’s vital weakness of incomplete set orientation.

The travel table contains travelling logs in multiple fields – name, starting date, ending date…. The task is to find out the 5 peak days in terms of the number of tourists.

The task requires splitting the interval defined by the starting date and the ending date into a set of single dates. esProc offers special function for doing this with the support of set data type, and thus can handle the task effortlessly:

Generating a sequence of dates is intrinsically difficult, particularly when it involves an inverse grouping and especially in SQL.

The event table T has three fields – I (event), S (starting date) and E (ending date). The task is to split the interval between the starting dates and ending dates to generate multiple intervals by month. The first month starts from the starting date and the last month ends at the ending date. The months between them have all their dates.

The pdate function works with @m option and @me options separately to get the first day and the last day of each month. after@m gets the date which is a number of months after a certain date. It will automatically adjust the new date to the last day of the corresponding month and is useful in generating a monthly interval.

3/05/2015

Comparison of Loop Function in esProc and R Language

Loop function can traverse every member of an array or a set, express complicated loop statements with simple functions, as well as reduce the amount of code and increase readability. Both esProc and R language support the loop function. The following will compare their similarities and differences in usage.

1.Generating data

Generate odd numbers between 1 and 10.

esProc: x=to(1,10).step(2)

In the code, to(1,10) generates consecutive integers from 1 to 10, step function gets members inconsecutively according to the computed result of last step and the final result is [1,3,4,5,7,9]. This type of data in esProc is called a sequence.

The code has a simpler version: x=10.step(2).

R language:x<-seq(from=1,to=10,by=2)

This piece of code gets integers directly and inconsecutively from 1 to 10. Computed result is c(1,3,4,5,9). This type of data in R language is called vector.

A simpler version of this piece of code is x<-seq(1,10,2).

Comparison:

1.Both can solve the problem in this example. esProc needs two steps to solve it, indicating theoretically a poor performance. While R language can resolve it with only one step, displaying a better performance.

2. The method for esProc to develop code is getting members from a set according to the sequence number. It is a common method. For example, there is a string sequence A1=["a", "bc", "def"……], now get strings in the positions of odd numbers. Here it’s no need to change the type of code writing, the code is x=A1.step(2).

R language generates data directly, thus it has a better performance. It can write common expressions, too. For example, get strings in the positions of odd numbers from the string vector quantity A1=c("a", "bc", "def"……), the expression in R language can be x=A1[seq(1,length(A1),2)].

3. esProc loop function has characteristics that R language hasn’t, that is, built-in loop variables and operators. “~” represents the loop variable, “#” represents the loop count, “[]” represents relative position and “{}” represents relative interval. By using these variables and operators, esProc can produce common concise expressions. For example, seek square of each member of the set A2=[2,3,4,5,6]:

A2.(~*~) / Result is [4,9,16,25,36], which can also be written as A2**A2. But the latter lacks a sense of immediacy and commonality. R language can only use A2*A2 to express the result.

Get the first three members:

A2.select(#<=3) / Result is [2,3,4]

Get each member’s previous member and create a new set:

A2.(~[-1]) / Result is [null,2,3,4,5]

Growth rate:

A2.((~ - ~[-1])/ ~[-1]) /Result is [null, 0.5, 0.33333333333, 0.25, 0.2]

Moving average:

A2.(~{-1,1}.avg()) /Result is [2.5, 3.0, 4.0, 5.0, 5.5]

Summary:In this example, that R language can directly generate data and produce common expressions shows that it is more flexible and takes less memory space than esProc.

2.Filtering records

Computational objects of a loop function can be an array or a set whose members are single value, or two-dimensional structured data objects whose members are records. In fact, loop function is mainly used in processing the latter. For example, select orders of 2010 whose amount is greater than 2,000 from sales, the order records.

Note: sales originates from a text file, some of its data are as follows:

esProc:

sales.select(ORDERDATE>=date("2010-01-01") && AMOUNT>2000)

Some of the results are:

R language:

sales[as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt("2010-01-01") & sales$AMOUNT>2000,]

Some of the results are:

Comparison:

1.Both esProc and R language can realize this function. Their difference lies that esProc uses select loop function while R language directly uses index. But there isn’t an essential distinction between them. In addition, R language can further simplify the expression by using attach function:

sales[as.POSIXlt(ORDERDATE)>=as.POSIXlt("2010-01-01") & AMOUNT>2000,]

Thus, there are more similarities between them.

2. Except query, loop function can be used to seek sequence number, sort, rank, seek Top N, group and summarize, etc. For example, seek sequence numbers of records.

sales.pselect@a(ORDERDATE>=date("2010-01-01") && AMOUNT>2000) / esProc

which(as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt("2010-01-01") & sales$AMOUNT>2000) #R language

For example, sort records by SELLERID in ascending order and by AMOUNT in descending order.

sales.sort(SELLERID,AMOUNT:-1) / esProc

sales[order(sales$SELLERID,-sales$AMOUNT),] /R language

For example, seek the top three records by AMOUNT.

sales.top(AMOUNT;3) / esProc

head(sales[order(-sales$AMOUNT),],n=3) /R language

3. Sometimes, R language computes with index, like filtering; sometimes it computes with functions, like seeking sequence numbers of records; sometimes it programs in the form of “data set + function + data set”, like sorting; and other times it works in the way of “function + data set + function”, like seeking Top N. Its programming method seems flexible but is liable to greatly confuse programmers. By comparison, esPoc always adopts object-style method “data set + function + function …” in access. The method has a simple and uniform structure and is easy for programmers to grasp.

Here is an example of performing continuous computations. Filter records and seek Top N. esProc will compute like this:

sales.select(ORDERDATE>=date("2010-01-01") && AMOUNT>2000). top(AMOUNT;3)

And R language will compute in this way:

Mid<-sales[as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt("2010-01-01") & sales$AMOUNT>2000,]

head(Mid [order(Mid$AMOUNT),],n=3)

As you can see, esProc is better at programming multi-step continuous computations.

Summary:In this example, esPoc gains the upper hand in ensuring syntax consistency and performing continuous computations, and is more beginner-friendly.

3.Grouping and summarizing

The loop function is often employed in grouping and summarizing records. For example, group by CLIENT and SELLERID, and then sum up AMOUNT and seek the maximum value.

esProc: sales.groups(CLIENT, SELLERID;sum(AMOUNT),max(AMOUNT))

Some of the results are as follows:

R language:

result1<-aggregate(sales[,4], sales[c(3,2)],sum)

result2<-aggregate(sales[,4], sales[c(3,2)],max)

result<-cbind(result1,result2[,3])

Some of the results are as follows:

Comparison:

1.In this case, more than one summarizing method is required. esProc can complete the task in one step. R language has to go through two steps to sum up and seek the maximum value, and finally, combine the results with cbind, because its built-in library function cannot directly use multiple summarizing methods simultaneously. Besides, R language will have more memory usage in completing the task.

2. Another thing is the illogical design in R language. For sales[c(3,2)], the group order in the code is that SELLERID is ahead of CLIENT, but in business, the order is completely opposite. In the result, the order changes again and becomes the same as that in the code. In a word, there is not a unified standard for business logic, the code and the computed result.

Summary:In this example, esProc has the advantages of high efficiency, small memory usage and having a unified standard.

4.Seeking quadratic sum

Use a loop function to seek quadratic sum of the set v=[2,3,4,5].

Please note that both esProc and R language have functions to seek quadratic sum, but a loop function will be used here to perform this task.

esProc:

v.loops(~~+~*~ ; 0)

R language:

Reduce(function(x,y) x+y*y, c(0,v))

Comparison:

1.Both esProc and R language can realize this function easily.

2.The use of loops function by esProc means that it sets zero as the initial value, computes every member of v in order and returns the final result. In the code, “~” represents member being computed and “~~” represents computed result of last step. For example, the arithmetic in the first step is 0+2*2 and that in the second step is 4+3*3, and so forth. The final result is 54.

The use of reduce function by R language means that it computes members of [0,2,3,4,5] in order, and puts the computed result of the current step into the next one to go on with the computation. As esProc, the arithmetic in the first step is 0+2*2 and that in the second step is 4+3*3, and so forth.

3. R language employs lambda expression to perform the operation. This is one of the programming methods of anonymous functions, and can be directly executed without specifying the function name. In this example, function(x,y), the specification, defines two parameters; x+y*y, the body, is responsible for performing the operation; c(0,v) combines 0 and v into [0,2,3,4,5] in which every member will take part in the operation in order. Because it can input a complete function, this programming method becomes quite flexible and is able to perform operations containing complicated functions.

The esProc programming method can be regarded as an implicit lambda expression, which is essentially the same as the explicit expression in R language. But it has a bare expression without function name, specification and variables and its structure is simpler. In this example, “~” represents the built-in loop variable unnecessary to be defined; ~~+~*~ is the expression responsible for performing the operation; v is a fixed parameter in which every member will take part in the operation in order. Being unable to input a function, it is not as good as R language theoretically in flexibility and ability of expression.

4. Despite being not flexible enough in theory, esProc programming method boasts convenient built-in variables and operators, like ~, ~~, #, [], {}, etc., and gets a more powerful expression in practical use. For example, esProc uses “~~” to directly represent the computed result of last step, while R language needs reduce function and extra variables to do this. esProc can use “#” to directly represent the current loop number while R language is difficult to do this. Also, esProc can use “[]” to represent relative position. For example, ~[1] is used to represent the value of next member and Close[-1] is used to represent value of the field Close in the last record.

In addition, esProc can use “{}” to represent relative interval. For example, {-1,1} represents the three members between the previous and next member. Therefore, the common expression v.(~{-1,1}.avg()) can be used to compute moving average, while R language needs specific functions to do this. For example, there is even no such a function for “seeking average” in the expression filter(v/3, rep(1, 3),sides = 1), which is difficult to understand for beginners.

Summary:In this case, the lambda expression in R language is more powerful in theory but is a little difficult to understand. By comparison, esProc programming method is easier to understand.

5. Inter-rows and –groups operation

Here is a table stock containing daily trade data of multiple stocks. Please compute daily growth rate of closing price of each stock.

Some of the original data are as follows:

esProc:

A10=stock.group(Code)

A11=A10.(~.sort(Date))

A12=A11.(~.derive((Close-Close[-1]):INC))

R language:

A10<-split(stock, stock $Code)

for(I in 1:length(A10){

A10[[i]][order(as.numeric(A10[[i]]$Date)),] #sort by Date in each group

A10[[i]]$INC<-with(A10[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price

}

Comparison:

1. Both esProc and R language can achieve the task. esProc only uses loop function in computing, achieving high performance and concise code. R language requires writing code manually by using for statement, which brings poor performance and readability.

2.To complete the task, two layers of loop are required: loop each stock, and then loop each record of the stocks. Except being good at expressing the innermost loop, loop function of R language (including lambda syntax) hasn’t built-in loop variables and is hard to express multi-layer loops. Even if it manages to work out the code, the code is unintelligible.

Loop function of esProc can not only use “~” to represent the loop variable, but also be used in nested loop, therefore, it is expert at expressing multi-layer loops. For example, A10.(~.sort(Date)) in the code is in fact the abbreviation of A10.(~.sort(~.Date)). The first “~” represents the current stock, and the second “~” represents the current record of this stock.

3.As a typical ordered operation, it is required that the closing price of last day be subtracted from the current price. With the useful built-in variables and operators, such as #,[] and {}, esProc is easy to express this type of ordered operation. For example, Close-Close[-1] can represent the increasing amount. R language can also perform the ordered operation, but its syntax is much too complicated due to the lack of facilities like loop number, relative position, relative interval and so on. For example, the expression of increasing amount is Close-c(0,Close[- length (Close)]).

Itis hard enough for loop function in R language to perform the relative simple ordered operation in this example, let alone the more complicated operations. Inthose cases, multi-layer for loop is usually needed. For example, find out how many days the stock has been rising:

A10<-split(stock, stock $Code)

for(I in 1:length(A10){

A10[[i]][order(as.numeric(A10[[i]]$Date)),] #sort by Date in each group

A10[[i]]$INC<-with(A10[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price

if(nrow(A 10[[i]])>0){ #add a column, continuous increased days

A10 [[i]]$CID[[1]]<-1

for(j in 2:nrow(A3[[i]])){

if(A10 [[i]]$INC[[j]]>0 ){

A10 [[i]]$CID[[j]]<-A10 [[i]]$CID[[j-1]]+1

}else{

A10 [[i]]$CID[[j]]<-0

}

The code in esProc is still concise and easy to understand:

A10=stock.group(Code)

A11=A10.(~.sort(Date))

A12=A11.(~.derive((Close-Close[-1]):INC), if(INC>0,CID=CID[-1]+1, 0):CID))

Summary:In performing multi-layer loops or inter-rows and -groups operations, esProc loop function has higher computational performance and more concise code.

menu

3/16/2016

Handling 10 SQL Problems with R-like Languages

3/05/2015

Comparison of Loop Function in esProc and R Language

1.Generating data

2.Filtering records

3.Grouping and summarizing

4.Seeking quadratic sum

5. Inter-rows and –groups operation