3/16/2016

Handling 10 SQL Problems with R-like Languages

SQL is a sophisticated and all-around database programming language, making most instances of structured-data computing a painless experience. Yet there are still some instances that are difficult to handle in SQL in computer programming.

Here’s an example. duty is a MySQL table of shift schedule, in which an employee works several weekdays continuously before another employee takes his place. The task is to find out the beginning and ending of every shift based on the duty table. Below is the source table:
workday
name
2016-02-05
Ashley
2016-02-08
Ashley
2016-02-09
Ashley
2016-02-10
Johnson
2016-02-11
Johnson
2016-02-12
Johnson
2016-02-15
Ashley
2016-02-16
Ashley
2016-02-17
Ashley
Here’s the desired table:
name
begin
end
Ashley
2016-02-05
2016-02-09
Johnson
2016-02-10
2016-02-12
Ashley
2016-02-15
2016-02-17
An intuitive approach is to sort records by date, then group them by name, and finally retrieve the first and last records of each group. But it’s hard to implement it in SQL. As for MySQL, it doesn’t support window functions. So you have to shift your thinking to take the long way around: invent 4 temporary tables as the data manipulation basis, with a number field added to each, and perform joins between them by associating records with different number field values, and finally group, aggregate and sort the resulting joining table.


select name, begin, end from(
SELECT a.name, a.workday begin, MIN(c.workday) end
  FROM (SELECT *,@a:=@a+1 rank FROM t,(SELECT @a:=0) vars ORDER BY workday) a
  LEFT JOIN (SELECT *,@b:=@b+1 rank FROM t,(SELECT @b:=0) vars ORDER BY workday) b
    ON b.name = a.name AND b.rank = a.rank - 1
  LEFT JOIN (SELECT *,@c:=@c+1 rank FROM t,(SELECT @c:=0) vars ORDER BY workday) c
    ON c.name = a.name AND c.rank >= a.rank
  LEFT JOIN (SELECT *,@d:=@d+1 rank FROM t,(SELECT @d:=0) vars ORDER BY workday) d
    ON d.name = a.name AND d.rank = c.rank + 1
 WHERE b.workday IS NULL AND c.workday IS NOT NULL AND d.workday IS NULL
 GROUP BY a.name,a.workday
  ) e order by begin

This piece of SQL code is long and difficult to understand. The code would be easier if the database supports window functions, but still it wouldn’t be simple enough.

Actually, this isn’t the only scenario where SQL performs badly. Other cases include handling the subgroups, order-related calculations, related computing and multi-step calculation. SQL’s inability is a result of incomplete set orientation, as well as the lack of explicit sets, ordered sets, and support for individual records and step-by-step calculation.

Facing these SQL weaknesses, you should choose R-like languages that supports explicit sets, ordered sets, individual records and step-by-step calculation to import the data before manipulating it. The R-like languages include Python(Pandas), esProc, R and others.Among them esProc is the most integration-friendly and convenient-to-use.

Here’s the esProc code of implementing the above task:

The group function groups records without performing aggregation (while SQL forces an aggregation). The @o option working with it forbids a re-sorting when performing the group operation, and creates a new group only when the value of the grouping criterion changes (but SQL forces a re-sorting). The new function creates a two-dimensional table based on each group of the grouped table; ~.m(1) stands for the first record of each group and ~.m(-1) stands for the last record of each group.

But you should be aware that the scripting language you use can’t replace SQL, but that it is responsible for transforming difficult SQL queries to a form of simple SQL statement plus simple script. It is suitable for being used to handle data computing tasks involving small data but complex logic, or heterogeneous data sources. It’s not a good choice when handling big data computing with simple logic, because the IO cost of importing data would be very high.

Here are some examples of those SQL problems and their solutions in esProc. All of them are simplified and adapted from the real world cases posted on the internet.

Manipulating column data dynamically
The physicalExam table contains PE grades with the following fields: Athlete, Dash, Long-distance running, Long jump, Shot put… There are four grades for each event – excellent, good, pass and fail. The task is to find out the number of people for each grade in all events.

SQL believes column is a static attribute of table data, and doesn’t provide set-oriented operations for column data. It will be extremely cumbersome to handle the task in SQL. esProc, however, supports dynamic column access with simple code:

Non-equi-grouping
It’s common to group records by certain intervals, such as grade ranges (excellent, good …) and age groups (young people, middle-aged people…). It’s always a complicated job to perform grouping by these intervals in SQL. SQL can perform grouping by a few fixed ranges using the conditional statement case when; but it needs to create temporary tables and perform non-equi-joins when there are many intervals or the conditional intervals are continuous but dynamic. 

esProc penum function can return the sequence numbers of the enumerated conditions. Other times the pseg function can be conveniently used to find out the sequence number of one of a series of continuous intervals specified in the form of a sequence, and together with the groups function, can easily realize the non-equi-grouping:

A related scenario is sorting data in a specified order, which means that you specify the exact order, rather than use the standard order, to display the data manipulation results. For example, the capital city is always put in the first place when arranging data by administrative regions. SQL handles this type of sorting in a similar way as it handles grouping by conditional intervals. When the sorting criterion is simple, decode function is used; but if the criterion is long or dynamic, temporary tables need to be created and joins would be performed to generate sequence numbers.

esProc specifically provides align@s function for this type of sorting in alignment. The code is simple:

Equi-grouping doesn’t produce an empty subset, but non-equi-grouping may have one. For the latter, some tasks require that the result be continuous ranges. Therefore the missing values of the empty subset should be supplied. To do this in SQL, you need to create continuous conditional ranges manually for the grouping, and then perform left join on it and the data table, during which the complicated subqueries are necessary. The whole process is awfully cumbersome. esProc align function is specific to this requirement with convenient method of preparing the interval to which the table is aligned, making the process much simpler.

Here’s a transaction records table – transaction, whose structure is code, date and amount. The task is to calculate the accumulated transaction amount by week, with weeks having no transaction records listed.

Handling subgroups
Because SQL lacks an explicit set data type, it will force an aggregation after data grouping. But sometimes you also want the subset of the set of groups, like finding out the grade records of the students whose total scores are above 500.

With SQL, you would first group records to calculate the total score of each student and find out those whose scores are above 500, and then perform join between the name list and the achievement table or use the IN operator to specify the selected names in the achievement table. The implementation is indirect and data will be retrieved repeatedly:
with t as (select student from achievement group by student having sum(score)>500)
select tt.* from t left join achievement tt on t.student =tt.student

Having set data type and group functions that can return the subgroup, esProc is able to handle effortlessly the subgroups generated by data grouping:

Even if only the aggregate values are desired, the subgroups are worth keeping for further handling because the current aggregate operation may be too special to be performed by simple aggregate functions.

Here’s such an example. The user access table, access, has two fields – user (ID) and login (access time). The task is to find out the last login time per ID and the number of this ID’s logins within 3 days before the last time.

A SQL approach will first group the records to find out the last login time per ID, then join with the original table and get the records in the specified time period and again perform grouping and aggregation. It is roundabout and inefficient.
with t as (select "user",max(login) last from access group by "user")
select t."user",t.last,count(tt.login) from t left join access tt on t."user"=tt."user"
where t.last-tt.login<=3 group by t."user",t.last

Since esProc can retain the subgroups, it’s easy to realize a step-by-step calculation:

In the code, “~” stands for the subgroup generated from grouping records by user.

A more efficient method could be used to get the result if the records are already sorted:
Order-related aggregation
Finding top N from each group and getting record with the maximum value are also common in data handling. For example, the user access table access has the following structure: user, login time, IP…. The task is to list the first login record per user.

SQL will use a window function to generate the sequence numbers for records in each group and get all records whose sequence numbers are 1. As the window function works on the basis of a result set, you should first create a subquery before performing filtering, making the code a little complex:
select * from (select rank() over(partition by "user" order by login) rk,t.* from access t) a
where a.rk=1;

But this expression doesn’t apply to the SQL dialects that don’t support window functions.

The esProc method is using group@1 to get the first member from each group straightforwardly.

The following example is about finding Top N from each group. The structure of the stock price table transaction is code, transaction date and closing price. The task is to calculate the latest growth rate of each stock.

The growth rate calculation involves the records of the last trading days. SQL would use two levels of window functions to separately perform the intra-group inter-row calculation and then get the first row of the intermediate table. The expression is difficult to understand:
with tt as (select rank() over(partition by share order by date desc)  rk,t.* from transaction t)
select share,(select tt.closingPrice from tt where tt.share =ttt.share and tt.rk=1)-
(select tt.closingPrice from tt where tt.share =ttt.share and tt.rk=2) increase from transaction ttt group by share

esProc has topN function for performing aggregation on a set of records (grouped records).
The desired records can be obtained efficiently according to sequence numbers if the original records are already sorted:

Both finding the record with the maximum value and getting the first/last record are special cases of finding top N.

Inverse grouping
The inverse grouping is at the opposite end of the spectrum from the grouping and aggregation. It splits each of aggregated records into multiple detail records.

The installment table installment includes these fields – number, total amount, beginning date, number of periods. The task is to divide each loan into a number of payment records having the structure of number, period, payment date and amount. The total amount will simply be distributed evenly to every period of one month.

It’s easy to summarize data, but it’s difficult to do the opposite. The common ways in which SQL expands a record are performing a join with a sequence number table and using recursive query. Both are roundabout. By contrast, esProc enables writing the code in an intuitive way:

Cross-row reference
The sales table holds the monthly sales amount for each product. Its fields are product, month and amount. The task is to find out the records in which the amount increases by 10% compared with the previous month.

The calculation involves cross-row reference. The early SQL standards don’t support cross-row reference. Sequence numbers need to be generated for performing a join. The procedure is achingly complicated and difficult. Thanks to the introduction of window functions, SQL is able to make the cross-row reference more conveniently, but far from concisely due to the subquery. The code will be lengthy when there are multiple referenced items.
with t as (select a.*, nvl(volume/lag(volume) over(partition by product order by month,0) "relative"
from sales a) select * from t where relative>1.1

esProc provides the intuitive and convenient-to-use cross-row reference syntax.
Conveniently, [-1] can be used to reference a value of the previous month after data is sorted. The syntax also makes it possible to filtering data based on the results of inter-row calculations.

Then based on the above sales table, we calculate the moving averages of the sales amounts across a period of time including the current month, its previous month and the following month.

The moving average calculation involves forward reference and the reference of sets. A MySQL variable can’t express a forward reference, making a direct calculation impossible. Window functions can make it but they produce bloated and unintelligible code by using subqueries:
with b as(select a.*,lag(volumn) over(partition by product order by month) f1,
lead(volumn) over (partition by product order by month) f2 from sales a)
select product, month, volumn,
(nvl(f1,0)+nvl(f2,0)+volumn)/(decode(f1,null,0,1)+decode(f2,null,0,1)+1) MA from b

esProc uses [1] to reference a value from the next record and {-1:1} to reference a set consisting values of the previous, the current and the next months. With them the code is easy to write:

Order-related grouping
The budget table records incomes and expenses with three fields – month, income and expense. The task is to find out the records where the income is less than the expense for at least 3 continuous months.

In this case the key value based on which records are grouped can’t be obtained directly from the records. It is related to the order of the records. SQL would do it by creating sequence numbers, producing bloated code. esProc provides the grouping method for order-related calculations, facilitating those aiming at finding continuous records. Here’s the esProc code:

The expression group@o means grouping records by comparing only the neighboring records. By comparing the values of income and expense the records can be divided into groups of budget surplus, budget deficit, budget surplus… Then get the budget deficit groups that have at least 3 members and concatenate them.

You can also try to find out the biggest number of consecutive months when the income increases. For this the following grouping strategy can be used: group the current record with the previous one when income increases, and put it into a new group when income decreases. Finally you can get the number of members in the group where income increases continuously.

The SQL code for implementing this grouping algorithm is hard to understand:
select max(continuousMonthNnumber) from(select count(*)continuousMonthNnumber    from (select sum(isIncrease) over(order by month) notIncreaseMonthNumber from
(select month,income,case when income >lag(income) over(order by month) then 0 else 1 end isIncrease from budget))group by notIncreaseMonthNumber)

But esProc can create a new group when the grouping criterion changes (i.e. when income decreases) using group@i, generating clear and easy-to-understand code:
Merging intervals is another type of order-related group operation. Here’s an example. The event table T has two fields S (starting time) and E (ending time). The task is to merge the overlapped intervals to find out the total duration of the event.
The code offers methods of achieving different computing goals which capture the essence of cross-row operation and order-related operation and express them in a most intuitive way. It’s impossible to implement both operations in SQL using merely the window functions. The more sophisticated recursive query is required.

Record accessing by sequence number

Median price is frequently needed during the analysis of economic statistics. The value is easily calculated if you can access the records by sequence numbers. But you should create the sequence numbers and then perform filtering with SQL used. The code is tedious and inefficient:
select avg(price) from
(select price,row_number() over (order by price) r from priceTable ) t
where r=trunc((select count(price)+1 from priceTable)/2) or r=trunc((select count(price) from priceTable)/2)+1

With support for ordered sets, esProc can generate simple code:
The sequence numbers are useful in data grouping. The event table event has the following structure – number, time, action (including starting time and ending time). The task is to calculate the total duration of the whole event. That is, summing up the duration between each pair of starting action and ending action.

SQL inherits the mathematical concept of the unordered sets, stipulating that sequence numbers be created and conditional filtering be performed in order to access members in the specified positions:
with t as (select a.*,trunc((row_number() over (order by ID)-1)/2) r from event a)
select ROUND(TO_NUMBER(max(time)-min(time)) * 24 * 60) from t group by r

esProc, however, performs member accessing with simple and clear code:
The sign “#” represents the sequence number of a record. group((#-1)\2) place every two records into one group. After that esProc calculates the duration for each group and sums up the results.
        
Sequence numbers are easy to use when referencing a value from a neighboring row. The transaction table contains stock prices in two fields trading date and closing price. The task is to list the trading days when the stock prices are above ¥100 and calculate the growth rate each of those trading days.

You can’t calculate the growth rate after filtering records by the condition closing price>100. So with SQL you should first calculate all the growth rates and perform a filtering with the window function. This makes the code difficult to understand:
with t as (select date, closingPrice, closingPrice-lag(closingPrice) over(order by date) increase from transaction)
select * from t where closingPrice >100 and increase is not null

The esProc solution is using pselect function to find out the sequence numbers of those records that satisfying the condition. With the sequence numbers, you can easily calculate the desired growth rates, without having to calculate all growth rates and then perform a filtering as the window function does.
Using sets for string handling
A class table classtb include three fields – class, male students, and female students. The task is to change its structure and make it a student table that includes these fields – class, name, and gender.

SQL has group_concat for concatenating strings. But as with this case, strings need to be split to form records. Lacking explicit sets, SQL resorts to recursive query or a join with an alignment table. Both methods are complex. Being a dynamic language supporting explicit sets, esProc can handle this type of inverse grouping operation in a much easier way:
At times the object of string splitting is to perform set operations. Here’s a book table named book with the following structure – title and author. Each author value includes multiple people separated by the comma and with an unfixed order. The task is to find out the records where the names of the authors appear in at least two books.
Split all the author strings to form a set and sort them, then use members of the ordered set as grouping key values to perform the subsequent operations.

Using sets for date handling
Like string handling, the database is capable of dealing with normal, single date values. But it has difficulty in splitting a time interval or in generating a sequence of dates due to SQL’s vital weakness of incomplete set orientation.

The travel table contains travelling logs in multiple fields – name, starting date, ending date…. The task is to find out the 5 peak days in terms of the number of tourists.

The task requires splitting the interval defined by the starting date and the ending date into a set of single dates. esProc offers special function for doing this with the support of set data type, and thus can handle the task effortlessly:
Generating a sequence of dates is intrinsically difficult, particularly when it involves an inverse grouping and especially in SQL.

The event table T has three fields – I (event), S (starting date) and E (ending date). The task is to split the interval between the starting dates and ending dates to generate multiple intervals by month. The first month starts from the starting date and the last month ends at the ending date. The months between them have all their dates.

The pdate function works with @m option and @me options separately to get the first day and the last day of each month. after@m gets the date which is a number of months after a certain date. It will automatically adjust the new date to the last day of the corresponding month and is useful in generating a monthly interval.