SQL is a
sophisticated and all-around database programming language, making most
instances of structured-data computing a painless experience. Yet there are
still some instances that are difficult to handle in SQL in computer
programming.
Here’s
an example. duty is a MySQL table of
shift schedule, in which an employee works several weekdays continuously before
another employee takes his place. The task is to find out the beginning and
ending of every shift based on the duty
table. Below is the source table:
workday
|
name
|
2016-02-05
|
Ashley
|
2016-02-08
|
Ashley
|
2016-02-09
|
Ashley
|
2016-02-10
|
Johnson
|
2016-02-11
|
Johnson
|
2016-02-12
|
Johnson
|
2016-02-15
|
Ashley
|
2016-02-16
|
Ashley
|
2016-02-17
|
Ashley
|
Here’s
the desired table:
name
|
begin
|
end
|
Ashley
|
2016-02-05
|
2016-02-09
|
Johnson
|
2016-02-10
|
2016-02-12
|
Ashley
|
2016-02-15
|
2016-02-17
|
An
intuitive approach is to sort records by date, then group them by name, and finally
retrieve the first and last records of each group. But it’s hard to implement
it in SQL. As for MySQL, it doesn’t support window functions. So you have to
shift your thinking to take the long way around: invent 4 temporary tables as
the data manipulation basis, with a number
field added to each, and perform joins between them by associating records with
different number field values, and
finally group, aggregate and sort the resulting joining table.
select
name, begin, end from(
SELECT a.name, a.workday begin,
MIN(c.workday) end
FROM (SELECT *,@a:=@a+1 rank FROM t,(SELECT
@a:=0) vars ORDER BY workday) a
LEFT JOIN (SELECT *,@b:=@b+1 rank FROM
t,(SELECT @b:=0) vars ORDER BY workday) b
ON b.name = a.name AND b.rank = a.rank -
1
LEFT JOIN (SELECT *,@c:=@c+1 rank FROM
t,(SELECT @c:=0) vars ORDER BY workday) c
ON c.name = a.name AND c.rank >=
a.rank
LEFT JOIN (SELECT *,@d:=@d+1 rank FROM
t,(SELECT @d:=0) vars ORDER BY workday) d
ON d.name = a.name AND d.rank = c.rank +
1
WHERE b.workday IS NULL AND c.workday IS NOT
NULL AND d.workday IS NULL
GROUP BY a.name,a.workday
) e order by begin
|
This
piece of SQL code is long and difficult to understand. The code would be easier
if the database supports window functions, but still it wouldn’t be simple
enough.
Actually,
this isn’t the only scenario where SQL performs badly. Other cases include
handling the subgroups, order-related calculations, related computing and
multi-step calculation. SQL’s inability is a result of incomplete set
orientation, as well as the lack of explicit sets, ordered sets, and support
for individual records and step-by-step calculation.
Facing these
SQL weaknesses, you should choose R-like languages that
supports explicit sets, ordered sets, individual records and step-by-step
calculation to import the data before manipulating it. The R-like languages include Python(Pandas), esProc, R and others.Among them esProc is the most
integration-friendly and convenient-to-use.
Here’s
the esProc code of implementing the above task:
The
group function groups records without performing aggregation (while SQL forces
an aggregation). The @o option working with it forbids a re-sorting when
performing the group operation, and creates a new group only when the value of
the grouping criterion changes (but SQL forces a re-sorting). The new function
creates a two-dimensional table based on each group of the grouped table; ~.m(1) stands for the first record of
each group and ~.m(-1) stands for the
last record of each group.
But you
should be aware that the scripting language you use can’t replace SQL, but that
it is responsible for transforming difficult SQL queries to a form of simple
SQL statement plus simple script. It is suitable for being used to handle data
computing tasks involving small data but complex logic, or heterogeneous data
sources. It’s not a good choice when handling big data computing with simple
logic, because the IO cost of importing data would be very high.
Here are
some examples of those SQL problems and their solutions in esProc. All of them
are simplified and adapted from the real world cases posted on the internet.
Manipulating
column data dynamically
The physicalExam
table contains PE grades with the following fields: Athlete, Dash, Long-distance
running, Long jump, Shot put… There are four grades for each event – excellent,
good, pass and fail. The task is to find out the number of people for each
grade in all events.
SQL believes column is a static attribute
of table data, and doesn’t provide set-oriented operations for column data. It
will be extremely cumbersome to handle the task in SQL. esProc, however,
supports dynamic column access with simple code:
Non-equi-grouping
It’s common to group records by certain intervals,
such as grade ranges (excellent, good …) and age groups (young people,
middle-aged people…). It’s always a complicated job to perform grouping by
these intervals in SQL. SQL can perform grouping by a few fixed ranges using
the conditional statement case when;
but it needs to create temporary tables and perform non-equi-joins when there
are many intervals or the conditional intervals are continuous but dynamic.
A related scenario is sorting data in a
specified order, which means that you specify the exact order, rather than use
the standard order, to display the data manipulation results. For example, the
capital city is always put in the first place when arranging data by administrative
regions. SQL handles this type of sorting in a similar way as it handles
grouping by conditional intervals. When the sorting criterion is simple, decode
function is used; but if the criterion is long or dynamic, temporary tables
need to be created and joins would be performed to generate sequence numbers.
Equi-grouping
doesn’t produce an empty subset, but non-equi-grouping may have one. For the
latter, some tasks require that the result be continuous ranges. Therefore the
missing values of the empty subset should be supplied. To do this in SQL, you need
to create continuous conditional ranges manually for the grouping, and then
perform left join on it and the data table, during which the complicated subqueries
are necessary. The whole process is awfully cumbersome. esProc align function
is specific to this requirement with convenient method of preparing the interval
to which the table is aligned, making the process much simpler.
Handling
subgroups
Because SQL lacks an explicit set data
type, it will force an aggregation after data grouping. But sometimes you also
want the subset of the set of groups, like finding out the grade records of the
students whose total scores are above 500.
With SQL, you would first group records to calculate
the total score of each student and find out those whose scores are above 500,
and then perform join between the name list and the achievement table or use
the IN operator to specify the selected names in the achievement table. The implementation
is indirect and data will be retrieved repeatedly:
with t as (select student from achievement group by student
having sum(score)>500)
select tt.* from t left join achievement tt on t.student =tt.student
Having set data type and group functions that can return the subgroup, esProc is able to handle effortlessly the subgroups generated by data grouping:
Even if only the aggregate values are
desired, the subgroups are worth keeping for further handling because the
current aggregate operation may be too special to be performed by simple
aggregate functions.
Here’s such an example. The user access
table, access, has two fields – user
(ID) and login (access time). The task is to find out the last login time per
ID and the number of this ID’s logins within 3 days before the last time.
A SQL approach will first group the records
to find out the last login time per ID, then join with the original table and
get the records in the specified time period and again perform grouping and
aggregation. It is roundabout and inefficient.
with t as (select "user",max(login) last from access
group by "user")
select t."user",t.last,count(tt.login) from t
left join access tt on t."user"=tt."user"
where t.last-tt.login<=3 group by
t."user",t.last
Since esProc can retain the subgroups, it’s easy to realize a step-by-step calculation:
In the code, “~” stands for the subgroup
generated from grouping records by user.
A more efficient method could be used to get the result if the records are already sorted:
Order-related
aggregation
Finding top N from each group and getting
record with the maximum value are also common in data handling. For example,
the user access table access has the
following structure: user, login time, IP…. The task is to list the first login
record per user.
SQL will use a window function to generate
the sequence numbers for records in each group and get all records whose
sequence numbers are 1. As the window function works on the basis of a result
set, you should first create a subquery before performing filtering, making the
code a little complex:
select * from (select rank() over(partition by
"user" order by login) rk,t.* from access t) a
where a.rk=1;
But this expression doesn’t apply to the
SQL dialects that don’t support window functions.
The esProc method is using group@1 to get the first member from each group straightforwardly.
The following example is about finding Top N
from each group. The structure of the stock price table transaction is code, transaction date and closing price. The task
is to calculate the latest growth rate of each stock.
The growth rate calculation involves the
records of the last trading days. SQL would use two levels of window functions
to separately perform the intra-group inter-row calculation and then get the
first row of the intermediate table. The expression is difficult to understand:
with tt as (select rank() over(partition by share order
by date desc) rk,t.* from transaction t)
select share,(select tt.closingPrice from tt where tt.share
=ttt.share and tt.rk=1)-
(select tt.closingPrice from tt where tt.share =ttt.share
and tt.rk=2) increase from transaction ttt group by share
The desired records can be obtained efficiently according to sequence numbers if the original records are already sorted:
Both finding the record with the maximum
value and getting the first/last record are special cases of finding top N.
Inverse
grouping
The inverse grouping is at the opposite end
of the spectrum from the grouping and aggregation. It splits each of aggregated
records into multiple detail records.
The installment table installment includes these fields – number, total amount, beginning
date, number of periods. The task is to divide each loan into a number of
payment records having the structure of number, period, payment date and
amount. The total amount will simply be distributed evenly to every period of
one month.
Cross-row
reference
The sales
table holds the monthly sales amount for each product. Its fields are product,
month and amount. The task is to find out the records in which the amount
increases by 10% compared with the previous month.
The calculation involves cross-row
reference. The early SQL standards don’t support cross-row reference. Sequence
numbers need to be generated for performing a join. The procedure is achingly complicated
and difficult. Thanks to the introduction of window functions, SQL is able to
make the cross-row reference more conveniently, but far from concisely due to
the subquery. The code will be lengthy when there are multiple referenced
items.
with t as (select a.*, nvl(volume/lag(volume)
over(partition by product order by month,0) "relative"
from sales a) select * from t where relative>1.1
esProc provides the intuitive and convenient-to-use cross-row reference syntax.
Conveniently, [-1] can be used to reference
a value of the previous month after data is sorted. The syntax also makes it
possible to filtering data based on the results of inter-row calculations.
Then based on the above sales table, we calculate the moving
averages of the sales amounts across a period of time including the current month,
its previous month and the following month.
The moving average calculation involves forward
reference and the reference of sets. A MySQL variable can’t express a forward
reference, making a direct calculation impossible. Window functions can make it
but they produce bloated and unintelligible code by using subqueries:
with b as(select a.*,lag(volumn)
over(partition by product order by month) f1,
lead(volumn)
over (partition by product order by month) f2 from sales a)
select product, month, volumn,
(nvl(f1,0)+nvl(f2,0)+volumn)/(decode(f1,null,0,1)+decode(f2,null,0,1)+1)
MA from b
Order-related
grouping
The budget
table records incomes and expenses with three fields – month, income and
expense. The task is to find out the records where the income is less than the
expense for at least 3 continuous months.
The expression group@o means grouping
records by comparing only the neighboring records. By comparing the values of
income and expense the records can be divided into groups of budget surplus, budget
deficit, budget surplus… Then get the budget deficit groups that have at least
3 members and concatenate them.
You can also try to find out the biggest number
of consecutive months when the income increases. For this the following
grouping strategy can be used: group the current record with the previous one
when income increases, and put it into a new group when income decreases.
Finally you can get the number of members in the group where income increases
continuously.
The SQL code for implementing this grouping
algorithm is hard to understand:
select max(continuousMonthNnumber) from(select
count(*)continuousMonthNnumber from
(select sum(isIncrease) over(order by month) notIncreaseMonthNumber from
(select month,income,case when income >lag(income)
over(order by month) then 0 else 1 end isIncrease from budget))group by notIncreaseMonthNumber)
Merging intervals is another type of order-related group operation. Here’s an example. The event table T has two fields S (starting time) and E (ending time). The task is to merge the overlapped intervals to find out the total duration of the event.
The code offers methods of achieving
different computing goals which capture the essence of cross-row operation and
order-related operation and express them in a most intuitive way. It’s
impossible to implement both operations in SQL using merely the window
functions. The more sophisticated recursive query is required.
Record
accessing by sequence number
Median price is frequently needed during
the analysis of economic statistics. The value is easily calculated if you can
access the records by sequence numbers. But you should create the sequence
numbers and then perform filtering with SQL used. The code is tedious and
inefficient:
select avg(price) from
(select price,row_number() over (order by price) r from priceTable
) t
where r=trunc((select count(price)+1 from priceTable)/2)
or r=trunc((select count(price) from priceTable)/2)+1
With support for ordered sets, esProc can generate simple code:
The sequence numbers are useful in data
grouping. The event table event has
the following structure – number, time, action (including starting time and
ending time). The task is to calculate the total duration of the whole event.
That is, summing up the duration between each pair of starting action and ending
action.
SQL inherits the mathematical concept of
the unordered sets, stipulating that sequence numbers be created and conditional
filtering be performed in order to access members in the specified positions:
with t as (select a.*,trunc((row_number() over (order by ID)-1)/2)
r from event a)
select ROUND(TO_NUMBER(max(time)-min(time)) * 24 * 60)
from t group by r
esProc, however, performs member accessing with simple and clear code:
The sign “#” represents the sequence number
of a record. group((#-1)\2) place every two records into one group. After that
esProc calculates the duration for each group and sums up the results.
Sequence numbers are easy to use when
referencing a value from a neighboring row. The transaction table contains stock prices in two fields trading date
and closing price. The task is to list the trading days when the stock prices
are above ¥100 and calculate the growth rate each of those trading days.
You can’t calculate the growth rate after
filtering records by the condition closing price>100. So with SQL you should
first calculate all the growth rates and perform a filtering with the window
function. This makes the code difficult to understand:
with t as (select date, closingPrice, closingPrice-lag(closingPrice)
over(order by date) increase from transaction)
select * from t where closingPrice >100 and increase is
not null
The esProc solution is using pselect function to find out the sequence numbers of those records that satisfying the condition. With the sequence numbers, you can easily calculate the desired growth rates, without having to calculate all growth rates and then perform a filtering as the window function does.
Using
sets for string handling
A class table classtb include three fields –
class, male students, and female students. The task is to change its structure
and make it a student table that includes these fields – class, name, and
gender.
At times the object of string splitting is to perform set operations. Here’s a book table named book with the following structure – title and author. Each author value includes multiple people separated by the comma and with an unfixed order. The task is to find out the records where the names of the authors appear in at least two books.
Split all the author strings to form a set
and sort them, then use members of the ordered set as grouping key values to
perform the subsequent operations.
Using
sets for date handling
Like string handling, the database is
capable of dealing with normal, single date values. But it has difficulty in splitting
a time interval or in generating a sequence of dates due to SQL’s vital
weakness of incomplete set orientation.
The travel
table contains travelling logs in multiple fields – name, starting date, ending
date…. The task is to find out the 5 peak days in terms of the number of
tourists.
Generating a sequence of dates is
intrinsically difficult, particularly when it involves an inverse grouping and especially
in SQL.
The event table T has three fields – I (event), S (starting date) and E (ending date). The task is to split the interval between the starting dates and ending dates to generate multiple intervals by month. The first month starts from the starting date and the last month ends at the ending date. The months between them have all their dates.
The pdate function works with @m option and
@me options separately to get the first day and the last day of each month.
after@m gets the date which is a number of months after a certain date. It will
automatically adjust the new date to the last day of the corresponding month
and is useful in generating a monthly interval.
No comments:
Post a Comment