I answered a few SQL questions on Stack Overflow recently, which I could solve easily by using the Tabibitosan method. It's such an elegant and efficient technique, I think it's worth giving it an extra bit of attention through this post. I'm certainly not the first to write about it: it was introduced to me by Japanese Oracle ACE Aketi Jyuuzou on OTN. He wrote a special forum post explaining his technique here, which also contains lots of examples. I have also included it in my SQL Masterclass. And Boneist and bluefrog have written about Tabibitosan in the past as well.
In its simplest form, the Tabibitosan technique lets you group consecutive rows easily, using just one level of analytic functions. And with a bit of imagination, you can also do some primitive SQL pattern matching avant la lettre. I'll show examples of both in this post.
The key of the technique is to map rows belonging to the same pattern to the same number, which can then be used for grouping or partitioning. To illustrate, let's start with a simple example to group consecutive numbers.
The table:
SQL> create table mytable (nr) 2 as 3 select 1 from dual union all 4 select 2 from dual union all 5 select 3 from dual union all 6 select 6 from dual union all 7 select 7 from dual union all 8 select 11 from dual union all 9 select 18 from dual union all 10 select 19 from dual union all 11 select 20 from dual union all 12 select 21 from dual union all 13 select 22 from dual union all 14 select 25 from dual 15 / Table created.
With the question: show me all the groups of integer values that are in sequence without gaps. For each group show the starting number and end number. So the expected result set is this:
MIN(NR) MAX(NR) ---------- ---------- 1 3 6 7 11 11 18 22 25 25
Tabibitosan works by calculating an extra column, grp in my case, by subtracting row_number() from the value that defines the sequence, nr in my case. If the interval of the values in sequence is 1, then subtracting row_number() will result in a constant value for the group members. This query shows you the core of the technique:
SQL> select nr 2 , row_number() over (order by nr) rn 3 , nr - row_number() over (order by nr) grp 4 from mytable 5 / NR RN GRP ---------- ---------- ---------- 1 1 0 2 2 0 3 3 0 6 4 2 7 5 2 11 6 5 18 7 11 19 8 11 20 9 11 21 10 11 22 11 11 25 12 13 12 rows selected.
It doesn't matter what the grp value is exactly. What matters is that it's the same constant value for all group members. This then allows for easy partitioning or grouping:
SQL> with tabibitosan as 2 ( select nr 3 , nr - row_number() over (order by nr) grp 4 from mytable 5 ) 6 select min(nr) 7 , max(nr) 8 from tabibitosan 9 group by grp 10 order by grp 11 / MIN(NR) MAX(NR) ---------- ---------- 1 3 6 7 11 11 18 22 25 25 5 rows selected.
You can see another example of this simple form of Tabibitosan in this Stack Overflow question.
Tabibitosan only works if the difference of adjacent column values in the first operand of the minus operator (here: nr) equals 1. If it's not 1, you'll have to come up with an expression to make it 1. For example, let's see how a similar example works if we'd now like to group rows for consecutive months. Here's the setup:
SQL> create table mytable (startdate) 2 as 3 select date '2013-01-01' from dual union all 4 select date '2013-02-01' from dual union all 5 select date '2013-03-01' from dual union all 6 select date '2013-06-01' from dual union all 7 select date '2013-07-01' from dual union all 8 select date '2013-10-01' from dual union all 9 select date '2013-12-01' from dual union all 10 select date '2014-01-01' from dual 11 / Table created.
And this is the expected result set:
MIN(STARTD MAX(STARTD ---------- ---------- 01-01-2013 01-03-2013 01-06-2013 01-07-2013 01-10-2013 01-10-2013 01-12-2013 01-01-2014
We can use the same technique, if only the interval between two months can be mapped to 1, which is what the MONTHS_BETWEEN function gives us. So let's use that function to calculate the grp value:
SQL> select startdate 2 , months_between(startdate,date '2010-01-01') mb 3 , months_between(startdate,date '2010-01-01') 4 - row_number() over (order by startdate) grp 5 from mytable 6 / STARTDATE MB GRP ---------- ---------- ---------- 01-01-2013 36 35 01-02-2013 37 35 01-03-2013 38 35 01-06-2013 41 37 01-07-2013 42 37 01-10-2013 45 39 01-12-2013 47 40 01-01-2014 48 40 8 rows selected.
I used an arbitrary date here (January 1, 2010). Any date on the first day of the month would be good. With the grp value calculated, the Tabibitosan query is again easy:
SQL> with tabibitosan as 2 ( select startdate 3 , months_between(startdate,date '2010-01-01') 4 - row_number() over (order by startdate) grp 5 from mytable 6 ) 7 select min(startdate) 8 , max(startdate) 9 from tabibitosan 10 group by grp 11 order by grp 12 / MIN(STARTD MAX(STARTD ---------- ---------- 01-01-2013 01-03-2013 01-06-2013 01-07-2013 01-10-2013 01-10-2013 01-12-2013 01-01-2014 4 rows selected.
When the column that defines the order doesn't always increase with a constant number, the row_number() analytic function for the first operand of the minus operator comes in handy for the solution. Since I've found this one to be quite common, I'm also showing this case here. The example below is taken from this Stack Overflow question, where I'm comparing dates but excluding saturdays and sundays. The table and its contents look like this:
SQL> create table mytable (date_worked,country) 2 as 3 select to_date('1-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all 4 select to_date('4-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all 5 select to_date('5-Nov-13','dd-Mon-yy'), 'India' from dual union all 6 select to_date('6-Nov-13','dd-Mon-yy'), 'India' from dual union all 7 select to_date('7-Nov-13','dd-Mon-yy'), 'India' from dual union all 8 select to_date('8-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all 9 select to_date('11-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all 10 select to_date('12-Nov-13','dd-Mon-yy'), 'India' from dual union all 11 select to_date('13-Nov-13','dd-Mon-yy'), 'India' from dual union all 12 select to_date('14-Nov-13','dd-Mon-yy'), 'India' from dual union all 13 select to_date('15-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all 14 select to_date('18-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all 15 select to_date('19-Nov-13','dd-Mon-yy'), 'India' from dual union all 16 select to_date('20-Nov-13','dd-Mon-yy'), 'India' from dual union all 17 select to_date('21-Nov-13','dd-Mon-yy'), 'India' from dual union all 18 select to_date('22-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all 19 select to_date('25-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all 20 select to_date('26-Nov-13','dd-Mon-yy'), 'India' from dual union all 21 select to_date('27-Nov-13','dd-Mon-yy'), 'India' from dual union all 22 select to_date('28-Nov-13','dd-Mon-yy'), 'India' from dual union all 23 select to_date('29-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual 24 / Table created.
The query needs to return the start date and end date of each stay in a country. So the expected result set is this:
COUNTRY START_DATE END_DATE -------------- ---------- ---------- United Kingdom 01-11-2013 04-11-2013 India 05-11-2013 07-11-2013 United Kingdom 08-11-2013 11-11-2013 India 12-11-2013 14-11-2013 United Kingdom 15-11-2013 18-11-2013 India 19-11-2013 21-11-2013 United Kingdom 22-11-2013 25-11-2013 India 26-11-2013 28-11-2013 United Kingdom 29-11-2013 29-11-2013 9 rows selected.
By subtracting a partitioned row_number() from a regular full-set row_number() we can calculate the grp value:
SQL> select date_worked 2 , country 3 , row_number() over (order by date_worked) x 4 , row_number() over (partition by country order by date_worked) y 5 , row_number() over (order by date_worked) 6 - row_number() over (partition by country order by date_worked) grp 7 from mytable 8 / DATE_WORKE COUNTRY X Y GRP ---------- -------------- ---------- ---------- ---------- 01-11-2013 United Kingdom 1 1 0 04-11-2013 United Kingdom 2 2 0 05-11-2013 India 3 1 2 06-11-2013 India 4 2 2 07-11-2013 India 5 3 2 08-11-2013 United Kingdom 6 3 3 11-11-2013 United Kingdom 7 4 3 12-11-2013 India 8 4 4 13-11-2013 India 9 5 4 14-11-2013 India 10 6 4 15-11-2013 United Kingdom 11 5 6 18-11-2013 United Kingdom 12 6 6 19-11-2013 India 13 7 6 20-11-2013 India 14 8 6 21-11-2013 India 15 9 6 22-11-2013 United Kingdom 16 7 9 25-11-2013 United Kingdom 17 8 9 26-11-2013 India 18 10 8 27-11-2013 India 19 11 8 28-11-2013 India 20 12 8 29-11-2013 United Kingdom 21 9 12 21 rows selected.
Note that just using the grp value for the final grouping, could lead to overlap of groups from different countries. So we need the country for the final grouping as well. The full query becomes:
SQL> with tabibitosan as 2 ( select date_worked 3 , country 4 , row_number() over (order by date_worked) 5 - row_number() over (partition by country order by date_worked) grp 6 from mytable 7 ) 8 select country 9 , min(date_worked) start_date 10 , max(date_worked) end_date 11 from tabibitosan 12 group by country 13 , grp 14 order by start_date 15 / COUNTRY START_DATE END_DATE -------------- ---------- ---------- United Kingdom 01-11-2013 04-11-2013 India 05-11-2013 07-11-2013 United Kingdom 08-11-2013 11-11-2013 India 12-11-2013 14-11-2013 United Kingdom 15-11-2013 18-11-2013 India 19-11-2013 21-11-2013 United Kingdom 22-11-2013 25-11-2013 India 26-11-2013 28-11-2013 United Kingdom 29-11-2013 29-11-2013 9 rows selected.
Please note that these Tabibitosan queries contain only two levels and look pretty clean. Also, the performance is nice: just one full table scan and two sorts for the two analytic functions, followed by the unavoidable HASH GROUP BY and SORT ORDER BY for the grouping and the sorted display:
----------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 9 |00:00:00.01 | 3 | | | | | 1 | SORT ORDER BY | | 1 | 2 | 9 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 2 | HASH GROUP BY | | 1 | 2 | 9 |00:00:00.01 | 3 | 1088K| 1088K| 1233K (0)| | 3 | VIEW | | 1 | 21 | 21 |00:00:00.01 | 3 | | | | | 4 | WINDOW SORT | | 1 | 21 | 21 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 5 | WINDOW SORT | | 1 | 21 | 21 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 6 | TABLE ACCESS FULL| MYTABLE | 1 | 21 | 21 |00:00:00.01 | 3 | | | | -----------------------------------------------------------------------------------------------------------------------
And here is another example from Stack Overflow using Tabibitosan with two row_numbers.
If it's becoming too hard to come up with a function that maps the differences between group members to the number 1, then you have an alternative which I'll call the max-on-case-row-number technique, by lack of a better name. I first saw this technique on AskTom. I'm going to repeat the same example as in the link to allow easy comparison.
The table:
SQL> create table mytable (time,quantity) 2 as 3 select trunc(sysdate) + to_dsinterval('0 12:22:01'), 100 from dual union all 4 select trunc(sysdate) + to_dsinterval('0 12:22:03'), 200 from dual union all 5 select trunc(sysdate) + to_dsinterval('0 12:22:04'), 300 from dual union all 6 select trunc(sysdate) + to_dsinterval('0 12:22:06'), 200 from dual union all 7 select trunc(sysdate) + to_dsinterval('0 12:22:45'), 100 from dual union all 8 select trunc(sysdate) + to_dsinterval('0 12:22:46'), 200 from dual union all 9 select trunc(sysdate) + to_dsinterval('0 12:23:12'), 100 from dual union all 10 select trunc(sysdate) + to_dsinterval('0 12:23:12'), 200 from dual 11 / Table created.
The goal is to sum the amounts where the time is within 3 seconds of each other. So, the expected result set is this:
MIN(TIME) MAX(TIME) SUM(QUANTITY) ------------------- ------------------- ------------- 05-01-2014 12:22:01 05-01-2014 12:22:06 800 05-01-2014 12:22:45 05-01-2014 12:22:46 300 05-01-2014 12:23:12 05-01-2014 12:23:12 300
First (the case-row-number part), we'll compute a new column rn, and assign a row_number to the rows that mark a new group: the first row and the ones where the previous row has a gap larger than 3 seconds. All other rows don't get a rn value:
SQL> select time 2 , quantity 3 , case 4 when time - lag(time,1,date '0001-01-01') over (order by time) > 3/24/60/60 then 5 row_number() over (order by time) 6 end rn 7 from mytable 8 / TIME QUANTITY RN ------------------- ---------- ---------- 05-01-2014 12:22:01 100 1 05-01-2014 12:22:03 200 05-01-2014 12:22:04 300 05-01-2014 12:22:06 200 05-01-2014 12:22:45 100 5 05-01-2014 12:22:46 200 05-01-2014 12:23:12 100 7 05-01-2014 12:23:12 200 8 rows selected.
Second (the max-on part), we'll use the analytic function MAX to compute the grp value, which gives the rows where "rn is null" the same value as the first value of the group, as you can see in this query:
SQL> with case_row_number as 2 ( select time 3 , quantity 4 , case 5 when time - lag(time,1,date '0001-01-01') over (order by time) > 3/24/60/60 then 6 row_number() over (order by time) 7 end rn 8 from mytable 9 ) 10 select time 11 , quantity 12 , max(rn) over (order by time) grp 13 from case_row_number 14 / TIME QUANTITY GRP ------------------- ---------- ---------- 05-01-2014 12:22:01 100 1 05-01-2014 12:22:03 200 1 05-01-2014 12:22:04 300 1 05-01-2014 12:22:06 200 1 05-01-2014 12:22:45 100 5 05-01-2014 12:22:46 200 5 05-01-2014 12:23:12 100 7 05-01-2014 12:23:12 200 7 8 rows selected.
Now that we have a suitable grp value calculated, the last part -the grouping- is easy:
SQL> with case_row_number as 2 ( select time 3 , quantity 4 , case 5 when time - lag(time,1,date '0001-01-01') over (order by time) > 3/24/60/60 then 6 row_number() over (order by time) 7 end rn 8 from mytable 9 ) 10 , max_on_case_row_number as 11 ( select time 12 , quantity 13 , max(rn) over (order by time) grp 14 from case_row_number 15 ) 16 select min(time) 17 , max(time) 18 , sum(quantity) 19 from max_on_case_row_number 20 group by grp 21 order by min(time) 22 / MIN(TIME) MAX(TIME) SUM(QUANTITY) ------------------- ------------------- ------------- 05-01-2014 12:22:01 05-01-2014 12:22:06 800 05-01-2014 12:22:45 05-01-2014 12:22:46 300 05-01-2014 12:23:12 05-01-2014 12:23:12 300 3 rows selected.
Compared to Tabibitosan, the max-on-case-row-number technique has similar performance characteristics: only one full table scan and a few sort operations on top. To compare it with Tabibitosan, you'd need to compare the sort operations. This is the plan for the query above:
------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | 3 |00:00:00.01 | 3 | | | | | 1 | SORT ORDER BY | | 1 | 8 | 3 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 2 | HASH GROUP BY | | 1 | 8 | 3 |00:00:00.01 | 3 | 1034K| 1034K| 737K (0)| | 3 | VIEW | | 1 | 8 | 8 |00:00:00.01 | 3 | | | | | 4 | WINDOW BUFFER | | 1 | 8 | 8 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 5 | VIEW | | 1 | 8 | 8 |00:00:00.01 | 3 | | | | | 6 | WINDOW SORT | | 1 | 8 | 8 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 7 | TABLE ACCESS FULL| MYTABLE | 1 | 8 | 8 |00:00:00.01 | 3 | | | | ------------------------------------------------------------------------------------------------------------------------
There are three analytic functions in the query (lag, row_number and max), but all three order by time, so effectively there is only one WINDOW SORT operation. For calculating the max, Oracle discovers the intermediate result set is already sorted by time, so it does a WINDOW BUFFER instead of a WINDOW SORT. The outer HASH GROUP BY and SORT ORDER BY, are because of the "group by grp order by min(time)".
As an alternative, the Tabibitosan solution would need to map the rows within 3 seconds of each other to consecutive numbers, and leave a larger gap for the other rows. This is the best I could come up with:
SQL> with x as 2 ( select time 3 , quantity 4 , case 5 when time - lag(time,1,date '0001-01-01') over (order by time,rowid) > 3/24/60/60 then 6 'Y' 7 end gap_indicator 8 from mytable 9 ) 10 , tabibitosan as 11 ( select time 12 , quantity 13 , count(*) over (order by time,rowid) 14 + count(gap_indicator) over (order by time,rowid) 15 - row_number() over (order by time,rowid) grp 16 from x 17 ) 18 select min(time) 19 , max(time) 20 , sum(quantity) 21 from tabibitosan 22 group by grp 23 order by min(time) 24 / MIN(TIME) MAX(TIME) SUM(QUANTITY) ------------------- ------------------- ------------- 05-01-2014 12:22:01 05-01-2014 12:22:06 800 05-01-2014 12:22:45 05-01-2014 12:22:46 300 05-01-2014 12:23:12 05-01-2014 12:23:12 300 3 rows selected.
You can see the similarities with the max-on-case-row-number. Note that I had to include rowid in the ordering, because time alone is not unique. The plan is exactly the same:
------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | 3 |00:00:00.01 | 3 | | | | | 1 | SORT ORDER BY | | 1 | 8 | 3 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 2 | HASH GROUP BY | | 1 | 8 | 3 |00:00:00.01 | 3 | 1034K| 1034K| 735K (0)| | 3 | VIEW | | 1 | 8 | 8 |00:00:00.01 | 3 | | | | | 4 | WINDOW BUFFER | | 1 | 8 | 8 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 5 | VIEW | | 1 | 8 | 8 |00:00:00.01 | 3 | | | | | 6 | WINDOW SORT | | 1 | 8 | 8 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 7 | TABLE ACCESS FULL| MYTABLE | 1 | 8 | 8 |00:00:00.01 | 3 | | | | ------------------------------------------------------------------------------------------------------------------------
If you look closely to the expression with the three analytic functions, you'll notice that "count(*) over (order by time,rowid)" equals "row_number() over (order by time,rowid)" and thus they can be eliminated, which leads to a simpler non-Tabibitosan query:
SQL> with x as 2 ( select time 3 , quantity 4 , case 5 when time - lag(time,1,date '0001-01-01') over (order by time,rowid) > 3/24/60/60 then 6 'Y' 7 end gap_indicator 8 from mytable 9 ) 10 , y as 11 ( select time 12 , quantity 13 , count(gap_indicator) over (order by time,rowid) grp 14 from x 15 ) 16 select min(time) 17 , max(time) 18 , sum(quantity) 19 from y 20 group by grp 21 order by min(time) 22 / MIN(TIME) MAX(TIME) SUM(QUANTITY) ------------------- ------------------- ------------- 05-01-2014 12:22:01 05-01-2014 12:22:06 800 05-01-2014 12:22:45 05-01-2014 12:22:46 300 05-01-2014 12:23:12 05-01-2014 12:23:12 300 3 rows selected.
but with the same performance characteristics again:
------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | 3 |00:00:00.01 | 3 | | | | | 1 | SORT ORDER BY | | 1 | 8 | 3 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 2 | HASH GROUP BY | | 1 | 8 | 3 |00:00:00.01 | 3 | 1034K| 1034K| 735K (0)| | 3 | VIEW | | 1 | 8 | 8 |00:00:00.01 | 3 | | | | | 4 | WINDOW BUFFER | | 1 | 8 | 8 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 5 | VIEW | | 1 | 8 | 8 |00:00:00.01 | 3 | | | | | 6 | WINDOW SORT | | 1 | 8 | 8 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)| | 7 | TABLE ACCESS FULL| MYTABLE | 1 | 8 | 8 |00:00:00.01 | 3 | | | | ------------------------------------------------------------------------------------------------------------------------
If you can find an expression to map the distance of same group members to a constant number, then grouping or partitioning your data using Tabibitosan will lead to a simple looking two-level query. Since alternatives will be more complex and sometimes will require an extra sort, it is definitely worth its place in your toolbox.
Hi Rob,
ReplyDeleteI like your method and did something similar in order to prevent complex self joins in one case. In another case this was the only possible way to solve a problem.
Did you already see the new "Row Pattern Matching" (http://www.oracle.com/technetwork/issue-archive/2013/13-nov/o63asktom-2034271.html?msgid=3-9498526537)?
I'm pretty sure that this is an even better way to address such a use-case (and even more complex ones), but couldn't test it, as I have no 12c at hand currently.
Did you test it? If so it would be great if you could update this post ;)
Best regards,
Salek Talangi
Tabibitosan is great ;-)
ReplyDelete@Salek: Stew Ashton has compared tabibitosan with 12c row pattern matching:
http://stewashton.wordpress.com/2014/03/05/12c-match_recognize-grouping-sequences/
And also 12c pattern matching for cases where tabibitosan is not appropriate:
http://stewashton.wordpress.com/2014/03/08/12c-match_recognize-and-the-start-of-group-method/
Always good to get such examples gathered so you can expand your "toolkit" ;-)
Thanks, Rob, for gathering tabibitosan and related techniques.
Kim,
DeleteThanks for mentioning me. I just published a pointer to this article, which is more complete than mine. Rob, I add my thanks to Kim's.
This comment has been removed by the author.
ReplyDeleteThis method is marvellous. It was first introduced to me by Italian SQL guru Toma. I believe she subscribes to the same school of thought as Japanese ACE Jyuuzou
ReplyDelete