Monday, January 6, 2014

Tabibitosan

I answered a few SQL questions on Stack Overflow recently, which I could solve easily by using the Tabibitosan method. It's such an elegant and efficient technique, I think it's worth giving it an extra bit of attention through this post. I'm certainly not the first to write about it: it was introduced to me by Japanese Oracle ACE Aketi Jyuuzou on OTN. He wrote a special forum post explaining his technique here, which also contains lots of examples. I have also included it in my SQL Masterclass. And Boneist and bluefrog have written about Tabibitosan in the past as well.

In its simplest form, the Tabibitosan technique lets you group consecutive rows easily, using just one level of analytic functions. And with a bit of imagination, you can also do some primitive SQL pattern matching avant la lettre. I'll show examples of both in this post.

The key of the technique is to map rows belonging to the same pattern to the same number, which can then be used for grouping or partitioning. To illustrate, let's start with a simple example to group consecutive numbers.

The table:

SQL> create table mytable (nr)
  2  as
  3  select 1 from dual union all
  4  select 2 from dual union all
  5  select 3 from dual union all
  6  select 6 from dual union all
  7  select 7 from dual union all
  8  select 11 from dual union all
  9  select 18 from dual union all
 10  select 19 from dual union all
 11  select 20 from dual union all
 12  select 21 from dual union all
 13  select 22 from dual union all
 14  select 25 from dual
 15  /

Table created.

With the question: show me all the groups of integer values that are in sequence without gaps. For each group show the starting number and end number. So the expected result set is this:

   MIN(NR)    MAX(NR)
---------- ----------
         1          3
         6          7
        11         11
        18         22
        25         25

Tabibitosan works by calculating an extra column, grp in my case, by subtracting row_number() from the value that defines the sequence, nr in my case. If the interval of the values in sequence is 1, then subtracting row_number() will result in a constant value for the group members. This query shows you the core of the technique:

SQL> select nr
  2       , row_number() over (order by nr) rn
  3       , nr - row_number() over (order by nr) grp
  4    from mytable
  5  /

        NR         RN        GRP
---------- ---------- ----------
         1          1          0
         2          2          0
         3          3          0
         6          4          2
         7          5          2
        11          6          5
        18          7         11
        19          8         11
        20          9         11
        21         10         11
        22         11         11
        25         12         13

12 rows selected.

It doesn't matter what the grp value is exactly. What matters is that it's the same constant value for all group members. This then allows for easy partitioning or grouping:

SQL> with tabibitosan as
  2  ( select nr
  3         , nr - row_number() over (order by nr) grp
  4      from mytable
  5  )
  6  select min(nr)
  7       , max(nr)
  8    from tabibitosan
  9   group by grp
 10   order by grp
 11  /

   MIN(NR)    MAX(NR)
---------- ----------
         1          3
         6          7
        11         11
        18         22
        25         25

5 rows selected.

You can see another example of this simple form of Tabibitosan in this Stack Overflow question.

Tabibitosan only works if the difference of adjacent column values in the first operand of the minus operator (here: nr) equals 1. If it's not 1, you'll have to come up with an expression to make it 1. For example, let's see how a similar example works if we'd now like to group rows for consecutive months. Here's the setup:

SQL> create table mytable (startdate)
  2  as
  3  select date '2013-01-01' from dual union all
  4  select date '2013-02-01' from dual union all
  5  select date '2013-03-01' from dual union all
  6  select date '2013-06-01' from dual union all
  7  select date '2013-07-01' from dual union all
  8  select date '2013-10-01' from dual union all
  9  select date '2013-12-01' from dual union all
 10  select date '2014-01-01' from dual
 11  /

Table created.

And this is the expected result set:

MIN(STARTD MAX(STARTD
---------- ----------
01-01-2013 01-03-2013
01-06-2013 01-07-2013
01-10-2013 01-10-2013
01-12-2013 01-01-2014

We can use the same technique, if only the interval between two months can be mapped to 1, which is what the MONTHS_BETWEEN function gives us. So let's use that function to calculate the grp value:

SQL> select startdate
  2       , months_between(startdate,date '2010-01-01') mb
  3       , months_between(startdate,date '2010-01-01')
  4         - row_number() over (order by startdate) grp
  5    from mytable
  6  /

STARTDATE          MB        GRP
---------- ---------- ----------
01-01-2013         36         35
01-02-2013         37         35
01-03-2013         38         35
01-06-2013         41         37
01-07-2013         42         37
01-10-2013         45         39
01-12-2013         47         40
01-01-2014         48         40

8 rows selected.

I used an arbitrary date here (January 1, 2010). Any date on the first day of the month would be good. With the grp value calculated, the Tabibitosan query is again easy:

SQL> with tabibitosan as
  2  ( select startdate
  3         , months_between(startdate,date '2010-01-01')
  4           - row_number() over (order by startdate) grp
  5      from mytable
  6  )
  7  select min(startdate)
  8       , max(startdate)
  9    from tabibitosan
 10   group by grp
 11   order by grp
 12  /

MIN(STARTD MAX(STARTD
---------- ----------
01-01-2013 01-03-2013
01-06-2013 01-07-2013
01-10-2013 01-10-2013
01-12-2013 01-01-2014

4 rows selected.

When the column that defines the order doesn't always increase with a constant number, the row_number() analytic function for the first operand of the minus operator comes in handy for the solution. Since I've found this one to be quite common, I'm also showing this case here. The example below is taken from this Stack Overflow question, where I'm comparing dates but excluding saturdays and sundays. The table and its contents look like this:

SQL> create table mytable (date_worked,country)
  2  as
  3  select to_date('1-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all
  4  select to_date('4-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all
  5  select to_date('5-Nov-13','dd-Mon-yy'), 'India' from dual union all
  6  select to_date('6-Nov-13','dd-Mon-yy'), 'India' from dual union all
  7  select to_date('7-Nov-13','dd-Mon-yy'), 'India' from dual union all
  8  select to_date('8-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all
  9  select to_date('11-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all
 10  select to_date('12-Nov-13','dd-Mon-yy'), 'India' from dual union all
 11  select to_date('13-Nov-13','dd-Mon-yy'), 'India' from dual union all
 12  select to_date('14-Nov-13','dd-Mon-yy'), 'India' from dual union all
 13  select to_date('15-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all
 14  select to_date('18-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all
 15  select to_date('19-Nov-13','dd-Mon-yy'), 'India' from dual union all
 16  select to_date('20-Nov-13','dd-Mon-yy'), 'India' from dual union all
 17  select to_date('21-Nov-13','dd-Mon-yy'), 'India' from dual union all
 18  select to_date('22-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all
 19  select to_date('25-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual union all
 20  select to_date('26-Nov-13','dd-Mon-yy'), 'India' from dual union all
 21  select to_date('27-Nov-13','dd-Mon-yy'), 'India' from dual union all
 22  select to_date('28-Nov-13','dd-Mon-yy'), 'India' from dual union all
 23  select to_date('29-Nov-13','dd-Mon-yy'), 'United Kingdom' from dual
 24  /

Table created.

The query needs to return the start date and end date of each stay in a country. So the expected result set is this:

COUNTRY        START_DATE END_DATE
-------------- ---------- ----------
United Kingdom 01-11-2013 04-11-2013
India          05-11-2013 07-11-2013
United Kingdom 08-11-2013 11-11-2013
India          12-11-2013 14-11-2013
United Kingdom 15-11-2013 18-11-2013
India          19-11-2013 21-11-2013
United Kingdom 22-11-2013 25-11-2013
India          26-11-2013 28-11-2013
United Kingdom 29-11-2013 29-11-2013

9 rows selected.

By subtracting a partitioned row_number() from a regular full-set row_number() we can calculate the grp value:

SQL> select date_worked
  2       , country
  3       , row_number() over (order by date_worked) x
  4       , row_number() over (partition by country order by date_worked) y
  5       , row_number() over (order by date_worked)
  6         - row_number() over (partition by country order by date_worked) grp
  7    from mytable
  8  /

DATE_WORKE COUNTRY                 X          Y        GRP
---------- -------------- ---------- ---------- ----------
01-11-2013 United Kingdom          1          1          0
04-11-2013 United Kingdom          2          2          0
05-11-2013 India                   3          1          2
06-11-2013 India                   4          2          2
07-11-2013 India                   5          3          2
08-11-2013 United Kingdom          6          3          3
11-11-2013 United Kingdom          7          4          3
12-11-2013 India                   8          4          4
13-11-2013 India                   9          5          4
14-11-2013 India                  10          6          4
15-11-2013 United Kingdom         11          5          6
18-11-2013 United Kingdom         12          6          6
19-11-2013 India                  13          7          6
20-11-2013 India                  14          8          6
21-11-2013 India                  15          9          6
22-11-2013 United Kingdom         16          7          9
25-11-2013 United Kingdom         17          8          9
26-11-2013 India                  18         10          8
27-11-2013 India                  19         11          8
28-11-2013 India                  20         12          8
29-11-2013 United Kingdom         21          9         12

21 rows selected.

Note that just using the grp value for the final grouping, could lead to overlap of groups from different countries. So we need the country for the final grouping as well. The full query becomes:

SQL> with tabibitosan as
  2  ( select date_worked
  3         , country
  4         , row_number() over (order by date_worked)
  5           - row_number() over (partition by country order by date_worked) grp
  6      from mytable
  7  )
  8  select country
  9       , min(date_worked) start_date
 10       , max(date_worked) end_date
 11    from tabibitosan
 12   group by country
 13       , grp
 14   order by start_date
 15  /

COUNTRY        START_DATE END_DATE
-------------- ---------- ----------
United Kingdom 01-11-2013 04-11-2013
India          05-11-2013 07-11-2013
United Kingdom 08-11-2013 11-11-2013
India          12-11-2013 14-11-2013
United Kingdom 15-11-2013 18-11-2013
India          19-11-2013 21-11-2013
United Kingdom 22-11-2013 25-11-2013
India          26-11-2013 28-11-2013
United Kingdom 29-11-2013 29-11-2013

9 rows selected.

Please note that these Tabibitosan queries contain only two levels and look pretty clean. Also, the performance is nice: just one full table scan and two sorts for the two analytic functions, followed by the unavoidable HASH GROUP BY and SORT ORDER BY for the grouping and the sorted display:
-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |         |      1 |        |      9 |00:00:00.01 |       3 |       |       |          |
|   1 |  SORT ORDER BY         |         |      1 |      2 |      9 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   2 |   HASH GROUP BY        |         |      1 |      2 |      9 |00:00:00.01 |       3 |  1088K|  1088K| 1233K (0)|
|   3 |    VIEW                |         |      1 |     21 |     21 |00:00:00.01 |       3 |       |       |          |
|   4 |     WINDOW SORT        |         |      1 |     21 |     21 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   5 |      WINDOW SORT       |         |      1 |     21 |     21 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   6 |       TABLE ACCESS FULL| MYTABLE |      1 |     21 |     21 |00:00:00.01 |       3 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------

And here is another example from Stack Overflow using Tabibitosan with two row_numbers.

If it's becoming too hard to come up with a function that maps the differences between group members to the number 1, then you have an alternative which I'll call the max-on-case-row-number technique, by lack of a better name. I first saw this technique on AskTom. I'm going to repeat the same example as in the link to allow easy comparison.

The table:

SQL> create table mytable (time,quantity)
  2  as
  3  select trunc(sysdate) + to_dsinterval('0 12:22:01'), 100 from dual union all
  4  select trunc(sysdate) + to_dsinterval('0 12:22:03'), 200 from dual union all
  5  select trunc(sysdate) + to_dsinterval('0 12:22:04'), 300 from dual union all
  6  select trunc(sysdate) + to_dsinterval('0 12:22:06'), 200 from dual union all
  7  select trunc(sysdate) + to_dsinterval('0 12:22:45'), 100 from dual union all
  8  select trunc(sysdate) + to_dsinterval('0 12:22:46'), 200 from dual union all
  9  select trunc(sysdate) + to_dsinterval('0 12:23:12'), 100 from dual union all
 10  select trunc(sysdate) + to_dsinterval('0 12:23:12'), 200 from dual
 11  /

Table created.

The goal is to sum the amounts where the time is within 3 seconds of each other. So, the expected result set is this:

MIN(TIME)           MAX(TIME)           SUM(QUANTITY)
------------------- ------------------- -------------
05-01-2014 12:22:01 05-01-2014 12:22:06           800
05-01-2014 12:22:45 05-01-2014 12:22:46           300
05-01-2014 12:23:12 05-01-2014 12:23:12           300

First (the case-row-number part), we'll compute a new column rn, and assign a row_number to the rows that mark a new group: the first row and the ones where the previous row has a gap larger than 3 seconds. All other rows don't get a rn value:

SQL> select time
  2       , quantity
  3       , case
  4         when time - lag(time,1,date '0001-01-01') over (order by time) > 3/24/60/60 then
  5           row_number() over (order by time)
  6         end rn
  7    from mytable
  8  /

TIME                  QUANTITY         RN
------------------- ---------- ----------
05-01-2014 12:22:01        100          1
05-01-2014 12:22:03        200
05-01-2014 12:22:04        300
05-01-2014 12:22:06        200
05-01-2014 12:22:45        100          5
05-01-2014 12:22:46        200
05-01-2014 12:23:12        100          7
05-01-2014 12:23:12        200

8 rows selected.

Second (the max-on part), we'll use the analytic function MAX to compute the grp value, which gives the rows where "rn is null" the same value as the first value of the group, as you can see in this query:

SQL> with case_row_number as
  2  ( select time
  3         , quantity
  4         , case
  5           when time - lag(time,1,date '0001-01-01') over (order by time) > 3/24/60/60 then
  6             row_number() over (order by time)
  7           end rn
  8      from mytable
  9  )
 10  select time
 11       , quantity
 12       , max(rn) over (order by time) grp
 13    from case_row_number
 14  /

TIME                  QUANTITY        GRP
------------------- ---------- ----------
05-01-2014 12:22:01        100          1
05-01-2014 12:22:03        200          1
05-01-2014 12:22:04        300          1
05-01-2014 12:22:06        200          1
05-01-2014 12:22:45        100          5
05-01-2014 12:22:46        200          5
05-01-2014 12:23:12        100          7
05-01-2014 12:23:12        200          7

8 rows selected.

Now that we have a suitable grp value calculated, the last part -the grouping- is easy:

SQL> with case_row_number as
  2  ( select time
  3         , quantity
  4         , case
  5           when time - lag(time,1,date '0001-01-01') over (order by time) > 3/24/60/60 then
  6             row_number() over (order by time)
  7           end rn
  8      from mytable
  9  )
 10  , max_on_case_row_number as
 11  ( select time
 12         , quantity
 13         , max(rn) over (order by time) grp
 14      from case_row_number
 15  )
 16  select min(time)
 17       , max(time)
 18       , sum(quantity)
 19    from max_on_case_row_number
 20   group by grp
 21   order by min(time)
 22  /

MIN(TIME)           MAX(TIME)           SUM(QUANTITY)
------------------- ------------------- -------------
05-01-2014 12:22:01 05-01-2014 12:22:06           800
05-01-2014 12:22:45 05-01-2014 12:22:46           300
05-01-2014 12:23:12 05-01-2014 12:23:12           300

3 rows selected.

Compared to Tabibitosan, the max-on-case-row-number technique has similar performance characteristics: only one full table scan and a few sort operations on top. To compare it with Tabibitosan, you'd need to compare the sort operations. This is the plan for the query above:

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |         |      1 |        |      3 |00:00:00.01 |       3 |       |       |          |
|   1 |  SORT ORDER BY          |         |      1 |      8 |      3 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   2 |   HASH GROUP BY         |         |      1 |      8 |      3 |00:00:00.01 |       3 |  1034K|  1034K|  737K (0)|
|   3 |    VIEW                 |         |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
|   4 |     WINDOW BUFFER       |         |      1 |      8 |      8 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   5 |      VIEW               |         |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
|   6 |       WINDOW SORT       |         |      1 |      8 |      8 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   7 |        TABLE ACCESS FULL| MYTABLE |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

There are three analytic functions in the query (lag, row_number and max), but all three order by time, so effectively there is only one WINDOW SORT operation. For calculating the max, Oracle discovers the intermediate result set is already sorted by time, so it does a WINDOW BUFFER instead of a WINDOW SORT. The outer HASH GROUP BY and SORT ORDER BY, are because of the "group by grp order by min(time)".

As an alternative, the Tabibitosan solution would need to map the rows within 3 seconds of each other to consecutive numbers, and leave a larger gap for the other rows. This is the best I could come up with:

SQL> with x as
  2  ( select time
  3         , quantity
  4         , case
  5           when time - lag(time,1,date '0001-01-01') over (order by time,rowid) > 3/24/60/60 then
  6             'Y'
  7           end gap_indicator
  8      from mytable
  9  )
 10  , tabibitosan as
 11  ( select time
 12         , quantity
 13         , count(*) over (order by time,rowid)
 14           + count(gap_indicator) over (order by time,rowid)
 15           - row_number() over (order by time,rowid) grp
 16      from x
 17  )
 18  select min(time)
 19       , max(time)
 20       , sum(quantity)
 21    from tabibitosan
 22   group by grp
 23   order by min(time)
 24  /

MIN(TIME)           MAX(TIME)           SUM(QUANTITY)
------------------- ------------------- -------------
05-01-2014 12:22:01 05-01-2014 12:22:06           800
05-01-2014 12:22:45 05-01-2014 12:22:46           300
05-01-2014 12:23:12 05-01-2014 12:23:12           300

3 rows selected.

You can see the similarities with the max-on-case-row-number. Note that I had to include rowid in the ordering, because time alone is not unique. The plan is exactly the same:

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |         |      1 |        |      3 |00:00:00.01 |       3 |       |       |          |
|   1 |  SORT ORDER BY          |         |      1 |      8 |      3 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   2 |   HASH GROUP BY         |         |      1 |      8 |      3 |00:00:00.01 |       3 |  1034K|  1034K|  735K (0)|
|   3 |    VIEW                 |         |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
|   4 |     WINDOW BUFFER       |         |      1 |      8 |      8 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   5 |      VIEW               |         |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
|   6 |       WINDOW SORT       |         |      1 |      8 |      8 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   7 |        TABLE ACCESS FULL| MYTABLE |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

If you look closely to the expression with the three analytic functions, you'll notice that "count(*) over (order by time,rowid)" equals "row_number() over (order by time,rowid)" and thus they can be eliminated, which leads to a simpler non-Tabibitosan query:

SQL> with x as
  2  ( select time
  3         , quantity
  4         , case
  5           when time - lag(time,1,date '0001-01-01') over (order by time,rowid) > 3/24/60/60 then
  6             'Y'
  7           end gap_indicator
  8      from mytable
  9  )
 10  , y as
 11  ( select time
 12         , quantity
 13         , count(gap_indicator) over (order by time,rowid) grp
 14      from x
 15  )
 16  select min(time)
 17       , max(time)
 18       , sum(quantity)
 19    from y
 20   group by grp
 21   order by min(time)
 22  /

MIN(TIME)           MAX(TIME)           SUM(QUANTITY)
------------------- ------------------- -------------
05-01-2014 12:22:01 05-01-2014 12:22:06           800
05-01-2014 12:22:45 05-01-2014 12:22:46           300
05-01-2014 12:23:12 05-01-2014 12:23:12           300

3 rows selected.

but with the same performance characteristics again:

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |         |      1 |        |      3 |00:00:00.01 |       3 |       |       |          |
|   1 |  SORT ORDER BY          |         |      1 |      8 |      3 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   2 |   HASH GROUP BY         |         |      1 |      8 |      3 |00:00:00.01 |       3 |  1034K|  1034K|  735K (0)|
|   3 |    VIEW                 |         |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
|   4 |     WINDOW BUFFER       |         |      1 |      8 |      8 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   5 |      VIEW               |         |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
|   6 |       WINDOW SORT       |         |      1 |      8 |      8 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   7 |        TABLE ACCESS FULL| MYTABLE |      1 |      8 |      8 |00:00:00.01 |       3 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

If you can find an expression to map the distance of same group members to a constant number, then grouping or partitioning your data using Tabibitosan will lead to a simple looking two-level query. Since alternatives will be more complex and sometimes will require an extra sort, it is definitely worth its place in your toolbox.