Saturday, September 15, 2012

Keep clause

You may have seen an aggregate function like this in SQL queries:

max(value) keep (dense_rank first order by mydate)
or this analytic variant:
max(value) keep (dense_rank last order by mydate) over (partition by relation_nr)
Unfortunately, when you start searching for the "keep" clause, you won't find anything in the Oracle documentation (and hopefully because of this blogpost, people will now have a reference). Of course Oracle documents such functions. You only have to know that they are called FIRST and LAST in the SQL Language Reference.

Even though these functions were already introduced in version 9, I've seen lots of code that could have used these functions, but didn't. And that's a pity because it's a wasted opportunity to write shorter and faster code. The common use case I'm talking about is when you have a detail table with a validity period. Typically with a column startdate, and optionally an enddate. For such a table, you often have to know the values of the currently valid row. An example: suppose we have a table RELATIONS and for each relation we want to know his address at a certain point in time:
SQL> create table relations
  2  ( id   number       not null primary key
  3  , name varchar2(30) not null
  4  )
  5  /

Table created.

SQL> insert into relations
  2  select 1, 'Oracle Nederland' from dual union all
  3  select 2, 'Ciber Nederland' from dual
  4  /

2 rows created.

SQL> create table relation_addresses
  2  ( relation_id number       not null
  3  , startdate   date         not null
  4  , address     varchar2(30) not null
  5  , postal_code varchar2(6)  not null
  6  , city        varchar2(30) not null
  7  , constraint ra_pk primary key (relation_id,startdate)
  8  , constraint ra_r_fk foreign key (relation_id) references relations(id)
  9  )
 10  /

Table created.

SQL> insert into relation_addresses
  2  select 1, date '1995-01-01', 'Rijnzathe 6', '3454PV', 'De Meern' from dual union all
  3  select 1, date '2011-01-01', 'Hertogswetering 163-167', '3543AS', 'Utrecht' from dual union all
  4  select 2, date '2000-01-01', 'Frankrijkstraat 128', '5622AH', 'Eindhoven' from dual union all
  5  select 2, date '2006-01-01', 'Meerkollaan 15', '5613BS', 'Eindhoven' from dual union all
  6  select 2, date '2010-01-01', 'Burgemeester Burgerslaan 40b', '5245NH', 'Den Bosch' from dual union all
  7  select 2, date '2015-01-01', 'Archimedesbaan 16', '3439ME', 'Nieuwegein' from dual
  8  /

6 rows created.

SQL> begin
  2    dbms_stats.gather_table_stats(user,'relations');
  3    dbms_stats.gather_table_stats(user,'relation_addresses');
  4  end;
  5  /

PL/SQL procedure successfully completed.
Relation "Oracle Nederland" has two addresses, and its current address being at the Hertogswetering. And fictively, relation "Ciber Nederland" has four addresses. The current address is the Den Bosch one. And I've also recorded a future address in Nieuwegein. Note that, in real life, the latter three are all Ciber offices currently in use. To get the active relation addresses on October 1st, 2012, I can use this query:
SQL> var REFERENCE_DATE varchar2(10)
SQL> exec :REFERENCE_DATE:='2012-10-01'

PL/SQL procedure successfully completed.

SQL> select ra.relation_id
  2       , max(ra.startdate) startdate
  3    from relation_addresses ra
  4   where ra.startdate <= to_date(:REFERENCE_DATE,'yyyy-mm-dd')
  5   group by ra.relation_id
  6  /

RELATION_ID STARTDATE
----------- -------------------
          1 01-01-2011 00:00:00
          2 01-01-2010 00:00:00

2 rows selected.
But what if I want to retrieve the current address belonging to these rows? In fact, this is frequently being asked in Oracle forums. Prior to Oracle8, you would have used a query like below:
SQL> select ra.relation_id
  2       , ra.startdate
  3       , ra.address
  4       , ra.postal_code
  5       , ra.city
  6    from relation_addresses ra
  7   where ra.startdate <= to_date(:REFERENCE_DATE,'yyyy-mm-dd')
  8     and not exists
  9         ( select 'a relation_address with a more recent startdate'
 10             from relation_addresses ra2
 11            where ra2.relation_id = ra.relation_id
 12              and ra2.startdate <= to_date(:REFERENCE_DATE,'yyyy-mm-dd')
 13              and ra2.startdate > ra.startdate
 14         )
 15  /

RELATION_ID STARTDATE           ADDRESS                        POSTAL CITY
----------- ------------------- ------------------------------ ------ ------------------------------
          1 01-01-2011 00:00:00 Hertogswetering 163-167        3543AS Utrecht
          2 01-01-2010 00:00:00 Burgemeester Burgerslaan 40b   5245NH Den Bosch

2 rows selected.
This uses a correlated subquery accessing the table (or index belonging to) table RELATION_ADDRESSES twice. Which can be prevented from Oracle8 onwards by using an analytic function:
SQL> select relation_id
  2       , startdate
  3       , address
  4       , postal_code
  5       , city
  6    from ( select ra.relation_id
  7                , ra.startdate
  8                , ra.address
  9                , ra.postal_code
 10                , ra.city
 11                , row_number() over (partition by ra.relation_id order by ra.startdate desc) rn
 12             from relation_addresses ra
 13            where ra.startdate <= to_date(:REFERENCE_DATE,'yyyy-mm-dd')
 14         )
 15   where rn = 1
 16  /

RELATION_ID STARTDATE           ADDRESS                        POSTAL CITY
----------- ------------------- ------------------------------ ------ ------------------------------
          1 01-01-2011 00:00:00 Hertogswetering 163-167        3543AS Utrecht
          2 01-01-2010 00:00:00 Burgemeester Burgerslaan 40b   5245NH Den Bosch

2 rows selected.
Here you compute the row_number when you partition the result set per relation_id ordered by startdate in descending order. Meaning the most recent date starting before the reference date, gets row_number 1 assigned per relation_id. By using an inline view, we can filter on the outcome of the analytic function, and only select the rows with row_number 1. In forums, you'll see this solution often being adviced. Compared to the correlated subquery, this query selects only once from table RELATION_ADDRESSES. However, you can do even better by just adding three "keep clause" functions to the original query:
SQL> select ra.relation_id
  2       , max(ra.startdate) startdate
  3       , max(ra.address) keep (dense_rank last order by ra.startdate) address
  4       , max(ra.postal_code) keep (dense_rank last order by ra.startdate) postal_code
  5       , max(ra.city) keep (dense_rank last order by ra.startdate) city
  6    from relation_addresses ra
  7   where ra.startdate <= to_date(:REFERENCE_DATE,'yyyy-mm-dd')
  8   group by ra.relation_id
  9  /

RELATION_ID STARTDATE           ADDRESS                        POSTAL CITY
----------- ------------------- ------------------------------ ------ ------------------------------
          1 01-01-2011 00:00:00 Hertogswetering 163-167        3543AS Utrecht
          2 01-01-2010 00:00:00 Burgemeester Burgerslaan 40b   5245NH Den Bosch

2 rows selected.
The three extra aggregate functions all do a "dense_rank last order by startdate", meaning "sort the rows by startdate, and pick only those rows which have the most recent startdate". If you have more rows with the same startdate, the max function at the start tells Oracle to pick the value with the maximum address/postal_code/city. However, (relation_id,startdate) is unique, so ties are impossible and thus the max function is a dummy. I also could have used min.

The query is shorter and -to me- clearer at first glance. However, the main reason for my enthusiasm for the aggregate functions FIRST and LAST is because it's just faster. To show this, let's execute those queries against a table with 300,000 rows, 100,000 relations with 3 addresses each:
SQL> create table relations
  2  ( id   number       not null primary key
  3  , name varchar2(30) not null
  4  )
  5  /

Table created.

SQL> create table relation_addresses
  2  ( relation_id number       not null
  3  , startdate   date         not null
  4  , address     varchar2(30) not null
  5  , postal_code varchar2(6)  not null
  6  , city        varchar2(30) not null
  7  , constraint ra_pk primary key (relation_id,startdate)
  8  , constraint ra_r_fk foreign key (relation_id) references relations(id)
  9  )
 10  /

Table created.

SQL> insert into relations
  2   select level
  3        , dbms_random.string('a',30)
  4     from dual
  5  connect by level <= 100000
  6  /

100000 rows created.

SQL> insert into relation_addresses
  2   select 1 + mod(level-1,100000)
  3        , date '2013-01-01' - numtodsinterval(level,'hour')
  4        , dbms_random.string('a',30)
  5        , dbms_random.string('a',6)
  6        , dbms_random.string('a',30)
  7     from dual
  8  connect by level <= 300000
  9  /

300000 rows created.

SQL> begin
  2    dbms_stats.gather_table_stats
  3    ( user
  4    , 'relations'
  5    , cascade=>true
  6    , method_opt=>'FOR ALL INDEXED COLUMNS SIZE 254'
  7    , estimate_percent=>100
  8    );
  9    dbms_stats.gather_table_stats
 10    ( user
 11    , 'relation_addresses'
 12    , cascade=>true
 13    , method_opt=>'FOR ALL INDEXED COLUMNS SIZE 254'
 14    , estimate_percent=>100
 15    );
 16  end;
 17  /

PL/SQL procedure successfully completed.
Note that I created histograms with 254 buckets just to make the optimizer see that it should full scan the table, despite the "startdate <= :REFERENCE_DATE" predicate. This next query should give a clue what's in the table:
SQL> select *
  2    from relation_addresses
  3   where relation_id in (1,2,99999,100000)
  4   order by relation_id
  5       , startdate
  6  /

RELATION_ID STARTDATE           ADDRESS                        POSTAL CITY
----------- ------------------- ------------------------------ ------ ------------------------------
          1 09-03-1990 15:00:00 tKgXePxuAIdhFBNJLIRRjodrlJzGOl vPIAbL pNkbFHTJPrVuDIYLxsCfUfetBsKJIE
          1 05-08-2001 07:00:00 LybVzfpzoQzXjpCAdkSZrkYrwUtZtL cWJwFe IczTRyjITWCJIOErccfITVvsqRVyMF
          1 31-12-2012 23:00:00 lNEwsdYhbwdqRxHTSCTCykgICxiXKL oXzHQF YfyKFmiboCWfmNLjVLZoKmUDoMFaDu
          2 09-03-1990 14:00:00 svOylQPkbyfympSXRMeyudfFErFvlO MLFdpG LTtAKdrpUmCwFgqEmoKxnUtWecwgcV
          2 05-08-2001 06:00:00 BsRCUviBiLHaAEjyRVnIedRAWzuVSe DlBlZW ErQmCkDgNDTMOdZzceFYrMXnZmmjxg
          2 31-12-2012 22:00:00 wqdFdXoBdmmCooLtGfWOMKukIMrDlI geRRHz DaPpWHOOdWgbjLaRkxfFDUIPgVgvEt
      99999 12-10-1978 01:00:00 FsXOjUdNIgjjGjnWpJjTTscbcuqsxa PdhVtm qOskmLwRlngSEihmlpYhmNHhvtrpBc
      99999 09-03-1990 17:00:00 sqoKYNeDntZtAUSmSDMtIQZloTSVeD uGPszi GIDctptEomcGzYGYhUGhKHgDRZJCmY
      99999 05-08-2001 09:00:00 fhHGwuGPIHSOaKdjDvDcqTzsbHZzqR tpaLAP rVYCmijzqJmhlnZZLXkHpgFmLAEiTS
     100000 12-10-1978 00:00:00 WwxfHcVfkFfItgcXfjPnKTiATlHjao nSOjSn vZNRsRySNPlmQKgCJjcpiEOhQIxzoy
     100000 09-03-1990 16:00:00 cGcVPMsFyxCBrnsZtMYBnaAflXiNff NVKRIr SseFWkWyUDgaPpbxdmENdLjurGbJPK
     100000 05-08-2001 08:00:00 dRfCmqdmbhcmaMvyYBpewPsFBCVdlG BMQWLY YPaAGnKKUkfdnAeAyLYeUBfXwezsEo

12 rows selected.
So there are a couple of rows that are filtered because they're in the future, but for most rows, the latest row is the current one. This is the plan of the first query with the correlated subquery:
SQL> select * from table(dbms_xplan.display_cursor(null,null,'iostats last'))
  2  /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  d6p5uh67h65yb, child number 0
-------------------------------------
select ra.relation_id      , ra.startdate      , ra.address      ,
ra.postal_code      , ra.city   from relation_addresses ra  where
ra.startdate <= to_date(:REFERENCE_DATE,'yyyy-mm-dd')    and not exists
       ( select 'a relation_address with a more recent startdate'
     from relation_addresses ra2           where ra2.relation_id =
ra.relation_id             and ra2.startdate <=
to_date(:REFERENCE_DATE,'yyyy-mm-dd')             and ra2.startdate >
ra.startdate        )

Plan hash value: 3749094337

---------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                    |      1 |        |    100K|00:00:00.66 |   15071 |   3681 |
|*  1 |  HASH JOIN RIGHT ANTI |                    |      1 |   2978 |    100K|00:00:00.66 |   15071 |   3681 |
|*  2 |   INDEX FAST FULL SCAN| RA_PK              |      1 |    297K|    297K|00:00:00.05 |    1240 |     35 |
|*  3 |   TABLE ACCESS FULL   | RELATION_ADDRESSES |      1 |    297K|    297K|00:00:00.12 |   13831 |   3646 |
---------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("RA2"."RELATION_ID"="RA"."RELATION_ID")
       filter("RA2"."STARTDATE">"RA"."STARTDATE")
   2 - filter("RA2"."STARTDATE"<=TO_DATE(:REFERENCE_DATE,'yyyy-mm-dd'))
   3 - filter("RA"."STARTDATE"<=TO_DATE(:REFERENCE_DATE,'yyyy-mm-dd'))


30 rows selected.
A HASH JOIN ANTI for the not exists, and a total of .66 seconds. Next, the plan for the query with the analytic row_number function:
SQL> select * from table(dbms_xplan.display_cursor(null,null,'iostats last'))
  2  /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  1zd4wqtxkc2vz, child number 0
-------------------------------------
select relation_id      , startdate      , address      , postal_code
   , city   from ( select ra.relation_id               , ra.startdate
            , ra.address               , ra.postal_code               ,
ra.city               , row_number() over (partition by ra.relation_id
order by ra.startdate desc) rn            from relation_addresses ra
       where ra.startdate <= to_date(:REFERENCE_DATE,'yyyy-mm-dd')
  )  where rn = 1

Plan hash value: 2795878473

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                    |      1 |        |    100K|00:00:00.97 |    7238 |   3646 |
|*  1 |  VIEW                    |                    |      1 |    297K|    100K|00:00:00.97 |    7238 |   3646 |
|*  2 |   WINDOW SORT PUSHED RANK|                    |      1 |    297K|    200K|00:00:00.93 |    7238 |   3646 |
|*  3 |    TABLE ACCESS FULL     | RELATION_ADDRESSES |      1 |    297K|    297K|00:00:00.09 |    7238 |   3646 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RN"=1)
   2 - filter(ROW_NUMBER() OVER ( PARTITION BY "RA"."RELATION_ID" ORDER BY
              INTERNAL_FUNCTION("RA"."STARTDATE") DESC )<=1)
   3 - filter("RA"."STARTDATE"<=TO_DATE(:REFERENCE_DATE,'yyyy-mm-dd'))


29 rows selected.
Note that this query takes longer than the correlated subquery above: .97 seconds versus .66 seconds. The HASH JOIN ANTI took .49 seconds (.66 - .05 -.12) where computing the ROW_NUMBER took .84 seconds (.93 - .09). So here, on my laptop, I have avoided .05 seconds for the INDEX FAST FULL SCAN, but spend .35 (.84 - .49) seconds more for the computation. Likely, when I/O is more expensive than on my laptop, the time of the first query will go up and the times will be closer to each other. Now the keep clause variant:
SQL> select * from table(dbms_xplan.display_cursor(null,null,'iostats last'))
  2  /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  dcw8tyyqtu2kk, child number 0
-------------------------------------
select ra.relation_id      , max(ra.startdate) startdate      ,
max(ra.address) keep (dense_rank last order by ra.startdate) address
  , max(ra.postal_code) keep (dense_rank last order by ra.startdate)
postal_code      , max(ra.city) keep (dense_rank last order by
ra.startdate) city   from relation_addresses ra  where ra.startdate <=
to_date(:REFERENCE_DATE,'yyyy-mm-dd')  group by ra.relation_id

Plan hash value: 2324030966

------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                    |      1 |        |    100K|00:00:00.55 |    7238 |   3646 |
|   1 |  SORT GROUP BY     |                    |      1 |    100K|    100K|00:00:00.55 |    7238 |   3646 |
|*  2 |   TABLE ACCESS FULL| RELATION_ADDRESSES |      1 |    297K|    297K|00:00:00.09 |    7238 |   3646 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("RA"."STARTDATE"<=TO_DATE(:REFERENCE_DATE,'yyyy-mm-dd'))


24 rows selected.
The shortest query, the shortest plan and the fastest execution. The SORT GROUP BY immediately reduces the number of intermediate rows from 297K to 100K, whereas the WINDOW SORT PUSHED RANK had to compute the row_number for all 297K rows.

PS: this topic and much more is covered in an upcoming Live Virtual Seminar for Oracle University on October 2nd

5 comments:

  1. Rob,

    Here is an "index friendly" variant. I tested it with different data (30 addresses per relation, not 3), and it ran somewhat faster and did fewer consistent gets.

    select * from RELATION_ADDRESSES
    where rowid in (
    select max(rowid) keep (dense_rank last order by STARTDATE)
    from RELATION_ADDRESSES
    where STARTDATE <= sysdate
    group by RELATION_ID
    )

    ReplyDelete
  2. Hello Stew,

    With 3 addresses per relation on the table above, this is my result:

    ------------------------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
    ------------------------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | | 100K|00:00:00.67 | 15049 | 3646 |
    |* 1 | HASH JOIN | | 1 | 300M| 100K|00:00:00.67 | 15049 | 3646 |
    | 2 | VIEW | VW_NSO_1 | 1 | 100K| 100K|00:00:00.39 | 1208 | 0 |
    | 3 | HASH UNIQUE | | 1 | 100K| 100K|00:00:00.37 | 1208 | 0 |
    | 4 | SORT GROUP BY | | 1 | 100K| 100K|00:00:00.31 | 1208 | 0 |
    |* 5 | INDEX FAST FULL SCAN| RA_PK | 1 | 297K| 297K|00:00:00.06 | 1208 | 0 |
    | 6 | TABLE ACCESS FULL | RELATION_ADDRESSES | 1 | 300K| 300K|00:00:00.09 | 13841 | 3646 |
    ------------------------------------------------------------------------------------------------------------------

    Predicate Information (identified by operation id):
    ---------------------------------------------------

    1 - access(ROWID="$kkqu_col_1")
    5 - filter("STARTDATE"<=SYSDATE@!)

    And with 10,000 relations and 30 addresses each, the results are:
    My query: .42 seconds, 3595 consistent gets
    Your query: .43 seconds, 5214 consistent gets

    Strange why doing an extra pass on the table/index while doing the same type of calculation, results in faster execution and fewer consistent gets. What exactly did you do?

    Regards,
    Rob.

    ReplyDelete
  3. Rob,

    Which version have you tested this on? Not sure if anything has changed in last couple of years but look at http://www.oramoss.com/blog/2010/03/12/keep-dense_rank-versus-row_number-further-details/ where Jeff mentions the row_number approach works better as the number of "other" columns needed in the resultset (corresponding to the MAX/MIN value) increase.

    ReplyDelete
  4. Hi Narendra,

    This is version 11.2.0.2.
    Thanks for the reference to Jeff Moss' blog. Unfortunately, he forgot to mention the database version as well. My test above was with three columns. I also tested with 1 and 6 columns, but I didn't see any significant changes in runtime, so I left those out. My guess is that I'll never encounter a situation where the number of keep-clause-aggregate-functions is so high that the analytic function will win.

    Regards,
    Rob.

    ReplyDelete
  5. Like Laurent Schneider, I found the EMP table convenient for demonstrating how to "KEEP FIRST and LAST Handy": http://it.toolbox.com/blogs/data-ruminations/keep-first-and-last-handy-49681

    ReplyDelete