Pages

Tuesday, January 6, 2009

SQL Model Clause tutorial, part three

SQL Model Clause Tutorial, part one
SQL Model Clause Tutorial, part two

If you prefer to read in Dutch, then you can read the original article here (page 30-34), and if you prefer to read in German, then you can find a translated version here.

Practicalities of the SQL Model Clause

If developing SQL queries is part of your job, then it is definitely worth knowing the SQL model clause. Once you’ve gotten over the initial fear of the new syntax and learned the basics, you can save yourself writing procedural code and creating auxiliary database objects even more often, by doing more in SQL itself. You may end up with shorter, more readable and sometimes faster code. This article will show a couple of uses of the model clause for fairly common type of queries and will discuss in each case whether or not the model clause solution is suitable compared to alternative queries - if any - in terms of readability and performance. By the end you will more easily recognize situations in your daily work where you can apply the model clause.

Background / Overview

The SQL model clause was introduced with the release of Oracle10g Release 1, back in 2003. According to the documentation, the model clause can replace PC-based spreadsheets by creating a multidimensional array from query results and then apply formulas (called rules) to this array to calculate new values. However, judging from the various Oracle forums, it is not used very often yet. And when it is used, people are not doing spreadsheet like calculations with it. The types of queries the model clause does get used, are discussed in this article. These types of queries are:
  • Forecasting queries
  • Row generation
  • Variable number of calculations based on calculated values
  • Complex algorithms
  • String aggregation
Of course, the possibilities of the model clause are certainly not limited to this list.

If you are not familiar with the SQL model clause yet, then you can catch up by reading Chapter 22 of the Data Warehousing Guide. You can also read the step-by-step tutorials on my weblog here and here.

Forecasting queries

This type of query is used throughout the aforementioned chapter 22 of the Oracle Database Data Warehousing Guide. The basis for this type of query is typically a sales table or view and the question to be answered is a forecast of how future sales will look like. For example, let’s have a look at the SALES table in figure 1.

PRODUCT       YEAR     AMOUNT
------- ---------- ----------
paper         2006          2
pen           2006         80
staples       2006         18
paper         2007          4
pen           2007        100
staples       2007         30

And suppose you want to calculate the 2008 sales to be the 2007 sales multiplied by the same growth factor as from 2006 to 2007. In a formula:

sales[2008] = sales[2007] * ( sales[2007] / sales[2006] )

Using the model clause, such a question can be answered like this:
SQL> select product
 2       , year
 3       , amount
 4    from sales
 5   model
 6         partition by (product)
 7         dimension by (year)
 8         measures (amount)
 9         rules
10         ( amount[2008] = amount[2007] * amount[2007] / amount[2006]
11         )
12   order by year
13       , product
14  /

PRODUCT       YEAR     AMOUNT
------- ---------- ----------
paper         2006          2
pen           2006         80
staples       2006         18
paper         2007          4
pen           2007        100
staples       2007         30
paper         2008          8
pen           2008        125
staples       2008         50

Using the model clause for such a query is not mandatory; you can achieve the same result by doing a UNION ALL and calculate next year’s sales in a separate query with its own extra table access. Or you can use grouping sets. But if you write these statements, you'll see they become messy and complicated quite fast. And what if you want the sales forecast for 2009 as well, for example by multiplying the sales of 2008 by 2? With the model clause, you just add an extra rule:

amount[2009] = amount[2008] * 2

So forecasting queries are the example in chapter 22 of the Data Warehousing Guide for a reason, as the SQL model clause is able to calculate them without much ado.

Row generation

For many years, you generated rows by selecting from a table or view for which you knew it contained at least the amount of rows you wanted to generate. Usually the all_objects view was used for this, like this:
select rownum
  from all_objects
 where rownum <= 100000

Then Mikito Harakiri came up with this simple row generating query against dual:
 select level
   from dual
connect by level <= 100000

And with the model clause there is another alternative:
select i
  from dual
 model
       dimension by (1 i)
       measures (0 x)
       (x[for i from 2 to 100000 increment 1] = 0)

And this is a tkprof snippet of a 10046 level 8 trace of the three variants generating 100,000 rows:
select rownum
  from all_objects
 where rownum <= 100000

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.14       0.24          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     4564      9.90      12.50        557     179849          0       68433
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     4566     10.04      12.75        557     179849          0       68433

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88  

Rows     Row Source Operation
-------  ---------------------------------------------------
  68433  COUNT STOPKEY (cr=186696 pr=636 pw=0 time=61522 us)
  68433   FILTER  (cr=186696 pr=636 pw=0 time=61522 us)
  70174    HASH JOIN  (cr=5383 pr=0 pw=0 time=0 us cost=44 size=1198062 card=11862)
     91     INDEX FULL SCAN I_USER2 (cr=1 pr=0 pw=0 time=0 us cost=1 size=87 card=29)(object id 47)
  70174     HASH JOIN  (cr=5382 pr=0 pw=0 time=0 us cost=42 size=1162476 card=11862)
     91      INDEX FULL SCAN I_USER2 (cr=1 pr=0 pw=0 time=0 us cost=1 size=638 card=29)(object id 47)
  70174      TABLE ACCESS FULL OBJ$ (cr=5381 pr=0 pw=0 time=0 us cost=41 size=901512 card=11862)
   3309    TABLE ACCESS BY INDEX ROWID IND$ (cr=1616 pr=5 pw=0 time=0 us cost=2 size=8 card=1)
   4103     INDEX UNIQUE SCAN I_IND1 (cr=601 pr=4 pw=0 time=0 us cost=1 size=0 card=1)(object id 41)
  25305    HASH JOIN  (cr=43900 pr=98 pw=0 time=0 us cost=3 size=24 card=1)
  25702     INDEX RANGE SCAN I_OBJAUTH1 (cr=43900 pr=98 pw=0 time=0 us cost=2 size=11 card=1)(object id 62)
  28897     FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
      1    FIXED TABLE FULL X$KZSPR (cr=8 pr=1 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    NESTED LOOPS  (cr=88452 pr=386 pw=0 time=0 us)
      0     NESTED LOOPS  (cr=88452 pr=386 pw=0 time=0 us cost=6 size=70 card=1)
      0      NESTED LOOPS  (cr=88452 pr=386 pw=0 time=0 us cost=4 size=60 card=1)
  49740       NESTED LOOPS  (cr=83309 pr=386 pw=0 time=0 us cost=3 size=49 card=1)
  49740        MERGE JOIN CARTESIAN (cr=80820 pr=386 pw=0 time=0 us cost=2 size=46 card=1)
   2487         INDEX RANGE SCAN I_OBJ5 (cr=80820 pr=386 pw=0 time=0 us cost=2 size=33 card=1)(object id 40)
  49740         BUFFER SORT (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
  49740          FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
  49740        INDEX RANGE SCAN I_USER2 (cr=2489 pr=0 pw=0 time=0 us cost=1 size=3 card=1)(object id 47)
      0       INDEX RANGE SCAN I_OBJAUTH1 (cr=5143 pr=0 pw=0 time=0 us cost=1 size=11 card=1)(object id 62)
      0      INDEX RANGE SCAN I_DEPENDENCY1 (cr=0 pr=0 pw=0 time=0 us cost=1 size=0 card=3)(object id 106)
      0     TABLE ACCESS BY INDEX ROWID DEPENDENCY$ (cr=0 pr=0 pw=0 time=0 us cost=2 size=10 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
  23864    HASH JOIN  (cr=28167 pr=0 pw=0 time=0 us cost=3 size=24 card=1)
  24116     INDEX RANGE SCAN I_OBJAUTH1 (cr=28167 pr=0 pw=0 time=0 us cost=2 size=11 card=1)(object id 62)
  26105     FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
     99    NESTED LOOPS  (cr=1761 pr=0 pw=0 time=0 us cost=2 size=48 card=2)
    120     INDEX RANGE SCAN I_OBJAUTH1 (cr=1761 pr=0 pw=0 time=0 us cost=2 size=11 card=1)(object id 62)
     99     FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=2)
      0    NESTED LOOPS  (cr=7012 pr=68 pw=0 time=0 us cost=6 size=70 card=1)
   4820     MERGE JOIN CARTESIAN (cr=6454 pr=68 pw=0 time=0 us cost=5 size=59 card=1)
    241      NESTED LOOPS  (cr=6454 pr=68 pw=0 time=0 us)
   1458       NESTED LOOPS  (cr=6205 pr=28 pw=0 time=0 us cost=5 size=46 card=1)
    249        NESTED LOOPS  (cr=5684 pr=0 pw=0 time=0 us cost=3 size=36 card=1)
    249         INDEX RANGE SCAN I_OBJ5 (cr=5443 pr=0 pw=0 time=0 us cost=2 size=33 card=1)(object id 40)
    249         INDEX RANGE SCAN I_USER2 (cr=241 pr=0 pw=0 time=0 us cost=1 size=3 card=1)(object id 47)
   1458        INDEX RANGE SCAN I_DEPENDENCY1 (cr=521 pr=28 pw=0 time=0 us cost=1 size=0 card=3)(object id 106)
    241       TABLE ACCESS BY INDEX ROWID DEPENDENCY$ (cr=249 pr=40 pw=0 time=0 us cost=2 size=10 card=1)
   4820      BUFFER SORT (cr=0 pr=0 pw=0 time=0 us cost=3 size=1300 card=100)
   4820       FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
      0     INDEX RANGE SCAN I_OBJAUTH1 (cr=558 pr=0 pw=0 time=0 us cost=1 size=11 card=1)(object id 62)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      4    NESTED LOOPS  (cr=2498 pr=0 pw=0 time=0 us cost=2 size=72 card=2)
     10     NESTED LOOPS  (cr=2498 pr=0 pw=0 time=0 us cost=2 size=23 card=1)
    461      TABLE ACCESS BY INDEX ROWID TRIGGER$ (cr=1967 pr=0 pw=0 time=0 us cost=1 size=12 card=1)
    488       INDEX UNIQUE SCAN I_TRIGGER2 (cr=1479 pr=0 pw=0 time=0 us cost=0 size=0 card=1)(object id 160)
     10      INDEX RANGE SCAN I_OBJAUTH1 (cr=531 pr=0 pw=0 time=0 us cost=1 size=11 card=1)(object id 62)
      4     FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=2)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
     92    VIEW  (cr=0 pr=0 pw=0 time=0 us cost=2 size=13 card=1)
     92     FAST DUAL  (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)
      2    NESTED LOOPS  (cr=1060 pr=0 pw=0 time=0 us cost=2 size=42 card=2)
      4     INDEX RANGE SCAN I_OBJAUTH1 (cr=1060 pr=0 pw=0 time=0 us cost=2 size=8 card=1)(object id 62)
      2     FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=2)
      1    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    NESTED LOOPS  (cr=0 pr=0 pw=0 time=0 us cost=2 size=42 card=2)
      0     INDEX RANGE SCAN I_OBJAUTH1 (cr=0 pr=0 pw=0 time=0 us cost=2 size=8 card=1)(object id 62)
      0     FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=2)
      0    FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
      0    VIEW  (cr=0 pr=0 pw=0 time=0 us cost=1 size=16 card=1)
      0     SORT GROUP BY (cr=0 pr=0 pw=0 time=0 us cost=1 size=86 card=1)
      0      NESTED LOOPS  (cr=0 pr=0 pw=0 time=0 us cost=1 size=86 card=1)
      0       MERGE JOIN CARTESIAN (cr=0 pr=0 pw=0 time=0 us cost=0 size=78 card=1)
      0        NESTED LOOPS  (cr=0 pr=0 pw=0 time=0 us cost=0 size=65 card=1)
      0         INDEX UNIQUE SCAN I_OLAP_CUBES$ (cr=0 pr=0 pw=0 time=0 us cost=0 size=13 card=1)(object id 899)
      0         TABLE ACCESS BY INDEX ROWID OLAP_DIMENSIONALITY$ (cr=0 pr=0 pw=0 time=0 us cost=0 size=52 card=1)
      0          INDEX RANGE SCAN I_OLAP_DIMENSIONALITY$ (cr=0 pr=0 pw=0 time=0 us cost=0 size=0 card=1)(object id 903)
      0        BUFFER SORT (cr=0 pr=0 pw=0 time=0 us cost=0 size=13 card=1)
      0         INDEX FULL SCAN I_OLAP_CUBE_DIMENSIONS$ (cr=0 pr=0 pw=0 time=0 us cost=0 size=13 card=1)(object id 887)
      0       INDEX RANGE SCAN I_OBJ1 (cr=0 pr=0 pw=0 time=0 us cost=1 size=8 card=1)(object id 36)
      0    NESTED LOOPS  (cr=0 pr=0 pw=0 time=0 us cost=2 size=27 card=1)
      0     INDEX FULL SCAN I_USER2 (cr=0 pr=0 pw=0 time=0 us cost=1 size=19 card=1)(object id 47)
      0     INDEX RANGE SCAN I_OBJ4 (cr=0 pr=0 pw=0 time=0 us cost=1 size=8 card=1)(object id 39)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    4564        0.00          0.02
  SQL*Net message from client                  4564        0.00          0.80
  db file sequential read                       557        0.07          2.36
********************************************************************************

 select level
   from dual
connect by level <= 100000

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     6668      0.31       0.36          0          0          0      100000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     6670      0.31       0.36          0          0          0      100000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88  

Rows     Row Source Operation
-------  ---------------------------------------------------
 100000  CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=0 us)
      1   FAST DUAL  (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    6668        0.00          0.02
  SQL*Net message from client                  6668        0.00          0.75
********************************************************************************

select i
  from dual
 model
       dimension by (1 i)
       measures (0 x)
       (x[for i from 2 to 100000 increment 1] = 0)

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     6668      3.39       3.19          0          0          0      100000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     6670      3.39       3.19          0          0          0      100000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88  

Rows     Row Source Operation
-------  ---------------------------------------------------
 100000  SQL MODEL ORDERED (cr=0 pr=0 pw=0 time=0 us)
      1   FAST DUAL  (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    6668        0.00          0.02
  SQL*Net message from client                  6668        0.00          0.75



********************************************************************************

The model clause query for generating rows is faster than the good old query against all_objects, but much slower than the the "connect by level" query. It is therefore not worthwhile to use the SQL model clause for pure row generating. However, for inferred problems of row generation, the model clause can certainly be an option. An example of such an inferred problems is splitting periods into individual days in a calendar, for example when 1 row containing a startdate January 1st and enddate January 5th, have to be split into 5 individual rows. Another example is splitting strings into individual words:

SQL> create table t (id,str)
  2  as
  3  select 1, 'OGh Visie' from dual union all
  4  select 2, 'Oracle Gebruikersclub Holland' from dual union all
  5  select 3, null from dual union all
  6  select 4, 'OGh' from dual
  7  /

Table created.

To split the individual words from these strings you can use one of the next two queries:

SQL> select id
  2       , i seqnr
  3       , str word
  4    from t
  5   model
  6         return updated rows
  7         partition by (id)
  8         dimension by (0 i)
  9         measures (str)
 10         ( str[for i from 1 to regexp_count(str[0],' ')+1 increment 1]
 11           = regexp_substr(str[0],'[^ ]+',1,cv(i))
 12         )
 13   order by id
 14       , seqnr
 15  /

        ID      SEQNR WORD
---------- ---------- -----------------------------
         1          1 OGh
         1          2 Visie
         2          1 Oracle
         2          2 Gebruikersclub
         2          3 Holland
         4          1 OGh

6 rows selected.

SQL> select id
  2       , n seqnr
  3       , regexp_substr(str,'[^ ]+',1,n) word
  4    from t
  5       , ( select level n
  6             from ( select max(regexp_count(str,' '))+1 max_#words
  7                      from t
  8                  )
  9          connect by level <= max_#words
 10         )
 11   where n <= regexp_count(str,' ')+1
 12   order by id
 13       , seqnr
 14  /

        ID      SEQNR WORD
---------- ---------- -----------------------------
         1          1 OGh
         1          2 Visie
         2          1 Oracle
         2          2 Gebruikersclub
         2          3 Holland
         4          1 OGh

6 rows selected.

You can see how they are based on the pure row generating queries described earlier. In these solutions "regexp_count(str,' ')+1" calculates the number of words, by calculating the number of spaces and adding one. "regexp_substr(str,'[^ ]+',1,n)" gets the n-th word out of the string.

Especially for the "connect by level" query, more variants are possible. All of them look pretty cryptic, making them hard to maintain. The SQL model clause query maintains its readability by using the partition clause.

With a 10,000 times enlarged table t, the tkprof file looks like this:

select id
     , i seqnr
     , str word
  from t
 model
       return updated rows
       partition by (id)
       dimension by (0 i)
       measures (str)
       ( str[for i from 1 to regexp_count(str[0],' ')+1 increment 1]
         = regexp_substr(str[0],'[^ ]+',1,cv(i))
       )
 order by id
     , seqnr

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     4001      1.56       1.58          0        142          0       60000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     4003      1.56       1.58          0        142          0       60000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88  

Rows     Row Source Operation
-------  ---------------------------------------------------
  60000  SORT ORDER BY (cr=142 pr=0 pw=0 time=0 us cost=286 size=640000 card=40000)
  60000   SQL MODEL ORDERED (cr=142 pr=0 pw=0 time=0 us cost=286 size=640000 card=40000)
  40000    TABLE ACCESS FULL T (cr=142 pr=0 pw=0 time=0 us cost=69 size=640000 card=40000)

********************************************************************************

select id
     , n seqnr
     , regexp_substr(str,'[^ ]+',1,n) word
  from t
     , ( select level n
           from ( select max(regexp_count(str,' '))+1 max_#words
                    from t
                )
        connect by level <= max_#words
       )
 where n <= regexp_count(str,' ')+1
 order by id
     , seqnr

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     4001      1.32       1.29          0        568          0       60000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     4003      1.32       1.29          0        568          0       60000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88  

Rows     Row Source Operation
-------  ---------------------------------------------------
  60000  SORT ORDER BY (cr=568 pr=0 pw=0 time=0 us cost=140 size=58000 card=2000)
  60000   NESTED LOOPS  (cr=568 pr=0 pw=0 time=20508 us cost=139 size=58000 card=2000)
      3    VIEW  (cr=142 pr=0 pw=0 time=0 us cost=69 size=13 card=1)
      3     CONNECT BY WITHOUT FILTERING (cr=142 pr=0 pw=0 time=0 us)
      1      VIEW  (cr=142 pr=0 pw=0 time=0 us cost=69 size=13 card=1)
      1       SORT AGGREGATE (cr=142 pr=0 pw=0 time=0 us)
  40000        TABLE ACCESS FULL T (cr=142 pr=0 pw=0 time=0 us cost=69 size=480000 card=40000)
  60000    TABLE ACCESS FULL T (cr=426 pr=0 pw=0 time=20508 us cost=69 size=32000 card=2000)

********************************************************************************

Again, the "connect by level" trick wins, but this time it's much closer. So you may let readability prevail.

Variable number of calculations based on calculated values

There are queries out there that just seem impossible to do with just SQL, where a model clause can provide a solution elegantly. This is a class of problems where calculations have to be done on calculated values repeatedly. An example says more than a thousand words, so here is one:
SQL> select * from deposits
  2  /

  CUSTOMER     AMOUNT DEPOSIT_DATE
---------- ---------- -------------------
         1       1000 01-01-2003 00:00:00
         1        200 01-01-2004 00:00:00
         1        500 01-01-2005 00:00:00
         1        100 01-01-2006 00:00:00
         1        800 01-01-2007 00:00:00
         2         20 01-01-2003 00:00:00
         2        150 01-01-2004 00:00:00
         2         60 01-01-2005 00:00:00
         2        100 01-01-2006 00:00:00
         2        100 01-01-2007 00:00:00

10 rows selected.

SQL> select * from interest_rates
  2  /

STARTDATE           PERCENTAGE
------------------- ----------
01-01-2003 00:00:00          5
01-01-2004 00:00:00        3.2
01-01-2005 00:00:00        4.1
01-01-2006 00:00:00        5.8
01-01-2007 00:00:00        4.9

5 rows selected.

The question is: "What's the balance at the end of each year?". For customer 1 at the end of 2003 it is: 1000 * 1,05 = 1050. At the end of 2004 this is (1050 + 200) * 1,032 = 1290, et cetera. This is what I mean by calculations based on calculated values, that you have to do a variable number of times. Using the model clause, this query becomes relatively easy, like this:

SQL> select customer
  2       , amount
  3       , deposit_date
  4       , percentage
  5       , balance balance_at_end_of_year
  6    from deposits s
  7       , interest_rates r
  8   where s.deposit_date = r.startdate
  9   model
 10         partition by (s.customer)
 11         dimension by (s.deposit_date)
 12         measures (s.amount, r.percentage, 0 balance)
 13         rules
 14         ( balance[any] order by deposit_date
 15           = round
 16             (   (nvl(balance[add_months(cv(),-12)],0) + amount[cv()])
 17               * (1 + percentage[cv()]/100)
 18             , 2
 19             )
 20         )
 21   order by customer
 22       , deposit_date
 23  /

  CUSTOMER     AMOUNT DEPOSIT_DATE        PERCENTAGE BALANCE_AT_END_OF_YEAR
---------- ---------- ------------------- ---------- ----------------------
         1       1000 01-01-2003 00:00:00          5                   1050
         1        200 01-01-2004 00:00:00        3.2                   1290
         1        500 01-01-2005 00:00:00        4.1                1863.39
         1        100 01-01-2006 00:00:00        5.8                2077.27
         1        800 01-01-2007 00:00:00        4.9                3018.26
         2         20 01-01-2003 00:00:00          5                     21
         2        150 01-01-2004 00:00:00        3.2                 176.47
         2         60 01-01-2005 00:00:00        4.1                 246.17
         2        100 01-01-2006 00:00:00        5.8                 366.25
         2        100 01-01-2007 00:00:00        4.9                  489.1

10 rows selected.

For doing this type of calculations in SQL, there is hardly an alternative, except a very slow one using advanced XML functions. Before Oracle10 you'd probably resort to a procedural language such as PL/SQL. But nowadays you don't need to anymore.

Complex algorithms

With the SQL model clause, it has become possible to solve algorithms, that previously could not be done in SQL. Even some of the most complex ones. An example is a query to distribute tablespaces evenly among data files of a database, in such a way that the free space in each data file remains as equal as possible. Or even a single query to solve each sudoku puzzle when supplied with a string of 81 digits. Although it can be great fun to write, you'll have to admit that these will be a nightmare to maintain if you are not the original author. Regardless of reduced lines of code or small performance gains, a (pipelined) PL/SQL function seems much more appropriate here.

String aggregation

String aggregation is really just a special case of the category "variable number of calculations based on calculations". But it is such a frequently asked question on Oracle forums, that I will treat it separately.

String aggregation is the act of aggregating several rows into one row and "stringing up" all column values, possibly separated by a special character like a comma. An example with the well known EMP table is to produce the following output:

    DEPTNO ENAMES
---------- ----------------------------------------
        10 CLARK,KING,MILLER
        20 ADAMS,FORD,JONES,SCOTT,SMITH
        30 ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

There are lots of ways to do string aggregation:
• using a user defined aggregate function, for example Tom Kyte’s stragg function,
• a hierarchical query with sys_connect_by_path
• with XML-functions
• using the COLLECT function, a SQL type and a user defined function
• using the undocumented (and therefore not recommended) wmsys.wm_concat

Using the model clause it is done like this:

SQL> select deptno
  2       , rtrim(ename,',') enames
  3    from ( select deptno
  4                , ename
  5                , rn
  6             from emp
  7            model
  8                  partition by (deptno)
  9                  dimension by (row_number() over
 10                                (partition by deptno order by ename) rn
 11                               )
 12                  measures     (cast(ename as varchar2(40)) ename)
 13                  rules
 14                  ( ename[any] order by rn desc = ename[cv()]||','||ename[cv()+1]
 15                  )
 16         )
 17   where rn = 1
 18   order by deptno
 19  /

    DEPTNO ENAMES
---------- ----------------------------------------
        10 CLARK,KING,MILLER
        20 ADAMS,FORD,JONES,SCOTT,SMITH
        30 ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

Just one query, no auxiliary objects and faster than all other previously mentioned alternatives. Nothing comes for free: the price to pay is a higher PGA memory footprint. Usually this is not a problem, if sufficient memory is available of course. If this is not the case, you'll notice the wait events 'direct path read temp' and 'direct path write temp' and the performance gain quickly turns around and becomes a performance loss.

Conclusion

For a number of problems, the SQL model clause is a very useful extension of the SQL language. This is especially true for forecasting queries and queries with a variable number of calculations based on calculated values, including string aggregation. The model clause can also be used for solving complex algorithms in SQL and for row generation. But this compromises maintainability in the first case and performance in the second case. Nevertheless, if you want to leverage SQL's capabilities, the model clause is absolutely worth studying and using.

8 comments:

  1. Excellent articles and well explained

    ReplyDelete
  2. But "String aggregation" example doesn't work - getting ORA-25137: Data value out of range error.

    ReplyDelete
  3. Thanks, leeso.

    I explained the ORA-25137 in this blog post: http://rwijk.blogspot.com/2008/05/string-aggregation-with-model-clause.html. It only occurs in 10.2.0.4 and 10.2.0.5 as far as I know.

    Regards,
    Rob.

    ReplyDelete
  4. Thank you very much, you're spot on, we're on 10.2.0.4.

    But instead of the workaround you suggested, I did this which also works;

    ...measures(cast(ename||',' as varchar2(40) ename)
    rules(ename[any] order by rn desc = ename[cv()] || ename[cv()+1]
    ...

    which might be more memory friendly ?

    ReplyDelete
  5. Ooops, missed some closing brackets!

    ...measures(cast(ename||',' as varchar2(40)) ename)
    rules(ename[any] order by rn desc = ename[cv()] || ename[cv()+1])
    ...

    ReplyDelete
  6. That's an excellent alternative for the bug! And it's as memory friendly as the original.

    Thanks, leeso.

    ReplyDelete
  7. Rob,

    I posted the folling question on OTN Discussion Forms:
    https://forums.oracle.com/forums/thread.jspa?threadID=2351437&tstart=75

    Text version:
    -----
    Hi all,

    My question is about the Analytical functions in the model clause.
    First the data base I am using:

    Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
    PL/SQL Release 11.2.0.1.0 - Production
    "CORE 11.2.0.1.0 Production"
    TNS for 64-bit Windows: Version 11.2.0.1.0 - Production
    NLSRTL Version 11.2.0.1.0 - Production



    This is my sample query:

    with test_data as
    ( select 1 id , 70 n from dual
    union all select 2 id , 60 n from dual
    union all select 3 id , 50 n from dual
    union all select 4 id , 40 n from dual
    union all select 5 id , 30 n from dual
    union all select 6 id , 20 n from dual
    union all select 7 id , 10 n from dual
    )
    select
    *

    from
    test_data
    model
    dimension by (id)
    measures (n,0 a, 0 b, 0 c, 0 d, 0 e, 0 f, 0 g,0 h)
    rules
    update
    (
    --aggregate sample
    a[any] = max(n) [any]
    ,b[1] = max(n) [any]
    --Analytic sample
    ,c[any] = first_value(n) over (order by id desc)
    ,d[any] = first_value(n) over (order by id asc)
    ,e[1] = first_value(n) over (order by id asc)
    ,f[1] = first_value(n) over (order by id desc)
    ,e[2] = first_value(n) over (order by id asc)
    ,f[2] = first_value(n) over (order by id desc)
    ,e[3] = first_value(n) over (partition by (1) order by id asc rows between unbounded preceding and unbounded following)
    ,f[3] = first_value(n) over (partition by (1) order by id desc rows between unbounded preceding and unbounded following)
    ,e[4] = first_value(n) over (partition by (1) order by id asc range between unbounded preceding and unbounded following)
    ,f[4] = first_value(n) over (partition by (1) order by id desc range between unbounded preceding and unbounded following)
    ,g[1] = first_value(n) over (order by id asc range between 10 preceding and 2 following)
    ,h[1] = first_value(n) over (order by id desc range between 10 preceding and 2 following)
    ,g[2] = first_value(n) over (order by id asc range between 10 preceding and 10 following)
    ,h[2] = first_value(n) over (order by id desc range between 10 preceding and 10 following)
    ,g[3] = first_value(n) over (order by id asc rows between 10 preceding and 2 following)
    ,h[3] = first_value(n) over (order by id desc rows between 10 preceding and 2 following)
    ,g[4] = first_value(n) over (order by id asc rows between 10 preceding and 10 following)
    ,h[4] = first_value(n) over (order by id desc rows between 10 preceding and 10 following)
    )
    order by
    id
    ;

    ID N A B C D E F G H
    ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    1 70 70 70 10 70 70 70 70 70
    2 60 70 0 10 70 60 60 60 60
    3 50 70 0 10 70 50 50 50 50
    4 40 70 0 10 70 40 40 40 40
    5 30 70 0 10 70 0 0 0 0
    6 20 70 0 10 70 0 0 0 0
    7 10 70 0 10 70 0 0 0 0



    In the aggregate sample columns (A and B):
    A returns the max of N
    B[1] returns also the max of N as I expected.

    In the Analytical sample columns (C to G):
    C[any] returns the first_value of N after sorting id asc
    D[any] returns the first_value of N after sorting id desc

    Until now everything is as I expect. However as seen in the following columns D to G.
    If we specify a dimension (other than any) the window of the Analytical is limited to that dimension only.

    I cannot find any documentation on this behavior.

    Can anybody explain this?
    Or is there another way to force the function to look at all records of the Model?

    Regards,

    Peter


    -----

    Do you know any documentation showing this behavior?

    ReplyDelete
  8. Peter,

    I have answered your question in the OTN thread.

    Regards,
    Rob.

    ReplyDelete