Tuesday, January 6, 2009

SQL Model Clause tutorial, part three

If you prefer to read in Dutch, then you can read the original article here (page 30-34), and if you prefer to read in German, then you can find a translated version here.

Practicalities of the SQL Model Clause

If developing SQL queries is part of your job, then it is definitely worth knowing the SQL model clause. Once you’ve gotten over the initial fear of the new syntax and learned the basics, you can save yourself writing procedural code and creating auxiliary database objects even more often, by doing more in SQL itself. You may end up with shorter, more readable and sometimes faster code. This article will show a couple of uses of the model clause for fairly common type of queries and will discuss in each case whether or not the model clause solution is suitable compared to alternative queries - if any - in terms of readability and performance. By the end you will more easily recognize situations in your daily work where you can apply the model clause.

Background / Overview

The SQL model clause was introduced with the release of Oracle10g Release 1, back in 2003. According to the documentation, the model clause can replace PC-based spreadsheets by creating a multidimensional array from query results and then apply formulas (called rules) to this array to calculate new values. However, judging from the various Oracle forums, it is not used very often yet. And when it is used, people are not doing spreadsheet like calculations with it. The types of queries the model clause does get used, are discussed in this article. These types of queries are:

  • Forecasting queries
  • Row generation
  • Variable number of calculations based on calculated values
  • Complex algorithms
  • String aggregation
Of course, the possibilities of the model clause are certainly not limited to this list.

If you are not familiar with the SQL model clause yet, then you can catch up by reading Chapter 22 of the Data Warehousing Guide. You can also read the step-by-step tutorials on my weblog here and here.

Forecasting queries

This type of query is used throughout the aforementioned chapter 22 of the Oracle Database Data Warehousing Guide. The basis for this type of query is typically a sales table or view and the question to be answered is a forecast of how future sales will look like. For example, let’s have a look at the SALES table in figure 1.

PRODUCT       YEAR     AMOUNT
------- ---------- ----------
paper 2006 2
pen 2006 80
staples 2006 18
paper 2007 4
pen 2007 100
staples 2007 30

And suppose you want to calculate the 2008 sales to be the 2007 sales multiplied by the same growth factor as from 2006 to 2007. In a formula:

sales[2008] = sales[2007] * ( sales[2007] / sales[2006] )

Using the model clause, such a question can be answered like this:
SQL> select product
2 , year
3 , amount
4 from sales
5 model
6 partition by (product)
7 dimension by (year)
8 measures (amount)
9 rules
10 ( amount[2008] = amount[2007] * amount[2007] / amount[2006]
11 )
12 order by year
13 , product
14 /

PRODUCT YEAR AMOUNT
------- ---------- ----------
paper 2006 2
pen 2006 80
staples 2006 18
paper 2007 4
pen 2007 100
staples 2007 30
paper 2008 8
pen 2008 125
staples 2008 50

Using the model clause for such a query is not mandatory; you can achieve the same result by doing a UNION ALL and calculate next year’s sales in a separate query with its own extra table access. Or you can use grouping sets. But if you write these statements, you'll see they become messy and complicated quite fast. And what if you want the sales forecast for 2009 as well, for example by multiplying the sales of 2008 by 2? With the model clause, you just add an extra rule:

amount[2009] = amount[2008] * 2

So forecasting queries are the example in chapter 22 of the Data Warehousing Guide for a reason, as the SQL model clause is able to calculate them without much ado.

Row generation

For many years, you generated rows by selecting from a table or view for which you knew it contained at least the amount of rows you wanted to generate. Usually the all_objects view was used for this, like this:
select rownum
from all_objects
where rownum <= 100000

Then Mikito Harakiri came up with this simple row generating query against dual:
 select level
from dual
connect by level <= 100000

And with the model clause there is another alternative:
select i
from dual
model
dimension by (1 i)
measures (0 x)
(x[for i from 2 to 100000 increment 1] = 0)

And this is a tkprof snippet of a 10046 level 8 trace of the three variants generating 100,000 rows:
select rownum
from all_objects
where rownum <= 100000

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.14 0.24 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 4564 9.90 12.50 557 179849 0 68433
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4566 10.04 12.75 557 179849 0 68433

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88

Rows Row Source Operation
------- ---------------------------------------------------
68433 COUNT STOPKEY (cr=186696 pr=636 pw=0 time=61522 us)
68433 FILTER (cr=186696 pr=636 pw=0 time=61522 us)
70174 HASH JOIN (cr=5383 pr=0 pw=0 time=0 us cost=44 size=1198062 card=11862)
91 INDEX FULL SCAN I_USER2 (cr=1 pr=0 pw=0 time=0 us cost=1 size=87 card=29)(object id 47)
70174 HASH JOIN (cr=5382 pr=0 pw=0 time=0 us cost=42 size=1162476 card=11862)
91 INDEX FULL SCAN I_USER2 (cr=1 pr=0 pw=0 time=0 us cost=1 size=638 card=29)(object id 47)
70174 TABLE ACCESS FULL OBJ$ (cr=5381 pr=0 pw=0 time=0 us cost=41 size=901512 card=11862)
3309 TABLE ACCESS BY INDEX ROWID IND$ (cr=1616 pr=5 pw=0 time=0 us cost=2 size=8 card=1)
4103 INDEX UNIQUE SCAN I_IND1 (cr=601 pr=4 pw=0 time=0 us cost=1 size=0 card=1)(object id 41)
25305 HASH JOIN (cr=43900 pr=98 pw=0 time=0 us cost=3 size=24 card=1)
25702 INDEX RANGE SCAN I_OBJAUTH1 (cr=43900 pr=98 pw=0 time=0 us cost=2 size=11 card=1)(object id 62)
28897 FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
1 FIXED TABLE FULL X$KZSPR (cr=8 pr=1 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 NESTED LOOPS (cr=88452 pr=386 pw=0 time=0 us)
0 NESTED LOOPS (cr=88452 pr=386 pw=0 time=0 us cost=6 size=70 card=1)
0 NESTED LOOPS (cr=88452 pr=386 pw=0 time=0 us cost=4 size=60 card=1)
49740 NESTED LOOPS (cr=83309 pr=386 pw=0 time=0 us cost=3 size=49 card=1)
49740 MERGE JOIN CARTESIAN (cr=80820 pr=386 pw=0 time=0 us cost=2 size=46 card=1)
2487 INDEX RANGE SCAN I_OBJ5 (cr=80820 pr=386 pw=0 time=0 us cost=2 size=33 card=1)(object id 40)
49740 BUFFER SORT (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
49740 FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
49740 INDEX RANGE SCAN I_USER2 (cr=2489 pr=0 pw=0 time=0 us cost=1 size=3 card=1)(object id 47)
0 INDEX RANGE SCAN I_OBJAUTH1 (cr=5143 pr=0 pw=0 time=0 us cost=1 size=11 card=1)(object id 62)
0 INDEX RANGE SCAN I_DEPENDENCY1 (cr=0 pr=0 pw=0 time=0 us cost=1 size=0 card=3)(object id 106)
0 TABLE ACCESS BY INDEX ROWID DEPENDENCY$ (cr=0 pr=0 pw=0 time=0 us cost=2 size=10 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
23864 HASH JOIN (cr=28167 pr=0 pw=0 time=0 us cost=3 size=24 card=1)
24116 INDEX RANGE SCAN I_OBJAUTH1 (cr=28167 pr=0 pw=0 time=0 us cost=2 size=11 card=1)(object id 62)
26105 FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
99 NESTED LOOPS (cr=1761 pr=0 pw=0 time=0 us cost=2 size=48 card=2)
120 INDEX RANGE SCAN I_OBJAUTH1 (cr=1761 pr=0 pw=0 time=0 us cost=2 size=11 card=1)(object id 62)
99 FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=2)
0 NESTED LOOPS (cr=7012 pr=68 pw=0 time=0 us cost=6 size=70 card=1)
4820 MERGE JOIN CARTESIAN (cr=6454 pr=68 pw=0 time=0 us cost=5 size=59 card=1)
241 NESTED LOOPS (cr=6454 pr=68 pw=0 time=0 us)
1458 NESTED LOOPS (cr=6205 pr=28 pw=0 time=0 us cost=5 size=46 card=1)
249 NESTED LOOPS (cr=5684 pr=0 pw=0 time=0 us cost=3 size=36 card=1)
249 INDEX RANGE SCAN I_OBJ5 (cr=5443 pr=0 pw=0 time=0 us cost=2 size=33 card=1)(object id 40)
249 INDEX RANGE SCAN I_USER2 (cr=241 pr=0 pw=0 time=0 us cost=1 size=3 card=1)(object id 47)
1458 INDEX RANGE SCAN I_DEPENDENCY1 (cr=521 pr=28 pw=0 time=0 us cost=1 size=0 card=3)(object id 106)
241 TABLE ACCESS BY INDEX ROWID DEPENDENCY$ (cr=249 pr=40 pw=0 time=0 us cost=2 size=10 card=1)
4820 BUFFER SORT (cr=0 pr=0 pw=0 time=0 us cost=3 size=1300 card=100)
4820 FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=1300 card=100)
0 INDEX RANGE SCAN I_OBJAUTH1 (cr=558 pr=0 pw=0 time=0 us cost=1 size=11 card=1)(object id 62)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
4 NESTED LOOPS (cr=2498 pr=0 pw=0 time=0 us cost=2 size=72 card=2)
10 NESTED LOOPS (cr=2498 pr=0 pw=0 time=0 us cost=2 size=23 card=1)
461 TABLE ACCESS BY INDEX ROWID TRIGGER$ (cr=1967 pr=0 pw=0 time=0 us cost=1 size=12 card=1)
488 INDEX UNIQUE SCAN I_TRIGGER2 (cr=1479 pr=0 pw=0 time=0 us cost=0 size=0 card=1)(object id 160)
10 INDEX RANGE SCAN I_OBJAUTH1 (cr=531 pr=0 pw=0 time=0 us cost=1 size=11 card=1)(object id 62)
4 FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=2)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
92 VIEW (cr=0 pr=0 pw=0 time=0 us cost=2 size=13 card=1)
92 FAST DUAL (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)
2 NESTED LOOPS (cr=1060 pr=0 pw=0 time=0 us cost=2 size=42 card=2)
4 INDEX RANGE SCAN I_OBJAUTH1 (cr=1060 pr=0 pw=0 time=0 us cost=2 size=8 card=1)(object id 62)
2 FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=2)
1 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 NESTED LOOPS (cr=0 pr=0 pw=0 time=0 us cost=2 size=42 card=2)
0 INDEX RANGE SCAN I_OBJAUTH1 (cr=0 pr=0 pw=0 time=0 us cost=2 size=8 card=1)(object id 62)
0 FIXED TABLE FULL X$KZSRO (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=2)
0 FIXED TABLE FULL X$KZSPR (cr=0 pr=0 pw=0 time=0 us cost=0 size=26 card=1)
0 VIEW (cr=0 pr=0 pw=0 time=0 us cost=1 size=16 card=1)
0 SORT GROUP BY (cr=0 pr=0 pw=0 time=0 us cost=1 size=86 card=1)
0 NESTED LOOPS (cr=0 pr=0 pw=0 time=0 us cost=1 size=86 card=1)
0 MERGE JOIN CARTESIAN (cr=0 pr=0 pw=0 time=0 us cost=0 size=78 card=1)
0 NESTED LOOPS (cr=0 pr=0 pw=0 time=0 us cost=0 size=65 card=1)
0 INDEX UNIQUE SCAN I_OLAP_CUBES$ (cr=0 pr=0 pw=0 time=0 us cost=0 size=13 card=1)(object id 899)
0 TABLE ACCESS BY INDEX ROWID OLAP_DIMENSIONALITY$ (cr=0 pr=0 pw=0 time=0 us cost=0 size=52 card=1)
0 INDEX RANGE SCAN I_OLAP_DIMENSIONALITY$ (cr=0 pr=0 pw=0 time=0 us cost=0 size=0 card=1)(object id 903)
0 BUFFER SORT (cr=0 pr=0 pw=0 time=0 us cost=0 size=13 card=1)
0 INDEX FULL SCAN I_OLAP_CUBE_DIMENSIONS$ (cr=0 pr=0 pw=0 time=0 us cost=0 size=13 card=1)(object id 887)
0 INDEX RANGE SCAN I_OBJ1 (cr=0 pr=0 pw=0 time=0 us cost=1 size=8 card=1)(object id 36)
0 NESTED LOOPS (cr=0 pr=0 pw=0 time=0 us cost=2 size=27 card=1)
0 INDEX FULL SCAN I_USER2 (cr=0 pr=0 pw=0 time=0 us cost=1 size=19 card=1)(object id 47)
0 INDEX RANGE SCAN I_OBJ4 (cr=0 pr=0 pw=0 time=0 us cost=1 size=8 card=1)(object id 39)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 4564 0.00 0.02
SQL*Net message from client 4564 0.00 0.80
db file sequential read 557 0.07 2.36
********************************************************************************

select level
from dual
connect by level <= 100000

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 6668 0.31 0.36 0 0 0 100000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 6670 0.31 0.36 0 0 0 100000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88

Rows Row Source Operation
------- ---------------------------------------------------
100000 CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=0 us)
1 FAST DUAL (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 6668 0.00 0.02
SQL*Net message from client 6668 0.00 0.75
********************************************************************************

select i
from dual
model
dimension by (1 i)
measures (0 x)
(x[for i from 2 to 100000 increment 1] = 0)

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 6668 3.39 3.19 0 0 0 100000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 6670 3.39 3.19 0 0 0 100000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88

Rows Row Source Operation
------- ---------------------------------------------------
100000 SQL MODEL ORDERED (cr=0 pr=0 pw=0 time=0 us)
1 FAST DUAL (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 6668 0.00 0.02
SQL*Net message from client 6668 0.00 0.75



********************************************************************************

The model clause query for generating rows is faster than the good old query against all_objects, but much slower than the the "connect by level" query. It is therefore not worthwhile to use the SQL model clause for pure row generating. However, for inferred problems of row generation, the model clause can certainly be an option. An example of such an inferred problems is splitting periods into individual days in a calendar, for example when 1 row containing a startdate January 1st and enddate January 5th, have to be split into 5 individual rows. Another example is splitting strings into individual words:

SQL> create table t (id,str)
2 as
3 select 1, 'OGh Visie' from dual union all
4 select 2, 'Oracle Gebruikersclub Holland' from dual union all
5 select 3, null from dual union all
6 select 4, 'OGh' from dual
7 /

Table created.

To split the individual words from these strings you can use one of the next two queries:

SQL> select id
2 , i seqnr
3 , str word
4 from t
5 model
6 return updated rows
7 partition by (id)
8 dimension by (0 i)
9 measures (str)
10 ( str[for i from 1 to regexp_count(str[0],' ')+1 increment 1]
11 = regexp_substr(str[0],'[^ ]+',1,cv(i))
12 )
13 order by id
14 , seqnr
15 /

ID SEQNR WORD
---------- ---------- -----------------------------
1 1 OGh
1 2 Visie
2 1 Oracle
2 2 Gebruikersclub
2 3 Holland
4 1 OGh

6 rows selected.

SQL> select id
2 , n seqnr
3 , regexp_substr(str,'[^ ]+',1,n) word
4 from t
5 , ( select level n
6 from ( select max(regexp_count(str,' '))+1 max_#words
7 from t
8 )
9 connect by level <= max_#words
10 )
11 where n <= regexp_count(str,' ')+1
12 order by id
13 , seqnr
14 /

ID SEQNR WORD
---------- ---------- -----------------------------
1 1 OGh
1 2 Visie
2 1 Oracle
2 2 Gebruikersclub
2 3 Holland
4 1 OGh

6 rows selected.

You can see how they are based on the pure row generating queries described earlier. In these solutions "regexp_count(str,' ')+1" calculates the number of words, by calculating the number of spaces and adding one. "regexp_substr(str,'[^ ]+',1,n)" gets the n-th word out of the string.

Especially for the "connect by level" query, more variants are possible. All of them look pretty cryptic, making them hard to maintain. The SQL model clause query maintains its readability by using the partition clause.

With a 10,000 times enlarged table t, the tkprof file looks like this:

select id
, i seqnr
, str word
from t
model
return updated rows
partition by (id)
dimension by (0 i)
measures (str)
( str[for i from 1 to regexp_count(str[0],' ')+1 increment 1]
= regexp_substr(str[0],'[^ ]+',1,cv(i))
)
order by id
, seqnr

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 4001 1.56 1.58 0 142 0 60000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4003 1.56 1.58 0 142 0 60000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88

Rows Row Source Operation
------- ---------------------------------------------------
60000 SORT ORDER BY (cr=142 pr=0 pw=0 time=0 us cost=286 size=640000 card=40000)
60000 SQL MODEL ORDERED (cr=142 pr=0 pw=0 time=0 us cost=286 size=640000 card=40000)
40000 TABLE ACCESS FULL T (cr=142 pr=0 pw=0 time=0 us cost=69 size=640000 card=40000)

********************************************************************************

select id
, n seqnr
, regexp_substr(str,'[^ ]+',1,n) word
from t
, ( select level n
from ( select max(regexp_count(str,' '))+1 max_#words
from t
)
connect by level <= max_#words
)
where n <= regexp_count(str,' ')+1
order by id
, seqnr

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 4001 1.32 1.29 0 568 0 60000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4003 1.32 1.29 0 568 0 60000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88

Rows Row Source Operation
------- ---------------------------------------------------
60000 SORT ORDER BY (cr=568 pr=0 pw=0 time=0 us cost=140 size=58000 card=2000)
60000 NESTED LOOPS (cr=568 pr=0 pw=0 time=20508 us cost=139 size=58000 card=2000)
3 VIEW (cr=142 pr=0 pw=0 time=0 us cost=69 size=13 card=1)
3 CONNECT BY WITHOUT FILTERING (cr=142 pr=0 pw=0 time=0 us)
1 VIEW (cr=142 pr=0 pw=0 time=0 us cost=69 size=13 card=1)
1 SORT AGGREGATE (cr=142 pr=0 pw=0 time=0 us)
40000 TABLE ACCESS FULL T (cr=142 pr=0 pw=0 time=0 us cost=69 size=480000 card=40000)
60000 TABLE ACCESS FULL T (cr=426 pr=0 pw=0 time=20508 us cost=69 size=32000 card=2000)

********************************************************************************

Again, the "connect by level" trick wins, but this time it's much closer. So you may let readability prevail.

Variable number of calculations based on calculated values

There are queries out there that just seem impossible to do with just SQL, where a model clause can provide a solution elegantly. This is a class of problems where calculations have to be done on calculated values repeatedly. An example says more than a thousand words, so here is one:
SQL> select * from deposits
2 /

CUSTOMER AMOUNT DEPOSIT_DATE
---------- ---------- -------------------
1 1000 01-01-2003 00:00:00
1 200 01-01-2004 00:00:00
1 500 01-01-2005 00:00:00
1 100 01-01-2006 00:00:00
1 800 01-01-2007 00:00:00
2 20 01-01-2003 00:00:00
2 150 01-01-2004 00:00:00
2 60 01-01-2005 00:00:00
2 100 01-01-2006 00:00:00
2 100 01-01-2007 00:00:00

10 rows selected.

SQL> select * from interest_rates
2 /

STARTDATE PERCENTAGE
------------------- ----------
01-01-2003 00:00:00 5
01-01-2004 00:00:00 3.2
01-01-2005 00:00:00 4.1
01-01-2006 00:00:00 5.8
01-01-2007 00:00:00 4.9

5 rows selected.

The question is: "What's the balance at the end of each year?". For customer 1 at the end of 2003 it is: 1000 * 1,05 = 1050. At the end of 2004 this is (1050 + 200) * 1,032 = 1290, et cetera. This is what I mean by calculations based on calculated values, that you have to do a variable number of times. Using the model clause, this query becomes relatively easy, like this:

SQL> select customer
2 , amount
3 , deposit_date
4 , percentage
5 , balance balance_at_end_of_year
6 from deposits s
7 , interest_rates r
8 where s.deposit_date = r.startdate
9 model
10 partition by (s.customer)
11 dimension by (s.deposit_date)
12 measures (s.amount, r.percentage, 0 balance)
13 rules
14 ( balance[any] order by deposit_date
15 = round
16 ( (nvl(balance[add_months(cv(),-12)],0) + amount[cv()])
17 * (1 + percentage[cv()]/100)
18 , 2
19 )
20 )
21 order by customer
22 , deposit_date
23 /

CUSTOMER AMOUNT DEPOSIT_DATE PERCENTAGE BALANCE_AT_END_OF_YEAR
---------- ---------- ------------------- ---------- ----------------------
1 1000 01-01-2003 00:00:00 5 1050
1 200 01-01-2004 00:00:00 3.2 1290
1 500 01-01-2005 00:00:00 4.1 1863.39
1 100 01-01-2006 00:00:00 5.8 2077.27
1 800 01-01-2007 00:00:00 4.9 3018.26
2 20 01-01-2003 00:00:00 5 21
2 150 01-01-2004 00:00:00 3.2 176.47
2 60 01-01-2005 00:00:00 4.1 246.17
2 100 01-01-2006 00:00:00 5.8 366.25
2 100 01-01-2007 00:00:00 4.9 489.1

10 rows selected.

For doing this type of calculations in SQL, there is hardly an alternative, except a very slow one using advanced XML functions. Before Oracle10 you'd probably resort to a procedural language such as PL/SQL. But nowadays you don't need to anymore.

Complex algorithms

With the SQL model clause, it has become possible to solve algorithms, that previously could not be done in SQL. Even some of the most complex ones. An example is a query to distribute tablespaces evenly among data files of a database, in such a way that the free space in each data file remains as equal as possible. Or even a single query to solve each sudoku puzzle when supplied with a string of 81 digits. Although it can be great fun to write, you'll have to admit that these will be a nightmare to maintain if you are not the original author. Regardless of reduced lines of code or small performance gains, a (pipelined) PL/SQL function seems much more appropriate here.

String aggregation

String aggregation is really just a special case of the category "variable number of calculations based on calculations". But it is such a frequently asked question on Oracle forums, that I will treat it separately.

String aggregation is the act of aggregating several rows into one row and "stringing up" all column values, possibly separated by a special character like a comma. An example with the well known EMP table is to produce the following output:

    DEPTNO ENAMES
---------- ----------------------------------------
10 CLARK,KING,MILLER
20 ADAMS,FORD,JONES,SCOTT,SMITH
30 ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

There are lots of ways to do string aggregation:
• using a user defined aggregate function, for example Tom Kyte’s stragg function,
• a hierarchical query with sys_connect_by_path
• with XML-functions
• using the COLLECT function, a SQL type and a user defined function
• using the undocumented (and therefore not recommended) wmsys.wm_concat

Using the model clause it is done like this:

SQL> select deptno
2 , rtrim(ename,',') enames
3 from ( select deptno
4 , ename
5 , rn
6 from emp
7 model
8 partition by (deptno)
9 dimension by (row_number() over
10 (partition by deptno order by ename) rn
11 )
12 measures (cast(ename as varchar2(40)) ename)
13 rules
14 ( ename[any] order by rn desc = ename[cv()]||','||ename[cv()+1]
15 )
16 )
17 where rn = 1
18 order by deptno
19 /

DEPTNO ENAMES
---------- ----------------------------------------
10 CLARK,KING,MILLER
20 ADAMS,FORD,JONES,SCOTT,SMITH
30 ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD

Just one query, no auxiliary objects and faster than all other previously mentioned alternatives. Nothing comes for free: the price to pay is a higher PGA memory footprint. Usually this is not a problem, if sufficient memory is available of course. If this is not the case, you'll notice the wait events 'direct path read temp' and 'direct path write temp' and the performance gain quickly turns around and becomes a performance loss.

Conclusion

For a number of problems, the SQL model clause is a very useful extension of the SQL language. This is especially true for forecasting queries and queries with a variable number of calculations based on calculated values, including string aggregation. The model clause can also be used for solving complex algorithms in SQL and for row generation. But this compromises maintainability in the first case and performance in the second case. Nevertheless, if you want to leverage SQL's capabilities, the model clause is absolutely worth studying and using.

8 comments:

  1. Excellent articles and well explained

    ReplyDelete
  2. But "String aggregation" example doesn't work - getting ORA-25137: Data value out of range error.

    ReplyDelete
  3. Thanks, leeso.

    I explained the ORA-25137 in this blog post: http://rwijk.blogspot.com/2008/05/string-aggregation-with-model-clause.html. It only occurs in 10.2.0.4 and 10.2.0.5 as far as I know.

    Regards,
    Rob.

    ReplyDelete
  4. Thank you very much, you're spot on, we're on 10.2.0.4.

    But instead of the workaround you suggested, I did this which also works;

    ...measures(cast(ename||',' as varchar2(40) ename)
    rules(ename[any] order by rn desc = ename[cv()] || ename[cv()+1]
    ...

    which might be more memory friendly ?

    ReplyDelete
  5. Ooops, missed some closing brackets!

    ...measures(cast(ename||',' as varchar2(40)) ename)
    rules(ename[any] order by rn desc = ename[cv()] || ename[cv()+1])
    ...

    ReplyDelete
  6. That's an excellent alternative for the bug! And it's as memory friendly as the original.

    Thanks, leeso.

    ReplyDelete
  7. Rob,

    I posted the folling question on OTN Discussion Forms:
    https://forums.oracle.com/forums/thread.jspa?threadID=2351437&tstart=75

    Text version:
    -----
    Hi all,

    My question is about the Analytical functions in the model clause.
    First the data base I am using:

    Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
    PL/SQL Release 11.2.0.1.0 - Production
    "CORE 11.2.0.1.0 Production"
    TNS for 64-bit Windows: Version 11.2.0.1.0 - Production
    NLSRTL Version 11.2.0.1.0 - Production



    This is my sample query:

    with test_data as
    ( select 1 id , 70 n from dual
    union all select 2 id , 60 n from dual
    union all select 3 id , 50 n from dual
    union all select 4 id , 40 n from dual
    union all select 5 id , 30 n from dual
    union all select 6 id , 20 n from dual
    union all select 7 id , 10 n from dual
    )
    select
    *

    from
    test_data
    model
    dimension by (id)
    measures (n,0 a, 0 b, 0 c, 0 d, 0 e, 0 f, 0 g,0 h)
    rules
    update
    (
    --aggregate sample
    a[any] = max(n) [any]
    ,b[1] = max(n) [any]
    --Analytic sample
    ,c[any] = first_value(n) over (order by id desc)
    ,d[any] = first_value(n) over (order by id asc)
    ,e[1] = first_value(n) over (order by id asc)
    ,f[1] = first_value(n) over (order by id desc)
    ,e[2] = first_value(n) over (order by id asc)
    ,f[2] = first_value(n) over (order by id desc)
    ,e[3] = first_value(n) over (partition by (1) order by id asc rows between unbounded preceding and unbounded following)
    ,f[3] = first_value(n) over (partition by (1) order by id desc rows between unbounded preceding and unbounded following)
    ,e[4] = first_value(n) over (partition by (1) order by id asc range between unbounded preceding and unbounded following)
    ,f[4] = first_value(n) over (partition by (1) order by id desc range between unbounded preceding and unbounded following)
    ,g[1] = first_value(n) over (order by id asc range between 10 preceding and 2 following)
    ,h[1] = first_value(n) over (order by id desc range between 10 preceding and 2 following)
    ,g[2] = first_value(n) over (order by id asc range between 10 preceding and 10 following)
    ,h[2] = first_value(n) over (order by id desc range between 10 preceding and 10 following)
    ,g[3] = first_value(n) over (order by id asc rows between 10 preceding and 2 following)
    ,h[3] = first_value(n) over (order by id desc rows between 10 preceding and 2 following)
    ,g[4] = first_value(n) over (order by id asc rows between 10 preceding and 10 following)
    ,h[4] = first_value(n) over (order by id desc rows between 10 preceding and 10 following)
    )
    order by
    id
    ;

    ID N A B C D E F G H
    ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    1 70 70 70 10 70 70 70 70 70
    2 60 70 0 10 70 60 60 60 60
    3 50 70 0 10 70 50 50 50 50
    4 40 70 0 10 70 40 40 40 40
    5 30 70 0 10 70 0 0 0 0
    6 20 70 0 10 70 0 0 0 0
    7 10 70 0 10 70 0 0 0 0



    In the aggregate sample columns (A and B):
    A returns the max of N
    B[1] returns also the max of N as I expected.

    In the Analytical sample columns (C to G):
    C[any] returns the first_value of N after sorting id asc
    D[any] returns the first_value of N after sorting id desc

    Until now everything is as I expect. However as seen in the following columns D to G.
    If we specify a dimension (other than any) the window of the Analytical is limited to that dimension only.

    I cannot find any documentation on this behavior.

    Can anybody explain this?
    Or is there another way to force the function to look at all records of the Model?

    Regards,

    Peter


    -----

    Do you know any documentation showing this behavior?

    ReplyDelete
  8. Peter,

    I have answered your question in the OTN thread.

    Regards,
    Rob.

    ReplyDelete