Wednesday, November 14, 2007

Interval based row generation

Well, I didn't have a good name for this common SQL problem, so I decided to call it "interval based row generation". If someone out there knows a better or more common name for the problem, then please say so by leaving a comment.

An example says more than a thousand words, so here is an example of what I mean by the term:

Suppose I have this table:

rwijk@ORA10GR2> create table mytable (code,start_date,end_date,value)
2 as
3 select 'group1', date '2007-01-01', date '2007-01-10', 11 from dual union all
4 select 'group2', date '2007-01-01', date '2007-01-01', 22 from dual union all
5 select 'group3', date '2007-01-01', date '2007-01-03', 33 from dual
6 /

Tabel is aangemaakt.


And I want to generate rows for all dates between and including the start_date and the end_date. The query would have to generate 10 rows for group1, 1 row for group2 and 3 rows for group3. I know of six ways to do this type of query. There may be more of course. Here are the six alternatives:

rwijk@ORA10GR2> select /* alternative 1: cartesian product with hard coded upper bound */
2 m.code
3 , m.start_date + l as day
4 , m.value
5 from mytable m
6 , (select level-1 l from dual connect by level <= 1000)
7 where l <= m.end_date - m.start_date
8 order by m.code
9 , day
10 /

CODE DAY VALUE
---------- ------------------- ----------
group1 01-01-2007 00:00:00 11
group1 02-01-2007 00:00:00 11
group1 03-01-2007 00:00:00 11
group1 04-01-2007 00:00:00 11
group1 05-01-2007 00:00:00 11
group1 06-01-2007 00:00:00 11
group1 07-01-2007 00:00:00 11
group1 08-01-2007 00:00:00 11
group1 09-01-2007 00:00:00 11
group1 10-01-2007 00:00:00 11
group2 01-01-2007 00:00:00 22
group3 01-01-2007 00:00:00 33
group3 02-01-2007 00:00:00 33
group3 03-01-2007 00:00:00 33

14 rijen zijn geselecteerd.

rwijk@ORA10GR2> select /* alternative 2: cartesian product with calculated upper bound */
2 m.code
3 , m.start_date + l as day
4 , m.value
5 from mytable m
6 , ( select level-1 l
7 from (select max(end_date-start_date+1) maxinterval from mytable)
8 connect by level <= maxinterval
9 )
10 where l <= m.end_date - m.start_date
11 order by m.code
12 , day
13 /

CODE DAY VALUE
---------- ------------------- ----------
group1 01-01-2007 00:00:00 11
group1 02-01-2007 00:00:00 11
group1 03-01-2007 00:00:00 11
group1 04-01-2007 00:00:00 11
group1 05-01-2007 00:00:00 11
group1 06-01-2007 00:00:00 11
group1 07-01-2007 00:00:00 11
group1 08-01-2007 00:00:00 11
group1 09-01-2007 00:00:00 11
group1 10-01-2007 00:00:00 11
group2 01-01-2007 00:00:00 22
group3 01-01-2007 00:00:00 33
group3 02-01-2007 00:00:00 33
group3 03-01-2007 00:00:00 33

14 rijen zijn geselecteerd.

rwijk@ORA10GR2> select /* alternative 3: hierarchical query with prior dbms_random.value */
2 m.code
3 , m.start_date + level - 1 as day
4 , m.value
5 from mytable m
6 connect by level <= m.end_date - m.start_date + 1
7 and prior m.code = m.code
8 and prior dbms_random.value is not null
9 order by m.code
10 , day
11 /

CODE DAY VALUE
---------- ------------------- ----------
group1 01-01-2007 00:00:00 11
group1 02-01-2007 00:00:00 11
group1 03-01-2007 00:00:00 11
group1 04-01-2007 00:00:00 11
group1 05-01-2007 00:00:00 11
group1 06-01-2007 00:00:00 11
group1 07-01-2007 00:00:00 11
group1 08-01-2007 00:00:00 11
group1 09-01-2007 00:00:00 11
group1 10-01-2007 00:00:00 11
group2 01-01-2007 00:00:00 22
group3 01-01-2007 00:00:00 33
group3 02-01-2007 00:00:00 33
group3 03-01-2007 00:00:00 33

14 rijen zijn geselecteerd.

rwijk@ORA10GR2> select /* alternative 4: hierarchical query with connect_by_root */
2 m.code
3 , m.start_date + level - 1 as day
4 , m.value
5 from mytable m
6 connect by connect_by_root m.code = m.code
7 and level <= m.end_date - m.start_date + 1
8 order by m.code
9 , day
10 /

CODE DAY VALUE
---------- ------------------- ----------
group1 01-01-2007 00:00:00 11
group1 02-01-2007 00:00:00 11
group1 03-01-2007 00:00:00 11
group1 04-01-2007 00:00:00 11
group1 05-01-2007 00:00:00 11
group1 06-01-2007 00:00:00 11
group1 07-01-2007 00:00:00 11
group1 08-01-2007 00:00:00 11
group1 09-01-2007 00:00:00 11
group1 10-01-2007 00:00:00 11
group2 01-01-2007 00:00:00 22
group3 01-01-2007 00:00:00 33
group3 02-01-2007 00:00:00 33
group3 03-01-2007 00:00:00 33

14 rijen zijn geselecteerd.

rwijk@ORA10GR2> select /* alternative 5: Adrian Billington's multiset */
2 m.code
3 , m.start_date + (row_number() over (partition by m.code order by null)-1) as day
4 , m.value
5 from mytable m
6 , table
7 ( cast
8 ( multiset(select null from dual connect by rownum <= (m.end_date-m.start_date)+1)
9 as sys.dbms_debug_vc2coll
10 )
11 ) t
12 order by m.code
13 , day
14 /

CODE DAY VALUE
---------- ------------------- ----------
group1 01-01-2007 00:00:00 11
group1 02-01-2007 00:00:00 11
group1 03-01-2007 00:00:00 11
group1 04-01-2007 00:00:00 11
group1 05-01-2007 00:00:00 11
group1 06-01-2007 00:00:00 11
group1 07-01-2007 00:00:00 11
group1 08-01-2007 00:00:00 11
group1 09-01-2007 00:00:00 11
group1 10-01-2007 00:00:00 11
group2 01-01-2007 00:00:00 22
group3 01-01-2007 00:00:00 33
group3 02-01-2007 00:00:00 33
group3 03-01-2007 00:00:00 33

14 rijen zijn geselecteerd.

rwijk@ORA10GR2> select /* alternative 6: model clause */
2 code
3 , day
4 , value
5 from mytable
6 model
7 partition by (code,value)
8 dimension by (0 i)
9 measures (start_date day, end_date)
10 rules
11 ( day[for i from 1 to end_date[0] - day[0] increment 1] = day[0] + cv(i)
12 )
13 order by code
14 , day
15 /

CODE DAY VALUE
---------- ------------------- ----------
group1 01-01-2007 00:00:00 11
group1 02-01-2007 00:00:00 11
group1 03-01-2007 00:00:00 11
group1 04-01-2007 00:00:00 11
group1 05-01-2007 00:00:00 11
group1 06-01-2007 00:00:00 11
group1 07-01-2007 00:00:00 11
group1 08-01-2007 00:00:00 11
group1 09-01-2007 00:00:00 11
group1 10-01-2007 00:00:00 11
group2 01-01-2007 00:00:00 22
group3 01-01-2007 00:00:00 33
group3 02-01-2007 00:00:00 33
group3 03-01-2007 00:00:00 33

14 rijen zijn geselecteerd.


I noticed something remarkable when running the same queries on my Oracle11g database. Alternatives 1, 2, 5 and 6 ran as expected, but 3 & 4 did not work anymore:

rwijk@ORA11G>  select /* alternative 3: hierarchical query with prior dbms_random.value */
2 m.code
3 , m.start_date + level - 1 as day
4 , m.value
5 from mytable m
6 connect by level <= m.end_date - m.start_date + 1
7 and prior m.code = m.code
8 and prior dbms_random.value is not null
9 order by m.code
10 , day
11 /
from mytable m
*
FOUT in regel 5:
.ORA-01436: CONNECT BY-lus in gebruikersgegevens.


rwijk@ORA11G> select /* alternative 4: hierarchical query with connect_by_root */
2 m.code
3 , m.start_date + level - 1 as day
4 , m.value
5 from mytable m
6 connect by connect_by_root m.code = m.code
7 and level <= m.end_date - m.start_date + 1
8 order by m.code
9 , day
10 /
connect by connect_by_root m.code = m.code
*
FOUT in regel 6:
.ORA-30007: Operator CONNECT BY ROOT wordt niet ondersteund in de voorwaarde START WITH of CONNECT BY.


Alternative 3 uses a trick: line 8 prevented the ORA-01436 connect by loop in user data error in versions 9 and 10, but not anymore in 11.

Alternative 4 shows a new error message introduced in Oracle11g: ORA-30007. Apparently the use of the connect_by_root function has been restricted more. I don't know the reason why, although Oracle probably will have a good reason for the restriction.

In this recent OTN-thread Adrian Billington (of the excellent oracle-developer.net site) showed me alternative 5, which I had not seen before. He also did a small performance test showing that alternative 5 is slightly better than alternative 6, the model clause solution. I decided to investigate a little more by tracing/tkprof and by comparing latches with Tom Kyte's runstats_pkg.

Here are the results of a test with this table:

create table mytable (code,start_date,end_date,value)
as
select 'group' || to_char(level)
, trunc(sysdate)
, trunc(sysdate) + power(ceil(dbms_random.value(0,30)),2)
, 'value' || to_char(level)
from dual
connect by level <= 100
/


Tkprof on Oracle11g shows:

********************************************************************************

select /* alternative 1: cartesian product with hard coded upper bound */
m.code
, m.start_date + l as day
, m.value
from mytable m
, (select level-1 l from dual connect by level <= 1000)
where l <= m.end_date - m.start_date
order by m.code
, day

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 1 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2415 0.31 0.29 0 3000 0 36197
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2417 0.31 0.29 0 3001 0 36197

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 81

Rows Row Source Operation
------- ---------------------------------------------------
36197 SORT ORDER BY (cr=3000 pr=0 pw=0 time=806 us cost=6 size=395 card=5)
36197 NESTED LOOPS (cr=3000 pr=0 pw=0 time=2220 us cost=5 size=395 card=5)
1000 VIEW (cr=0 pr=0 pw=0 time=67 us cost=2 size=13 card=1)
1000 CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=27 us)
1 FAST DUAL (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)
36197 TABLE ACCESS FULL MYTABLE (cr=3000 pr=0 pw=0 time=4259 us cost=3 size=330 card=5)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2415 0.00 0.00
SQL*Net message from client 2415 0.00 0.52
********************************************************************************

select /* alternative 2: cartesian product with calculated upper bound */
m.code
, m.start_date + l as day
, m.value
from mytable m
, ( select level-1 l
from (select max(end_date-start_date+1) maxinterval from mytable)
connect by level <= maxinterval
)
where l <= m.end_date - m.start_date
order by m.code
, day

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.00 0 3 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2415 0.26 0.28 0 2706 0 36197
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2417 0.28 0.29 0 2709 0 36197

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 81

Rows Row Source Operation
------- ---------------------------------------------------
36197 SORT ORDER BY (cr=2706 pr=0 pw=0 time=797 us cost=7 size=395 card=5)
36197 NESTED LOOPS (cr=2706 pr=0 pw=0 time=2251 us cost=6 size=395 card=5)
901 VIEW (cr=3 pr=0 pw=0 time=66 us cost=3 size=13 card=1)
901 CONNECT BY WITHOUT FILTERING (cr=3 pr=0 pw=0 time=28 us)
1 VIEW (cr=3 pr=0 pw=0 time=0 us cost=3 size=13 card=1)
1 SORT AGGREGATE (cr=3 pr=0 pw=0 time=0 us)
100 TABLE ACCESS FULL MYTABLE (cr=3 pr=0 pw=0 time=2 us cost=3 size=1800 card=100)
36197 TABLE ACCESS FULL MYTABLE (cr=2703 pr=0 pw=0 time=4266 us cost=3 size=330 card=5)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2415 0.00 0.00
SQL*Net message from client 2415 0.00 0.52
********************************************************************************

select /* alternative 5: Adrian Billington's multiset */
m.code
, m.start_date + (row_number() over (partition by m.code order by null)-1) as day
, m.value
from mytable m
, table
( cast
( multiset(select null from dual connect by rownum <= (m.end_date-m.start_date)+1)
as sys.dbms_debug_vc2coll
)
) t
order by m.code
, day

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 1 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2415 0.35 0.36 0 3 0 36197
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2417 0.35 0.36 0 4 0 36197

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 81

Rows Row Source Operation
------- ---------------------------------------------------
36197 SORT ORDER BY (cr=3 pr=0 pw=0 time=749 us cost=28444 size=53908800 card=816800)
36197 WINDOW SORT (cr=3 pr=0 pw=0 time=902 us cost=28444 size=53908800 card=816800)
36197 NESTED LOOPS (cr=3 pr=0 pw=0 time=4381 us cost=2726 size=53908800 card=816800)
100 TABLE ACCESS FULL MYTABLE (cr=3 pr=0 pw=0 time=4 us cost=3 size=6600 card=100)
36197 COLLECTION ITERATOR SUBQUERY FETCH (cr=0 pr=0 pw=0 time=4111 us)
36197 COUNT (cr=0 pr=0 pw=0 time=2600 us)
36197 CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=962 us)
100 FAST DUAL (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2415 0.00 0.00
SQL*Net message from client 2415 0.00 0.53
********************************************************************************

select /* alternative 6: model clause */
code
, day
, value
from mytable
model
partition by (code,value)
dimension by (0 i)
measures (start_date day, end_date)
rules
( day[for i from 1 to end_date[0] - day[0] increment 1] = day[0] + cv(i)
)
order by code
, day

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 1 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2415 0.46 0.51 0 3 0 36197
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2417 0.46 0.51 0 4 0 36197

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 81

Rows Row Source Operation
------- ---------------------------------------------------
36197 SORT ORDER BY (cr=3 pr=0 pw=0 time=771 us cost=4 size=6600 card=100)
36197 SQL MODEL ORDERED (cr=3 pr=0 pw=0 time=697 us cost=4 size=6600 card=100)
100 TABLE ACCESS FULL MYTABLE (cr=3 pr=0 pw=0 time=3 us cost=3 size=6600 card=100)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2415 0.00 0.00
SQL*Net message from client 2415 0.00 0.53

********************************************************************************


So, the clunky cartesian joins - alternatives 1 and 2 - seem to outperform the multiset and model clause solutions. These alternatives just generate all rows possibly needed and filter out the unneeded ones.

Next I compared alternative 2 with 5 using runstats-pkg:

rwijk@ORA11G> exec runstats_pkg.rs_stop(50)
Run1 draaide in 78 hsecs
Run2 draaide in 86 hsecs
Run1 draaide in 90,7% van de tijd

Naam Run1 Run2 Verschil
LATCH.shared pool 156 222 66
LATCH.enqueues 80 14 -66
LATCH.enqueue hash chains 82 15 -67
STAT.undo change vector size 2,732 2,804 72
STAT.redo size 3,656 3,740 84
STAT.sorts (memory) 2 102 100
STAT.workarea executions - optimal 6 107 101
STAT.no work - consistent read gets 910 9 -901
STAT.table scan blocks gotten 902 1 -901
STAT.table scans (short tables) 902 1 -901
STAT.calls to get snapshot scn: kcmgss 1,811 9 -1,802
LATCH.cache buffers chains 2,856 163 -2,693
STAT.session logical reads 2,753 53 -2,700
STAT.consistent gets 2,727 26 -2,701
STAT.consistent gets from cache 2,727 26 -2,701
STAT.consistent gets from cache (fastpath) 2,720 18 -2,702
STAT.sorts (rows) 36,207 72,512 36,305
STAT.table scan rows gotten 90,200 100 -90,100

Run1 latches totaal versus run2 -- verschil en percentage
Run1 Run2 Verschil Pct
8,692 5,907 -2,785 147.15%


So alternative 2 is faster wall clock wise, but is less scalabe than alternative 5.

Now comparing 5 and 6:

rwijk@ORA11G> exec runstats_pkg.rs_stop(50)
Run1 draaide in 73 hsecs
Run2 draaide in 89 hsecs
Run1 draaide in 82,02% van de tijd

Naam Run1 Run2 Verschil
STAT.undo change vector size 2,868 2,804 -64
LATCH.SQL memory manager workarea list latch 74 9 -65
STAT.bytes received via SQL*Net from client 24,549 24,484 -65
LATCH.enqueues 2 76 74
LATCH.enqueue hash chains 3 78 75
STAT.redo size 3,876 3,784 -92
STAT.workarea executions - optimal 107 6 -101
STAT.sorts (memory) 102 1 -101
STAT.sorts (rows) 62,756 31,328 -31,428

Run1 latches totaal versus run2 -- verschil en percentage
Run1 Run2 Verschil Pct
4,700 4,811 111 97.69%


Conclusion is that Adrian's multiset solution is the close winner. The cartesian product solution will appeal more when the difference in intervals is small and the percentage of filtered rows is small. The model clause solution only wins in readability I guess, although some of you probably won't agree :-)

No comments:

Post a Comment