Friday, December 7, 2007

Multicolumn statistics

Of all new features of 11g, there are the ones that are talked about a lot, like the result cache and virtual columns. I haven't seen much about multicolumn statistics yet, apart from appearing in several bullet lists about 11g. However, I have seen several performance issues in the past, that could have benefitted from this feature, so I am quite excited about this one. The feature addresses what Wolfgang Breitling calls "Fallacy II - The Predicate Independence Assumption" in his paper Fallacies of the Cost Based Optimizer. Here is an example of how it works:

Let's create a fake order table:

rwijk@ORA11G> create table orders
2 as
3 select level nr
4 , 'Customer ' || to_char(level) customer_name
5 , case
6 when level <= 500 then 'London'
7 when level <= 550 then 'New York'
8 when level <= 600 then 'Paris'
9 when level <= 650 then 'Vienna'
10 when level <= 700 then 'Moscow'
11 when level <= 750 then 'Berlin'
12 when level <= 800 then 'Rome'
13 when level <= 850 then 'Madrid'
14 when level <= 900 then 'Brussels'
15 else 'Amsterdam'
16 end city
17 , case
18 when level < 500 then 'UK'
19 when level <= 550 then 'USA'
20 when level <= 600 then 'France'
21 when level <= 650 then 'Austria'
22 when level <= 700 then 'Russia'
23 when level <= 750 then 'Germany'
24 when level <= 800 then 'Italy'
25 when level <= 850 then 'Spain'
26 when level <= 900 then 'Belgium'
27 else 'Netherlands'
28 end country
29 from dual
30 connect by level <= 1000
31 /

Tabel is aangemaakt.


And gather statistics normally.

rwijk@ORA11G> exec dbms_stats.gather_table_stats(user,'orders')

PL/SQL-procedure is geslaagd.


Because the case expression in the create table statement may not read easily, I did these select statements to give you a better idea of what is in the table:

rwijk@ORA11G> select city
2 , count(*)
3 from orders
4 group by city
5 order by city
6 /

CITY COUNT(*)
--------- --------------------------------------
Amsterdam 100
Berlin 50
Brussels 50
London 500
Madrid 50
Moscow 50
New York 50
Paris 50
Rome 50
Vienna 50

10 rijen zijn geselecteerd.

rwijk@ORA11G> select country
2 , count(*)
3 from orders
4 group by country
5 order by country
6 /

COUNTRY COUNT(*)
----------- --------------------------------------
Austria 50
Belgium 50
France 50
Germany 50
Italy 50
Netherlands 100
Russia 50
Spain 50
UK 499
USA 51

10 rijen zijn geselecteerd.

rwijk@ORA11G> select city
2 , country
3 , count(*)
4 from orders
5 group by city
6 , country
7 order by city
8 , country
9 /

CITY COUNTRY COUNT(*)
--------- ----------- --------------------------------------
Amsterdam Netherlands 100
Berlin Germany 50
Brussels Belgium 50
London UK 499
London USA 1
Madrid Spain 50
Moscow Russia 50
New York USA 50
Paris France 50
Rome Italy 50
Vienna Austria 50

11 rijen zijn geselecteerd.


So 10 cities, 10 countries, with a clear dependency between the columns city and country. Although not exactly one-on-one, because - as everybody knows :-) - London is also a town in Ohio, USA. Don't get extracted by this fact too much; it won't play a role in the example.

Ok, so let's have a look at the cardinalities in the following query:

rwijk@ORA11G> explain plan
2 for
3 select *
4 from orders
5 where city = 'London'
6 and country = 'UK'
7 /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
2 /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 290 | 4 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| ORDERS | 10 | 290 | 4 (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("CITY"='London' AND "COUNTRY"='UK')

13 rijen zijn geselecteerd.


Since histograms are not present on the table yet, the cost based optimizer has predicted 10 rows, because the filter predicate city = 'London' has 10 distinct values, and the predicate country = 'UK' also has 10 distinct values. Because of the predicate independence assumption, the cost based optimizer calculates 1000 rows * (1/10) * (1/10) = 10 rows.

Also it doesn't matter which value I put in the query:

rwijk@ORA11G> exec dbms_lock.sleep(1)

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> explain plan
2 for
3 select *
4 from orders
5 where city = 'London'
6 and country = 'Netherlands'
7 /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
2 /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 290 | 4 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| ORDERS | 10 | 290 | 4 (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("CITY"='London' AND "COUNTRY"='Netherlands')

13 rijen zijn geselecteerd.


Even when searching for London in the Netherlands, it predicts 10. We all know that the query would of course return 0 rows, but the cost based optimizer does not.

So let's create histograms on the columns, to have better estimates:

rwijk@ORA11G> exec dbms_stats.gather_table_stats(user,'orders',
method_opt=>'FOR ALL COLUMNS')

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> explain plan
2 for
3 select *
4 from orders
5 where city = 'London'
6 and country = 'UK'
7 /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
2 /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 250 | 7250 | 4 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| ORDERS | 250 | 7250 | 4 (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("COUNTRY"='UK' AND "CITY"='London')

13 rijen zijn geselecteerd.


Now the prediction is 250 rows. This is calculated based on the histogram information that tells the cost based optimizer that London occurs 500 times out of 1000 rows and the UK occurs 499 times out of 1000 rows. 1000 rows * (500/1000) * (499/1000) is approximately 250 rows.

And for the query using London in the Netherlands:

rwijk@ORA11G> exec dbms_lock.sleep(1)

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> explain plan
2 for
3 select *
4 from orders
5 where city = 'London'
6 and country = 'Netherlands'
7 /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
2 /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50 | 1450 | 4 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| ORDERS | 50 | 1450 | 4 (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("COUNTRY"='Netherlands' AND "CITY"='London')

13 rijen zijn geselecteerd.


Now 50 rows are calculated, because 1000 rows * (500/1000) * (100/1000) = 50.

The above is the best you could get before 11g. With the use of multicolumn statistics, the row counts will be REALLY accurate:

rwijk@ORA11G> exec dbms_stats.gather_table_stats(user,'orders',
method_opt=>'FOR COLUMNS (city,country)')

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> pause

rwijk@ORA11G> explain plan
2 for
3 select *
4 from orders
5 where city = 'London'
6 and country = 'UK'
7 /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
2 /

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 499 | 15469 | 4 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| ORDERS | 499 | 15469 | 4 (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("COUNTRY"='UK' AND "CITY"='London')

13 rijen zijn geselecteerd.


499 exactly! And for London in the Netherlands:

rwijk@ORA11G> exec dbms_lock.sleep(1)

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> explain plan
2 for
3 select *
4 from orders
5 where city = 'London'
6 and country = 'Netherlands'
7 /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
2 /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 31 | 4 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| ORDERS | 1 | 31 | 4 (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("COUNTRY"='Netherlands' AND "CITY"='London')

13 rijen zijn geselecteerd.


Since the optimizer won't calculate with 0, it uses 1, but again very accurate! Much more precise cardinality estimates will lead to better decisions regarding whether to use an index or not, for example. It can really make a big difference in a production system. Problem queries can now be tuned more often by analyzing instead of adding hints to your code.

No comments:

Post a Comment