About Oracle: Multicolumn statistics

Of all new features of 11g, there are the ones that are talked about a lot, like the result cache and virtual columns. I haven't seen much about multicolumn statistics yet, apart from appearing in several bullet lists about 11g. However, I have seen several performance issues in the past, that could have benefitted from this feature, so I am quite excited about this one. The feature addresses what Wolfgang Breitling calls "Fallacy II - The Predicate Independence Assumption" in his paper Fallacies of the Cost Based Optimizer. Here is an example of how it works:

Let's create a fake order table:

rwijk@ORA11G> create table orders
  2  as
  3   select level nr
  4        , 'Customer ' || to_char(level) customer_name
  5        , case
  6          when level <= 500 then 'London'
  7          when level <= 550 then 'New York'
  8          when level <= 600 then 'Paris'
  9          when level <= 650 then 'Vienna'
 10          when level <= 700 then 'Moscow'
 11          when level <= 750 then 'Berlin'
 12          when level <= 800 then 'Rome'
 13          when level <= 850 then 'Madrid'
 14          when level <= 900 then 'Brussels'
 15          else 'Amsterdam'
 16          end city
 17        , case
 18          when level <  500 then 'UK'
 19          when level <= 550 then 'USA'
 20          when level <= 600 then 'France'
 21          when level <= 650 then 'Austria'
 22          when level <= 700 then 'Russia'
 23          when level <= 750 then 'Germany'
 24          when level <= 800 then 'Italy'
 25          when level <= 850 then 'Spain'
 26          when level <= 900 then 'Belgium'
 27          else 'Netherlands'
 28          end country
 29     from dual
 30  connect by level <= 1000
 31  /

Tabel is aangemaakt.

And gather statistics normally.

rwijk@ORA11G> exec dbms_stats.gather_table_stats(user,'orders')

PL/SQL-procedure is geslaagd.

Because the case expression in the create table statement may not read easily, I did these select statements to give you a better idea of what is in the table:

rwijk@ORA11G> select city
  2       , count(*)
  3    from orders
  4   group by city
  5   order by city
  6  /

CITY                                    COUNT(*)
--------- --------------------------------------
Amsterdam                                    100
Berlin                                        50
Brussels                                      50
London                                       500
Madrid                                        50
Moscow                                        50
New York                                      50
Paris                                         50
Rome                                          50
Vienna                                        50

10 rijen zijn geselecteerd.

rwijk@ORA11G> select country
  2       , count(*)
  3    from orders
  4   group by country
  5   order by country
  6  /

COUNTRY                                   COUNT(*)
----------- --------------------------------------
Austria                                         50
Belgium                                         50
France                                          50
Germany                                         50
Italy                                           50
Netherlands                                    100
Russia                                          50
Spain                                           50
UK                                             499
USA                                             51

10 rijen zijn geselecteerd.

rwijk@ORA11G> select city
  2       , country
  3       , count(*)
  4    from orders
  5   group by city
  6       , country
  7   order by city
  8       , country
  9  /

CITY      COUNTRY                                   COUNT(*)
--------- ----------- --------------------------------------
Amsterdam Netherlands                                    100
Berlin    Germany                                         50
Brussels  Belgium                                         50
London    UK                                             499
London    USA                                              1
Madrid    Spain                                           50
Moscow    Russia                                          50
New York  USA                                             50
Paris     France                                          50
Rome      Italy                                           50
Vienna    Austria                                         50

11 rijen zijn geselecteerd.

So 10 cities, 10 countries, with a clear dependency between the columns city and country. Although not exactly one-on-one, because - as everybody knows :-) - London is also a town in Ohio, USA. Don't get extracted by this fact too much; it won't play a role in the example.

Ok, so let's have a look at the cardinalities in the following query:

rwijk@ORA11G> explain plan
  2  for
  3  select *
  4    from orders
  5   where city = 'London'
  6     and country = 'UK'
  7  /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |    10 |   290 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS |    10 |   290 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("CITY"='London' AND "COUNTRY"='UK')

13 rijen zijn geselecteerd.

Since histograms are not present on the table yet, the cost based optimizer has predicted 10 rows, because the filter predicate city = 'London' has 10 distinct values, and the predicate country = 'UK' also has 10 distinct values. Because of the predicate independence assumption, the cost based optimizer calculates 1000 rows * (1/10) * (1/10) = 10 rows.

Also it doesn't matter which value I put in the query:

rwijk@ORA11G> exec dbms_lock.sleep(1)

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> explain plan
  2  for
  3  select *
  4    from orders
  5   where city = 'London'
  6     and country = 'Netherlands'
  7  /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |    10 |   290 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS |    10 |   290 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("CITY"='London' AND "COUNTRY"='Netherlands')

13 rijen zijn geselecteerd.

Even when searching for London in the Netherlands, it predicts 10. We all know that the query would of course return 0 rows, but the cost based optimizer does not.

So let's create histograms on the columns, to have better estimates:

rwijk@ORA11G> exec dbms_stats.gather_table_stats(user,'orders',
method_opt=>'FOR ALL COLUMNS')

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> explain plan
  2  for
  3  select *
  4    from orders
  5   where city = 'London'
  6     and country = 'UK'
  7  /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |   250 |  7250 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS |   250 |  7250 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("COUNTRY"='UK' AND "CITY"='London')

13 rijen zijn geselecteerd.

Now the prediction is 250 rows. This is calculated based on the histogram information that tells the cost based optimizer that London occurs 500 times out of 1000 rows and the UK occurs 499 times out of 1000 rows. 1000 rows * (500/1000) * (499/1000) is approximately 250 rows.

And for the query using London in the Netherlands:

rwijk@ORA11G> exec dbms_lock.sleep(1)

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> explain plan
  2  for
  3  select *
  4    from orders
  5   where city = 'London'
  6     and country = 'Netherlands'
  7  /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |    50 |  1450 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS |    50 |  1450 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("COUNTRY"='Netherlands' AND "CITY"='London')

13 rijen zijn geselecteerd.

Now 50 rows are calculated, because 1000 rows * (500/1000) * (100/1000) = 50.

The above is the best you could get before 11g. With the use of multicolumn statistics, the row counts will be REALLY accurate:

rwijk@ORA11G> exec dbms_stats.gather_table_stats(user,'orders',
method_opt=>'FOR COLUMNS (city,country)')

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> pause

rwijk@ORA11G> explain plan
  2  for
  3  select *
  4    from orders
  5   where city = 'London'
  6     and country = 'UK'
  7  /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |   499 | 15469 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS |   499 | 15469 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("COUNTRY"='UK' AND "CITY"='London')

13 rijen zijn geselecteerd.

499 exactly! And for London in the Netherlands:

rwijk@ORA11G> exec dbms_lock.sleep(1)

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> explain plan
  2  for
  3  select *
  4    from orders
  5   where city = 'London'
  6     and country = 'Netherlands'
  7  /

Uitleg is gegeven.

rwijk@ORA11G> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------

Plan hash value: 1275100350

----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |     1 |    31 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS |     1 |    31 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("COUNTRY"='Netherlands' AND "CITY"='London')

13 rijen zijn geselecteerd.

Since the optimizer won't calculate with 0, it uses 1, but again very accurate! Much more precise cardinality estimates will lead to better decisions regarding whether to use an index or not, for example. It can really make a big difference in a production system. Problem queries can now be tuned more often by analyzing instead of adding hints to your code.

About Oracle

Pages

Friday, December 7, 2007

Multicolumn statistics

No comments:

Post a Comment

About me

My APEX applications

Upcoming presentations

Twitter

Oracle related links

Oracle database and APEX blogs

Blog Archive

Total Pageviews