Monday, March 24, 2008

11g sample sizes

One of the nice side effects of doing a workshop or presentation, is that you get to see new or updated behaviour on the subject. This time I could not explain why the gathered histogram information on my 100,000 row table was only an estimate. Although we did not specify the estimate_percent, the sample_size of the user_tables clearly indicated a compute, or at least I thought so, because the sample size of the table equalled the number of rows. During the workshop, I digged up another example of a presentation of more than two years ago, showing histogram information on a 10,000 row table. This one still showed computed (non-estimated) histogram information. When explicitly setting the estimate_percent parameter to null, the histogram information became exact for the 100,000 row table as well. Here is what happened.

First the situation we encountered during the workshop:

rwijk@ORA11G> create table t (id,status,description)
2 as
3 select level
4 , trunc(ln(level)/ln(10))
5 , 'description ' || to_char(level)
6 from dual
7 connect by level <= 99999
8 /

Tabel is aangemaakt.

rwijk@ORA11G> alter table t add constraint t_pk primary key (id)
2 /

Tabel is gewijzigd.

rwijk@ORA11G> create index t_status_i on t(status)
2 /

Index is aangemaakt.

rwijk@ORA11G> begin
2 dbms_stats.gather_table_stats
3 ( user
4 , 't'
5 , cascade => true
6 , method_opt => 'for columns status'
7 );
8 end;
9 /

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> select status
2 , count(*)
3 from t
4 group by status
5 order by status
6 /

STATUS COUNT(*)
---------- ----------
0 9
1 90
2 900
3 9000
4 90000

5 rijen zijn geselecteerd.

rwijk@ORA11G> select sample_size
2 , num_rows
3 , blocks
4 from user_tables
5 where table_name = 'T'
6 /

SAMPLE_SIZE NUM_ROWS BLOCKS
----------- ---------- ----------
99999 99999 443

1 rij is geselecteerd.

rwijk@ORA11G> select endpoint_value
2 , endpoint_number -
3 lag(endpoint_number,1,0) over (order by endpoint_value) bucket_height
4 from user_histograms
5 where table_name = 'T'
6 and column_name = 'STATUS'
7 /

ENDPOINT_VALUE BUCKET_HEIGHT
-------------- -------------
0 1
1 3
2 44
3 510
4 4951

5 rijen zijn geselecteerd.


The estimated histogram information I was talking about. And this is how we continued, by setting the estimate_percent to null:

rwijk@ORA11G> begin
2 dbms_stats.gather_table_stats
3 ( user
4 , 't'
5 , cascade => true
6 , method_opt => 'for columns status'
7 , estimate_percent => null
8 );
9 end;
10 /

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> select sample_size
2 , num_rows
3 , blocks
4 from user_tables
5 where table_name = 'T'
6 /

SAMPLE_SIZE NUM_ROWS BLOCKS
----------- ---------- ----------
99999 99999 443

1 rij is geselecteerd.

rwijk@ORA11G> select endpoint_value
2 , endpoint_number -
3 lag(endpoint_number,1,0) over (order by endpoint_value) bucket_height
4 from user_histograms
5 where table_name = 'T'
6 and column_name = 'STATUS'
7 /

ENDPOINT_VALUE BUCKET_HEIGHT
-------------- -------------
0 9
1 90
2 900
3 9000
4 90000

5 rijen zijn geselecteerd.


The turning point where histogram information is exactly gathered with an estimate_percent left to the default using the auto_sample_size, is between 21,999 rows and 22,000 rows. This is also the case on 10g. But I noticed a difference on that version. Here is the same situation with a 22,000 row table on 10.2.0.1:

rwijk@ORA10GR2> create table t (id,status,description)
2 as
3 select level
4 , trunc(ln(level)/ln(10))
5 , 'description ' || to_char(level)
6 from dual
7 connect by level <= 22000
8 /

Tabel is aangemaakt.

rwijk@ORA10GR2> alter table t add constraint t_pk primary key (id)
2 /

Tabel is gewijzigd.

rwijk@ORA10GR2> create index t_status_i on t(status)
2 /

Index is aangemaakt.

rwijk@ORA10GR2> begin
2 dbms_stats.gather_table_stats
3 ( user
4 , 't'
5 , cascade => true
6 , method_opt => 'for columns status'
7 );
8 end;
9 /

PL/SQL-procedure is geslaagd.

rwijk@ORA10GR2> select status
2 , count(*)
3 from t
4 group by status
5 order by status
6 /

STATUS COUNT(*)
---------- ----------
0 9
1 90
2 900
3 9000
4 12001

5 rijen zijn geselecteerd.

rwijk@ORA10GR2> select sample_size
2 , num_rows
3 , blocks
4 from user_tables
5 where table_name = 'T'
6 /

SAMPLE_SIZE NUM_ROWS BLOCKS
----------- ---------- ----------
5508 22032 101

1 rij is geselecteerd.

rwijk@ORA10GR2> select endpoint_value
2 , endpoint_number -
3 lag(endpoint_number,1,0) over (order by endpoint_value) bucket_height
4 from user_histograms
5 where table_name = 'T'
6 and column_name = 'STATUS'
7 /

ENDPOINT_VALUE BUCKET_HEIGHT
-------------- -------------
0 5
1 22
2 229
3 2289
4 2963

5 rijen zijn geselecteerd.


The same results as on 11.1.0.6, but one striking difference: the sample size of the table was 5508 instead of 22,000. This is exactly why on earlier versions I wasn´t surprised to see estimated histogram information: the table statistics indicated estimate numbers are used, and estimates were used for histograms as well. On 11g the sample sizes for table and column information can be different as can be seen in the next example. You can see this difference when comparing the sample sizes of user_tables and user_tab_columns (or user_tab_statistics and user_tab_col_statistics). On 11.1.0.6 with a 22,000 row table:

rwijk@ORA11G> create table t (id,status,description)
2 as
3 select level
4 , trunc(ln(level)/ln(10))
5 , 'description ' || to_char(level)
6 from dual
7 connect by level <= 22000
8 /

Tabel is aangemaakt.

rwijk@ORA11G> alter table t add constraint t_pk primary key (id)
2 /

Tabel is gewijzigd.

rwijk@ORA11G> create index t_status_i on t(status)
2 /

Index is aangemaakt.

rwijk@ORA11G> begin
2 dbms_stats.gather_table_stats
3 ( user
4 , 't'
5 , cascade => true
6 , method_opt => 'for columns status'
7 );
8 end;
9 /

PL/SQL-procedure is geslaagd.

rwijk@ORA11G> select status
2 , count(*)
3 from t
4 group by status
5 order by status
6 /

STATUS COUNT(*)
---------- ----------
0 9
1 90
2 900
3 9000
4 12001

5 rijen zijn geselecteerd.

rwijk@ORA11G> select sample_size
2 , num_rows
3 , blocks
4 from user_tables
5 where table_name = 'T'
6 /

SAMPLE_SIZE NUM_ROWS BLOCKS
----------- ---------- ----------
22000 22000 101

1 rij is geselecteerd.

rwijk@ORA11G> select sample_size
2 , low_value
3 , high_value
4 , num_nulls
5 , num_distinct
6 from user_tab_columns
7 where table_name = 'T'
8 and column_name = 'STATUS'
9 /

SAMPLE_SIZE LOW_VALUE HIGH_VALUE NUM_NULLS NUM_DISTINCT
----------- ---------- ---------- ---------- ------------
5527 80 C105 0 5

1 rij is geselecteerd.

rwijk@ORA11G> select endpoint_value
2 , endpoint_number -
3 lag(endpoint_number,1,0) over (order by endpoint_value) bucket_height
4 from user_histograms
5 where table_name = 'T'
6 and column_name = 'STATUS'
7 /

ENDPOINT_VALUE BUCKET_HEIGHT
-------------- -------------
0 2
1 27
2 220
3 2285
4 2993

5 rijen zijn geselecteerd.


On 11g, the table statistics are computed until 9,999,999 rows at least. My guess is that the sample sizes can be different now, to speed up the process of gathering the statistics. The gathering of several column level statistics is a much more time consuming process than the one-time gathering of table level statistics, the dbms_stats package is now able to use different sample sizes for the two.

No comments:

Post a Comment