Sunday, September 25, 2011

EXISTS

Most of us know the SQL keyword EXISTS as the condition with a subquery you use in a WHERE clause. But if you look at the documentation of EXISTS, you'll see that it says nothing about just using it in a WHERE clause. It's just a sort of function that accepts a query as input and returns "TRUE if a subquery returns at least one row". The SQL language doesn't know about booleans, but it calls these "conditions". And according to another part of the documentation about SQL conditions:

You can use a condition in the WHERE clause of these statements:

DELETE
SELECT
UPDATE

You can use a condition in any of these clauses of the SELECT statement:

WHERE
START WITH
CONNECT BY
HAVING

So you might be tempted to believe it is not possible to use EXISTS in the SELECT clause. However, in a CASE expression, more specifically the searched case expression, it expects a "condition" after the WHEN, as you can see in the syntax wire diagram. Meaning we can put an EXISTS condition after the WHEN.

An example of this came up recently when a colleague asked me for help with his conversion project with a query. His setup resembles this test case:

SQL> create table t1 (id)
2 as
3 select level
4 from dual
5 connect by level <= 4
6 /

Table created.

SQL> alter table t1 add constraint t1_pk primary key (id)
2 /

Table altered.

SQL> exec dbms_stats.gather_table_stats(user,'t1',cascade=>true)

PL/SQL procedure successfully completed.

SQL> create table t2 (id, t1_id, status, filler)
2 as
3 select 1, 1, 'w', lpad('*',1000,'*') from dual union all
4 select 2, 2, 'x', lpad('*',1000,'*') from dual union all
5 select 3, 2, 'y', lpad('*',1000,'*') from dual union all
6 select 4, 3, 'w', lpad('*',1000,'*') from dual union all
7 select 5, 3, 'x', lpad('*',1000,'*') from dual union all
8 select 6, 3, 'z', lpad('*',1000,'*') from dual union all
9 select 7, 4, 'w', lpad('*',1000,'*') from dual union all
10 select 8, 4, 'z', lpad('*',1000,'*') from dual
11 /

Table created.

SQL> alter table t2 add constraint t2_pk primary key (id)
2 /

Table altered.

SQL> alter table t2 add constraint t2_t1_fk foreign key (t1_id) references t1(id)
2 /

Table altered.

SQL> create index t2_i on t2 (status,t1_id)
2 /

Index created.

SQL> exec dbms_stats.gather_table_stats(user,'t2',cascade=>true)

PL/SQL procedure successfully completed.

Two tables with a master detail relationship. The question for me was to come up with a query that selects all rows of t1 and if there is at least one row in t2 with status x or y, then print a 'yes' in an extra column, and 'no' otherwise.

The first query that I came up with, was to use a simple outer join and remove the duplicates with a distinct.

SQL> select /*+ gather_plan_statistics */
2 distinct t1.id
3 , nvl2(t2.id,'yes','no') exists_t2
4 from t1
5 left outer join t2
6 on ( t1.id = t2.t1_id
7 and t2.status in ('x','y')
8 )
9 order by t1.id
10 /

ID EXISTS_T2
----- ----------
1 no
2 yes
3 yes
4 no

4 rows selected.

And it is done in 4 consistent gets using this plan:

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
2 /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 2mtksgp0pzx59, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ distinct t1.id ,
nvl2(t2.id,'yes','no') exists_t2 from t1 left outer join t2
on ( t1.id = t2.t1_id and t2.status in ('x','y')
) order by t1.id

Plan hash value: 1021492722

------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 4 |00:00:00.01 | 4 | | | |
| 1 | SORT UNIQUE | | 1 | 4 | 4 |00:00:00.01 | 4 | 2048 | 2048 | 2048 (0)|
| 2 | MERGE JOIN OUTER | | 1 | 4 | 5 |00:00:00.01 | 4 | | | |
| 3 | INDEX FULL SCAN | T1_PK | 1 | 4 | 4 |00:00:00.01 | 1 | | | |
|* 4 | SORT JOIN | | 4 | 4 | 3 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)|
| 5 | INLIST ITERATOR | | 1 | | 3 |00:00:00.01 | 3 | | | |
| 6 | TABLE ACCESS BY INDEX ROWID| T2 | 2 | 4 | 3 |00:00:00.01 | 3 | | | |
|* 7 | INDEX RANGE SCAN | T2_I | 2 | 4 | 3 |00:00:00.01 | 2 | | | |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("T1"."ID"="T2"."T1_ID")
filter("T1"."ID"="T2"."T1_ID")
7 - access(("T2"."STATUS"='x' OR "T2"."STATUS"='y'))


29 rows selected.

Later, I offered an alternative, after realizing I can also use EXISTS in the SELECT clause, like this:

SQL> select /*+ gather_plan_statistics */
2 t1.id
3 , case
4 when exists
5 ( select ''
6 from t2
7 where t2.t1_id = t1.id
8 and t2.status in ('x','y')
9 )
10 then 'yes'
11 else 'no'
12 end exists_t2
13 from t1
14 order by t1.id
15 /

ID EXISTS_T2
----- ----------
1 no
2 yes
3 yes
4 no

4 rows selected.

So, that's the proof that you can use an EXISTS in a SELECT clause. But this query uses 8 consistent gets (6+2):

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
2 /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID b13u3f5m9ybuu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ t1.id , case
when exists ( select '' from t2 where
t2.t1_id = t1.id and t2.status in ('x','y') )
then 'yes' else 'no' end exists_t2 from t1 order
by t1.id

Plan hash value: 1623664937

-------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 4 |00:00:00.01 | 2 |
| 1 | INLIST ITERATOR | | 4 | | 2 |00:00:00.01 | 6 |
|* 2 | INDEX RANGE SCAN| T2_I | 6 | 1 | 2 |00:00:00.01 | 6 |
| 3 | INDEX FULL SCAN | T1_PK | 1 | 4 | 4 |00:00:00.01 | 2 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access((("T2"."STATUS"='x' OR "T2"."STATUS"='y')) AND
"T2"."T1_ID"=:B1)


25 rows selected.

So not really interesting with this data set, although the plan is shorter in number of operations. With other data distributions it can be useful though. For example, when blowing up the t2 table:

SQL> insert into t2
2 select n * 1000 + t2.id
3 , t2.t1_id
4 , t2.status
5 , t2.filler
6 from t2
7 cross join (select level n from dual connect by level <= 1000)
8 /

8000 rows created.

SQL> exec dbms_stats.gather_table_stats(user,'t2',cascade=>true,method_opt=>'for all columns size auto')

PL/SQL procedure successfully completed.

Now, the left outer join with the distinct has to join 4 rows with 3003 joins, and the resulting SQL uses 3017 consistent gets:

SQL> select /*+ gather_plan_statistics */
2 distinct t1.id
3 , nvl2(t2.id,'yes','no') exists_t2
4 from t1
5 left outer join t2
6 on ( t1.id = t2.t1_id
7 and t2.status in ('x','y')
8 )
9 order by t1.id
10 /

ID EXISTS_T2
----- ----------
1 no
2 yes
3 yes
4 no

4 rows selected.

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
2 /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 2mtksgp0pzx59, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ distinct t1.id ,
nvl2(t2.id,'yes','no') exists_t2 from t1 left outer join t2
on ( t1.id = t2.t1_id and t2.status in ('x','y')
) order by t1.id

Plan hash value: 1021492722

------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 4 |00:00:00.01 | 3017 | | | |
| 1 | SORT UNIQUE | | 1 | 4 | 4 |00:00:00.01 | 3017 | 2048 | 2048 | 2048 (0)|
| 2 | MERGE JOIN OUTER | | 1 | 4 | 3005 |00:00:00.01 | 3017 | | | |
| 3 | INDEX FULL SCAN | T1_PK | 1 | 4 | 4 |00:00:00.01 | 1 | | | |
|* 4 | SORT JOIN | | 4 | 4 | 3003 |00:00:00.01 | 3016 | 115K| 115K| 102K (0)|
| 5 | INLIST ITERATOR | | 1 | | 3003 |00:00:00.01 | 3016 | | | |
| 6 | TABLE ACCESS BY INDEX ROWID| T2 | 2 | 4 | 3003 |00:00:00.01 | 3016 | | | |
|* 7 | INDEX RANGE SCAN | T2_I | 2 | 4 | 3003 |00:00:00.01 | 13 | | | |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("T1"."ID"="T2"."T1_ID")
filter("T1"."ID"="T2"."T1_ID")
7 - access(("T2"."STATUS"='x' OR "T2"."STATUS"='y'))


29 rows selected.

And the second query is now a whole lot more efficient with only 15 consistent gets:

SQL> select /*+ gather_plan_statistics */
2 t1.id
3 , case
4 when exists
5 ( select ''
6 from t2
7 where t2.t1_id = t1.id
8 and t2.status in ('x','y')
9 )
10 then 'yes'
11 else 'no'
12 end exists_t2
13 from t1
14 order by t1.id
15 /

ID EXISTS_T2
----- ----------
1 no
2 yes
3 yes
4 no

4 rows selected.

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
2 /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID b13u3f5m9ybuu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ t1.id , case
when exists ( select '' from t2 where
t2.t1_id = t1.id and t2.status in ('x','y') )
then 'yes' else 'no' end exists_t2 from t1 order
by t1.id

Plan hash value: 1623664937

-------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 4 |00:00:00.01 | 2 |
| 1 | INLIST ITERATOR | | 4 | | 2 |00:00:00.01 | 13 |
|* 2 | INDEX RANGE SCAN| T2_I | 6 | 1 | 2 |00:00:00.01 | 13 |
| 3 | INDEX FULL SCAN | T1_PK | 1 | 4 | 4 |00:00:00.01 | 2 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access((("T2"."STATUS"='x' OR "T2"."STATUS"='y')) AND
"T2"."T1_ID"=:B1)


25 rows selected.

The query selects the four rows of T1 by the INDEX FULL SCAN in line 3. The six starts of line 2 can be explained by the INLIST ITERATOR that splittes the T2 access into one for status = 'x' and one for status = 'y'. It searches for the 'x' for all four rows of T1. For t1.id 2 and 3, it finds an 'x' value, and it doesn't need to look for an 'y' value. Only for t1.id 1 and 4, it needs to search for an 'y' value.

Note that not only the data distribution is a factor, but the index on (t1_id,status) is needed as well to make a difference here.

I guess I just want to say that there is an alternative that can be useful in certain situations, that you might not have thought of initially.




PS: Something completely different: Kamran Agayev has posted an interview with me.