Saturday, May 19, 2012

Much Ado About Nothing?

I was reading this presentation PDF of Hugh Darwen recently, called How To Handle Missing Information Without Using NULL. Several great thinkers and founders of the relational theory consider NULL as the thing that should not be. For example, one slide in the above mentioned PDF is titled SQL's Nulls Are A Disaster. And I found a paper with the amusing title The Final Null In The Coffin.

I can understand the critique. The introduction of NULL leads to three valued logic, which makes programs much more complex and harder to prove correct. All database professionals likely have been bitten by NULLs several times during their career, myself included. And a NULL can have several interpretations. By using NULL, you are not making clear what is meant. If the value for column hair_colour is NULL, does it mean the person is bald? Or do you know the person has hair, but you just don't know what colour? Or can the person be bald or have hair, but you just don't know which one applies? Or is the person in the midst of a hair colouring exercise and you only temporarily don't know the colour? If you're creative, I'm sure you can come up with other interpretations as well.

On the other hand, the theorists don't have to build database applications for end users who like reasonable response times, and I do. Avoiding nulls at all cost typically leads to a data model that has more tables than needed, requiring more joins and therefore making queries slower. So I have to make a trade off. In general I try to avoid nullable columns as much as possible, for example by chosing subtype implementations instead of supertype implementations, and by modelling entity subtypes in the first place, but I will never let it noticeably slow down my application. At my current job, I'm making a data model right now. Having read all use cases, I know how the data will be used and so I know where in the model there is room to avoid an extra nullable column. One thing I'll never voluntarily do though, is make up strange outlier values just to get rid of the null.

Any way, I was curious to see how Hugh Darwen handles missing information without using nulls. In his paper, he has a concise example, which I'll translate to Oracle syntax in this blogpost to see what practically needs to happen to avoid nulls in his example. He starts with this table:

SQL> select *
  2    from pers_info
  3  /

        ID NAME       JOB            SALARY
---------- ---------- ---------- ----------
      1234 Anne       Lawyer         100000
      1235 Boris      Banker
      1236 Cindy                      70000
      1237 Davinder

4 rows selected.
Which contains four NULL values. The meaning of those NULL values can't be seen from this table, but this is what they are meant to be:
  • Boris earns something, but we don't know how much
  • Cindy does some job, but we don't know what it is
  • Davinder doesn't have a job
  • Davinder doesn't have a salary
So he applies a technique called vertical decomposition and on top of those results horizontal decomposition, to arrive at the seven tables below, where everything has a clear meaning.
SQL> select *
  2    from called
  3  /

        ID NAME
---------- --------
      1234 Anne
      1235 Boris
      1236 Cindy
      1237 Davinder

4 rows selected.

SQL> select *
  2    from does_job
  3  /

        ID JOB
---------- ------
      1234 Lawyer
      1235 Banker

2 rows selected.

SQL> select *
  2    from job_unk
  3  /

        ID
----------
      1236

1 row selected.

SQL> select *
  2    from unemployed
  3  /

        ID
----------
      1237

1 row selected.

SQL> select *
  2    from earns
  3  /

        ID     SALARY
---------- ----------
      1234     100000
      1236      70000

2 rows selected.

SQL> select *
  2    from salary_unk
  3  /

        ID
----------
      1235

1 row selected.

SQL> select *
  2    from unsalaried
  3  /

        ID
----------
      1237

1 row selected.
Here we achieved a data model where every NULL has been banned out.

Now what if we'd like to simulate a query against the PERS_INFO table? Darwen uses this expression to transform the seven tables back to the PERS_INFO table:
WITH (EXTEND JOB_UNK ADD ‘Job unknown’ AS Job_info) AS T1,
     (EXTEND UNEMPLOYED ADD ‘Unemployed’ AS Job_info) AS T2,
     (DOES_JOB RENAME (Job AS Job_info)) AS T3,
     (EXTEND SALARY_UNK ADD ‘Salary unknown’ AS Sal_info) AS T4,
     (EXTEND UNSALARIED ADD ‘Unsalaried’ AS Sal_info) AS T5,
     (EXTEND EARNS ADD CHAR(Salary) AS Sal_info) AS T6,
     (T6 { ALL BUT Salary }) AS T7,
     (UNION ( T1, T2, T3 )) AS T8,
     (UNION ( T4, T5, T7 )) AS T9,
     (JOIN ( CALLED, T8, T9 )) AS PERS_INFO :
PERS_INFO
Translated to Oracle syntax, this becomes:
SQL> with t1 as
  2  ( select id
  3         , 'Job unknown' as job_info
  4      from job_unk
  5  )
  6  , t2 as
  7  ( select id
  8         , 'Unemployed' as job_info
  9      from unemployed
 10  )
 11  , t3 as
 12  ( select id
 13         , job as job_info
 14      from does_job
 15  )
 16  , t4 as
 17  ( select id
 18         , 'Salary unknown' as sal_info
 19      from salary_unk
 20  )
 21  , t5 as
 22  ( select id
 23         , 'Unsalaried' as sal_info
 24      from unsalaried
 25  )
 26  , t6 as
 27  ( select id
 28         , salary
 29         , to_char(salary,'fm999G999') as sal_info
 30      from earns
 31  )
 32  , t7 as
 33  ( select id
 34         , sal_info
 35      from t6
 36  )
 37  , t8 as
 38  ( select id
 39         , job_info
 40      from t1
 41     union all
 42    select id
 43         , job_info
 44      from t2
 45     union all
 46    select id
 47         , job_info
 48      from t3
 49  )
 50  , t9 as
 51  ( select id
 52         , sal_info
 53      from t4
 54     union all
 55    select id
 56         , sal_info
 57      from t5
 58     union all
 59    select id
 60         , sal_info
 61      from t7
 62  )
 63  , pers_info as
 64  ( select c.id
 65         , c.name
 66         , j.job_info
 67         , s.sal_info
 68      from called c
 69           inner join t8 j on (c.id = j.id)
 70           inner join t9 s on (c.id = s.id)
 71  )
 72  select *
 73    from pers_info
 74  /

        ID NAME     JOB_INFO    SAL_INFO
---------- -------- ----------- --------------
      1235 Boris    Banker      Salary unknown
      1237 Davinder Unemployed  Unsalaried
      1234 Anne     Lawyer      100,000
      1236 Cindy    Job unknown 70,000

4 rows selected.
Very elaborate, but the optimizer does a great job at simplifying the query under the covers, as can be seen in this execution plan:
SQL> select *
  2    from table(dbms_xplan.display_cursor(null,null,'allstats last'))
  3  /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  bmrtdy0jad18p, child number 0
-------------------------------------
with t1 as ( select id        , 'Job unknown' as job_info     from
job_unk ) , t2 as ( select id        , 'Unemployed' as job_info
from unemployed ) , t3 as ( select id        , job as job_info     from
does_job ) , t4 as ( select id        , 'Salary unknown' as sal_info
 from salary_unk ) , t5 as ( select id        , 'Unsalaried' as
sal_info     from unsalaried ) , t6 as ( select id        , salary
  , to_char(salary,'fm999G999') as sal_info     from earns ) , t7 as (
select id        , sal_info     from t6 ) , t8 as ( select id        ,
job_info     from t1    union all   select id        , job_info
from t2    union all   select id        , job_info     from t3 ) , t9
as ( select id        , sal_info     from t4    union all   select id
     , sal_info     from t5    union all   select id        , sal_info
   from t7 ) , pers_info as ( select c.id        , c.name        ,
j.job_info        , s.sal_info     from called c          inner join t8
j on (c.id = j.id)

Plan hash value: 583520090

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name       | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |            |      1 |        |      4 |00:00:00.01 |      14 |       |       |          |
|*  1 |  HASH JOIN            |            |      1 |      4 |      4 |00:00:00.01 |      14 |  1011K|  1011K|  550K (0)|
|*  2 |   HASH JOIN           |            |      1 |      4 |      4 |00:00:00.01 |       8 |  1180K|  1180K|  548K (0)|
|   3 |    TABLE ACCESS FULL  | CALLED     |      1 |      4 |      4 |00:00:00.01 |       2 |       |       |          |
|   4 |    VIEW               |            |      1 |      4 |      4 |00:00:00.01 |       6 |       |       |          |
|   5 |     UNION-ALL         |            |      1 |        |      4 |00:00:00.01 |       6 |       |       |          |
|   6 |      TABLE ACCESS FULL| JOB_UNK    |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|   7 |      TABLE ACCESS FULL| UNEMPLOYED |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|   8 |      TABLE ACCESS FULL| DOES_JOB   |      1 |      2 |      2 |00:00:00.01 |       2 |       |       |          |
|   9 |   VIEW                |            |      1 |      4 |      4 |00:00:00.01 |       6 |       |       |          |
|  10 |    UNION-ALL          |            |      1 |        |      4 |00:00:00.01 |       6 |       |       |          |
|  11 |     TABLE ACCESS FULL | SALARY_UNK |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|  12 |     TABLE ACCESS FULL | UNSALARIED |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|  13 |     TABLE ACCESS FULL | EARNS      |      1 |      2 |      2 |00:00:00.01 |       2 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("C"."ID"="S"."ID")
   2 - access("C"."ID"="J"."ID")


45 rows selected.
If I had to build the PERS_INFO table back again with a query myself, I'd use this shorter query with six left outer joins:
SQL> select c.id
  2       , c.name
  3       , coalesce(j.job,nvl2(ju.id,'Job unknown',null),nvl2(ue.id,'Unemployed',null)) job_info
  4       , coalesce(to_char(e.salary,'fm999G999'),nvl2(su.id,'Salary unknown',null),nvl2(us.id,'Unsalaried',null)) salary_info
  5    from called c
  6         left outer join does_job j on (c.id = j.id)
  7         left outer join job_unk ju on (c.id = ju.id)
  8         left outer join unemployed ue on (c.id = ue.id)
  9         left outer join earns e on (c.id = e.id)
 10         left outer join salary_unk su on (c.id = su.id)
 11         left outer join unsalaried us on (c.id = us.id)
 12  /

        ID NAME     JOB_INFO    SALARY_INFO
---------- -------- ----------- --------------
      1234 Anne     Lawyer      100,000
      1236 Cindy    Job unknown 70,000
      1235 Boris    Banker      Salary unknown
      1237 Davinder Unemployed  Unsalaried

4 rows selected.
Although, as you can see below, the plan doesn't really improve:
SQL> select *
  2    from table(dbms_xplan.display_cursor(null,null,'allstats last'))
  3  /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  6x45b27mvpb1m, child number 0
-------------------------------------
select c.id      , c.name      , coalesce(j.job,nvl2(ju.id,'Job
unknown',null),nvl2(ue.id,'Unemployed',null)) job_info      ,
coalesce(to_char(e.salary,'fm999G999'),nvl2(su.id,'Salary
unknown',null),nvl2(us.id,'Unsalaried',null)) salary_info   from called
c        left outer join does_job j on (c.id = j.id)        left outer
join job_unk ju on (c.id = ju.id)        left outer join unemployed ue
on (c.id = ue.id)        left outer join earns e on (c.id = e.id)
 left outer join salary_unk su on (c.id = su.id)        left outer join
unsalaried us on (c.id = us.id)

Plan hash value: 3398518218

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name       | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |            |      1 |        |      4 |00:00:00.01 |      15 |       |       |          |
|*  1 |  HASH JOIN OUTER        |            |      1 |      4 |      4 |00:00:00.01 |      15 |   955K|   955K|  528K (0)|
|*  2 |   HASH JOIN OUTER       |            |      1 |      4 |      4 |00:00:00.01 |      12 |  1000K|  1000K|  523K (0)|
|*  3 |    HASH JOIN OUTER      |            |      1 |      4 |      4 |00:00:00.01 |      10 |  1035K|  1035K|  536K (0)|
|*  4 |     HASH JOIN OUTER     |            |      1 |      4 |      4 |00:00:00.01 |       8 |  1063K|  1063K|  536K (0)|
|*  5 |      HASH JOIN OUTER    |            |      1 |      4 |      4 |00:00:00.01 |       6 |  1114K|  1114K|  537K (0)|
|*  6 |       HASH JOIN OUTER   |            |      1 |      4 |      4 |00:00:00.01 |       4 |  1180K|  1180K|  538K (0)|
|   7 |        TABLE ACCESS FULL| CALLED     |      1 |      4 |      4 |00:00:00.01 |       2 |       |       |          |
|   8 |        TABLE ACCESS FULL| JOB_UNK    |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|   9 |       TABLE ACCESS FULL | UNEMPLOYED |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|  10 |      TABLE ACCESS FULL  | SALARY_UNK |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|  11 |     TABLE ACCESS FULL   | UNSALARIED |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|  12 |    TABLE ACCESS FULL    | DOES_JOB   |      1 |      2 |      2 |00:00:00.01 |       2 |       |       |          |
|  13 |   TABLE ACCESS FULL     | EARNS      |      1 |      2 |      2 |00:00:00.01 |       3 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("C"."ID"="E"."ID")
   2 - access("C"."ID"="J"."ID")
   3 - access("C"."ID"="US"."ID")
   4 - access("C"."ID"="SU"."ID")
   5 - access("C"."ID"="UE"."ID")
   6 - access("C"."ID"="JU"."ID")


43 rows selected.
But the two plans above are really complex, compared with a simple query against the PERS_INFO table with nullable columns:
SQL> select *
  2    from pers_info
  3  /

        ID NAME       JOB            SALARY
---------- ---------- ---------- ----------
      1234 Anne       Lawyer         100000
      1235 Boris      Banker
      1236 Cindy                      70000
      1237 Davinder

4 rows selected.

SQL> select *
  2    from table(dbms_xplan.display_cursor(null,null,'allstats last'))
  3  /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  016x9f106gj27, child number 1
-------------------------------------
select *   from pers_info

Plan hash value: 1584579034

-----------------------------------------------------------------------------------------
| Id  | Operation         | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |      1 |        |      4 |00:00:00.01 |       7 |
|   1 |  TABLE ACCESS FULL| PERS_INFO |      1 |      4 |      4 |00:00:00.01 |       7 |
-----------------------------------------------------------------------------------------


13 rows selected.
If queries like this are not very frequent in your database, you might want to take this extra work for granted and avoid the NULL. But you need to consider something else as well: the new schema requires much more constraints. Using just the PERS_INFO table, a single primary key constraint on the Id column is all you need. But for the new model, Darwen describes 9, but really 15 constraints:
  1. No two CALLED rows have the same Id. (Primary key)
  2. Every row in CALLED has a matching row in either DOES_JOB, JOB_UNK, or UNEMPLOYED.
  3. No row in DOES_JOB has a matching row in JOB_UNK.
  4. No row in DOES_JOB has a matching row in UNEMPLOYED.
  5. No row in JOB_UNK has a matching row in UNEMPLOYED.
  6. Every row in DOES_JOB has a matching row in CALLED. (Foreign key)
  7. Every row in JOB_UNK has a matching row in CALLED. (Foreign key)
  8. Every row in UNEMPLOYED has a matching row in CALLED. (Foreign key)
  9. Constraints 2 through 8 repeated, mutatis mutandis, for CALLED with respect to EARNS, SALARY_UNK and UNSALARIED.
Implementing constraint 1 is easy:
SQL> alter table called add primary key (id)
  2  /

Table altered.
And so are constraints 6, 7 and 8:
SQL>alter table does_job add foreign key (id) references called (id)
  2  /

Table altered.

SQL> alter table job_unk add foreign key (id) references called (id)
  2  /

Table altered.

SQL> alter table unemployed add foreign key (id) references called (id)
  2  /

Table altered.
But constraint 2 says that the Id in table CALLED is a foreign distributed key. And constraints 3, 4 and 5 say the Id's of tables DOES_JOB, JOB_UNK and UNEMPLOYED are a distributed key. Oracle doesn't have declarative support for distributed keys or for foreign distributed keys. We could write database trigger code to implement this, which is very hard to do correct or we could use the materialized view trick to have the condition validated at the end of a transaction, instead of at the end of the statement, which also has its downsides. And such deferred constraint checking is explicitly ruled out by The Third Manifesto as well. Nevertheless, here is how it can be done.

The distributed key (constraints 3, 4 and 5):
SQL> create materialized view log on does_job with rowid
  2  /

Materialized view log created.

SQL> create materialized view log on job_unk with rowid
  2  /

Materialized view log created.

SQL> create materialized view log on unemployed with rowid
  2  /

Materialized view log created.

SQL> create materialized view distributed_key_vw
  2    refresh fast on commit
  3  as
  4  select d.rowid rid
  5       , d.id    id
  6       , 'D'     umarker
  7    from does_job d
  8   union all
  9  select j.rowid
 10       , j.id
 11       , 'J'
 12    from job_unk j
 13   union all
 14  select u.rowid
 15       , u.id
 16       , 'U'
 17    from unemployed u
 18  /

Materialized view created.

SQL> alter table distributed_key_vw
  2    add constraint distributed_key_check
  3    primary key (id)
  4  /

Table altered.
And to show that the distributed key implementation works:
SQL> insert into job_unk values (1234)
  2  /

1 row created.

SQL> commit
  2  /
commit
*
ERROR at line 1:
ORA-12048: error encountered while refreshing materialized view "RWIJK"."DISTRIBUTED_KEY_VW"
ORA-00001: unique constraint (RWIJK.DISTRIBUTED_KEY_CHECK) violated
And the foreign distributed key ("Every row in CALLED has a matching row in either DOES_JOB, JOB_UNK, or UNEMPLOYED.") can be implemented like this:
SQL> create materialized view log on does_job with rowid
  2  /

Materialized view log created.

SQL> create materialized view log on job_unk with rowid
  2  /

Materialized view log created.

SQL> create materialized view log on unemployed with rowid
  2  /

Materialized view log created.

SQL> create materialized view foreign_distributed_key_vw
  2    refresh fast on commit
  3  as
  4  select c.rowid  c_rowid
  5       , dj.rowid dj_rowid
  6       , ju.rowid ju_rowid
  7       , ue.rowid ue_rowid
  8       , c.id     id
  9       , dj.id    dj_id
 10       , ju.id    ju_id
 11       , ue.id    ue_id
 12    from called c
 13       , does_job dj
 14       , job_unk ju
 15       , unemployed ue
 16   where c.id = dj.id (+)
 17     and c.id = ju.id (+)
 18     and c.id = ue.id (+)
 19  /

Materialized view created.

SQL> alter table foreign_distributed_key_vw
  2    add constraint foreign_distributed_key_check
  3    check (coalesce(dj_id,ju_id,ue_id) is not null)
  4  /

Table altered.
And some proof that this implementation works:
SQL> insert into called values (1238,'Elise')
  2  /

1 row created.

SQL> commit
  2  /
commit
*
ERROR at line 1:
ORA-12008: error in materialized view refresh path
ORA-02290: check constraint (RWIJK.FOREIGN_DISTRIBUTED_KEY_CHECK) violated
Would I go through the extra trouble of an implementation with 6 more tables, 14 extra constraints and worse performance like above? It depends. It depends on how often the data is queried, and on how often it is updated concurrently. And on whether the distinction between the possible multiple meanings of NULL is relevant in my case. And whether I have sufficient extra time to implement it. Using Oracle, probably most often, I won't.




PS: For all Dutch DBA's: here is a symposium you don't want to miss.

7 comments:

  1. "And a NULL can have several meanings. By using NULL, you are not making clear what is meant"

    Wrong. NULL means one and only one thing: "unknown". Nothing could be more clear!
    No, the subjective interpretations of that are NOT allowed. Unknown is just that: unknown. Single word, single meaning.

    ReplyDelete
  2. "Wrong. NULL means one and only one thing: "unknown". Nothing could be more clear!"

    If only real life were as simple... :)

    ReplyDelete
  3. Appreciated this post, thanks Rob :) I think one of the points that Hugh Darwen (and Chris Date, as well I think) make is that the ideal is to eliminate three-valued logic implied by NULLs; but that to do this you'd need a DBMS that supports the Relational Model of data correctly. When you want to move from your logical model (with all 7 relations) to a *physical* model embodied in a SQL database, you're pretty much forced (as you found in your testing) to go back to the single table. However, you may have gotten some benefit from thinking through what the various conditions mean - in light of the analysis you might decide to add some additional columns (e.g. "is_salaried", "is_employed") to help the application interpret the NULLs correctly.

    ReplyDelete
  4. @Noons, thanks for your reaction.

    Yes, that's the meaning in the Oracle realm we've all grown accustomed to. Outside the Oracle realm though, theorists have other opinions.

    Edgar F. Codd suggested we should have two distinct NULL markers: missing but applicable (A-marks) and missing but inapplicable (I-marks), as you can read here.

    Other, less scientific, opinions are to include extra meanings besides Unknown, like Conflicting values, Not applicable, Empty and Default.

    And Chris Date and Hugh Darwen suggest we should not have a NULL value at all in their Third Manifesto.

    I have rephrased that sentence to "And a NULL value can have several interpretations" now, which I think is even more accurate.

    ReplyDelete
  5. I've been thinking about the dreaded NULL again recently, as I am dealing with a new database that someone else designed in the past and the consequences of them allowing NULL's on most columns. I would tend to do "NOT NULL" by default, and allow NULL's as an exception, so that NULL columns are in the minority.

    My real issue is not with whether a NULL is meaningful for a specific column, but with the consequences on the Oracle optimizer when NULL values are allowed. NULL values are not stored in indexes (there may be exceptions, but generally they are not). Lots of queries have either / or type constraints where NULL is used to mean "any value allowed" (COL1 = 34 OR COL1 IS NULL). And many other cases that stop the optimizer either being able to use indexes fully, or which affect things like cardinality estimates.

    ReplyDelete
  6. When you know all the use-cases, implementing subtypes as separate tables, might be benificial for certain queries needed by thoae use- cases. More often than not I've seen very "wide" tables, occupying a large segment, and indexed heavily, function based indexes too (required to quickly retrieve certain subsets of the rows), that cause lots of issues.

    For instance using the example given, a query needing to retrieve all jobless employees is rather simple if a subtype exists, yet requires a much larger segment scan (or a function bases index) in the "stuff it all in one table" scenario.

    ReplyDelete
  7. Would I go through the extra trouble of an implementation with 6 more tables, 14 extra constraints and worse performance like above? It depends. It depends on how often the data is queried, and on how often it is updated concurrently.



    I find it funny that you don't mention "it depends on how important it is for the data to be consistent and conforming to business rules". Because _that_ is what those extra constraints actually get you.

    ReplyDelete