Wednesday, December 9, 2009

SYS_GUID()

A colleague of mine sent me this link, and wanted me to have a look at point 5:

Choose sys_guid() instead of sequences for populating meaningless PK-columns.
Traditional sequences are used for populating an ID-column.
Sequences have several disadvantages:
• Sequences have to be created, maintained and deployed.
• Sequences need PL/SQL-code to be fetched (the ‘select into’ construction is obsolete in 11g, but defaulting a column with sequence.nextval still does not work).
• When you introduce triggers to fetch the .nextval value, you introduce significant undesired performance penalties
• When moving your data from data from one environment to another (e.g. copy PRODUCTION data to ACCEPTANCE), you have to synchronize the .LASTVAL of all sequences.
• Fetching sequences can be a performance-critical in big environment (bulk fetch in Data warehouse for instance) which has to be fixed by increasing the CACHE property
Use guid's to overcome the disadvantages of sequences:
create table employees
( guid raw(32) default sys_guid() not null primary key);

I knew about the SYS_GUID() function, but didn't realize you can use it as a default value for a primary key column. That sure is convenient. Where other RDBMS all have some sort of autonumber/identity column to automatically fill an ID column, Oracle uses sequences, which is a little more cumbersome to implement. However, I found the story not really balanced. Before addressing each point and mentioning some of the disadvantages of SYS_GUID, let's see how it works. Using sequences, a simple setup can be like this:

rwijk@ORA11GR1> create sequence seq cache 1000
2 /

Sequence created.

Here I create the sequence with "cache 1000", because I want a session to fetch 1000 numbers at once, instead of one each time. This reduces the overhead for Oracle to maintain the SYS.SEQ$ table. Gaps in the numbers might be large this way, but this shouldn't bother you. If it does bother you, you likely shouldn't be using a sequence, as they are never gap-free. A table is created without any reference to the sequence:

rwijk@ORA11GR1> create table t_seq
2 ( id number(9) primary key
3 , filler varchar2(1000)
4 )
5 /

Table created.

To fill the id column with the sequence value, I prefer to just use the nextval function directly in my api. Another option is to use database triggers, but I'm on a mission to avoid those whenever possible. The next procedure represents such an api:

rwijk@ORA11GR1> create procedure p_seq
2 is
3 begin
4 insert into t_seq (id,filler)
5 select seq.nextval
6 , lpad('*',1000,'*')
7 from dual
8 connect by level <= 20000
9 ;
10 end;
11 /

Procedure created.


Using the SYS_GUID() function, a similar setup would look like this:

rwijk@ORA11GR1> create table t_sysguid
2 ( id raw(16) default sys_guid() primary key
3 , filler varchar2(1000)
4 )
5 /

Table created.

rwijk@ORA11GR1> create procedure p_sysguid
2 is
3 begin
4 insert into t_sysguid (filler)
5 select lpad('*',1000,'*')
6 from dual
7 connect by level <= 20000
8 ;
9 end;
10 /

Procedure created.


One object less (the sequence), and you don't need to bother about the ID column in the insert statement. The mentioned disadvantages of sequences per point:

Sequences have to be created, maintained and deployed
While obviously true, I don't think it is a major point. A sequence never comes alone, so we are already creating, maintaining and deploying database objects. But it adds up a little indeed.


Sequences need PL/SQL-code to be fetched (the ‘select into’ construction is obsolete in 11g, but defaulting a column with sequence.nextval still does not work)
You don't necessarily need PL/SQL to use a sequence, as shown above. So this is untrue. Although it is true that you cannot default a column with sequence.nextval.


When you introduce triggers to fetch the .nextval value, you introduce significant undesired performance penalties
So don't introduce database triggers to fetch the .nextval value :-)


When moving your data from data from one environment to another (e.g. copy PRODUCTION data to ACCEPTANCE), you have to synchronize the .LASTVAL of all sequences
This is true. Very minor disadvantage though, in my opinion. If we refresh the acceptance database, we don't only refresh the data, but also the objects itself.


Fetching sequences can be a performance-critical in big environment (bulk fetch in Data warehouse for instance) which has to be fixed by increasing the CACHE property.
The solution to the possible problem is already presented here.


No advantages of sequences over SYS_GUID are mentioned, so here are some of mine:

A RAW(16) column takes up more space than a NUMBER column.
The SYS_GUID is a RAW(16) taking up 16 bytes, whereas as an ID populated by a sequences typically is a NUMBER(6), NUMBER(9) or NUMBER(12). A reasonable average is NUMBER(9) which takes up only 4 bytes. When you have foreign key columns in your table, this effect is multiplied.


Ad-hoc querying has become more cumbersome.
Which one would you rather type. This:
select id from t_sysguid where id = '48000DA3C76E41A789C7E8925A91E28A'
Or this:
select id from t_seq where id = 1234


You can no longer see the order of inserts by the ID column.
When using the cache property and multiple sessions, this is neither necessarily true for sequence populated ID columns, but there is a strong correlation between the number and the time it was populated. This "information" is not there with SYS_GUID() populated columns.


Performance seems to be slightly less when using SYS_GUID()
The results of my test varied a lot. But generally it favoured the sequence approach. Here is the output of one of the more representative tests:

rwijk@ORA11GR1> begin
2 runstats_pkg.rs_start;
3 p_seq;
4 runstats_pkg.rs_middle;
5 p_sysguid;
6 runstats_pkg.rs_stop(1000);
7 end;
8 /
Run1 draaide in 1670 hsecs
Run2 draaide in 2220 hsecs
Run1 draaide in 75.23% van de tijd

Naam Run1 Run2 Verschil
STAT.recursive calls 1,162 49 -1,113
LATCH.enqueue hash chains 1,626 508 -1,118
STAT.hot buffers moved to head of LRU 1,213 75 -1,138
STAT.dirty buffers inspected 5,613 3,266 -2,347
STAT.free buffer requested 6,215 3,482 -2,733
STAT.free buffer inspected 6,678 3,577 -3,101
STAT.redo subscn max counts 4,147 753 -3,394
STAT.heap block compress 3,678 7,119 3,441
LATCH.cache buffer handles 4,922 0 -4,922
STAT.Heap Segment Array Inserts 12,512 7,156 -5,356
STAT.HSC Heap Segment Block Changes 12,582 7,156 -5,426
STAT.calls to get snapshot scn: kcmgss 6,752 152 -6,600
LATCH.checkpoint queue latch 17,879 9,818 -8,061
LATCH.cache buffers lru chain 27,793 13,845 -13,948
LATCH.object queue header operation 40,308 25,913 -14,395
STAT.consistent gets 26,810 7,302 -19,508
STAT.consistent gets from cache 26,810 7,302 -19,508
STAT.consistent gets from cache (fastpath) 19,687 145 -19,542
STAT.redo entries 42,552 71,492 28,940
STAT.db block gets from cache (fastpath) 21,080 57,202 36,122
STAT.session logical reads 98,920 144,900 45,980
LATCH.simulator hash latch 5,752 56,028 50,276
STAT.db block gets from cache 72,110 137,598 65,488
STAT.db block gets 72,110 137,598 65,488
STAT.db block changes 63,379 129,357 65,978
LATCH.sequence cache 150,100 0 -150,100
LATCH.cache buffers chains 396,737 572,383 175,646
STAT.undo change vector size 2,564,000 4,921,872 2,357,872
STAT.physical read total bytes 17,776,640 23,355,392 5,578,752
STAT.physical read bytes 17,776,640 23,355,392 5,578,752
STAT.cell physical IO interconnect bytes 17,776,640 23,355,392 5,578,752
STAT.physical IO disk bytes 17,776,640 23,355,392 5,578,752
STAT.redo size 59,361,376 66,097,504 6,736,128

Run1 latches totaal versus run2 -- verschil en percentage
Run1 Run2 Verschil Pct
679,069 707,672 28,603 95.96%

PL/SQL procedure successfully completed.


So while I am certainly not saying no to SYS_GUID(), I thought there was a little more to the story than the all-positive points in the link.

If you know some more advantages or disadvantages from one over the other, please say so in a comment.

15 comments:

  1. Indexes ! Sequence generated values will always increment and be added at the high end of the index.
    SYS_GUID values will be all over the place. On the plus side, that may reduce contention for hot blocks , but on the minus side your index inserts will be all over the place. All the arguments for/against reverse key indexes would apply.

    ReplyDelete
  2. I've had a few people trying to convince me to use SYS_GUIDs, I'm just not buying it yet. I'm not convinced as to any real or perceived advantage.

    "You can no longer see the order of inserts by the ID column."
    You probably know this, but I thought it worth mentioning, on a RAC system you can't count on the sequence numbers to be in order. That quickly stopped me from using the ID for any type of ordering.

    chet

    ReplyDelete
  3. Hi.

    I've used SYS_GUID in a project and it worked fine. Actually, the project in question used a combination of SYG_GUID and a Java stored procedure to generate UUIDs.

    http://www.oracle-base.com/articles/9i/UUID9i.php

    I wouldn't say either method (sequences or GUIDs) are a clear winner. As always, both methods have their pros and cons.

    If you know you will be transferring data between systems, then avoiding sequences is actually a very big plus.

    Cheers

    Tim...

    ReplyDelete
  4. I can vouch for comparative slowness of guid over sequences; see Unique IDs for Multi-master Replication.

    Regards Nigel

    ReplyDelete
  5. Hi Rob,

    Interesting post! There also must be a reasons why GUIDs are used in for example BPEL and ESB schemas.
    A quick performance comparison can be found here.
    Other links on the subject.

    Regards,
    Peter Paul

    ReplyDelete
  6. Excellent comments, all. Thanks for the extra (dis)advantages, twists (I forgot about RAC!), thoughts and the links!

    Regards,
    Rob.

    ReplyDelete
  7. You *can* guarantee sequence order in a RAC system (but you do incur some cross-instance traffic to do so).

    Cheers
    Connor

    ReplyDelete
  8. Thanks Connor.

    I just looked up some information and found this article about sequences and RAC: http://www.pythian.com/news/383/sequences-in-oracle-10g-rac/.

    ReplyDelete
  9. Is the SYS_GUID is incremental. I.e. can it's output be in sequencial order? or its just a unique string?

    ReplyDelete
  10. @Rohan: it's not sequential, it's hust a string:

    SQL> select sys_guid() from dual;

    SYS_GUID()
    --------------------------------
    9B1085FF611A4518A3211A495E36B703

    SQL> /

    SYS_GUID()
    --------------------------------
    9BAE7CE712D24D90A289861086B450C7

    SQL>

    HTH, Nigel

    ReplyDelete
  11. I should add, it isn't monotonic either; next two selects are:

    SQL> /

    SYS_GUID()
    --------------------------------
    A72FA8729A38444B82AEDEB875E6DF47

    SQL> /

    SYS_GUID()
    --------------------------------
    886B805C723340F3BD37F2C2A322ECC3

    SQL>

    Regards

    ReplyDelete
  12. FYI, I just had opposite results testing 1 million inserts with RAW(16) GUID vs NUMBER(16) sequence-based key.

    About 13 seconds (on multiple runs) for GUID.
    About 30 seconds (on multiple runs) for Sequence.

    Connected to Oracle Database 11g Enterprise Edition Release 11.1.0.7.0

    drop table mat;
    create table mat (
    mat_id raw(16) default sys_guid() primary key,
    mat_val number
    );
    insert into mat (
    mat_val
    )
    select level as mat_val
    from dual
    connect by level < 1000000
    ;
    commit;

    drop table mat;
    drop sequence mat_seq;
    create table mat (
    mat_id number(16) primary key,
    mat_val number
    );
    create sequence mat_seq;
    insert into mat (
    mat_id,
    mat_val
    )
    select mat_seq.nextval as mat_id,
    level as mat_val
    from dual
    connect by level < 1000000
    ;
    commit;

    ReplyDelete
  13. Another con to sequences in a web environment is security. Sequence IDs can be easily guessed and allow access to data in nearby rows in the database.

    ReplyDelete
    Replies
    1. If your app allows access to any row given an arbitrary key value, it's already insecure, regardless of whether you use a sequence or not.

      Delete
    2. Like Jeffrey mentionend, if guessing ID's breaks your system, you should have a look for another job. Basically i had to change for several years from Oracle to SQL Server, where this "crazy idea" of GUIDs is omnipresent. in short:
      Tons of waste data to represent easy, small and basically useless information. as a PK is just an arbitrary value, you do not have to order it or whatever and indeed, readable ID-Values make life of a developer really much easier and less error prawn (similar looking is not similar value !).
      In real large and high concurrency systems GUID's are the real performance killers, as KEYS are usually Indexed, just the UK-Check on inserts raise for example physically/logically reads dramatically, especially if tables are not partitioned (talkin about billion rows tables here..). Imagine a table with 5 billion rows .. just calc size of the index ..and if you have more than one table..as usual a model..you'll need tons of RAM just for indexes ...and now have 200.000 users doing concurrent inserts (by luck no update/delete).. i had this situation. only possibility to safe system before death was to refactor guids to int .. and all was fine. .. so like always, for small systems .. it probably doesnt matter, but if you plan to go far/big, NEVER use it !

      Delete