Monday, March 29, 2010

Shredding XML into multiple tables in one shot

To insert the data from an XML document into multiple relational tables, you can use PL/SQL, loop statements and the extract function. But I wanted to know whether it is possible to do it using only SQL. With the XMLTable function you can transform an XML document to a relational format. This post will show how you can use the XMLTable function to store the data into relational tables with a master-detail-detail relationship, when the connecting columns are not part of the XML; they are surrogate key columns.

For this example, I'll use this XML:

SQL> select xml
2 from xml_documents_to_process
3 /

XML
----------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<msg:message xmlns:msg="namespace1">
<msg:metadata>
<msg:CreationTime>2010-03-16T15:23:14.56</msg:CreationTime>
</msg:metadata>
<msg:content xmlns:tcim="namespace2">
<tcim:ReportingDate>2010-03-19</tcim:ReportingDate>
<tcim:Persons>
<tcim:Person>
<tcim:Name>Jeffrey Lebowski</tcim:Name>
<tcim:Items>
<tcim:Item>
<tcim:Name>Rug</tcim:Name>
<tcim:Remarks>Really ties the room together</tcim:Remarks>
</tcim:Item>
<tcim:Item>
<tcim:Name>Creedence Tapes</tcim:Name>
<tcim:Remarks>Stolen</tcim:Remarks>
</tcim:Item>
<tcim:Item>
<tcim:Name>Toe</tcim:Name>
<tcim:Remarks>With nail polish</tcim:Remarks>
</tcim:Item>
</tcim:Items>
</tcim:Person>
<tcim:Person>
<tcim:Name>Jesus Quintana</tcim:Name>
<tcim:Items>
<tcim:Item>
<tcim:Name>Bowling ball</tcim:Name>
<tcim:Remarks>Warmed up</tcim:Remarks>
</tcim:Item>
</tcim:Items>
</tcim:Person>
<tcim:Person>
<tcim:Name>Walter Sobchak</tcim:Name>
</tcim:Person>
</tcim:Persons>
</msg:content>
</msg:message>


1 row selected.

The contents have to be stored in these three tables:

SQL> create table messages
2 ( guid raw(16) primary key
3 , creation_time timestamp(2) not null
4 , reporting_date date not null
5 )
6 /

Table created.

SQL> create table persons
2 ( guid raw(16) primary key
3 , message_guid raw(16) not null references messages(guid)
4 , name varchar2(30) not null
5 )
6 /

Table created.

SQL> create table items
2 ( guid raw(16) primary key
3 , person_guid raw(16) not null references persons(guid)
4 , name varchar2(20) not null
5 , remarks varchar2(40) not null
6 )
7 /

Table created.

Note that I use globally unique identifiers (guid's) here, because I could not make the example below work when using sequences. I challenge you to show me a reasonable way to do the same as I do below, using sequences :-). I really like to know. But for now, I know one more advantage of SYS_GUID() over sequences.

Back to the example. Let me first show how to shred this XML into these three tables in one shot. The explanation comes afterwards.

SQL> create procedure from_xml_to_relational (p_xml in xmltype)
2 is
3 begin
4 insert all
5 when add_messages_indicator = 'Y'
6 then
7 into messages values
8 ( m_guid
9 , creation_time
10 , reporting_date
11 )
12 when add_persons_indicator = 'Y'
13 then
14 into persons values
15 ( p_guid
16 , m_guid
17 , p_name
18 )
19 when add_items_indicator = 'Y'
20 then
21 into items values
22 ( i_guid
23 , p_guid
24 , i_name
25 , remarks
26 )
27 select case when p.id = 1 and nvl(i.id,1) = 1 then 'Y' end add_messages_indicator
28 , case when nvl(i.id,1) = 1 then 'Y' end add_persons_indicator
29 , nvl2(i.id,'Y',null) add_items_indicator
30 , first_value(sys_guid()) over (partition by m.id order by p.id,i.id) m_guid
31 , first_value(sys_guid()) over (partition by m.id,p.id order by i.id) p_guid
32 , sys_guid() i_guid
33 , to_timestamp(m.creation_time,'yyyy-mm-dd"T"hh24:mi:ss.ff') creation_time
34 , to_date(m.reporting_date,'yyyy-mm-dd') reporting_date
35 , p.name p_name
36 , i.name i_name
37 , i.remarks remarks
38 from xmltable
39 ( xmlnamespaces ('namespace1' as "msg", default 'namespace2')
40 , 'msg:message'
41 passing p_xml
42 columns id for ordinality
43 , creation_time varchar2(50) path 'msg:metadata/msg:CreationTime'
44 , reporting_date varchar2(50) path 'msg:content/ReportingDate'
45 , persons xmltype path 'msg:content/Persons'
46 ) m
47 , xmltable
48 ( xmlnamespaces (default 'namespace2')
49 , 'Persons/Person'
50 passing m.persons
51 columns id for ordinality
52 , name varchar2(30) path 'Name'
53 , items xmltype path 'Items'
54 ) p
55 , xmltable
56 ( xmlnamespaces(default 'namespace2')
57 , 'Items/Item'
58 passing p.items
59 columns id for ordinality
60 , name varchar2(20) path 'Name'
61 , remarks varchar2(40) path 'Remarks'
62 ) (+) i
63 ;
64 end;
65 /

Procedure created.

The first challenge is to join the three levels in this XML with each other. This is done by passing a part of the XML to the lower level. For example, in line 45 the Persons-part of the Message XML is caught and passed on to subquery p in line 50. This is repeated from Persons to Items in line 53 and 58.

A second challenge is to do an outerjoin with XMLTable. This wasn't a challenge after all, because - as you can see - all you have to do is to add a (+) sign after the XMLTable function.

The resulting query produces 5 rows. The third challenge is to make the INSERT ALL statement do only 1 insert into the messages table, 3 inserts into persons and 4 inserts into items. To support this challenge, I introduced an "id for ordinality" numeric column. This id column helps to construct expressions add_messages_indicator, add_persons_indicator and add_items_indicator. The id's also help for constructing the right guid's for each row to link the related records to each other. To see how these expressions work, let's show the result of the select:

SQL> select case when p.id = 1 and nvl(i.id,1) = 1 then 'Y' end add_messages_indicator
2 , case when nvl(i.id,1) = 1 then 'Y' end add_persons_indicator
3 , nvl2(i.id,'Y',null) add_items_indicator
4 , first_value(sys_guid()) over (partition by m.id order by p.id,i.id) m_guid
5 , first_value(sys_guid()) over (partition by m.id,p.id order by i.id) p_guid
6 , sys_guid() i_guid
7 , to_timestamp(m.creation_time,'yyyy-mm-dd"T"hh24:mi:ss.ff') creation_time
8 , to_date(m.reporting_date,'yyyy-mm-dd') reporting_date
9 , p.name p_name
10 , i.name i_name
11 , i.remarks remarks
12 from xml_documents_to_process x
13 , xmltable
14 ( xmlnamespaces ('namespace1' as "msg", default 'namespace2')
15 , 'msg:message'
16 passing x.xml
17 columns id for ordinality
18 , creation_time varchar2(50) path 'msg:metadata/msg:CreationTime'
19 , reporting_date varchar2(50) path 'msg:content/ReportingDate'
20 , persons xmltype path 'msg:content/Persons'
21 ) m
22 , xmltable
23 ( xmlnamespaces (default 'namespace2')
24 , 'Persons/Person'
25 passing m.persons
26 columns id for ordinality
27 , name varchar2(30) path 'Name'
28 , items xmltype path 'Items'
29 ) p
30 , xmltable
31 ( xmlnamespaces(default 'namespace2')
32 , 'Items/Item'
33 passing p.items
34 columns id for ordinality
35 , name varchar2(20) path 'Name'
36 , remarks varchar2(40) path 'Remarks'
37 ) (+) i
38 /

ADD ADD ADD M_GUID P_GUID
--- --- --- -------------------------------- --------------------------------
I_GUID CREATION_TIME REPORTING_DATE
-------------------------------- -------------------- -------------------
P_NAME
--------------------------------------------------------------------------------
I_NAME
------------------------------------------------------------
REMARKS
------------------------------
Y Y Y 82F643037E04E60FE0400000000001C0 82F643037E04E60FE0400000000001C0
82F643037E09E60FE0400000000001C0 16-03-10 15:23:14,56 19-03-2010 00:00:00
0000000
Jeffrey Lebowski
Rug
Really ties the room together

Y 82F643037E04E60FE0400000000001C0 82F643037E04E60FE0400000000001C0
82F643037E0AE60FE0400000000001C0 16-03-10 15:23:14,56 19-03-2010 00:00:00
0000000
Jeffrey Lebowski
Creedence Tapes
Stolen

Y 82F643037E04E60FE0400000000001C0 82F643037E04E60FE0400000000001C0
82F643037E0BE60FE0400000000001C0 16-03-10 15:23:14,56 19-03-2010 00:00:00
0000000
Jeffrey Lebowski
Toe
With nail polish

Y Y 82F643037E04E60FE0400000000001C0 82F643037E07E60FE0400000000001C0
82F643037E0CE60FE0400000000001C0 16-03-10 15:23:14,56 19-03-2010 00:00:00
0000000
Jesus Quintana
Bowling ball
Warmed up

Y 82F643037E04E60FE0400000000001C0 82F643037E08E60FE0400000000001C0
82F643037E0DE60FE0400000000001C0 16-03-10 15:23:14,56 19-03-2010 00:00:00
0000000
Walter Sobchak




5 rows selected.

The guid's all look alike, but don't be fooled (as I was initially): they differ where they need to differ. Just look at byte 6 (position 12). If you look closely, you'll see that the column m_guid all contain the same guid's. The first three p_guid's are the same, but row 4 and 5 differ. This is what the partition clause in the first_value analytic function achieves. The i_guid's all differ. With the guid's defined like this, the foreign key columns refer to the right parent records.

Finally, let's show that the INSERT ALL statement has done its job:

SQL> declare
2 l_xml xmltype;
3 begin
4 select xml
5 into l_xml
6 from xml_documents_to_process
7 ;
8 from_xml_to_relational(l_xml);
9 end;
10 /

PL/SQL procedure successfully completed.

SQL> select * from messages
2 /

GUID CREATION_TIME REPORTING_DATE
-------------------------------- -------------------- -------------------
82F643037DDCE60FE0400000000001C0 16-03-10 15:23:14,56 19-03-2010 00:00:00

1 row selected.

SQL> select * from persons
2 /

GUID MESSAGE_GUID NAME
-------------------------------- -------------------------------- --------------------
82F643037DDCE60FE0400000000001C0 82F643037DDCE60FE0400000000001C0 Jeffrey Lebowski
82F643037DDFE60FE0400000000001C0 82F643037DDCE60FE0400000000001C0 Jesus Quintana
82F643037DE0E60FE0400000000001C0 82F643037DDCE60FE0400000000001C0 Walter Sobchak

3 rows selected.

SQL> select * from items
2 /

GUID PERSON_GUID NAME REMARKS
-------------------------------- -------------------------------- -------------------- ------------------------------
82F643037DE1E60FE0400000000001C0 82F643037DDCE60FE0400000000001C0 Rug Really ties the room together
82F643037DE2E60FE0400000000001C0 82F643037DDCE60FE0400000000001C0 Creedence Tapes Stolen
82F643037DE3E60FE0400000000001C0 82F643037DDCE60FE0400000000001C0 Toe With nail polish
82F643037DE4E60FE0400000000001C0 82F643037DDFE60FE0400000000001C0 Bowling ball Warmed up

4 rows selected.

SQL> select m.creation_time
2 , m.reporting_date
3 , p.name
4 , i.name
5 , i.remarks
6 from messages m
7 , persons p
8 , items i
9 where m.guid = p.message_guid
10 and p.guid = i.person_guid (+)
11 /

CREATION_TIME REPORTING_DATE NAME NAME REMARKS
-------------------- ------------------- -------------------- -------------------- ------------------------------
16-03-10 15:23:14,56 19-03-2010 00:00:00 Jeffrey Lebowski Rug Really ties the room together
16-03-10 15:23:14,56 19-03-2010 00:00:00 Jeffrey Lebowski Creedence Tapes Stolen
16-03-10 15:23:14,56 19-03-2010 00:00:00 Jeffrey Lebowski Toe With nail polish
16-03-10 15:23:14,56 19-03-2010 00:00:00 Jesus Quintana Bowling ball Warmed up
16-03-10 15:23:14,56 19-03-2010 00:00:00 Walter Sobchak

5 rows selected.

7 comments:

  1. Thank you Rob for this helpful tip. I didn't know what term to use to google this but I'm truely glad for finding this.

    ReplyDelete
  2. Thanks Rob, this has proved very helpful. Brilliant.

    ReplyDelete
  3. Mate,
    It is an excellent post on dealing with xml files with a very good, easy to follow and understand example. specially when there isn't that many posts on this topic.

    WELL DONE & THANKS

    ReplyDelete
  4. Hey Rob,

    I had meant to mention this sometime back but hadn't actually done it. Since you have mentioned in the above post that you could not get this working using sequences, will the technique used in http://dspsd.blogspot.co.uk/2012/07/insert-all-masterdetail-data-from-xml.html help?

    ReplyDelete
  5. This is really useful - bookmarked. Excellent.

    ReplyDelete