Friday, March 9, 2012

Keeping relational design with XML

I'm trying to prove or disprove whether or not it is best
to keep a relational db structure and use the built in
functionality for XML. I have a developer who want to
just plop the entire XML string into a single column. I
can prove the performance, but I'm hoping to find some
documentation to back myself up and don't really find any
reference to the fact that the schema of the database
should still be designed relationally. Does that make
any sense, can anyone help, do you know of any resources?
Please and Thanks,
Rachel
Start here:
http://www.oceaninformatics.biz/publications/e2.pdf
http://www.dbazine.com/pascal4.html
http://searchdatabase.techtarget.com...511729,00.html
"Rachel" <anonymous@.discussions.microsoft.com> wrote in message
news:168f01c4ad46$faa638d0$a301280a@.phx.gbl...
> I'm trying to prove or disprove whether or not it is best
> to keep a relational db structure and use the built in
> functionality for XML. I have a developer who want to
> just plop the entire XML string into a single column. I
> can prove the performance, but I'm hoping to find some
> documentation to back myself up and don't really find any
> reference to the fact that the schema of the database
> should still be designed relationally. Does that make
> any sense, can anyone help, do you know of any resources?
> Please and Thanks,
> Rachel
>
|||Hmm. The first paper is mainly arguing that you should not expect XML to
help you solve semantic heterogeneity. Well, that is fine, but neither does
the relational model. And the paper does not address the question asked.
The second and third articles are nitpick articles by Fabian Pascal who has
his own way of looking at things and again, do not give a good reply to the
question asked (except for adding FUD and "proof by authority").
The third one needs registration.
To give you a more constructive answer, consider the following:
1. How is the data represented in the programming layer? Objects, XML
documents, individual information, DataSet?
2. What do you want to do with the data, once it is in the database? Do you
want to query it, repurpose it (ie, the XML has customer elements containing
order elements but you may want to run reports only over orders), update it
at a property level? If so, you probably want to shred it into relational
form (assuming that you can extract the information that way) and use XML
publishing functionality (in SQL Server 2000 look at OpenXML, FOR XML or
annotated schema).
If you mainly want to store the XML and retrieve it as such (basically cache
it to disk), the storing it as XML if it was already XML in the program
layer seems like a good idea.
3. Does the XML represent hierarchical relational data or document markup,
where order is important?
For some of that, feel free to take a look at my TechEd presentation on XML
that contains a slide about when to go relational and not XML and when to
consider tradeoffs. You can find it from on my Blog at
http://sqljunkies.com/weblog/mrys
Best regards
Michael
"Adam Machanic" <amachanic@.hotmail._removetoemail_.com> wrote in message
news:eW0UeeUrEHA.2244@.TK2MSFTNGP15.phx.gbl...
> Start here:
> http://www.oceaninformatics.biz/publications/e2.pdf
> http://www.dbazine.com/pascal4.html
> http://searchdatabase.techtarget.com...511729,00.html
> "Rachel" <anonymous@.discussions.microsoft.com> wrote in message
> news:168f01c4ad46$faa638d0$a301280a@.phx.gbl...
>
|||"Michael Rys [MSFT]" <mrys@.online.microsoft.com> wrote in message
news:%23zaIqpZrEHA.1952@.TK2MSFTNGP12.phx.gbl...
> Hmm. The first paper is mainly arguing that you should not expect XML to
> help you solve semantic heterogeneity. Well, that is fine, but neither
does
> the relational model. And the paper does not address the question asked.
The first paper was posted because the most common argument I hear from
people who want to store their data in XML form rather than relational form
is that it facilitates rapid development by allowing software engineers to
use XML parsers rather than loop through recordsets and allows them to
"look" at the data in order to debug issues more quickly. This paper points
out some of the problems with the latter argument.

> The second and third articles are nitpick articles by Fabian Pascal who
has
> his own way of looking at things and again, do not give a good reply to
the
> question asked (except for adding FUD and "proof by authority").
Why do they not give a good reply? They very clearly state issues with
storing data in XML format. Some of the arguments, whcih I find to be
compelling:
XML lacks a formal data model and query algebra. XML is rigidly
hierarchical. XML does not provide data integrity features. (see second
article)
And there is a good reason for all of this: XML is not designed for
data management. XML is designed for document management, an entirely
different field. (see third article)

> To give you a more constructive answer, consider the following:
> 1. How is the data represented in the programming layer? Objects, XML
> documents, individual information, DataSet?
This generally has little bearing on how the data should be stored in
the database. Storing data based on how it's represented in application
code creates a tight coupling between the two that defeats the ideas of
application agnostic databases and encapsulation of the data model by the
DBMS. This can create huge problems when it comes time to create new
applications that use the same data. There are good reasons that we moved
from embedding SQL in the application to using stored procedures.

> If you mainly want to store the XML and retrieve it as such (basically
cache
> it to disk), the storing it as XML if it was already XML in the program
> layer seems like a good idea.
I agree with this, but only to a point. I have seen too many projects
where the original scope, "we will not query this XML", was later changed
to, "we need to query some of this XML". A great deal of pain would have
been avoided by storing the data relationally from the beginning.
|||Michael,
Can you please provide a brief description of, perhaps with an example, what
semantic heterogeneity is?
I assume it means some kind of miscommunication of meanings. BTW, I did some
googling, the term is being used extensively by the research community, but
could not find a meaningful definition in the context of databases.
Anith
|||See below.
Best regards
Michael
"Adam Machanic" <amachanic@.hotmail._removetoemail_.com> wrote in message
news:e1hpVttrEHA.3896@.TK2MSFTNGP15.phx.gbl...
> "Michael Rys [MSFT]" <mrys@.online.microsoft.com> wrote in message
> news:%23zaIqpZrEHA.1952@.TK2MSFTNGP12.phx.gbl...
> does
> The first paper was posted because the most common argument I hear from
> people who want to store their data in XML form rather than relational
> form
> is that it facilitates rapid development by allowing software engineers to
> use XML parsers rather than loop through recordsets and allows them to
> "look" at the data in order to debug issues more quickly. This paper
> points
> out some of the problems with the latter argument.
> has
> the
> Why do they not give a good reply? They very clearly state issues with
> storing data in XML format. Some of the arguments, whcih I find to be
> compelling:
> XML lacks a formal data model and query algebra. XML is rigidly
> hierarchical. XML does not provide data integrity features. (see second
> article)
The formal data model and query algebra is what the XQuery WG is working on
(currently in Last Call). And there are more research papers on alternative
data models and query algebras since at least 1998 (I am author of at least
two).
So claiming that such a thing does not exist is - let's say - either
ignorant or worse.
Data integrity features exist with DTD and XML Schema. However they do not
follow the relation FK-PK model. And that relational database systems (or to
use Fabian's "term" SQL database systems) do not yet provide a direct FK-PK
mechanism between a relational field and some XML element/attribute has more
to do with the product-cycles not having progressed that far than any
theoretical problem.

> And there is a good reason for all of this: XML is not designed for
> data management. XML is designed for document management, an entirely
> different field. (see third article)
Well, document management also needs formal data models and algebras.
Especially if you start using XML as a tree model to describe semistructured
data (see Lore and UnQL for some pre-XML semistructured data models).

> This generally has little bearing on how the data should be stored in
> the database. Storing data based on how it's represented in application
> code creates a tight coupling between the two that defeats the ideas of
> application agnostic databases and encapsulation of the data model by the
> DBMS. This can create huge problems when it comes time to create new
> applications that use the same data. There are good reasons that we moved
> from embedding SQL in the application to using stored procedures.
Actually, if the data is not represented in XML then you probably should ask
yourself twice whether you want to store it as such. So therefore there is a
very good reason to ask this question. Also, if the data indeed is an XML
_document_ then knowing that in order to know how to store it is also
important (since the general shredding does not work).

> cache
> I agree with this, but only to a point. I have seen too many projects
> where the original scope, "we will not query this XML", was later changed
> to, "we need to query some of this XML". A great deal of pain would have
> been avoided by storing the data relationally from the beginning.
If it fits. Often it may not. And that's where the XQuery support in SQL
Server 2005 comes to play.
Best regards
Michael
>
|||This is a category of problems associated with different parties (companies,
applications etc.) using terms that are semantically related but different
or semantically not related but the same. This miscommunication of meaning
is what makes interoperability between different database schemata or
database and application schema more difficult. It is often used as a
counter-argument to the claim that XML helps interop. The point is that the
issue is not specific to XML. XML has other benefits in interop (can
represent hierarchies which CSV cannot, parsers are widely available, is
based on Unicode (thus international)) but does not address this.
Best regards
Michael
"Anith Sen" <anith@.bizdatasolutions.com> wrote in message
news:%235Ld4TzrEHA.3980@.TK2MSFTNGP12.phx.gbl...
> Michael,
> Can you please provide a brief description of, perhaps with an example,
> what semantic heterogeneity is?
> I assume it means some kind of miscommunication of meanings. BTW, I did
> some googling, the term is being used extensively by the research
> community, but could not find a meaningful definition in the context of
> databases.
> --
> Anith
>
|||Thanks for your response. Since you seem to be knowledgeable in XML related
data management issues, perhaps you could clear up some of my confusion
regarding XML's utility.
[vbcol=seagreen]
I fail to see any of the above points to be reasonable to overwhelmingly
consider XML as an alternative. Why can't we use comma separated path
enumerated strings to represent hierarchies in an ASCII file, since the
structure is always an acyclic graph? For instance, like:
Root, Level1, Level2,...Level n-2, Leaf
Regarding the availability of parsers, it is like putting the cart before
the horse. Nobody wrote XML parsers before XML's invention, right?. So
claiming that as a benefit can be moot. On the other hand, it may probably
take about a few hours for an average programmer to come up with a CSV
parser in his/her favorite programming language.
Regarding UNICODE support, how is an n-character CSV any different? Tons of
APIs are out there which could convert non-Unicode text to 16-bit encoded
Unicode & vice-versa.
[vbcol=seagreen]
Actually regarding interoperability, I was under that impression as well
till now. If that is not related to XML and if XML has its place in data
management, which specific data management problem that is not addressed in
a relational system, does XML address?
Again, thanks for your time.
Anith
|||See below.
Best regards
Michael
"Anith Sen" <anith@.bizdatasolutions.com> wrote in message
news:OYJk5m1rEHA.896@.TK2MSFTNGP12.phx.gbl...
> Thanks for your response. Since you seem to be knowledgeable in XML
> related data management issues, perhaps you could clear up some of my
> confusion regarding XML's utility.
>
> I fail to see any of the above points to be reasonable to overwhelmingly
> consider XML as an alternative. Why can't we use comma separated path
> enumerated strings to represent hierarchies in an ASCII file, since the
> structure is always an acyclic graph? For instance, like:
> Root, Level1, Level2,...Level n-2, Leaf
Sure you can. But you will have to document the format, make sure that it
can work across Windows, Unix, mainframe in different codepage encodings,
and even then nobody else will understand it or will be locked into your
format (note the wide availability of XML as the common syntax allows you to
build decoupled systems).

> Regarding the availability of parsers, it is like putting the cart before
> the horse. Nobody wrote XML parsers before XML's invention, right?. So
> claiming that as a benefit can be moot. On the other hand, it may probably
> take about a few hours for an average programmer to come up with a CSV
> parser in his/her favorite programming language.
Correct. Everyone comes up with his or her own format and you cannot use
commoditized tools. Is that really a benefit?

> Regarding UNICODE support, how is an n-character CSV any different? Tons
> of APIs are out there which could convert non-Unicode text to 16-bit
> encoded Unicode & vice-versa.
It is not the Unicode per se, but also the ability of the format to be
fairly good at communicating what encoding the document actually is in.

> Actually regarding interoperability, I was under that impression as well
> till now. If that is not related to XML and if XML has its place in data
> management, which specific data management problem that is not addressed
> in a relational system, does XML address?
Can I point you to my TechEd 2004 presentation linked from
http://sqljunkies.com/WebLog/mrys/ar.../08/2967.aspx?
In general:
- data exchange message format (for loosely coupled systems)
- document management (where you use XML as a markup format) because you can
use XQuery/XPath to query it.
- semi-structured data representation

> Again, thanks for your time.
> --
> Anith
>
|||>> Sure you can. But you will have to document the format, make sure that it[vbcol=seagreen]
In other words, if we can standardize the documentation of formats ( like
DTD, schema ), CSVs can also address the same data exchange issues.
[vbcol=seagreen]
In other words, if we can standardize the CSV, it will no longer be an
XML-specific benefit.
[vbcol=seagreen]
In other words, if we can standardize various encodings, CSV can address
this issue as well.
So it is clear that there is nothing special to XML other than it is a
well-standardized text format which is poised to be ( or already ) widely
popular. In a nutshell, one can conclude that we could have come up with any
formats or any text based structures and as long as it is standardized, it
can address the same data exchange problems that XML is intended to address.
Am I correct?
[vbcol=seagreen]
[vbcol=seagreen]
Thanks for the references. I have several questions with the points on your
presentations, but this may not be the right venue to address them.
Anith

No comments:

Post a Comment