KDC Problem: Key fields

Edit: the question is more clear in http://www.dbforums.com/showthread.php?p=4477661&posted=1#post4477661

OK this question is not really SQL server related but since we will build the database in MS SQL Server (2000) i think this is the proper forum to ask this question.

It's actually a quite simple question i gues, but my experience with databases is very limited (buiding my first serious database actually).

when i have two tables Organization and Samples coming from this organization with the following fields:

Organization:

InstituteID (identification)
Name
Adress
...

Sample:

SampleID (sample identification)
InstituteID (references to the Institute table)
SampleDate
Location
...

In general is it better to use the sampleID and InstituteID combined as primary key in the second (sample) table (and build an index on the combined key) or to generate a new unique key for each record? And why should i choose either one of these options?

One advantage of the first option is you dont have to generate keys yourself (or use some autonumbering system), however the disadvantage is the key grows bigger and bigger when adding more tables (fraction taken from samples, analysis performed on fractions, Analysis giving results).This is an aged old question/debate. Artificial key allows you to have a narrow key but natural key allows you to cover the query. Depending on who you ask, the opinion and supporting *proof* in perf gain can be shown.

Personally, I would go with natural key but I do use artificial key in the past.

Here is a sample of previous discussion on such topic.

http://groups.google.com/groups?q=sqlserver+artificial+key+vs+natural+key|||I am already reading a lot about it so ill hope i can decide for myself but still any opinions are appreciated. One more thing that might be important is that when using natural keys in deeper tables (such as results) a natural key can result in up to 7 fields to make a record unique.

Edit: Another thing to consider: It is most likely because of the complexity of data checking (not done by the DBMS) needed before records are inserted most of the data will be inserted by a single administrator executing these procedures.|||One more question:

If i have say a number of countries (max length 20 characters) should i put them in a seperate table (they dont have aditional attributes) and use integer numbers to refer to this table or just store the full country name in the field? If i do add them in a different table you haev to generate id's for them and need 1 more join to join the data. On the other hand you do save space in the database, but is it really worth the hassle?|||If this is an OLTP system, the normalization should be observed. Thus, a seperate table for COUNTRIES would be needed.

As to your earlier question about the wide multi-column key, I would suggest you search for Kimberly Tripp's article/paper on covering indexes.|||ok im really a starter so why would that be? For data inserts/updates it will take longer and use more space if not normaliing countries, but for data retreival joining another table will make the time to retreive results longer not?|||I'm confused. Is SampleID unique for each Sample record, as its name implies? If so, then under no circumstances would you want to combine it with InstituteID as a composite key. It is a primary key on its own.

If SampleID is not unique for each Sample record, then you might want to rethink your object names. If the Sample record is simply implementing a many-to-many join between Institutes and some other sample table, then an additional surrogate key may not be necessary, but may simplify application programming.

I am one of the camp that puts a surrogate key on everything, because composite keys become unwieldy after several layers of day, and I believe that consistency in development is important.|||Different institutes supply unique Samples. Though the SampleID institute 1 gives to a sample can be the same as intitute 2 gives to another sample. That is why only instituteID+SampleID make a record unique.

From this sample, fractions are taken, the fractions are analyzed and the analysis gives results (so all 1 to many relations)

if i continue using keys the way i use them in the sample table example i will end up in the results table with the following fields identifying a record:

InstituteID
SampleID
FractionMethodID
AnalysisMethodID
MeaurementNumber
CASNumber (identifying a result)

So the 2 questions i have:

Should i replace this key (build mostly from natural keys) with 1 numeric key value (composite key)? This means more joins to retrieve certain information (all tables have to be included to link results to an institute)? What will work faster?

Or should i replace the natural keys in every table with a composite key and use this composite keys as replacement for natural keys but use them in the same way i use the natural keys now so i am still able to use less joins to retrieve data? (you know what i mean here?)|||It is a matter of choice and style. I make sure every table I create has a non-composite primary key, usually surrogate.
Regardless, you will want a unique index/constraint on InstituteID/SampleID if you elect not to make it the primary key.|||I have read pages and pages about this topic already, but every person, site or newsgroup seems to have a different opinion. That makes it not easier to come to a final decission. Even checking performance afterwards seems no option since at this moment we have no idea about how big the database will grow and this also seems to influences what path to take.

It all comes down to the following 1 question:

What works faster for data RETRIEVAL. Normalize as far as u can go and generate surrogate keys in every table where necessary, resulting in more joins to retrieve data or use natural composite keys that can result in keys up to 7 fields BUT not having to join a lot of tables to get the same result? Expect the database to grow pretty large over time.

If anyone could answer that it would help greatly in coming to a decission about what path to take.|||I think you are trying to make it simpler than it is. You really can't boil it down to that one either-or decision.

Database design is what it is because thought and trial-n-error have brought it to where it is. Complete/Full normalization is ideal if you want to get anything from anywhere, coming from any direction. The reality is that there are/may be tradeoffs to performance in a fully-normalized DB. This accounts for much of the reason there are different LEVELS of normalization.

If you really want speed in the retrieval of data, you perhaps should use an indexed flat file, or a single table with a single unique key. Yeah, going way the other way, but you get the idea.

There is no silver bullet.

Everything's a tradeoff.

Which accounts for the number and varied opinions on the subject.|||I know this is not an easy topic with one streight forward answer. But my problem is that because i am still a rookie with database design i am trying to find out what the things are that are most difficult to do for a DBMS (MS SQL server in this case), before trying to decide with what option i will go (mostly surrogate or natural keys)

I dont expect a clear answer but am hoping that someone who does have experience with the MS SQL server DBMS and who has build a similar database like i do (a database with say six 1 to many related tables and some lookup tables attached to it) could come up with the basic advantages and disadvantages of using surrogate keys in a database similar to mine.

If i have 6 tables and the 6th table uses the primary key of the 5th table + 1 field to make tuples in in this 6th table unique i just have the feeling it is not right since you have a lot of fields in more than 1 table (not properly normalized if u ask me). It feels better to just create 1 surrogate key for each 1 of these fields (which does result in having one surrogate primary key unqiue for a tuple and having a surrogate reference key refering to its parent table.)

But am i right here? Will Select queries run fast in both systems as long as the tables are properly indexed or does 1 of these methods have a sgnificant disadvantage opposed to the other?

I just want to get informed well before i am making a decission based on a feeling that is not right :)|||If the database is expected to be large, you do not want to carry around any extra storage space. Bigger backups, more resources to archive and purge data etc.

It's not a question of all or nothing, it's possible to use natural keys to a certain level in the hierarchy, then at some point assign a surrogate identity for further referencing from that point.

Personally I prefer to use natural keys, but use surrogates when it makes sense to do so.

As for the performance questions you had, I think neither method has a distinct advantage over the other. That depends more on usage patterns of the data (and indexes). But methodwise, properly implemented both are ok.

rockmoose|||So when does it make sense for you to use surrogates?|||After reading a lot i am leaning toward surrogate keys now. Mostly because using a number of colums as primary key and clustering them makes the the indexes larger, which can result in deecreased performance. Next to that a lot more storage is required and updates to data (that won't hapen often though) require a lot of cascade updating.

The only big disadvantage is having to use most tables in every query to filter on certain data. However i have the feeling that when the DBMS is smart enough in query optimization this shoulden't cause a too high performance penalty|||From a performance and technical perspective an ideal key is thin, stable and monotonically increasing. From a usage perspective it also represent a business meaning.
thin - this is particularily important if the key is going to be propagated to large child tables, you don't want to carry the overhead of large (or multiple) column keys.
stable - a changing key is not a good choice of key, changes will have to be propagated throughout the database.
monotonicaly increasing - this is purely for technical and perf reasons, for large tables with many inserts this is important. Not so important for "static" or small tables.
convey business meaning - easier to work with, less lookups/joins necessary.
if a table has surrogate key, you still need a unique constraint to make sure that no duplicate data is inserted.

So when does it make sense to me?
Well, when a surrogate would clearly give more of the above benefits than a natural key would.
When I design a database and introduce a surrogate, I want to be able to justify it.
e.g. "I put a surrogate here, because the natural key is 4 columns and 32 bytes wide, and it is going to be propagated into this 20 mil row transaction table"

Did you read all the links to discussions oj gave you ;) ?!??!

__________________________________
I wish we had a NATURAL JOIN operator

rockmoose|||After reading a lot i am leaning toward surrogate keys now. Mostly because using a number of colums as primary key and clustering them makes the the indexes larger, which can result in deecreased performance. Next to that a lot more storage is required and updates to data (that won't hapen often though) require a lot of cascade updating.

The only big disadvantage is having to use most tables in every query to filter on certain data. However i have the feeling that when the DBMS is smart enough in query optimization this shoulden't cause a too high performance penalty

You can create views that do the joins for you, and query these.
Have a physical storage layer with surrogates, and a logical view layer with more "business meaning".

IMO, when introducing a surrogate key one is introducing an extra layer of indirection. i.e. we tell the system that this surrogate stands for this natural key. And that from this point we have to perform a lookup to access the data.
This should be done judiciously and for a reason.|||Yes i did read those topics, but for a beginning db developper it still makes it not easy to decide what option to choose.

Even if i go for surrogate keys you have the option to propage the key into the child tables (again resulting in multiple column keys but at least limiting the size) or don't do this and generate a new surrogate key in each table where no keys are propagated.

About this last i have read nowhere in any of the topics on the google news groups

Is there even a sugestion for when to go for surrogates? I mean if a key consist of more than X colums or the column length of a key is larger than y characters or bytes (If the key is stable and unique?).

Another thing that might be good to know for me is if a key field is say 20 characters what influence will that have on the index generated on such field? Will it be larger than an index generated on a 5 character field?

Monday, March 12, 2012

Key fields

No comments:

Post a Comment

KDC Problem

Blog Archive

About Me