Wednesday, March 7, 2012

Keeping dirty data...

The organization I'm in has the business need of collecting data from
outside organizations and tracking what data is bad and what data is
good. When I say bad data I mean everything from things outside of
range to absolute crap - characters in integer columns, integers in
character columns, special characters, etc. The data comes in in the
form of flat file so it's a free for all until it hits ssis & the db
engine.
Eventually of course they work to get the data corrected at the source
& resubmitted but in the meantime, they have the legitimate need of
not only pushing the data into the database (dirty or not), but
keeping all the bad stuff, running reports on the bad stuff, etc. I
can't in good conscience make everything a varchar to catch everything
- that would go against the database gods. IMO - I still must make an
integer be an integer , characters are characters, etc. But what do I
do with the junk? Any thoughts? Right now I'm throwing everything
over to some side catch-all tables as varchars, and pushing clean data
into real tables, but that feels wrong too. Suggestions?You're doing exactly what a lot of people do in BI applications - just using
an intermediate staging area to temporarily store data until you run the
next process to clean it. Some people prefer to push the data directly into
the production system without an intermediate "staging" area, and just let
the SSIS error flows catch the bad rows while importing from flat files.
You might find this method more efficient, since you don't need two separate
processes.
"CB" <unc27932@.yahoo.com> wrote in message
news:415c48ac-cfad-4d70-9619-4dec8cd062ac@.l1g2000hsa.googlegroups.com...
> The organization I'm in has the business need of collecting data from
> outside organizations and tracking what data is bad and what data is
> good. When I say bad data I mean everything from things outside of
> range to absolute crap - characters in integer columns, integers in
> character columns, special characters, etc. The data comes in in the
> form of flat file so it's a free for all until it hits ssis & the db
> engine.
> Eventually of course they work to get the data corrected at the source
> & resubmitted but in the meantime, they have the legitimate need of
> not only pushing the data into the database (dirty or not), but
> keeping all the bad stuff, running reports on the bad stuff, etc. I
> can't in good conscience make everything a varchar to catch everything
> - that would go against the database gods. IMO - I still must make an
> integer be an integer , characters are characters, etc. But what do I
> do with the junk? Any thoughts? Right now I'm throwing everything
> over to some side catch-all tables as varchars, and pushing clean data
> into real tables, but that feels wrong too. Suggestions?

No comments:

Post a Comment