Skip navigation
0

Data Chop Shop

Posted by David Birmingham Feb 17, 2009

Before throwing the data on the grill, it needs prep. We might want to chop it, slice it, dress it and otherwise tenderize it before throwing it on the flames, but throw it on the fire we must.

 

Here's an interesting problem - what if we're presented with data that just won't intake. It's horribly formatted, if at all, and we don't have access to any kind of ETL or data-shaping environment to get the data inside. We have bad dates, bad numerics, and the only thing we have that actually works are the varchars! Whoo hoo. Okay, not so fast.


To make matters worse, we have intake records that face concatenated views, some of which have almost two-hundred columns. If any columns are bad, it's a lot like a submarine hunt, without the submarine. How bad is the data? Well, not all that bad where it actually resides. The users do a lot of inline and stored-proc math, and don't like those pesky overflows in their math. So the best option for them is to define all their numerics as "number", with no specified precision. Hey, that works great as long as the data never has to leave its home.

 

But now it's escaped, and it's on our front door with a trick-or-treat bag without the treats. When the source database doesn't define the numeric precision of its source columns, we'll find values with very oddball characteristics. Simple things, like 45 digits to the right of the decimal. Not particularly useful digits, but those leftovers like .33333 etc that just showed up without being invited.

 

How do we shock-the-system with these values, or just trim them out? On intake we have the potential of an empty column, too. The danger never ends. check out this example, if we want to clamp the numeric data to a more palatable value:

 

select

-- for a numeric

     case when mycol is null then null else substring(MyColumnName, 1, 38)::numeric(38,8) end MyColumnName

 

--or for a date

     case when mycol is null then null else date(MyColumnName) end MyColumnName

 

 

Now we've stripped the data to something we won't choke on, and it's within lasso-distance of our numeric precision. What's that? You don't want to do this for every column in case it's something pervasive (and it usually is) - and you don't like the idea of bringing the numeric precision out of the catalog and putting it into the intake statement? What kind of perfectionist are you?

 

Not to worry, I don't like this kind of construct either, at least, not "out in the open". If I really need to use this, I would rather find a way to automate its construction right off the catalog. Our "substring" doesn't have to change, it will chop the physical data to a non-choking size. The catalog-based precision we can get, well, off the catalog. I am a huge fan of using the catalog for meta-data-based constructs, especially for common, automated tasks like intake and publication.

 

Let's say we have a very-wide intake record, like 200 columns of varying types. Do we really want to carefully craft an intake statement including the construct above? The capability is willing, but the flesh is weak. I don't find such mind-numbing work to be profitable or productive, even though it's sometimes insanely necessary because the intake data is so junky.

 

The cool part is just this: get the data into the machine! We don't have to push load-ready data into the machine, we just need to push it to a safe location. Once inside the machine, we can use Netezza's SPUs to beat the living daylights out of the data. And when we think about it, once the data's in the machine, it's like putty in our hands. We can create, teardown and rebuild whole data models, several-a-day if we want, to shape and mold the structures to the form we want. But the potter's wheel is awefully lonely with no clay.

 

I've been place where we literally waited for weeks upon weeks to get data into the machine, largely because the data we want in the machine has never (by design) left its home for another machine. Once it leaves, people see how ugly the data looks "out in the open" -- but something funny happens. We might ask them "could you format those pesky dates and numerics to something more palatable?". The answer we get back is as honest and refreshing as Nestle iced tea: "You guys own the warehouse, and the whole chain of data cleansing. I'm not the cleaner, you are. Why are you asking me to do your job?"

 

Ouch, well, there it is, and quite frankly they're right about it. It's awfully hard to tell someone to restructure their data to a form that meets our needs - it's 2008 after all. What would it take to get information into the machine, especially junky data? Do we really need to push back on our DBAs? Our analysts? Our DBAs are well-paid to manage and deliver the information in a pre-defined form, usually not in bulk. It's not particularly daunting for them, but they cannot read your mind, either.

 

So how would we access the catalog to automate this intake problem? And while we're at it, why not solve other intake problems, not just the pesky numeric precisions? How about solving the need for file space to land a flat file for intake? Or that the data in the source doesn't completely match the final target tables (we've added some administrative columns and other items that don't have a source-side equivalent. Intake-mechanics are actually pretty simple, and once we solve some of the basics, we can do some pretty advanced intake at the push of a button, and then use the common SQL-based ELT to take the data to its final home.

 

But getting the data into the box is no different than getting a player on the field, No player, no game. If we examine the common failure points that are in our way, we have the potential for source-failure in a database or file system, network errors, power-outages - you name it - all between our machine and our data. If we can get the data into the machine and cut the ropes from whence it came, we can do anything we want. If we can't get the data into the machine, well, what on earth are we standing around for?