[ www.netezza.com ]
0

First-Bubble Blues

Posted by David Birmingham Feb 26, 2009

About a year ago we encountered an environment where the client wanted the old system refactored into the new. The "new" here being the Netezza platform and the "old" here being an overwhelmed RDBMS that couldn't hope to keep up with the workload.

 

So the team landed on the ground with all hopes high. The client had purchased a 10200 (216 processors) for production deployment and a 10100 system for development. Oddly, the same thing happened here as happens in many places. The 10200 was dispatched to the protected production enclave and the 10100 was dropped into the local data center with the developers salivating to get started. And get started they did.

 

The first team inherited about half a terabyte of raw data from the old system and started crunching on it. The second team, starting a week later, began testing on the work of the first team. A third team entered the fray, building out test cases and a wide array of number-crunching exercises. While these three teams dogpiled onto and hammered the 10100, the 10200 sat elsewhere, humming with nothing to do.


We know that in any environment we encounter, with any technoogy you can name, the development machines are underpowered compared to the production environment. And while the production environment has a lot of growing priorities for ongoing projects, we don't have this problem for our first project, do we?

 

And this is the irony - for a first project we have a huge "first-bubble" of work before us that will never appear again. the bubble includes all the data movement, management and backfilling of structures that we will execute only once, right? Really? I've been in places where these processes have to be executed dozens if not hundreds of times in a devlopment or integration environment as a means to boil out any latent bugs prior to its maiden - and only  - conversion voyage. But is this a maiden-and-only voyage? Hardly - typically the production guys will want to make several dry runs of the stuff too. We can multiply their need for dry runs with ours, because we have no intention of invoking such a large-scale movement of data without extensive testing.

 

And yet, we're doing it on the smaller machine. No doubt the 10100 has some stuff - but I've seen cases where it might take us two weeks to wrap up a particularly heavy-lifing piece of logic. If we'd done this on the larger 10200, we would have finished it in a week or less. Double the power, half the time-to-deliver (when the time is deliver is governed by testing) In practically every case of a data warehouse conversion, the actual 'coding' and development itself is a nit compered to the timeline required for testing. I've noted this in a number of places and forms, in that the testing load for a data warehouse conversion is the largest and most protracted part of the effort. And if testing (as in our case) is largely loading, crunching and presenting the data, we need the strongest possible hardware to get past the first bubble.

 

So this is a case for any data warehouse project, not just one with a Netezza machine. The first bubble is the worst bubble. As our techs slave themselves over a hot CPU, sweating out the extreme workload of the initial conversion, they will quickly start to compare the machine they are working on versus that production machine sitting over there with nothing to do. It wouldn't matter what the technology happened to be - the equation is out of kilter. We need all the available power to get past the first bubble.

 

But I've had this conversaion with more people than I can count. Why can't you deploy the production-destined machine with all its power, for development/testing use in getting past the first bubble, then scratch the system and deploy for production? What is the danger here? I know plenty of people, some of them vendor product engineers,  who would be happy to validate such a 'scratch' so that the production system arrives with nothing but its birthday-suit - its originally deployed default environment. Yet another philosophy is that we would pre-configure the machine for production deployment, but nobody likes developers doing this kind of thing in a vacuum. They would rather see deployment/implementation scripts that blow-out and instantiate the inplementation. I'm a big fan of that, too, for the first and every following deployment. That's why I would prefer we used the production-destined system to get past the first-bubble-blues, then scratch it, and get the original environment standing up straight, then treat it as an operational production asset.

 

Most projects like this have a very short runway, and we do a disservise to the hard-working folks who are doing their best to stand up this environment, They need all the power they can get, especially when they enter the testing cycle. And for this, it's an 80/20 rule for every technical work product we will ever produce. Take a look sometime at what it takes to roll out a simple Java Bean, or a C# application, or a web site. Part of the time is spent in raw development, and part of it in testing. If I include the total number of minutes spent by the developer in unit testing, and then by hardcore testers in a UAT or QA environment, and it is clear that the total wall-clock hours spent in producing quality technology breaks into the 80/20 rule - 20 percent of the time is spent in development, and 80 percent in testing.

 

And if the majority of the time is spent in testing, what are we testing on Enzee space? The machine's ability to load, internally crunch and then publish the data. On a Netezza machine, this last operation is largely a function of the first two. But we have to test all the loading don't we? And when testing the full processing cycle we have to load-and-crunch in the same stream, no? What does it take to do this? Hardware, baby, and lots of it. So why are we doing it on one-third of the available hardware (seeing that we're on a 10100 and the 10200 is sitting over there, humming away and taunting us from a distance!)

 

I can say that multiple small teams can get a lot of "ongoing" work done on a 10100, no doubt a very powerful environment. I can also say that a machine like this, for multiple teams in the first-bubble effort, will gaze longingly at the 10200 in the hopes they can get to it soon, because so much testing is still before them, and they need the power to close. With that, Netezza gives us the power to close faster than any other environment, to get past this first-bubble without the blues - we only hurt ourselves with rules for the environment that are impractical for the first-bubble. So all things considered, if we were on a traditional platform we would see months pass for the relative weeks it would take for a Netezza machine to do the same work.

 

Alas, when one has a Netezza machine, it bends gravity and dilates time. Months become weeks. Weeks become days. And yet, we still need more power. More is never enough to wash away the blues.

 

Those first-bubble-blues.

0 Comments Permalink
0

Data Chop Shop

Posted by David Birmingham Feb 17, 2009

Before throwing the data on the grill, it needs prep. We might want to chop it, slice it, dress it and otherwise tenderize it before throwing it on the flames, but throw it on the fire we must.

 

Here's an interesting problem - what if we're presented with data that just won't intake. It's horribly formatted, if at all, and we don't have access to any kind of ETL or data-shaping environment to get the data inside. We have bad dates, bad numerics, and the only thing we have that actually works are the varchars! Whoo hoo. Okay, not so fast.


To make matters worse, we have intake records that face concatenated views, some of which have almost two-hundred columns. If any columns are bad, it's a lot like a submarine hunt, without the submarine. How bad is the data? Well, not all that bad where it actually resides. The users do a lot of inline and stored-proc math, and don't like those pesky overflows in their math. So the best option for them is to define all their numerics as "number", with no specified precision. Hey, that works great as long as the data never has to leave its home.

 

But now it's escaped, and it's on our front door with a trick-or-treat bag without the treats. When the source database doesn't define the numeric precision of its source columns, we'll find values with very oddball characteristics. Simple things, like 45 digits to the right of the decimal. Not particularly useful digits, but those leftovers like .33333 etc that just showed up without being invited.

 

How do we shock-the-system with these values, or just trim them out? On intake we have the potential of an empty column, too. The danger never ends. check out this example, if we want to clamp the numeric data to a more palatable value:

 

select

-- for a numeric

     case when mycol is null then null else substring(MyColumnName, 1, 38)::numeric(38,8) end MyColumnName

 

--or for a date

     case when mycol is null then null else date(MyColumnName) end MyColumnName

 

 

Now we've stripped the data to something we won't choke on, and it's within lasso-distance of our numeric precision. What's that? You don't want to do this for every column in case it's something pervasive (and it usually is) - and you don't like the idea of bringing the numeric precision out of the catalog and putting it into the intake statement? What kind of perfectionist are you?

 

Not to worry, I don't like this kind of construct either, at least, not "out in the open". If I really need to use this, I would rather find a way to automate its construction right off the catalog. Our "substring" doesn't have to change, it will chop the physical data to a non-choking size. The catalog-based precision we can get, well, off the catalog. I am a huge fan of using the catalog for meta-data-based constructs, especially for common, automated tasks like intake and publication.

 

Let's say we have a very-wide intake record, like 200 columns of varying types. Do we really want to carefully craft an intake statement including the construct above? The capability is willing, but the flesh is weak. I don't find such mind-numbing work to be profitable or productive, even though it's sometimes insanely necessary because the intake data is so junky.

 

The cool part is just this: get the data into the machine! We don't have to push load-ready data into the machine, we just need to push it to a safe location. Once inside the machine, we can use Netezza's SPUs to beat the living daylights out of the data. And when we think about it, once the data's in the machine, it's like putty in our hands. We can create, teardown and rebuild whole data models, several-a-day if we want, to shape and mold the structures to the form we want. But the potter's wheel is awefully lonely with no clay.

 

I've been place where we literally waited for weeks upon weeks to get data into the machine, largely because the data we want in the machine has never (by design) left its home for another machine. Once it leaves, people see how ugly the data looks "out in the open" -- but something funny happens. We might ask them "could you format those pesky dates and numerics to something more palatable?". The answer we get back is as honest and refreshing as Nestle iced tea: "You guys own the warehouse, and the whole chain of data cleansing. I'm not the cleaner, you are. Why are you asking me to do your job?"

 

Ouch, well, there it is, and quite frankly they're right about it. It's awfully hard to tell someone to restructure their data to a form that meets our needs - it's 2008 after all. What would it take to get information into the machine, especially junky data? Do we really need to push back on our DBAs? Our analysts? Our DBAs are well-paid to manage and deliver the information in a pre-defined form, usually not in bulk. It's not particularly daunting for them, but they cannot read your mind, either.

 

So how would we access the catalog to automate this intake problem? And while we're at it, why not solve other intake problems, not just the pesky numeric precisions? How about solving the need for file space to land a flat file for intake? Or that the data in the source doesn't completely match the final target tables (we've added some administrative columns and other items that don't have a source-side equivalent. Intake-mechanics are actually pretty simple, and once we solve some of the basics, we can do some pretty advanced intake at the push of a button, and then use the common SQL-based ELT to take the data to its final home.

 

But getting the data into the box is no different than getting a player on the field, No player, no game. If we examine the common failure points that are in our way, we have the potential for source-failure in a database or file system, network errors, power-outages - you name it - all between our machine and our data. If we can get the data into the machine and cut the ropes from whence it came, we can do anything we want. If we can't get the data into the machine, well, what on earth are we standing around for?

0 Comments Permalink