Skip navigation
Currently Being Moderated

Of haystacks, needles and hostage negotation

Posted by David Birmingham on Jan 13, 2009 7:49:47 AM


And now, a drum roll please, the inaugural entry for this auspicious occasion. I realize that many people who read this will be Enzees and non-Enzees, so for those who want to know a little more about the machine, sort of dipping the toes in the water, I'm talkin' to you.

 

And for those who are already swimming in the deep end, and those in the deep end without a Netezza machine, I'll try to shape some thoughts for your own discussions. I've noted in other venues (particularly in the Netezza Underground, and that's my only shameless plug for the book!) that the Netezza machine addresses problems-of-scale. We know what a function-point looks like, and we know when data is transformed from one shape to another. What we might not consider is what it takes to make this happen in scale. And not to belabor a point, but Netezza is an appliance. While we know that we can make enough drinks for a small dinner party in our blender-appliance, and serve coffee to them afterward from our automatic coffee maker-appliance, will these same appliances cover, say one thousand people? Ten thousand people? For that matter, if we have a simple toaster-appliance, will even our four-slot toaster satisfy warm-toast-requirements for say, a thousand hungry teenagers? I recall having a winter get-together for a bunch of teenagers in our home, and one of the parents had signed up to provide chili. I still remember the kids piling into the house, about thirty or more, cold and thirsty from being outside in the weather. The parent showed up with mini-crock-pot containing the warmed contents of two can of store-bought chili, and had clearly missed the memo to show up prepared to feed the masses. We ordered pizza.

 

The requirements for processing data in scale are no different, and we have several primary hurdles to overcome that Netezza has recognized, embraced, harnessed and solved, and I would be remiss for not pointing them out, because for some people these are not obvious. But before we delve, consider this: Let's say I have a compliance model and I want to find the few thousand records among millions or billions that are not in compliance. This is a needle-in-a-haystack-problem, and for some of you that pile of hay is pretty big! If we did this in a "typical" fashion (without Netezza), we would examine each row for a variety of anomalies, all based on some criteria or set of rules. We really need to get it right on the first pass, because a multiple-pass-model on the data is unthinkable.

 

Does this always work? What if one of the anomalies gets past the sifter because we did not actively include a rule that one of the bad-boy records is now deftly sidestepping without our knowledge? Will this dirt creep into our warehouse? Will our users see the dirt before we do? Will this chaos potentially spell doom for someone, somewhere, and can we rescue it from the tracks before the locomotive arrives? Stop throwing popcorn, it's only a melodrama.

 

But with Netezza we have a very interesting option, in that Netezza operates on the principle of where-not-to-look. If we give it enough information, it can ignore the data we don't care about and by default bubble-up the data we want. In the above case, let's say we have two broad categories of data, the "hay" and the "needles". We know that the needles don't belong, and we can certainly tell what hay looks like. Now what if we did a single-pass on the data, identifying everything that looks like hay, and roll the identifiers for these records into a temporary table. In the next pass (that's right, another pass!) we simply anti-join the orignal data with this temporary table (using a where-not-exists), and voila! - the needles fall out of this as a natural result. Will all of these results be needles for certain? Well, we know that if all the other records are definitely hay, we now have a much smaller and objective subset of needle-candidates to work with, and that all of our needles are in it, even though some anomalous "borderline hay" might be there, too. Either way, the needle-candidates are in the tank, ready for examination, And more importantly - our haystack itself is pure and ready for the next downstream operation.

 

And this is what's important - if we have billions of rows in a table and we want to find the few thousand that will cause us trouble, we can perform this two-step carving using Netezza's massively parallel power, carve out the troublesome ones and pass good data to the next downstream process. We then deal with the anomalies in a more administrative manner. In short, we don't have to "stop the presses" for the sake of the needles. We just remove the needles and continue. I know that some people have the philosophy that if any needles exist at all, we must stop-the-presses and figure things out before proceeding. This might work for a few hundred anomalies in a few thousand, but it won't scale.

 

Our analysts will grow impatient and our troubleshooters catatonic from the sheer volume. Someone will scream an epithet that begins with the words "Why don't you just..." and hopefully the end of this sentence will be something professional like "take the dirt out and deal with it elsewhere, but don't hold the rest of our data hostage!"

 

I'll deal with hostage negotiation in a later entry.

Comments (0)