[ www.netezza.com ]
0

Here I am in Washington DC. Yes, that's right, Washington DC on the eve of the inauguration. And one may well wonder why a boy from Texas is hanging out in DC? I don't really have plans to attend the events of the next several days, as I am ensconced at a client site on a new and challenging project. I'll share some of the more generic challenges and opportunities at this site at a later time, scrubbing the content, as always. However, the DC area is about to be inundated with what is nothing less than a problem-of-scale. Estimates are that over three million people will attend the inauguration tomorrow, and they've pulled out all the stops and added extra manpower to assist in this mass ingress (and egress) of humanity. The parks services here often use indirect readings to get a better picture of total headcount (like water usage, purchases of refreshments, etc). In the most recent Earth Day celebration, they over-estimated the total attendees because they left more trash on the ground than the usual events. The irony is not lost.

 

It is instructive what the local authorities have initiated - the bridges and roads all around the city are closed to motored traffic - allowing foot-traffic only. In case of emergency, people can make a quick and orderly exit from the city and into the sorrounding areas. Of course, after watching the mayhem of 9/11, we know that if a catastrophic emergency arises, people will make it to the riverfront and then wait patiently in line to cross the closest bridge, right? Who are we kidding? Expect to see swimmers in the water, and all that. Not to put a damper on the festivities, of course, this is an important day for America in many ways, and the people in charge are very carefully seeing to it that all forms of chaos are under control.

 

But are we? How does chaos, even explosive, career-threatening chaos enter our environments and wreak havoc on our systems? I noted in a prior entry that simply watching for what is known while ignoring the unknown - or even accidentally allowing the unknown to pass through because we failed to recognize it - isn't good enough. It wouldn't be good enough this week for the Obama and Biden families, so why is it ever good enough for our mission-critical systems? It is because we are accustomed only to checking our information against the standard for which it must rise to, rather than checking it against a minimum standard for which it cannot rise at all.

 

Now, that sounds a little oblique so here's an example: Whenever a police dog from any given K-9 unit in America is searching for a suspect in the dead of night amidst flashlights, annoying shouts and eerie background music, it is searching for one and only one thing- the scent that isn't there. In fact, for any of these dogs you could easily perform a test - fill a room with fifty people and have one of them leave and hide. Then bring in the dog. It will immedaitely lock upon the missing-scent and go after the person who isn't there. How amazing is that? Of course, their olfactory sensors are several hundred thousand times more sensitive than ours.

 

Within international terrorist discussions on the chit-chat shows, we hear a maxim of "our intelligence has to be right 100% of the time, but a terrorist only has to get it right once." In other words, the terrorist may make one thousand attempts to take down a target, and only has to get it right once to succeed. But the people protecting it have no options - it has to be right all the time. Errors and junk in our data are a form of virtual terrorism - but only by latent effort on their part. Like dust slowly settling on electronic parts, if the parts are not protected from the creeping effect of buildup, the layer will eventually reach a critical mass and bring down the system. No one speck of dust did the trick, only the lack of attention to the dirt. When we say, "what are the odds?" - that word "odd" has a meaning. It is the meaning of "what doesn't belong here".

 

In a data processing environment, chaos thrives amid neglect. If we're in hard pursuit of chaos it has less of a chance to succeed - but we're only talking about probability now. Sooner or later, will the odds be in the favor of the last speck of dust that really counts, or will we keep the dust to an acceptable level so that there is no appreciable or dangerous buildup? Before I start sounding like a cleaning commercial, let's try to keep in mind that we need some serious sifting power, and not just spot-check, sampling or rules-based checking of row-at-a-time data. Dirt and all its patterns show up slowly in some systems, but show up in bulk in ours.

 

We need a way to bulk-lift the dirt out and away. Or for that matter, redirect the dirt in quantity so that it never finds a home. How do we do this? By challenging as-suspect every row that arrives on the front door. Of course, doing this "upon arrival" is a bit daunting, highly inefficent and error-prone, and not really practical. No, what we want to do is pull alll the candidates inside, take a look at those who belong and those that don't, and sift them while they are in the pipleline before they ever land in an Important Place. We need something akin to a switching device on a train track, or like a letter-sorter in a postal office, or like a coin-sifter in a change-making machine. We need for this sifting effect to fall-out using normal physics, not as a result of carefully handling the elements. We just don't have the time or bandwidth to taste-touch-handle each row.

 

Within a Netezza domain, start thinking about doing everything in scale. Rather than examining a problem as though we're trying to find a criminal, examine the problem in terms of what criminals look like, and what they don't. These provide clues to Netezza on "where not to look" - and we'll find our suspect faster.

 

Many years ago I assisted (from a virtual distance) a prosecutor in a dragnet operation where the various detectives had, over time, photographed or filmed a number of drug-purchases by nefarious characters and denizens of society. In one day and night of misfortune (for the perps), a task force netted all of these people in a several-hour sweep of three large districts in three states. Within a matter of hours, all but one of the over three-hundred perps had been captured, and the final one on the next morning when he walked in his front door. We hear about occasional "creative stings" - one in Dallas where every person with an outstanding warrant was sent a mailer saying that they had won an all-expense-paid cruise vacation and all they needed to do was show up at Texas Stadium to claim their prize. Of course, in reality only police officers awaited them, and scuttled the hapless souls out the back door to awaiting transportation to a local incarceration facility.

 

Why does all this matter? When we want to find the bad data, we need a way to call it out. We need a way to find it in a manner that separates it from the rest of the data. In human terms, we often think in terms of identifying the perps and going after them, because we cannot possibly sift through the counties on a door-to-door search. We might know for sure that a nefarious character lives in a certain neighborhood, but without specific intelligence to locate the character, he remains at large. Netezza doesn't do a door-to-door search either. In fact, if we can only identify "what isn't a perp" - the suspects stand out like the hapless lemmings at Texas Stadium. Both of them arrived by a form of natural physics. The perps stood out by the natural human gravity of something-for-nothing. Our bad data stands out using the natural physics of the Netezza architecture.

 

Or in another vein, Secret Service agents and bank employees are taught how to spot a counterfeit bill based on what the real thing looks like. While they may know various counterfeiting techniques,nothing beats the ability to know the real thing so well, that a counterfeit stands out like a sore thumb. Got a pile of cash in your pocket? Could you tell which ones are real and which ones are counterfeit because you know "what isn't a counterfeit" better? Same technique applies here - give Netezza the cues on where not-to-look and we have our counterfeit data, our posers, our perps and wannabees. And make the domain safe before bedtime.

0 Comments Permalink
0


And now, a drum roll please, the inaugural entry for this auspicious occasion. I realize that many people who read this will be Enzees and non-Enzees, so for those who want to know a little more about the machine, sort of dipping the toes in the water, I'm talkin' to you.

 

And for those who are already swimming in the deep end, and those in the deep end without a Netezza machine, I'll try to shape some thoughts for your own discussions. I've noted in other venues (particularly in the Netezza Underground, and that's my only shameless plug for the book!) that the Netezza machine addresses problems-of-scale. We know what a function-point looks like, and we know when data is transformed from one shape to another. What we might not consider is what it takes to make this happen in scale. And not to belabor a point, but Netezza is an appliance. While we know that we can make enough drinks for a small dinner party in our blender-appliance, and serve coffee to them afterward from our automatic coffee maker-appliance, will these same appliances cover, say one thousand people? Ten thousand people? For that matter, if we have a simple toaster-appliance, will even our four-slot toaster satisfy warm-toast-requirements for say, a thousand hungry teenagers? I recall having a winter get-together for a bunch of teenagers in our home, and one of the parents had signed up to provide chili. I still remember the kids piling into the house, about thirty or more, cold and thirsty from being outside in the weather. The parent showed up with mini-crock-pot containing the warmed contents of two can of store-bought chili, and had clearly missed the memo to show up prepared to feed the masses. We ordered pizza.

 

The requirements for processing data in scale are no different, and we have several primary hurdles to overcome that Netezza has recognized, embraced, harnessed and solved, and I would be remiss for not pointing them out, because for some people these are not obvious. But before we delve, consider this: Let's say I have a compliance model and I want to find the few thousand records among millions or billions that are not in compliance. This is a needle-in-a-haystack-problem, and for some of you that pile of hay is pretty big! If we did this in a "typical" fashion (without Netezza), we would examine each row for a variety of anomalies, all based on some criteria or set of rules. We really need to get it right on the first pass, because a multiple-pass-model on the data is unthinkable.

 

Does this always work? What if one of the anomalies gets past the sifter because we did not actively include a rule that one of the bad-boy records is now deftly sidestepping without our knowledge? Will this dirt creep into our warehouse? Will our users see the dirt before we do? Will this chaos potentially spell doom for someone, somewhere, and can we rescue it from the tracks before the locomotive arrives? Stop throwing popcorn, it's only a melodrama.

 

But with Netezza we have a very interesting option, in that Netezza operates on the principle of where-not-to-look. If we give it enough information, it can ignore the data we don't care about and by default bubble-up the data we want. In the above case, let's say we have two broad categories of data, the "hay" and the "needles". We know that the needles don't belong, and we can certainly tell what hay looks like. Now what if we did a single-pass on the data, identifying everything that looks like hay, and roll the identifiers for these records into a temporary table. In the next pass (that's right, another pass!) we simply anti-join the orignal data with this temporary table (using a where-not-exists), and voila! - the needles fall out of this as a natural result. Will all of these results be needles for certain? Well, we know that if all the other records are definitely hay, we now have a much smaller and objective subset of needle-candidates to work with, and that all of our needles are in it, even though some anomalous "borderline hay" might be there, too. Either way, the needle-candidates are in the tank, ready for examination, And more importantly - our haystack itself is pure and ready for the next downstream operation.

 

And this is what's important - if we have billions of rows in a table and we want to find the few thousand that will cause us trouble, we can perform this two-step carving using Netezza's massively parallel power, carve out the troublesome ones and pass good data to the next downstream process. We then deal with the anomalies in a more administrative manner. In short, we don't have to "stop the presses" for the sake of the needles. We just remove the needles and continue. I know that some people have the philosophy that if any needles exist at all, we must stop-the-presses and figure things out before proceeding. This might work for a few hundred anomalies in a few thousand, but it won't scale.

 

Our analysts will grow impatient and our troubleshooters catatonic from the sheer volume. Someone will scream an epithet that begins with the words "Why don't you just..." and hopefully the end of this sentence will be something professional like "take the dirt out and deal with it elsewhere, but don't hold the rest of our data hostage!"

 

I'll deal with hostage negotiation in a later entry.

0 Comments Permalink
0

Fire it up!

Posted by David Birmingham Jan 12, 2009

Greetings to all Enzees everywhere and welcome to the Grill!

 

Now, to level-set a bit on where this "grill" concept came from, I was hanging out at the house with friends and family shooting the breeze, cooking food on the outdoor grill and it sort of dawned on me that this is a lot like the virtual atmosphere in the Enzee Community. Maybe some of the analogies fall short, but the spirit of the atmosphere is what we're after.

 

And of course, the grill itself. A place where we toss things on the cooker, or cook-things-up, or take a raw idea to its conclusion, along with the transformation that happens from raw-to-cooked, and hopefully we won't have anything half-baked!

 

But keep in mind this is a grill, not a roast! So let's keep the atmosphere jovial and forward-looking, because that's where all the action is. And like any good cookout, while there's a guy flippin' the food who seems to be in charge, everyone's opinion counts. I'll write most of the entries in essay-style form, perhaps even persuasive-essay style, because this is a blog and I have opinions too - but I don't do one-liner entries (mostly)

 

I'd like to thank the folks at Netezza for this opportunity to interact with the Enzee community at large, and hope it becomes a very fruitful discussion for all.

 

In the spirit of the political season, the "inaugural" discussion will appear next!

0 Comments Permalink
1 2 3 Previous Next