Skip navigation
0

Here I am in Washington DC. Yes, that's right, Washington DC on the eve of the inauguration. And one may well wonder why a boy from Texas is hanging out in DC? I don't really have plans to attend the events of the next several days, as I am ensconced at a client site on a new and challenging project. I'll share some of the more generic challenges and opportunities at this site at a later time, scrubbing the content, as always. However, the DC area is about to be inundated with what is nothing less than a problem-of-scale. Estimates are that over three million people will attend the inauguration tomorrow, and they've pulled out all the stops and added extra manpower to assist in this mass ingress (and egress) of humanity. The parks services here often use indirect readings to get a better picture of total headcount (like water usage, purchases of refreshments, etc). In the most recent Earth Day celebration, they over-estimated the total attendees because they left more trash on the ground than the usual events. The irony is not lost.

 

It is instructive what the local authorities have initiated - the bridges and roads all around the city are closed to motored traffic - allowing foot-traffic only. In case of emergency, people can make a quick and orderly exit from the city and into the sorrounding areas. Of course, after watching the mayhem of 9/11, we know that if a catastrophic emergency arises, people will make it to the riverfront and then wait patiently in line to cross the closest bridge, right? Who are we kidding? Expect to see swimmers in the water, and all that. Not to put a damper on the festivities, of course, this is an important day for America in many ways, and the people in charge are very carefully seeing to it that all forms of chaos are under control.

 

But are we? How does chaos, even explosive, career-threatening chaos enter our environments and wreak havoc on our systems? I noted in a prior entry that simply watching for what is known while ignoring the unknown - or even accidentally allowing the unknown to pass through because we failed to recognize it - isn't good enough. It wouldn't be good enough this week for the Obama and Biden families, so why is it ever good enough for our mission-critical systems? It is because we are accustomed only to checking our information against the standard for which it must rise to, rather than checking it against a minimum standard for which it cannot rise at all.

 

Now, that sounds a little oblique so here's an example: Whenever a police dog from any given K-9 unit in America is searching for a suspect in the dead of night amidst flashlights, annoying shouts and eerie background music, it is searching for one and only one thing- the scent that isn't there. In fact, for any of these dogs you could easily perform a test - fill a room with fifty people and have one of them leave and hide. Then bring in the dog. It will immedaitely lock upon the missing-scent and go after the person who isn't there. How amazing is that? Of course, their olfactory sensors are several hundred thousand times more sensitive than ours.

 

Within international terrorist discussions on the chit-chat shows, we hear a maxim of "our intelligence has to be right 100% of the time, but a terrorist only has to get it right once." In other words, the terrorist may make one thousand attempts to take down a target, and only has to get it right once to succeed. But the people protecting it have no options - it has to be right all the time. Errors and junk in our data are a form of virtual terrorism - but only by latent effort on their part. Like dust slowly settling on electronic parts, if the parts are not protected from the creeping effect of buildup, the layer will eventually reach a critical mass and bring down the system. No one speck of dust did the trick, only the lack of attention to the dirt. When we say, "what are the odds?" - that word "odd" has a meaning. It is the meaning of "what doesn't belong here".

 

In a data processing environment, chaos thrives amid neglect. If we're in hard pursuit of chaos it has less of a chance to succeed - but we're only talking about probability now. Sooner or later, will the odds be in the favor of the last speck of dust that really counts, or will we keep the dust to an acceptable level so that there is no appreciable or dangerous buildup? Before I start sounding like a cleaning commercial, let's try to keep in mind that we need some serious sifting power, and not just spot-check, sampling or rules-based checking of row-at-a-time data. Dirt and all its patterns show up slowly in some systems, but show up in bulk in ours.

 

We need a way to bulk-lift the dirt out and away. Or for that matter, redirect the dirt in quantity so that it never finds a home. How do we do this? By challenging as-suspect every row that arrives on the front door. Of course, doing this "upon arrival" is a bit daunting, highly inefficent and error-prone, and not really practical. No, what we want to do is pull alll the candidates inside, take a look at those who belong and those that don't, and sift them while they are in the pipleline before they ever land in an Important Place. We need something akin to a switching device on a train track, or like a letter-sorter in a postal office, or like a coin-sifter in a change-making machine. We need for this sifting effect to fall-out using normal physics, not as a result of carefully handling the elements. We just don't have the time or bandwidth to taste-touch-handle each row.

 

Within a Netezza domain, start thinking about doing everything in scale. Rather than examining a problem as though we're trying to find a criminal, examine the problem in terms of what criminals look like, and what they don't. These provide clues to Netezza on "where not to look" - and we'll find our suspect faster.

 

Many years ago I assisted (from a virtual distance) a prosecutor in a dragnet operation where the various detectives had, over time, photographed or filmed a number of drug-purchases by nefarious characters and denizens of society. In one day and night of misfortune (for the perps), a task force netted all of these people in a several-hour sweep of three large districts in three states. Within a matter of hours, all but one of the over three-hundred perps had been captured, and the final one on the next morning when he walked in his front door. We hear about occasional "creative stings" - one in Dallas where every person with an outstanding warrant was sent a mailer saying that they had won an all-expense-paid cruise vacation and all they needed to do was show up at Texas Stadium to claim their prize. Of course, in reality only police officers awaited them, and scuttled the hapless souls out the back door to awaiting transportation to a local incarceration facility.

 

Why does all this matter? When we want to find the bad data, we need a way to call it out. We need a way to find it in a manner that separates it from the rest of the data. In human terms, we often think in terms of identifying the perps and going after them, because we cannot possibly sift through the counties on a door-to-door search. We might know for sure that a nefarious character lives in a certain neighborhood, but without specific intelligence to locate the character, he remains at large. Netezza doesn't do a door-to-door search either. In fact, if we can only identify "what isn't a perp" - the suspects stand out like the hapless lemmings at Texas Stadium. Both of them arrived by a form of natural physics. The perps stood out by the natural human gravity of something-for-nothing. Our bad data stands out using the natural physics of the Netezza architecture.

 

Or in another vein, Secret Service agents and bank employees are taught how to spot a counterfeit bill based on what the real thing looks like. While they may know various counterfeiting techniques,nothing beats the ability to know the real thing so well, that a counterfeit stands out like a sore thumb. Got a pile of cash in your pocket? Could you tell which ones are real and which ones are counterfeit because you know "what isn't a counterfeit" better? Same technique applies here - give Netezza the cues on where not-to-look and we have our counterfeit data, our posers, our perps and wannabees. And make the domain safe before bedtime.