[ www.netezza.com ]
1

Here's a shout-out to all you ELT aficionados out there - those who have embraced the call to use the Netezza machine for hard-core data processing, and not just query acceleration. What's that? You've deployed it as a query accelerator because that's was your functional requirement? Tish-tosh, you are under-utilizing the machine.

 

In ELT space, we see data arriving on our machine's eastern shore like immigrants from a foreign land. Give us your poor, tired and huddled data, and buried information yearning to be free, and all that. We need liberation! (a subject of another like-minded site) and the big-black-box is a beacon to collect the uncollectable, love the unlovable, and process the unprocessable data arriving in completely un-integrated form. We see information from this source or that, arriving on rafts, boats, inner-tubes and the like, and we want to believe that all are created equal, yet our process for assimilation and naturalization of the data has an uptake, doesn't it? Perhaps we'll stage the information (give it a green card) and maybe even load it partially-cleansed into an ODS - but one way or another we have to challenge the information, make it consumable "enough" even when it first arrives.

 

ELT is a practice already found in many RDBMS's, engaged by people who have no desire to purchase middleware, and honestly believe that pulling the data out of the database, processing it only to put it right back, is a waste of time and resources. Fire up a stored procedure, they will say, and process the data on the machine. Isn't this the most efficient means to achieve our goal?

 

On the Netezza machine, you bet. We have hundred(s) of processors working in purpose-built synergy toward this goal. But on an SMP-based RDBMS, no way. It won't scale and is destined to run out of gas. It's only a matter of time. And because we would have to use stored procs to affect our outcome, we also embrace a black-box processing scenario that really is black, lights-out, underground, all the bad things. Our poor operators will watch it kick off, run for hours and hope it finishes on time, and correctly. Once it becomes visible to an operator or admin, it's already running out of gas, and now we'll watch the engineers swarm to do what they do best - engineer - and the danse-macabre of propping up a dying process with artificial respiration.

 

Which is why we have Rule #10 isn't it?

 

One reason some don't embrace ELT - that is - simple data transport followed by hard-core data processing in the database engine, is because it's a bad practice to do bulk data processing in the (SMP-based) database engine. Since Netezza has broken this envelope, we now have freedom to proceed, but wait. We need a way to manage the ELT flow itself. After all, the ELT flow is just a series of SQL statements, right? Even the most robust "ETL" tools will only support "ELT" by firing off sequential SQL statements because they are not really in control of the data. What we'd like to see, is a flow-based mechanism like Ab Initio, Informatica, Expressor or the like, to transparently harness the SQL statement like a true transformation component, even though in the background it will "only" fire off a SQL statement to affect the outcome. With the Netezza power we really can process the data with mind-bending speed, but after the smoke clears we need a way to report, track and manage this process. A SQL statement seems rather primitive, raw, and too much like hand-coding. Largely because it is. If we had a tool that could manufacture these kinds of transforms on-the-fly and manage the process as a visual flow, hey, life would be good.

 

What would this look like? Well, sort of like what we see today in flow-based management. Expressor, for example, allows us to leverage Visio to describe the flow, then Expressor components consume the Visio diagram's metadata  and manufacture a living program. Ab Initio uses its proprietary graphical canvas to affect a similar scenario. What we really need is the ability in one of these products (or another product entirely) to pull a Netezza ELT component onto the canvas, connect one end to a source table, one end to a target table (albeit a temp-table if necessary) and allow us to describe the transform between the two - just like any other transform. Ultimately this would provide us with graphical control over the flow in a visible, manageable and traceable form. Alas, we as the machine owners must (for now) embrace some degree of scripted logic to affect our desired outcome. I see this as a temporary state of affairs. Someone will rise to the challenge.

 

Oh, come on, David, we know that the folks who live and breathe Netezza are the pioneers, who eat from the back of wagons, sleep under the stars and change their own horse-shoes. Innovative problem solvers, braving the wilds of the prairie with fearless resolve. Yeah, uh, before we go down that path, describng a cowboy (and don't get me started) let's examine what the enterprise needs. Whether the cowboys (intone John Williams theme song here) get the job done rustling the data and wrangling the loads, at the end of the day the trail boss will want a status. Have we lost any dogies? Are all of them fed and watered? How much farther to the end of the trail? What about the weather? Wild animals? Data-rustlers (hackers). The list of pitfalls and opportunities is boundless, and the trail boss wants to know "where we stand". You know, like the business intelligence dashboards.


The reason we might not see this "ELT" harness scenario any time soon from the power-hitter products is that ELT requires the power-hitter to maintain local control but externalize the processing power (delegating it to Netezza). This is unpalatable for the product vendors that claim we don't need Netezza to process the data (and of course, Netezza is horning-in on their action like a good competitor). Yet we have this big-black-box machine sitting on the floor that has the power to perform seriously hard-core processing on a breathtaking scale, achieving internal bandwidth that these power-hitter software products cannot achieve (because their hardware platforms constrain them). Let's face it, put a power-hitter software product on a 32-way Sun machine and then attempt to process data in the same scale as a 200+ processor Netezza machine. No slight on the software product, because it could probably process at Netezza's scale if given enough hardware, but do we really want to deploy a 200 processor Sun machine?

 

Another reason is that Netezza is the only product that truly unleashes the processing capacity to make ELT a practical and easy reality, and is seen as a competitor by the power-hitter software product vendors. Yet another reason is that those who would embark on this path have to commit resources to Netezza's (somewhat) more rarefied market, and for now are simply unwilling to do so. Time will change this, however.

 

I'm not one to tell competitors how they should behave in the marketplace, because competition always increases quality. But if we all get together and shout "we are here" - perhaps at least one maverick elephant will hear our cry. With all apologies to Dr Seuss, we could start our own web presence as Horton Hears an ELT, or Horton.com, or even MaverickElephant.com - I don't know - just thinkin' out loud here.

1 Comments Permalink
0

I was amazed at how many people showed up for the inauguration. Here in DC the roads were closed, the weather was cold and the atmosphere was in a word, electric. I also witnessed some disturbing things, like people who had arrived from hundress of miles away but could not get in, and many more who did get in but had to watch it all on the jumbo-tron screens anyhow. Still, being there is being there, and nobody can take it away.

 

And this is a segue into my observations on some interesting patterns of humanity that align perfectly with our solution domains, so follow me on this. As I sat in a hotel room on Tuesday, debating whether I should brave the extreme cold and traffic, I am reminded that one of our presidents (Harrison) gave a too-long inauguration speech in such conditions, developed a cold and died of pneumonia a month later.  As for me, I had a problem to solve - when I turned on the morning news I found a common "of scale" ingress and egress scenario, something I predicted but was stunned as to the magnitude of its reality. Yes, an important day for America and for the world - and worth at least a superficial examination of the logistics of people-movement.

 

I noted that the cabs, trains, buses public and private, and other transportation mechanisms had been trickling people onto the scene for days, increasing by the hour. The DC train systems had dropped off over 200k people in a matter of hours, probably setting records as well. But when we think of the drop-off mechanics, these are not people arriving in-bulk. They are collecting from various places in a near-transactional manner. A handful here, a handful there, each one patted-down for their belongings one-at-a-time no differently than our transactional stored procedures check our inbound data one-at-a-time. Then the festivities started, the regalia of our peaceful transition of power, and when it was all done, we saw another interesting effect. It was time for everyone to make an orderly exit.

 

So people who had arrived early in the day and had likewise confirmed a front-row position for the festivities -  now did an about-face to exit - only to find a sea of humanity between themselves and the exits. I use the word "exits" here loosely, because they would not be patted-down to leave like they had been to enter, so the egress was a bit smoother and more steady. Uh, you know, like we pat-down the data when it first enters our warehouse, and deliver it clean on-demand. Ohhh, the parallels.

 

Discussing this at the watercooler, a couple of colleagues wondered out loud how to make a mass-egress work. How would we (safely) empty the Mall of the majority of people in a short period? After all, some of the attendees didn't make it back to their hotels until almost nine hours later! One person suggested to do it "the Netezza way" by providing 100 helicopter pads, plus 400 helicopters, each of them alighting on the pads every five minutes with a 20-minute round trip to spirit people out. This would leverage the vertical space, not just the horizontal. But this model won't serve, since the helicopters waste the return leg. One of them suggested conveyor belts, objecting to the Netezza way. But I suggested that this better represented the Netezza way, a streaming model of constantly moving data. The helicopters could move people at only 1/100th the speed of the same number of conveyors, and they wouldn't have to move all that fast.

 

The streaming model is something that shakes the rafters on our reporting models, but as with any problem of scale, we must provide the physical plant first, and it has to address the problem on purpose.

0 Comments Permalink
0

Here I am in Washington DC. Yes, that's right, Washington DC on the eve of the inauguration. And one may well wonder why a boy from Texas is hanging out in DC? I don't really have plans to attend the events of the next several days, as I am ensconced at a client site on a new and challenging project. I'll share some of the more generic challenges and opportunities at this site at a later time, scrubbing the content, as always. However, the DC area is about to be inundated with what is nothing less than a problem-of-scale. Estimates are that over three million people will attend the inauguration tomorrow, and they've pulled out all the stops and added extra manpower to assist in this mass ingress (and egress) of humanity. The parks services here often use indirect readings to get a better picture of total headcount (like water usage, purchases of refreshments, etc). In the most recent Earth Day celebration, they over-estimated the total attendees because they left more trash on the ground than the usual events. The irony is not lost.

 

It is instructive what the local authorities have initiated - the bridges and roads all around the city are closed to motored traffic - allowing foot-traffic only. In case of emergency, people can make a quick and orderly exit from the city and into the sorrounding areas. Of course, after watching the mayhem of 9/11, we know that if a catastrophic emergency arises, people will make it to the riverfront and then wait patiently in line to cross the closest bridge, right? Who are we kidding? Expect to see swimmers in the water, and all that. Not to put a damper on the festivities, of course, this is an important day for America in many ways, and the people in charge are very carefully seeing to it that all forms of chaos are under control.

 

But are we? How does chaos, even explosive, career-threatening chaos enter our environments and wreak havoc on our systems? I noted in a prior entry that simply watching for what is known while ignoring the unknown - or even accidentally allowing the unknown to pass through because we failed to recognize it - isn't good enough. It wouldn't be good enough this week for the Obama and Biden families, so why is it ever good enough for our mission-critical systems? It is because we are accustomed only to checking our information against the standard for which it must rise to, rather than checking it against a minimum standard for which it cannot rise at all.

 

Now, that sounds a little oblique so here's an example: Whenever a police dog from any given K-9 unit in America is searching for a suspect in the dead of night amidst flashlights, annoying shouts and eerie background music, it is searching for one and only one thing- the scent that isn't there. In fact, for any of these dogs you could easily perform a test - fill a room with fifty people and have one of them leave and hide. Then bring in the dog. It will immedaitely lock upon the missing-scent and go after the person who isn't there. How amazing is that? Of course, their olfactory sensors are several hundred thousand times more sensitive than ours.

 

Within international terrorist discussions on the chit-chat shows, we hear a maxim of "our intelligence has to be right 100% of the time, but a terrorist only has to get it right once." In other words, the terrorist may make one thousand attempts to take down a target, and only has to get it right once to succeed. But the people protecting it have no options - it has to be right all the time. Errors and junk in our data are a form of virtual terrorism - but only by latent effort on their part. Like dust slowly settling on electronic parts, if the parts are not protected from the creeping effect of buildup, the layer will eventually reach a critical mass and bring down the system. No one speck of dust did the trick, only the lack of attention to the dirt. When we say, "what are the odds?" - that word "odd" has a meaning. It is the meaning of "what doesn't belong here".

 

In a data processing environment, chaos thrives amid neglect. If we're in hard pursuit of chaos it has less of a chance to succeed - but we're only talking about probability now. Sooner or later, will the odds be in the favor of the last speck of dust that really counts, or will we keep the dust to an acceptable level so that there is no appreciable or dangerous buildup? Before I start sounding like a cleaning commercial, let's try to keep in mind that we need some serious sifting power, and not just spot-check, sampling or rules-based checking of row-at-a-time data. Dirt and all its patterns show up slowly in some systems, but show up in bulk in ours.

 

We need a way to bulk-lift the dirt out and away. Or for that matter, redirect the dirt in quantity so that it never finds a home. How do we do this? By challenging as-suspect every row that arrives on the front door. Of course, doing this "upon arrival" is a bit daunting, highly inefficent and error-prone, and not really practical. No, what we want to do is pull alll the candidates inside, take a look at those who belong and those that don't, and sift them while they are in the pipleline before they ever land in an Important Place. We need something akin to a switching device on a train track, or like a letter-sorter in a postal office, or like a coin-sifter in a change-making machine. We need for this sifting effect to fall-out using normal physics, not as a result of carefully handling the elements. We just don't have the time or bandwidth to taste-touch-handle each row.

 

Within a Netezza domain, start thinking about doing everything in scale. Rather than examining a problem as though we're trying to find a criminal, examine the problem in terms of what criminals look like, and what they don't. These provide clues to Netezza on "where not to look" - and we'll find our suspect faster.

 

Many years ago I assisted (from a virtual distance) a prosecutor in a dragnet operation where the various detectives had, over time, photographed or filmed a number of drug-purchases by nefarious characters and denizens of society. In one day and night of misfortune (for the perps), a task force netted all of these people in a several-hour sweep of three large districts in three states. Within a matter of hours, all but one of the over three-hundred perps had been captured, and the final one on the next morning when he walked in his front door. We hear about occasional "creative stings" - one in Dallas where every person with an outstanding warrant was sent a mailer saying that they had won an all-expense-paid cruise vacation and all they needed to do was show up at Texas Stadium to claim their prize. Of course, in reality only police officers awaited them, and scuttled the hapless souls out the back door to awaiting transportation to a local incarceration facility.

 

Why does all this matter? When we want to find the bad data, we need a way to call it out. We need a way to find it in a manner that separates it from the rest of the data. In human terms, we often think in terms of identifying the perps and going after them, because we cannot possibly sift through the counties on a door-to-door search. We might know for sure that a nefarious character lives in a certain neighborhood, but without specific intelligence to locate the character, he remains at large. Netezza doesn't do a door-to-door search either. In fact, if we can only identify "what isn't a perp" - the suspects stand out like the hapless lemmings at Texas Stadium. Both of them arrived by a form of natural physics. The perps stood out by the natural human gravity of something-for-nothing. Our bad data stands out using the natural physics of the Netezza architecture.

 

Or in another vein, Secret Service agents and bank employees are taught how to spot a counterfeit bill based on what the real thing looks like. While they may know various counterfeiting techniques,nothing beats the ability to know the real thing so well, that a counterfeit stands out like a sore thumb. Got a pile of cash in your pocket? Could you tell which ones are real and which ones are counterfeit because you know "what isn't a counterfeit" better? Same technique applies here - give Netezza the cues on where not-to-look and we have our counterfeit data, our posers, our perps and wannabees. And make the domain safe before bedtime.

0 Comments Permalink
0


And now, a drum roll please, the inaugural entry for this auspicious occasion. I realize that many people who read this will be Enzees and non-Enzees, so for those who want to know a little more about the machine, sort of dipping the toes in the water, I'm talkin' to you.

 

And for those who are already swimming in the deep end, and those in the deep end without a Netezza machine, I'll try to shape some thoughts for your own discussions. I've noted in other venues (particularly in the Netezza Underground, and that's my only shameless plug for the book!) that the Netezza machine addresses problems-of-scale. We know what a function-point looks like, and we know when data is transformed from one shape to another. What we might not consider is what it takes to make this happen in scale. And not to belabor a point, but Netezza is an appliance. While we know that we can make enough drinks for a small dinner party in our blender-appliance, and serve coffee to them afterward from our automatic coffee maker-appliance, will these same appliances cover, say one thousand people? Ten thousand people? For that matter, if we have a simple toaster-appliance, will even our four-slot toaster satisfy warm-toast-requirements for say, a thousand hungry teenagers? I recall having a winter get-together for a bunch of teenagers in our home, and one of the parents had signed up to provide chili. I still remember the kids piling into the house, about thirty or more, cold and thirsty from being outside in the weather. The parent showed up with mini-crock-pot containing the warmed contents of two can of store-bought chili, and had clearly missed the memo to show up prepared to feed the masses. We ordered pizza.

 

The requirements for processing data in scale are no different, and we have several primary hurdles to overcome that Netezza has recognized, embraced, harnessed and solved, and I would be remiss for not pointing them out, because for some people these are not obvious. But before we delve, consider this: Let's say I have a compliance model and I want to find the few thousand records among millions or billions that are not in compliance. This is a needle-in-a-haystack-problem, and for some of you that pile of hay is pretty big! If we did this in a "typical" fashion (without Netezza), we would examine each row for a variety of anomalies, all based on some criteria or set of rules. We really need to get it right on the first pass, because a multiple-pass-model on the data is unthinkable.

 

Does this always work? What if one of the anomalies gets past the sifter because we did not actively include a rule that one of the bad-boy records is now deftly sidestepping without our knowledge? Will this dirt creep into our warehouse? Will our users see the dirt before we do? Will this chaos potentially spell doom for someone, somewhere, and can we rescue it from the tracks before the locomotive arrives? Stop throwing popcorn, it's only a melodrama.

 

But with Netezza we have a very interesting option, in that Netezza operates on the principle of where-not-to-look. If we give it enough information, it can ignore the data we don't care about and by default bubble-up the data we want. In the above case, let's say we have two broad categories of data, the "hay" and the "needles". We know that the needles don't belong, and we can certainly tell what hay looks like. Now what if we did a single-pass on the data, identifying everything that looks like hay, and roll the identifiers for these records into a temporary table. In the next pass (that's right, another pass!) we simply anti-join the orignal data with this temporary table (using a where-not-exists), and voila! - the needles fall out of this as a natural result. Will all of these results be needles for certain? Well, we know that if all the other records are definitely hay, we now have a much smaller and objective subset of needle-candidates to work with, and that all of our needles are in it, even though some anomalous "borderline hay" might be there, too. Either way, the needle-candidates are in the tank, ready for examination, And more importantly - our haystack itself is pure and ready for the next downstream operation.

 

And this is what's important - if we have billions of rows in a table and we want to find the few thousand that will cause us trouble, we can perform this two-step carving using Netezza's massively parallel power, carve out the troublesome ones and pass good data to the next downstream process. We then deal with the anomalies in a more administrative manner. In short, we don't have to "stop the presses" for the sake of the needles. We just remove the needles and continue. I know that some people have the philosophy that if any needles exist at all, we must stop-the-presses and figure things out before proceeding. This might work for a few hundred anomalies in a few thousand, but it won't scale.

 

Our analysts will grow impatient and our troubleshooters catatonic from the sheer volume. Someone will scream an epithet that begins with the words "Why don't you just..." and hopefully the end of this sentence will be something professional like "take the dirt out and deal with it elsewhere, but don't hold the rest of our data hostage!"

 

I'll deal with hostage negotiation in a later entry.

0 Comments Permalink
0

Fire it up!

Posted by David Birmingham Jan 12, 2009

Greetings to all Enzees everywhere and welcome to the Grill!

 

Now, to level-set a bit on where this "grill" concept came from, I was hanging out at the house with friends and family shooting the breeze, cooking food on the outdoor grill and it sort of dawned on me that this is a lot like the virtual atmosphere in the Enzee Community. Maybe some of the analogies fall short, but the spirit of the atmosphere is what we're after.

 

And of course, the grill itself. A place where we toss things on the cooker, or cook-things-up, or take a raw idea to its conclusion, along with the transformation that happens from raw-to-cooked, and hopefully we won't have anything half-baked!

 

But keep in mind this is a grill, not a roast! So let's keep the atmosphere jovial and forward-looking, because that's where all the action is. And like any good cookout, while there's a guy flippin' the food who seems to be in charge, everyone's opinion counts. I'll write most of the entries in essay-style form, perhaps even persuasive-essay style, because this is a blog and I have opinions too - but I don't do one-liner entries (mostly)

 

I'd like to thank the folks at Netezza for this opportunity to interact with the Enzee community at large, and hope it becomes a very fruitful discussion for all.

 

In the spirit of the political season, the "inaugural" discussion will appear next!

0 Comments Permalink