Here's a shout-out to all you ELT aficionados out there - those who have embraced the call to use the Netezza machine for hard-core data processing, and not just query acceleration. What's that? You've deployed it as a query accelerator because that's was your functional requirement? Tish-tosh, you are under-utilizing the machine.
In ELT space, we see data arriving on our machine's eastern shore like immigrants from a foreign land. Give us your poor, tired and huddled data, and buried information yearning to be free, and all that. We need liberation! (a subject of another like-minded site) and the big-black-box is a beacon to collect the uncollectable, love the unlovable, and process the unprocessable data arriving in completely un-integrated form. We see information from this source or that, arriving on rafts, boats, inner-tubes and the like, and we want to believe that all are created equal, yet our process for assimilation and naturalization of the data has an uptake, doesn't it? Perhaps we'll stage the information (give it a green card) and maybe even load it partially-cleansed into an ODS - but one way or another we have to challenge the information, make it consumable "enough" even when it first arrives.
ELT is a practice already found in many RDBMS's, engaged by people who have no desire to purchase middleware, and honestly believe that pulling the data out of the database, processing it only to put it right back, is a waste of time and resources. Fire up a stored procedure, they will say, and process the data on the machine. Isn't this the most efficient means to achieve our goal?
On the Netezza machine, you bet. We have hundred(s) of processors working in purpose-built synergy toward this goal. But on an SMP-based RDBMS, no way. It won't scale and is destined to run out of gas. It's only a matter of time. And because we would have to use stored procs to affect our outcome, we also embrace a black-box processing scenario that really is black, lights-out, underground, all the bad things. Our poor operators will watch it kick off, run for hours and hope it finishes on time, and correctly. Once it becomes visible to an operator or admin, it's already running out of gas, and now we'll watch the engineers swarm to do what they do best - engineer - and the danse-macabre of propping up a dying process with artificial respiration.
Which is why we have Rule #10 isn't it?
One reason some don't embrace ELT - that is - simple data transport followed by hard-core data processing in the database engine, is because it's a bad practice to do bulk data processing in the (SMP-based) database engine. Since Netezza has broken this envelope, we now have freedom to proceed, but wait. We need a way to manage the ELT flow itself. After all, the ELT flow is just a series of SQL statements, right? Even the most robust "ETL" tools will only support "ELT" by firing off sequential SQL statements because they are not really in control of the data. What we'd like to see, is a flow-based mechanism like Ab Initio, Informatica, Expressor or the like, to transparently harness the SQL statement like a true transformation component, even though in the background it will "only" fire off a SQL statement to affect the outcome. With the Netezza power we really can process the data with mind-bending speed, but after the smoke clears we need a way to report, track and manage this process. A SQL statement seems rather primitive, raw, and too much like hand-coding. Largely because it is. If we had a tool that could manufacture these kinds of transforms on-the-fly and manage the process as a visual flow, hey, life would be good.
What would this look like? Well, sort of like what we see today in flow-based management. Expressor, for example, allows us to leverage Visio to describe the flow, then Expressor components consume the Visio diagram's metadata and manufacture a living program. Ab Initio uses its proprietary graphical canvas to affect a similar scenario. What we really need is the ability in one of these products (or another product entirely) to pull a Netezza ELT component onto the canvas, connect one end to a source table, one end to a target table (albeit a temp-table if necessary) and allow us to describe the transform between the two - just like any other transform. Ultimately this would provide us with graphical control over the flow in a visible, manageable and traceable form. Alas, we as the machine owners must (for now) embrace some degree of scripted logic to affect our desired outcome. I see this as a temporary state of affairs. Someone will rise to the challenge.
Oh, come on, David, we know that the folks who live and breathe Netezza are the pioneers, who eat from the back of wagons, sleep under the stars and change their own horse-shoes. Innovative problem solvers, braving the wilds of the prairie with fearless resolve. Yeah, uh, before we go down that path, describng a cowboy (and don't get me started) let's examine what the enterprise needs. Whether the cowboys (intone John Williams theme song here) get the job done rustling the data and wrangling the loads, at the end of the day the trail boss will want a status. Have we lost any dogies? Are all of them fed and watered? How much farther to the end of the trail? What about the weather? Wild animals? Data-rustlers (hackers). The list of pitfalls and opportunities is boundless, and the trail boss wants to know "where we stand". You know, like the business intelligence dashboards.
The reason we might not see this "ELT" harness scenario any time soon from the power-hitter products is that ELT requires the power-hitter to maintain local control but externalize the processing power (delegating it to Netezza). This is unpalatable for the product vendors that claim we don't need Netezza to process the data (and of course, Netezza is horning-in on their action like a good competitor). Yet we have this big-black-box machine sitting on the floor that has the power to perform seriously hard-core processing on a breathtaking scale, achieving internal bandwidth that these power-hitter software products cannot achieve (because their hardware platforms constrain them). Let's face it, put a power-hitter software product on a 32-way Sun machine and then attempt to process data in the same scale as a 200+ processor Netezza machine. No slight on the software product, because it could probably process at Netezza's scale if given enough hardware, but do we really want to deploy a 200 processor Sun machine?
Another reason is that Netezza is the only product that truly unleashes the processing capacity to make ELT a practical and easy reality, and is seen as a competitor by the power-hitter software product vendors. Yet another reason is that those who would embark on this path have to commit resources to Netezza's (somewhat) more rarefied market, and for now are simply unwilling to do so. Time will change this, however.
I'm not one to tell competitors how they should behave in the marketplace, because competition always increases quality. But if we all get together and shout "we are here" - perhaps at least one maverick elephant will hear our cry. With all apologies to Dr Seuss, we could start our own web presence as Horton Hears an ELT, or Horton.com, or even MaverickElephant.com - I don't know - just thinkin' out loud here.
Great post, David - just thought I'd throw in a couple of comments re 3rd party 'ELT' style products..
Before it got bought by Oracle, Sunopsis enabled some of what you describe, and I see from a recent blog posting that Informatica's PushDown Option (PDO) now includes Netezza support. I've done some experimentation with earlier Informatica PDO versions on Netezza, and they're certainly going in the right direction. Many (though not all) transformations (e.g. aggregations, look-ups, some expressions, sorts, joins, unions..) can be pushed down to the Netezza system if the source and target are on the same connection. So the system is capable of generating INSERT.... SELECT.. type code from a mapping described in the normal developer GUI, while retaining the appropriate metadata in the Informatica repository.
This doesn't however mean that an existing 'Oracle style' mapping will generate pushed down code if the source and target are migrated to Netezza... to get the best out of this functionality will generally require a rethink in approach, just the same as when 'hand coding' the ELT operations.
I'll hopefully be doing some work with the latest Informatica version very soon - I'll report back on what I find out.
D.
Principal Consultant
Edge Assoicates