At the June conference we covered a lot of practices around harnessing the Netezza machine for fun and profit. But the buzz on ELT was rising then, and when I arrived (and remained after) the November London Roadshow, people were practically tackling me in the hallways to learn more about how ELT works, how well it work, and how practical it is.
Well, the ELT buzz is growing even louder. I've been practically inundated with requests for more information on ELT and how it operates. Of all the opinions on the subject, the conversation usually boils down to the irreducible: How do we get plain vanilla queries to do more than be, well, you know - plain vanilla queries? I mean, the business logic and flow rules are all bound into it! And if the queries can be thrown into the machine with reckless abandon how can we possibly make any sense of them? ELT means forming structured flows of information using intermediate tables as checkpoints. Netezza is one of the few platforms in the world that can do this effortlessly and productively. Anyone now exposed to this (like a common RDBMS developer) would claim that ELT is unworkable on any database. Well, perhaps on your database, but not on Netezza.
Of course, we cannot allow arbitrary SQL to enter the flow can we? But does this mean arbitrarily submitted, or arbitrarily formed? You see, if the SQL has a consistent form, we have all the freedom we need for arbitrary submission. I suppose it's the "degree" of arbitrary we're looking for no?
Case in point, a vendor tells me "Oh, you can send any arbitrary query to the machine and.." Of course he means anything, absolutely anything our developers choose to concoct of their own cleverness or ignorance, or anywhere in between. In fact, cleverness is perhaps the least of our worries since the clever queries often follow some degree of structured discipline. It's the queries produced by the newbies that pose the most danger. It's hard to remediate a lot of work that took place "in the wild", so to speak. Of course, for those who like their wild-game-SQL or pehaps free-range-SQL, I will not disparage their personal tastes. But this is enterprise computing after all, and many of my personal tastes are already checked at the door.
And as I've noted in the past, metadata is where it all happens anyhow so let's take a short dive into the future of ELT and what I suspect is the next wave of stuff we'll see.
1 - insert/select queries, while they are the ultimate bread-and-butter of ELT, their hard-wired form should not rule the roost. One need not fear being enslaved by the queries, but should instead harness them. In fact, the queries should never completely exist in one place, but be manufactured at run time. Now, our BI tools already do this, right? We post a request into our favorite BI tool and it leverages its metadata repository to form-the-perfect-query. That's where we need to be. Provide a template and some steering logic, but the query itself is manufactured on-the-fly.
2 - Automate everything, from DDL to physical model, from intake to reposit, and it should be a fairly simple approach. When we takea look at the patterns of what we're trying to accomplish, we are essentially punting around metadata (tables, columns, data types, rules) and these are the object-oriented assets shaping the solution-at-hand. We should not be in the businss of crafting SQL statements (let a product do that for us) but in the business of shaping logic. The ETL tools purport to do this but cannot scale (for set-based operations) as effortlessly as Netezza can. So in some ways, the ELT aficionado is stuck in a bubble waitng for the technologies to mature.
3 - Think in terms of metadata, factories and steering logic. Okay, what does that mean? Factories mean components that take raw parts and manufacture them into SQL statements and other assets. This allows us to drive practically everything from knowledge of the Netezza catalog. Source table to target table? All we need are the rules in the middle and we're done, right? So if we can make simple lists of rules that are tied to particular table/colums as targets, we are now in the zone to leverage a factory to manufacture the proper insert/select statement to send to the machine. We don't really need to have the predefied insert/select, only a list of target columns tied to rules. The rules then become our source map, and the target columns become our fodder for manufacturing the insert-clause. Bolt a filter / where-clause on the end of it and yes, we've have valid SQL. But wait, what about steering logic? As my factory is manufacturing the query, I can make all kinds of decisions about the table and columns as I encounter them. For example, if I am making an extraction query to face a database, as I encounter each column in the catalog, I can compare this to yet another repository of information that can tell me, does this column use a different name on the extraction source than the target? Perhaps I need to exclude it from the query altogether? Perhaps I need to alter the value of a given column to a default (data masking) or I may need to bolt-on columns to the end of the structure. In short, the steering logic is a simple set of exports that I can tie by-name to the source and table name so that it is surgically applied as necessary. This means every extraction is now programmable because we've built the programmability into the factory - the switches and signals serve as steering logic as the factory weaves the perfect query.
What is BI anyhow? Some companies who do all-data-warehousing have jumped into BI thinking it's just the icing on the cake, so to speak. Others embrace BI as their primary focus but then drive into a pit of miry clay when they have to access data untouched by the scrubbing bubbles of an enterprise-class data warehouse. So which is it? Data warehousing and BI? Does BI encompass data warehousing? I've had this conversation with so many people who cannot seem to put their finger on the definition. One argued passionately that we should simple consider data warehousing and BI to be one and the same, such that when we talk about BI, we're talking about both. Then not five minutes later he claimed that Microsoft Excel was the most widely used BI tool in the world, to which I asked wryly, "since when did Excel start doing data warehousing and ELT" - he sort of stared at me, not knowing why I had asked the question, forgetting his prior claims as though second thoughts entirely. My simple assertion is that the open marketplace calls them out as different - BI is not data warehousing and likely never will be.
Cached stuff - Like any system, catalog-hits can become cumbersome and the factories leveraging them can begin to take protracted lifecycles for building out the queries, some of which can be very complex. However, if they are based on static metadata (like that in the catalog) it is easier to cache the manufactured assets (less any table/column-centric adjustments) and avoid re-manufacturing these assets each and every time we need them. In one case, manufacturing the assets for each extract/load took about five minutes to complete when this metadata had to be manufactured from scratch each time. After caching, however the entire duration became less than a minute. Many ELT-oriented tools actually do consume the catalog but unfortunately do not consider this aspect of data processing - that the metadata itself - when manufactured from static assets, becomes a repeatable product. A perfect candidate for a cache.
Transform metadata - talk about "all the rage". Hey, people want ELT, but they don't want to risk losing track of how to trace data lineage through the multiple stages of SQL statements. After all, by the time the data actually arrives in our BI tool, we might see the data undergo hundreds of insert/select queries to arrive at the final data outcome. What if an analyst wants to perform track-back lineage of the data, or perhaps even do an impact analyis of the environment? These are impossible without metadata-driven artifacts, rather than allowing SQL guys to manufacture any arbitrary query on-the-fly. What does this look like? Well, think in terms of source columns, rules applied to those columns and their outome sent to target columns. Then we need a filter clause (join, where, filter, group) that is easily identified. Of all this, we need to capture rules, participating tables, participating columns (both in the source to target map and in the filter clause). It's not really all that hard to capture it if the developers have followed certain rules.
Rules-based asset manufacturing - (a spinoff of one of the above) is simply providing a structured template where developers can formulate and test a single query for accuracy, or multiple dependent queries. But in either case, the metadata in the catalog can provide us with the the raw structures - columns, tables, etc and we can then formulate templates from these. We then hand off this template to a developer and they fill in the details. They can't really get themselves into trouble because if the developer tries to do something illegal or ignorant, it's all part of the game to us. Our environment requires the developer to adapt - or the query simply won't run. I was on a site where we had only a handful of basic programming rules. Easy enough - like falling off a log. But rules will not protect us from someone who is willing to break them. Here we have templates that provide the balance between rigidity and infinite flexibility. Make the template work and you are by definition in-compliance.
Without elaborating into infinite detail, the realm of ELT is still maturing. ETL tools jockey for position in push-down approaches, but know that if they are really good at this, they are essentially working themselves out of a job. If for no better reason than they have effectively harnesses a Netezza machine to do all of the heavy-lifting their own hard-won capabilities currently do. And if so, the irony is that any one of their smaller competitors stands to gain much more in this marketing space because they have less to lose. After all, we don't buy a power-hitter ETL tool and relegate it to data-transport/transfer activities (it's new role alongside the Netezza machine) but a smaller player would love to fill this spot, grow it, mature it and deliver it as a packaged method to create fire-from-fingertips.

