Skip navigation

Gather 'round the Grill

2 Posts tagged with the data tag
0

ELT buzz is rising

Posted by David Birmingham Dec 20, 2010

At the June conference we covered a lot of practices around harnessing the Netezza machine for fun and profit. But the buzz on ELT was rising then, and when I arrived (and remained after) the November London Roadshow, people were practically tackling me in the hallways to learn more about how ELT works, how well it work, and how practical it is.

 

Well, the ELT buzz is growing even louder. I've been practically inundated with requests for more information on ELT and how it operates. Of all the opinions on the subject, the conversation usually boils down to the irreducible: How do we get plain vanilla queries to do more than be, well, you know - plain vanilla queries? I mean, the business logic and flow rules are all bound into it! And if the queries can be thrown into the machine with reckless abandon how can we possibly make any sense of them? ELT means forming structured flows of information using intermediate tables as checkpoints. Netezza is one of the few platforms in the world that can do this effortlessly and productively. Anyone now exposed to this (like a common RDBMS developer) would claim that ELT is unworkable on any database. Well, perhaps on your database, but not on Netezza.

 

Of course, we cannot allow arbitrary SQL to enter the flow can we? But does this mean arbitrarily submitted, or arbitrarily formed? You see, if the SQL has a consistent form, we have all the freedom we need for arbitrary submission. I suppose it's the "degree" of arbitrary we're looking for no?

 

Case in point, a vendor tells me "Oh, you can send any arbitrary query to the machine and.." Of course he means anything, absolutely anything our developers choose to concoct of their own cleverness or ignorance, or anywhere in between. In fact, cleverness is perhaps the least of our worries since the clever queries often follow some degree of structured discipline. It's the queries produced by the newbies that pose the most danger. It's hard to remediate a lot of work that took place "in the wild", so to speak. Of course, for those who like their wild-game-SQL or pehaps free-range-SQL, I will not disparage their personal tastes. But this is enterprise computing after all, and many of my personal tastes are already checked at the door.

 

And as I've noted in the past, metadata is where it all happens anyhow so let's take a short dive into the future of ELT and what I suspect is the next wave of stuff we'll see.

 

1 - insert/select queries, while they are the ultimate bread-and-butter of ELT, their hard-wired form should not rule the roost. One need not fear being enslaved by the queries, but should instead harness them. In fact, the queries should never completely exist in one place, but be manufactured at run time. Now, our BI tools already do this, right? We post a request into our favorite BI tool and it leverages its metadata repository to form-the-perfect-query. That's where we need to be. Provide a template and some steering logic, but the query itself is manufactured on-the-fly.

 

2 - Automate everything, from DDL to physical model, from intake to reposit, and it should be a fairly simple approach. When we takea look at the patterns of what we're trying to accomplish, we are essentially punting around metadata (tables, columns, data types, rules) and these are the object-oriented assets shaping the solution-at-hand. We should not be in the businss of crafting SQL statements (let a product do that for us) but in the business of shaping logic. The ETL tools purport to do this but cannot scale (for set-based operations) as effortlessly as Netezza can. So in some ways, the ELT aficionado is stuck in a bubble waitng for the technologies to mature.

 

3 - Think in terms of metadata, factories and steering logic. Okay, what does that mean? Factories mean components that take raw parts and manufacture them into SQL statements and other assets. This allows us to drive practically everything from knowledge of the Netezza catalog. Source table to target table? All we need are the rules in the middle and we're done, right? So if we can make simple lists of rules that are tied to particular table/colums as targets, we are now in the zone to leverage a factory to manufacture the proper insert/select statement to send to the machine. We don't really need to have the predefied insert/select, only a list of target columns tied to rules. The rules then become our source map, and the target columns become our fodder for manufacturing the insert-clause. Bolt a filter / where-clause on the end of it and yes, we've have valid SQL. But wait, what about steering logic? As my factory is manufacturing the query, I can make all kinds of decisions about the table and columns as I encounter them. For example, if I am making an extraction query to face a database, as I encounter each column in the catalog, I can compare this to yet another repository of information that can tell me, does this column use a different name on the extraction source than the target? Perhaps I need to exclude it from the query altogether? Perhaps I need to  alter the value of a given column to a default (data masking) or I may need to bolt-on columns to the end of the structure. In short, the steering logic is a simple set of exports that I can tie by-name to the source and table name so that it is surgically applied as necessary. This means every extraction is now programmable because we've built the programmability into the factory - the switches and signals serve as steering logic as the factory weaves the perfect query.

 

What is BI anyhow? Some companies who do all-data-warehousing have jumped into BI thinking it's just the icing on the cake, so to speak. Others embrace BI as their primary focus but then drive into a pit of miry clay when they have to access data untouched by the scrubbing bubbles of an enterprise-class data warehouse. So which is it? Data warehousing and BI? Does BI encompass data warehousing? I've had this conversation with so many people who cannot seem to put their finger on the definition. One argued passionately that we should simple consider data warehousing and BI to be one and the same, such that when we talk about BI, we're talking about both. Then not five minutes later he claimed that Microsoft Excel was the most widely used BI tool in the world, to which I asked wryly, "since when did Excel start doing data warehousing and ELT" - he sort of stared at me, not knowing why I had asked the question, forgetting his prior claims as though second thoughts entirely. My simple assertion is that the open marketplace calls them out as different - BI is not data warehousing and likely never will be.

 

Cached stuff - Like any system, catalog-hits can become cumbersome and the factories leveraging them can begin to take protracted lifecycles for building out the queries, some of which can be very complex. However, if they are based on static metadata (like that in the catalog) it is easier to cache the manufactured assets (less any table/column-centric adjustments) and avoid re-manufacturing these assets each and every time we need them. In one case, manufacturing the assets for each extract/load took about five minutes to complete when this metadata had to be manufactured from scratch each time. After caching, however the entire duration became less than a minute. Many ELT-oriented tools actually do consume the catalog but unfortunately  do not consider this aspect of data processing - that the metadata itself - when manufactured from static assets, becomes a repeatable product. A perfect candidate for a cache.

 

Transform metadata - talk about "all the rage". Hey, people want ELT, but they don't want to risk losing track of how to trace data lineage through the multiple stages of SQL statements. After all, by the time the data actually arrives in our BI tool, we might see the data undergo hundreds of insert/select queries to arrive at the final data outcome. What if an analyst wants to perform track-back lineage of the data, or perhaps even do an impact analyis of the environment? These are impossible without metadata-driven artifacts, rather than allowing SQL guys to manufacture any arbitrary query on-the-fly.  What does this look like? Well, think in terms of source columns, rules applied to those columns and their outome sent to target columns. Then we need a filter clause (join, where, filter, group) that is easily identified. Of all this, we need to capture rules, participating tables, participating columns (both in the source to target map and in the filter clause). It's not really all that hard to capture it if the developers have followed certain rules.

 

Rules-based asset manufacturing - (a spinoff of one of the above) is simply providing a structured template where developers can formulate and test a single query for accuracy, or multiple dependent queries. But in either case, the metadata in the catalog can provide us with the the raw structures -  columns, tables, etc and we can then formulate templates from these. We then hand off this template to a developer and they fill in the details. They can't really get themselves into trouble because if the developer tries to do something illegal or ignorant, it's all part of the game to us. Our environment requires the developer to adapt - or the query simply won't run. I was on a site where we had only a handful of basic programming rules. Easy enough - like falling off a log. But rules will not protect us from someone who is willing to break them. Here we have templates that provide the balance between rigidity and infinite flexibility. Make the template work and you are by definition in-compliance.

 

Without elaborating into infinite detail, the realm of ELT is still maturing. ETL tools jockey for position in push-down approaches, but know that if they are really good at this, they are essentially working themselves out of a job. If for no better reason than they have effectively harnesses a Netezza machine to do all of the heavy-lifting their own hard-won capabilities currently do. And if so, the irony is that any one of their smaller competitors stands to gain much more in this marketing space because they have less to lose. After all, we don't buy a power-hitter ETL tool and relegate it to data-transport/transfer activities (it's new role alongside the Netezza machine) but a smaller player would love to fill this spot, grow it, mature it and deliver it as a packaged method to create fire-from-fingertips.

0

Many years ago someone impressed upon me the need for simplicity in matters of scale. The problem of course, is that simplicity is impossible without power. This is why we see secondhand RDBMS environments proliferate their complexity into a functionally catatonic state, ultimately calling for its wholesale replacement. And upon the call, others swoop in to save the day, saying that they can "replicate your functionality" in another stronger environment that is geared for high complexity, without even once looking askance at the complexity's necessity. By that I mean, the complexity arrived as a function of an underpowered environment, with sweat-labor from engineers working diligently to prop up a fading machine. It means - all that addtional power-propping, is artificial and we need to regard it as a necessary evil. If we'd had the power way back when, the complexity never would have arisen in the first place. So the complexity is a symptom, not an attribute.

 

If we have the power, it gives us the capability to implement functionally sophisticated solutions with ease of maintenance and operation. Functional sophistication is key, because we don't want to dumb-down the functionality just because we don't have enough hardware. Yet the people responsibile for buying us more hardware look at us warily, wondering "You know, I signed off on the last hardware purchase thinking it would be my last hardware purchase. Yet now we are already out of gas and it seems like we didn't get the return-on-investment for the last purchase."  Ahh, but wait, says the engineer, the systems are voracious and so are the users. Adding more hardware is the only way to stay ahead of them. To which the purchaser objects "How do I know that you are making the most efficient use of the hardware you already have? Can you tell me that you haven't tried to optimize the environment?"  To which the engineer skulks away, formulates a plan and spends the next six months wrapping the solution in an engineered cocoon of complexity that offers marginal boost, but a boost nevertheless. The purchaser feels vindicated. "I'll stand my ground next time, and they'll have to go through another optimization! Aha! The key is to get these deadbeat engineers to do their job, and engineer!"

 

And so at the end of this bitter and tumultuous cycle, we have an over-engineered, underpowered machine that nobody is happy with except for the purchaser. who held out until the very end. Often the purchaser is overridden by a super-purchaser, like the CTO or CEO, who finally releases the funds, offers the directives, and the purchaser dutifully though reluctantly complies, certain in his heart that the engineers have at least one more optimization cycle left in them. If you've never been the recipient or the participant in such a repeatable and pervasive cycle, what a blessing to be in the food service or housekeeping industries in these perilous times!

 

In one case, I told a senior leader point-blank that the reason his secondhand RDBMS system was running out of gas, was the proliferation of cursor-based stored procedures draining the lifeblood from the box. He was stunned at this assertion, because he'd been assured by his implementers that this was the right thing to do. The implementers in the room objected by asking if I was "actually suggesting" that they pull the data into a third-party tool, process it, and put it back. Such phrases as "are you telling me", "seriously"  and "everybody knows that" were the commom prefixes of all their objections. Oookay-fine. This doesn't detract one iota from the simple fact that the secondhand RDBMS doesn't process bulk data well, and it doesn't really support third-party environments that require it to (you know, bulk loading and extract). The whole bulk-loading and extract domain is something that secondhand RDBMS systems provide their own utility for, and dogpile one caveat after another against its regular use for operational, lights-out data processing. This is because, as Rule #10 tells us, secondhand RDBMS systems are lousy at row-by-row processing. This has always been true.

 

And think from the perspective of a newbie. Imagine being introduced now to a Netezza machine where it has been specifically purpose-built to support all the things that the secondhand RDBMS's treat as - well - secondhand? Our newbies come to their cubicle with their hard-won tales of derring-do and the many dragons they have captured and thrown down. They drag their canvas bags filled with dragon-fighting gear, empty it on the floor and we marvel at all the tools we once used to be effective. Ahh, the aroma of the dragon's blood wafts from the antiquated instruments of war, reminding us of days gone by, and all that bygone stuff, too.  It's hard to immediately schedule their pickup for the Warehousing Museum on the 7th floor, where all of our stuff is on display and gathering dust. Instead we scratch the Museum's number on the back of a business card and hand it to the warrior for later. Still, all of those weapons and their attendant experience are very valuable, but not in the way one might think.

 

What's a seasoned warrior to do when the machine can devour a data warehouse dragon like a tree shredder, digest it, repackage it and dungeon-ize it for all eternity, and do it as an appliance? Almost like pressing the button on a toaster, but not quite. Our hero rolls up his sleeves and starts slinging row-by-row, cursor-based stuff so prevalent in secondhand implementations. Upon first execution, it runs worse than a dog. It runs like a wet dog. And doesn't smell any better either. The hero scratches his head and tries again. Everything the hero knows about data processing just went "counter-intuitive". The hero mutters "I don't get it". Ahh, but the hero will get it, because the hero is emotionally engaged.

 

This emotional engagement is something that we, as seasoned Netezza folk, should leverage to help our newbies make the necessary shift in their approach. They really need to take ownership of the knowledge, but sometimes (I've been told) it feels a lot like pushing a string (for the seasoned people) and a lot like playing some kind of strange board-game for the newbie. To avoid mental pain and injury, and heavy-lifting on the part of the seasoned people (and the attendant frustration from helping a newbie blossom) the seasoned pro only need to remember the power of the emotional engagement. The newbie will want to conquer the machine. Harness it for the good of all mankind. They will brandish their blades with the familiar shhhhiiiiiinnnngg as it leaves the sheath. But there's only one hero in the room. The big black box. The warrior's emotions are drawing him/her inexorably closer to the final conclusion - the machine works for us. Really works for us. Unlike the secondhand RDBMS machines that once enslaved us. Now we are the master. The new hero does our bidding, and does it well.

 

So our warrior, after many days of working with the machine, finds himself no longer using his old gear. No longer using his old ways. In fact, now stumbling over the gear and getting his feet tangled in the pull-strings on the canvas bag. One day he will pull out the number we gave him for the Data Warehouse Museum on the 7th floor and call for a pickup. On that day, take the warrior to lunch. The transition is complete.

 

What has the warrior embraced - that complexity is no longer the key to winning. We don't have to "do it the old way" and we don't have to figure out a way to shoehorn our former implementations into the new domain. Those implementations were necessary evils, using technology that wasn't geared to support them, rigged for an outcome on an underpowered and overwhelmed environment. When moving to the new environment (whether it's a newbie learning the ropes or a large-scale migration of a secondhand RDBMS into the big black box) we don't take the prior implementation with us. We don't take the prior table structures, cursor-based processes, or anything else about the original implementations. If artificial complexity means that even part of the implementation is suspect, then all of the implementation is suspect. We have a new way of doing things, a new machine to do them on, and the outcome will be so much better if we now do things the way the machine best performs, rather than doing things the way some other secondhand machine performs poorly.

 

And this is the difference between complexity and simplicity. In an underpowered secondhand RDBMS, the complexity props up the weakness of the technology and is a symptom of weakness in every way imaginable. Simplicity on the other hand, is the mark of strength of the technology and its ease of implementation and maintenance.

 

For a (human) hero, complexity is the reason for existence and simplicity is for the simple-minded peasant. After all, the peasants do the (back labor) work of the field and save the thought labor for the feudal lord and his lackeys. But one can see the parallel immediately - the secondhand technology is actually a feudal lord, its high-priced product engineers are its lackeys, and we, my friend are its indentured servants. Quite the converse for Netezza folk - we are the feudal lord, requiring no lackeys, and the machine is our indentured servant. The machine works for us.

 

I cannot count how many times I've been approached (even in the hallway of a Netezza conference!) of people asking me about re-engineering their existing systems for Netezza. No, my friend, we don't re-engineer, we de-engineer. There's a huge difference.

 

What does simplification buy us? We can now move away from the raw complexity required to prop up a lack of performance, and move toward the sophistication required of a competitive solution. The sophistication is key, because anyone can build a simple system for simple business reasons. But when the complexity of the environment hinders us from the next level of functional sophistication, the complexity has now enslaved our business model, and effectively the business itself. Simplified implementations are stable and adaptable, so can scale to breathtaking heights, giving us the  necessary edge for competitive, sophisticated offerings that aren't even possible in the secondhand technologies.