[ www.netezza.com ]

Gather 'round the Grill

2 Posts tagged with the engineering tag
0

Many years ago someone impressed upon me the need for simplicity in matters of scale. The problem of course, is that simplicity is impossible without power. This is why we see secondhand RDBMS environments proliferate their complexity into a functionally catatonic state, ultimately calling for its wholesale replacement. And upon the call, others swoop in to save the day, saying that they can "replicate your functionality" in another stronger environment that is geared for high complexity, without even once looking askance at the complexity's necessity. By that I mean, the complexity arrived as a function of an underpowered environment, with sweat-labor from engineers working diligently to prop up a fading machine. It means - all that addtional power-propping, is artificial and we need to regard it as a necessary evil. If we'd had the power way back when, the complexity never would have arisen in the first place. So the complexity is a symptom, not an attribute.

 

If we have the power, it gives us the capability to implement functionally sophisticated solutions with ease of maintenance and operation. Functional sophistication is key, because we don't want to dumb-down the functionality just because we don't have enough hardware. Yet the people responsibile for buying us more hardware look at us warily, wondering "You know, I signed off on the last hardware purchase thinking it would be my last hardware purchase. Yet now we are already out of gas and it seems like we didn't get the return-on-investment for the last purchase."  Ahh, but wait, says the engineer, the systems are voracious and so are the users. Adding more hardware is the only way to stay ahead of them. To which the purchaser objects "How do I know that you are making the most efficient use of the hardware you already have? Can you tell me that you haven't tried to optimize the environment?"  To which the engineer skulks away, formulates a plan and spends the next six months wrapping the solution in an engineered cocoon of complexity that offers marginal boost, but a boost nevertheless. The purchaser feels vindicated. "I'll stand my ground next time, and they'll have to go through another optimization! Aha! The key is to get these deadbeat engineers to do their job, and engineer!"

 

And so at the end of this bitter and tumultuous cycle, we have an over-engineered, underpowered machine that nobody is happy with except for the purchaser. who held out until the very end. Often the purchaser is overridden by a super-purchaser, like the CTO or CEO, who finally releases the funds, offers the directives, and the purchaser dutifully though reluctantly complies, certain in his heart that the engineers have at least one more optimization cycle left in them. If you've never been the recipient or the participant in such a repeatable and pervasive cycle, what a blessing to be in the food service or housekeeping industries in these perilous times!

 

In one case, I told a senior leader point-blank that the reason his secondhand RDBMS system was running out of gas, was the proliferation of cursor-based stored procedures draining the lifeblood from the box. He was stunned at this assertion, because he'd been assured by his implementers that this was the right thing to do. The implementers in the room objected by asking if I was "actually suggesting" that they pull the data into a third-party tool, process it, and put it back. Such phrases as "are you telling me", "seriously"  and "everybody knows that" were the commom prefixes of all their objections. Oookay-fine. This doesn't detract one iota from the simple fact that the secondhand RDBMS doesn't process bulk data well, and it doesn't really support third-party environments that require it to (you know, bulk loading and extract). The whole bulk-loading and extract domain is something that secondhand RDBMS systems provide their own utility for, and dogpile one caveat after another against its regular use for operational, lights-out data processing. This is because, as Rule #10 tells us, secondhand RDBMS systems are lousy at row-by-row processing. This has always been true.

 

And think from the perspective of a newbie. Imagine being introduced now to a Netezza machine where it has been specifically purpose-built to support all the things that the secondhand RDBMS's treat as - well - secondhand? Our newbies come to their cubicle with their hard-won tales of derring-do and the many dragons they have captured and thrown down. They drag their canvas bags filled with dragon-fighting gear, empty it on the floor and we marvel at all the tools we once used to be effective. Ahh, the aroma of the dragon's blood wafts from the antiquated instruments of war, reminding us of days gone by, and all that bygone stuff, too.  It's hard to immediately schedule their pickup for the Warehousing Museum on the 7th floor, where all of our stuff is on display and gathering dust. Instead we scratch the Museum's number on the back of a business card and hand it to the warrior for later. Still, all of those weapons and their attendant experience are very valuable, but not in the way one might think.

 

What's a seasoned warrior to do when the machine can devour a data warehouse dragon like a tree shredder, digest it, repackage it and dungeon-ize it for all eternity, and do it as an appliance? Almost like pressing the button on a toaster, but not quite. Our hero rolls up his sleeves and starts slinging row-by-row, cursor-based stuff so prevalent in secondhand implementations. Upon first execution, it runs worse than a dog. It runs like a wet dog. And doesn't smell any better either. The hero scratches his head and tries again. Everything the hero knows about data processing just went "counter-intuitive". The hero mutters "I don't get it". Ahh, but the hero will get it, because the hero is emotionally engaged.

 

This emotional engagement is something that we, as seasoned Netezza folk, should leverage to help our newbies make the necessary shift in their approach. They really need to take ownership of the knowledge, but sometimes (I've been told) it feels a lot like pushing a string (for the seasoned people) and a lot like playing some kind of strange board-game for the newbie. To avoid mental pain and injury, and heavy-lifting on the part of the seasoned people (and the attendant frustration from helping a newbie blossom) the seasoned pro only need to remember the power of the emotional engagement. The newbie will want to conquer the machine. Harness it for the good of all mankind. They will brandish their blades with the familiar shhhhiiiiiinnnngg as it leaves the sheath. But there's only one hero in the room. The big black box. The warrior's emotions are drawing him/her inexorably closer to the final conclusion - the machine works for us. Really works for us. Unlike the secondhand RDBMS machines that once enslaved us. Now we are the master. The new hero does our bidding, and does it well.

 

So our warrior, after many days of working with the machine, finds himself no longer using his old gear. No longer using his old ways. In fact, now stumbling over the gear and getting his feet tangled in the pull-strings on the canvas bag. One day he will pull out the number we gave him for the Data Warehouse Museum on the 7th floor and call for a pickup. On that day, take the warrior to lunch. The transition is complete.

 

What has the warrior embraced - that complexity is no longer the key to winning. We don't have to "do it the old way" and we don't have to figure out a way to shoehorn our former implementations into the new domain. Those implementations were necessary evils, using technology that wasn't geared to support them, rigged for an outcome on an underpowered and overwhelmed environment. When moving to the new environment (whether it's a newbie learning the ropes or a large-scale migration of a secondhand RDBMS into the big black box) we don't take the prior implementation with us. We don't take the prior table structures, cursor-based processes, or anything else about the original implementations. If artificial complexity means that even part of the implementation is suspect, then all of the implementation is suspect. We have a new way of doing things, a new machine to do them on, and the outcome will be so much better if we now do things the way the machine best performs, rather than doing things the way some other secondhand machine performs poorly.

 

And this is the difference between complexity and simplicity. In an underpowered secondhand RDBMS, the complexity props up the weakness of the technology and is a symptom of weakness in every way imaginable. Simplicity on the other hand, is the mark of strength of the technology and its ease of implementation and maintenance.

 

For a (human) hero, complexity is the reason for existence and simplicity is for the simple-minded peasant. After all, the peasants do the (back labor) work of the field and save the thought labor for the feudal lord and his lackeys. But one can see the parallel immediately - the secondhand technology is actually a feudal lord, its high-priced product engineers are its lackeys, and we, my friend are its indentured servants. Quite the converse for Netezza folk - we are the feudal lord, requiring no lackeys, and the machine is our indentured servant. The machine works for us.

 

I cannot count how many times I've been approached (even in the hallway of a Netezza conference!) of people asking me about re-engineering their existing systems for Netezza. No, my friend, we don't re-engineer, we de-engineer. There's a huge difference.

 

What does simplification buy us? We can now move away from the raw complexity required to prop up a lack of performance, and move toward the sophistication required of a competitive solution. The sophistication is key, because anyone can build a simple system for simple business reasons. But when the complexity of the environment hinders us from the next level of functional sophistication, the complexity has now enslaved our business model, and effectively the business itself. Simplified implementations are stable and adaptable, so can scale to breathtaking heights, giving us the  necessary edge for competitive, sophisticated offerings that aren't even possible in the secondhand technologies.

0 Comments Permalink
0

Famous words, or some such like, uttered by Orson Welles as he launched into a scary parody of alien terror on national radio. Really scary for some. And proferred on Halloween night in 1938, so dare I say, 'tis the season (almost).

 

Ahh, not to fear, this purports to be a painless foray. But I do have a story to tell.

 

Several projects ago (I always start this way, so you won't think I'm talking about you!) - I worked with some really sharp data engineers on boiling out a solution for retail operational reporting. The data arrived every five minutes or more, or less, and sometimes in parallel loads, with 24x7 regularity. More and more Netezza implementations are going this way, and you too, should look into processing data at the speed of thought. In any case, the reporting users wanted to plumb the depths of this data store, to the tune of eighty billion records and growing. (Okay, small I know (for some of you) but humor me).

 

Well and good, except rather late in the game, the reporting users spontaneously expressed a desire to review the detail through metadata-based "lens", that is, set up some drilling levels and other metadata-based entry points, such that the entire operational model would be seen through this reporting "lens" and it would provide all the context for the consumers.

 

Now, such a model as described, would require such enormous power from a standard SMP/RDBMS-styled system, that we might well cause structural damage on the raised floor for sheer physical weight of said system. That is, if we really expected a report to return within a day or two of the request. Ahem! as I facetiously clear my literary throat.

 

But the worst-case for any given query for the above was around 8 minutes, and over 99 percent of the thousands of queries submitted, returned in less than 30 seconds. Oh, yeah, it was smokin' hot. In most queries using zone maps and the like, we saw returns in mere multiple seconds. Pshaw! Says the tick-tock-man, chocolate and vanilla, don't waste my time.

 

However (and there's always a catch) many of the larger reports were actually conglomerations of these smaller queries, and their aggregate time would occasionally exceed ten minutes or more. And even though this was a far cry from the "days away" we would expect from an SMP/RDBMS system, it was still 'too slow' for the users. Now, this is true adrenalin-junkie stuff, sort of like the old Far-Side cartoon of a young man standing with a fork in front of a waffle iron, captioned "Wendell Zurkowitz, slave to the waffle light". I recall how one man noted that many years ago we would wait hour(s) for a traditional oven to finish cooking, and now get impatient when the microwave instructions are greater than five minutes.

 

Perspective.

 

And rather than punt to the users and say, "Hey guys, this is just unrealistic" and degenerate into "expectation management" - the challenge was to actually achieve faster turnaround times on the reports. And here, I'm talking about getting these ten-minute reports into the 30-second zone. Would we have to embrace some extreme engineering for this feat? Methinks not - but the form of the process to get there was quite instructive.

 

Now recall I noted that the above model had operational tables, which were to be the detailed source, and a retail reporting hierarchy that was largely metadata-based. This reporting hierarchy had some significant size as well, perhaps a fourth the size of the eighty-billion-record fact table it had to link into. Yet both of these were on separate distribution keys. Queryng one meant broadcasting another.

 

And now, for broadcasting.

 

Whenever two tables are distributed on different keys, a join between them cannot be initially co-located. To support the co-location, Netezza will broadcast the salient information from one table's context to the other. This means the physical data has to move from its home SPU, out onto the inter-SPU network fabric, and find its way to the target SPU where it will be further examined. Broadcasting for small tables is inconsequential and barely a blink on the radar. For larger tables it can have strange effects. For example, we saw one query return consistently in ten seconds. Yet when running side-by-side with itself (multiple users) it could take several times longer.


The reason is that both queries were competing for bandwidth on the inter-SPU fabric, among other things. The simplest solution, of course, is to get our metadata table distributed on the same key as the operational tables. The problem was simply in the complexity of this metadata table and how it mapped to the core information. "Blowing it out" into a materialized form of information would require significant planning and design, because a misstep could easily make the reports turn out wrong, and this was unthinkable. In all this, the maintainability had to be considered, because if our initial complexity is too high, the maintainability is in jeopardy - by design.

 

Of course, we would spend most of our time in testing this scenario. Coding and implementation in most BI shops is a nit compared to the testing we have to execute to validate the outcome. Netezza is no different, except we can close the testing loop sooner if we have more power. And of course, for something of this magnitude, to test the change from minutes to seconds, we would need a powerful machine to measure the difference. Whenever we ran the new solution on a smaller machine, the difference couldn't even be measured. No, the power of the machine makes the testable difference visible and measurable.

 

As I noted, the form of this exercise was the most instructive part. Rather than form a means to align these two tables for co-located joins, the first effort was in attempting to tune the queries. You know, "query engineering", which is the mainstay of performance engineering on an SMP/RDBMS platform, and old habits are hard to break. The data engineers were somehow in denial that they would receive extraordinary power from configuring the data. Rather they trusted their instincts and chose to attack the queries.

 

Now, in any platform, regardless of shape, size or vendor, power is always and forever the domain of hardware. Software cannot manufacture more CPUs or network speed. If the physical plant is not ready, the software can only use what it has at its disposal. The software itself is largely a cost center, because it can only drain the machine's energy through inefficiency. In an SMP/RDBMS machine, the only option we have is to engineer the queries, because the physical plant is configured to be general purpose.

 

In a purpose-built machine, however, the query is simply a controlling mechanism to Netezza's resources. The host will chop it apart into snippets and dispatch these to the component that they will serve. Extreme query engineering on the other hand, assumes that jockeying around with the query can actually affect our fate. (contrast; a poorly written query is different from directly engineering a well-written query). And besides, do we really want to spend our time carefully engineering the query to the point of functional brittleness? In an SMP/RDBMS machine we will see queries that extend for tens of pages in a very daunting complexity. Maintaining these is a full-time job for our consultants. They swarm on the machine, and carefully tune their handiwork to avoid breakage.

 

Yet, we purchased a Netezza machine to get away from this complexity. To reduce, clarify and simplify our administration and consumption of the data. So as I watched these engineers bat themselves against the problem, no differently than a fly batting against a window, I watched them pull out their hair in generous tufts when little they did offered the significant gains they expected. This outcome was entirely counter-intuitive to their training. They were acccustomed to using and tuning software to make things work faster.


Sweeping the hair from the floor one evening, I mentioned (for the x-teenth time) that the broadcast effect was killing them. Once our engineers grasped the broadcasting problem, I thought we would make headway, but things actually got worse. They started trying try to control the broadcast as the root cause rather than the symptom. In one test, I saw one of the largest tables leap into a broadcast and we just killed the query outright (it would probably still be running, even today). The engineers lamented: How do we make sure the larger table doesn't broadcast? How do we control the broadcasting to our benefit? Answers exist to all of these, but it's like talking to a drug addict, one who is addicted to the drug of SMP/RDBMS and claims he can 'quit anytime'.

 

And then the truth came out, "David, if we can make this 10100 machine process data like a 10400 machine, we'll look like heroes!" To which I ask "How?" to which the response is: "We can save them all that money they would have spent on the hardware..." Well, not really. You've just chosen something else to spend the money on, namely performance engineering, the cost of time-to-market, the cost of a marginal implementation and the cost of human labor (the most expensive asset you have, by the way). But since the only way to get a 10100 to perform like a 10400 is to actually be a 10400, well, you see the futility. 432 SPUs versus 108 SPUs? And they really, truly thought they could - I mean - seriously. Let's keep in mind that the opposite is true. If we can't make the 10100 process data like a 10400, perhaps our approach is flawed? Heroes or goats. Take your pick. In my estimation, there's only one hero in the room. The big black box.

 

So the broadcast is the symptom, not the root cause. How about, we quit broadcasting, cold turkey? Take the data model through a detox program and the engineers through a series of deprogramming seminars to - well - it's not that bad. Typically the average engineer only has to see it operate in an adverse manner to become a believer. But a believer they must be, or they will not take action to correct the problem, correctly.

 

So one of them finally decided to produce a map table, one that would map the metadata into the operational tables such that all core joins would become co-located, with a common distribution. And lo, the first test of this blew their minds. Even the complex reports were now coming back in single-digit times, and the reports that had been running ten minutes or longer were now under a minute, even with multiple users. In fact, they saw the performance and scalability practically handed to them - simply because they configured the data correctly. It had little to do with query engineering.

 

Now one may ask the obvious question, and please do so now: Why don't you just build out some user-facing tables and forget leveraging the operational tables? After all, we don't build our non-Netezza reporting systems on top of operational data, do we? We build-out dimensional models and other handy structures to postively affect the user experience and simplify the flow (and the maintenance). This functional decoupling is a mainstay of reporting environments. (Okay, the next entry will focus on this). But in this case, suffice to say that the owner of the machine had placed down a hard-mandate on disk utilization. At no time could we foray into replicated detail, or even summary of detail without a plan to access the operational detail on a drill-down and the like. Interestingly, the required reporting tables would have only cost mere fractions of the cost (on disk) of the time/labor and effort put into making the operational tables viable. This is why it deserves its own treatment in a separate rant - er - essay. Stay tuned, and don't touch that radio dial.


Back to the drama - A telltale symptom that we're doing something wrong, is when we start down the engineering path. It's an appliance. We don't engineer toasters, blenders or laundry machines. But the difference here seems to be subtle. It's not. In this case, the culprit was the broadcast, something to be eliminated rather than managed. And no amount of creative query hoop-jumping would overcome this. Get the joins onto the SPUs. It seems obvious to those who have been around the machine for bit. But for those who have not, the learning curve is upon them. Be patient with them for as long as it takes to get it right. Once we have a believer, we'll never have the conversation again. As long as we stay in a theoretical zone, however expect them to stay in the spin cycle. This is like many things scientific. Seeing is believing.

 

Whenever I (and others like me) observe a ritual of performance engineering, each participant holding out the hope that "just one thing" will offer stratospheric boost so they can all wipe their foreheads and go home - this is the surest sign of one of two things: Either the data is poorly configured and is causing the queries to be ineffcient, or the data is properly configured and the machine does not have enough physics to achieve the goal. If the focus is on query engineering, they are wasting time. If the focus is on data engineering, at some point it will reach a "diminishing return". Either the machine has the power or it doesn't. Time to switch to Netezza, or if using Netezza, time to add some physics (a frame or two) to make it happen.


Moral of the story: Performance is found in the physics, not the carefully engineered queries. If we find ourselves "engineering" our queries for performance reasons - we should take a step back, take a deep breath - click our heels together and say softly: "There's no power like SPU power. There's no power like SPU power." Repeat as necessary.

 

And pay no attention to the man behind the curtain. I'll bet he and Orson Welles never even met.

0 Comments Permalink