[ www.netezza.com ]
0

One of the questions oft-asked in best-practices sessions and in general consulting: How do we get a "newbie" on-boarded quickly? Some concern usually arises when the new Enzee approaches the Netezza machine with the same thinking processes as with a traditional RDBMS. While there are "gross" similarities, it is the differences we want to leverage, and these are not either/or questions. There is a better way to implement things in Netezza, and a better way in the traditional RDBMS. Mixing the two is not optimum and can be detrimental.

 

The primary discussion fulcrum is simple: One is a transactional database and one is not. Moving away from "transactional thinking" is the key. How to accomplish this?

 

One of the best ways is to discuss and actually demonstrate the primary differences between bulk and transactional processing. As this is largely the crux of misunderstanding, or even the necessary "paradigm shift" our newbie needs to embrace, a significant hurdle it seems, is the newbie's belief that the core engine functionality of their favorite RDBMS is somehow being indicted or set aside as useless. After all, the transactional RDBMS is just that - transactional - and this is what we want the newbie to move away from. What? All that hard-won and industry-hardened capability - and we're just setting it aside? Really?

 

In a word - Yes.

 

It's not that the transactional capabilities are useless. They simply aren't useful in a data warehouse. More importantly, they don't even exist in a Netezza machine. So attempting to shoe-horn transactional thinking into this machine is a huge disconnect - no differently than using a lawnmower as a hedge-trimmer. Netezza is purpose-built. Transactions are missing by design.

 

Now at least one person is bristling because they know, administratively, that transactional support is handy for logging, managing metadata, troubleshooting hooks and other administrative support. I don't disagree with that, but it's not the activity of bulk data processing. It is far easier to set up a smaller database machine alongside the Netezza machine to perform these administrative transactional tasks. Each machine then has an objective role and purpose, and off we go.

 

What are some of the demonstrable ways that we can introduce the new Enzee to this issue, in a manner that really drives the point home? Well, I can't seem to count how many times I've had (sometimes rather contentious) discussions with "outsiders" (or perhaps "purists" ) on the subject of transactional exception handling. Inserting a record into a transactional database, with its glorious constraints turned on, will guarantee that it will pushback on us with an exception. Said exception requiring the dutiful compliance of an exception handler. You know the drill.

 

But in data warehousing, such transactional exceptions are in the way of our bulk load. We don't want the database to examine each and every record as it arrives, potentially formulating an exception (and its attendant overhead) for each record, or passing each one through after its constraint-based integrity check. We just finished taking all that data through a detailed sieve of business rules in the ETL layer, didn't we? The database needn't trouble itself, just load the data, thank you.

 

Now at least one more "outsider" is bristling. How dare you say that we should set aside the constraint-based exception handling? What possible justification could there be for such a gross trampling of RDBMS functionality? Explain yourself!

 

In a word - performance.

 

Storytime: Just after 9/11, the airports over-compensated with all kinds of rigorous shakedown protocols. Travelers had to show a boarding pass and ID at a checkpoint, then keep them handy for just after the checkpoint. And then also for presentation at the gate prior to boarding, along with random bag searching. If you were the first one to board, or made eye-contact with the bag-search team, it was guaranteed that you would be taken aside and your luggage rummaged. A friend of mine told me that the rummagers liked to carry on a conversation to make you feel more comfortable about their pulling your private things out into the open air for all to see. One of them held up a nose hair trimmer to one of his cohorts and said What the heck is this? Makes one wonder what other kinds of personal appliances we could "salt" the bag with just to embarrass the daylights out of them, hmmm?

 

My friend told me that he was pulled aside a lot, and started experimenting with "stated professions" that the rummager would not care to talk about. At one point he blurted "I'm a professional bodyguard" to which the rummager alerted like a trained narcotics dog and said "So you would know how to use weapons?" to which my friend simply said "Or not."  This of course made the rummager gulp and go quiet, but it still wasn't good enough. My friend didn't want them to talk at all, so they wouldn't waste any time in their rummaging and just get-it-over-with. So at one point he said "I own a funeral home." Which of course, stopped the chatter completely. Nobody really knows how to continue a casual conversation about such a subject.

 

The point being, he'd already had his bags electronically scanned at the checkpoint. Do we really need to check it again? And unlike a constraint-based exception handler, the rummager had the option of only picking out random hapless travelers. The exception handler rummages the bags of every traveler in the line. We can see how utterly inefficient this is. Nowadays, they screen the bags and then don't even check ID again at the gate. Except for random gates on occasion because nefarious people sometimes swap tickets when they get behind the checkpoint. In any case, if we've already exhaustively checked the bags to get the traveler where he is, more checking is a waste of everyone's time. Just like the exception handler. If we just ran the entire set of data through rignorous validation rules, we have no need whatsoever of the transactional exception handling in the database. It will waste processing time.

 

And wasting time, we don't have the luxury to do.

 

The transactionally-constrained bulk-load of data will be, on average, five to ten times slower in operation than its non-constrained equivalent. If our objective is to achieve a fast load - and trust me - it really is - we don't want constraints turned on. We're talking about loading millions if not billions of records. Even in an RDBMS, we cannot afford to convert what could be a thirty-minute operation into a two-plus-hour operation. The window of time simply does not exist. In some locations, if this kind of window ever existed, it is rapidly vanishing as their businesses go-global and need to process data as-the-world-turns.

 

On the flip side, think about the main reason for a transactional exception - it is to keep a transactional application honest. If the data does not comply, the user fixes and re-submits. It's interactive, and it deals with a single entity at a time, not millions of entities at a time.

 

The "outsider" will now brace on this assertion as well, because they think that having thousands of users interacting with the system constitutes this many-entities-at-a-time, but it simply doesn't. And here's why: RDBMS systems are meant to assimilate data in small chunks with high frequency. They are not designed to deal with large chunks at low frequency (e.g. a batch load once a night). They will accommodate such activities, but not do them well. In this case, "well" means loading a million rows a second. The RDBMS cannot approach this.

 

And this is the reasoning behind the Rule #10, which is - when loading bulk data, never involve the database in row-level activities. This means, without exception, turn off the exception handling. Because the database will just protract the duration of the flow, checking each and every record and slowing down all of them as a whole, in order to find the few exceptions. It is the equivalent of making the entire flow suffer for the sake of a few records. this is a bad tradeoff. And once again - didn't we just validate and scrub all these exceptions from the flow, in the ETL/data processing environment? Why are we asking the database to validate them again?

 

And the worst part, is that the potential exceptions are all anticipated and known. What does this mean? As a back-end programmer buildng the data flow, we have direct and objective access to each and every failure point that will stop the data from loading. Why would we delegate this to the database, since it is so inefficient in performing it? Note - not so lacking in functionality, because the RDBMS has lots of functionality to perform it. It is simply too inefficient with bulk loading to be a viable resource.

 

So what are the anticipated exceptions? Let's go for popularity:

 

  1. Bad or null data
  2. Unique key violations
  3. Primary/foreign key violations

 

 

In fact, the above constitute the primary reasons for the load to fail. So lets walk through the basic process we would need to follow if we delegate this to the RDBMS database.

 

In transactional mode, the RDBMS data load will kick out an exception for each of these it finds. Even if it completes with no error, someone in the room will say -It took too long. Even if it didn't find any exceptions.Fix it. Make it faster! The hard-core transactional engenue will attempt to optimize it without turning off exceptions, and find that it cannot be done. If the load of one record requires 1 second, it will take 1 million seconds to load 1 million records.

 

We just don't have 1 million seconds.

 

(Incidentally, in Netezza, the load of 1 record requires 1 second. The load of 1 million records requires 1 second. Use your second wisely!)

 

So our new Enzee will grouse a bit and then look around at the data warehousing sites for answers, and all of them will say, turn the RDBMS exceptions off, load the data, and then turn them back on. The newbie will object - but wait - when I try to turn them back on, the database yelps and says there are constraint violations. I will have to back the records out and try again. Oh yes, we now have a mess on our hands. In the time it took to load the RDBMS data - say thirty minutes - we have now accumulated errors that might take hours to back out, fix and then retry. And we'll have to do it while the batch-window clock is ticking, not in a pre-process where we had more breathing room. We don't have this kind of time window.

 

We will never have this kind of time window.

 

So the next fallback is to fix the exceptions in the data processing realm (ETL tool), prior to loading the data. But isn't this what the data processing realm is for? Really? This means we do all null checking and constraint checking prior to loading. How? We download the primary and foreign keys into the local data processing environment and perform a localized join-filter to remove the exceptions. This is a data warehousing 101 best practice.

 

The newbie will now brace on the idea of downloading all the key values. All of them? That could take, well, it could take a long time!  It will take mere minutes to pull down all the key values. And those mere minutes are nothing compared to the duration of the recovery mess we will endure if we don't take this step.

 

Pay a little time now, or a lot of time later. Use your time wisely.

 

So here is the tradeoff (again in traditional RDBMS space)

  1. Turn off constraints, load the data, and then deal with the mess after the fact. Plan to spend hours backing out the mess and then running the load from scratch.
  2. Download keys, join/filter the key exceptions, turn off constraints, load the data with the expectation that no mess will arise. (because it won't)

 

In short, downloading the keys for constraint checking is a necessary evil. Our only "next best" fallback is to load the data into a pre-target staging table and do the gross comparison there. Then we copy the good records into the target table. But wait - now we've incurred the penalty of the load twice (one for the ETL to staging and one for staging to target). Isn't it cheaper to pull down the keys once than it is to load all the data twice? Not to mention the fact that the average RDBMS engine does not efficiently copy tables either. So even if we decide to go with loading a staging table, the copy of the staging-to-target will take longer than we are willing to wait.

 

Think about this: When the data exception arises in (1) above, where will we fix the problem? In the database, or in the data processing realm? The database can only report the issue, not fix it. If we must fix it in the data processing zone anyhow, why woudn't we fix it proactively rather than reactively?

 

So this approach means something even more valuable - if we find the exceptions in the data processing realm prior to loading, we will have found them proactively and administratively, not
reactively and operationally.

 

This makes a huge difference in the reconciliation of data exceptions when we're dealing with millione or billions of entities.

 

And yet another issue our newbie is pleasantly unaware of - data processing on this scale has to be beholden to the constraints of the lights-out operation, administration, and logistical capabilities of the physical plant around the machine. If the operators have to get involved in the data recovery, with data processing on this scale, it needs to be for incidental reasons, not mass data recovery.
In essence, delegating this activity to the RDBMS, is setting up our operators to fail. We will find them entirely intolerant of this approach. Fix it, they will say. If our answer to them is - hey, me architect, you operator, so gird up thy loins and get thee to work - we have punted (and dangerously so) something we should take complete responsibility for. Because make no mistake, we will be held completely acccountable for it as well. They will call us in the middle of the night. They will only help us incidentally. It's your mess, you clean it up!

 

The primary issue here is that the traditional RDBMS load has to be not only load ready, but consumption-ready. When we load the data, we have to be completely and thoroughly finished with all data processing before it hits the target table. From there, the user should be able to consume it right away. Load-ready and consumption-ready is the name of the game, and it's accomplished for the RDBMS in the ETL environment, because it cannot be efficiently accomplished inside the RDBMS. The RDBMS is simply too slow and inefficient for any form of bulk operation. And again I say, if the only place to actually fix the data is the data processing realm, it only makes sense to do it proactively, not reactively.

 

Now let's flip over to the Netezza side of things.

 

In the Netezza machine, we can stage the data "dirty" if we want to, and we often do. The data can essentially be copied as-is from its external dirty location directly into the machine with an nzload to a staging database. From there, we have it in massively parallel form and can use a series of CTAS operations (ELT-style) to cleanse and shape the data. Once we're ready, when can then do a massively parallel join from the incoming table to the target, validating primary and foreign key values in bulk. Then we just copy the good data and we're done. When using Netezza, it is always faster to let the machine do the data cleanup and integration in a massively parallel, set-based operation (even a series of them) than it is to pull the data out, process it in an ETL tool, and put it back. ETL tools, on average, cannot compete with the massively parallel power of Netezza's engine.

 

Let's look at what we accomplished with little effort: (1) We cleansed the data of dirt. (2) In a single, massively parallel join we validated unique constraints.(3) In a single, massively parallel join we validated foreign keys (one join per key). The total time to accomplish the second two tasks is fractional, often a matter of minutes even on billion-row tables and billion-row loads. The time for the first task is shrunken too, since we can apply our row-level data scrubbing rules in-bulk with sweeping operations rather than row-level operations.

 

Case Study Short: Working with a SQL-server based model, the client was loading 15 million records into the database with the bulk loader and the largest machines available. Total time to load - over 2 hours. Tried it again on an Oracle platform, with a top-line 16-core machine with plenty of high-end disk space. This operation took 30 minutes. This was attempted on a Netezza platform, same data, same volume, and it took 15 seconds. There is a contrast, but not a comparison. Nothing adequately compares to a 15-second data load.

 

The important takeaway is this: If I can load the data in 15 seconds, I have a luxury of time to perform internal ELT, data scrubbing and integration, key checking and the like in a matter of minutes, still ensconcing the data into the final target table before the other two databases even get started. More importantly, I did it without standing up a formal external ETL tool. All of it happened "under the air" of the Netezza machine.

 

Now an interesting exercise for the new Enzee would be to actually walk through the processes noted above. In a problem-solving series of exercises, they should get some data that has embedded constraint violations, then attempt to load the data to an RDBMS with transactional constraints turned on, then turn on the creative juices to see how it can be done more efficiently. I would not suggest loading millions of rows to an RDBMS for this exercise, since they are so inefficient at this. Try it with a smaller row-count and then extrapolate the necessary time-to-load. What they will discover is that they will find themselves slowly backing out their precious transactional exception handling to fix the problem another way. The faster they get, the more the the chosen path will start looking very lean on RDBMS capabilities.

 

The final form of their solution, they will find, is supported de-facto and in massively parallel inside the Netezza appliance at no additional charge.In the end, they will see why Enzees have run, not walked to a Netezza platform for just this kind of capability. We know they have made the transition when we can hear them having a conversation with another newbie about transactional versus bulk processing, and they are coaching the newbie away from the transactional model.  Ahh, a beautiful thing, indeed.

 

This is why Netezza is in no way, no how a transactional machine, and why it doesn't enforce primary and foreign key constraints. These can be installed as metadata, but the expectation is that they will be used by an external, intelligent operation that will leverage them for administrative key validation - in bulk. After all, I can read the key metadata from the Netezza catalog, formulate a series of validation operations that will work for any table, any key, any time. Install it as a stored proc and invoke it when necessary. This allows me to set up the load operations and prepare the final copy to the target (which is often the accumulation of dozens of operations to integrate the data into a common pre-target table). Then validate the data just before it is finally copied to the target. This keeps me from having to do it a record-at-a-time, or to have an exception processor accidentally execute the operation before I am completely finished formulating the data for the load.

 

Row-level exception handling is a beautiful thing - transactionally. If the domain where the exception must be fixed is already the data processing zone, we need to proactively embrace this responsibility and just do it. In the end, row-level exception handling has to be completely removed from our thinking processes. We need to invoke sweeping operations that capture the exceptions in-bulk, not a row-at-a-time. Fix and integrate them in bulk, not a row-at-a-time. Bulk is the name of the game, and always has been.

0 Comments Permalink
0

Many years ago someone impressed upon me the need for simplicity in matters of scale. The problem of course, is that simplicity is impossible without power. This is why we see secondhand RDBMS environments proliferate their complexity into a functionally catatonic state, ultimately calling for its wholesale replacement. And upon the call, others swoop in to save the day, saying that they can "replicate your functionality" in another stronger environment that is geared for high complexity, without even once looking askance at the complexity's necessity. By that I mean, the complexity arrived as a function of an underpowered environment, with sweat-labor from engineers working diligently to prop up a fading machine. It means - all that addtional power-propping, is artificial and we need to regard it as a necessary evil. If we'd had the power way back when, the complexity never would have arisen in the first place. So the complexity is a symptom, not an attribute.

 

If we have the power, it gives us the capability to implement functionally sophisticated solutions with ease of maintenance and operation. Functional sophistication is key, because we don't want to dumb-down the functionality just because we don't have enough hardware. Yet the people responsibile for buying us more hardware look at us warily, wondering "You know, I signed off on the last hardware purchase thinking it would be my last hardware purchase. Yet now we are already out of gas and it seems like we didn't get the return-on-investment for the last purchase."  Ahh, but wait, says the engineer, the systems are voracious and so are the users. Adding more hardware is the only way to stay ahead of them. To which the purchaser objects "How do I know that you are making the most efficient use of the hardware you already have? Can you tell me that you haven't tried to optimize the environment?"  To which the engineer skulks away, formulates a plan and spends the next six months wrapping the solution in an engineered cocoon of complexity that offers marginal boost, but a boost nevertheless. The purchaser feels vindicated. "I'll stand my ground next time, and they'll have to go through another optimization! Aha! The key is to get these deadbeat engineers to do their job, and engineer!"

 

And so at the end of this bitter and tumultuous cycle, we have an over-engineered, underpowered machine that nobody is happy with except for the purchaser. who held out until the very end. Often the purchaser is overridden by a super-purchaser, like the CTO or CEO, who finally releases the funds, offers the directives, and the purchaser dutifully though reluctantly complies, certain in his heart that the engineers have at least one more optimization cycle left in them. If you've never been the recipient or the participant in such a repeatable and pervasive cycle, what a blessing to be in the food service or housekeeping industries in these perilous times!

 

In one case, I told a senior leader point-blank that the reason his secondhand RDBMS system was running out of gas, was the proliferation of cursor-based stored procedures draining the lifeblood from the box. He was stunned at this assertion, because he'd been assured by his implementers that this was the right thing to do. The implementers in the room objected by asking if I was "actually suggesting" that they pull the data into a third-party tool, process it, and put it back. Such phrases as "are you telling me", "seriously"  and "everybody knows that" were the commom prefixes of all their objections. Oookay-fine. This doesn't detract one iota from the simple fact that the secondhand RDBMS doesn't process bulk data well, and it doesn't really support third-party environments that require it to (you know, bulk loading and extract). The whole bulk-loading and extract domain is something that secondhand RDBMS systems provide their own utility for, and dogpile one caveat after another against its regular use for operational, lights-out data processing. This is because, as Rule #10 tells us, secondhand RDBMS systems are lousy at row-by-row processing. This has always been true.

 

And think from the perspective of a newbie. Imagine being introduced now to a Netezza machine where it has been specifically purpose-built to support all the things that the secondhand RDBMS's treat as - well - secondhand? Our newbies come to their cubicle with their hard-won tales of derring-do and the many dragons they have captured and thrown down. They drag their canvas bags filled with dragon-fighting gear, empty it on the floor and we marvel at all the tools we once used to be effective. Ahh, the aroma of the dragon's blood wafts from the antiquated instruments of war, reminding us of days gone by, and all that bygone stuff, too.  It's hard to immediately schedule their pickup for the Warehousing Museum on the 7th floor, where all of our stuff is on display and gathering dust. Instead we scratch the Museum's number on the back of a business card and hand it to the warrior for later. Still, all of those weapons and their attendant experience are very valuable, but not in the way one might think.

 

What's a seasoned warrior to do when the machine can devour a data warehouse dragon like a tree shredder, digest it, repackage it and dungeon-ize it for all eternity, and do it as an appliance? Almost like pressing the button on a toaster, but not quite. Our hero rolls up his sleeves and starts slinging row-by-row, cursor-based stuff so prevalent in secondhand implementations. Upon first execution, it runs worse than a dog. It runs like a wet dog. And doesn't smell any better either. The hero scratches his head and tries again. Everything the hero knows about data processing just went "counter-intuitive". The hero mutters "I don't get it". Ahh, but the hero will get it, because the hero is emotionally engaged.

 

This emotional engagement is something that we, as seasoned Netezza folk, should leverage to help our newbies make the necessary shift in their approach. They really need to take ownership of the knowledge, but sometimes (I've been told) it feels a lot like pushing a string (for the seasoned people) and a lot like playing some kind of strange board-game for the newbie. To avoid mental pain and injury, and heavy-lifting on the part of the seasoned people (and the attendant frustration from helping a newbie blossom) the seasoned pro only need to remember the power of the emotional engagement. The newbie will want to conquer the machine. Harness it for the good of all mankind. They will brandish their blades with the familiar shhhhiiiiiinnnngg as it leaves the sheath. But there's only one hero in the room. The big black box. The warrior's emotions are drawing him/her inexorably closer to the final conclusion - the machine works for us. Really works for us. Unlike the secondhand RDBMS machines that once enslaved us. Now we are the master. The new hero does our bidding, and does it well.

 

So our warrior, after many days of working with the machine, finds himself no longer using his old gear. No longer using his old ways. In fact, now stumbling over the gear and getting his feet tangled in the pull-strings on the canvas bag. One day he will pull out the number we gave him for the Data Warehouse Museum on the 7th floor and call for a pickup. On that day, take the warrior to lunch. The transition is complete.

 

What has the warrior embraced - that complexity is no longer the key to winning. We don't have to "do it the old way" and we don't have to figure out a way to shoehorn our former implementations into the new domain. Those implementations were necessary evils, using technology that wasn't geared to support them, rigged for an outcome on an underpowered and overwhelmed environment. When moving to the new environment (whether it's a newbie learning the ropes or a large-scale migration of a secondhand RDBMS into the big black box) we don't take the prior implementation with us. We don't take the prior table structures, cursor-based processes, or anything else about the original implementations. If artificial complexity means that even part of the implementation is suspect, then all of the implementation is suspect. We have a new way of doing things, a new machine to do them on, and the outcome will be so much better if we now do things the way the machine best performs, rather than doing things the way some other secondhand machine performs poorly.

 

And this is the difference between complexity and simplicity. In an underpowered secondhand RDBMS, the complexity props up the weakness of the technology and is a symptom of weakness in every way imaginable. Simplicity on the other hand, is the mark of strength of the technology and its ease of implementation and maintenance.

 

For a (human) hero, complexity is the reason for existence and simplicity is for the simple-minded peasant. After all, the peasants do the (back labor) work of the field and save the thought labor for the feudal lord and his lackeys. But one can see the parallel immediately - the secondhand technology is actually a feudal lord, its high-priced product engineers are its lackeys, and we, my friend are its indentured servants. Quite the converse for Netezza folk - we are the feudal lord, requiring no lackeys, and the machine is our indentured servant. The machine works for us.

 

I cannot count how many times I've been approached (even in the hallway of a Netezza conference!) of people asking me about re-engineering their existing systems for Netezza. No, my friend, we don't re-engineer, we de-engineer. There's a huge difference.

 

What does simplification buy us? We can now move away from the raw complexity required to prop up a lack of performance, and move toward the sophistication required of a competitive solution. The sophistication is key, because anyone can build a simple system for simple business reasons. But when the complexity of the environment hinders us from the next level of functional sophistication, the complexity has now enslaved our business model, and effectively the business itself. Simplified implementations are stable and adaptable, so can scale to breathtaking heights, giving us the  necessary edge for competitive, sophisticated offerings that aren't even possible in the secondhand technologies.

0 Comments Permalink
0

Manhattan Skylines

Posted by David Birmingham Mar 4, 2010

Marcus Gray watched in consternation as the viral program cranked up. He knew that in moments the band of hackers would once again take over the Manhattan power grid. For now, they were doing it as a prank. But he also realized it could be a test run for something even bigger. Like a grid-by-grid shutdown of the entire system, opening the door for untold mayhem on the darkened streets.

 

Moments later, messages from the hacker gang started appearing on all their terminals. Taunting barbs letting everyone know that they were in complete control and nobody could stop them. Gray shook his head and closed his eyes, hoping that this would pass quickly. Losing power even in one part of the grid could spell pandemonium and place lives and fortunes at risk. The weight on his shoulders was crushing.

 

"I think I can help," said a voice from behind. Lane McBride from the Federal Counter-Terrorism Unit based in Manhattan, leaned over to regard Gray's terminal.

 

Gray turned to the voice, recognizing it with hope in his eyes, and said, "They're at it again."

 

"I saw the precursors," McBride noted, "That they were entering the system."

 

"Yeah, but it doesn't matter if we can't find exactly where they are," Gray sighed, shaking his head, "They're in a hundred different buildings, including the Empire State. You guys have agents standing by at all of them, but they have to search the buildings floor-by-floor to find them. The problem is, we have to shut down communications for the building so that they can't warn each other. So even if we could catch a few, do you have any idea how long a floor-to-floor search takes in the Empire State? We can't keep that building offline from communication for that long."

 

"Not to worry," McBride grinned, "I have an algorithm that will directly pinpoint their floors. All we have to do is send our officers up to the floor, and I bet we can round them up in minutes."

 

"Wow," Gray whistled, "I'd like to see that."

 

McBride whipped out a flash stick, plugged it in and let the program do its work. Within seconds, it had pinpointed each hacker, the building their signal was coming from and the floor of the building. "Here we go."

 

"I like it," Gray grinned.

 

McBride touched several buttons on his phone and dispatched the information, and monitored as each of the officers acknowledged the information and the plan. "We'll know soon enough."

 

Gray noted, "The problem has always been that they could hear us coming and could shift floors anytime they wanted."

 

"Not this time," McBride smirked, "At least, not if we do it right."

 

The first officer to report back was from the Empire State. Two of the hackers had been stationed there on separate floors. Both were now in custody and unable to warn their cohorts in the other buildings. Gray listened in awe as one by one, the officers reported in, having captured their respective quarries with minimal effort.

 

"That was brilliant," Gray stared at the screen as the weight seemed to lift from his shoulders, "How did you come up with the algorithm?"

 

"Simple process of elimination. I just looked at the problem from a very-large-scale search. The most important information is where the perps aren't - not where they are. The algorithm zones in on the candidate floor by understanding which floors are not candidates. Process of elimination leads the way. So we can search the Empire State and Chrysler buildings just as quickly as a single-story, capture the floor number and we're done."


---------------------

Some of you already see the parallels. It's how a zone map works. But how does it apply?

 

When we take a look at the Record Distribution option in the Netezza Admin GUI, we're often happy with a "ragged edge" for all the SPUs. And a "flat top" is the ticket. But what about the case of a "Manhattan Skyline", where we have high peaks and low valleys? This is higher than normal skew (something we're supposed to avoid, right?) People see those and shun them. However, these are often the natural result of an intermediate table produced by an ELT operation, and often a result of multi-pass queries in a BI tool. These usually leverage the mainstay workhorse CTAS (Create-Table-As-Select), so in many cases, people are tempted to turn on "random" for all CTAS operations. Or just maybe - one of our regular static supporting tables is deliberately distributed as a Manhattan Skyline just because we want to regularly perform co-located joins with it on larger master table using the same distribution key.

 

In any case, a primary reason we would get this kind of Manhattan Skyline distribution is if we are trying to preserve an existing distribution in order to perform a follow-on operation with tables on the same distribution. Whew! And why would we allow this to continue? Isn't a random distribution better than a Manhattan Skyline? Our problem remains: if the table has such a Manhattan Skyline distribution, we have higher than normal skew. Any full-scan on the table will cause the query to perform as slow as the "tallest bar"  (the SPU with too much of the table's data). As the table grows in size, the problem worsens. It is not a scalable distribution in its latent form, so don't embrace one without a plan.

 

Well, random distribution has a risk too, especially at the BI level, of negatively affecting concurrency performance. Even if our individual queries are not hindered by the data-broadcast incurred by the random distribution, they could just be a one-hit-wonder, because running many of these operations side-by-side can sometimes saturate the inter-SPU fabric, affecting concurrency. If we can keep the processing on the SPUs, we can avoid this problem entirely. So the issue is one of user scalability, something that all of us care about and that the other vendors (sometimes) turn a blind eye to. Netezza has it covered, and as usual, it's so simple a cave man could do it (now I'll get mail!)

 

So now we have two options, neither of which seem good - (a) keep the Manhattan Skyline distribution or (b) use a random one. Let me say that random is not always bad, but it poses a potential danger for concurrency. Likewise the Manhattan Skyline can often be a latent result of an intermediate CTAS so is unavoidable anyhow. And why would we want to preserve an existing distribution on a CTAS? The answer - because it will be a co-located write (blazingly fast). But wait! Don't we get a co-located write by default?

 

Maybe.

 

I have noted in prior posts how the default distribution for a CTAS might not be what we want or expected, so here's a quick recap:

 

(a) For simple single-table CTAS, it will preserve the source distribution key - (co-located write)

(b) For simple multi-table-join CTAS, it will leverage the first column result in the "select" clause (maybe a co-located write)

(c) For CTAS using summaries/group functions in the select, it will leverage the columns in the "group-by" clause (rarely a co-located write)

 

If any of the above are not the original distribution of the source(s), we could inadvertently sacrfice our co-located write. But we can preserve it if we specifically use "distribute on" with the CTAS execution. With co-located writes, this means the data never leaves the SPUs. If we distribute the CTAS on anything else, the data must leave its current SPU and find its way to another one. This initiates a data broadcast (and can negatively affect concurrency). Preserving the distribution, we get the benefit of a co-located write (avoiding broadcast to make the table) and set up the next operation for a co-located read (also avoid the broadcast to leverage the table). Short answer: preserving the distribution preserves concurrency performance. Now the SPUs are working for us at physics-speed.

 

Rather than just live with the latent effects, lets embrace and harness them for the good of all mankind. Well - er -  at least for our user base.

 

What we really want is threefold -

 

(1) preserve the distribution with a co-located write (preserve concurrency, potential Manhattan Skyline as latent artifact)
(2) leverage the result with a co-located read (preserve concurrency, potential penalty from Manhattan Skyline)
(3) mitigate the Manhattan Skyline with a zone map (ahh, best of all worlds)

 

So to get the first two, we can simply preserve the distribution with a "distribute on (key)" clause and make sure the distribution key is part of the "where/join" operations.. This is the simple part.

 

To get the third, we need to either (a) sort the data as it is created, or (b) make a materialized view after-the-fact to get the zone map effect for selected columns. The first one (sorting) is often easier than it sounds, and with strongly filtered intermediate tables is also very scalable. The second one (materialized view) has some caveats but is very fast to create. What does the zone map actually do? It effectively stripes each SPUs portion of the table so that only the section in the zone is actually addressed. Like McBride's algorithm, it's as though the rest of the data isn't even there, because the zone map has guided the optimizer to completely ignore it. So whether the SPU's data has a tall bar or a short bar, the performance is the same. We need all three of the above and the zone map mitigates the potential problem of unexpectedly high skew from an intermediate distribution - or an outlier table that we need to distribute on a common key. Even if (1) and (2) above give us a good distribution today, it could always "go Manhattan" in the future.

 

Another obvious question is "If this is an intermediate result, why bother? Just filter out the stuff I don't want and then there's no issue, right?" Well, technically yes, for a single operation, but I know of at least a dozen cases where the intermediate table is used for a lot of downstream activity, not just a one-off throwaway. So our stewardship rule is: make the data better. For the next downstream process or the ultimate data consumer, the data should get better every time we touch it.

 

Rather than rewrite or re-design a carefully tested and detailed process, adding a simple "order by" or MV is easy and preserves the existing logic, and data model, with little impact and high return. This is especially true of a static supporting table, because we can install what we need on the table's creation. The consuming processes all benefit from it with no more than regular query execution (materialized views are transparent).

 

In the end, we can still leverage the plain-vanilla parts of the Netezza performance model (zone maps, co-location) without having to over-engineer the data using indexes, intersection tables or summaries. This preserves something more  - the ongoing resilience and adaptability of the model itself.

 

Recap:

 

  • Apply the "distribute on" clause of the CTAS to avoid the latent effect of default distribution.
  • Preserve co-location for reads and writes in intermediate tables.
  • If a potential Manhattan Skyline distribution is the CTAS result, rather than go random, sort the CTAS result by a selected column or use a materialized view.
  • As always, apply strong filters to the CTAS creation so that it's not simply copying one table's contents to another (carve the data out).
  • Experiment for the best fit, but remember that Netezza is an appliance.
  • We don't need to engineer the queries, only apply simple performance model alignments in the data itself, to leverage the machine's physics
0 Comments Permalink