[ www.netezza.com ]
0

I heard this sentiment from a senior manager at a large-scale data processing facility, so I thought I'd post it as a provocative talking point. In his mind, when something went really south in the scheme of things, he had to evaluate people as to whether they were incompetent or immoral. Or something in between. You never know what a manager is thinking, apparently.

 

You see, in his mind, he needed a means to label a person rather than an activity. On the other hand, I like the sentiment of another famous consultant, who when asked how things could get so bad, would simply quip "Because honest, hard-working people did the best they could with what they had." Hmm - there's no incompetence or immorality there, just the realization that things can and do go wrong. Case in point, last year we had a project where the workload grossly exceeded the headcount to make-it-happen. With a thousand spinning plates in the air, and not enough people to keep them spinning, invariably a plate would fall to the floor and crash. The manager above would call a meeting and review why the plate crashed, and find someone to blame for it. But in the end, the plate crashed because it's what plates do.

 

Debating the "why" is a waste of time.

 

We see this when the "critical mass" exists to switch horses, sometimes in midstream, from a powerhouse, legacy and mainstay kind of technology to a new, shining future in another, more promising technology. Ahh - you see where this leads. Someone now has turf to protect, and the review of a new technology - or even the hint of replacement - is viewed as an indictment of the existing technology. And, you guessed it, an indictment of the existing technologists. Because of the manager above, the people in the mix begin to wonder if they are being labeled as incompetent, for defending an inadequate technology, or immoral, for having another motive, like defending the technology because it will help keep their job, regardless of whether it is the best choice for their company or team.

 

And if they perceive this labeling, they too will fire off their own labels and soon we see the makings of a classic conflict. I spoke with a leader who had just weathered such a conflict, and he said that he couldn't believe how quickly is seemingly objective, science-minded technologists reduced to feral animals practically overnight. He didn't really have to muscle-through the process like some other extreme cases, but it is important to note, that the conflict is real. The drama brings out interesting colors in people, and shows us what they are made of. And like the second consultant above, it's usually not bad stuff. Just human stuff.

 

People who make an investment in one technology find themselves with an emotional and professional attachment to it. Like hanging on to to a stock ticker even when it's in free-fall, hope springs eternal. Our investment isn't really for nought - if we can just wait it out. My challenge to the average Joe out there, is to do what you've been doing, stay the course and keep the high road. Bad-mouthing the existing technology, or the existing people running the technology, is not a profitable path. When we think about it, the "Enzee way" is to let the machine's power and architecture speak for itself. After all, if we have to resort to the same nefarious activities as a wannabee competitor, doesn't it speak volumes about what we really think of our favored son?

 

Some time ago, I was helping with a competitive POC and when we finally reported the metrics for loading, query and whatnot, we decided to show Netezza in its best possible light, and we agreed with everyone that this would be the case. We watched across-the-way as the competitors stayed late nights, carefully tuning their machine and its attendant parts, while we just tossed data into to the Netezza box, did some basic distribution tuning and that was that. In fact, after getting some initial metrics, we took the worst times and reported on them, not the best times for the loads and queries.

 

When we reported our final numbers, our metrics blew away the competition by a factor of five or more, in some cases much more. When we told the decision-makers that we'd only spent a few hours on the POC, and even then only reported the worst-case numbers, they were stunned. Primarily because the competitive team had spent so much energy on tuning their technology, only to fall short at 20 percent or less of what the Netezza machine could do.

 

And doesn't this kind of story speak volumes - and shows Netezza in the best possible light? After all, it's not really a good story to tell if we have to spend countless hours tuning the machine. The decision-makers know that for the sake of competition, we might spend a lot of time to "get that benchmark", but it will be the only time they ever see the benchmarked metric because they know we won't embrace that kind of intensity when we've actually deployed the technology.

 

I saw a "famous" benchmark on the internet, touted by vendors other than Netezza using technologies that were carefully tuned for the outcome. You know, like an Olympic athlete trains the daylights out of his body to get that one-shining-moment. But catch up with the same athlete years later and find them out-of-the-game, no longer the feared competitor for one primary reason - they can meet the bar once. But they can't sustain it. And this is true of the famous benchmark. They can tune the daylights out of those technologies, and take them to new, never-known heights, and break world records. But if you really want to deploy these at your site, you'll get the standard disclaimer.

 

Your actual mileage may vary.

 

This variability is not found in the Netezza experience. The machine delivers the kind of sheer power and turnaround we need just by breaking the plastic and plugging it in. We don't have to spend countless hours tuning the machine under a hot lamp and after gallons of Red Bull. Power - effortless power - at our fingertips - really is the best possible light for showcasing what the machine can do. One of the customer decision makers said just that - if it requires a swarm of people to get the competitive technology to remotely the same level as a Netezza machine can reach just by powering-on, what kind of story does this tell? That we are committing to high-intensity deployment and maintenance for the life of the technology?

 

Cost of ownership has a lot of different meanings, no?


On a more recent project, we did the common "light" organization of the data and then the report developers cut their BI tool loose on it. When the smoke cleared, the turnaround times on the reports were abysmal. Some of them executed in minutes, some of them tens of minutes and some of them never came back at all. Then the finger-pointing started (from the reporting team) and could not lay enough blame at the foot of the Netezza machine. But soft, what light through yonder window breaks? It is the Marlboro Man, to carefully show the reporting gurus why the Netezza machine is not an SMP/RDBMS machine, and needs a few additional hints (e.g. zone map refs at the query level) to make the reports turnaround at keyboard-speed. Honestly, if it were any other technology - like an SMP/RDBMS, and we encountered such abysmal turnaround time, the answer really would be to fix the database, in the data structures, the indexing, or even at the hardware level. How amazing is it that rather than "going back to formula" - we can just tweak a query or two, and lo, we have stratospheric performance?

 

As it should be.


There is a temptation, you see, to protect the turf one loves so well, by somehow telling a story that does not meet with reality. And in all this, it's no different than saying we cannot get our toaster or blender to work in our kitchen, even though we aren't using them as described in the owner's manual. Netezza is an appliance. It has measurable, deterministic behavior and simply does not deviate from its prescribed, self-contained nature. For someone to claim that a kitchen toaster doesn't work, one only has to ask a few simple questions to determine whether the toaster really doesn't work, or if it's just not being used correctly.

 

And in our case, the Netezza machine is a more complex horse. But the interface to the horse is still the same - a pair of reins and a pat on the neck, and the horse behaves just like we expect it to. Of course, getting four-hundred horses to behave the same way in lockstep, is a matter of architecture. But imagine how much work you could get done if you could package four-hundred-horspower for useful work? It's the difference between a 32-horsepower (SMP/RDBMS) oatmeal-mobile ---  or a 400-horsepower street machine with e-brake for those drifting stunts.

 

Yeah-man, give me the street machine any day.

0 Comments Permalink
0

Now here's a luxury we don't see every day. After all, if we're a car manufacturer, perhaps an aircraft builder, we have to get it right and make it fast all as part of the original design.

 

Which is not to say that we should just choke up a bunch of data structures and expect Netezza to cover our backs - oh wait - perhaps we really can do this, but it's not always practical.

 

Early in my career I worked with embedded / real-time systems, and while some really believe that they work with "real time" - here's the purist definition: A robot balancing a broom on its open palm. The robot has to make infinitesimal adjustments, in real time at the microsecond level, to keep the broom from falling. In "business" real time, however, we have the luxury of whole seconds to make a decision!

 

In these embedded systems, we had to take things through to a complete functional shakeout. Only then could we see where events collided or didn't make sense. So-called "race" conditions and meta-stable conditions - yep - for those of you who know what these things mean, it can cause you do go gray early and stay that way.

 

Ahem.

 

So the maxim here was always "get it right, and then make it fast". I took this "RTI" (little three-letter-acronym (TLA) for Run-Time-Improvement), into a commercial venture where I developed an expert system engine for medical claims processing. Only when the system was behaving correctly could we then find ways to pinpoint the hot-spots. In one case, one percent of the claims took over 90 percent of the processing. We dug deep to find the issue, optimized for it and voila - the system screams like a banshee. In fact, these improvements boosted processing speed for the areas that were already fast - so it was a double-win.

 

People asked us at the time why we allowed the development process to "suffer" with "poor" performance up until the very end - when we knew good and well that attempting to optimize it while it was still in functional flux, didn't make any sense. We can do sweeping improvements only so much and only so deep. But what if we spent all of our time improving something that three weeks later the client says they don't want any more, or want to go in another direction, and all of our improvements are for nought?

 

Nay, wait a little longer - make it right, then make it fast.

 

Fortunately - and quite luxuriously - in our space the Netezza platform dovetails directly into this approach. It already has all the juice to help us succeed even if we do things badly or inefficiently. But when the smoke clears and we have a working prototype, it's time to roll up sleeves and pop-the-hood, so to speak, to do some RTI on a working system. This is where it gets fun.

 

The following is a short list and is by no means comprehensive

 

(A) - line up the operations in their natural sequence. Find the ones that are longer-running and optimize them. This will provide some degree of relief, but it's still just low-hanging fruit.

 

(B) Locate in-line calculations in the where-clause or join-clause, and reduce or eliminate them. In one case, we need to join on a date-time where the "drift" allowed the timestamps to be different by plus-or-minus three seconds. Rather than put the time+3 and time-3  calculation in the actual query, we precalculated an additional two columns, one with the time minus three and one with time plus 3. We then used a simple "between" operator to get the answer. Time to market - 1000:1 difference in the two run times.  In-line calculations in the where-clause are an invisible power drain. Get them outta there.

 

(C) Find the processing patterns and consolidate them. For example, we had a BI tool executing eight queries to achieve a report output - and each query took approximately 20 seconds. Not bad for having to plumb the equivalent of 53 terabytes of information. Yet eight queries like this delayed the report's display by 160 seconds - almost three minutes. Users don't like to wait this long. So to optimize, we focused on the pattern, that the same basic query existed at the core of each of the eight. By taking the "hard part" and performing an up-front query that did all the heavy-lifting at once, we were able to take the 20 second penalty only once, and the 8 downstream queries returned in 4 seconds or less. So - 160 seconds to 34 seconds with a simple logistical change.

 

(D) Pre-filter or precalculate when consolidating. This means taking a common downstream operation, especially a repetitive one, and moving it upstream to another, even unrelated operation. The above time-drift is one example. Calculate and filter as early as possible, and then all the downstream operations benefit from it. This can offer up more than just a "spot" boost, because if it shaves a few seconds off every downstream query, this can quickly add up to shaving tens of minutes and even hours off our overall processing time.

 

(E) Mind the gap - of what we understand about how an RDBMS works versus Netezza. For example, if we want to leverage filter power in Netezza, we could use a "where exists" clause rather than a regular join. If the regular join cannot leverage a distribution key, then the where-exists is a highly performant option. Likewise if we have a view in the join that does more than just serve up data, like doing a sub-join itself. This can be very costly, and is another hidden drain on the performance that we can pull into a where-exists. So another "gap" is in merging two data sets where they have no common distribution key. The where-exists and similar operations force the machine to obey our optimizations, because we actually know the data, where the machine simply exhanges it with us.

 

(F) Avoid squeezing blood from a stone - It is tempting for said adrenalin junkie to see a cool way to reduce 2 minutes to several seconds - after a rush like this, it becomes addictive. We should not let it go to our head. In one case of processing a nightly batch, one of the client's 14-hour processes had been reduced to less than fifteen minutes. Yet some in the room still groused for more - they came up with an outrageous plan to reduce the processing to less than five minutes, but only four people on the entire planet could ever maintain it, much less enhance or improve it. At some point we have to agree that enough is enough - especially when we are sacrificing valuable things (like extensibility, flexibility etc) for the sake of a few more minutes of adrenalin rush. We must resist!

 

(G) Focus on the target system, not the one you are leaving behind. I've never known a case where someone moves into a new home and refuses to use the appliances in the home just because they didn't exist in their old one. Nor have I ever seen someone buy an new home and then attempt to fill it with new furniture that would only fit in their former home. Who does stuff like this? No, we should use the former system as a functional baseline (describing what)  but not focus on how we implemented the baseline - rather our new technology gives us the ability to spread our wings and fly in ways that we never could in the old environment. Example: One client had over 400 stored procedures to convert, and regarded these stored procedures as the baseline for the actual work, rather than the baseline for the functionality alone. When re-characterized, it reduced to four flows with a handful of core operations each - all with very simple implementations. Trust me, when it takes its final form in Netezza, it will look nothing like its former self.

 

Haven't yet experienced a run-time improvement cycle, or committed time to making it happen? It's worth it - even if only for a sanity-check on an existing implementation or a proof-of-concept on an upcoming one.

 

Either way, we can reach functional closure in a fraction of the time of any other system, and once the functionality is stable (not necessarily locked down, just stable) we can find surprising and dramatic boosts with little additional effort.

 

 

Make it right (functionally speaking) - and then make it fast -

0 Comments Permalink
0

Before a professional visit to London last year, a friend of mine said to me "Mind the Gap" - and said it's something I would hear a lot. He did not mention the primary context of this phrase. Seems that in some of the London Tube (subway) Stations, their is a significant gap between the platform and the door to the subway car. Westminster station and Kensington station even have "Mind the Gap" engraved on the platform, with a pleasant voice intoning this phrase over the PA. That "gap" can be as much as 8 inches, too wide to just drag a roll-aboard into the car, and too wide to expect a small child to get it right.

 

I can say that in standing up a new Netezza environment there is a common "gap" between what the new users expect to see versus what they experience. Closing this gap will not only accelerate productivity, you will get to closure as quickly as humanly possible, without the humans doing the heavy lifting

 

SPUs do the work - In the NPS, the SPUs are the workhorses, not the overall machine. Push all of the work to the SPUs and avoid hitting the host for a lot of work.

 

What does this mean? Typically you will find a new user implementing the machine in the same way they would implement an RDBMS - that is - thinking about the problem in a single-record-at-a-time model. It is sometimes a challenge for people to restructure their thinking into a "bulk" approach versus a singleton approach. Here's an example:

 

Let's say I have an RDBMS stored procedure with four rules. The common approach is to open a cursor, pull a record-at-a-time, process each of these entities in context of the four rules, and then persist the results. One of our customers has a particular procedure that follows this protocol, and each of these 4-rule entity processes takes about 30 seconds. For 3500 entities, this can eventually add up to hours of time as the table grows.

 

How would we solve this in Netezza?

 

We would focus the work so that each rule is applied en-masse to all the records. We know that Netezza can process 3500 records in the blink of an eye. So if we simply take these rules and "stand them on their side" - we get the effect - Rule 1 for 3500 records, Rule 2 for 3500 records etc - and we finish all the rules in less than a minute.

 

When we say - "apply rule 1 to 3500 records" - this means persisting the data to a temporary table to be consumed by Rule #2.

 

Yikes - some of you will say - temp tables. You're kidding right? This first, knee-jerk reaction to temp-tables is expected from those who shun temp-tables in the RDBMS, because they are so expensive. In Netezza space, the temp-table is your friend. In fact, a most significant ally.

 

Queries do the work?  In an RDBMS setting, it is typical to see "big fat" queries that span pages and pages of work. The owner of this kind of query knows that the data coming off the disk had better come off only once, and be written back only once, and all the work that needs to get done had better get done in-transit, while the data is on-the-move.

 

The maxim of this approach is transactional-thinking - that "If the data is in my hands, I should do as much as possible before sending it back to storage".

 

But this maxim is anathema to a Netezza implementation, where we might see dozens upon dozens of "ELT" queries that manufacture intermediate results toward a conclusion in fraction of the time of their "big fat" counterparts.

 

In short, when we force the query to do the work, we dogpile all of our logic into a single query. When we let the SPUs do the work, we snap apart the query into smaller, more digestible chunks, and the data never leaves the SPUs until we want to consume the final product.

 

SPU means something: In the Netezza machine, the Snippet Processing Unit - that the machine already intends to break the SQL apart into manageable chunks call snippets. Each snippet finds a home in various parts of the architecture. What we want to make sure of - is that the SPUS are getting the majority of the snippets in the query (not the host or the network fabric between SPUS) and one of the best ways to do this is to avoid dogpiling a lot of logic into a single SQL query with the expectation that Netezza will just sort it all out. Oh, Netezza will give you an answer, usually in a fraction of the time of its RDBMS counterpart.

 

But when we really want to optimize the machine, we need to think like the machine does, and this often means injecting simpler SQL statements into the machine, capturing intermediate work in parallel tables, and deriving the same conclusion in yet another fraction of the time.

 

As an example, one of our clients has a query structure like this:

 

select

   sum(a), sum(b), sum(c)....etc

from

    lots of join conditions here

    lots of filter conditions here

    lots of group-by conditions here.

 

For this query, executed by a BI tool, the result came back in about 10 minutes, owing to the fact that one of the tables was over 8 terabytes in size and two others were over four terabytes in size. We had applied all the zone maps we could, and had generally tuned the query for the best fit, down from (a lot more than 10 minutes) to 10 minutes. Yet 10 minutes still seemed like an eternity.

 

One of the problems - the joins occurred across distribution boundaries - so the larger tables were on one distribution and the smaller tables on another, so that bridging them was problematic.

 

Simplest fix: was to divide this query into two - with the heavy-lifting split across the two.

 

Query 1:

select

    (raw columns)

from

    heavy-lifting tables

    using applicable filters

into a temp-table containing only a subset of data

   distributed on the key for Query 2 tables

 

Query 2:

select

       sum( raw columns)

from

        temp table above

        joined to additional tables, leveraging co-location

        and final filters for additional tables

 

 

the execution for the first query was about 15 seconds. For the second one, about 5 seconds.

 

So in a single implementation, we have reduced the time from 10 minuites to 20 seconds just by breaking up the "big fat query" into more digestible parts.

 

 

Does it always work?  In most cases, if we approach the problem in a way the machine will ultimately solve it at the SPU level, and keep the data on the SPUs for as long as possible, then yes, emphatically so. It always works.

 

Is it always necessary? Hmm. well, you tell me. If your users are okay with a 10-minute turnaround on a query, then no, It's not necessary. What is necessary, however, is to be a proper steward of the data and the processing resources hosting it. In practically every case, running a big-fat-query is inefficient and wasteful, and largely borne on the transactional maxims/constraints of the RDBMS approach.

 

Unhook your brain from RDBMS-styled problem solving, and get away from transactional thinking. This is bulk data processing. Everything we do should address data in millions-of-records-at-a-time, not an record-at-a-time.

0 Comments Permalink