Skip navigation
Currently Being Moderated

Mind the Gap - Let the SPUs do the work!

Posted by David Birmingham on Jul 13, 2009 10:02:23 AM

Before a professional visit to London last year, a friend of mine said to me "Mind the Gap" - and said it's something I would hear a lot. He did not mention the primary context of this phrase. Seems that in some of the London Tube (subway) Stations, their is a significant gap between the platform and the door to the subway car. Westminster station and Kensington station even have "Mind the Gap" engraved on the platform, with a pleasant voice intoning this phrase over the PA. That "gap" can be as much as 8 inches, too wide to just drag a roll-aboard into the car, and too wide to expect a small child to get it right.

 

I can say that in standing up a new Netezza environment there is a common "gap" between what the new users expect to see versus what they experience. Closing this gap will not only accelerate productivity, you will get to closure as quickly as humanly possible, without the humans doing the heavy lifting

 

SPUs do the work - In the NPS, the SPUs are the workhorses, not the overall machine. Push all of the work to the SPUs and avoid hitting the host for a lot of work.

 

What does this mean? Typically you will find a new user implementing the machine in the same way they would implement an RDBMS - that is - thinking about the problem in a single-record-at-a-time model. It is sometimes a challenge for people to restructure their thinking into a "bulk" approach versus a singleton approach. Here's an example:

 

Let's say I have an RDBMS stored procedure with four rules. The common approach is to open a cursor, pull a record-at-a-time, process each of these entities in context of the four rules, and then persist the results. One of our customers has a particular procedure that follows this protocol, and each of these 4-rule entity processes takes about 30 seconds. For 3500 entities, this can eventually add up to hours of time as the table grows.

 

How would we solve this in Netezza?

 

We would focus the work so that each rule is applied en-masse to all the records. We know that Netezza can process 3500 records in the blink of an eye. So if we simply take these rules and "stand them on their side" - we get the effect - Rule 1 for 3500 records, Rule 2 for 3500 records etc - and we finish all the rules in less than a minute.

 

When we say - "apply rule 1 to 3500 records" - this means persisting the data to a temporary table to be consumed by Rule #2.

 

Yikes - some of you will say - temp tables. You're kidding right? This first, knee-jerk reaction to temp-tables is expected from those who shun temp-tables in the RDBMS, because they are so expensive. In Netezza space, the temp-table is your friend. In fact, a most significant ally.

 

Queries do the work?  In an RDBMS setting, it is typical to see "big fat" queries that span pages and pages of work. The owner of this kind of query knows that the data coming off the disk had better come off only once, and be written back only once, and all the work that needs to get done had better get done in-transit, while the data is on-the-move.

 

The maxim of this approach is transactional-thinking - that "If the data is in my hands, I should do as much as possible before sending it back to storage".

 

But this maxim is anathema to a Netezza implementation, where we might see dozens upon dozens of "ELT" queries that manufacture intermediate results toward a conclusion in fraction of the time of their "big fat" counterparts.

 

In short, when we force the query to do the work, we dogpile all of our logic into a single query. When we let the SPUs do the work, we snap apart the query into smaller, more digestible chunks, and the data never leaves the SPUs until we want to consume the final product.

 

SPU means something: In the Netezza machine, the Snippet Processing Unit - that the machine already intends to break the SQL apart into manageable chunks call snippets. Each snippet finds a home in various parts of the architecture. What we want to make sure of - is that the SPUS are getting the majority of the snippets in the query (not the host or the network fabric between SPUS) and one of the best ways to do this is to avoid dogpiling a lot of logic into a single SQL query with the expectation that Netezza will just sort it all out. Oh, Netezza will give you an answer, usually in a fraction of the time of its RDBMS counterpart.

 

But when we really want to optimize the machine, we need to think like the machine does, and this often means injecting simpler SQL statements into the machine, capturing intermediate work in parallel tables, and deriving the same conclusion in yet another fraction of the time.

 

As an example, one of our clients has a query structure like this:

 

select

   sum(a), sum(b), sum(c)....etc

from

    lots of join conditions here

    lots of filter conditions here

    lots of group-by conditions here.

 

For this query, executed by a BI tool, the result came back in about 10 minutes, owing to the fact that one of the tables was over 8 terabytes in size and two others were over four terabytes in size. We had applied all the zone maps we could, and had generally tuned the query for the best fit, down from (a lot more than 10 minutes) to 10 minutes. Yet 10 minutes still seemed like an eternity.

 

One of the problems - the joins occurred across distribution boundaries - so the larger tables were on one distribution and the smaller tables on another, so that bridging them was problematic.

 

Simplest fix: was to divide this query into two - with the heavy-lifting split across the two.

 

Query 1:

select

    (raw columns)

from

    heavy-lifting tables

    using applicable filters

into a temp-table containing only a subset of data

   distributed on the key for Query 2 tables

 

Query 2:

select

       sum( raw columns)

from

        temp table above

        joined to additional tables, leveraging co-location

        and final filters for additional tables

 

 

the execution for the first query was about 15 seconds. For the second one, about 5 seconds.

 

So in a single implementation, we have reduced the time from 10 minuites to 20 seconds just by breaking up the "big fat query" into more digestible parts.

 

 

Does it always work?  In most cases, if we approach the problem in a way the machine will ultimately solve it at the SPU level, and keep the data on the SPUs for as long as possible, then yes, emphatically so. It always works.

 

Is it always necessary? Hmm. well, you tell me. If your users are okay with a 10-minute turnaround on a query, then no, It's not necessary. What is necessary, however, is to be a proper steward of the data and the processing resources hosting it. In practically every case, running a big-fat-query is inefficient and wasteful, and largely borne on the transactional maxims/constraints of the RDBMS approach.

 

Unhook your brain from RDBMS-styled problem solving, and get away from transactional thinking. This is bulk data processing. Everything we do should address data in millions-of-records-at-a-time, not an record-at-a-time.

Comments (0)