[ www.netezza.com ]

Gather 'round the Grill

2 Posts tagged with the bi tag
0

Manhattan Skylines

Posted by David Birmingham Mar 4, 2010

Marcus Gray watched in consternation as the viral program cranked up. He knew that in moments the band of hackers would once again take over the Manhattan power grid. For now, they were doing it as a prank. But he also realized it could be a test run for something even bigger. Like a grid-by-grid shutdown of the entire system, opening the door for untold mayhem on the darkened streets.

 

Moments later, messages from the hacker gang started appearing on all their terminals. Taunting barbs letting everyone know that they were in complete control and nobody could stop them. Gray shook his head and closed his eyes, hoping that this would pass quickly. Losing power even in one part of the grid could spell pandemonium and place lives and fortunes at risk. The weight on his shoulders was crushing.

 

"I think I can help," said a voice from behind. Lane McBride from the Federal Counter-Terrorism Unit based in Manhattan, leaned over to regard Gray's terminal.

 

Gray turned to the voice, recognizing it with hope in his eyes, and said, "They're at it again."

 

"I saw the precursors," McBride noted, "That they were entering the system."

 

"Yeah, but it doesn't matter if we can't find exactly where they are," Gray sighed, shaking his head, "They're in a hundred different buildings, including the Empire State. You guys have agents standing by at all of them, but they have to search the buildings floor-by-floor to find them. The problem is, we have to shut down communications for the building so that they can't warn each other. So even if we could catch a few, do you have any idea how long a floor-to-floor search takes in the Empire State? We can't keep that building offline from communication for that long."

 

"Not to worry," McBride grinned, "I have an algorithm that will directly pinpoint their floors. All we have to do is send our officers up to the floor, and I bet we can round them up in minutes."

 

"Wow," Gray whistled, "I'd like to see that."

 

McBride whipped out a flash stick, plugged it in and let the program do its work. Within seconds, it had pinpointed each hacker, the building their signal was coming from and the floor of the building. "Here we go."

 

"I like it," Gray grinned.

 

McBride touched several buttons on his phone and dispatched the information, and monitored as each of the officers acknowledged the information and the plan. "We'll know soon enough."

 

Gray noted, "The problem has always been that they could hear us coming and could shift floors anytime they wanted."

 

"Not this time," McBride smirked, "At least, not if we do it right."

 

The first officer to report back was from the Empire State. Two of the hackers had been stationed there on separate floors. Both were now in custody and unable to warn their cohorts in the other buildings. Gray listened in awe as one by one, the officers reported in, having captured their respective quarries with minimal effort.

 

"That was brilliant," Gray stared at the screen as the weight seemed to lift from his shoulders, "How did you come up with the algorithm?"

 

"Simple process of elimination. I just looked at the problem from a very-large-scale search. The most important information is where the perps aren't - not where they are. The algorithm zones in on the candidate floor by understanding which floors are not candidates. Process of elimination leads the way. So we can search the Empire State and Chrysler buildings just as quickly as a single-story, capture the floor number and we're done."


---------------------

Some of you already see the parallels. It's how a zone map works. But how does it apply?

 

When we take a look at the Record Distribution option in the Netezza Admin GUI, we're often happy with a "ragged edge" for all the SPUs. And a "flat top" is the ticket. But what about the case of a "Manhattan Skyline", where we have high peaks and low valleys? This is higher than normal skew (something we're supposed to avoid, right?) People see those and shun them. However, these are often the natural result of an intermediate table produced by an ELT operation, and often a result of multi-pass queries in a BI tool. These usually leverage the mainstay workhorse CTAS (Create-Table-As-Select), so in many cases, people are tempted to turn on "random" for all CTAS operations. Or just maybe - one of our regular static supporting tables is deliberately distributed as a Manhattan Skyline just because we want to regularly perform co-located joins with it on larger master table using the same distribution key.

 

In any case, a primary reason we would get this kind of Manhattan Skyline distribution is if we are trying to preserve an existing distribution in order to perform a follow-on operation with tables on the same distribution. Whew! And why would we allow this to continue? Isn't a random distribution better than a Manhattan Skyline? Our problem remains: if the table has such a Manhattan Skyline distribution, we have higher than normal skew. Any full-scan on the table will cause the query to perform as slow as the "tallest bar"  (the SPU with too much of the table's data). As the table grows in size, the problem worsens. It is not a scalable distribution in its latent form, so don't embrace one without a plan.

 

Well, random distribution has a risk too, especially at the BI level, of negatively affecting concurrency performance. Even if our individual queries are not hindered by the data-broadcast incurred by the random distribution, they could just be a one-hit-wonder, because running many of these operations side-by-side can sometimes saturate the inter-SPU fabric, affecting concurrency. If we can keep the processing on the SPUs, we can avoid this problem entirely. So the issue is one of user scalability, something that all of us care about and that the other vendors (sometimes) turn a blind eye to. Netezza has it covered, and as usual, it's so simple a cave man could do it (now I'll get mail!)

 

So now we have two options, neither of which seem good - (a) keep the Manhattan Skyline distribution or (b) use a random one. Let me say that random is not always bad, but it poses a potential danger for concurrency. Likewise the Manhattan Skyline can often be a latent result of an intermediate CTAS so is unavoidable anyhow. And why would we want to preserve an existing distribution on a CTAS? The answer - because it will be a co-located write (blazingly fast). But wait! Don't we get a co-located write by default?

 

Maybe.

 

I have noted in prior posts how the default distribution for a CTAS might not be what we want or expected, so here's a quick recap:

 

(a) For simple single-table CTAS, it will preserve the source distribution key - (co-located write)

(b) For simple multi-table-join CTAS, it will leverage the first column result in the "select" clause (maybe a co-located write)

(c) For CTAS using summaries/group functions in the select, it will leverage the columns in the "group-by" clause (rarely a co-located write)

 

If any of the above are not the original distribution of the source(s), we could inadvertently sacrfice our co-located write. But we can preserve it if we specifically use "distribute on" with the CTAS execution. With co-located writes, this means the data never leaves the SPUs. If we distribute the CTAS on anything else, the data must leave its current SPU and find its way to another one. This initiates a data broadcast (and can negatively affect concurrency). Preserving the distribution, we get the benefit of a co-located write (avoiding broadcast to make the table) and set up the next operation for a co-located read (also avoid the broadcast to leverage the table). Short answer: preserving the distribution preserves concurrency performance. Now the SPUs are working for us at physics-speed.

 

Rather than just live with the latent effects, lets embrace and harness them for the good of all mankind. Well - er -  at least for our user base.

 

What we really want is threefold -

 

(1) preserve the distribution with a co-located write (preserve concurrency, potential Manhattan Skyline as latent artifact)
(2) leverage the result with a co-located read (preserve concurrency, potential penalty from Manhattan Skyline)
(3) mitigate the Manhattan Skyline with a zone map (ahh, best of all worlds)

 

So to get the first two, we can simply preserve the distribution with a "distribute on (key)" clause and make sure the distribution key is part of the "where/join" operations.. This is the simple part.

 

To get the third, we need to either (a) sort the data as it is created, or (b) make a materialized view after-the-fact to get the zone map effect for selected columns. The first one (sorting) is often easier than it sounds, and with strongly filtered intermediate tables is also very scalable. The second one (materialized view) has some caveats but is very fast to create. What does the zone map actually do? It effectively stripes each SPUs portion of the table so that only the section in the zone is actually addressed. Like McBride's algorithm, it's as though the rest of the data isn't even there, because the zone map has guided the optimizer to completely ignore it. So whether the SPU's data has a tall bar or a short bar, the performance is the same. We need all three of the above and the zone map mitigates the potential problem of unexpectedly high skew from an intermediate distribution - or an outlier table that we need to distribute on a common key. Even if (1) and (2) above give us a good distribution today, it could always "go Manhattan" in the future.

 

Another obvious question is "If this is an intermediate result, why bother? Just filter out the stuff I don't want and then there's no issue, right?" Well, technically yes, for a single operation, but I know of at least a dozen cases where the intermediate table is used for a lot of downstream activity, not just a one-off throwaway. So our stewardship rule is: make the data better. For the next downstream process or the ultimate data consumer, the data should get better every time we touch it.

 

Rather than rewrite or re-design a carefully tested and detailed process, adding a simple "order by" or MV is easy and preserves the existing logic, and data model, with little impact and high return. This is especially true of a static supporting table, because we can install what we need on the table's creation. The consuming processes all benefit from it with no more than regular query execution (materialized views are transparent).

 

In the end, we can still leverage the plain-vanilla parts of the Netezza performance model (zone maps, co-location) without having to over-engineer the data using indexes, intersection tables or summaries. This preserves something more  - the ongoing resilience and adaptability of the model itself.

 

Recap:

 

  • Apply the "distribute on" clause of the CTAS to avoid the latent effect of default distribution.
  • Preserve co-location for reads and writes in intermediate tables.
  • If a potential Manhattan Skyline distribution is the CTAS result, rather than go random, sort the CTAS result by a selected column or use a materialized view.
  • As always, apply strong filters to the CTAS creation so that it's not simply copying one table's contents to another (carve the data out).
  • Experiment for the best fit, but remember that Netezza is an appliance.
  • We don't need to engineer the queries, only apply simple performance model alignments in the data itself, to leverage the machine's physics
0 Comments Permalink
0

Many of those who integrate the mainstream BI tools into various underpinning data sources find subtle nuances. Not the least of which is how the database will respond to the queries presented. In Netezza data access especially, the power is not found in the query, but in the hardware. We can certainly degrade our experience with bad queries, but we would not tune queries in the same manner as with an SMP/RDBMS.

 

For example, I've watched RDBMS engineers work black-magic with a query by simply rearranging this-or-that in the monolithic query to provide boosts in the orders-of-magnitude. This is because the query is being used to guide the general-purpose physics. In Netezza, however, the purpose-driven physics snips the query apart. The physics then guides the query's mechanics. I've watched newbie Netezza folks nearly pull their hair out - and their eyelashes too! - when trying to "make the machine do what I want". Hmm, no, the machine does what it does. It's an appliance. We get what we want when we conform the data to the physics. The query is just along for the ride.

 

How does all this apply to multi-pass SQL in a BI Tool? Well, most BI tools come to the table with a pre-conceived notion that all databases are created equal. Unless they have specific VLDB hooks, and unless those hooks fully embrace VLDB principles, the BI tool will not experience the expected lift and we'll likely have to help it out. In fact, little about a BI tool is purpose-built in regards to its data source. It regards data sources as general purpose interfaces so it can be as vendor-neutral as possible.


Unlike a standard star-schema, many VLDB tables are fact-sized tables containing billions of rows, as are their dimensional counterparts. So a single one-shot query will sometimes provide the functional answer but with unacceptable performance. Many of us have seen multi-page (hey, 100+ page) queries that try to do everything in one shot. The average RDBMS leaves us few options. The VLDB and especially Netezza is not so constrained. We can make multiple passes on the data often with little penalty. The danger here is in the inefficiency of the passes, not whether multi-pass is okay. Multi-pass, or more appropriately multi-stage SQL,  is a necessary approach with large-scale tables. Netezza makes it simple and fast, using built-in concepts of its performance model.

 

Here is a spot case-study - a BI tool needed to access several tables that were each in the many billions of records. The end result was a summary of user-selected values. The temp-table creation here is done automatically by the BI-Tool, so we may have limited options in getting it to shape them as needed. In the examples below, I'll label the queries so we can reference them later.


A typical BI tool, upon realizing it needs a summary, will often divide the answer into multiiple stages of work. Each stage will store its result in a temporary table using a CTAS, leveraged in one or more following passes. Unfortunately these passes are sometimes inefficient. In the case below (this is pseudo-SQL, so bear with me here)


(1a) create t1 as select region, district, store, sum(transaction_amt) sumtran, sum(transaction_tax) sumtax from transactions where district_id=4 group by region, district, store;  (1 million records)

-

(1b) create t2 as Select  employee_id, employee_name, t2.store_id from employee_master t2, employee_lookup t3 where store_id=6 and t2.store_id=t3.store_id                   (500 records)

-

(1c) select store_id, employee_id, employee_name, sumtran, sumtax from  t1, t2 where t1.store_id = t2.store_id and t2.region_id in (41,42) and t1.store_id = 6;                     (450 records)

 

Note how in the above, the filter effects are largely applied last (1b and 1c) with the summaries applied first (1a). In this case, it is summarizing over a million values but it throws away over 90 percent of this result on the last operation, reducing 1 million records to 450. It is still accessing the larger table (transactions) only once. It just does it at the wrong time.

 

If we invert this chain and regard the filters first, we might see queries like this:

 

(2a) create t1 as select region, district, store, transaction_amt, transaction_tax from transactions where district_id=4 and region_id in (41,42) and store_id=6;            (15,000 raw records)

-

(2b) create t2 as Select  employee_id, employee_name, t2.store_id from employee_master t2, employee_lookup t3 where store_id=6 and t2.store_id=t3.store_id              (500 records)

-

(2c) select store_id, employee_id, employee_name, sum(transaction_amt) sumtran, sum(transaction_tax) sumtax from  t1, t2 where t1.store_id = t2.store_id;       (450 records)


In the above, the filters are pushed into the first part of the query chain (2a) to squeeze down the data sizes, but to also glean out the raw values for the final summary (transaction_amt, transaction_tax). The (2b) query is still a filter, but by the time we get to (2c) all we really need to do is summarize based on the intermediate table results. We don't have to "go back to the well" of the larger table. Everything we need for the final result is already in our hands, and a much smaller workload.

 

The simple inversion of the query order has significantly reduced the workload of the entire chain of events. This of course, does not answer whether our BI tool will actually implement the query in this order or manner. Anecdotally, with the above tables the original "transactions" table was over 30 billion very wide rows. The first query chain (1a-1c) takes no less than a minute, but only because key1 is zone mapped. The second query chain (2a-2c) takes 6 seconds or less, and it better represents a flow of data from larger-to-smaller, like a common source-to-target flow. It is easier to visualize and manage, and is more efficient.

 

Note: Can our BI tool shape a query chain in this manner? Can it glean out in the raw columns to an intermediate table, later summarizing on the intermediate? Or will it always require us to summarize at the outset and then squeeze out from there? Some BI tools are very close to this model already.

 

Yet another pernicious issue is not obvious from the above - temp table distribution. This last query chain, though 6 seconds in duration, is still a one-hit wonder. Once two or more users start hitting the machine, concurrency will reveal all. The machine is quickly saturated and all of the queries start to take more and more time. In one case of just five users on the machine, all of the queries took over a minute, and one took over five minutes. Concurrency tuning is a bread-and-butter issue, too, so what's going on here?

 

In both query chains, the CTAS is not being given explicit instructions on how to distribute its results. The outcome is unpredictable from the BI tool's perspective, but very predictable for us. When the CTAS result remains distributed on its original distribution, we get a co-located write. If the CTAS does not use the original distribution, it will have to redistribute the data, broadcasting it all over the SPUs. We need to avoid this because co-located writes are desireable and muey caliente.

 

The original distribution key for the transaction table is (transaction_id). This doesn't do us much good if we are later focusing on the store_id (2b, 2c) as the primary distribution. In order for the final activities to be as quick as possible, we need to bridge the transactions into the store_id. We could set up data structures to do this, but in the end with so few records coming off the transaction table in the (2a-2c) chain, an intermediate broadcast is already in the mix. We can do it deliberately under our control, or allow it to use CTAS defaults. In this case, the CTAS default is worse.

-

In the first chain of queries (1a-1c), we would expect to see the following CTAS defaults:

 

(1a) - distributed on (region, district, store) because this is the group-by clause. It cannot use transaction_id for a co-located write because it's not even in the result set. Those who understand distribution keys know that this is not an optimal state of affairs.
-

(1b) - distributed on (employee_id) because it happens to be the first column in the select-clause. This query uses two tables in the join, so
     CTAS will opt for using a column in the select clause.

 

So in this case, the CTAS will not preserve the original distribution or even a useful distribution. Don't get me wrong here. CTAS defaults are acceptable in over 90 percent of cases. This example is offered as a typical one-off of BI automated query construction. The first query (1a) will produce a million records (and honestly, some cases it produced a couple of billion records) we really need some optimization here.


If we were to take (2a) and (2b) above to deliberately enforce the distribution, we would use the "distribute on (store_id)", but we would have to include store_id in the result set. In each case, this would prepare both tables for the final query (2c) for a co-located join.

 

Note: This brings up another BI tool issue, in that we need to affect the order of the sequence, and also provide for columns that are adminstrative (like store_id) but not part of the final result. Some BI tools are picky this way. If the column is not required in the final reporting output, it trims or ignores the need for the column in the intermediate tables.

 

To continue, we have now pushed the workload into the physics, not the query itself. But as noted, concurrency is the test.  This final chain of co-located queries then returned in less than 3 seconds, and did not grow beyond 4 seconds until 20 users were running the same query at the same time, and even then tended to hover between 3 and 5 seconds as even
more users were added. Isn't this the kind of scalable performance we want?

 

Additional note: If we really want to push this harder, it would be best for us to manufacture a "store_transactions" table that is distributed on the store_id already (for the 2a query). This would be a report-facing table that essentially mirrored the transactions table, but only carrying the high-traffic reporting columns. In this way, the store_id becomes the universal distribution even for the very first query. Keep in mind that while this strategy may cost disk space, it will further eliminate concurrency issues. I am not a big fan of preserving disk space when performance issues are in play. We will still need to perform a "distribute on (store_id)" for each (2a,2b) but it will preserve the distribution with a co-located write.

 

But we can see, the two protocols we will need in play from the BI tool is to use capture-filtration-summary, and then also apply distribution keys deliberately to the first passes to preserive distribution. We often apply these very same protocols in ELT because they make sense. But we have complete, detailed control of query construction in ELT, not so in the BI Tool world.

 

Conclusion: Rather than use a BI tool's default of summary-filter chain, what we need is capture-filter-summary chain. This guarantees that we can leverage the VLDB physics, but also moves the data from larger-to-smaller in the most efficient manner.

 

Recap for Multi-Stage SQL:

  • especially for summary data, should perform the summary as the final operation, with capture-and-filtration in the first passes. This allows the final operation to be a simple summary, since all the filtration has already been applied. In other words, no more where-clause activity apart from the join criteria.
  • Organize the tables (including additional tables) on the distribution key in play. Bridging one distribution to another can give us the performance, but if broadcasting it can eventually create a concurrency problem
  • the chain should not address the same large table more than once. Get everything we will need and get out - don't keep coming back for something the first pass did not get.
  • the chain should capture raw information into an intermediate table, foregoing the summary until the final operation.
  • should provide a means to bridge one distribution key into another, for maximum efficiency, rather than using CTAS defaults.
  • should perform filtration at the outset, as a method toward attacking the larger table(s) with zone maps etc.. Move from larger data sets to smaller ones.
  • should preserve distribution to leverage co-located write and read where possible. This maximizes overall performance but also optimizes concurrency.


What if the BI tool will not, as a general-purpose tool, perform these deliberate and purposeful query chains? At this point, we need to have a heart-to-heart with the BI Tool vendor stating our concerns. Assume the best, that the tool vendor may eventually fix the issue, just not in time to help us now. We then need to consider two purpose-built options, each of which has its own issues. These are offered in the spirit of temporary adaptation until the BI tool is smart enough to bypass them.

 

Summary tables: These are often constructed to prop up database performance issues. They are just as viable for functional reasons, such as providing data in a form that is only available and most efficient when summarized, or to intersect details with pre-summarized data. But if used as a performance prop or BI Tool helper, put some effort into making it an adaptation that could be deprecated when the BI Tool is smarter. This way, we're not committed to it forever.

 

Stored procedures: Used in an appliance as an adaptation mechanism (in this context). Effectively bridges the BI tool to the data with a temporary procedural construct (the procedure) rather than a more permanent structure (like a summary table). Stored procedures pull application features down to the database level and adapt the BI tool into the Netezza performance model.

 

When or whether to use either of the above is always a design decison, not necessarily dictated by the tools themselves. But keep in mind the idea of temporary adaptation. I am always of the mindset that the warehouse and BI environment must exist with the expectation of change, so in general, adaptability and adaptation concepts are always desireable. They allow us to be more responsive to future requirements

0 Comments Permalink