[ www.netezza.com ]
0

"You didn't kill it," fumed the customer, "You said you would kill it."

 

"We've had some, er, labor setbacks," said Bjorn, head of DragonSlayers Inc, a startup boutique firm from several valleys away.

 

"I don't see an excuse clause in the contract," the customer shot back, "Kill the dragon or we're done."

 

"The dragon can't be killed," said a rich Scottish voice striding up to meet them.

 

Bjorn recognized the stealthy character, by name of Connery, from the Information Superhighway Roadside Assistance Service.

 

"I didn't realize that RAS was in the area," Bjorn quipped, offering a hand to Connery.

 

Connery grasped Bjorn's hand and shook it once, "We're all over. Been doing a little cleanup of this or that."

 

"What's this about the dragon," asked the customer, "That it can't be killed?"

 

"Of course not," Connery smiled, "It's a dragon. It's immortal."

 

"Did you know this?" the customer glared at Bjorn, "Have you been stringing us along?"

 

"No," Bjorn defended, "We kill dragons. It's what we do."

 

"Well," Connery chuckled, "Not real dragons, anyhow."

 

The customer's lackey approached them with a small flagon of tea, poured a stein for each of them, and departed.

 

"The dragon is immortal," Connery muttered, sipping his tea.

 

"That's impossible," Bjorn said through a long gasp, "We've killed dragons before - we "

 

"But of course you have," Connery smiled dismissively, drawing another casual sip.

 

Bjorn stared at him, unable to form another word.

 

"If the dragon can't be killed," asked the customer, "Then what?"

 

"In the nether worlds, beyond the mapped regions, you'll see little notation There Be Dragons," Connery said softly, "And whether there be dragons or not, it's uncharted territory. Places no man has ventured, but rest assured danger lurks. Unknown to the uninitiated."

 

"So you know what lies in the uncharted territories?" Bjorn sneered.

 

"It's why I'm a guide and you're a dragonslayer," Connery huffed, "Whether you know your way or not, dragon chow comes in many shapes and sizes," he put his hands up as if to size-up Bjorn, "Many shapes and sizes."


"Funny," Bjorn quipped, but it clearly wasn't funny, "All we have to do is get close enough."


"Reminds me of a time," Connery said wistfully, "Once I knew a man who you could skewer a hundred times and he'd still get right back up."

 

"Ahh, the Highlander," said Bjorn, "I've heard of him."

 

"Well, he never lost his head," Connery huffed, "Or that would've been the end of him."

 

"What are you saying?"

 

"The treacheries of the lands beyond are many. You have to keep your wits about you. Keep your head."

 

"Keep my head, got it," Bjorn said sarcastically, "Anything else?"

 

"You need to deal with the whole dragon," Connery advised, "Not just the part you wrap with that silly leash. It won't hold the dragon. Only a dungeon will."

 

"So we need an enchanter?" Bjorn smirked.

 

"In no uncertain terms," Connery said, laughing, "You have a go at that dragon on your own. Go in there with no more than an enchanter's bag of tricks, and he'll make an ash out of you!"

 

Bjorn gulped, "We'll see about that!"

 

One of the lackeys turned to the other and chortled, "He thinks he's James Bond!"

 

"What do you know about it?" Connery shot back with piercing eyes, "The dragon sends your consultants to the street and you send the dragon to the morgue. Is that how it's done in data warehousing?"


"Basically, yes," snickered a lackey.


Connery whirled, "No morgue will hold him." He turned to the customer and glared hotly, "What are you prepared to do?"


"Sign the contract," said the customer, quickly applying a signature. He stuffed the papers into Connery's hands and hastily departed, leaving the men to set sail and dispatch the dragon as soon as possible.


The boat ride to the dragon's coast was uneventful until the boat ran aground near the shore, screeching loudly against the rocks as its keel protested with a deep, gutteral groan.

 

"That's noise will stir the dragon," Bjorn bemoaned. He'd hoped for a more stealthy entrance.

 

"Hopefully only stirred," Connery quipped as he snatched up his bag, "Not shaken. Won't do to have him awake when we approach, right?"

 

"Coastline is enormous," Bjorn complained, "How will we ever pinpoint his location?"

 

"To find the dragon, you'll need to think his thoughts. Know your adversary. Know his heart."

 

"Yeah, Dragonheart," chuckled a lackey, "Seen the movie."

 

Connery ignored him and leapt from the boat onto the dry shore. "Welcome to the Rock," and then looked out over the vast, scorched wasteland, a product of the dragon's handiwork. He led the team up the rocky slope to the first rise, whipped out his spyglass and waved his hand to the others to belay their ascent.

 

"What's he doing?" asked one lackey to another.

 

"Lookin' around, I guess," smickered one, "Guess nobody told him that the dragon sleeps all day."

 

"What was that?" Connery whispered loudly enough for them to hear, "You think the dragon sleeps all day? Who are you kidding? Maybe you only struggle with him in his lair at night, but he breathes fire all day long. He never sleeps. He never dies."

 

"Where did we find this kook?" asked another, "He's as nutty as a fruitcake."

 

"He'll eat you alive," Connery sneered, trying to spot motion anywhere along the landscape before proceeding. In the distance, a dank mist arose from the ground near some caves. Connery zoomed in and spied dragon scales littering the ground. "Let's go."


The team made the tedious crossing without incident, until they stood before the open, reeking maw of the dragon's lair.

 

"Who wants to go first?" Connery chuckled.

 

"I will," said a lackey fearlessly, "I've taken down enough of these."

 

"But of course you have," Connery strode to the nearest large boulder while the others scattered for cover. After several tedious minutes, all of them could now feel the impact tremors shaking the ground, growing in intensity as the beast ascended from his lair to the cave's mouth.


Then the horns appeared, fifty feet from point-to-point as they slowly rose from the hole. Then the head,larger than a common city bus and almost twice as long. The dragon stared down the lackey for a long moment, then continued to ascend from the hole, growing larger and more hideous with each passing second until his entire upper body was revealed, from his head down to his midsection, standing over ten stories tall. He burst-extended his massive wing membranes with a loud, deafening snap, and then pointed his head straight up to gather a deep breath of air.

 

Connery reached down to pick up one of the many dragon scales scattered all over the ground. Five inches across and eight inches long, made of the most impervious stuff on earth. He flipped it over and shuddered to realize the dragon's age, betrayed in the scale's growth rings. Four thousand years, this animal had been eating and breathing fire.


The young lackey had forgotten to breathe. This dragon was orders-of-magnitude larger than any dragon he'd ever dealt with. In fact, the sheer scale of the dragon made him feel light-headed. Gathering his presence of mind, he took a defiant stance and shouted, "Begone, Dragon!"

 

Connery turned away, trying to hold back a snicker that could reveal his location to the dragon's attenuated senses.

 

The dragon pointed his nose straight down, cocked his head to the side, opened his mouth and released his breath. The column of high-intensity chemical fire blasted downward on the lackey, instantly reducing him to ash and causing the rocks all around where he'd stood to glow and almost melt.

 

Connery glanced over to the rest of the team, cowering behind the rocks in hiding, not believing that the dragon was so huge and powerful, and feeling completely beyond their depth. They stared, partly in awe and partly in concern, as Connery stepped out from behind his hiding place and boldly strode up to the dragon's cave.

 

The dragon once again drew breath into his nostrils to recharge his furnace, when Connery simply placed his hands behind his back and stared deeply into the dragon's eyes.

 

The dragon stared back, unable to comprehend the feeling of drowsiness suddenly overtaking him. He slowly lowered his head, then his body, down to the ground to gently lay next to Connery, unable to break his eyes away from Connery's deep, mesmerizing gaze.

 

Once completely settled, Connery reached out to tap the dragon's front jawbone as it drifted off to sleep, "There now," Connery said soothingly, "That's a good lad."

 

"How is this possible?" Bjorn gasped, stunned at how easily Connery had mastered the beast.

 

"Your friend told the dragon to leave," Connery huffed, "But the dragon isn't going anywhere. He lives here and you people don't. In fact, he's been around so long, and you people come and go so often, that he sees you as decorations, not even permanent fixtures in his home."

 

"But he just laid his head down and went to sleep," Bjorn noted, "How did you do it?"


"The dragon serves me," Connery said slowly, "Not the other way around. If the dragon needs to breathe fire, it's because we've not done a good job harnessing the dragon, not just because the dragon is mean."

 

"So dragon's aren't mean?"

 

"Oh, their born mean," Connery chuckled, "And they bite. Whom they bite and when, is ours to control. That's why we have dungeons. Places where the dragon will survive but under our control. Think about putting that dragon's breath to work in boiling water, making steam to run a turbine. Now the dragon is working for us."

 

"Can't be a happy existence for him."

 

"Happy? Perhaps not. Necessary? Most definitely. You came here to kill him or banish him. He knows his place. He only responds to someone who knows it as well as he does."

 

"You're an enchanter, aren't you?" Bjorn said, realizing Connery's identity.

 

"Some call me, Tim."

0 Comments Permalink
0

Many of those who integrate the mainstream BI tools into various underpinning data sources find subtle nuances. Not the least of which is how the database will respond to the queries presented. In Netezza data access especially, the power is not found in the query, but in the hardware. We can certainly degrade our experience with bad queries, but we would not tune queries in the same manner as with an SMP/RDBMS.

 

For example, I've watched RDBMS engineers work black-magic with a query by simply rearranging this-or-that in the monolithic query to provide boosts in the orders-of-magnitude. This is because the query is being used to guide the general-purpose physics. In Netezza, however, the purpose-driven physics snips the query apart. The physics then guides the query's mechanics. I've watched newbie Netezza folks nearly pull their hair out - and their eyelashes too! - when trying to "make the machine do what I want". Hmm, no, the machine does what it does. It's an appliance. We get what we want when we conform the data to the physics. The query is just along for the ride.

 

How does all this apply to multi-pass SQL in a BI Tool? Well, most BI tools come to the table with a pre-conceived notion that all databases are created equal. Unless they have specific VLDB hooks, and unless those hooks fully embrace VLDB principles, the BI tool will not experience the expected lift and we'll likely have to help it out. In fact, little about a BI tool is purpose-built in regards to its data source. It regards data sources as general purpose interfaces so it can be as vendor-neutral as possible.


Unlike a standard star-schema, many VLDB tables are fact-sized tables containing billions of rows, as are their dimensional counterparts. So a single one-shot query will sometimes provide the functional answer but with unacceptable performance. Many of us have seen multi-page (hey, 100+ page) queries that try to do everything in one shot. The average RDBMS leaves us few options. The VLDB and especially Netezza is not so constrained. We can make multiple passes on the data often with little penalty. The danger here is in the inefficiency of the passes, not whether multi-pass is okay. Multi-pass, or more appropriately multi-stage SQL,  is a necessary approach with large-scale tables. Netezza makes it simple and fast, using built-in concepts of its performance model.

 

Here is a spot case-study - a BI tool needed to access several tables that were each in the many billions of records. The end result was a summary of user-selected values. The temp-table creation here is done automatically by the BI-Tool, so we may have limited options in getting it to shape them as needed. In the examples below, I'll label the queries so we can reference them later.


A typical BI tool, upon realizing it needs a summary, will often divide the answer into multiiple stages of work. Each stage will store its result in a temporary table using a CTAS, leveraged in one or more following passes. Unfortunately these passes are sometimes inefficient. In the case below (this is pseudo-SQL, so bear with me here)


(1a) create t1 as select region, district, store, sum(transaction_amt) sumtran, sum(transaction_tax) sumtax from transactions where district_id=4 group by region, district, store;  (1 million records)

-

(1b) create t2 as Select  employee_id, employee_name, t2.store_id from employee_master t2, employee_lookup t3 where store_id=6 and t2.store_id=t3.store_id                   (500 records)

-

(1c) select store_id, employee_id, employee_name, sumtran, sumtax from  t1, t2 where t1.store_id = t2.store_id and t2.region_id in (41,42) and t1.store_id = 6;                     (450 records)

 

Note how in the above, the filter effects are largely applied last (1b and 1c) with the summaries applied first (1a). In this case, it is summarizing over a million values but it throws away over 90 percent of this result on the last operation, reducing 1 million records to 450. It is still accessing the larger table (transactions) only once. It just does it at the wrong time.

 

If we invert this chain and regard the filters first, we might see queries like this:

 

(2a) create t1 as select region, district, store, transaction_amt, transaction_tax from transactions where district_id=4 and region_id in (41,42) and store_id=6;            (15,000 raw records)

-

(2b) create t2 as Select  employee_id, employee_name, t2.store_id from employee_master t2, employee_lookup t3 where store_id=6 and t2.store_id=t3.store_id              (500 records)

-

(2c) select store_id, employee_id, employee_name, sum(transaction_amt) sumtran, sum(transaction_tax) sumtax from  t1, t2 where t1.store_id = t2.store_id;       (450 records)


In the above, the filters are pushed into the first part of the query chain (2a) to squeeze down the data sizes, but to also glean out the raw values for the final summary (transaction_amt, transaction_tax). The (2b) query is still a filter, but by the time we get to (2c) all we really need to do is summarize based on the intermediate table results. We don't have to "go back to the well" of the larger table. Everything we need for the final result is already in our hands, and a much smaller workload.

 

The simple inversion of the query order has significantly reduced the workload of the entire chain of events. This of course, does not answer whether our BI tool will actually implement the query in this order or manner. Anecdotally, with the above tables the original "transactions" table was over 30 billion very wide rows. The first query chain (1a-1c) takes no less than a minute, but only because key1 is zone mapped. The second query chain (2a-2c) takes 6 seconds or less, and it better represents a flow of data from larger-to-smaller, like a common source-to-target flow. It is easier to visualize and manage, and is more efficient.

 

Note: Can our BI tool shape a query chain in this manner? Can it glean out in the raw columns to an intermediate table, later summarizing on the intermediate? Or will it always require us to summarize at the outset and then squeeze out from there? Some BI tools are very close to this model already.

 

Yet another pernicious issue is not obvious from the above - temp table distribution. This last query chain, though 6 seconds in duration, is still a one-hit wonder. Once two or more users start hitting the machine, concurrency will reveal all. The machine is quickly saturated and all of the queries start to take more and more time. In one case of just five users on the machine, all of the queries took over a minute, and one took over five minutes. Concurrency tuning is a bread-and-butter issue, too, so what's going on here?

 

In both query chains, the CTAS is not being given explicit instructions on how to distribute its results. The outcome is unpredictable from the BI tool's perspective, but very predictable for us. When the CTAS result remains distributed on its original distribution, we get a co-located write. If the CTAS does not use the original distribution, it will have to redistribute the data, broadcasting it all over the SPUs. We need to avoid this because co-located writes are desireable and muey caliente.

 

The original distribution key for the transaction table is (transaction_id). This doesn't do us much good if we are later focusing on the store_id (2b, 2c) as the primary distribution. In order for the final activities to be as quick as possible, we need to bridge the transactions into the store_id. We could set up data structures to do this, but in the end with so few records coming off the transaction table in the (2a-2c) chain, an intermediate broadcast is already in the mix. We can do it deliberately under our control, or allow it to use CTAS defaults. In this case, the CTAS default is worse.

-

In the first chain of queries (1a-1c), we would expect to see the following CTAS defaults:

 

(1a) - distributed on (region, district, store) because this is the group-by clause. It cannot use transaction_id for a co-located write because it's not even in the result set. Those who understand distribution keys know that this is not an optimal state of affairs.
-

(1b) - distributed on (employee_id) because it happens to be the first column in the select-clause. This query uses two tables in the join, so
     CTAS will opt for using a column in the select clause.

 

So in this case, the CTAS will not preserve the original distribution or even a useful distribution. Don't get me wrong here. CTAS defaults are acceptable in over 90 percent of cases. This example is offered as a typical one-off of BI automated query construction. The first query (1a) will produce a million records (and honestly, some cases it produced a couple of billion records) we really need some optimization here.


If we were to take (2a) and (2b) above to deliberately enforce the distribution, we would use the "distribute on (store_id)", but we would have to include store_id in the result set. In each case, this would prepare both tables for the final query (2c) for a co-located join.

 

Note: This brings up another BI tool issue, in that we need to affect the order of the sequence, and also provide for columns that are adminstrative (like store_id) but not part of the final result. Some BI tools are picky this way. If the column is not required in the final reporting output, it trims or ignores the need for the column in the intermediate tables.

 

To continue, we have now pushed the workload into the physics, not the query itself. But as noted, concurrency is the test.  This final chain of co-located queries then returned in less than 3 seconds, and did not grow beyond 4 seconds until 20 users were running the same query at the same time, and even then tended to hover between 3 and 5 seconds as even
more users were added. Isn't this the kind of scalable performance we want?

 

Additional note: If we really want to push this harder, it would be best for us to manufacture a "store_transactions" table that is distributed on the store_id already (for the 2a query). This would be a report-facing table that essentially mirrored the transactions table, but only carrying the high-traffic reporting columns. In this way, the store_id becomes the universal distribution even for the very first query. Keep in mind that while this strategy may cost disk space, it will further eliminate concurrency issues. I am not a big fan of preserving disk space when performance issues are in play. We will still need to perform a "distribute on (store_id)" for each (2a,2b) but it will preserve the distribution with a co-located write.

 

But we can see, the two protocols we will need in play from the BI tool is to use capture-filtration-summary, and then also apply distribution keys deliberately to the first passes to preserive distribution. We often apply these very same protocols in ELT because they make sense. But we have complete, detailed control of query construction in ELT, not so in the BI Tool world.

 

Conclusion: Rather than use a BI tool's default of summary-filter chain, what we need is capture-filter-summary chain. This guarantees that we can leverage the VLDB physics, but also moves the data from larger-to-smaller in the most efficient manner.

 

Recap for Multi-Stage SQL:

  • especially for summary data, should perform the summary as the final operation, with capture-and-filtration in the first passes. This allows the final operation to be a simple summary, since all the filtration has already been applied. In other words, no more where-clause activity apart from the join criteria.
  • Organize the tables (including additional tables) on the distribution key in play. Bridging one distribution to another can give us the performance, but if broadcasting it can eventually create a concurrency problem
  • the chain should not address the same large table more than once. Get everything we will need and get out - don't keep coming back for something the first pass did not get.
  • the chain should capture raw information into an intermediate table, foregoing the summary until the final operation.
  • should provide a means to bridge one distribution key into another, for maximum efficiency, rather than using CTAS defaults.
  • should perform filtration at the outset, as a method toward attacking the larger table(s) with zone maps etc.. Move from larger data sets to smaller ones.
  • should preserve distribution to leverage co-located write and read where possible. This maximizes overall performance but also optimizes concurrency.


What if the BI tool will not, as a general-purpose tool, perform these deliberate and purposeful query chains? At this point, we need to have a heart-to-heart with the BI Tool vendor stating our concerns. Assume the best, that the tool vendor may eventually fix the issue, just not in time to help us now. We then need to consider two purpose-built options, each of which has its own issues. These are offered in the spirit of temporary adaptation until the BI tool is smart enough to bypass them.

 

Summary tables: These are often constructed to prop up database performance issues. They are just as viable for functional reasons, such as providing data in a form that is only available and most efficient when summarized, or to intersect details with pre-summarized data. But if used as a performance prop or BI Tool helper, put some effort into making it an adaptation that could be deprecated when the BI Tool is smarter. This way, we're not committed to it forever.

 

Stored procedures: Used in an appliance as an adaptation mechanism (in this context). Effectively bridges the BI tool to the data with a temporary procedural construct (the procedure) rather than a more permanent structure (like a summary table). Stored procedures pull application features down to the database level and adapt the BI tool into the Netezza performance model.

 

When or whether to use either of the above is always a design decison, not necessarily dictated by the tools themselves. But keep in mind the idea of temporary adaptation. I am always of the mindset that the warehouse and BI environment must exist with the expectation of change, so in general, adaptability and adaptation concepts are always desireable. They allow us to be more responsive to future requirements

0 Comments Permalink