[ www.netezza.com ]
1 2 3 Previous Next

Gather 'round the Grill

33 Posts
0

Team Makeup

Posted by David Birmingham Aug 28, 2010

As some of you know, here in Texas the sport of football is a bit more than a team sport. In some locales, it's practically royalty. Back in high school and college, I was ever-aware of the sports fans that liked to wear the team colors. One in particular invited me to lunch while his parents were in town (I already knew them from back home) but I never knew that his father owned a car that was actually painted in Dallas Cowboys colors. Another colleague of mine was stuck in a parking garage with a dead battery as I was leaving work, and asked if I could help him with a jump. Upon saying yes, he produced a set of maroon-and-white jumper cables, honoring Texas A&M University. Oh yeah, he was a fan.

 

Then several of my friends kept large jars of team-color makeup and would smear it all over them before a game. No, not to go see the game at the stadium, but in their apartment! Yes, the team makeup was always a big hit, and their parties were solely centered on the tiny color TV in the center of their rather modest apartment. I had once wondered what the people next door thought of their shouts, whoops and antics, until I learned that the folks next door were just as rowdy with their own team. Hey, ya gatta be a fan.

 

Okay that's not the team makeup I'm talking about here, but those were just such funny memories I thought I'd share. And to make a point of course. The enthusiasm of your developers and architects, and their desire to cheer you on with your goals to achieve, are directly dependent on the team makeup. Of people, that is. So what does it take to make a data warehouse? Or for that matter, what does it take to effectively roll out a new one, or migrate an old one?


What most people fail to estimate in taming such a beast, is the level of testing required to make it a reality. As an example, many of you must produce documents as part of your regular workday. Those documents are often hard to write, but even more work to proofread. In fact, proofreading is the same form of content-and-context testing we would do for a data warehouse. The chief reason is the product - information and knowledge. Business intelligence is the same way - it has a way of taking on a life of its own, but the only way we can reliably roll out a viable business intelligence platform, is to test, test and test some more. Eyeballs-to-page however, may be required for a book or document, but it won't scale for a warehouse.

 

Many people don't realize that the testing portion of the warehouse can take as much as 80 percent of the project's resources. While we can compress this somewhat with agile methods, we cannot afford to test such quantities of data with simplistic manual approaches. And by that I mean eye-ball examination or screen-shot testing. No, the majority of testing is in the data itself, and on a Netezza machine it's in the billions of records. Eyeballs don't have the bandwidth. We need to use the actual power of the machine to scale this mountain.

 

So what should the team itself look like? May I suggest that for every Architect you would have several developers, and for every developer you would have two or more testers. Ideally three testers for each developer, regardless of how many developers are actually doing the work. I will also suggest that you keep the total count of developers rather lean. Five is perhaps overkill for the back end. For the front end, three solid developers can be an army, and five is about the upper limit. The reason is simple: logistics. If you have five developers on the back end and five on the front end, and three testers in the wake of each, this is a team of 40 people - which quite frankly is overkill in any sense of the word. Not to say that an overall team might not be comprised of 40 people once we include all of the infrastructure folks, but not for pure develop-and-test. We can and should make it leaner.

 

As an aside, I had the rather disturbing experience, numerous times, of encountering folks who worked these things out with overblown spreadsheets that they normally used for application development estimations. A data warehouse gig is completely different from an application development gig. But of course, if one of these spreadsheet guys ever showed up and plugged in his metrics, he would spout off that we need a 30-person team to migrate a couple of tables from one machine to the new one. Once this number is in the air, it becomes the de-facto standard by which all discussions are measured, even though it is completely wrong. In another setting, another spreadsheet-guy plugged in his numbers and characterized a project as a $900k gig when our competitors were bidding $300k for the same work. Knock yourself out, dude, because the client ain't a-bitin' three times more expensive projects because they like our faces. True to form, the $300k bid actually won. But the irony was, that the potential client had no desire whatsoever to pay more than about $400k, so the bid fit their budget just fine. The eventual winner of the bid took a bath, however. The truth is always somewhere in the middle.

 

I still say, watch those spreadsheet-guys. Somethin' up with that.

 

Perhaps it goes without saying that an architect needs to lay out a framework so that all can work comfortably in the same sandbox. This is a challenge and should not be left to the developers to forge on their own. Harnessing it later will be impossible, because too many opportunities for flow-based consolidation will be lost. Workarounds and repetitive logic will become the rule. Let's not go there.

 

If we have say, three solid developers in the back and front end each, they can and should cross-test each other's  work. In this case, we have the senior developers and architect working on the core logic, and the junior developers bench-checking their work and zipping it up for a formal testing team. Here we have a synergy, that a senior developer can crank out ten times the work and quality of the juniors (so says Demarco and Lister, your actual mileage may vary)  but nonetheless, we would not want to put a junior developer in the front seat of this chaim because the testers will be waiting on him more often than not. But with the ten-times-more-power driving the front like a locomotive, the junior developers can wrap up the many tactical areas of the warehouse and cross-test each other, but also receive the work products from the senior developer.

 

Now think about what this kind of model means. The senior developer is literally force-feeding the pipeline with work products and is doing it with the highest quality the team has to offer. The junior developers are learning from the senior without injecting their own inexperience into the mix, which will invariably have to be reviewed by the senior developer anyhow. No, the senior developer is more productive and experienced, so let him/her drive. Seems like every senior developer I talk to, they really, really want to develop and have the testing tedium off their plates. The junior developers really, really want to learn from a senior developer, and of course want to do some development themselves. I'm not saying this is off-limits, but the senior developer can delegate-what-he-knows to the junior developers because they cannot go too far astray without his guidance anyhow, so it's a win-win. And of course, I and every other person who was ever a junior developer had to pay our dues, so not everyone can be the leader. I don't say this dismissively, but we know in a business intelligence project there has to be a driving mind. Too much consensus means too little leadership, and in the famous words of Margaret Thatcher "Consensus is the absence of leadership".

 

But for the people doing the testing, they need something that will scale. To billions of records. And it had better be solid on the first round or they will be playing catch-up for every round after that. While writing and proofreading a document is an eyeballs-only model, don't you think I could at least do myself a favor and run a spell-check and grammar-check on the contents? Such set-based operations resolve a world of problems and let me focus my eyeballs on the harder stuff. But in a data warehouse, our eyeballs will never have enough bandwidth, and will never scale to the necessary heights. Set-based testing is all we have, but it's also all we need. And with a Netezza machine, we're so in the zone.

 

Now testing of the report screens can involve eyeball-based activities but doesn't have to be so egregious. Automated testing tools go a long way to mitigate the necessity for eyeballs on these as well (for the subjective parts like positioning, banners or colors especially). However, if the data is wrong, no amount of pretty-pretty will fix it. As Murphy would say "Beauty is only skin deep, but ugly goes to the bone."

 

Now, no sooner will I write this than I will get feedback from those junior developers who say that they have been relegated, but not to fear. This particular article is in context of a high-productivity bubble of work, normally found with new projects or migrations. The priority is not to make people feel better about their role, but to get past the workload so that everyone can feel better about the work products. I am always looking for opportunities to stretch the developers, both junior and senior. When a junior is ready to sit in the driver's seat of the locomotive, it's because he's passed the Demarco and Lister smoke test. Now what the heck is that, anyhow?

 

Get a copy of Demarco and Lister's Peopleware, a classic in every decade. Something they have empirically measured, is that a junior developer will start out at one level of productivity, and then in a sudden epiphany will transform into someone who is ten times more productive than before. Something mentally and/or emotionally clicks and they get this whoosh. They claim it is different timing for each developer, but usually takes about two years to make this transition. This is perhaps one reason why so many job-search requirement listings show "X years of experience in Y" and the "X" is never less than two years. Not because the poster has ever read Peopleware, but we who are in the field want folks who are 2 years along because we already know they have (at least) transitioned into a high-productivity asset.

 

But this is the mechanism driving the team makeup - and the experience of the developers and their known levels of productivity should help us find the right role for them on the team. We don't want a low productivity person in the locomotive chair. But having one in the wake of a strong developer only makes them stronger and exposes them to practices that will accelerate their transition into the higher productivity person we always wanted anyhow. And then, of course, once the person has made the 10x transition and is self-aware of their value, we have another problem: They are self-aware of their salary level too! Making someone stronger makes them more valuable. Be prepared to recognize the value (or rest assured that your competitor will). But all this, is the nature of the beast we purport to tame, no?

 

Back to set-based testing. This has as much to do with using the right data as it does the right method. The right data means - select a subset of known data that will deliberately exercise all of your business rules and software paths. Nothing is worse than realizing such errors in production. Then, we need set-based testing methods. This means we need three primary assets: (a) source data that we can sluice through our application transforms to get a result (b) a saved baseline result to compare this recent result against and (c) tested components that compare these two results in a reliable manner so that we get a statistical report on what passed and what didn't, and a detailed report on what specific records didn't make the cut. Counts, amounts, checksums and summaries all reveal deviations, especially for regression testing. You might recognize this as an exception report, and this is exactly the spirit of the effort. Our testing has to deal with statistical exceptions, because it is the only practical and scalable way to validate billions of rows.

 

And also notice that such a practice would be the kiss-of-death in many other "secondhand" data warehouse platforms. Such platforms are in no wise optimized to compare monstrous sets of data to each other column-for-column, row-for-row. Queries like that can dim-the-lights and may not return for hours, if not days. We cannot afford a protracted testing phase, and with Netezza we don't have to. Scan times and comparison times are very objective and knowable. The tests will take the same amount of time each time they run, and we always have the option to optimize them further with the Netezza performance model. Power is in the physics.

 

And again, why all the focus on testing? I have seen data warehouses blind-side an organization that accounted only for the opposite equation - 80 percent development and 20 percent testing, when more often than not, exactly the opposite is true. This would mean that if a two-month development effort were characterized with one model (the wrong one) it would look like at most a three-month effort. Why then does it metastasize into a ten-month effort? Because 20 percent (2 months) tranlates to 80 percent (8 months) of testing.

 

That is, if we just embrace the standard model. By embracing the aforementioned model, we get the development out of the way quickly and deliberately, entering the testing phase much sooner, and if more heads are deliberately dedicated to set-based testing we can close this part off even sooner. I have watched very-large-scale projects, with a Netezza team in the middle and strong developers in the locomotive seat, enter their first UAT phase within two months of the project's inception. The funny thing is, the model requires rapid turnaound that only the Netezza workhorse can provide, Try pulling off this team makeup with any other lower-productivity technology, and it won't sing the same key. A high-productivity developer is meaningless on low-productivity technology. And high-productivity testing methods are useless if enslaved to a low-productivity technology.

 

Start it, shape it, ship it. Netezza is the ticket home.

0 Comments Permalink
0

In the past several projects, the issue of using views has consistently arisen, as to when to use, not to use and what to expect. Views are one of those mainstay workhorses that we love to hate and sometimes hate to love, but used correctly, can save a world of hurt and lost development time.

 

So we would ask the question - why use a view at all? Isn't the table definition good enough? And what of a synonym? Isn't this just as good?

 

Well, synonyms are handy for configuration management and invaluable for testing. For BI, however, they don't pass-thru the metadata of their underpinning relation's metadata for consumption by the BI tool, so this can be problematic. We also cannot refresh a synonym easily. It has to be dropped and then created in two operations, where view gives us the concurrency-protected operation of "create or replace view" and is muey bien. More on synonyms in another blog entry.

 

Views can easily reach across databases, giving us the ability to stand-up a consumption-point that contains one-part tables and one-part views without having to push data around (very handy for say, a reference database that we want as an on-demand resource of fresh information). I'm a big fan of setting up consumption-point databases so that a user comes to a pre-designated place, not the master repository, to fulfill their information needs. This decouples the user from the master repository and gives us enormous freedom in the ongoing enhancement of their user experience.Views are the vehicle towards this goal.

 

Views also let us do on-demand case/when conversions and typecasting that can be completely encapsulated from the consuming process.


And of course a really cool part about Netezza views - is that we can include as many columns as we want in its "select" clause, the view will not fetch them all, only the one's mentioned in the select that consumes the view - this is a win-win because otherwise it would fetch all of the columns and then drop the majority on the floor to deliver a few.

 

Views have the lightweight nature of a single SQL statement that can be easily installed, where a stored proc often contains multiple SQL statements. Both of these mechanisms serve to hide the logic from the BI tool. But think about this - would we use the stored proc as part of another join? Or would we expect to just select from the stored proc and consume an answer? The more complex the operation, the more we need to just select-and-consume, and take the burden off the BI tool to know more than it has to.

 

A pernicious part of integrating BI tools is just that - expecting that it will know all it needs to know to interact with the Netezza MPP. This is - as you may painfully discover - a false expectation. Case in point, we might have a very creative intersection table between two large fact tables, and we can formulate a query that will browse the information-we-want in mere seconds. Then we plug in our BI tool and ask it to manufacture a query to do the same thing, but it struggles. Now we have to make a call - do we deploy the BI tool in the hopes that later releases will resolve this, or do we install a view or stored procedure that adapt the BI tool to our data model, and then wait for the BI tool to get better in a later release? You see, we can always toss the adaptation when our BI tool gets better. But we cannot allow our user-experience to languish on the same terms. More on this in another essay.


So before I jump into a lot of other things we like about views, I'll address some of the above in their more malignant form.

 

I'll loosely divide views into two buckets - simple and complex. The simple view consumes a single table and may have columnar transforms on it. A complex view, simply put, has more than one table in the join logic.

 

A simple view cannot be easily misused, but a complex view can be misused so easily it will make the head spin on your best troubleshooter. For example, I cannot count the times I've seen a case where a master query joining on a view, which in turn joined on a view, which in turn joined on a view etc. How deep can you go? This is not the issue at all. The issue is in treating the view as though it is a reusable, inheritable object rather than a standalone select-and-consume capability. So where do we draw the line?

 

Transactional thinking - that is - the notion that we can install nested (inherited) views because they handle transaction-at-a-time anyhow and any given instance of them will have a negligible performance problem - is completely washed away when dealing with multi-billion-row scales on a Netezza platform. It's not a transactional platform, so each view potentially initiates a full table scan. Multiply these nested upon nested views and we have nested tables scans - sometimes several separate scans on the same table. Which is more efficient, to look at a multi-billion row table one, or multiple times?

 

One customer had a query that started running very slow one day. We went through a process of discovery to find out what had changed. Seems that a new version of an existing view had
been installed, and the bad query was consuming this view deep under the covers. The bad query and view  were both accessing one of the largest tables in the database, the bad query was now scanning the big table twice, taking a double-hit on the master query itself. Even worse, the changed view did not leverage the big table's zone maps or its distribution key. So a change in one place dramatically affected unchanged functionality of a master query.

 

Because we are embracing economies of extraordinary scale, dynamic objects have a propensity to lose performance integrity over time. What worked yesterday may not work today, so we have to tune it. Netezza is so efficient that this tuning necessity may not arise for years after the implementation. (In one case, four years afterward). By that time, the knowledge of the system's dependencies are not fresh on everyone's mind, so it is easy to make a spot-fix on the view and deploy it. In so doing, we may create a cascading effect for all the other places that consume the view and do so with the expectation of original behavior. In short, the latent nested view architecture is a minefield. We should not implement it because it creates trouble from day one, even though nobody has stepped on mine just yet.

 

At one customer site we had to sift through six levels of view logic to find the performance problem. The customer wanted to know what they should do to fix the problem, but "the problem" was in the overall inplementation and the nested views, not the one bad view, or for that matter, the recent performance symptoms of a minefield implementation.

 

Views can behave as traditional objects if they are single-table views or they leverage additional tables that are small and inconsequential to performance. Don't ever include a big-fat table in a view as part of a performance boosting strategy unless you can designate that the view is in fact a standalone entry point and not something that can arbitrarily participate in the JOIN clause of another master query. Why is this? Invariably we will forget the complexity of the view and then attempt to join it in another operation. For a BI tool, this could be highly problematic as well, because a view that was once simple could spontaneously go complex, and if it affects performance, we'll be pulling our hair out to find the problem through what reduces to a scavenger hunt, or worse, a submarine hunt.

 

Many BI tools simply choke on automatically forming a "complex" Netezza query because there is an implicit assumption of indexes via primary keys, and if these don't exist, the BI tool does the best it can, which in many cases is the least-common-denominator of a query structure. This this doesn't play well on the SPUs for large-scale queries. I cannot count how many times I've seen a convoluted query that we just de-engineered and simplified, and ran an order-of-magnitude faster than the one conjured by the BI tool, yet nothing the tool folks could do seemed to make the BI tool form it the same way. To the rescue: a view that did the right thing - and that was that.

 

What's that? Putting together a view diminishes the flexibilty of the query? Only marginally, and since we're dealing with billions of rows, we don't have much runway for "ultimate" flexibility anyhow. The larger the datasets, the more we need to make sure the queries are as efficiently formed as possible. And since this means as simply formed as possible, we're not talking about BI Tool query engineering, but query de-engineering.


To avoid pain and injury, don't treat all views the same. If we have a complex view, we should tune and designate it as standalone. No matter how much we like its results, it is better not to just arbitrarily include it in another join. the primary reason being - most views are not set up to regard a distribution. So when we include it with our other join, the resolution of distribution might take the form of of least-performing, lower-denominator. We don't want that.

 

One alternative, oddly, is to CTAS - execute such a view in context and insert its data into a temporary table, then use the temporary table in the master join. This affords us the option to (a) leverage the view's normally small output (b) preserve the distribution or (c) align distribution to the next operation (d) simplify the implementation. Of course, your BI tool may not support this, or may support it in an inefficient fashion. Most of the major BI tools will accommodate advanced scenarios, so get your product support rep on the wire and have a heart-to-heart.

 

Yet another alternative is to use the view like an in-line view, except in the where-clause in correlated sub-query. This can often take the form of a where-not-exists clause or the like and can also be very efficient.

 

Another alternative is to break apart the view's logic and assimilate it into the larger view so that all logic is preserved. But you'll be maintaining that logic in two places, right? Not necessarily. We have a lot of view DDL executables that do not directly spawn from a modeling tool. Several of those being in BASH script, which provides for parameterization of logic. If we put the logic into a parameter, then produce the views by including the parameterized logic, we will maintain the core logic in one place (script) but actually deploy two views that leverage it. This is essentially what happens under the covers with many object-oriented environments anyhow. Multiple objects will consume another class and deploy an instance that includes that class, so this approach embraces that inheritance pattern. Not in the dynamic run-time of the view, but in the view's initial DLL-level deployment.

 

MYLOGIC=$( cat <<!
a.limit1 between 50 and 60 and
a.limit2 between 1000 and 50000 and
a.tran_amt < 10000 and
b.employee_id <> 9999
!
)

 

view1="create view view1 as select col1, col2, col3 from mytable a where $MYLOGIC ;"


view2="create view view2 as select col1, col2, col4 from mytable a join yourtable b on a.id = b.id where a.col1 = b.col1 and $MYLOGIC ;"

 

If our modeling tool supports this capability as part of its functionality, and we should leverage it before simply bolting a view into a join. If our modeling tool does not support it, this scripted DDL scenario is easy enough to formulate and leverage without a lot of overhead. The objective: two views that both behave as optimized joins, rather than one view that behaves as a join-with-a-view.

 

Either way, there is a theme here, that simply including the complex view as part of another join's  logic - as though it was a table - is risky and can, even at the outset, offer up such bad performance as to be a non-starter. So a plain-vanilla practice should be to make the complex view behave in a standalone query-and-consume fashion by default. Make no assumptions that it is okay to arbitrarily include it in a larger query's join clause.

 

The further downside is that a de-facto join-with-a-view can work really well at the outset, but the scale of the data can catch up to the even the most robust of implementations, and wiring up the complex view dependencies creates a problem that will not scale, but will only become obvious over time (a minefield)

 

One group invoked a standard for view naming conventions. The simple views would have no prefix at all, so they would look like tables to the casual user. Fair game and all that. The complex views were labeled as v_<viewname> as a cue to a user or report builder: don't use it in the join of a larger query. You'd think that if there was an implicit rule to avoid using anything "v_" prefix that people would play nicely. But not so, since your reporting users may have come from a RDBMS background where it's perfectly okay to mix views into the master query. Awareness of the standard is one thing, but actually embracing it is another. We cannot protect our systems from people who either don't know the rules, don't understand them, or cannot map their experiences from an RDBMS to an MPP.

 

So a suggestion here would be to name the view in a manner that is a departure from common view nomenclature. Calling it an sp_(NAME) might draw the ire of your admins who want stored procs named for what they are, and not obfuscate their names. But if our views are not really common views, and have caveats on their usage, we need a safer naming convention, one that aligns with the goal we are trying to achieve - that of adapting the BI tool to the MPP. One group used a naming convention of "bi_", while another used "rpt_", and still another used the common acronym for their given BI tool. The point is to adopt a convention that is somewhat unconventional, so that those with conventional thinking are able to transform their thinking without finding themselves in a minefield.

 

Nothing is worse than overlooking a minefield - it's a scary view - a view to a kill.

0 Comments Permalink
0

Whew! The Enzee Universe this year was quite an experience.

 

I would like to offer my sincerest thanks to all of you who attended the Best Practices session held in marathon-form on Monday before the keynote. Over 300 people signed up, and many of you arrived the evening before, and at the end of the session, you were still there!

 

Afterwards we took a checkpoint and then added more material to the powerpoint presentations that we used during the sessions, and these will be added to the content of the Enzee Universe downloads for those who attended.


Some of you also asked me about the music selections we played at the intro and during the breaks. These were selected in terms of "Your Theme Songs", because some of them were from superhero movies, and some from action-adventure flicks. Here they are, in no particular order, the tune, origin and reason for selection:

 

Theme from "The Incredibles" - because we come from a family (The Enzee Universe) of mult-talented superstars

Theme from "Batman/The Dark Knight" - because sometimes we have to work in a thankless role (however personally rewarding, with high-tech toys)

Theme from "Superman" - because to competitors, Netezza is like Kryptonite, and to the rest of us, it solves World Problems and makes us look good without having to wear our underwear on the outside

Theme from "Spirit, Stallion of the Cimarron" - surging music for those who entered the frontier on a Mustang

Theme from "Surf's Up" - probably enough said, here

Theme from "Mission Impossible" - a congenially offered rebuttal to those naysayers who say it can't be done

Theme from "James Bond" - because he, like many of you, is an MTBA - That's Multi-Talented Bad A**

Herbie Hancock - Rockit - because that's what you do

 

I would also like to thank Netezza for the opportunity to share these ideas, many of which I have gathered over the course of my Netezza derring-do from people just like you, so some of the information is what-is-practiced in the field, and some of it is idea-works that we have re-synthesized into practices that seem work well as a sort of "adaptive composite". The objective of course is to share with you what others are doing, to enrich your base of ideas, but are certainly not hard-and-fast rules. The Netezza appliance is one that unlocks creativity, harnessing it for the Good of All Mankind. So guidelines and practices give us more critical mass to solve problems.

 

Likewise I would like to thank Netezza for the Enzee Community Voice Award presented to me on Tuesday night at the Gala, in recognition for being such a vocal supporter. But my words then apply as now "The people and the product create a synergy that's like electric current. I love interacting with the Enzee Community, and being a part of it".

 

I also noted that a larger number of independent consultant/contractors were present on this go-round. In the best practices sessions, there are a wide range of professionals, from those who are Netezza customers, to consultants working for firms, independent consultants, analysts and the like. The Enzee Universe has various video screens constantly running slow-motion surfing videos in keeping with the TwinFin Theme. One day you might be lookin' at that guy on the wave, thinkin' about the Twinfin and wondering if you're on the wave, or just watching it pass.

 

As with all professions, you might be in the zone where your project is just ending, or is about to, and you're wondering about the next great thing. And as you know, I'm always on the hunt for bright architects and engineers, especially in this economy, so if any of you independent types are looking for an opportunity, give me a shout. I also extend the invitation to anyone else who is reading, with the qualified apology that I am not lurking at the doorways to steal away your company's valuable resources. But I have seen in the past that some bright folks find themselves tapping a pencil on their desk, coming down from the exhilaration of a Netezza 'experience" and wishing for more. I can say - the work is out there. I'm often in contact with people who need someone just like you - hooked on the technology. Hey, who isn't?

 

Finally I offer a simple salutation to everyone who "gathered round the grill" this past week, sampled the wares, wishes and whatcha-ma-callits of the various vendors, trainers and speakers, and came away enriched and enabled to dream a little stronger, solve a little simpler, and crush those waves with the shredding confidence of parallel power. So I'll either see you in your natural habitat, interact with you here, or catch up with you in person when the Enzee Universe cranks up another adventure.

0 Comments Permalink
0

I know some of you are wondering if I have "gone dark" or something. But hey, there a conference comin' up and I'm on the hot seat!

 

Much of what will be discussed, I could (and may) post here eventually, but for now it's time to burn this stuff into the presentations so all of us have a productive meeting on Day One.

 

Now some of you have also fired off to me some interesting requests for the discussion, most of which I agree with so we'll definitely make room for it. In the end, the whole of the best practice session is not really for you to hear me prattle on about derring-do - although I do have tales to tell - but more to gather the tales that you have to tell, so that all of us are better for when we return to our respective machines.

 

One of the requests is a deeper dive into the Ten Rules of Large Scale Data Processing that I added to Netezza Underground. We could probably spend all day on that stuff - because it can go really deep, like outta-hand deep on some of them. Not to worry, as I am a highly trained professional moderator and promise to keep you on track.

 

Also note that on Day Two there is a special session on TwinFin migration, with some emphasis on Netezza-To-Netezza migrations since some of us are in the middle of these kinds of conversions. I think this will be especially productive if some of the Enzees (that would be you) are present (and not shy!) so that the attendees can wrap their heads around the overall process and what to expect.So that's another invitation to not only bring your notepad, but your notebook (the one that you use for design sessions) and give us a little non-proprietary insight on how you conquered the world (or at least, your corner of it).

 

Not so strangely, the overall, high-level methodology for moving from an older Netezza to a Twinfin does not change (compared to moving from say, Oracle to a Twinfin), but the challenges can be multivariant. It is these challenges, nuances and subtlties that make life fun - so bring some of them with you! We'd like to hear the adventure.

 

And yes, the rumors are true. I have come under intense pressure to publish yet another Netezza book, and this was evident and underway as of last Fall. The real problem is in deciding what goes into the book and what should be left out - but the left-out stuff is valuable too. I think the consensus now is that the content needs to be divided into two books, and publish them both in a rapid fashion. Hey, it worked for Back to the Future, The Matrix and Lord of the Rings - but I don't have any delusions of such wild success - I just need to get them into print.

 

I am working right now on a mongraph concerning inheritable objects, namely views and such, that in some ways do, and some ways don't translate into systems of scale. After all, once the data gets really big, the physics becomes even more critical. Any inefficiency will only get worse over time. Okay, that's my hook to pique your curiosity. More later!

0 Comments Permalink
0

In an ongoing campaign to take his environment to the next level, John (name changed to protect the innocent) started holding orientation sessions for his DBAs, programmers, architects and other technical resources as a means to get-the-word-out or at least get some education into his people - so that they would be more self-contained in the field, as it were, building out their environments.

 

While they weren't using Netezza (yet), they were still using set-based operations in a high-powered ETL environment, so the basic principles were the same. That is, perform operations on whole groups of records at a time, not several operations on one record at a time.

 

Well, as it turns out the primary mental resistance had little to do with understanding the technology, but more in wanting to be "the guy" to break the mold. What mold is that? The mold that makes things feel so "mainframey", with flat files and these "archaic" approaches to data processing that technology "left behind" so many moons ago. Why are we going back to the old days? they lament. Surely someone, somewhere missed the memo that we just don't use flat files! C'mon people! Get with the program!

 

And the more John used his Jedi mind powers to change their minds, the more they dug into this whole notion that they were being dragged back to the dark ages of programming. It might seem a bit dramatic, but many of us have actually seen the techies roll their eyes at the mere mention of flat files, as though we have made a statement more expected from a shaman medicine man or medieval witch. What's next? Put leeches on the machine to cure its bugs? C'mon people! Get with the program!

 

Some fifteen years ago I worked with a group that had some very complex integration issues between their custom application and the Oracle Financials application. They wanted to transport invoicing and other billing issues from their system into OF and have a transparent interaction. OF wasn't as mature back then, so this approach had some challenges, and since the primary interface between the two applications was Oracle's Pro-C, the technical team naturally chose C-language (not C++) to make the interfaces work. A debate ensued on this project as to whether we should "code everything in Pro-C" or use C language at all. The confusion? Pro-C isn't a control language. It's a database I/O specification for C-language. We could always interact with the database through Pro-C, but even the simplest decision-tree operation would still require a formal procedural language, and Pro-C didn't qualify. The Pro-C proponents felt like they had lost the debate, but something more was lost - in all this the technical team was required to use Pro-C for all interactions with the database, including the batch-uploads of the invoices and other instruments. The time for this operation was so egregiously slow that we made a mid-stream decision to deal with the problem more transactionally. That is, mini-batch operations more often than once an evening (some recognize this as "continuous" operation).

 

This only forestalled the inevitable. The volume of data quickly grew so large that the back-end was running continuously and unable to keep up with the front-end. Even worse, the OF functionality was not being used in real-time, even though we'd set it up to behave that way. It all came crashing down when we had to install the system where the volume would crush it. The answer? Flat files.

 

Now you can imagine the outcry from the Pro-C people. Not only had we pushed back on them to use C-language for control, but we were now moving even further away from Pro-C into - gulp - flat files. The gauntlet was thrown down that the entire back-end architecture was flawed and needed complete rework. By this time I had moved on to warmer climes and could watch this battle from a distance, but one of the new engineers called me up one evening to get the down-low on the principals in the environment. He claimed that the whole place was crazy and this myopic fixation on Pro-C would be their undoing. Why? Pro-C is a transactional protocol, not a batch protocol. It invokes the database at the API level rather than performing the leaner bulk-insert operations. Over a year later, the problem remained unsolved so they abandoned the OF interface altogether and started using a third-party provider for their financial reporting.

 

The irony - the third-party provider required them to ship over their transactions nightly. In flat files! (the horror!)

 

Something I continuously but gently point out, is that in the data warehousing realm the flat-file is a mainstay. It's not going away and should not have to. The denial that the flat file is a permanent player ----- is the path to mayhem.

 

Not too very long ago, I had the opportunity to assist an online brokerage in how they assimilated transactions from their various member firms. The member firms could ship their data via a web service, over the internet, manually enter it on the brokerage web site (ideal for small shops) or ship it via flat files. Enormous resources had been leveraged to program, maintain and enhance the automated interactions for all these pathways - except for flat files. They had been treated like a necessary evil. Something to be tolerated until they could be replaced with one of the "more mainstream" methods. Even today, they have not moved away from flat files. And they represent over 30 percent of the brokerage's total transactional volume. Not something to be relegated or trivialized.

 

It all came to a head one day when a clerk called up a leader at one of the upstream member firms claiming that they had not transmitted their transaction file on time. (Note - in brokerage terms this leads to an SEC action, so it's serious business). This claim quickly escalated to the member firm's top echelon and within the hour the CEO of the firm was on a conference call with the CEO of the brokerage firm, evidence-in-hand that they had in fact transmitted the file and would not stand for this. The CEO of the brokerage took a deep dive into the responsibility chain only to discover that the member firm really had transmitted things via flat file, and the brokerage was so lacking in attention to this medium that they didn't even have auditing capabilities or anything to tell them definitively whether the file had arrived or not. The file had arrived but somewhere in the night, it's loading process had aborted for reasons other than data. The poor clerk could only review a morning report, with no visibility to the actual problem. This state of affairs was intolerable.

 

The brokerage CEO mandated audit-style processing and flat-file receipt for everything, not because it's a good idea, but it's the law (SEC-wise), after all. And the outcry from this could not have been more vehement. The techies were being told in no uncertain terms - formalize and institutionalize flat file handling. This was taken no differently than telling the network group that they would have to support every prior version of Windows, including 3.0, just-in-case. Woe is us.

 

But the techies missed the point, in that the flat file is a modern-day marvel in its resilience and capability. As many other types of storage mechanisms have come and gone, the flat file continues, impervious to the changes in technology all around. Flat files underpin every major database storage mechanism, are ultimately the storage form for more recent formats like XML and its derivatives, are scalable on any platform, and have generally stood the test of time. Flat files are like a gallon of gasoline. They are always predictable, reliable, never break, easily scale and can be used practically anywhere - and usually are.

 

So why the consistent resistance to flat files? It's because they don't seem exotic or challenging. After all, it's a flat file. A caveman could do it. Somehow, suggesting a flat file solution makes a person feel like they are not a contributor. Yeah, anybody can whip up a flat file - where's the technical prowess in that? In my humble opinion, the diminishing prowess in effectively using and embracing flat files - especially where they are supposed to be used - is rapidly becoming a lost art form and part of a lost world. Just as we lost the architectures and methods of the wonders of the ancient world, the knowledge of effective flat-file usage is invaluable in enterprise computing. Formalizing and institutionalizing it has extraordinary value, especially when it comes to tracking critical enterprise assets.

 

But in the Netezza space, where do they play?

 

How about data intake? In the Netezza platform we can load data at extraordinary speeds. About the only technology that can possibly feed Netezza at its maximum rate of intake, is the raw physics of a file system. Considering that should we need to extract something from a database, it will ultimately go through the database engine's software layer down into its bowels to arrive at - flat files - and then pull the data, rise back into the engine's CPUs and be delivered, largely via software processes, to the extraction point of the information. Reading from this extraction point will always be slower than reading from a flat file. This is why products like WisdomForce can extract Oracle data so much faster than an interface-level extract. Their Fastreader goes for the data on the file-system level, and has embraced the audacious notion that performance is found close-to-the-physics. Where have we heard that before?

 

If we were to perform a simple test - let's say we pull data from SQLPLUS into a pipe (to eliminate the write-drag of the file system) and then perform a simultaneous nzload from that same pipe, the flow will move only as fast as the extraction. In one particular case, 3 million records (even with a parallel extract) took over fifteen minutes to pull from the database. Netezza's nzload waited patiently, perhaps scraping its virtual nails on its internal chalkboards waiting for the maddeningly slow load to finally finish. In the second version of the test, we extracted the data to a flat file and once completed, performed the nzload. The extraction still took fifteen minutes. The nzload took a few seconds.

 

But think about this - in the first test, the fifteen minutes that the Netezza machine was tied up with an nzload, it could have been doing other things. After all, there are a finite number of load threads we can invoke for work, and this long, slow stream tied up one of them for much longer than it should have. Pushing to a flat file and then performing an nzload, the load thread is only occupied for a short window and is then free for more work. This is a more efficient use of Netezza's interface. Of course, if the box has nothing else whatsoever to do in this fifteen minutes, then go for it. A while back, we installed and burned in a sqlplus-to-pipe-to-netezza intake framework that worked just fine, and its window of operation was during a quiet time. On the flip side, other interfaces had detailed data feeds, most of them integrated to arrive at the same time, and they were all being pushed as flat files. Netezza simply inhaled them - in seconds - and moved on.

 

In yet another venue, the upstream systems were pulling and pushing small data files from their various internet-based sources. All of the data files carried a small part of the same information stream, so we could essentially "cat" these files together into one. The problem was that they were all being written to a common file server, and then other processes kicked off to load these snippets of data individually. Since each one was a tiny file, this created an enormous burden on Netezza. Every load has a finite cost. (Recall, we can load 1 record in 1 second, or 1 million records in 1 second - either way it costs us 1 second). So by formalizing a "collector" mechanism for these files, we could effectively have one nzload load as many of them as were available on the file system as one large stream of data. In this regard, we could "cat" hundreds of files into a pipe and nzload from the pipe. This is a good use of "cat", since it is writing to memory and not back out to the file system, essentially reading and stuffing data into a pipe for our consumption. This alone stabilized the intake protocol and - where the existing implementation had saturated the Netezza machine's interface, the collector freed it up and allowed them to take on even more capacity without additional overhead. Think about the mechanics for a moment. If we have 1500 files, collectively containing 150 million records, we can choose to load 1 file at a time, requiring 1500 seconds (25 minutes), or we can cat the files into a single nzload for a load that took about two minutes. If the issue is load-time and performance, then we need to formalize and harness the way we intend to load all those flat files. Make a design decision, embrace it and institutionalize it for maximum firepower.

 

The Netezza Underground offers up a number of rules (the first ten of which are early in the book) that are specific and non-optional for systems of scale. One of the rules is to use the most scalable mechanisms and assets available, and embrace them as a regular part of everyday data warehousing life. But mention "flat file" to someone steeped in transactional or visual technologies flowing from Redmond or Silicon Valley, and they first roll their eyes. When they see we are serious, they pushback as though fighting-on-principle. When the verdict is in, flat files are here to stay, and they see they cannot win, they update their resume and leave. Their reasoning: they don't want to go back to the "dark ages". You think I'm kidding.

 

We again come full circle to the whole idea that bulk processing is not transactional processing. And with Netezza, the bulk processing is on a sometimes mind-numbing scale. Things that we used to do as neat-clean operations in transactional space, suddenly have no viability whatsoever with data sizes of this magnitude. Which is, of course, the primary reason that people will reduce to flat file operation when the performance starts to lag. Flat files are scalable. Software, not so much. Transactional, for bulk, never scales. Stop now.

0 Comments Permalink
0

One of the questions oft-asked in best-practices sessions and in general consulting: How do we get a "newbie" on-boarded quickly? Some concern usually arises when the new Enzee approaches the Netezza machine with the same thinking processes as with a traditional RDBMS. While there are "gross" similarities, it is the differences we want to leverage, and these are not either/or questions. There is a better way to implement things in Netezza, and a better way in the traditional RDBMS. Mixing the two is not optimum and can be detrimental.

 

The primary discussion fulcrum is simple: One is a transactional database and one is not. Moving away from "transactional thinking" is the key. How to accomplish this?

 

One of the best ways is to discuss and actually demonstrate the primary differences between bulk and transactional processing. As this is largely the crux of misunderstanding, or even the necessary "paradigm shift" our newbie needs to embrace, a significant hurdle it seems, is the newbie's belief that the core engine functionality of their favorite RDBMS is somehow being indicted or set aside as useless. After all, the transactional RDBMS is just that - transactional - and this is what we want the newbie to move away from. What? All that hard-won and industry-hardened capability - and we're just setting it aside? Really?

 

In a word - Yes.

 

It's not that the transactional capabilities are useless. They simply aren't useful in a data warehouse. More importantly, they don't even exist in a Netezza machine. So attempting to shoe-horn transactional thinking into this machine is a huge disconnect - no differently than using a lawnmower as a hedge-trimmer. Netezza is purpose-built. Transactions are missing by design.

 

Now at least one person is bristling because they know, administratively, that transactional support is handy for logging, managing metadata, troubleshooting hooks and other administrative support. I don't disagree with that, but it's not the activity of bulk data processing. It is far easier to set up a smaller database machine alongside the Netezza machine to perform these administrative transactional tasks. Each machine then has an objective role and purpose, and off we go.

 

What are some of the demonstrable ways that we can introduce the new Enzee to this issue, in a manner that really drives the point home? Well, I can't seem to count how many times I've had (sometimes rather contentious) discussions with "outsiders" (or perhaps "purists" ) on the subject of transactional exception handling. Inserting a record into a transactional database, with its glorious constraints turned on, will guarantee that it will pushback on us with an exception. Said exception requiring the dutiful compliance of an exception handler. You know the drill.

 

But in data warehousing, such transactional exceptions are in the way of our bulk load. We don't want the database to examine each and every record as it arrives, potentially formulating an exception (and its attendant overhead) for each record, or passing each one through after its constraint-based integrity check. We just finished taking all that data through a detailed sieve of business rules in the ETL layer, didn't we? The database needn't trouble itself, just load the data, thank you.

 

Now at least one more "outsider" is bristling. How dare you say that we should set aside the constraint-based exception handling? What possible justification could there be for such a gross trampling of RDBMS functionality? Explain yourself!

 

In a word - performance.

 

Storytime: Just after 9/11, the airports over-compensated with all kinds of rigorous shakedown protocols. Travelers had to show a boarding pass and ID at a checkpoint, then keep them handy for just after the checkpoint. And then also for presentation at the gate prior to boarding, along with random bag searching. If you were the first one to board, or made eye-contact with the bag-search team, it was guaranteed that you would be taken aside and your luggage rummaged. A friend of mine told me that the rummagers liked to carry on a conversation to make you feel more comfortable about their pulling your private things out into the open air for all to see. One of them held up a nose hair trimmer to one of his cohorts and said What the heck is this? Makes one wonder what other kinds of personal appliances we could "salt" the bag with just to embarrass the daylights out of them, hmmm?

 

My friend told me that he was pulled aside a lot, and started experimenting with "stated professions" that the rummager would not care to talk about. At one point he blurted "I'm a professional bodyguard" to which the rummager alerted like a trained narcotics dog and said "So you would know how to use weapons?" to which my friend simply said "Or not."  This of course made the rummager gulp and go quiet, but it still wasn't good enough. My friend didn't want them to talk at all, so they wouldn't waste any time in their rummaging and just get-it-over-with. So at one point he said "I own a funeral home." Which of course, stopped the chatter completely. Nobody really knows how to continue a casual conversation about such a subject.

 

The point being, he'd already had his bags electronically scanned at the checkpoint. Do we really need to check it again? And unlike a constraint-based exception handler, the rummager had the option of only picking out random hapless travelers. The exception handler rummages the bags of every traveler in the line. We can see how utterly inefficient this is. Nowadays, they screen the bags and then don't even check ID again at the gate. Except for random gates on occasion because nefarious people sometimes swap tickets when they get behind the checkpoint. In any case, if we've already exhaustively checked the bags to get the traveler where he is, more checking is a waste of everyone's time. Just like the exception handler. If we just ran the entire set of data through rignorous validation rules, we have no need whatsoever of the transactional exception handling in the database. It will waste processing time.

 

And wasting time, we don't have the luxury to do.

 

The transactionally-constrained bulk-load of data will be, on average, five to ten times slower in operation than its non-constrained equivalent. If our objective is to achieve a fast load - and trust me - it really is - we don't want constraints turned on. We're talking about loading millions if not billions of records. Even in an RDBMS, we cannot afford to convert what could be a thirty-minute operation into a two-plus-hour operation. The window of time simply does not exist. In some locations, if this kind of window ever existed, it is rapidly vanishing as their businesses go-global and need to process data as-the-world-turns.

 

On the flip side, think about the main reason for a transactional exception - it is to keep a transactional application honest. If the data does not comply, the user fixes and re-submits. It's interactive, and it deals with a single entity at a time, not millions of entities at a time.

 

The "outsider" will now brace on this assertion as well, because they think that having thousands of users interacting with the system constitutes this many-entities-at-a-time, but it simply doesn't. And here's why: RDBMS systems are meant to assimilate data in small chunks with high frequency. They are not designed to deal with large chunks at low frequency (e.g. a batch load once a night). They will accommodate such activities, but not do them well. In this case, "well" means loading a million rows a second. The RDBMS cannot approach this.

 

And this is the reasoning behind the Rule #10, which is - when loading bulk data, never involve the database in row-level activities. This means, without exception, turn off the exception handling. Because the database will just protract the duration of the flow, checking each and every record and slowing down all of them as a whole, in order to find the few exceptions. It is the equivalent of making the entire flow suffer for the sake of a few records. this is a bad tradeoff. And once again - didn't we just validate and scrub all these exceptions from the flow, in the ETL/data processing environment? Why are we asking the database to validate them again?

 

And the worst part, is that the potential exceptions are all anticipated and known. What does this mean? As a back-end programmer buildng the data flow, we have direct and objective access to each and every failure point that will stop the data from loading. Why would we delegate this to the database, since it is so inefficient in performing it? Note - not so lacking in functionality, because the RDBMS has lots of functionality to perform it. It is simply too inefficient with bulk loading to be a viable resource.

 

So what are the anticipated exceptions? Let's go for popularity:

 

  1. Bad or null data
  2. Unique key violations
  3. Primary/foreign key violations

 

 

In fact, the above constitute the primary reasons for the load to fail. So lets walk through the basic process we would need to follow if we delegate this to the RDBMS database.

 

In transactional mode, the RDBMS data load will kick out an exception for each of these it finds. Even if it completes with no error, someone in the room will say -It took too long. Even if it didn't find any exceptions.Fix it. Make it faster! The hard-core transactional engenue will attempt to optimize it without turning off exceptions, and find that it cannot be done. If the load of one record requires 1 second, it will take 1 million seconds to load 1 million records.

 

We just don't have 1 million seconds.

 

(Incidentally, in Netezza, the load of 1 record requires 1 second. The load of 1 million records requires 1 second. Use your second wisely!)

 

So our new Enzee will grouse a bit and then look around at the data warehousing sites for answers, and all of them will say, turn the RDBMS exceptions off, load the data, and then turn them back on. The newbie will object - but wait - when I try to turn them back on, the database yelps and says there are constraint violations. I will have to back the records out and try again. Oh yes, we now have a mess on our hands. In the time it took to load the RDBMS data - say thirty minutes - we have now accumulated errors that might take hours to back out, fix and then retry. And we'll have to do it while the batch-window clock is ticking, not in a pre-process where we had more breathing room. We don't have this kind of time window.

 

We will never have this kind of time window.

 

So the next fallback is to fix the exceptions in the data processing realm (ETL tool), prior to loading the data. But isn't this what the data processing realm is for? Really? This means we do all null checking and constraint checking prior to loading. How? We download the primary and foreign keys into the local data processing environment and perform a localized join-filter to remove the exceptions. This is a data warehousing 101 best practice.

 

The newbie will now brace on the idea of downloading all the key values. All of them? That could take, well, it could take a long time!  It will take mere minutes to pull down all the key values. And those mere minutes are nothing compared to the duration of the recovery mess we will endure if we don't take this step.

 

Pay a little time now, or a lot of time later. Use your time wisely.

 

So here is the tradeoff (again in traditional RDBMS space)

  1. Turn off constraints, load the data, and then deal with the mess after the fact. Plan to spend hours backing out the mess and then running the load from scratch.
  2. Download keys, join/filter the key exceptions, turn off constraints, load the data with the expectation that no mess will arise. (because it won't)

 

In short, downloading the keys for constraint checking is a necessary evil. Our only "next best" fallback is to load the data into a pre-target staging table and do the gross comparison there. Then we copy the good records into the target table. But wait - now we've incurred the penalty of the load twice (one for the ETL to staging and one for staging to target). Isn't it cheaper to pull down the keys once than it is to load all the data twice? Not to mention the fact that the average RDBMS engine does not efficiently copy tables either. So even if we decide to go with loading a staging table, the copy of the staging-to-target will take longer than we are willing to wait.

 

Think about this: When the data exception arises in (1) above, where will we fix the problem? In the database, or in the data processing realm? The database can only report the issue, not fix it. If we must fix it in the data processing zone anyhow, why woudn't we fix it proactively rather than reactively?

 

So this approach means something even more valuable - if we find the exceptions in the data processing realm prior to loading, we will have found them proactively and administratively, not
reactively and operationally.

 

This makes a huge difference in the reconciliation of data exceptions when we're dealing with millione or billions of entities.

 

And yet another issue our newbie is pleasantly unaware of - data processing on this scale has to be beholden to the constraints of the lights-out operation, administration, and logistical capabilities of the physical plant around the machine. If the operators have to get involved in the data recovery, with data processing on this scale, it needs to be for incidental reasons, not mass data recovery.
In essence, delegating this activity to the RDBMS, is setting up our operators to fail. We will find them entirely intolerant of this approach. Fix it, they will say. If our answer to them is - hey, me architect, you operator, so gird up thy loins and get thee to work - we have punted (and dangerously so) something we should take complete responsibility for. Because make no mistake, we will be held completely acccountable for it as well. They will call us in the middle of the night. They will only help us incidentally. It's your mess, you clean it up!

 

The primary issue here is that the traditional RDBMS load has to be not only load ready, but consumption-ready. When we load the data, we have to be completely and thoroughly finished with all data processing before it hits the target table. From there, the user should be able to consume it right away. Load-ready and consumption-ready is the name of the game, and it's accomplished for the RDBMS in the ETL environment, because it cannot be efficiently accomplished inside the RDBMS. The RDBMS is simply too slow and inefficient for any form of bulk operation. And again I say, if the only place to actually fix the data is the data processing realm, it only makes sense to do it proactively, not reactively.

 

Now let's flip over to the Netezza side of things.

 

In the Netezza machine, we can stage the data "dirty" if we want to, and we often do. The data can essentially be copied as-is from its external dirty location directly into the machine with an nzload to a staging database. From there, we have it in massively parallel form and can use a series of CTAS operations (ELT-style) to cleanse and shape the data. Once we're ready, when can then do a massively parallel join from the incoming table to the target, validating primary and foreign key values in bulk. Then we just copy the good data and we're done. When using Netezza, it is always faster to let the machine do the data cleanup and integration in a massively parallel, set-based operation (even a series of them) than it is to pull the data out, process it in an ETL tool, and put it back. ETL tools, on average, cannot compete with the massively parallel power of Netezza's engine.

 

Let's look at what we accomplished with little effort: (1) We cleansed the data of dirt. (2) In a single, massively parallel join we validated unique constraints.(3) In a single, massively parallel join we validated foreign keys (one join per key). The total time to accomplish the second two tasks is fractional, often a matter of minutes even on billion-row tables and billion-row loads. The time for the first task is shrunken too, since we can apply our row-level data scrubbing rules in-bulk with sweeping operations rather than row-level operations.

 

Case Study Short: Working with a SQL-server based model, the client was loading 15 million records into the database with the bulk loader and the largest machines available. Total time to load - over 2 hours. Tried it again on an Oracle platform, with a top-line 16-core machine with plenty of high-end disk space. This operation took 30 minutes. This was attempted on a Netezza platform, same data, same volume, and it took 15 seconds. There is a contrast, but not a comparison. Nothing adequately compares to a 15-second data load.

 

The important takeaway is this: If I can load the data in 15 seconds, I have a luxury of time to perform internal ELT, data scrubbing and integration, key checking and the like in a matter of minutes, still ensconcing the data into the final target table before the other two databases even get started. More importantly, I did it without standing up a formal external ETL tool. All of it happened "under the air" of the Netezza machine.

 

Now an interesting exercise for the new Enzee would be to actually walk through the processes noted above. In a problem-solving series of exercises, they should get some data that has embedded constraint violations, then attempt to load the data to an RDBMS with transactional constraints turned on, then turn on the creative juices to see how it can be done more efficiently. I would not suggest loading millions of rows to an RDBMS for this exercise, since they are so inefficient at this. Try it with a smaller row-count and then extrapolate the necessary time-to-load. What they will discover is that they will find themselves slowly backing out their precious transactional exception handling to fix the problem another way. The faster they get, the more the the chosen path will start looking very lean on RDBMS capabilities.

 

The final form of their solution, they will find, is supported de-facto and in massively parallel inside the Netezza appliance at no additional charge.In the end, they will see why Enzees have run, not walked to a Netezza platform for just this kind of capability. We know they have made the transition when we can hear them having a conversation with another newbie about transactional versus bulk processing, and they are coaching the newbie away from the transactional model.  Ahh, a beautiful thing, indeed.

 

This is why Netezza is in no way, no how a transactional machine, and why it doesn't enforce primary and foreign key constraints. These can be installed as metadata, but the expectation is that they will be used by an external, intelligent operation that will leverage them for administrative key validation - in bulk. After all, I can read the key metadata from the Netezza catalog, formulate a series of validation operations that will work for any table, any key, any time. Install it as a stored proc and invoke it when necessary. This allows me to set up the load operations and prepare the final copy to the target (which is often the accumulation of dozens of operations to integrate the data into a common pre-target table). Then validate the data just before it is finally copied to the target. This keeps me from having to do it a record-at-a-time, or to have an exception processor accidentally execute the operation before I am completely finished formulating the data for the load.

 

Row-level exception handling is a beautiful thing - transactionally. If the domain where the exception must be fixed is already the data processing zone, we need to proactively embrace this responsibility and just do it. In the end, row-level exception handling has to be completely removed from our thinking processes. We need to invoke sweeping operations that capture the exceptions in-bulk, not a row-at-a-time. Fix and integrate them in bulk, not a row-at-a-time. Bulk is the name of the game, and always has been.

0 Comments Permalink
0

Many years ago someone impressed upon me the need for simplicity in matters of scale. The problem of course, is that simplicity is impossible without power. This is why we see secondhand RDBMS environments proliferate their complexity into a functionally catatonic state, ultimately calling for its wholesale replacement. And upon the call, others swoop in to save the day, saying that they can "replicate your functionality" in another stronger environment that is geared for high complexity, without even once looking askance at the complexity's necessity. By that I mean, the complexity arrived as a function of an underpowered environment, with sweat-labor from engineers working diligently to prop up a fading machine. It means - all that addtional power-propping, is artificial and we need to regard it as a necessary evil. If we'd had the power way back when, the complexity never would have arisen in the first place. So the complexity is a symptom, not an attribute.

 

If we have the power, it gives us the capability to implement functionally sophisticated solutions with ease of maintenance and operation. Functional sophistication is key, because we don't want to dumb-down the functionality just because we don't have enough hardware. Yet the people responsibile for buying us more hardware look at us warily, wondering "You know, I signed off on the last hardware purchase thinking it would be my last hardware purchase. Yet now we are already out of gas and it seems like we didn't get the return-on-investment for the last purchase."  Ahh, but wait, says the engineer, the systems are voracious and so are the users. Adding more hardware is the only way to stay ahead of them. To which the purchaser objects "How do I know that you are making the most efficient use of the hardware you already have? Can you tell me that you haven't tried to optimize the environment?"  To which the engineer skulks away, formulates a plan and spends the next six months wrapping the solution in an engineered cocoon of complexity that offers marginal boost, but a boost nevertheless. The purchaser feels vindicated. "I'll stand my ground next time, and they'll have to go through another optimization! Aha! The key is to get these deadbeat engineers to do their job, and engineer!"

 

And so at the end of this bitter and tumultuous cycle, we have an over-engineered, underpowered machine that nobody is happy with except for the purchaser. who held out until the very end. Often the purchaser is overridden by a super-purchaser, like the CTO or CEO, who finally releases the funds, offers the directives, and the purchaser dutifully though reluctantly complies, certain in his heart that the engineers have at least one more optimization cycle left in them. If you've never been the recipient or the participant in such a repeatable and pervasive cycle, what a blessing to be in the food service or housekeeping industries in these perilous times!

 

In one case, I told a senior leader point-blank that the reason his secondhand RDBMS system was running out of gas, was the proliferation of cursor-based stored procedures draining the lifeblood from the box. He was stunned at this assertion, because he'd been assured by his implementers that this was the right thing to do. The implementers in the room objected by asking if I was "actually suggesting" that they pull the data into a third-party tool, process it, and put it back. Such phrases as "are you telling me", "seriously"  and "everybody knows that" were the commom prefixes of all their objections. Oookay-fine. This doesn't detract one iota from the simple fact that the secondhand RDBMS doesn't process bulk data well, and it doesn't really support third-party environments that require it to (you know, bulk loading and extract). The whole bulk-loading and extract domain is something that secondhand RDBMS systems provide their own utility for, and dogpile one caveat after another against its regular use for operational, lights-out data processing. This is because, as Rule #10 tells us, secondhand RDBMS systems are lousy at row-by-row processing. This has always been true.

 

And think from the perspective of a newbie. Imagine being introduced now to a Netezza machine where it has been specifically purpose-built to support all the things that the secondhand RDBMS's treat as - well - secondhand? Our newbies come to their cubicle with their hard-won tales of derring-do and the many dragons they have captured and thrown down. They drag their canvas bags filled with dragon-fighting gear, empty it on the floor and we marvel at all the tools we once used to be effective. Ahh, the aroma of the dragon's blood wafts from the antiquated instruments of war, reminding us of days gone by, and all that bygone stuff, too.  It's hard to immediately schedule their pickup for the Warehousing Museum on the 7th floor, where all of our stuff is on display and gathering dust. Instead we scratch the Museum's number on the back of a business card and hand it to the warrior for later. Still, all of those weapons and their attendant experience are very valuable, but not in the way one might think.

 

What's a seasoned warrior to do when the machine can devour a data warehouse dragon like a tree shredder, digest it, repackage it and dungeon-ize it for all eternity, and do it as an appliance? Almost like pressing the button on a toaster, but not quite. Our hero rolls up his sleeves and starts slinging row-by-row, cursor-based stuff so prevalent in secondhand implementations. Upon first execution, it runs worse than a dog. It runs like a wet dog. And doesn't smell any better either. The hero scratches his head and tries again. Everything the hero knows about data processing just went "counter-intuitive". The hero mutters "I don't get it". Ahh, but the hero will get it, because the hero is emotionally engaged.

 

This emotional engagement is something that we, as seasoned Netezza folk, should leverage to help our newbies make the necessary shift in their approach. They really need to take ownership of the knowledge, but sometimes (I've been told) it feels a lot like pushing a string (for the seasoned people) and a lot like playing some kind of strange board-game for the newbie. To avoid mental pain and injury, and heavy-lifting on the part of the seasoned people (and the attendant frustration from helping a newbie blossom) the seasoned pro only need to remember the power of the emotional engagement. The newbie will want to conquer the machine. Harness it for the good of all mankind. They will brandish their blades with the familiar shhhhiiiiiinnnngg as it leaves the sheath. But there's only one hero in the room. The big black box. The warrior's emotions are drawing him/her inexorably closer to the final conclusion - the machine works for us. Really works for us. Unlike the secondhand RDBMS machines that once enslaved us. Now we are the master. The new hero does our bidding, and does it well.

 

So our warrior, after many days of working with the machine, finds himself no longer using his old gear. No longer using his old ways. In fact, now stumbling over the gear and getting his feet tangled in the pull-strings on the canvas bag. One day he will pull out the number we gave him for the Data Warehouse Museum on the 7th floor and call for a pickup. On that day, take the warrior to lunch. The transition is complete.

 

What has the warrior embraced - that complexity is no longer the key to winning. We don't have to "do it the old way" and we don't have to figure out a way to shoehorn our former implementations into the new domain. Those implementations were necessary evils, using technology that wasn't geared to support them, rigged for an outcome on an underpowered and overwhelmed environment. When moving to the new environment (whether it's a newbie learning the ropes or a large-scale migration of a secondhand RDBMS into the big black box) we don't take the prior implementation with us. We don't take the prior table structures, cursor-based processes, or anything else about the original implementations. If artificial complexity means that even part of the implementation is suspect, then all of the implementation is suspect. We have a new way of doing things, a new machine to do them on, and the outcome will be so much better if we now do things the way the machine best performs, rather than doing things the way some other secondhand machine performs poorly.

 

And this is the difference between complexity and simplicity. In an underpowered secondhand RDBMS, the complexity props up the weakness of the technology and is a symptom of weakness in every way imaginable. Simplicity on the other hand, is the mark of strength of the technology and its ease of implementation and maintenance.

 

For a (human) hero, complexity is the reason for existence and simplicity is for the simple-minded peasant. After all, the peasants do the (back labor) work of the field and save the thought labor for the feudal lord and his lackeys. But one can see the parallel immediately - the secondhand technology is actually a feudal lord, its high-priced product engineers are its lackeys, and we, my friend are its indentured servants. Quite the converse for Netezza folk - we are the feudal lord, requiring no lackeys, and the machine is our indentured servant. The machine works for us.

 

I cannot count how many times I've been approached (even in the hallway of a Netezza conference!) of people asking me about re-engineering their existing systems for Netezza. No, my friend, we don't re-engineer, we de-engineer. There's a huge difference.

 

What does simplification buy us? We can now move away from the raw complexity required to prop up a lack of performance, and move toward the sophistication required of a competitive solution. The sophistication is key, because anyone can build a simple system for simple business reasons. But when the complexity of the environment hinders us from the next level of functional sophistication, the complexity has now enslaved our business model, and effectively the business itself. Simplified implementations are stable and adaptable, so can scale to breathtaking heights, giving us the  necessary edge for competitive, sophisticated offerings that aren't even possible in the secondhand technologies.

0 Comments Permalink
0

Manhattan Skylines

Posted by David Birmingham Mar 4, 2010

Marcus Gray watched in consternation as the viral program cranked up. He knew that in moments the band of hackers would once again take over the Manhattan power grid. For now, they were doing it as a prank. But he also realized it could be a test run for something even bigger. Like a grid-by-grid shutdown of the entire system, opening the door for untold mayhem on the darkened streets.

 

Moments later, messages from the hacker gang started appearing on all their terminals. Taunting barbs letting everyone know that they were in complete control and nobody could stop them. Gray shook his head and closed his eyes, hoping that this would pass quickly. Losing power even in one part of the grid could spell pandemonium and place lives and fortunes at risk. The weight on his shoulders was crushing.

 

"I think I can help," said a voice from behind. Lane McBride from the Federal Counter-Terrorism Unit based in Manhattan, leaned over to regard Gray's terminal.

 

Gray turned to the voice, recognizing it with hope in his eyes, and said, "They're at it again."

 

"I saw the precursors," McBride noted, "That they were entering the system."

 

"Yeah, but it doesn't matter if we can't find exactly where they are," Gray sighed, shaking his head, "They're in a hundred different buildings, including the Empire State. You guys have agents standing by at all of them, but they have to search the buildings floor-by-floor to find them. The problem is, we have to shut down communications for the building so that they can't warn each other. So even if we could catch a few, do you have any idea how long a floor-to-floor search takes in the Empire State? We can't keep that building offline from communication for that long."

 

"Not to worry," McBride grinned, "I have an algorithm that will directly pinpoint their floors. All we have to do is send our officers up to the floor, and I bet we can round them up in minutes."

 

"Wow," Gray whistled, "I'd like to see that."

 

McBride whipped out a flash stick, plugged it in and let the program do its work. Within seconds, it had pinpointed each hacker, the building their signal was coming from and the floor of the building. "Here we go."

 

"I like it," Gray grinned.

 

McBride touched several buttons on his phone and dispatched the information, and monitored as each of the officers acknowledged the information and the plan. "We'll know soon enough."

 

Gray noted, "The problem has always been that they could hear us coming and could shift floors anytime they wanted."

 

"Not this time," McBride smirked, "At least, not if we do it right."

 

The first officer to report back was from the Empire State. Two of the hackers had been stationed there on separate floors. Both were now in custody and unable to warn their cohorts in the other buildings. Gray listened in awe as one by one, the officers reported in, having captured their respective quarries with minimal effort.

 

"That was brilliant," Gray stared at the screen as the weight seemed to lift from his shoulders, "How did you come up with the algorithm?"

 

"Simple process of elimination. I just looked at the problem from a very-large-scale search. The most important information is where the perps aren't - not where they are. The algorithm zones in on the candidate floor by understanding which floors are not candidates. Process of elimination leads the way. So we can search the Empire State and Chrysler buildings just as quickly as a single-story, capture the floor number and we're done."


---------------------

Some of you already see the parallels. It's how a zone map works. But how does it apply?

 

When we take a look at the Record Distribution option in the Netezza Admin GUI, we're often happy with a "ragged edge" for all the SPUs. And a "flat top" is the ticket. But what about the case of a "Manhattan Skyline", where we have high peaks and low valleys? This is higher than normal skew (something we're supposed to avoid, right?) People see those and shun them. However, these are often the natural result of an intermediate table produced by an ELT operation, and often a result of multi-pass queries in a BI tool. These usually leverage the mainstay workhorse CTAS (Create-Table-As-Select), so in many cases, people are tempted to turn on "random" for all CTAS operations. Or just maybe - one of our regular static supporting tables is deliberately distributed as a Manhattan Skyline just because we want to regularly perform co-located joins with it on larger master table using the same distribution key.

 

In any case, a primary reason we would get this kind of Manhattan Skyline distribution is if we are trying to preserve an existing distribution in order to perform a follow-on operation with tables on the same distribution. Whew! And why would we allow this to continue? Isn't a random distribution better than a Manhattan Skyline? Our problem remains: if the table has such a Manhattan Skyline distribution, we have higher than normal skew. Any full-scan on the table will cause the query to perform as slow as the "tallest bar"  (the SPU with too much of the table's data). As the table grows in size, the problem worsens. It is not a scalable distribution in its latent form, so don't embrace one without a plan.

 

Well, random distribution has a risk too, especially at the BI level, of negatively affecting concurrency performance. Even if our individual queries are not hindered by the data-broadcast incurred by the random distribution, they could just be a one-hit-wonder, because running many of these operations side-by-side can sometimes saturate the inter-SPU fabric, affecting concurrency. If we can keep the processing on the SPUs, we can avoid this problem entirely. So the issue is one of user scalability, something that all of us care about and that the other vendors (sometimes) turn a blind eye to. Netezza has it covered, and as usual, it's so simple a cave man could do it (now I'll get mail!)

 

So now we have two options, neither of which seem good - (a) keep the Manhattan Skyline distribution or (b) use a random one. Let me say that random is not always bad, but it poses a potential danger for concurrency. Likewise the Manhattan Skyline can often be a latent result of an intermediate CTAS so is unavoidable anyhow. And why would we want to preserve an existing distribution on a CTAS? The answer - because it will be a co-located write (blazingly fast). But wait! Don't we get a co-located write by default?

 

Maybe.

 

I have noted in prior posts how the default distribution for a CTAS might not be what we want or expected, so here's a quick recap:

 

(a) For simple single-table CTAS, it will preserve the source distribution key - (co-located write)

(b) For simple multi-table-join CTAS, it will leverage the first column result in the "select" clause (maybe a co-located write)

(c) For CTAS using summaries/group functions in the select, it will leverage the columns in the "group-by" clause (rarely a co-located write)

 

If any of the above are not the original distribution of the source(s), we could inadvertently sacrfice our co-located write. But we can preserve it if we specifically use "distribute on" with the CTAS execution. With co-located writes, this means the data never leaves the SPUs. If we distribute the CTAS on anything else, the data must leave its current SPU and find its way to another one. This initiates a data broadcast (and can negatively affect concurrency). Preserving the distribution, we get the benefit of a co-located write (avoiding broadcast to make the table) and set up the next operation for a co-located read (also avoid the broadcast to leverage the table). Short answer: preserving the distribution preserves concurrency performance. Now the SPUs are working for us at physics-speed.

 

Rather than just live with the latent effects, lets embrace and harness them for the good of all mankind. Well - er -  at least for our user base.

 

What we really want is threefold -

 

(1) preserve the distribution with a co-located write (preserve concurrency, potential Manhattan Skyline as latent artifact)
(2) leverage the result with a co-located read (preserve concurrency, potential penalty from Manhattan Skyline)
(3) mitigate the Manhattan Skyline with a zone map (ahh, best of all worlds)

 

So to get the first two, we can simply preserve the distribution with a "distribute on (key)" clause and make sure the distribution key is part of the "where/join" operations.. This is the simple part.

 

To get the third, we need to either (a) sort the data as it is created, or (b) make a materialized view after-the-fact to get the zone map effect for selected columns. The first one (sorting) is often easier than it sounds, and with strongly filtered intermediate tables is also very scalable. The second one (materialized view) has some caveats but is very fast to create. What does the zone map actually do? It effectively stripes each SPUs portion of the table so that only the section in the zone is actually addressed. Like McBride's algorithm, it's as though the rest of the data isn't even there, because the zone map has guided the optimizer to completely ignore it. So whether the SPU's data has a tall bar or a short bar, the performance is the same. We need all three of the above and the zone map mitigates the potential problem of unexpectedly high skew from an intermediate distribution - or an outlier table that we need to distribute on a common key. Even if (1) and (2) above give us a good distribution today, it could always "go Manhattan" in the future.

 

Another obvious question is "If this is an intermediate result, why bother? Just filter out the stuff I don't want and then there's no issue, right?" Well, technically yes, for a single operation, but I know of at least a dozen cases where the intermediate table is used for a lot of downstream activity, not just a one-off throwaway. So our stewardship rule is: make the data better. For the next downstream process or the ultimate data consumer, the data should get better every time we touch it.

 

Rather than rewrite or re-design a carefully tested and detailed process, adding a simple "order by" or MV is easy and preserves the existing logic, and data model, with little impact and high return. This is especially true of a static supporting table, because we can install what we need on the table's creation. The consuming processes all benefit from it with no more than regular query execution (materialized views are transparent).

 

In the end, we can still leverage the plain-vanilla parts of the Netezza performance model (zone maps, co-location) without having to over-engineer the data using indexes, intersection tables or summaries. This preserves something more  - the ongoing resilience and adaptability of the model itself.

 

Recap:

 

  • Apply the "distribute on" clause of the CTAS to avoid the latent effect of default distribution.
  • Preserve co-location for reads and writes in intermediate tables.
  • If a potential Manhattan Skyline distribution is the CTAS result, rather than go random, sort the CTAS result by a selected column or use a materialized view.
  • As always, apply strong filters to the CTAS creation so that it's not simply copying one table's contents to another (carve the data out).
  • Experiment for the best fit, but remember that Netezza is an appliance.
  • We don't need to engineer the queries, only apply simple performance model alignments in the data itself, to leverage the machine's physics
0 Comments Permalink
0

"You didn't kill it," fumed the customer, "You said you would kill it."

 

"We've had some, er, labor setbacks," said Bjorn, head of DragonSlayers Inc, a startup boutique firm from several valleys away.

 

"I don't see an excuse clause in the contract," the customer shot back, "Kill the dragon or we're done."

 

"The dragon can't be killed," said a rich Scottish voice striding up to meet them.

 

Bjorn recognized the stealthy character, by name of Connery, from the Information Superhighway Roadside Assistance Service.

 

"I didn't realize that RAS was in the area," Bjorn quipped, offering a hand to Connery.

 

Connery grasped Bjorn's hand and shook it once, "We're all over. Been doing a little cleanup of this or that."

 

"What's this about the dragon," asked the customer, "That it can't be killed?"

 

"Of course not," Connery smiled, "It's a dragon. It's immortal."

 

"Did you know this?" the customer glared at Bjorn, "Have you been stringing us along?"

 

"No," Bjorn defended, "We kill dragons. It's what we do."

 

"Well," Connery chuckled, "Not real dragons, anyhow."

 

The customer's lackey approached them with a small flagon of tea, poured a stein for each of them, and departed.

 

"The dragon is immortal," Connery muttered, sipping his tea.

 

"That's impossible," Bjorn said through a long gasp, "We've killed dragons before - we "

 

"But of course you have," Connery smiled dismissively, drawing another casual sip.

 

Bjorn stared at him, unable to form another word.

 

"If the dragon can't be killed," asked the customer, "Then what?"

 

"In the nether worlds, beyond the mapped regions, you'll see little notation There Be Dragons," Connery said softly, "And whether there be dragons or not, it's uncharted territory. Places no man has ventured, but rest assured danger lurks. Unknown to the uninitiated."

 

"So you know what lies in the uncharted territories?" Bjorn sneered.

 

"It's why I'm a guide and you're a dragonslayer," Connery huffed, "Whether you know your way or not, dragon chow comes in many shapes and sizes," he put his hands up as if to size-up Bjorn, "Many shapes and sizes."


"Funny," Bjorn quipped, but it clearly wasn't funny, "All we have to do is get close enough."


"Reminds me of a time," Connery said wistfully, "Once I knew a man who you could skewer a hundred times and he'd still get right back up."

 

"Ahh, the Highlander," said Bjorn, "I've heard of him."

 

"Well, he never lost his head," Connery huffed, "Or that would've been the end of him."

 

"What are you saying?"

 

"The treacheries of the lands beyond are many. You have to keep your wits about you. Keep your head."

 

"Keep my head, got it," Bjorn said sarcastically, "Anything else?"

 

"You need to deal with the whole dragon," Connery advised, "Not just the part you wrap with that silly leash. It won't hold the dragon. Only a dungeon will."

 

"So we need an enchanter?" Bjorn smirked.

 

"In no uncertain terms," Connery said, laughing, "You have a go at that dragon on your own. Go in there with no more than an enchanter's bag of tricks, and he'll make an ash out of you!"

 

Bjorn gulped, "We'll see about that!"

 

One of the lackeys turned to the other and chortled, "He thinks he's James Bond!"

 

"What do you know about it?" Connery shot back with piercing eyes, "The dragon sends your consultants to the street and you send the dragon to the morgue. Is that how it's done in data warehousing?"


"Basically, yes," snickered a lackey.


Connery whirled, "No morgue will hold him." He turned to the customer and glared hotly, "What are you prepared to do?"


"Sign the contract," said the customer, quickly applying a signature. He stuffed the papers into Connery's hands and hastily departed, leaving the men to set sail and dispatch the dragon as soon as possible.


The boat ride to the dragon's coast was uneventful until the boat ran aground near the shore, screeching loudly against the rocks as its keel protested with a deep, gutteral groan.

 

"That's noise will stir the dragon," Bjorn bemoaned. He'd hoped for a more stealthy entrance.

 

"Hopefully only stirred," Connery quipped as he snatched up his bag, "Not shaken. Won't do to have him awake when we approach, right?"

 

"Coastline is enormous," Bjorn complained, "How will we ever pinpoint his location?"

 

"To find the dragon, you'll need to think his thoughts. Know your adversary. Know his heart."

 

"Yeah, Dragonheart," chuckled a lackey, "Seen the movie."

 

Connery ignored him and leapt from the boat onto the dry shore. "Welcome to the Rock," and then looked out over the vast, scorched wasteland, a product of the dragon's handiwork. He led the team up the rocky slope to the first rise, whipped out his spyglass and waved his hand to the others to belay their ascent.

 

"What's he doing?" asked one lackey to another.

 

"Lookin' around, I guess," smickered one, "Guess nobody told him that the dragon sleeps all day."

 

"What was that?" Connery whispered loudly enough for them to hear, "You think the dragon sleeps all day? Who are you kidding? Maybe you only struggle with him in his lair at night, but he breathes fire all day long. He never sleeps. He never dies."

 

"Where did we find this kook?" asked another, "He's as nutty as a fruitcake."

 

"He'll eat you alive," Connery sneered, trying to spot motion anywhere along the landscape before proceeding. In the distance, a dank mist arose from the ground near some caves. Connery zoomed in and spied dragon scales littering the ground. "Let's go."


The team made the tedious crossing without incident, until they stood before the open, reeking maw of the dragon's lair.

 

"Who wants to go first?" Connery chuckled.

 

"I will," said a lackey fearlessly, "I've taken down enough of these."

 

"But of course you have," Connery strode to the nearest large boulder while the others scattered for cover. After several tedious minutes, all of them could now feel the impact tremors shaking the ground, growing in intensity as the beast ascended from his lair to the cave's mouth.


Then the horns appeared, fifty feet from point-to-point as they slowly rose from the hole. Then the head,larger than a common city bus and almost twice as long. The dragon stared down the lackey for a long moment, then continued to ascend from the hole, growing larger and more hideous with each passing second until his entire upper body was revealed, from his head down to his midsection, standing over ten stories tall. He burst-extended his massive wing membranes with a loud, deafening snap, and then pointed his head straight up to gather a deep breath of air.

 

Connery reached down to pick up one of the many dragon scales scattered all over the ground. Five inches across and eight inches long, made of the most impervious stuff on earth. He flipped it over and shuddered to realize the dragon's age, betrayed in the scale's growth rings. Four thousand years, this animal had been eating and breathing fire.


The young lackey had forgotten to breathe. This dragon was orders-of-magnitude larger than any dragon he'd ever dealt with. In fact, the sheer scale of the dragon made him feel light-headed. Gathering his presence of mind, he took a defiant stance and shouted, "Begone, Dragon!"

 

Connery turned away, trying to hold back a snicker that could reveal his location to the dragon's attenuated senses.

 

The dragon pointed his nose straight down, cocked his head to the side, opened his mouth and released his breath. The column of high-intensity chemical fire blasted downward on the lackey, instantly reducing him to ash and causing the rocks all around where he'd stood to glow and almost melt.

 

Connery glanced over to the rest of the team, cowering behind the rocks in hiding, not believing that the dragon was so huge and powerful, and feeling completely beyond their depth. They stared, partly in awe and partly in concern, as Connery stepped out from behind his hiding place and boldly strode up to the dragon's cave.

 

The dragon once again drew breath into his nostrils to recharge his furnace, when Connery simply placed his hands behind his back and stared deeply into the dragon's eyes.

 

The dragon stared back, unable to comprehend the feeling of drowsiness suddenly overtaking him. He slowly lowered his head, then his body, down to the ground to gently lay next to Connery, unable to break his eyes away from Connery's deep, mesmerizing gaze.

 

Once completely settled, Connery reached out to tap the dragon's front jawbone as it drifted off to sleep, "There now," Connery said soothingly, "That's a good lad."

 

"How is this possible?" Bjorn gasped, stunned at how easily Connery had mastered the beast.

 

"Your friend told the dragon to leave," Connery huffed, "But the dragon isn't going anywhere. He lives here and you people don't. In fact, he's been around so long, and you people come and go so often, that he sees you as decorations, not even permanent fixtures in his home."

 

"But he just laid his head down and went to sleep," Bjorn noted, "How did you do it?"


"The dragon serves me," Connery said slowly, "Not the other way around. If the dragon needs to breathe fire, it's because we've not done a good job harnessing the dragon, not just because the dragon is mean."

 

"So dragon's aren't mean?"

 

"Oh, their born mean," Connery chuckled, "And they bite. Whom they bite and when, is ours to control. That's why we have dungeons. Places where the dragon will survive but under our control. Think about putting that dragon's breath to work in boiling water, making steam to run a turbine. Now the dragon is working for us."

 

"Can't be a happy existence for him."

 

"Happy? Perhaps not. Necessary? Most definitely. You came here to kill him or banish him. He knows his place. He only responds to someone who knows it as well as he does."

 

"You're an enchanter, aren't you?" Bjorn said, realizing Connery's identity.

 

"Some call me, Tim."

0 Comments Permalink
0

Many of those who integrate the mainstream BI tools into various underpinning data sources find subtle nuances. Not the least of which is how the database will respond to the queries presented. In Netezza data access especially, the power is not found in the query, but in the hardware. We can certainly degrade our experience with bad queries, but we would not tune queries in the same manner as with an SMP/RDBMS.

 

For example, I've watched RDBMS engineers work black-magic with a query by simply rearranging this-or-that in the monolithic query to provide boosts in the orders-of-magnitude. This is because the query is being used to guide the general-purpose physics. In Netezza, however, the purpose-driven physics snips the query apart. The physics then guides the query's mechanics. I've watched newbie Netezza folks nearly pull their hair out - and their eyelashes too! - when trying to "make the machine do what I want". Hmm, no, the machine does what it does. It's an appliance. We get what we want when we conform the data to the physics. The query is just along for the ride.

 

How does all this apply to multi-pass SQL in a BI Tool? Well, most BI tools come to the table with a pre-conceived notion that all databases are created equal. Unless they have specific VLDB hooks, and unless those hooks fully embrace VLDB principles, the BI tool will not experience the expected lift and we'll likely have to help it out. In fact, little about a BI tool is purpose-built in regards to its data source. It regards data sources as general purpose interfaces so it can be as vendor-neutral as possible.


Unlike a standard star-schema, many VLDB tables are fact-sized tables containing billions of rows, as are their dimensional counterparts. So a single one-shot query will sometimes provide the functional answer but with unacceptable performance. Many of us have seen multi-page (hey, 100+ page) queries that try to do everything in one shot. The average RDBMS leaves us few options. The VLDB and especially Netezza is not so constrained. We can make multiple passes on the data often with little penalty. The danger here is in the inefficiency of the passes, not whether multi-pass is okay. Multi-pass, or more appropriately multi-stage SQL,  is a necessary approach with large-scale tables. Netezza makes it simple and fast, using built-in concepts of its performance model.

 

Here is a spot case-study - a BI tool needed to access several tables that were each in the many billions of records. The end result was a summary of user-selected values. The temp-table creation here is done automatically by the BI-Tool, so we may have limited options in getting it to shape them as needed. In the examples below, I'll label the queries so we can reference them later.


A typical BI tool, upon realizing it needs a summary, will often divide the answer into multiiple stages of work. Each stage will store its result in a temporary table using a CTAS, leveraged in one or more following passes. Unfortunately these passes are sometimes inefficient. In the case below (this is pseudo-SQL, so bear with me here)


(1a) create t1 as select region, district, store, sum(transaction_amt) sumtran, sum(transaction_tax) sumtax from transactions where district_id=4 group by region, district, store;  (1 million records)

-

(1b) create t2 as Select  employee_id, employee_name, t2.store_id from employee_master t2, employee_lookup t3 where store_id=6 and t2.store_id=t3.store_id                   (500 records)

-

(1c) select store_id, employee_id, employee_name, sumtran, sumtax from  t1, t2 where t1.store_id = t2.store_id and t2.region_id in (41,42) and t1.store_id = 6;                     (450 records)

 

Note how in the above, the filter effects are largely applied last (1b and 1c) with the summaries applied first (1a). In this case, it is summarizing over a million values but it throws away over 90 percent of this result on the last operation, reducing 1 million records to 450. It is still accessing the larger table (transactions) only once. It just does it at the wrong time.

 

If we invert this chain and regard the filters first, we might see queries like this:

 

(2a) create t1 as select region, district, store, transaction_amt, transaction_tax from transactions where district_id=4 and region_id in (41,42) and store_id=6;            (15,000 raw records)

-

(2b) create t2 as Select  employee_id, employee_name, t2.store_id from employee_master t2, employee_lookup t3 where store_id=6 and t2.store_id=t3.store_id              (500 records)

-

(2c) select store_id, employee_id, employee_name, sum(transaction_amt) sumtran, sum(transaction_tax) sumtax from  t1, t2 where t1.store_id = t2.store_id;       (450 records)


In the above, the filters are pushed into the first part of the query chain (2a) to squeeze down the data sizes, but to also glean out the raw values for the final summary (transaction_amt, transaction_tax). The (2b) query is still a filter, but by the time we get to (2c) all we really need to do is summarize based on the intermediate table results. We don't have to "go back to the well" of the larger table. Everything we need for the final result is already in our hands, and a much smaller workload.

 

The simple inversion of the query order has significantly reduced the workload of the entire chain of events. This of course, does not answer whether our BI tool will actually implement the query in this order or manner. Anecdotally, with the above tables the original "transactions" table was over 30 billion very wide rows. The first query chain (1a-1c) takes no less than a minute, but only because key1 is zone mapped. The second query chain (2a-2c) takes 6 seconds or less, and it better represents a flow of data from larger-to-smaller, like a common source-to-target flow. It is easier to visualize and manage, and is more efficient.

 

Note: Can our BI tool shape a query chain in this manner? Can it glean out in the raw columns to an intermediate table, later summarizing on the intermediate? Or will it always require us to summarize at the outset and then squeeze out from there? Some BI tools are very close to this model already.

 

Yet another pernicious issue is not obvious from the above - temp table distribution. This last query chain, though 6 seconds in duration, is still a one-hit wonder. Once two or more users start hitting the machine, concurrency will reveal all. The machine is quickly saturated and all of the queries start to take more and more time. In one case of just five users on the machine, all of the queries took over a minute, and one took over five minutes. Concurrency tuning is a bread-and-butter issue, too, so what's going on here?

 

In both query chains, the CTAS is not being given explicit instructions on how to distribute its results. The outcome is unpredictable from the BI tool's perspective, but very predictable for us. When the CTAS result remains distributed on its original distribution, we get a co-located write. If the CTAS does not use the original distribution, it will have to redistribute the data, broadcasting it all over the SPUs. We need to avoid this because co-located writes are desireable and muey caliente.

 

The original distribution key for the transaction table is (transaction_id). This doesn't do us much good if we are later focusing on the store_id (2b, 2c) as the primary distribution. In order for the final activities to be as quick as possible, we need to bridge the transactions into the store_id. We could set up data structures to do this, but in the end with so few records coming off the transaction table in the (2a-2c) chain, an intermediate broadcast is already in the mix. We can do it deliberately under our control, or allow it to use CTAS defaults. In this case, the CTAS default is worse.

-

In the first chain of queries (1a-1c), we would expect to see the following CTAS defaults:

 

(1a) - distributed on (region, district, store) because this is the group-by clause. It cannot use transaction_id for a co-located write because it's not even in the result set. Those who understand distribution keys know that this is not an optimal state of affairs.
-

(1b) - distributed on (employee_id) because it happens to be the first column in the select-clause. This query uses two tables in the join, so
     CTAS will opt for using a column in the select clause.

 

So in this case, the CTAS will not preserve the original distribution or even a useful distribution. Don't get me wrong here. CTAS defaults are acceptable in over 90 percent of cases. This example is offered as a typical one-off of BI automated query construction. The first query (1a) will produce a million records (and honestly, some cases it produced a couple of billion records) we really need some optimization here.


If we were to take (2a) and (2b) above to deliberately enforce the distribution, we would use the "distribute on (store_id)", but we would have to include store_id in the result set. In each case, this would prepare both tables for the final query (2c) for a co-located join.

 

Note: This brings up another BI tool issue, in that we need to affect the order of the sequence, and also provide for columns that are adminstrative (like store_id) but not part of the final result. Some BI tools are picky this way. If the column is not required in the final reporting output, it trims or ignores the need for the column in the intermediate tables.

 

To continue, we have now pushed the workload into the physics, not the query itself. But as noted, concurrency is the test.  This final chain of co-located queries then returned in less than 3 seconds, and did not grow beyond 4 seconds until 20 users were running the same query at the same time, and even then tended to hover between 3 and 5 seconds as even
more users were added. Isn't this the kind of scalable performance we want?

 

Additional note: If we really want to push this harder, it would be best for us to manufacture a "store_transactions" table that is distributed on the store_id already (for the 2a query). This would be a report-facing table that essentially mirrored the transactions table, but only carrying the high-traffic reporting columns. In this way, the store_id becomes the universal distribution even for the very first query. Keep in mind that while this strategy may cost disk space, it will further eliminate concurrency issues. I am not a big fan of preserving disk space when performance issues are in play. We will still need to perform a "distribute on (store_id)" for each (2a,2b) but it will preserve the distribution with a co-located write.

 

But we can see, the two protocols we will need in play from the BI tool is to use capture-filtration-summary, and then also apply distribution keys deliberately to the first passes to preserive distribution. We often apply these very same protocols in ELT because they make sense. But we have complete, detailed control of query construction in ELT, not so in the BI Tool world.

 

Conclusion: Rather than use a BI tool's default of summary-filter chain, what we need is capture-filter-summary chain. This guarantees that we can leverage the VLDB physics, but also moves the data from larger-to-smaller in the most efficient manner.

 

Recap for Multi-Stage SQL:

  • especially for summary data, should perform the summary as the final operation, with capture-and-filtration in the first passes. This allows the final operation to be a simple summary, since all the filtration has already been applied. In other words, no more where-clause activity apart from the join criteria.
  • Organize the tables (including additional tables) on the distribution key in play. Bridging one distribution to another can give us the performance, but if broadcasting it can eventually create a concurrency problem
  • the chain should not address the same large table more than once. Get everything we will need and get out - don't keep coming back for something the first pass did not get.
  • the chain should capture raw information into an intermediate table, foregoing the summary until the final operation.
  • should provide a means to bridge one distribution key into another, for maximum efficiency, rather than using CTAS defaults.
  • should perform filtration at the outset, as a method toward attacking the larger table(s) with zone maps etc.. Move from larger data sets to smaller ones.
  • should preserve distribution to leverage co-located write and read where possible. This maximizes overall performance but also optimizes concurrency.


What if the BI tool will not, as a general-purpose tool, perform these deliberate and purposeful query chains? At this point, we need to have a heart-to-heart with the BI Tool vendor stating our concerns. Assume the best, that the tool vendor may eventually fix the issue, just not in time to help us now. We then need to consider two purpose-built options, each of which has its own issues. These are offered in the spirit of temporary adaptation until the BI tool is smart enough to bypass them.

 

Summary tables: These are often constructed to prop up database performance issues. They are just as viable for functional reasons, such as providing data in a form that is only available and most efficient when summarized, or to intersect details with pre-summarized data. But if used as a performance prop or BI Tool helper, put some effort into making it an adaptation that could be deprecated when the BI Tool is smarter. This way, we're not committed to it forever.

 

Stored procedures: Used in an appliance as an adaptation mechanism (in this context). Effectively bridges the BI tool to the data with a temporary procedural construct (the procedure) rather than a more permanent structure (like a summary table). Stored procedures pull application features down to the database level and adapt the BI tool into the Netezza performance model.

 

When or whether to use either of the above is always a design decison, not necessarily dictated by the tools themselves. But keep in mind the idea of temporary adaptation. I am always of the mindset that the warehouse and BI environment must exist with the expectation of change, so in general, adaptability and adaptation concepts are always desireable. They allow us to be more responsive to future requirements

0 Comments Permalink
0

Riding the Waves

Posted by David Birmingham Jan 20, 2010

I've been noticeably quiet over the past weeks as I've switched horses, so to speak, and joined Brightlight Consulting. I had already been following Brightlight for a number of years, encountering their significantly talented people at various Netezza sites across the fruited plain.

 

the press release is here:

 

http://www.brightlightconsulting.com/news_2010_DavidBirmingham.htm

 

 

At the Netezza conferences this year, many of you saw the slow-motion videos of surfers mastering those monstrous waves. Also during this season, I happened to attend another conference where the speaker shared some famous words from Shakespeare's Julius Caesar in a similar context to what I was now experiencing right there on the conference room floor:

 

 

There is a tide in the affairs of men. Which taken at the flood, leads on to fortune;
Omitted, all the voyage of their life Is bound in shallows...

 

 

I was standing on the top of the wave, so to speak, and had a choice before me. Ride the wave, or return to the shallows.

 

Now, I don't put a lot of stock in epiphany-styled revelations, but in this case a tingle went up my spine, realizing that the TwinFin had completely changed the game - it was time to seriously get on the wave and ride it, or commit to the shallows of the everyday. As many of you know, Brightlight has stood out for a number of years as being a go-to partner for all-things Netezza, and their VLDB consultants have solved large-scale problems where others feared to tread.

 

"You have some serious thrill issues, dude," Crush the Turtle, Finding Nemo

 

As I have been inundated with pings and kudos from many of you who already know the story, I thought it was worth sharing, especially for you Shakespeare and surfing aficionados, a rare breed indeed.

 

And to surfin' Enzees everywhere - here's to a "so totally awesome" 2010 and all the promises it offers. All the best.

 

See you on the waves!

0 Comments Permalink
0

A Tastier Float

Posted by David Birmingham Nov 5, 2009

In one of our primary tables, we'll call it a fact table, it contained a number of columns that had arrived through some pretty hairy ELT-based math algorithms. In all the crunching, we would see spontaneous overflow errors, so we converted some of them to float. More explosions occurred, and we converted more to float. After several more iterations, we converted them all to float. Then we discovered that the reporting layer also had to perform some hairy on-demand calculations, so it was a good thing we had float values to give them. Now everyone was safe.

 

However, as this table grew, and they always do, the floats became "bloats". Netezza does not compress a float data type. One day we looked up and the table was approaching 20 TB in size, with no end in sight.  The theory was, that we could reduce these float values to numeric data types, we could save half the storage right away, and even more so with Netezza's compression, but it would put the reporting layer in danger of a spontaneous overflow explosion.

 

Once we performed the conversion of the table (as a test case) and saw it reduce in size to about 7 TB, we were hooked on the possibilities of compression but vexed as to the impact this would have on the consumers of the data.

 

We had experimented with surgically casting the data from numeric to float on-the-fly, but this would create a lot of headaches for the users if they always had to wrap every field with a casting notation. It did however, prove out one thing, that the time to cast the numeric-to-float is inconsequential when compared to the amount of I/O required to pull a float value from the SPU's disk "as is". In essence, we traded the time we saved in compression, and converted it to time used in casting.

 

So the next step would be, put a special view on top of the fact table, such that it would automatically cast every numeric column into a floating point value. Thus, whenever a reporting layer query required data, it would automatically and transparently leverage the view, pull less data from the disk, covert it to float in the CPU and then leverage it as float in memory. We effectively eliminated the cycles spent in I/O to rip the float value from the disk drive. We spent a little of it in the cast of the data to a float. We made the operation transparent to the reporting layer.

 

old way:

 

FLOAT ->>>>>>>>>>>>>>>>>>>>>>>CPU -> QUERY MATH

(16 bytes, no compression

 

new way:

 

NUMERIC ->>>>>>>>>>>>>>>>>>>>FLOAT -> CPU -> QUERY MATH

(8 bytes or less, with compression)

 

All of the CPU-level math then becomes inconsequential when we move to the Twinfin, since it has its own floating-point processor and can handily deal with the float type. But we can continue to mitigate the I/O hit for the data by storing it in a compressible numeric format, and coverting this on-demand to a float at the CPU level.

0 Comments Permalink
0

Famous words, or some such like, uttered by Orson Welles as he launched into a scary parody of alien terror on national radio. Really scary for some. And proferred on Halloween night in 1938, so dare I say, 'tis the season (almost).

 

Ahh, not to fear, this purports to be a painless foray. But I do have a story to tell.

 

Several projects ago (I always start this way, so you won't think I'm talking about you!) - I worked with some really sharp data engineers on boiling out a solution for retail operational reporting. The data arrived every five minutes or more, or less, and sometimes in parallel loads, with 24x7 regularity. More and more Netezza implementations are going this way, and you too, should look into processing data at the speed of thought. In any case, the reporting users wanted to plumb the depths of this data store, to the tune of eighty billion records and growing. (Okay, small I know (for some of you) but humor me).

 

Well and good, except rather late in the game, the reporting users spontaneously expressed a desire to review the detail through metadata-based "lens", that is, set up some drilling levels and other metadata-based entry points, such that the entire operational model would be seen through this reporting "lens" and it would provide all the context for the consumers.

 

Now, such a model as described, would require such enormous power from a standard SMP/RDBMS-styled system, that we might well cause structural damage on the raised floor for sheer physical weight of said system. That is, if we really expected a report to return within a day or two of the request. Ahem! as I facetiously clear my literary throat.

 

But the worst-case for any given query for the above was around 8 minutes, and over 99 percent of the thousands of queries submitted, returned in less than 30 seconds. Oh, yeah, it was smokin' hot. In most queries using zone maps and the like, we saw returns in mere multiple seconds. Pshaw! Says the tick-tock-man, chocolate and vanilla, don't waste my time.

 

However (and there's always a catch) many of the larger reports were actually conglomerations of these smaller queries, and their aggregate time would occasionally exceed ten minutes or more. And even though this was a far cry from the "days away" we would expect from an SMP/RDBMS system, it was still 'too slow' for the users. Now, this is true adrenalin-junkie stuff, sort of like the old Far-Side cartoon of a young man standing with a fork in front of a waffle iron, captioned "Wendell Zurkowitz, slave to the waffle light". I recall how one man noted that many years ago we would wait hour(s) for a traditional oven to finish cooking, and now get impatient when the microwave instructions are greater than five minutes.

 

Perspective.

 

And rather than punt to the users and say, "Hey guys, this is just unrealistic" and degenerate into "expectation management" - the challenge was to actually achieve faster turnaround times on the reports. And here, I'm talking about getting these ten-minute reports into the 30-second zone. Would we have to embrace some extreme engineering for this feat? Methinks not - but the form of the process to get there was quite instructive.

 

Now recall I noted that the above model had operational tables, which were to be the detailed source, and a retail reporting hierarchy that was largely metadata-based. This reporting hierarchy had some significant size as well, perhaps a fourth the size of the eighty-billion-record fact table it had to link into. Yet both of these were on separate distribution keys. Queryng one meant broadcasting another.

 

And now, for broadcasting.

 

Whenever two tables are distributed on different keys, a join between them cannot be initially co-located. To support the co-location, Netezza will broadcast the salient information from one table's context to the other. This means the physical data has to move from its home SPU, out onto the inter-SPU network fabric, and find its way to the target SPU where it will be further examined. Broadcasting for small tables is inconsequential and barely a blink on the radar. For larger tables it can have strange effects. For example, we saw one query return consistently in ten seconds. Yet when running side-by-side with itself (multiple users) it could take several times longer.


The reason is that both queries were competing for bandwidth on the inter-SPU fabric, among other things. The simplest solution, of course, is to get our metadata table distributed on the same key as the operational tables. The problem was simply in the complexity of this metadata table and how it mapped to the core information. "Blowing it out" into a materialized form of information would require significant planning and design, because a misstep could easily make the reports turn out wrong, and this was unthinkable. In all this, the maintainability had to be considered, because if our initial complexity is too high, the maintainability is in jeopardy - by design.

 

Of course, we would spend most of our time in testing this scenario. Coding and implementation in most BI shops is a nit compared to the testing we have to execute to validate the outcome. Netezza is no different, except we can close the testing loop sooner if we have more power. And of course, for something of this magnitude, to test the change from minutes to seconds, we would need a powerful machine to measure the difference. Whenever we ran the new solution on a smaller machine, the difference couldn't even be measured. No, the power of the machine makes the testable difference visible and measurable.

 

As I noted, the form of this exercise was the most instructive part. Rather than form a means to align these two tables for co-located joins, the first effort was in attempting to tune the queries. You know, "query engineering", which is the mainstay of performance engineering on an SMP/RDBMS platform, and old habits are hard to break. The data engineers were somehow in denial that they would receive extraordinary power from configuring the data. Rather they trusted their instincts and chose to attack the queries.

 

Now, in any platform, regardless of shape, size or vendor, power is always and forever the domain of hardware. Software cannot manufacture more CPUs or network speed. If the physical plant is not ready, the software can only use what it has at its disposal. The software itself is largely a cost center, because it can only drain the machine's energy through inefficiency. In an SMP/RDBMS machine, the only option we have is to engineer the queries, because the physical plant is configured to be general purpose.

 

In a purpose-built machine, however, the query is simply a controlling mechanism to Netezza's resources. The host will chop it apart into snippets and dispatch these to the component that they will serve. Extreme query engineering on the other hand, assumes that jockeying around with the query can actually affect our fate. (contrast; a poorly written query is different from directly engineering a well-written query). And besides, do we really want to spend our time carefully engineering the query to the point of functional brittleness? In an SMP/RDBMS machine we will see queries that extend for tens of pages in a very daunting complexity. Maintaining these is a full-time job for our consultants. They swarm on the machine, and carefully tune their handiwork to avoid breakage.

 

Yet, we purchased a Netezza machine to get away from this complexity. To reduce, clarify and simplify our administration and consumption of the data. So as I watched these engineers bat themselves against the problem, no differently than a fly batting against a window, I watched them pull out their hair in generous tufts when little they did offered the significant gains they expected. This outcome was entirely counter-intuitive to their training. They were acccustomed to using and tuning software to make things work faster.


Sweeping the hair from the floor one evening, I mentioned (for the x-teenth time) that the broadcast effect was killing them. Once our engineers grasped the broadcasting problem, I thought we would make headway, but things actually got worse. They started trying try to control the broadcast as the root cause rather than the symptom. In one test, I saw one of the largest tables leap into a broadcast and we just killed the query outright (it would probably still be running, even today). The engineers lamented: How do we make sure the larger table doesn't broadcast? How do we control the broadcasting to our benefit? Answers exist to all of these, but it's like talking to a drug addict, one who is addicted to the drug of SMP/RDBMS and claims he can 'quit anytime'.

 

And then the truth came out, "David, if we can make this 10100 machine process data like a 10400 machine, we'll look like heroes!" To which I ask "How?" to which the response is: "We can save them all that money they would have spent on the hardware..." Well, not really. You've just chosen something else to spend the money on, namely performance engineering, the cost of time-to-market, the cost of a marginal implementation and the cost of human labor (the most expensive asset you have, by the way). But since the only way to get a 10100 to perform like a 10400 is to actually be a 10400, well, you see the futility. 432 SPUs versus 108 SPUs? And they really, truly thought they could - I mean - seriously. Let's keep in mind that the opposite is true. If we can't make the 10100 process data like a 10400, perhaps our approach is flawed? Heroes or goats. Take your pick. In my estimation, there's only one hero in the room. The big black box.

 

So the broadcast is the symptom, not the root cause. How about, we quit broadcasting, cold turkey? Take the data model through a detox program and the engineers through a series of deprogramming seminars to - well - it's not that bad. Typically the average engineer only has to see it operate in an adverse manner to become a believer. But a believer they must be, or they will not take action to correct the problem, correctly.

 

So one of them finally decided to produce a map table, one that would map the metadata into the operational tables such that all core joins would become co-located, with a common distribution. And lo, the first test of this blew their minds. Even the complex reports were now coming back in single-digit times, and the reports that had been running ten minutes or longer were now under a minute, even with multiple users. In fact, they saw the performance and scalability practically handed to them - simply because they configured the data correctly. It had little to do with query engineering.

 

Now one may ask the obvious question, and please do so now: Why don't you just build out some user-facing tables and forget leveraging the operational tables? After all, we don't build our non-Netezza reporting systems on top of operational data, do we? We build-out dimensional models and other handy structures to postively affect the user experience and simplify the flow (and the maintenance). This functional decoupling is a mainstay of reporting environments. (Okay, the next entry will focus on this). But in this case, suffice to say that the owner of the machine had placed down a hard-mandate on disk utilization. At no time could we foray into replicated detail, or even summary of detail without a plan to access the operational detail on a drill-down and the like. Interestingly, the required reporting tables would have only cost mere fractions of the cost (on disk) of the time/labor and effort put into making the operational tables viable. This is why it deserves its own treatment in a separate rant - er - essay. Stay tuned, and don't touch that radio dial.


Back to the drama - A telltale symptom that we're doing something wrong, is when we start down the engineering path. It's an appliance. We don't engineer toasters, blenders or laundry machines. But the difference here seems to be subtle. It's not. In this case, the culprit was the broadcast, something to be eliminated rather than managed. And no amount of creative query hoop-jumping would overcome this. Get the joins onto the SPUs. It seems obvious to those who have been around the machine for bit. But for those who have not, the learning curve is upon them. Be patient with them for as long as it takes to get it right. Once we have a believer, we'll never have the conversation again. As long as we stay in a theoretical zone, however expect them to stay in the spin cycle. This is like many things scientific. Seeing is believing.

 

Whenever I (and others like me) observe a ritual of performance engineering, each participant holding out the hope that "just one thing" will offer stratospheric boost so they can all wipe their foreheads and go home - this is the surest sign of one of two things: Either the data is poorly configured and is causing the queries to be ineffcient, or the data is properly configured and the machine does not have enough physics to achieve the goal. If the focus is on query engineering, they are wasting time. If the focus is on data engineering, at some point it will reach a "diminishing return". Either the machine has the power or it doesn't. Time to switch to Netezza, or if using Netezza, time to add some physics (a frame or two) to make it happen.


Moral of the story: Performance is found in the physics, not the carefully engineered queries. If we find ourselves "engineering" our queries for performance reasons - we should take a step back, take a deep breath - click our heels together and say softly: "There's no power like SPU power. There's no power like SPU power." Repeat as necessary.

 

And pay no attention to the man behind the curtain. I'll bet he and Orson Welles never even met.

0 Comments Permalink
0

A number of months ago I wrote about how the World Tour Awaits, and all the buzz in the air about the new TwinFin. I was honored to moderate the best practices forums in North America and London, and many thanks to the rather effervescent participation by the panelists. Kudos goes out to David from Brightlight, David from Edge Associates, and Jeff from Quantisense, each of whom have those over-the-top kind of personalities that turn the session into an "experience" more than just a discussion.

 

But all in all, the sessions flew like lightning. If any of you have additional questions or insights, may I invite you to post them here on the Netezza community. The discussion never ends, you know.

 

It is interesting to note that many of the questions coming from Enzees in every venue, struck a common chord and followed a common thread. In that Enzees are unique and have a rarefied problem and solution domain. And are able to approach it with the confidence of Spartacus in the arena, or Jackie Chan on the streets of New York. Comments often began with, "I have a table with <seventy, eighty, ninety, your number here> billion records and I want to..."  I mean, seriously, those on the outside lookin' in will also look askance at such an opening statement, and marvel at the ensuing, rather casual discussion about it. Nothing is casual about these data sizes, on the outside world.

 

It goes like this: Bring it on, baby. Because the question of whether it can be done is behind me, now I just want to know how to do it well. The audacity!

 

Kudos also for the Enzee crowd members who injected their insights and wisdom into the discussion, freely sharing their technical and political battleground knowledge for the betterment of all. This was not the same as "iron sharpening iron", because at this scale of data processing, iron crumbles. No, this was a lot like titanium sharpening titanium, and was exciting to participate in, to say the least.

 

Many thanks also to Netezza for inviting me to the tour. It was a whirlwind to be sure, but well worth the ride. Tim, Olga, Courtney and Karina made it easy for me (actually all of us) to participate. Thanks to all for your hard work and a World Tour Well Done!

0 Comments Permalink
0


As the sunrise peeked over the horizon, it cast long shadows over the four cars awaiting the break of dawn. Stretching before them, the expanse of the salt flat beckoned, nay taunted them, to accelerate across its ancient surface. Not caring for the winner or loser, it merely provided a level playing field for them to test their wares and technology. But yawned at the futility of the race itself. The salt flat had always been, and always would be. Come one, come all, it invited daily, almost mockingly.

 

The leader for team-Exa sat in his racer's driver seat, eyes closed. When he felt the warmth of the morning touch his face, he raised an eyelid to examine the time. Now thirty minutes from flag-down, the sun would still be at his back when he won the race. And he would win the race.

 

The lead for team-Terra pushed back into her driver's chair to stretch her legs as her eyes fluttered open. She glanced toward her left to the Exa racer, gleaming in the morning sun, and then to her right at the NZ racer, its plain black lines and nondescript exterior, she knew, hid the power under its frame, and was nothing to be trifled with.

 

The fourth car on the end, entered in the eleventh hour was a plain vanilla Volkswagen Beetle with a rocket engine attached to its backside. No frills, no nonsense and nothing hidden. Five men from Redmond had delivered it last evening. They hadn't even had time to take a test run on the flat.

 

Minutes later all four drivers and their lackeys met in front of the four cars, partly to wish each other luck and partly to offer last minute trash-talk. Dominic Toretto, the driver of the NZ machine, ran his hands over his bald scalp and rubbed it vigorously, as if massaging the sleep from his head, then yawned and said, "Okay gentlemen. We're fifteen minutes from flag-down. Anyone want to back out? I swear we won't hold it against you."

 

"Dude," laughed Excel, the driver for the Redmond machine, "In your dreams. I have investors watching."

 

"As do I," smiled Tara, the only female driver, and would command the blue-streamlined Terra racer, named for its ability to master the earth and its elements. "We're all in this for keeps." She batted her eyes and tilted her head flirtatiously, "You want to see under my hood?"

 

"Out here in the open?" Toretto laughed, drawing chuckles from the others, "Sure, let's see what you have."

 

She ignored the innuendo and pointed her keytag toward the Terra racer and pressed a button, causing both side doors to slide away and the hood to pop open. Toretto strolled over to examine the engine. He'd seen these before.

 

"Lot of power under that hood," he quipped.

 

"Yeah," she said, expecting a bit more enthusiasm for her machine. She wouldn't find it among any of these drivers, though. They lived and breathed adrenalin, and knew as much about her machine as she did. And weren't in denial about its weaknesses, either.

 

"Looks plain," said Jeff, driver for the Exa-car, "And as you can see, not enough control."

 

"So let's look at yours," Toretto said, a twinkle in his eye.

 

As they sauntered to the next car, Jeff's lackey whispered in Toretto's ear, "We've radar-mapped the entire flat between here and the finish line. Every bump is programmed into the machine. You think that's a competitive advantage?" He slapped Toretto on the back and laughed loudly.

 

"Bumps don't matter," Toretto muttered, with the strength and experience of someone who would know.

 

Jeff spun to face him, "What was that?" he laughed, "Bumps don't matter. Did you hear that?" he looked around him to the others, with his lackey already laughing, "He says bumps don't matter." He crossed his arms, "Would it matter to you if I said that ignoring bumps at these speeds is like a death wish?"

 

"No."

 

"No, what? No it won't matter what I say, or bumps still don't matter?"

 

"Either way," Toretto said with a wry grin, "Bumps don't matter."

 

Jeff threw up his hands in frustration as Toretto poked his head into the Exa-racer's driver side window. Jeff asked, "What do you think, huh?"

 

Toretto examined the interior, laid out like a Boeing 757 cockpit. Three LCD screens loaded with controls and meters, flashing lights all around the dashboard and dozens of knobs and gears. "Got a lot of moving parts," Toretto sighed, "Think you'll need all that?"

 

"No more, no less," Jeff said, "Our investors are very demanding. All the tires and wheels are measured for pressure and impact, the dual-redundant monitors compensate for any detected differences, and the pre-mapped radar anticipates every bump and turn."

 

"It's a salt flat," Toretto grinned, patting him on the side of his shoulder, "There are no turns. And bumps don't matter."

 

Jeff nearly bit his tongue, but instead smiled and shook his head while Toretto continued his examination.

 

"Looks to me like," Toretto finally said, "You decked out the car just for this ride."

 

"Yeah. So?"

 

"Well, it might work for a salt flat under controlled conditions, but it's not streetworthy."

 

"We're not testing on a street," Jeff fired back, "All that matters is who makes it to the other side."

 

"Really?" Toretto raised an eyebrow, "You think people will be knocking on your door to buy a few of these to come out here to run on salt flats?" He laughed, "Your investors will expect to see the performance you show here," he pointed toward the West, "Out there. Or they can't make any money. Optimizing your car, just for this test, doesn't mean anything."

 

"We'll see," Jeff snapped.

 

"I'd like an assessment of my car, if you don't mind," said Less, the driver for the Redmond car.

 

Toretto simply said, "Not much different from the Exa. Except you don't make any bones about the fact that you've strapped a jet engine to an underpowered car. You think those wheels and frame can handle the stress of the race? We'll see how you do on the flats. That's all I can say."

 

"Gentlemen," intoned a voice all around them, coming from well-placed speakers, "We're five minutes from flags-down so anything you need for warm-up, do it now."

 

Jeff punched a button on his keytag to remotely initiate his computers into a final pre-race system check. Toretto slowly strolled back to his car, opened the door and flopped into the driver's seat. His lackey Mark, younger than he but the sharpest of his crew, brushed back a long black lock of hair and positioned it over his ear, then silently joined Toretto in the passenger seat. After Toretto punched several buttons to initiate the engine, Mark  could no longer hold it in.

 

"Don't you think we're about to get smoked here?" Mark said, glancing to the Exa car, "I mean, radar mapping, all those controls and - I mean - "

 

"I know what you mean," Toretto said casually, engaging the first gear, "Just trust the machine."

 

"I know what your philosophy is," Mark sighed, shaking his head, "Put it all under the hood, make it self contained, but what if you need to get creative in the middle of the race?"

 

"Would one of our customers have the option to get creative?" Toretto asked, allowing the car to roll ahead to the starting line. "Do we let them add stuff to the machine? Do we require them to know a lot about what's under the hood?"

 

"No, but -"

 

"But what?"

 

"I don't know what! It just seems like they have more, you know, more -"

 

"More what?"

 

"I don't know what! It just seems like more."

 

"More to break. More to maintain and watch - when the real mission is to go fast on the flats. And everywhere else."

 

"You think we'll win?"

 

"Trust the machine."

 

Presently a racing judge appeared with a flag in each hand, and took his place between the two middle cars. Watching the clock count down, he raised the flags high, then started counting down loudly.

 

"Hold on to your chair," Toretto mumbled, "It's a little rough out of the gate."

 

"I'm ready," Mark said, holding tightly to the chair, pushing against the floorboard to press his back into the chair's leather. He'd made the mistake of eating a meal just prior to the first test runs the week before, and had spent an hour cleaning his half-digested meal from the dashboard and interior windshield. This time, he'd fasted for twenty four hours. Nothing remained in his stomach, he was sure of it.

 

Over in the Exa-racer, Jeff had strapped himself into his seat, and his onboard systems had just finished its run-through only seconds before the flags would fall. The carefully tuned machine would master the flats today. The machine, and his name, would soon be synonymous with extreme speed and power. He would win this race. He was sure of it.

 

Each driver sat in breathless anticipation as the judge counted down to zero, and watched almost in slow motion as the flags went down. But that's when anything "slow motion" utterly ended. Each of the machines engaged their own forms of acceleration. The Redmond machine driver simply turned a valve and flooded the rocket engine with fuel. It's ignition was like an explosion of TNT and it blasted from the line like, well, like a rocket.

 

"They're getting ahead of us," Mark complained as the NZ car's acceleration pulled him deeper into the leather.

 

"It's just a side effect of packaging," Toretto said, his pulse rate not having changed one beat faster, "Just be patient."

 

Without warning, the Redmond machine sputtered and fishtailed its wheels as they passed it, Mark spun his head as the Redmond machine flew past them and they left it in a wall of salty dust. He then looked back at the Exa racer, and to Jeff's eyes riveted forward, set like flint againt the Western sky.

 

"How did you -" Mark began.

 

"Know it would run out of power?" Toretto lifted one side of his mouth, "Get real."

 

"We're still ahead of the others," Mark noted pensively, glancing around toward Tara, who seemed oblivious to everything around her.

 

"It will stay that way," Toretto said simply.

 

"So that's it," said Mark, "We stay in these race positions until the end?"

 

"No, they will think the race is over soon, and make their move."

 

Suddenly Tara's car started gaining ground, like something pushing it from behind. Mark saw her pulling up behind them fast, and faster still, "She's coming. She's coming really fast."

 

"Naah, she's just changed her fuel mix. Thinks going from 55/50 to 25/50 will actually matter."

 

Mark spun toward the Exa racer, now closing the distance, "He's coming too, Are we slowing down, or are they -"

 

"Making their move," Toretto said quietly.

 

"Aren't you going to do something? They're gaining!"

 

"Let them burn out," Toretto chided as the two competitor machines passed them and gained their respective leads, "And besides, the race is won in the architecture, not the gadgets."

 

"What difference does it make if we're behind?"

 

Toretto watched as the odometer slowly ticked over, And over again. "We're almost there, are you strapped in?"

 

"Yes, I'm strapped in, but almost where? Where is there?"

 

"There," Toretto pointed to a tinted stain in the salt flat, and watched the odometer tick over to the prescribed reading. "Here we go. Hold on."

 

"What are you doing?"

 

Toretto ignored him and pressed a switch on the dashboard. They could hear a whining mechanical noise coming from the rear as two gleaming foils slowly rose from the tail of their accelerating vehicle.

 

"What are those?"

 

"What did the Exa driver say?" Toretto reminded, "That at these speeds, bumps count. Actually, at these speeds,what counts is stabilization."

 

"How will those make us more stable? It looks like they're slowing us down!"

 

"Brace yourself," Toretto said, and punched the second button. "Accelerators engaged."

 

In that instant, the air inside the car seemed to grow thin, and the air around them seemed to radically change, buffeting the racer with increasing intensity. Then Mark felt it, a pulling, g-force of acceleration as it pressed him deep into the leather of his chair, and caused the blood to run from his face and into the back of his head. With a whoosh-whoosh, they passed the other two cars as though they were standing still.

 

Jeff watched helplessly as the NZ racer flew past them. Upon glancing down and across the controls, all of their gauges were standing at the max, pinned almost into the red line. Even if he could make it go faster, they would incur irreversible structural stress, and possibly crack apart on the flats, spinning into a million pieces. Jeff furiously spun dials and adjusted controls, attempting to squeeze just a bit more power from the machine. If he couldn't come in first, second place would have to do. Jeff now cursed his own racer as it entered the NZ racer's dust trail. His investors would be livid.


Tara furiously slammed her palm into the steering wheel, repeatedly cursing as the NZ car disappeared into the distance. Switching her fuel mixture from 55/50 to 25/50 had made her car lighter and more agile, but had not offered the additional speed. At least, not that kind of speed.


Then something rushed toward both their cars as the NZ racer crossed the sound barrier, a shockwave ripped up the surface of the salt flat and met them head-on. The Terra car was more stable, so the wave simply bounced its wheels. The Exa car was not so lucky. When the shockwave hit, the passengers heard the sonic boom before they felt it lift the racer's front end and flip it backwards, spinning it in a barrel-roll as it tried to find its footing again. Its back wheels landed first, then the front, causing the back wheels to lift off again, then the front, rocking violently back and forth like this at least five times before the right front tire blew out, sending the vehicle into a wild spin.

 

Jeff could hear and feel the car's structure releasing and popping from the stress. At this speed and rate of rotation, the Exa-racer's uncontrolled spin would rapidly develop enough centrifugal force to turn human brains to scrambled eggs. Jeff felt the red-out coming as an automatic release triggered and both their ejection seats activated, separately catapulting them hundreds of feet into the air. Their parachutes deployed when they reached apex, and Jeff witnessed his car disintegrate on the salt flat.

 

Jeff lifted his gaze into the West, watching the NZ car disappear like a speck in the wake of its own shockwave, churning up the ground behind it. It would likely reach the finish line before his parachute even touched him to the ground.

 

Toretto casually glanced to his rear-view mirror, watchind the salt flat behind him, practically corrugating the ground in his wake. "Hmmm," he finally said, "Maybe bumps do count. Just not for us. And I don't mind giving them a bumpy ride." He settled into his seat, "No sir." And with that, fully understood the frustrated rage building in the minds of his competitors, and soon their investors.

 

And more fully understanding the difference between being fast, and being furious.

0 Comments Permalink
1 2 3 Previous Next