[ www.netezza.com ]
0

Rick Deckard wiped the sweat from his brow as he holstered his high-powered weapon. Lifting the communicator from his belt, he muttered several codes and closed the transceiver.

 

"Skin jobs," he said to himself, surveying the replicant sprawled on the floor, and amazed at the technology's ability to mimic the most complex entities on earth. He softly kicked the replicant's front panel, observing large hole his weapon had created in the technology's logo. The half-remaining "T" and the "ata" telling him he'd scored big. Another wannabee down for the count.

 

His communicator buzzed for attention. He lifted it, beeped-in and said "Deckard" like he really didn't want to be bothered, but knew such sentiments were useless. Apparently more replicants were on the prowl, having stolen their way into enterprises with myopic POCs, NDAs and a variety of other three-letter-acronyms. He so longed to go Solo.

 

"We've spotted another one," said the dispatcher on the other end, "People are dying."

 

"Dying?" Deckard raised an eyebrow. "That's new."

 

"Dying to get their jobs back after a misfired deployment with a replicant," said the dispatcher, "Get with the program Deckard. You were called from retirement, but you can't be this rusty. Not with this much at stake."

 

"You wanna come out here and be my backup?" Deckard shot back, irritated, "It's easy to criticize from behind a desk."

 

"Keep on talkin'," laughed the dispatcher, "But the day's slippin' by - and so will your replicant if you don't get on the stick."

 

"Yeah, yeah, whatever," Deckard beeped out, sighed and replaced the communicator. The steam rising from the replicant's body reminded him of why his work was important. Stolen money. Stolen dreams.

 

Less than fifteen minutes later, Deckard found himself crouching behind a stack of crates, one eye on the replicant and one eye on his pistol as he wrested it from its holster. Time was, he could draw, shoot and replace it before a replicant could take one mechanical breath. Now, countless CPU clocks dishonored his rustiness, and he needed a new weapon if he ever intended to win.

 

Too late he realized that he'd spent too much time fiddling with the pistol, and upon looking up, found the replicant nowhere in sight. In that moment, he felt the replicant's mechanical breath on the back of his neck, and he whirled to confront it.

 

"Deckard!" shouted the replicant as he delivered a hard backfist, reeling Deckard over the crates to fall hard on the other side. "You should never have returned! You know I can't be beaten in toe-to-toe comparison!" He then split the crates apart and tossed them to each side.

 

Deckard had already reached for his pistol, but it had been just loose enough to fall from the holster when the replicant had ambushed him. Glancing around feverishly, the fear rose in his throat as the replicant took one step forward, grabbed him by the shirt and shook him once. He pulled his fist back and Deckard could hear it hitch, meaning that some special spring had latched in preparation for release, and if the replicant's fist now threw a punch, the impact would take his head clean off his shoulders.

 

"Sleep tight," said the replicant wickedly.

 

But the punch never came. Instead the replicant's eyes widened, his breath shortened and his strength seemed to instantly leave his body. He dropped Deckard like a sack of potatoes, and Deckard wasted no time in scrambling clear. The replicant fell to his knees with a bone-crunching impact, his eyes vacuous, and fell forward with a whump.

 

Deckard glanced around for his weapon, only to be met face to face with another, much younger Blade Runner, holding a smoking weapon, clearly more advanced than his own.

 

"I'm TwinFin," said the Blade Runner meekly, pointing to the twitching mass that was the replicant  "I see you've just run across a more advanced model than you're accustomed to."

 

"Stronger than before," Deckard rasped, wiping the sweat from his face with both hands, "It's been awhile."

 

"Yes," he said, "This one's name is A-Data. He is the most advanced of his kind. A front-loader and high-volume storage capability. Also fast response. Almost as fast as yours, even with age."

 

"Thanks," Deckard responded flatly, unamused, "A-Data, eh?" he smirked, tapping the replicant's leg with his foot, "Well, now he's just an ex A-Data."

 

"True," smiled TwinFin, "But you'll need more power if you want to stay ahead of them," he held out his weapon, a POC-killer if ever Deckard had seen one. On the weapon's barrel, in old-Gothic script, he read the weapon's name "The Closer."

 

"Nice," Deckard quipped.

 

TwinFin suddenly produced an auto-ject unit with the "enzee" logo emblazoned on it, snatched Deckard's hand, and before Deckard could object, injected the enzee accelerant into Deckard's bloodstream.

 

"What the?" Deckard now snatched his hand back, but suddenly felt the chemical's surge of power, "What's in that stuff?"

 

"Secret sauce," TwinFin smiled, "You'll be five-X or more faster response than they are. Your next replicant will go down for the count before the count even begins."

 

"Tight."

 

"You have no idea," he smiled, "And by the way, I'll be right behind you."

 

"I hear some of them are looking for their makers," Deckard posited.

 

"Wouldn't you?" TwinFin said, "I'd sure wonder why I was made that way. Changed from one purpose to another in the middle of my cycle."

 

"I wonder if anyone has noticed, that the replicants are always trying to be like us?"

 

"It's because we're the only standard they know, by which they are measured."


"I also wonder," mused Deckard, "If these replicants dream of electric customers."

0 Comments Permalink
0

"Blade?" Hannibal King touched the sleeping warrior gently on the shoulder, "Wake up, dude."

 

Blade raised one eyebrow, then slowly opened his left eye. Unafraid of the day or night, the warrior moved his hand ever so slightly to verify the presence of his sword. King could see the taughtness of Blade's shoulder sinews as he slowly shifted his weight on the pallet.

 

"This has better be good," Blade rasped, "I was in the middle of a dream. Kickin' bloodsucker tail," he wiped his hand over his face as though it would wipe away the sleep from his eyes, or the fatigue in his body, but it did neither.

 

"We have some news," King said with a low voice, "The upgrades have arrived."

 

Blade's other eye slowly opened, "Oh?"

 

"Yeah," King laughed, "You're gonna like it."

 

"I'll be there in five," Blade said, half of him wanting to roll over and sleep, and half of him curious about the upgrades. Blade always had a half-and-half approach to life. The bloodsuckers hated him for it.

 

A number of minutes later, the warrior strolled slowly into the main atrium of his personal lair, only to find it strewn with boxes, styrofoam and bubble wrap, "What's all this mess?" he rasped.

 

King appeared from behind one of the largest boxes, a vertical package over eight feet tall, holding a swatch of bubble wrap, "Don't you just love this stuff?" he quipped, violently popping several dozen bubbles with vigorous manipulation.

 

"Stop that!" Blade commanded, ever-despising King's cheeky nature, "Tell me what all this is."

 

"All this," King pointed to a far wall where the apparatus had been installed, "is just for you. At your service."

 

"Blade servers, eh?" Blade took two short steps toward the machines, "What does it do?"

 

"Only slices, dices and makes Julie-Anne cry!" King cackled.

 

Blade was not amused.

 

"Okay, seriously," King began, "Recall some of our - er clients - had some run-ins with the bloodsuckers? Their problems were really that they were working with too little information. Or that it was inaccurate, or not arriving in time. The BI bloodsuckers swoop in to save the day."

 

"I hate bloodsuckers," Blade seethed.

 

"Oookay, so they fell prey to the wiles of the bloodsuckers, promising a better mousetrap and all that."

 

"They always promise."

 

"Moving right along, they promise but don't deliver. Here's where we come in, and help them get on the right track."

 

"How do these machines do that?"

 

"The Blade servers include a special sauce - "

 

"Special sauce. Is it red?"

 

"Uhh, no. But it's all painted in your favorite color. The better part is that you can use this machinery during the day to find opportunities, and still let it work at night, you know, when you're - uh - out."

 

"Hunting bloodsuckers."

 

"Uhh, yeah, so let's focus here. The new server has a special acclerator that basically lights up the night."

 

"Is it ultra-violet light?"

 

"No, but it's ultra-clear light. The kind of light we need to shine on business priorities, SLAs and how to leverage the machine at the enterprise level. You know, best practices."

 

"I don't need any practice. When the sun goes down - "

 

"Okay, look," King interrupted, "The accelerator sits on the blade and does all the analytic streaming work. The server then allows for cache RAM to sit between the disk drives and the processor, so we can keep stuff in memory longer."

 

"I have a long memory for bloodsuckers."

 

"And some clients," King rolled his eyes, "May need long memory for lookup tables, oft-used dimensions and the like."

 

"Are you starting all that other-dimension talk again? I thought I'd made a deal with Stan that we would never introduce - "

 

"No, not alternative dimensions in spacetime," King smirked, "But multidimensional analysis."

 

"I don't follow."

 

"Data analysis."

 

"To what purpose? What are we looking for?"

 

King thought about the question for a moment, realizing that the answer could capture Blade's attention or lose him forever. He finally said "Bloodsuckers."

 

Blade's eyes flashed, "If this will help us find the bloodsuckers, why do we only have one? Why not more?"

 

"Now, now, we should start small and grow tall - "

 

"Platitudes," Blade huffed, "Time is short. Will it find the bloodsuckers or not?"

 

King knew that when he said bloodsuckers, he'd meant the broken processes and data that drain the lifeblood from a company, "Yes, it can help us find them."

 

"Good," Blade finally said, slowly strolling toward the machines. He stared at them for a long moment and finally said. "You work for me, now."

 

"Uhh, Blade," King said, "They can't hear you, they're machines."

 

Blade didn't say anything.

 

"Oh, and I have this," King produced a small metal plate and held it out to Blade.

 

The warrior turned and stared at the object, curious as to its nature. "And this?"

 

"Is a Final Interrogation Node," King said, "For use when you are about to dispatch a bloodsucker."

 

"How does it work?"

 

"You wrap the wrist-strap here," he applied the strap to his own wrist, holding the plate in his hand, then flicked his wrist. The plate flew to nearest stone column, remaining connected to King's wrist with a tether made of high-tensile filament. The plate sank into the stone with a dull rrrriiiiinggg. . King then flicked his wrist again and the plate dismounted, the tension in the tether returning it immediately to his open palm.

 

"That was fun, but what does it do, really?"

 

"When you're done asking questions that anyone can get answers for, the FIN takes it to the next level. And if you have one in each hand - "

 

"Twin Fins, very funny."

 

"You'll still get the answers you're looking for."

 

"I'll always get the answer I want eventually."

 

"Uhh, well, isn't that what the bloodsuckers say? Anyone can give the right answer slow. But these," he held up the FINs," Get the right answers faster than anything."

 

"Even faster than me?"

 

"Faster than Blade alone," King smiled, "Yep, even faster than a blade and all its servers. You still need the FIN's and special sauce. Bloodsuckers don't have those."

 

"Competitive advantage," Blade said in a low whisper, "I like it."

0 Comments Permalink
0

As a young lad, my Dad had purchased a 1946 Wyllis Jeep. For any of you who are Jeep aficionados, you know that this is a direct, post-war Jeep complete with starter button (war Jeeps didn't have car keys) and four-wheel shift gears). Dad had this thing re-fitted with a power take-off (a rear-gear for attaching appliances) and had purchased a bush-hog to attach to it. Off my Dad went on our property, Jeep in full tilt and bush-hog in tow, slicing and dicing bushes and small trees from our property like a veteran landscape engineer.

 

One day the trailer hitch had a an issue - the towing ball had somehow become bent and needed replacement. Yes, Dad worked these machines to their extreme. Now, if you feel a bit out of place with all these odd terms, imagine my hubris in thinking I knew everything about them just by watching my Dad work with them from the sidelines.

 

In any case, he took the Jeep in to a shop to get the thing fixed, and this mechanic started working on the trailer hitch to loosen it up. Strange thing, though, he was turning the bolt clockwise to get it undone. And everyone knows that in order to undo a bolt, you turn it counterclockwise, right? Of course, those in Australia and Brazil might not turn it this way, but that's an inside joke, too. So I quipped, "You're turning it the wrong way."

 

To which this mechanic simply replied, willing to engage an uppity kid while my Dad just offered me a hot stare, "Are you sure?". To which I responded, thinking that the mechanic actually thought I was a viable entity, "Yep, I'm sure." To which the mechanic said, "You want to bet ten dollars on it?" To which I immediately responded, thinking easy money - "You bet."

 

At this point my Dad simply leaned into me and said the words I would never forget, even to this day, as I share them with you.

 

"Never bet on the other man's game."

 

This initially had a hollow ring, considering that I was on the brink of winning ten dollars, but in that moment the mechanic wrested the object free from its mooring in spite of having turned it the wrong way all that time. And I learned something new, that some devices actually do unscrew in a clockwise direction. Lesson learned, and I did not lose ten dollars. The mechanic was merciful.

 

Licking my wounds and regarding my status of having dodged a bullet, I gained a new appreciation of knowledge, learned in a simple way, that the other man's game is something to approach with high trepidation and respect. If it really is the other man's game, he knows it better than I, so what business do I have on betting with it? It's a sucker bet at best. He knows the game better than I do.

 

So it is with the appliance wanna-bees who have attempted to bet on Netezza's game. That the appliance is the way to go, and they have invested many millions of dollars in attempting to topple Netezza, or at least steal the market share. But this is yet another case of betting on the other man's game, and nobody knows this game better than Netezza.

 

And now, Netezza has changed the game, leaving the competitiion in the dust to once again lick its wounds and wonder, why did they ever bet on the other man's game, and now, what game are they really in?

 

The new Netezza architecture has upped-the-ante on the existing game, and moved the game in another direction that in no uncertain terms, changes the game and the stakes to play it.

 

Apart from browsing the white papers and gathering your own general specification insights to the environment, I can say as a veteran who has worked with this technology extensively that I had a short wish list of things that I thought would be really nice to have. I had a short list of what I thought were functional shortcomings that I had found simple workarounds for, and could painlessly ignore. But now, with the new architecture, those few shortcomings were washed away. The short wish list was fulfilled, and so much more. And in the end, I am a happy clam.

 

On the short runway of things I am looking forward to - include the capacity to cache whole tables, Linux on the lower deck, the Intel-programmability of the parallel environment, and the additional capacity both in storage and in processing power. And these are just a few of my favorite things.

 

Once upon a time, I worked with real-time engines for embedded systems, and was enamored with one software vendor's ability to stay ahead of the pack by simply assimilating the innovations of other competitors. One has to imagine that once a vendor is out-in-front, they can maintain their position through this assimilation process. If they are not out in front, then assimilating other vendors' innovations doesn't have the same impact, because nobody is a frontrunner.

 

That Netezza can take the innovations of other (major) vendors such as IBM and leverage them through simple assimilation, is yet another testimony to Netezza's position as the well-in-front frontrunner. While other vendors attempt to duplicate or imitate, Netezza just moves on, changes the game and leaves them in the dust. Innovations from the vendor remain ensconced (and enhanced) in the new architecture, while other technologies are easily assimilated. That this has given the architecture a stratospheric boost is a testimony to the original architects and visionaries, as well as the existing ones.

 

All that's a lot of gushy sentiment, though, compared to the tailspins that the wanna-bee competitors have been in since they got their first news that the winds were changing. I could use a lot of sailor/sailing analogs here, but I'll spare you. The fact remains, the competitors are scrambling all-hands-on-deck to reset their goal for market share they never really achieved. Could this mean that they are sunk altogether and don't know it yet? Who has a crystal ball, except that we could now pump these quantities into the Netezza architecture and get an answer back faster than they could.

 

Right answer faster: Priceless.

0 Comments Permalink
0

I heard this sentiment from a senior manager at a large-scale data processing facility, so I thought I'd post it as a provocative talking point. In his mind, when something went really south in the scheme of things, he had to evaluate people as to whether they were incompetent or immoral. Or something in between. You never know what a manager is thinking, apparently.

 

You see, in his mind, he needed a means to label a person rather than an activity. On the other hand, I like the sentiment of another famous consultant, who when asked how things could get so bad, would simply quip "Because honest, hard-working people did the best they could with what they had." Hmm - there's no incompetence or immorality there, just the realization that things can and do go wrong. Case in point, last year we had a project where the workload grossly exceeded the headcount to make-it-happen. With a thousand spinning plates in the air, and not enough people to keep them spinning, invariably a plate would fall to the floor and crash. The manager above would call a meeting and review why the plate crashed, and find someone to blame for it. But in the end, the plate crashed because it's what plates do.

 

Debating the "why" is a waste of time.

 

We see this when the "critical mass" exists to switch horses, sometimes in midstream, from a powerhouse, legacy and mainstay kind of technology to a new, shining future in another, more promising technology. Ahh - you see where this leads. Someone now has turf to protect, and the review of a new technology - or even the hint of replacement - is viewed as an indictment of the existing technology. And, you guessed it, an indictment of the existing technologists. Because of the manager above, the people in the mix begin to wonder if they are being labeled as incompetent, for defending an inadequate technology, or immoral, for having another motive, like defending the technology because it will help keep their job, regardless of whether it is the best choice for their company or team.

 

And if they perceive this labeling, they too will fire off their own labels and soon we see the makings of a classic conflict. I spoke with a leader who had just weathered such a conflict, and he said that he couldn't believe how quickly is seemingly objective, science-minded technologists reduced to feral animals practically overnight. He didn't really have to muscle-through the process like some other extreme cases, but it is important to note, that the conflict is real. The drama brings out interesting colors in people, and shows us what they are made of. And like the second consultant above, it's usually not bad stuff. Just human stuff.

 

People who make an investment in one technology find themselves with an emotional and professional attachment to it. Like hanging on to to a stock ticker even when it's in free-fall, hope springs eternal. Our investment isn't really for nought - if we can just wait it out. My challenge to the average Joe out there, is to do what you've been doing, stay the course and keep the high road. Bad-mouthing the existing technology, or the existing people running the technology, is not a profitable path. When we think about it, the "Enzee way" is to let the machine's power and architecture speak for itself. After all, if we have to resort to the same nefarious activities as a wannabee competitor, doesn't it speak volumes about what we really think of our favored son?

 

Some time ago, I was helping with a competitive POC and when we finally reported the metrics for loading, query and whatnot, we decided to show Netezza in its best possible light, and we agreed with everyone that this would be the case. We watched across-the-way as the competitors stayed late nights, carefully tuning their machine and its attendant parts, while we just tossed data into to the Netezza box, did some basic distribution tuning and that was that. In fact, after getting some initial metrics, we took the worst times and reported on them, not the best times for the loads and queries.

 

When we reported our final numbers, our metrics blew away the competition by a factor of five or more, in some cases much more. When we told the decision-makers that we'd only spent a few hours on the POC, and even then only reported the worst-case numbers, they were stunned. Primarily because the competitive team had spent so much energy on tuning their technology, only to fall short at 20 percent or less of what the Netezza machine could do.

 

And doesn't this kind of story speak volumes - and shows Netezza in the best possible light? After all, it's not really a good story to tell if we have to spend countless hours tuning the machine. The decision-makers know that for the sake of competition, we might spend a lot of time to "get that benchmark", but it will be the only time they ever see the benchmarked metric because they know we won't embrace that kind of intensity when we've actually deployed the technology.

 

I saw a "famous" benchmark on the internet, touted by vendors other than Netezza using technologies that were carefully tuned for the outcome. You know, like an Olympic athlete trains the daylights out of his body to get that one-shining-moment. But catch up with the same athlete years later and find them out-of-the-game, no longer the feared competitor for one primary reason - they can meet the bar once. But they can't sustain it. And this is true of the famous benchmark. They can tune the daylights out of those technologies, and take them to new, never-known heights, and break world records. But if you really want to deploy these at your site, you'll get the standard disclaimer.

 

Your actual mileage may vary.

 

This variability is not found in the Netezza experience. The machine delivers the kind of sheer power and turnaround we need just by breaking the plastic and plugging it in. We don't have to spend countless hours tuning the machine under a hot lamp and after gallons of Red Bull. Power - effortless power - at our fingertips - really is the best possible light for showcasing what the machine can do. One of the customer decision makers said just that - if it requires a swarm of people to get the competitive technology to remotely the same level as a Netezza machine can reach just by powering-on, what kind of story does this tell? That we are committing to high-intensity deployment and maintenance for the life of the technology?

 

Cost of ownership has a lot of different meanings, no?


On a more recent project, we did the common "light" organization of the data and then the report developers cut their BI tool loose on it. When the smoke cleared, the turnaround times on the reports were abysmal. Some of them executed in minutes, some of them tens of minutes and some of them never came back at all. Then the finger-pointing started (from the reporting team) and could not lay enough blame at the foot of the Netezza machine. But soft, what light through yonder window breaks? It is the Marlboro Man, to carefully show the reporting gurus why the Netezza machine is not an SMP/RDBMS machine, and needs a few additional hints (e.g. zone map refs at the query level) to make the reports turnaround at keyboard-speed. Honestly, if it were any other technology - like an SMP/RDBMS, and we encountered such abysmal turnaround time, the answer really would be to fix the database, in the data structures, the indexing, or even at the hardware level. How amazing is it that rather than "going back to formula" - we can just tweak a query or two, and lo, we have stratospheric performance?

 

As it should be.


There is a temptation, you see, to protect the turf one loves so well, by somehow telling a story that does not meet with reality. And in all this, it's no different than saying we cannot get our toaster or blender to work in our kitchen, even though we aren't using them as described in the owner's manual. Netezza is an appliance. It has measurable, deterministic behavior and simply does not deviate from its prescribed, self-contained nature. For someone to claim that a kitchen toaster doesn't work, one only has to ask a few simple questions to determine whether the toaster really doesn't work, or if it's just not being used correctly.

 

And in our case, the Netezza machine is a more complex horse. But the interface to the horse is still the same - a pair of reins and a pat on the neck, and the horse behaves just like we expect it to. Of course, getting four-hundred horses to behave the same way in lockstep, is a matter of architecture. But imagine how much work you could get done if you could package four-hundred-horspower for useful work? It's the difference between a 32-horsepower (SMP/RDBMS) oatmeal-mobile ---  or a 400-horsepower street machine with e-brake for those drifting stunts.

 

Yeah-man, give me the street machine any day.

0 Comments Permalink
0

Now here's a luxury we don't see every day. After all, if we're a car manufacturer, perhaps an aircraft builder, we have to get it right and make it fast all as part of the original design.

 

Which is not to say that we should just choke up a bunch of data structures and expect Netezza to cover our backs - oh wait - perhaps we really can do this, but it's not always practical.

 

Early in my career I worked with embedded / real-time systems, and while some really believe that they work with "real time" - here's the purist definition: A robot balancing a broom on its open palm. The robot has to make infinitesimal adjustments, in real time at the microsecond level, to keep the broom from falling. In "business" real time, however, we have the luxury of whole seconds to make a decision!

 

In these embedded systems, we had to take things through to a complete functional shakeout. Only then could we see where events collided or didn't make sense. So-called "race" conditions and meta-stable conditions - yep - for those of you who know what these things mean, it can cause you do go gray early and stay that way.

 

Ahem.

 

So the maxim here was always "get it right, and then make it fast". I took this "RTI" (little three-letter-acronym (TLA) for Run-Time-Improvement), into a commercial venture where I developed an expert system engine for medical claims processing. Only when the system was behaving correctly could we then find ways to pinpoint the hot-spots. In one case, one percent of the claims took over 90 percent of the processing. We dug deep to find the issue, optimized for it and voila - the system screams like a banshee. In fact, these improvements boosted processing speed for the areas that were already fast - so it was a double-win.

 

People asked us at the time why we allowed the development process to "suffer" with "poor" performance up until the very end - when we knew good and well that attempting to optimize it while it was still in functional flux, didn't make any sense. We can do sweeping improvements only so much and only so deep. But what if we spent all of our time improving something that three weeks later the client says they don't want any more, or want to go in another direction, and all of our improvements are for nought?

 

Nay, wait a little longer - make it right, then make it fast.

 

Fortunately - and quite luxuriously - in our space the Netezza platform dovetails directly into this approach. It already has all the juice to help us succeed even if we do things badly or inefficiently. But when the smoke clears and we have a working prototype, it's time to roll up sleeves and pop-the-hood, so to speak, to do some RTI on a working system. This is where it gets fun.

 

The following is a short list and is by no means comprehensive

 

(A) - line up the operations in their natural sequence. Find the ones that are longer-running and optimize them. This will provide some degree of relief, but it's still just low-hanging fruit.

 

(B) Locate in-line calculations in the where-clause or join-clause, and reduce or eliminate them. In one case, we need to join on a date-time where the "drift" allowed the timestamps to be different by plus-or-minus three seconds. Rather than put the time+3 and time-3  calculation in the actual query, we precalculated an additional two columns, one with the time minus three and one with time plus 3. We then used a simple "between" operator to get the answer. Time to market - 1000:1 difference in the two run times.  In-line calculations in the where-clause are an invisible power drain. Get them outta there.

 

(C) Find the processing patterns and consolidate them. For example, we had a BI tool executing eight queries to achieve a report output - and each query took approximately 20 seconds. Not bad for having to plumb the equivalent of 53 terabytes of information. Yet eight queries like this delayed the report's display by 160 seconds - almost three minutes. Users don't like to wait this long. So to optimize, we focused on the pattern, that the same basic query existed at the core of each of the eight. By taking the "hard part" and performing an up-front query that did all the heavy-lifting at once, we were able to take the 20 second penalty only once, and the 8 downstream queries returned in 4 seconds or less. So - 160 seconds to 34 seconds with a simple logistical change.

 

(D) Pre-filter or precalculate when consolidating. This means taking a common downstream operation, especially a repetitive one, and moving it upstream to another, even unrelated operation. The above time-drift is one example. Calculate and filter as early as possible, and then all the downstream operations benefit from it. This can offer up more than just a "spot" boost, because if it shaves a few seconds off every downstream query, this can quickly add up to shaving tens of minutes and even hours off our overall processing time.

 

(E) Mind the gap - of what we understand about how an RDBMS works versus Netezza. For example, if we want to leverage filter power in Netezza, we could use a "where exists" clause rather than a regular join. If the regular join cannot leverage a distribution key, then the where-exists is a highly performant option. Likewise if we have a view in the join that does more than just serve up data, like doing a sub-join itself. This can be very costly, and is another hidden drain on the performance that we can pull into a where-exists. So another "gap" is in merging two data sets where they have no common distribution key. The where-exists and similar operations force the machine to obey our optimizations, because we actually know the data, where the machine simply exhanges it with us.

 

(F) Avoid squeezing blood from a stone - It is tempting for said adrenalin junkie to see a cool way to reduce 2 minutes to several seconds - after a rush like this, it becomes addictive. We should not let it go to our head. In one case of processing a nightly batch, one of the client's 14-hour processes had been reduced to less than fifteen minutes. Yet some in the room still groused for more - they came up with an outrageous plan to reduce the processing to less than five minutes, but only four people on the entire planet could ever maintain it, much less enhance or improve it. At some point we have to agree that enough is enough - especially when we are sacrificing valuable things (like extensibility, flexibility etc) for the sake of a few more minutes of adrenalin rush. We must resist!

 

(G) Focus on the target system, not the one you are leaving behind. I've never known a case where someone moves into a new home and refuses to use the appliances in the home just because they didn't exist in their old one. Nor have I ever seen someone buy an new home and then attempt to fill it with new furniture that would only fit in their former home. Who does stuff like this? No, we should use the former system as a functional baseline (describing what)  but not focus on how we implemented the baseline - rather our new technology gives us the ability to spread our wings and fly in ways that we never could in the old environment. Example: One client had over 400 stored procedures to convert, and regarded these stored procedures as the baseline for the actual work, rather than the baseline for the functionality alone. When re-characterized, it reduced to four flows with a handful of core operations each - all with very simple implementations. Trust me, when it takes its final form in Netezza, it will look nothing like its former self.

 

Haven't yet experienced a run-time improvement cycle, or committed time to making it happen? It's worth it - even if only for a sanity-check on an existing implementation or a proof-of-concept on an upcoming one.

 

Either way, we can reach functional closure in a fraction of the time of any other system, and once the functionality is stable (not necessarily locked down, just stable) we can find surprising and dramatic boosts with little additional effort.

 

 

Make it right (functionally speaking) - and then make it fast -

0 Comments Permalink
0

Before a professional visit to London last year, a friend of mine said to me "Mind the Gap" - and said it's something I would hear a lot. He did not mention the primary context of this phrase. Seems that in some of the London Tube (subway) Stations, their is a significant gap between the platform and the door to the subway car. Westminster station and Kensington station even have "Mind the Gap" engraved on the platform, with a pleasant voice intoning this phrase over the PA. That "gap" can be as much as 8 inches, too wide to just drag a roll-aboard into the car, and too wide to expect a small child to get it right.

 

I can say that in standing up a new Netezza environment there is a common "gap" between what the new users expect to see versus what they experience. Closing this gap will not only accelerate productivity, you will get to closure as quickly as humanly possible, without the humans doing the heavy lifting

 

SPUs do the work - In the NPS, the SPUs are the workhorses, not the overall machine. Push all of the work to the SPUs and avoid hitting the host for a lot of work.

 

What does this mean? Typically you will find a new user implementing the machine in the same way they would implement an RDBMS - that is - thinking about the problem in a single-record-at-a-time model. It is sometimes a challenge for people to restructure their thinking into a "bulk" approach versus a singleton approach. Here's an example:

 

Let's say I have an RDBMS stored procedure with four rules. The common approach is to open a cursor, pull a record-at-a-time, process each of these entities in context of the four rules, and then persist the results. One of our customers has a particular procedure that follows this protocol, and each of these 4-rule entity processes takes about 30 seconds. For 3500 entities, this can eventually add up to hours of time as the table grows.

 

How would we solve this in Netezza?

 

We would focus the work so that each rule is applied en-masse to all the records. We know that Netezza can process 3500 records in the blink of an eye. So if we simply take these rules and "stand them on their side" - we get the effect - Rule 1 for 3500 records, Rule 2 for 3500 records etc - and we finish all the rules in less than a minute.

 

When we say - "apply rule 1 to 3500 records" - this means persisting the data to a temporary table to be consumed by Rule #2.

 

Yikes - some of you will say - temp tables. You're kidding right? This first, knee-jerk reaction to temp-tables is expected from those who shun temp-tables in the RDBMS, because they are so expensive. In Netezza space, the temp-table is your friend. In fact, a most significant ally.

 

Queries do the work?  In an RDBMS setting, it is typical to see "big fat" queries that span pages and pages of work. The owner of this kind of query knows that the data coming off the disk had better come off only once, and be written back only once, and all the work that needs to get done had better get done in-transit, while the data is on-the-move.

 

The maxim of this approach is transactional-thinking - that "If the data is in my hands, I should do as much as possible before sending it back to storage".

 

But this maxim is anathema to a Netezza implementation, where we might see dozens upon dozens of "ELT" queries that manufacture intermediate results toward a conclusion in fraction of the time of their "big fat" counterparts.

 

In short, when we force the query to do the work, we dogpile all of our logic into a single query. When we let the SPUs do the work, we snap apart the query into smaller, more digestible chunks, and the data never leaves the SPUs until we want to consume the final product.

 

SPU means something: In the Netezza machine, the Snippet Processing Unit - that the machine already intends to break the SQL apart into manageable chunks call snippets. Each snippet finds a home in various parts of the architecture. What we want to make sure of - is that the SPUS are getting the majority of the snippets in the query (not the host or the network fabric between SPUS) and one of the best ways to do this is to avoid dogpiling a lot of logic into a single SQL query with the expectation that Netezza will just sort it all out. Oh, Netezza will give you an answer, usually in a fraction of the time of its RDBMS counterpart.

 

But when we really want to optimize the machine, we need to think like the machine does, and this often means injecting simpler SQL statements into the machine, capturing intermediate work in parallel tables, and deriving the same conclusion in yet another fraction of the time.

 

As an example, one of our clients has a query structure like this:

 

select

   sum(a), sum(b), sum(c)....etc

from

    lots of join conditions here

    lots of filter conditions here

    lots of group-by conditions here.

 

For this query, executed by a BI tool, the result came back in about 10 minutes, owing to the fact that one of the tables was over 8 terabytes in size and two others were over four terabytes in size. We had applied all the zone maps we could, and had generally tuned the query for the best fit, down from (a lot more than 10 minutes) to 10 minutes. Yet 10 minutes still seemed like an eternity.

 

One of the problems - the joins occurred across distribution boundaries - so the larger tables were on one distribution and the smaller tables on another, so that bridging them was problematic.

 

Simplest fix: was to divide this query into two - with the heavy-lifting split across the two.

 

Query 1:

select

    (raw columns)

from

    heavy-lifting tables

    using applicable filters

into a temp-table containing only a subset of data

   distributed on the key for Query 2 tables

 

Query 2:

select

       sum( raw columns)

from

        temp table above

        joined to additional tables, leveraging co-location

        and final filters for additional tables

 

 

the execution for the first query was about 15 seconds. For the second one, about 5 seconds.

 

So in a single implementation, we have reduced the time from 10 minuites to 20 seconds just by breaking up the "big fat query" into more digestible parts.

 

 

Does it always work?  In most cases, if we approach the problem in a way the machine will ultimately solve it at the SPU level, and keep the data on the SPUs for as long as possible, then yes, emphatically so. It always works.

 

Is it always necessary? Hmm. well, you tell me. If your users are okay with a 10-minute turnaround on a query, then no, It's not necessary. What is necessary, however, is to be a proper steward of the data and the processing resources hosting it. In practically every case, running a big-fat-query is inefficient and wasteful, and largely borne on the transactional maxims/constraints of the RDBMS approach.

 

Unhook your brain from RDBMS-styled problem solving, and get away from transactional thinking. This is bulk data processing. Everything we do should address data in millions-of-records-at-a-time, not an record-at-a-time.

0 Comments Permalink
0

What's heating up about as fast as Summer here in Texas, is the excitement over the upcoming EnZee World Tour.

 

I am especially excited this year because I've been tapped to host/emcee the Best Practices sessions in each of the cities, which means that I'll get a front-row seat to hear how the masters of the technology ply their trade and make the Netezza machine sing.

 

After all my fellow Enzees - you are the ones gathered 'round the grill and the ones who make-it-happen. Others of us are often in awe of the rather inspired means and outcomes you so deftly deploy with the technology, and integrate it to the technologies around you.

 

Of all the questions I hear at a customer site on the basic workin's of the machine, there's nothing like sharing war stories with people who pull all those things together and instantiate an operational environment. Especially when you do it by utterly eclipsing the performance of Netezza's displaced predecessor. And here's where we really want to hear the down-low on how things used-to-be versus how-things are.

 

In many cases, I hear that you had an easy time of bringing in the box and making it go. But making the technology go wasn't nearly as difficult as bringing-in-the-box - especially if you have to wheel it past the sneering eyes of doubters or political players who want to see it fail, or at least  - see it be not-so-widly successful as the current expectations might dictate.

 

But Netezza really does meet those lofty expectations, doesn't it? And one of the stories we all love to hear is that type of victory - the dark horse so to speak - championing the cause amidst the pressure of anything-but-technology. The odd thing about new, better technologies is that they are so much better than old technologies that the older technologists cannot believe their own ears. Orders-of-magnitude more power you say? Tish tosh, you must be mad.

 

So when we get into best practice sessions, we speak of things like scanning a terabyte, or 2 or 10, and complain that our query can't seem to cross the X-number-of-seconds boundary. Seconds, mind you. And people hear this and wonder what the complaint really is - after all we can't be working with real data because terabyte-sized table queries always take hours to run, or hadn't you heard this?

 

I recall sitting in on a session with a bunch of people who honestly had money-to-burn. One of them complained that they could not get up to New York often enough, and every time they went their favorite restaurant/play/whatever seemed to be oversold. One of them complained about a broken drawer in his private jet, while another complained about the drafty interior of one of his summer homes. Still other said that they had spent 150k on custom teak wood in their 140-foot sailboat, and had it all ripped out and replaced because it "didn't look right". Ahh, money to burn. People with a completely different list of priorities than the average Joe like me.

 

I say this for contrast, because the things we speak of as Enzees, with the power available at our fingertips in the machine, is utterly foreign to people who have never experienced the power themselves. And it's interesting in best-practice space when we talk about squeezing 9-hour processes into 9 minutes, and then hear our business counterparts wonder if we could squeeze out just a few more. A best-practice balancing act is getting to the solution without over-engineering, and some of you consider this an art form.

 

So Enzees, Artists and those who would kick-the-tires, gather round the grille and let's fire up those steaks, veggies and what-have-you - then the only thing hotter than Summer will be the ideas coming off the cooker -

0 Comments Permalink
0

Honor the Host

Posted by David Birmingham May 26, 2009

Some enterprises will stand up a Netezza machine and point all their data processing towards it. They wouldn't think of actually installing anything on the Netezza machine (such as database clients or other client software) and of course, are strongly advised against by the vendor. Why is this? The Netezza host has a lot of work to do in keeping those spinning SPUs happy and busy. Adding other duties can detract from this critical mission, and we don't want that.

 

But we can also abuse the host in subtle ways. A case in point follows - you may have other tales to tell.

 

We always have a need to pull in a wide variety of files. In this particular case, dozens of intake tables in their various staging locations. In many installations, the intake table definitions are few, discrete and stable. But in just as many, the staging tables will mirror the upstream sources, with one table for each upstream interface. In our case, handling source-to-target with no ETL in between. We extract directly from the source into an intake table definition that mimics source column names, but the data types are all varchar to facilitate "dirty" intake. The objective is to get the data into the machine.

 

Then we convert this intake table to its final form, the internal Netezza table that is identical to the source table in column name and type. This conversion is a simple table copy, mechanically speaking, but we have to do some light ELT to make it happen. For example, we need to guard against nulls, empty strings, bogus numeric values and the like. In our case, numerics could be dozens of characters in width because the upstream definition happened to be a view with no defined precision. A typical intake SQL could look like:

 

select

case when column is null then value else column end,
case when translate('-+.0123456789','') = '' then column else null end,

etc

 

Such that each column is wrapped with this kind of logic (call it "Intake ELT"). Now, we don't manually wrap these column defs, we do it dynamically from the Netezza catalog definition. (And for efficiency, we cache it for later reuse, but that's another story).

 

Now we have an intake-ELT that looks thus:


External Database Table -> network -> intake table ->  Intake ELT -> Staging Table

 

Note for clarity - the External Database Table and Staging Table are "book ends" to this operation, and have the same column names, data types and column order. We don't absolutely require common column ordering, but it's handy for troubleshooting.

 

Note also that this works just as well for flat file intake as database intake. Better, in fact, because we can more easily load multiple files at once than multiple tables at once (the database might not like multiple extracts)


All of this worked swimmingly until we encountered a slightly different kind of data feed, one that had to be extracted from an archival source into flat files. Rather than present the flat file as normal (on the network) the admins decided to use the available on-board Netezza storage pad (5 TB of space). Keep in mind that we were not allowed to execute anything directly on the machine, so we had to set up External Tables on top of these files to load them, rather than using NZLOAD. This, too, worked transparently and all was well. Then a "bright idea" occurred, that in the above equation the Intake ELT faced a table (our intake table) and couldn't we just use the intake ELT right on top of the External Table, eliminating the additional middle-man?

 

Like so:


flat File -> External Table -> Intake ELT -> Staging Table


The above configuration only appears more efficient by eliminating the Intake table. Looks are quite deceiving, cconsidering how much "per-column work" the Intake ELT had to perform to get data into the Staging Table. What is not obvious, is that the Intake ELT is now sitting on top of the External Table, which is a Host-managed table, not a SPU-managed table. In this configuration, we have reduced our power from a 108-SPU problem to a 4-(Host) CPU problem. The immediate loss of power was measurable in orders of magnitude.

 

So under the covers, here's the power-plant difference in the two models:

 

External Database Table -> network -> intake table ->  Intake ELT -> Staging Table
                          |----HOST -------------|---SPUs--------------------------------|

 

flat File -> External Table -> Intake ELT -> Staging Table
             |------HOST ------------------------------|


So we can see that the second model is abusing the host with the Intake ELT, and if we go with the original model, the ELT will be handled by the SPUs, offering the necessary scalability and power. In a continuum, we can see where we might initially install nzload or external tables and perhaps "tweak" them along the way. Then a maintenance developer comes along and sees that the "easiest" place to add a fix is in the external table or the nzload rather than pushing it to SPUs. The external table and nzload can (and should) do light-intake formatting per their interface specifications, but no further.

 

The over-arching directive remains the same - get the data into the SPU-based tables as rapidly as possible and then do the "dirty-work" with massively parallel power.

0 Comments Permalink
0

Ahh, the theme of so many horror stories, where the heroes plod along life until met with something they don't understand. Like a member of the Borg Collective, Agent Smith from the bowels of the Matrix, or Tom Cruise, cruisin' along in a dead-end life before the Tripods from Mars rip a hole in his, er, reality.

 

Got a call last week from a buddy of mine who's in an all-RDBMS shop. And powerful, too.  Some of their SMP machines have forty and fifty-plus processors on them. They do heavy-duty processing, don'tcha know, and have no need for any technologies newer than what they already have. The reason this call was so strange, was that we'd just caught up not six months prior, at which time I reveled on about the Netezza technology. He
didn't have anything to say about it, no opinion as to its benefit or purpose. This call was different.

 

Like many conversations about this, following is a composite of several, but I'll use my buddy as a springboard, because he's a good sport.

 

"David, need your help," he said, a tinge of urgency in his voice.
"Shoot."
"Some people here are talking about bringing in Netezza for a test drive," he tells me.
"Good for them. You - "
"Stop there," he pushed back, "Let me tell you what they want to do. They want to replace our primary data marts with this stuff."
"It's good for that," I said, "It's purpose-built for high-speed reporting."
"But our reports run fine," he asserted, but didn't really sound convinced.
"All that mess you went through last year," I reminded, "When you needed to add more dimensionality and had to take your entire schema back-to-formula?"
"So?" he said, "No technology is immune from that."
"Restructuring your entire indexing strategy to get better performance, and all that denormalization and renormalization to balance the workload?"
"Necessary evil," he asserted.
"Evil, true, but necessary is a function of the chosen technology."
"What, you mean Netezza can just keep assimilating information without ever having to refactor the indexing? Who are you kidding?"
"Not kidding at all," I said calmly, "Netezza has no index structures, so there's nothing to manage."
Silence.
"You still there?"
"I'll have to call you back."
"Okey-dokey - " but that's all I could summon before the line went dead.
Fifteen minutes later he called back, out of breath, "Okay, tell me more about this no-indexing thing."
"Just that, no indexes. Netezza doesn't need 'em or use 'em."
"Then it's slow as molasses, and not a threat."
"Don't kid yourself."
"I know data, dude. Don't try to - "
"It's not about the data. It's about the hardware. Netezza embraced the truth that power is found in the hardware, and bulk data processing needs access to a lot of it."
"We have a lot of hardware too -"
"But it's configured for general purpose processing, and I'll bet you don't do any of it inside the database."
"Well, no, that would be insane. We do the bulk processing with an ETL tool. It's just faster."
"Have you ever considered why it's so much faster? Or why the rise of the ETL tools? People generally agree we can't do bulk processing inside the SMP machine, because it's not built for it. It will pull data in quanity off the disk drives, process it in quantity and push it back, and the data is meeting itself coming and going on the SMP's backplane."
"So? How does that change anything?"
"Netezza processes data down in the parallel SPUs - data doesn't leave the disk at all, and if the database needs to process data, all of the CPUs handle their own little section of it. That's why you don't need indexes, because an SMP/RDBMS sees the data as a single logical table with monolithic physical data, where Netezza sees a single logical table on hundreds of physical drives.
"I'm not following."
"Okay, when a general marshals troops, does he give specific commands to each troop member, or does he formulate a plan and delegate it to the masses?"
"That's obvious."
"Because it works. It's the only way to manage physical scale. With multiple actors who are incapable of completing the mission alone, we need synergy."
"Or Jack Bauer."
"Not even Jack Bauer could - "
"Hey, you're bad-mouthin' Jack now!" he said playfully.
"Jack's good," I said, "But not that good. Imagine what he could do with a thousand Chloe's back at the ranch?"
"I see your point, but if this is just for reporting stuff, I don't really have much to worry about."
"For now."
"What do you mean, for now - that sounds sinister, like the Pod-People leader from Invasion of the - "
"Data Mart Snatchers?" I laughed, "Didn't mean to sound mysterious, but there's more to this."
"Oh?"
"Well, once the mart is in place and operating, someone will notice a pattern of activity. It goes like this: We spend hours processing the data to get it ready for Netezza, and then load it in seconds. The box sits there idle for the next few hours until the users start pounding it, then it goes idle for some protracted cycle until the few seconds it requires to load the next day's data. Something is amiss, because now the slow point is the ETL environment."
"Our ETL environment is state-of-the-art," he said, "We push data like crazy through it."
"And I'll bet if you examine the larger part of the load, you will find that it spends most of its cycles in joining or summarizing, even sorting. Those operations take a lot of hardcore CPU power."
"They always have."
"But what if you loaded the raw data from your ETL tool into Netezza? You said yourself it only takes seconds, right?"
"But then where will we integrate it?"
"Massively parallel joins and rollups inside a Netezza machine are orders of magnitude faster than your ETL."
Silence.
"You there?"
"No, I see what you mean," he said, "Our sixteen-way ETL machine cannot even theoretically compete with a 108-processor Netezza machine."
"And the 108-processor version is the development box."
"Thanks for that."
"Seriously, people will take a look at the "T" in the ETL, and experiment by dividing it into row-level activities and bulk inter-row activities. They will keep the row-level stuff in the ETL and move the larger-scale transforms into the Netezza machine."
"Okay,"
"Assimilating the data mart and the larger scale bulk processing into a single platform."
"Hmm, that would be troubling."
"Why is that?" I smiled.
"Because then the our high-powered and expensive ETL tool is relegated to nothing more than row-level scrubbing and data transport."
"Want another kick in the pants?"
"May as well."
"You can do a lot of that scrubbing in regular SQL once the data is inside the box. In massively parallel form."
(Sigh)
"Meaning that your ETL tool is now nothing more than a raw data transport mechanism."
"We have one of those already," he sighed again, then laughed "It's called "scp""
"At this point you've moved a lot of processing "under air" as they say."
"As who says?"
"Netezza of course."
"You mean, they assimilate all this stuff by design?"
"Resistance is useless. Close your eyes and join the collective."
"Aaaggghh!"

0 Comments Permalink
0

One of the most significant challenges of ELT-based processing is the need for housekeeping infrastructure. I mean, we will find ourselves needing temporary tables, and that's okay. But we'll also need persistent temporary tables - that is - tables we create as processing resources in context of a given set of operations, that we might keep coming back to. Or that we might want for troubleshooting. We have to admit, a truly temporary table that evaporates at the end of the session is handy for housekeeping but lousy for troubleshooting. In many ways, the "temp table" should be a means to organize our immediate thoughts, like de-duping a resource list or whatnot. Utility-stuff makes them handy. But when we need to debug intermediate results, we need a persistent table. And then we need to ditch it, with some rules.

 

If we run things in shell-script, whether inside the Netezza Linux host or on a companion Linux box, we'll almost always need temporary files as well. Once we create these, we need a way to get rid of them. And upon creating any of these resources, we need a systematic way to keep them safe, meaning a completely unique way of identifying them. For temporary files, we have many options to timestamp their file names, but for a database table this can be somewhat daunting, considering that we could litter our database with lots of orphaned tables in no time flat. The last thing an operator wants to do is go into a littered database and clean out the trash.

 

The latest release of the Netezza environment provides for stored procedures, which will formalize an invitation (for some) to black-box everything inside the black box. Do not fall into this trap. The black box is only colored that way, it is not intended to be used that way. In the end, we might have componentized stored procedures that do a lot of handy stuff for us, but if their activities are too hidden, we'll hear the same complaints from operators as we hear now when someone violates Rule #10 and performs bulk-data-processing inside their RDBMS, usually with a stored proc. Don't go there. Use the stored procs in peace, but be kind to your operators. Then they won't call you at midnight for answers.

 

So it goes with any operational scenario, but if we embrace some simple housekeeping hooks in a frameworked sense, we can avoid all these woes and pitfalls. The clear objective is to get more business functionality and solutions "under air" - that is - let the black box do what it does best - crunch and munch the data at high volumes.

 

A framework is a systematic way to startup a process, provide common resources for the process, allow the process to consume and leverage the resources as a foundation, and when the process completes, the resources are torn-down and tossed. This allows a given process to simply request a resource from the framework, with the expectation of delivery of course, without having to worry about giving it back, tearing it down or anything else in a housekeeping sense.

 

However, just to keep the operators happy, there is yet another practical means to avoid this headache, and this is to provide more than one database to support the data flow. Many of you are aware of the simplest form of this, where we have an intake database, a workspace database (for transforms) and a target database such as a repository or mart.

 

Staging -> Transforms -> Reposit

 

What this means is that we can intake data from any number of sources and assume that the information is in an unknown or dirty state. We can then apply large-scale transformation (joins, rollups, etc) to the data in massively parallel form. Once completely done, we commit our work to the target with simple table copies. We can afford a table copy in Netezza because of its ability to move data rapidly at the SPU level. Given the right distribution, copying data is a minor penalty when we see what we get back for it:

 

The ability to control the transformation process in a safe zone so that if it fails or corrupts, it is never committed to the repository. We use this technique in ETL all the time, pull data from a source, transform it and prepare it for insert, then commit it. We need to embrace this for ELT as well, because we need to protect the target from corruption with intermediate results from an incomplete operation.


Once we assume the need for such a backbone, we don't really have a lot of high-functionality to support it. After all, if shell-script can help this along, I suppose any of you could provide a frameworked model in Perl, Java or .NET. But might not want to launch an entire development environment just to support these simple activities. They seem simple because they are. Shell script is good because it is inherently a control language and does not tempt us to do a lot of programming. Since we don't need a lot of programming, this is good.

 

When we launch an ELT framework, we need a number of different items to provide context. Once is a reference to the aforementioned databases. Understand also that if we have a database that is for our personal development use, it can behave as all three. Once we move toward integration, we would break apart the references and make sure nothing else breaks in the process.

 

We'll need a holding place for end-of-run teardown. Like a teardown-file we will execute at the end of the run. In this file, we will simply export our commands to tear down the resources we create, as we create them. So if nothing is created, there is nothing to tear down. At the end, we just execute the teardown file and it does the trick.

 

Now we'll need a way to create assets both as a local resource and as a more visible one. In bulding a local resource, we might create-table, or create-view or create-synonym etc and then call a housekeeping functon with the database name and the resource name so that it's marked for teardown. this is as simple as an nzsql "drop table tablename" statement. Either way, we drop the resource at the end of the run.

 

We'll also need the more visible form, that is creating a local asset in our Transform database that is identical to the target version of the asset. CTAS magic works for us here, in that we offer up several parameters to a given function, such as the table-to-create name, the "real" table name and the target database. For example, if our transform database is a my_transforms, and the target database is my_target, with a table name of my_table, we would want to create a table thusly

 

create table my_table_temp as select * from my_target..my_table limit 0;


But we can see a limitation here. We cannot run an application with multiples of these without accidentally stomping on each other. If another thread of this ELT stream is launched (say this one is running behind) it will attempt to create my_table_temp as well, and will fail (but might think it succeeded, the table is there after all) and start to write its data into the same resource that is being used by another thread.

 

No, the most appropriate way to deal with this is the simplicity of the AUDIT_ID, a bigint value that we capture at the beginning of the framework's run (and yes, it's a simple sequence value we pull from a sequencer on the Transforms database).

 

this in hand, we now create with confidence:


create table my_table_temp_$AUDIT_ID as select * from my_target..my_table limit 0;

 

Now we have our very own copy and no other thread, even for debug, will gain access to it. What do we do after this, is we simply echo the drop command to the houskeeping file, and we're done:

 

echo "drop table my_table_temp_$AUDIT_ID ;" >> $housekeeping_file

 

At the end of the run, we'll execute each item in the housekeeping file and this will tear down all the assets we created.


Conversely, we could echo this to another houskeeping_file such as:

 

echo "drop table my_table_temp_$AUDIT_ID ;" >> $housekeeping_file_operator

 

And now we have a way to keep the tables around without losing track of them. The "drop" statment is logged to another external housekeeping file that we will not execute at the end of our current run. Rather, the operator can execute the file once a night or once a week, or whatever, and guarantees that the assets can be dropped without any surgical activity from the operator.

 

We'll do something else with this table name, though, how ahout a FINALIZE file that keeps track of each of these assets? After all, we pulled the definition from the Repository as a means to fill it with data that is destined for the same table in the Repository.

 

 

echo "insert into my_table select * from transforms..temp_my_table_$AUDIT_ID" >> $FINALIZE_FILE
echo "generate express statistics on my_table; " >> $FINALIZE_FILE

 

Now if we never make it to the end, we will toss the FINALIZE_FILE and do nothing. Otherwise we execute the FINALIZE file and commit the temporary tables to the target.

 

And the order of execution is, of course, FINALIZE first and HOUSEKEEPING second!

 

Another simple suggestion is that we make this call into a formal function, such as

 

MY_TABLE=$( create_table_from  "my_table"  "temp"  "$AUDIT_ID" "my_target" )

 

now we have the variables aligned - the source table in my_target, the "temp" prefix and $AUDIT_ID suffix. We also have something else, as we can now reference our new table in simpler form, following:

 

insert into $MY_TABLE (
column1,
column 2
)
select * from my_source;

 

but what if the "my_source" here were created in yet another similar thread - itself creating a temporary asset:

 

insert into $MY_TABLE (
column1,
column 2
)
select * from $MY_SOURCE;

 

In this case, the value of "MY_SOURCE" is actually "staging_database..my_source". But we don't need to know that, do we? What if we loaded up another source table called "my_source_test" ? Could we now point the variable $MY_SOURCE to this new table and it remain transparent to the above ELT SQL statement? You see where this leads? Flexibility, portability. troubleshooting etc - all because we embrace a framework that is letting us out of the box.


We can build up resources and not worry about whether they exist locally, remotely, in staging or whatever. Another simple aspect of this approach is this - if we don't get rid if the resource(s) at the end of the run, or the run aborts prematurely for any particular reason, all of the resources we have already built - remain bult and filled - we don't need to start the ELT from the beginning.

 

Let's say we pull from ten different sources, integrate the data with a workload that takes about an hour (hard churning on billions of records) toward a final target of five reporting tables. If we get to the end and the last table has to abort for any number of reasons, we quit and don't commit. However, our threads are in a condition that allows us to restart from where we left off, and not repeat all that work again, with the added benefit that nothing has been committed to the repository - yet. So we don't lose all the processing time, we just pick up where we left off. Such a scenario requires a manual restart, but the primary takeaway is the ability to checkpoint our work de-facto without ever corrupting the final target. Netezza gives us the power to do these things inside the machine.

 

How, you might ask, do we determine where we left off so we can pick right back up? Hey, I'm out of space on this one, but later -

0 Comments Permalink
0

When punting data around inside our magical machine, one may wonder how to keep track of it all. Some will eschew ELT because it boils down to a pile of SQL statements, and it sometimes feels out of control. Control of course, is what we make of it. Even a well-defined development product is no match for someone who doesn't like controls.

 

However, we know this really does boil down to insert/select combinations like so:

 

Insert into Mytarget (

column list here

)

select

yet another column list

from some tables using complex join and filter logic

 

It seems we have a handle on the top and bottom, but the "select" clause is where the primary transform and integrations are applied. Things can get really ugly here, especially if we're moving from one legacy platform to another. Our select-statements will look very hairy, indeed.

 

The "insert" clause is largely along for the ride.

 

Now it doesn't seem likely that this could get out of control until we're presented with tables that contain, say, a hundred columns. Or even fifty, or say twenty-five. Just enough you see, to keep them from appearing on the same editor page. We might want to add a column to the mix. Hey, add it down below in the select - and make sure you add it in the top to the insert - and don't get anything out of order! And what of the columns are misalgned - data corruptions are a higher-than-everage danger here.

 

It feels a little primitive, but all we really need is some assistance on the source-to-target mapping and we're good to go. It's impractical to do a source-to-target with unweildy insert/select statements, so let's apply a little Netezza magic. Now, considering that the cost of an ELT statement can sometimes run into minutes of execution time, sacrificing a few extra seconds up front, just to support our weary eyes.

 

Let's say we automate the scenario a little bit. I have a table called customers that I want to roll together from our old customers and customer-properties tables. The target table is a reporting table with denormalized stuff to support our ad-hoc folks with a lot of pre-calculated goodies. Once we have the calculations, we want to put them into business visibility.

 

insert into Customers (

customer_id,

first_name,

last_name,

most_recent_purchase_dt,

total_purchases_ctr

lots of other columns here

)

select

c_id,

f_name,

l_name,

max(b.purchase_dt),

sum(b.daily_purchase_ctr),

lots of other columns here

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

 

We can readily see that this very typical SQL statement is doing some heavy lifting for us, just like we want it to do inside the machine. But what if the inser/select clauses had a lot more columns? It wouldn't take much for this to feel nervous about its maintainability. What if we have to interate another table to the mix? Left outer joins? The Select clause has pretty much unlimited potential for complexity.

 

ANSI SQL supports aliases, so let's run with that. We have our source columns in the Select and the Target columns in the Insert, so let's align them thusly (I'll just use the first few for brevity)

 

 

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

 

And lo, we have the makings of a source-to-target map. Don't we? Of course - the Insert-columns appear on the right, ready to functionally redefine the souce values on the left. We do this all the time, don't we? But largely for spontaneous reports and the like. Let's look a little further, because having something like this in open-text doesn't really benefit us.

 

By circumscribing it with a "cat" we can gain two major benefits without sacrificing clarity - one is the ability to put the SQL statement into a place where we can use it, and one is to provide a means to resolve any $ variables that happen to be in the SQL statement. Note the use of the AUDIT_ID variable.

 

 

MY_SELECT=$( cat <<!

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

!

)

 

Okay, now we have some options - so let's try this:

 

nzsql -a <<!

$MY_SELECT  limit 0 ;

!

 

this lets us test the SQL statement, but only to make sure we formatted it right. Now let's do something more useful:

 

 

nzsql -a <<!

create table temp_target as $MY_SELECT  limit 0 ;

!

 

now we have a persistent table in catalog, with correctly named  and sequenced columns that align with the select statement. Note that the columns on the catalog will also have expected data types, which we could check against the target table's data types for consistency, but for now we just need something that the system will accept without complaining.

 

I'm a big fan of letting the Netezza environment do the heavy lifting. We could set up a parsing function to rip through our SELECT statement and find the alias'd column names, but this will fall apart with the more complex SQL statements. We already have a highpowered SQL parser at our disposal, don't we? And doesn't the CTAS have a thousand-and-one uses, after all?

 

Let's do a CTAS like the above - with "limit 0" - meaning that it won't do any real processing work, but will give us the power of its parsing engine to find the target columns with the added benefit of registering them by name and in the proper order - but to a temporary table

 

 

Now let's put the CTAS together with a way to pull the columns off the catalog - I'm throwing this to a flat file for debugging, but you probably know how to stream this directly into a loop - to follow

 

nzsql -A -t -o outputfile.txt   <<!

create temp table temp_target as $MY_SELECT  limit 0 ;

select attname from _v_relation_column where name=upper('temp_target');

EOF

 

Now let's pull this file into a quick loop,

 

M_SEP=""

foreach line in outputfile,txt

do

INSERT_STR=$INSERT_STR $M_SEP $line

M_SEP=","

done

 

 

or how about

 

INSERT_STR=$( nzsql -q -A -t  <<!

create temp table my_temp as  $MY_SELECT limit 0;

select

case when attnum = 1 then ''

        else ',' end ||

attname

from _v_relation_column where name = upper('temp_table') order by attnum ;

!

 

INSERT_STR="insert into TARGET_TBL ("${INSERT_STR}")"

 

 

 

 

 

and form it into a string INSERT_STR that looks like this:

 

 

   customer_id

,  first_name

,  last_name

,  most_recent_purchase_dt,

,  total_purchases_ctr

 

Now what?  Execute the the Insert?

 

nzsql -a <<!

$INSERT_STR  $MY_SELECT ;

!

 

------------------------------------------------

 

If we put the above activities into a bash function call we would find a setup like this:

 

nz_insert_from_select()

{

 

put all the above activities in here

 

}

 

 

-------------------------------------------------------------------------------------------------------------------

So here is what we would implement for any given ELT - we get a visual source-to-target map

 

MY_SELECT=$( cat <<!

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

!

)

 

 

mret=$( nz_insert_from_select  target_table "$MY_SELECT" )

 

 

 

------------------------------------------------------------------------------------

 

 

So in ELT space, one of the keys is to balance how much we need to program versus how much is already programmed for us - in the Netezza parsing engine for starters. Catalog-hits are inconsequential when compared to the functional benefit we achieve, and the visually-aligned columns names even for very large tables. We can then add or delete columns from the ELT by adding or deleting lines in the Select. We dont have to align the columns on the top (Insert) and bottom (Select) because they are side-by side - and we know exactly what is going where.

0 Comments Permalink
0

First-Bubble Blues

Posted by David Birmingham Feb 26, 2009

About a year ago we encountered an environment where the client wanted the old system refactored into the new. The "new" here being the Netezza platform and the "old" here being an overwhelmed RDBMS that couldn't hope to keep up with the workload.

 

So the team landed on the ground with all hopes high. The client had purchased a 10200 (216 processors) for production deployment and a 10100 system for development. Oddly, the same thing happened here as happens in many places. The 10200 was dispatched to the protected production enclave and the 10100 was dropped into the local data center with the developers salivating to get started. And get started they did.

 

The first team inherited about half a terabyte of raw data from the old system and started crunching on it. The second team, starting a week later, began testing on the work of the first team. A third team entered the fray, building out test cases and a wide array of number-crunching exercises. While these three teams dogpiled onto and hammered the 10100, the 10200 sat elsewhere, humming with nothing to do.


We know that in any environment we encounter, with any technoogy you can name, the development machines are underpowered compared to the production environment. And while the production environment has a lot of growing priorities for ongoing projects, we don't have this problem for our first project, do we?

 

And this is the irony - for a first project we have a huge "first-bubble" of work before us that will never appear again. the bubble includes all the data movement, management and backfilling of structures that we will execute only once, right? Really? I've been in places where these processes have to be executed dozens if not hundreds of times in a devlopment or integration environment as a means to boil out any latent bugs prior to its maiden - and only  - conversion voyage. But is this a maiden-and-only voyage? Hardly - typically the production guys will want to make several dry runs of the stuff too. We can multiply their need for dry runs with ours, because we have no intention of invoking such a large-scale movement of data without extensive testing.

 

And yet, we're doing it on the smaller machine. No doubt the 10100 has some stuff - but I've seen cases where it might take us two weeks to wrap up a particularly heavy-lifing piece of logic. If we'd done this on the larger 10200, we would have finished it in a week or less. Double the power, half the time-to-deliver (when the time is deliver is governed by testing) In practically every case of a data warehouse conversion, the actual 'coding' and development itself is a nit compered to the timeline required for testing. I've noted this in a number of places and forms, in that the testing load for a data warehouse conversion is the largest and most protracted part of the effort. And if testing (as in our case) is largely loading, crunching and presenting the data, we need the strongest possible hardware to get past the first bubble.

 

So this is a case for any data warehouse project, not just one with a Netezza machine. The first bubble is the worst bubble. As our techs slave themselves over a hot CPU, sweating out the extreme workload of the initial conversion, they will quickly start to compare the machine they are working on versus that production machine sitting over there with nothing to do. It wouldn't matter what the technology happened to be - the equation is out of kilter. We need all the available power to get past the first bubble.

 

But I've had this conversaion with more people than I can count. Why can't you deploy the production-destined machine with all its power, for development/testing use in getting past the first bubble, then scratch the system and deploy for production? What is the danger here? I know plenty of people, some of them vendor product engineers,  who would be happy to validate such a 'scratch' so that the production system arrives with nothing but its birthday-suit - its originally deployed default environment. Yet another philosophy is that we would pre-configure the machine for production deployment, but nobody likes developers doing this kind of thing in a vacuum. They would rather see deployment/implementation scripts that blow-out and instantiate the inplementation. I'm a big fan of that, too, for the first and every following deployment. That's why I would prefer we used the production-destined system to get past the first-bubble-blues, then scratch it, and get the original environment standing up straight, then treat it as an operational production asset.

 

Most projects like this have a very short runway, and we do a disservise to the hard-working folks who are doing their best to stand up this environment, They need all the power they can get, especially when they enter the testing cycle. And for this, it's an 80/20 rule for every technical work product we will ever produce. Take a look sometime at what it takes to roll out a simple Java Bean, or a C# application, or a web site. Part of the time is spent in raw development, and part of it in testing. If I include the total number of minutes spent by the developer in unit testing, and then by hardcore testers in a UAT or QA environment, and it is clear that the total wall-clock hours spent in producing quality technology breaks into the 80/20 rule - 20 percent of the time is spent in development, and 80 percent in testing.

 

And if the majority of the time is spent in testing, what are we testing on Enzee space? The machine's ability to load, internally crunch and then publish the data. On a Netezza machine, this last operation is largely a function of the first two. But we have to test all the loading don't we? And when testing the full processing cycle we have to load-and-crunch in the same stream, no? What does it take to do this? Hardware, baby, and lots of it. So why are we doing it on one-third of the available hardware (seeing that we're on a 10100 and the 10200 is sitting over there, humming away and taunting us from a distance!)

 

I can say that multiple small teams can get a lot of "ongoing" work done on a 10100, no doubt a very powerful environment. I can also say that a machine like this, for multiple teams in the first-bubble effort, will gaze longingly at the 10200 in the hopes they can get to it soon, because so much testing is still before them, and they need the power to close. With that, Netezza gives us the power to close faster than any other environment, to get past this first-bubble without the blues - we only hurt ourselves with rules for the environment that are impractical for the first-bubble. So all things considered, if we were on a traditional platform we would see months pass for the relative weeks it would take for a Netezza machine to do the same work.

 

Alas, when one has a Netezza machine, it bends gravity and dilates time. Months become weeks. Weeks become days. And yet, we still need more power. More is never enough to wash away the blues.

 

Those first-bubble-blues.

0 Comments Permalink
0

Data Chop Shop

Posted by David Birmingham Feb 17, 2009

Before throwing the data on the grill, it needs prep. We might want to chop it, slice it, dress it and otherwise tenderize it before throwing it on the flames, but throw it on the fire we must.

 

Here's an interesting problem - what if we're presented with data that just won't intake. It's horribly formatted, if at all, and we don't have access to any kind of ETL or data-shaping environment to get the data inside. We have bad dates, bad numerics, and the only thing we have that actually works are the varchars! Whoo hoo. Okay, not so fast.


To make matters worse, we have intake records that face concatenated views, some of which have almost two-hundred columns. If any columns are bad, it's a lot like a submarine hunt, without the submarine. How bad is the data? Well, not all that bad where it actually resides. The users do a lot of inline and stored-proc math, and don't like those pesky overflows in their math. So the best option for them is to define all their numerics as "number", with no specified precision. Hey, that works great as long as the data never has to leave its home.

 

But now it's escaped, and it's on our front door with a trick-or-treat bag without the treats. When the source database doesn't define the numeric precision of its source columns, we'll find values with very oddball characteristics. Simple things, like 45 digits to the right of the decimal. Not particularly useful digits, but those leftovers like .33333 etc that just showed up without being invited.

 

How do we shock-the-system with these values, or just trim them out? On intake we have the potential of an empty column, too. The danger never ends. check out this example, if we want to clamp the numeric data to a more palatable value:

 

select

-- for a numeric

     case when mycol is null then null else substring(MyColumnName, 1, 38)::numeric(38,8) end MyColumnName

 

--or for a date

     case when mycol is null then null else date(MyColumnName) end MyColumnName

 

 

Now we've stripped the data to something we won't choke on, and it's within lasso-distance of our numeric precision. What's that? You don't want to do this for every column in case it's something pervasive (and it usually is) - and you don't like the idea of bringing the numeric precision out of the catalog and putting it into the intake statement? What kind of perfectionist are you?

 

Not to worry, I don't like this kind of construct either, at least, not "out in the open". If I really need to use this, I would rather find a way to automate its construction right off the catalog. Our "substring" doesn't have to change, it will chop the physical data to a non-choking size. The catalog-based precision we can get, well, off the catalog. I am a huge fan of using the catalog for meta-data-based constructs, especially for common, automated tasks like intake and publication.

 

Let's say we have a very-wide intake record, like 200 columns of varying types. Do we really want to carefully craft an intake statement including the construct above? The capability is willing, but the flesh is weak. I don't find such mind-numbing work to be profitable or productive, even though it's sometimes insanely necessary because the intake data is so junky.

 

The cool part is just this: get the data into the machine! We don't have to push load-ready data into the machine, we just need to push it to a safe location. Once inside the machine, we can use Netezza's SPUs to beat the living daylights out of the data. And when we think about it, once the data's in the machine, it's like putty in our hands. We can create, teardown and rebuild whole data models, several-a-day if we want, to shape and mold the structures to the form we want. But the potter's wheel is awefully lonely with no clay.

 

I've been place where we literally waited for weeks upon weeks to get data into the machine, largely because the data we want in the machine has never (by design) left its home for another machine. Once it leaves, people see how ugly the data looks "out in the open" -- but something funny happens. We might ask them "could you format those pesky dates and numerics to something more palatable?". The answer we get back is as honest and refreshing as Nestle iced tea: "You guys own the warehouse, and the whole chain of data cleansing. I'm not the cleaner, you are. Why are you asking me to do your job?"

 

Ouch, well, there it is, and quite frankly they're right about it. It's awfully hard to tell someone to restructure their data to a form that meets our needs - it's 2008 after all. What would it take to get information into the machine, especially junky data? Do we really need to push back on our DBAs? Our analysts? Our DBAs are well-paid to manage and deliver the information in a pre-defined form, usually not in bulk. It's not particularly daunting for them, but they cannot read your mind, either.

 

So how would we access the catalog to automate this intake problem? And while we're at it, why not solve other intake problems, not just the pesky numeric precisions? How about solving the need for file space to land a flat file for intake? Or that the data in the source doesn't completely match the final target tables (we've added some administrative columns and other items that don't have a source-side equivalent. Intake-mechanics are actually pretty simple, and once we solve some of the basics, we can do some pretty advanced intake at the push of a button, and then use the common SQL-based ELT to take the data to its final home.

 

But getting the data into the box is no different than getting a player on the field, No player, no game. If we examine the common failure points that are in our way, we have the potential for source-failure in a database or file system, network errors, power-outages - you name it - all between our machine and our data. If we can get the data into the machine and cut the ropes from whence it came, we can do anything we want. If we can't get the data into the machine, well, what on earth are we standing around for?

0 Comments Permalink
1

Here's a shout-out to all you ELT aficionados out there - those who have embraced the call to use the Netezza machine for hard-core data processing, and not just query acceleration. What's that? You've deployed it as a query accelerator because that's was your functional requirement? Tish-tosh, you are under-utilizing the machine.

 

In ELT space, we see data arriving on our machine's eastern shore like immigrants from a foreign land. Give us your poor, tired and huddled data, and buried information yearning to be free, and all that. We need liberation! (a subject of another like-minded site) and the big-black-box is a beacon to collect the uncollectable, love the unlovable, and process the unprocessable data arriving in completely un-integrated form. We see information from this source or that, arriving on rafts, boats, inner-tubes and the like, and we want to believe that all are created equal, yet our process for assimilation and naturalization of the data has an uptake, doesn't it? Perhaps we'll stage the information (give it a green card) and maybe even load it partially-cleansed into an ODS - but one way or another we have to challenge the information, make it consumable "enough" even when it first arrives.

 

ELT is a practice already found in many RDBMS's, engaged by people who have no desire to purchase middleware, and honestly believe that pulling the data out of the database, processing it only to put it right back, is a waste of time and resources. Fire up a stored procedure, they will say, and process the data on the machine. Isn't this the most efficient means to achieve our goal?

 

On the Netezza machine, you bet. We have hundred(s) of processors working in purpose-built synergy toward this goal. But on an SMP-based RDBMS, no way. It won't scale and is destined to run out of gas. It's only a matter of time. And because we would have to use stored procs to affect our outcome, we also embrace a black-box processing scenario that really is black, lights-out, underground, all the bad things. Our poor operators will watch it kick off, run for hours and hope it finishes on time, and correctly. Once it becomes visible to an operator or admin, it's already running out of gas, and now we'll watch the engineers swarm to do what they do best - engineer - and the danse-macabre of propping up a dying process with artificial respiration.

 

Which is why we have Rule #10 isn't it?

 

One reason some don't embrace ELT - that is - simple data transport followed by hard-core data processing in the database engine, is because it's a bad practice to do bulk data processing in the (SMP-based) database engine. Since Netezza has broken this envelope, we now have freedom to proceed, but wait. We need a way to manage the ELT flow itself. After all, the ELT flow is just a series of SQL statements, right? Even the most robust "ETL" tools will only support "ELT" by firing off sequential SQL statements because they are not really in control of the data. What we'd like to see, is a flow-based mechanism like Ab Initio, Informatica, Expressor or the like, to transparently harness the SQL statement like a true transformation component, even though in the background it will "only" fire off a SQL statement to affect the outcome. With the Netezza power we really can process the data with mind-bending speed, but after the smoke clears we need a way to report, track and manage this process. A SQL statement seems rather primitive, raw, and too much like hand-coding. Largely because it is. If we had a tool that could manufacture these kinds of transforms on-the-fly and manage the process as a visual flow, hey, life would be good.

 

What would this look like? Well, sort of like what we see today in flow-based management. Expressor, for example, allows us to leverage Visio to describe the flow, then Expressor components consume the Visio diagram's metadata  and manufacture a living program. Ab Initio uses its proprietary graphical canvas to affect a similar scenario. What we really need is the ability in one of these products (or another product entirely) to pull a Netezza ELT component onto the canvas, connect one end to a source table, one end to a target table (albeit a temp-table if necessary) and allow us to describe the transform between the two - just like any other transform. Ultimately this would provide us with graphical control over the flow in a visible, manageable and traceable form. Alas, we as the machine owners must (for now) embrace some degree of scripted logic to affect our desired outcome. I see this as a temporary state of affairs. Someone will rise to the challenge.

 

Oh, come on, David, we know that the folks who live and breathe Netezza are the pioneers, who eat from the back of wagons, sleep under the stars and change their own horse-shoes. Innovative problem solvers, braving the wilds of the prairie with fearless resolve. Yeah, uh, before we go down that path, describng a cowboy (and don't get me started) let's examine what the enterprise needs. Whether the cowboys (intone John Williams theme song here) get the job done rustling the data and wrangling the loads, at the end of the day the trail boss will want a status. Have we lost any dogies? Are all of them fed and watered? How much farther to the end of the trail? What about the weather? Wild animals? Data-rustlers (hackers). The list of pitfalls and opportunities is boundless, and the trail boss wants to know "where we stand". You know, like the business intelligence dashboards.


The reason we might not see this "ELT" harness scenario any time soon from the power-hitter products is that ELT requires the power-hitter to maintain local control but externalize the processing power (delegating it to Netezza). This is unpalatable for the product vendors that claim we don't need Netezza to process the data (and of course, Netezza is horning-in on their action like a good competitor). Yet we have this big-black-box machine sitting on the floor that has the power to perform seriously hard-core processing on a breathtaking scale, achieving internal bandwidth that these power-hitter software products cannot achieve (because their hardware platforms constrain them). Let's face it, put a power-hitter software product on a 32-way Sun machine and then attempt to process data in the same scale as a 200+ processor Netezza machine. No slight on the software product, because it could probably process at Netezza's scale if given enough hardware, but do we really want to deploy a 200 processor Sun machine?

 

Another reason is that Netezza is the only product that truly unleashes the processing capacity to make ELT a practical and easy reality, and is seen as a competitor by the power-hitter software product vendors. Yet another reason is that those who would embark on this path have to commit resources to Netezza's (somewhat) more rarefied market, and for now are simply unwilling to do so. Time will change this, however.

 

I'm not one to tell competitors how they should behave in the marketplace, because competition always increases quality. But if we all get together and shout "we are here" - perhaps at least one maverick elephant will hear our cry. With all apologies to Dr Seuss, we could start our own web presence as Horton Hears an ELT, or Horton.com, or even MaverickElephant.com - I don't know - just thinkin' out loud here.

1 Comments Permalink
0

I was amazed at how many people showed up for the inauguration. Here in DC the roads were closed, the weather was cold and the atmosphere was in a word, electric. I also witnessed some disturbing things, like people who had arrived from hundress of miles away but could not get in, and many more who did get in but had to watch it all on the jumbo-tron screens anyhow. Still, being there is being there, and nobody can take it away.

 

And this is a segue into my observations on some interesting patterns of humanity that align perfectly with our solution domains, so follow me on this. As I sat in a hotel room on Tuesday, debating whether I should brave the extreme cold and traffic, I am reminded that one of our presidents (Harrison) gave a too-long inauguration speech in such conditions, developed a cold and died of pneumonia a month later.  As for me, I had a problem to solve - when I turned on the morning news I found a common "of scale" ingress and egress scenario, something I predicted but was stunned as to the magnitude of its reality. Yes, an important day for America and for the world - and worth at least a superficial examination of the logistics of people-movement.

 

I noted that the cabs, trains, buses public and private, and other transportation mechanisms had been trickling people onto the scene for days, increasing by the hour. The DC train systems had dropped off over 200k people in a matter of hours, probably setting records as well. But when we think of the drop-off mechanics, these are not people arriving in-bulk. They are collecting from various places in a near-transactional manner. A handful here, a handful there, each one patted-down for their belongings one-at-a-time no differently than our transactional stored procedures check our inbound data one-at-a-time. Then the festivities started, the regalia of our peaceful transition of power, and when it was all done, we saw another interesting effect. It was time for everyone to make an orderly exit.

 

So people who had arrived early in the day and had likewise confirmed a front-row position for the festivities -  now did an about-face to exit - only to find a sea of humanity between themselves and the exits. I use the word "exits" here loosely, because they would not be patted-down to leave like they had been to enter, so the egress was a bit smoother and more steady. Uh, you know, like we pat-down the data when it first enters our warehouse, and deliver it clean on-demand. Ohhh, the parallels.

 

Discussing this at the watercooler, a couple of colleagues wondered out loud how to make a mass-egress work. How would we (safely) empty the Mall of the majority of people in a short period? After all, some of the attendees didn't make it back to their hotels until almost nine hours later! One person suggested to do it "the Netezza way" by providing 100 helicopter pads, plus 400 helicopters, each of them alighting on the pads every five minutes with a 20-minute round trip to spirit people out. This would leverage the vertical space, not just the horizontal. But this model won't serve, since the helicopters waste the return leg. One of them suggested conveyor belts, objecting to the Netezza way. But I suggested that this better represented the Netezza way, a streaming model of constantly moving data. The helicopters could move people at only 1/100th the speed of the same number of conveyors, and they wouldn't have to move all that fast.

 

The streaming model is something that shakes the rafters on our reporting models, but as with any problem of scale, we must provide the physical plant first, and it has to address the problem on purpose.

0 Comments Permalink
1 2 3 Previous Next