Skip navigation
1 2 3 Previous Next

Gather 'round the Grill

44 Posts
0
When introducing the Netezza platform to a new environment, or even trying to leverage existing technologies to support it, very often the infrastructure admins will have a lot of questions, especially concerning backups and disaster recovery. Not the least of which are "how much" "how often" and such like. More often than not, every one our responses will be met with a common pattern, a sentence starting with the same two words:
What the....?
Case in point, when we had a casual conversation with some overseers of backup technology as a precursor to "the big meeting" we almost - quite accidentally - shut down the conversation entirely. Just the mention of "billions" of rows, or speaking of the database in "Carl Sagan" scaled terms, caused them to want to scramble for budget and market surveys of technologies that were more scalable than their paltry nightly tape-backup routines. In this particular conversation, we were talking about nightly backups that were larger than the monthly backups of all of their other systems combined. Clearly we were about to pop the seams of their systems and they wanted a little runway to head off the problem in the right way.
But what is the "right way" to perform a backup where one of the tables is over 20 terabytes in size, the entire database is over 40 terabytes in size and the backup systems require two or even three weeks to extract and store this information for just one backup cycle?  Quipped one admin "It takes so long to back up the system that we need to start the cycle again before the first cycle is even partially done." and another "Forget the backup. What about the restore? Will it take us three weeks to restore the data too? This seems unreasonable."
Yes it does seem unreasonable precisely because it is - quite unreasonable. As many of you may have already discovered, the Netezza platform is a change-agent, and will either transform or crush the environment where it is installed, so voracious are its processing needs and so mighty is its power to store mind-numbing quantities of information.
The aforementioned admins simply plugged their backup system into the Netezza machine, closed their eyes and flipped the switch, then helplessly watched the malaise unfold. It doesn't have to be so painful. These are very-large-scale systems that we are attempting to interface with smaller-scale systems. We might think that the backup system is the largest scaled system in our entire enclosure, but put a Netezza machine next to it and watch it scream like a little girl.
So here's the deal: No environment of this size shoiuld be handled in a manner that is logistically unworkable for the infrastructure hosting it. We can say all day that these lower-scaled technologies should work better or that Netezza should pony-up some stuff to bridge the difference, but we all know that it's not that easy. Netezza has simplified a lot of things, but simplification of things outside the Netezza machine - aren't we asking a bit much of one vendor?
To avoid pain and injury, think about the things that we need to accomplish that are daunting us, and solve the problem. The problem is not in the technology but in the largesse of the information. We would have the same problem on our home computers if we had a terabyte of data to backup onto a common 50-gig tape drive. We would need twenty tapes to store the data. The backup/restore technology works perfectly fine and reasonably well for a variety of large-scale purposes. We simply need to be creative about adapting it to the Netezza machine. Don't plug it in and hope for the best. Don't do monolithic backups. The data did not arrive in the machine in a monolithic manner so why are we trying to preserve it that way? Leave large-scale storage and retrieval to the Netezza machine and don't crush the supporting technologies with a mission they were never designed for.
Several equally viable schools of thought are in play here. What we are looking for is the most reliable one. Which one will instill the highest confidence with least complexity? The more complex a backup/restore solution becomes, the less operational confidence we have in it. If it cannot backup and restore in a reasonable time frame, we exist in a rather anxious frontier, wondering when the time will come that the restoration may be required and we put our faith in the notion that it either won't, or when it does all of the other collateral operational ssues will eclipse the importance of the restoration. In other words. future circumstance will get us off the hook. There is a better way, like a deterministic and testable means to truly backup and restore the system with high reliability and confidence.
On deck is the simplest form of the solution - another Netezza machine. Many of you already have a Disaster-Recovery machine in play. Trust me when I tell you that this should be fleshed out as a fully functional capability (discussed next) and then the need for a commodity backup/restore technology evaporates. Using another Netezza machine, especially when leveraging the Netezza-compressed information form, allows us to replicate terabytes of information in a matter of minutes. I don't have to point out that none of our secondary technologies can compete with this.
A second strategy requires a bit more thought, but it actually does leverage our current backup/restore technology in a manner that won't choke it. It won't change the fact that the restoration, while reliable, may be slow simply because moving many terabytes in and out of one of these secondary environments is inherently slow already.
A third strategy is a hybrid of the second, in that enormous SAN/NAS resources are deployed as the active storage mechanism for the data that is not on (or no longer on) the Netezza machine.  This can be a very expensive proposition on its own. We know of sites that keep the data on SAN in load-ready form, and then load data on-demand to the Netezza machine, query the just-loaded data, return the result to the user and then drop the table. You may not have on-demand needs of this scale, but this shows that Netezza is ready to scale into it.
A fourth strategy is a hybrid of the first, that is we still use a Netezza machine to back up our other Netezza machine, but we use the more cost-effective Netezza High-Capacity server, which is less expensive than the common TwinFin (fewer CPUS, more disk drives) but otherwise behaves in every way identically to its more high-powered brethren. And honestly if we were to put apples-to-apples in a comparison between the cost of a big SAN plant to store these archives versus the High Capacity Server, the server wins hands-down. It's cheaper, simpler to operate, doesn't require any special adaptation and we can replicate data in terms guided by catalog metadata rather than adapting one technology to another.
So let's take these from the least viable to the most viable and compare them in context and contrast, and let the computer chips fall where they may.
Commodity backup/restore technology
If we want to leverage this, we need to understand that it cannot be used to perform monolothic operations. These are unmanageable for a lot of reasons:
  • The time it takes to perform a backup and restore is many times longer than a common operational cycle - that is the data is already on-the-move and being processed over the many days required to backup the data. By the time we are done, it does not represent a true backup
  • If the backup fails, we've lost all that time and have to restart from scratch

  • The time it takes to restore the data is too long to be viable. We cannot wait for days or weeks to get the data back under control and also lose days or weeks of processing time
  • The prospect of data loss, over many days if not weeks, is a real threat
  • The backup/restore technology is oblivious to the data content, but the content itself provides the necessary boundaries for proper backup operation
  • The vast majority of the content is static already, meaning that the vast majority of the backup is redundant and has no value in secondary form
To mitigate the above, we need to adapt the large-scale database to the backup technology by decoupling and downsizing the operation into manageable chunks. This is a direct application of the themes surrounding protracted data movement in any environment. The larger the data set, the more the need for checkpointed operations so that the overall event is an aggregation of smaller, manageable events. If any single event fails, it is independently restartable without having to start over. Case in point, if I have 100 steps to complete a task and they are all dependent upon one another, and the series should die on step 71, I still have 29 steps remaining that may have completed without incident, but I cannot run them without first completing step 71. This is what a monolithic backup buys us - an all-or-nothing dependency that is not manageable and I would argue is entirely artificial.
To continue this analogy, lets say that any one of these 100 steps only takes one minute. In the above, I am still 30 minutes away from completion. I arrive at 6am to find that step 71 died, and now I have to restart from step 1 and it will cost me another 100 minutes. Even if I could restart at step 71, I am still 30 minutes away from completion. 
Contrast the above to a checkpointed, independent model. If we have 100 independent steps and the step 71 should die, the remaining 29 steps will still continue. We arrive at 6am to find that only one of the 100 steps died and we are only 1 minute away from full completion. The difference in the two models is very dramatic:
Monolithic means we are operationally reactive when failure occurs. The clock is ticking and we have to get things back on track and keep moving. Checkpointed means we are administratively responsive when failure occurs. We don't have to scramble to keep things going. In fact, in the above example, if step 71 should fail and the operator is notified, doesn't the operator have at least half an hour to initiate and close step 71 independently of the remaining 29 steps? Operators need breathing room, not an anxious existence.
Monolithic methods are supported de-facto by the backup/restore technology. If we want to perform a checkpointed operation, we have to adapt the backup/restore process to the physical or logical content of the information. We don't want to directly mate the backup technology itself, so we need to adapt it.
Logical Content Boundaries
This means we have to define logical content boundaries in the data. What's that? You don't have any logical content boundaries, not even for operational purposes? Well, per my constant harping on enhancing our solution data structures with operational content, such as job/audit ID and other quantities, perhaps we need to take a step back and underscore the value of these things because they exist for a variety of reasons. One of them is now upon us - the operatonal support of content-bounded backups. It is required for scalability and adaptability and is not particularly hard to apply or maintain.
A more important aspect of content boundary is the ability to identify old versus new data. If the data is carved out in manageable chunks, some will always be least-recent and some more-recent. Invariably the least-recent chunks will be identical in content no matter how many times we extract them for backup. This means we can extract/backup these only once and then focus our attention on the most active data. In a monolithic model, there is no distinction between least or most active, least or most recent. In large-scale databases, the least-recent data is the majority, so the monolithic backup is painfully redundant when it need not be.  
Do we absolutely need content-bounded backups for all of our tables to function correctly? Of course not. But by applying this as a universal theme it allows us to treat all tables as part of a backup framework where all of them behave the same way. So part of this is in the capture of the data but a larger part is the operational consistency of the solution.
Many reference tables such as lookups will never grow larger and we know this. In fact, their data may remain static for many years. For the ones that are tied to the application and grow or change every day, these we will call solution tables. They are typically fed by an upstream source and are modified on a regular basis. Any of these tables can grow out of control. The reference data then represents a very low operational risk. Why then would we not simply fold the reference data into the larger body and treat all the tables the same? There is no operational penalty for it, but enormous benefit from being able to treat all tables the same way inside a common solution.
At this point, the backup/restore framework will address all of the tables the same way, but now we have the ability to leverage rules and conditions within the framework so that special handling is available if necessary. This is a common theme in large-scale processing: Handle everything as though it will grow, but accommodate exceptions with configuration rules. I'll forego this aspect for now and let' take a look at what we need in basic terms:
  • An intermedaite landing zone: Some off-machine storage location that can hold the data while in transit - e.g a NAS or SAN volume. This will not be the same size as the database but some fraction of it. It is a workspace for intermediate files, not permanant storage
  • Content boundaries: A means to define and manage content bthese are tags or quantities in the data itself. They need to be consistent for all tables so that the backups and restore operate the same way. Of course, if we don't have these, we need to apply them.
  • A restoration database: a separate database that will be used for the restoration process, because we won't be restoring the data directly into the table from whence it came. Why not? There is an increasing likelihood that the data structures have changed since the last backup. If the structure has changed, we cannot reload the data into it. We need a way to shepherd the data back into the machine.
  • Processes: that capture / backup / restore the data using content boundaries and common utilities. These are typically control scripts that are easily configured, deployed and maintained
  • Archival Database (next): Physical content replicas of the masters that hold all of the original information in a form useful for backup and restore. Whether they keep the data active/online is immaterial. Their role is to interface with the backup/restore mechanisms so that any data once made offline can easily be made online through this database.
Archival Databases
Setup formal archival databases. Whether these reside on the same server, or on a DR / High Capacity server (below) is immaterial. The point is that the data in the master tables will be actively rolled into these tables, which will form the backbone for all backup and restoration operations. We therefore replicate the masters (with streaming replication) into the archival stores and then at some interval perform the backup of the archives, not the master.
 
image
Foreshorten the Master Tables
Now that we have a means to define content boundaries - you did apply those boundaries right? We can now look at the database holistically for optimizations based on active data.
At one site we have a table with ninety billion records in three different fact tables spanning over ten years of information, However, the end-users and principals claimed (and we verified) that the most active data on any given day is the most recent six months. Anything prior to that, they would query perhaps once or twice a year for investigative purposes, but had not tied any reports to it.
image 
So now we have an opportunity to get agreement from the users to shorten the master tables. This is especially necessary for those fact tables. In the end, these fact tables (above) were shortened to less than four billion rows each and these are kept trimmed on a regular basis. The original long-term data is held in another archival database that is the foundation for the backups and restores.
 
Protocol
  • On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration

  • Execute the extractions to create flat files

  • Each extraction will pull data to a Netezza-compressed flat file and also supply another file with the metadata instructions necessary to restore the file's contents. On a per-file basis, these instructions will include:
    • Original table definition, used to construct a template of the table in the restoration database

    • File load instructions used to get the content back into the machine

    • Transformation command, used to move data from the restoration database into its original home

  • The files will be extracted (using content boundaries) onto the storage landing zone
  • The smaller extractions can be performed in parallel, are independent and form their own checkpointed process
  • The backup techology will begin its backup process once the second (metadata companion) file arrives for the set. The metadata companion file behaves as a trigger.
  • Once the backup for the file-set is complete, the backup/restore deletes the file and its companion
  • This decouples the backup/restore processing from the database (completely) so that it can focus solely on file-based backup and restore
  • For restoration, both the content file(s) and companion metadata file(s) will be retrieved and placed on the storage landing zone
  • The metadata will be used to reconstruct an empty intermediate version of the original table in the Restoration Database
  • Load the table using nzload
  • Then the metadata will be used (even a preformulated SQL statement) to copy the data into the original target
  • Throw away the intermediate load table, delete the data file and its companion metadata file
  • The restoration process can be run as many ways-parallel as supported by nzload
  • The restoration process can also be surgical, with the more agile ability to restore data in smaller segments
image 
The above protocol, while likely easy to pull off an administer, still has a number of moving parts that the Netezza based-equivalent (later) will not have.

SAN-based hybrid

The only difference between the above protocol and one using a SAN-based storage mechanism, is the absence of a formal backup/restore technology. Rather the SAN is the long-term storage location and we perform the incremental extractions onto it. Rather than delete the files, we keep them.
This has significant implications for the cost of the SAN. After all, if we intend to interface this to the Netezza machine, we would not want common NAS storage because it is too slow and the vendors actively disclaim their technology from being viable for data warehousing. The primary reason is that the network and CPUs are set up for load-balancing, like a transactional database, but not bulk onload/offload of the data.
Not only will we need enough SAN to backup the environment, but to also carry fathers and grandfathers if need be (this is a policy decision). With checkpointed extracts, the father/grandfather issue is largely moot. This is because once a checkpointed extract of older data is pulled and stored, it won't be changing and capturing another one just like it has no value. 
Netezza-based Hybrid
In this approach we leverage another Netezza machine like a DR server, as our backup, archival and restore foundation. It can easily hold the information quantity. The difference here is in price, of course, since a fully-functional TwinFin is more expensive than most common SAN installations. However, the High Capacity Server (below) mitigates this pricing problem while delivering a consistent data experience.
One primary benefit of performing backups into the DR server is that it can automatically serve the role of a hot-swap server in case of failure in the primary server.
For this scenario to work however, we would want streaming replication between the active databases and the DR server so that the data is being reconciled while being processed. This allows us to have a fully functional hot-swap if the primary crashes, and we can continue uninterrupted while the primary is serviced. Word to the wise on this kind of scenario however, bringing back the primary means that it is out of sync, since the secondary took over for a span of time. So we would need to be able to reverse the streaming replication to make it whole.
Scenarios like this often embrace the practice of operationally swapping the two machines on defined boundaries, like once a month or once a quarter, where they actually switch roles each time. This allows the operations staff to gain confidence in the two machines as redundant to each other in every way. I have seen cases like this where the primary machine went down, the secondary machine kicked in seamlessly and all was well. I have also seen cases where the principals kept the DR server up to date but when it came time to operationally switch, some important piece (usually in the infrastructure between the devices) was missing causing the failover itself to fail. It is best to have a plan in place, but it's better to have tested the plan and that it actually works. 
Protocol 
  • On a per-table basis, use the content boundaries to manufacture/compile a set of checkpointed instructions for each table to support extraction and restoration

  • Using a variation of the nzmigrate utility, perform the table-level extractions

  • Extract data in Netezza-compressed form to a flat file

  • Load the flat file into a restoration database in the second machine

  • Perform the transform-execution to copy data from the restoration database into the target table

  • The extractions, restoratons and transformations can all leverage simple scripts and catalog metadata, not become dependent on deeply hand-crafted code

  • Intermediate (SAN) landing zone requires less space because the data is being transferred in Netezza-compressed format and it is cleaned-up on-the-fly

  • Transfer is quicker because Netezza-compressed data is written to the data-slices directly instead of coming through the host.

A word on Netezza-compressed transfer. I wrote about this in Netezza Transformation but it is important to highlight here. We performed an experiment moving half a terabyte scattered across a hundred or more tables. This data was moved from its original home to a database in another machine. The first method used simple SQL-extract into an nzLoad component. This process took over an hour. The second method used transient external tables with compression, coupled with an nzload in compressed mode. The entire transfer took less than six minutes. This was because the compressed form of the data was already 14x compressed.
In other experiment using over 20x compression for the data, we were able to transfer ten terabytes in less than an hour. This kind of data transfer speaks well for the streaming replication necessary for DR server operation (above) but underscores the fact that even when transferring between Netezza machines, it's as though we haven't left the machine at all.
Netezza-based High-Capacity Server
This option is simply a form of the Netezza-based hybrid (above) but on a dedicated server designed to support backup and recovery.
The better part about this server is that is has more disk drives and fewer CPUs, making it far more cost effective for storage than common SAN devices. Couple this with the minimal overhead required for transferring data between machines, and the ability to surgically control the content with the content-boundaries and catalog metadata, and we get the best of all worlds with this device.
Not only that, but it is also scalable to support storage of all other Netezza devices in the shop as well as any non-Netezza device where we simply want to capture structured information for archival purposes. The High Capacity server is queryable also, meaning that even the ad-hoc folks will find some value in keeping the data online and available.
Lastly, in Spring 2010, as part of the safe-harbor presentation one of the principals at IBM Netezza announced plans for a replication server. I can only imagine that this device will deliver us from any additional hiccups associated with streaming replication that we might now be doing in script or other utility control language.
At Brightlight, our data integration framework (nzDIF) has the nz_migrate techniques built directly into the flow substrate of the processing controller, as well as the enforcement and maintenance of the aforementioned content boundaries. We are actively acquiring and applying best, most scalable and simplified approaches as a solution framework firmly lashed to a purpose-built machine. I am a big proponent of encouraging Enzees to take on these things themselves, or at least let us coach you on how to make it happen. The solutions are simple because the Netezza platform itself is simplified in its operation. Stand on the shoulders of genius - the air is good up here.
0

Window to the Sanctum

Posted by David Birmingham Aug 26, 2011

"What's this?" asked the CFO of the Data Warehousing Director, now holding a fresh-off-the-press request for more hardware.

"We've hit capacity," said the Director with a sigh, lowering himself wearily into a padded chair.

"What? That machine is barely a year old! And just last month you claimed it would be two years before we would need more capacity. I didn't allocate any budget for hardware because you said we wouldn't need it!"

"I was wrong. We just implemented an upgrade of the application and it started hammering the machine. We didn't know we had hit capacity. We thought we had enough."

"I've had enough," said the CFO, "I want some answers before I sign off on this."

 

And so began three weeks of malaise in searching for the right answer. An answer that nobody was particularly trained to find. Throwing in the towel, they sought outside help and eventually discovered some very interesting artifacts.

 

  • Firstly, the solution using the machine had been inefficient from the beginning. The machine's power had masked the inefficiency. The new implementation only created a tipping point.
  • Secondly, they discovered that of all the SQL statments hitting the machine, over eighty percent of them were singleton operations. Of these, over half of them were singleton updates. Of these, none of them were using a distribution key to support the update. This information was generally available in the query history database, if one only knew where to look. And if one had the time. And the patience.
  • Thirdly, many other inefficiencies were pervasive, including the use of big-fat transforms (as opposed to more and leaner transforms with intermediate tables) and the inability to see the spikes in application activity.
  • Lastly, much of this activity went across machines (several machines shared the load of processing) so that finding the nefarious operations was nothing short of a submarine hunt.
  • Several of the large tables had been distributed on keys that were more suitable for zone maps than distribution, meaning that the vast majority of queries would only leverage one or two CPUs out of hundreds. Sort of like trying to pull liquid cement through a soda straw. This is process skew rather than data skew and is sometimes difficult to diagnose.
  • Couple the above with the lack of proper controls for update/delete operations, resulting in an enormous count of aborted queries and their attendant instabilties - and the pernicious growth of unrecovered space and we have hidden inefficiencies that require a different level of resolution and investigation.
  • The cost to remediate these problems was not as substantial as wholesale upgrade, but could have been avoided or at least mitigated had the principals been aware, or had visibility to the lurking dangers and inefficiencies that were literally draining the lifeblood from the systems like parasitic insects on an unsuspecting host.

 

Now the reader may well assert, ahh, we have enterprise tools that allow us to see the machines on the network and in the operations center. Those tools should tell us everything we want to know. Well, sure, about the Linux host of the machine, but not about the health of the various databases, usage stats, trending and most certainly not application-level nuances like misapplied or ignored distribution keys, mangled zone maps and tables that haven't been groomed. What's that, the nzPortal could give us some of this? Sure, on a per-machine basis. But the machine admins want to deal with machine health on one level, while the information architects want a completely different, even oblique level of information that may require deep immersion into the logistics of one or more enteprise applications.

 

What's the real problem here? Logistical complexity. The simple fact that as we add more technologies, applications and functionality around the machine, it becomes the fulcrum for the enterprise. Without management of the activities both in detail and in context, logistical complexity arises. We want to embrace the simplicity of management and administration, and on a purely administrative level (e.g. what the common DBA role fulfills) it's still a part-time job. But this may hide the fact that the application engineers and implementors have been given, or rather been delegated with, the additional responsibility of logistics. Perhaps they didn't realize that implementation and architectural logistics was now on their plate. After all, they only build applications. Deploying, operating and maintaining them has always been someone else's gig. In any other technology besides Netezza, the DBAs make it their business to know the data and its processing nuances. But in those technologies, there is no power in the machine to take us very high, so the possibility of logistical complexity is held in check by the lack of power in the machine. This constraint released, logistical complexity now becomes a very real threat.

 

A number of years ago these needs reached critical mass and the ninjas of the Enzee Community that had been addressing these issues as onsite consultants finally congealed these capabilities into a platform that can not only see these trends, nuances and capacity issues, it can do it across multiple machines and applications. In the spirit of Netezza's "driving toward simplicity", this platform plumbs the depths of Netezza's more complex interior and serves up the nuggets in visualized, actionable form. It is essentially the application-administrative level of business intelligence - for the machines.

 

In the aforementioned "submarine hunt" the principals learned an important lesson of very-powerful machines: they can make even the most horrible implementation look stellar. By correcting the discovered issues, they reduced the machine's load by over eighty percent. Imagine recovering eighty percent of a machine's capacity just by applying some simple fixes. One day the machine is overloaded and we're looking at upgrading for non-trivial costs, and the next day the machine is barely breaching twenty percent capacity and we won't be chatting with the CFO anytime soon. Years even.

 

Some of you have similar stories (I've heard many of them!) and sincerely want a better way to deal with multiple machines and a wide array of applications and users in a more holistic manner, with an eye on what counts in a Netezza machine, not just what a typical database does. This takes administration to a level that simplifies and clarifies the complex in a form that intersects to the language base and nomenclature of the Enzee Universe.More importantly, the Enzee Universe is an ecosystem where many of us have found, through experience of success or pitfall, what works and what most people are asking for (and wondering why some of the leaders aren't taking the bull by the horns to solve it). Well, many members of the ecosystem have worked around the logistical issues without privy to the expanding capabilities of core observational needs, codified and implemented by people in-the-trench.

 

The inception of this observation technology was both brilliant and simple. It would faithfully gather stats from each of the machines as a summarized, transmission-safe extract and drop it to disk. Their on-site consultants could install it, kick it off and then go about their other analytic activities. At the end of a day (or two) they could examine the summaries and these would reveal all. As scheduled operations using the machine's power to process the machine's own statistics, it has a very low footprint and can be brought under workload management. As structured extracts. it was only a matter of time before they would put a pretty face on it. Now it's a dashboard to the inner sanctum of the Netezza machines, from the people who have collectively installed and shepherded more Netezza installations into production than any other, with no close second.

 

The Observation Deck, from Brightlight Consulting.

 

Next Tuesday, August 30th is a Webinar demonstration of this technology and I would like to cordially invite the Enzee Community (and all other interested parties!) to attend and take a deep look at what it can do, and what it could do for you.No pressure and no sales push. The need for this is obvious and we are simply demonstrating that it can be filled and supported. But the participants will be the judge as to its viability. It is the ecosystem's way after all.

 

 

The webinar is on Tuesday, 8/30 at 11 am pacific / 2 pm eastern.  You can get more details and sign up for it at:  http://advancedmonitoring.eventbrite.com/

0

Host: Ladies and Gentlemen, I'm your host here with Hairball Plotter-
Hairy: Call me Hairy
Host: (Pause) Oookay, I'm here with Hairy Plotter and he's going to tell us about some of his - er -

Hairy: Innovations

Host: Innovations? I thought they were just the opposite -

Hairy: Well, eccentric innovations anyhow. You know, the kind of stuff that gets the job done, but with a lot
more moving parts!

Host: Please explain

Hairy: Well, you see, it's not incumbent upon me as a consultant to work myself out of a job. I need a way
to stay employed and ensconced, and complexity is just the ticket

Host: How so?

Hairy: The more complex the environment, the better job security I have

Host: We have a caller from Ontario on the line, a question for Mr. Plotter?

Caller: Yes, I'm an IT manager and I've found that when people try to embed themselves as you describe, they create risk
for me, and I don't like that.

Hairy: Thanks for that input. I'll file that.

Caller: Wait a second, are you blowing me off?

Hairy: What difference does it make? If you like it or you don't, you're still stuck with me.

Caller: Unless I find something better.

Hairy: I'll just milk your budget so there's nothing to spend on something better. Works for me.

Caller: But that doesn't work for me.

Hairy: Noted.

Caller: But -
(click)

Host: We have another caller from a financial services firm in New York. Caller you're on the air.

Caller: Thanks, I agree with Hairy on this. Contractors are punted around and treated like fodder. We need a way to
keep our jobs and pay our bills. Working ourselves out of a job doesn't fit that mold. We need more ways to
make work for ourselves, even if it's artificial.

Host: Artificial?

Hairy: Well said!

Host: Wait a second, artificial?

Hairy: Well, of course its artificial. All of us are smart enough to do it better, faster, smarter. Heck, I could deploy a
high-reuse/low maintenance implementation that basically lets you run things lights-out.

Host: Good for you!

Hairy: Oh no, so not good for me. Once it's deployed I get a final paycheck and have to ride into the sunset. Seems romantic
but the sunset doesn't pay the bills!

Host: So you find a way to make yourself useful.

Hairy: No, I find a way to make myself necessary. Usefulness is for suckers.

Host: Now wait a second -

Hairy: Now listen. Everyone does it. At one of my favorite clients, all of the contractors are on a perpetual time-and-materials contract. When we
had some folks show up on a fixed-price gig, they practically pulled their eyeballs out when they couldn't get any urgency out of anyone.

Host: Why not?

Hairy: Because all those time-and-materials folks had a vested interest in protracting the work until the next day or the next week. Practically

every time we said we needed something right away because our clock was ticking, they would offer it up for delivery "next Friday" or some such.

But you know, If they ever looked ike they were wrapping up, it could spell curtains for their cushy little tushy.

Host: (laughs) I see what you mean.

Hairy: It's just how the game is played.

Host: So IT staffing is just a game?

Hairy: Musical chairs. In so many ways.

Host: So what are some of the ways that you - well - set this up?

Hairy: Invest in a lot of wool.

Host: Sorry?

Hairy: We'll be pulling it over the manager's eyes.

Host: Oh, I see, that's kind of funny.

Hairy: Thought you would like that. But seriously, we take the simplest, most direct way to implement something, so that we
know exactly what it would look like, then do the opposite.

Host: Seriously?

Hairy: Well of course. If we do the simplest approach, there's no room for a hero to step in an save the day. Lots of little
virtual terrorists running around in a complex system. Think about how the 9/11 terrorists were able to evade
electronic surveillance - they stayed off the grid and used cash, and were able to fool the other systems into thinking
they were out of the country.

Host: So you use the same technique?

Hairy: Well, the same forms of deception of course. If we built over-complex systems accidentally people would say that we're
clever, but if we do it deliberately, we're clearly diabolical. I would never deliberately paint myself to look bad, so we have to be a little

deceptive, right?

Host: Hmm, so this is a little disturbing. I want to know - oh we have a caller from Orlando.

Caller: Hey man, I want to come work with you.

Hairy: Shoot me a resume.

Host: Stay on the line and we'll take your contact information. There's another caller from Dallas Texas. Dallas, you're on the air.

Caller: Hey, this whole deliberate-snowstorm thing is really a different way to look at things.

Hairy: Snowstorm, I like that. When one of my first mentors realized this technique, he said in a gruff voice - "You're a blizzard, Hairy."
At least I'm not slithering - around you know like those snakepits where the idea-mongers hang out. Ideas are good for a spell, then they wear off.
Almost like-   Like snake-charming, or snake-whispering.

Host: But I mean, doing it deliberately seems, well -

Hairy: I know what you're about to say. But think about it. Whether we do it deliberately or accidentally, the outcome is
the same. They are interested in the system, but we're interested in self-preservation. If I have to choose, I say, don't
do self-preservation by accident. Do it deliberately. This means set yourself up to be necessary.
Host: Explain necessary?

Hairy: Of course. The more complex the system the more they need you. If it's simple, what do they need you for?

Host: But if they figure out what you're doing, you're cooked, right?

Hairy: But if I work myself out of a job, the outcome is the same. No paycheck.

Host: So you're saying that most data warehousing is just accidental brilliance?

Hairy: Acccidental or deliberate, the outcome is the same.

Host: Oh, please, don't go fatalist on me.

Hairy: Come on, man. Most folks can't pull this off whether accidental or deliberate. The fact is, we still stand up a functional,
operating environment. Consumers are being fed, users are getting what they want. Forget the fact that the
environment is stood up on pallettes and serviced by swarming people on rollerskates.

Host: I- suppose.

Hairy: Again, if someone stood up the working environment and had all this stuff in it, you'd call it clever. Only when
you realize that the complexity is artificial would you take exception to it. If you never realize this, I will always
have a paycheck and you will always have an operational system. You'll never fire me, because I'm your most visible hero.

Host: Ever heard of Munchhausen-by-proxy syndrome?

Hairy: Not familiar with it, no.

Host: Where a care-giver deliberately harms their child in order to be seen as the savior that delivers them from harm.

Hairy: I see a pattern, sure.

Host: You said it yourself, diabolical.

Hairy: Only if they realize it. Perception is the key.

Host: But you're admitting to it here. On the air.

Hairy: Nobody will remember it. For one good reason: It's so outrageous that it simply cannot be true.

Host: Unbelievable.

Hairy: ExACTLY!

Host: But what about scalability? When volumes increase, invariably the complex systems are crushed.

Hairy: Oh please, only about five percent of all implementations have that issue, so if I stay away from those, I'm gold.

Host: But what if one of your implementations grows into this? Seems like you'd have some explaining to do.

Hairy: What are the odds? I can easily place myself in the lower-scale zone and make a good living at it.

Host: But you would agree that the larger the scale, the more the need for simplicity?

Hairy: I don't do scale, so why would I care? I blow smoke into a manager's face, nose or other orifices and take a paycheck. What
does scale mean to me?

Host: You have a lot of disparaging things to say about these decision-makers. Aren't they your customers?

Hairy: Look, you have people like me, who work the magic, and people who sit in the office, smug in their confidence that
they know exactly what to do and how to do it. My objective is to make sure they stay on the outside looking in,
snuggled next to their branded-coffee cup and blissfully unaware of my agenda. We call them the Smuggles.

Host: Smuggles?

Hairy: Yeah, Smug people who snuggle with coffee. Their users are the Smuggees. Smug people who know data but don't know beans
about how to make it operational. Always offering opinions. Who cares what they think?

Host: Clearly not you.

Hairy: Well, I care what they think to an extent. As long as their reports are running to spec, it doesn't matter how we pulled
it off, only that they get the data they want.

Host: The "how" and "what" question. I've heard that before.

Hairy: And what I've heard before, is the endless droning of users dictating to us how we should deploy the systems.

Host: And what do you do with that?

Hairy: If their suggestion will make the environment more complex and difficult to manage, we're all over it. If it makes the
environment easier to operate, we push back. We call it the "you asked for it" policy. It's a theme, you see.

Host: Yes, I can see that.

Hairy: O I love 'em. Nothing is better than deliberately choosing a platform that can't cut the mustard. I mean, they're not really
hard to pick, you know. Imagine getting almost to the end of the project and hitting that hard
wall - flying right into it like a blind witch on a runaway broom. It's hysterical to watch all these
folks running around like headless chickens. You can't pay for that kind of entertainment. But of course,
they do pay for it, and handsomely.

Host: Until they realize that you're in the center of it?

Hairy: Eye of the storm you mean, where the seas are calm. I never lose my cool, so they always think things
are under control until someone pulls the single thread that unravels it all. I have plausible deniability. Keeps
me working and the paychecks coming. Sweet.

Host: Don't you think this is a bit - you know - underhanded?

Hairy: The mass layoffs of the turn of the century were underhanded. They created a 1099-Culture that basically means all of us are
mercenaries. Soldiers of fortune. We go to the highest bidder no matter what. There's no conscience in that existence, especially when we could
be working for one company today and their competitor tomorrow. Those companies treat the 1099's like batteries - plug 'em in, burn 'em out and toss 'em.

Host: Seems a bit cynical.

Hairy: One of the better parts about this kind of consulting is that I can propose the solution without
actually producing anything. Then I can flit from flower to flower, pollinating these ideas. They pay
me for the ideas, not the actual work, you see.

Host: So you propose, but you don't actually execute?

Hairy: Execution is somone else's problem. Why should I stick around to see if the proposals actually work? If
they don't, there's another feather in my hairball cap, or rather another notch in my mayhem gun. But
if the solution works, good grief, all that work for nothing? May as well stay home and play video
games.

Host: What would you like to leave our audience with?

Hairy: Oh, I suppose, don't worry, be happy and all that. I have a new book coming out called Managing Expectations.

Host: So tell us about that.

Hairy: It's all about how to give the user a false sense of security while we do whatever we want. I'm all about what's expedient
for me, but most of the time when the client sees what I'm doing, they love it because it makes things expedient for them as well. Soon
everyone is following the beat of the same drummer.

Host: Which would be?

Hairy: Get it done no matter what the cost. That way, I can charge whatever I want. Cost is no limit.

Host: I see. I think.

Hairy: When you think about it, most IT folk want to do things the expedient way anyhow. Over-thinking the problem seems so stodgy to them.
In fact, most of the time when we're going through the analysis phase, I can see it written in their expression and practically popping from
their eyeballs.

Host: What's that?

Hairy: The desire to start coding! Heck, practically every conversation is punctuated with how they intend to do it - long before the analysis
is complete. That's because IT folk just don't have the patience for analysis. They want to get coding. I just set the expectation that we'll
code first and analyze later.

Host: So when does the analysis come into play?

Hairy: What analysis? If we start coding the analysis never happens. There's never enough time. After all, the herald cry of
expedience is - "If you don't have time to do it right, when will you have time to do it over?" What do you get when you start coding
without any analysis? Hairball city. That's my stomping ground.

Host: Thank you for your time Mr. Plotter. This has been very - uh - enlightening.

Hairy: My pleasure

(pause)

Host: That was Hairy Plotter with his half-baked stints. Be with us next time when Jason Statham joins us for
"Data Transporter 2", where he kicks a bunch of back-ends with ETL tools. Until then, happy computing and keep that data flowing!

--

--

--

0

"You hate me don't you?" she whimpered, teary-eyed.
"Nobody hates you, Spooki," Bill assured, "It's just business."
"Just business?" she spat, the hurt spilling from her mouth, "Who died and made you king?"
"I thought you didn't want to hear about it."
"Never mind. We're already in Season 4, our ratings are tanking and you want to bring in who, exactly?"
"Well, we have some options. You know, other leading characters from bloodsucking tales -"

"What, the Lincoln Lawyer? You'll never get Matt to sign on -"

"No attorneys. People don't like to mix attorneys and bloodsucking, no matter how cute it sounds."

"Worked for Angel."

"Or not. He's in syndication now, right?"
"Then, like Lugosi?"
"Well, Bella anyhow, but that's not important. What's important is that we reel in this storyline so people can actually follow it."
Spooki rubbed her shoulder, pressing the fabric against her skin, "Where did we get this wardrobe? It's itchy."
"Wolfe collection. Has a lot of wool in it, none of us have problems with it."
"Did they think we would boost ratings if they got us all to wear Wolfe?"
"Some of us, anyhow. Were you like, asleep for Season 2?"
"Look, Bill, I've carried this series for three seasons, so - "
"But you're not a name-brand player, and we need some lift. I mean, not just to pull the ratings out, but to scale to more viewers. We're all over the world now. Streaming internet video, graphic novels, action figures -"
"That's just great. Merchandising. It's all about merchandising."
"No, it's all about scalability. The storyline's not scalable unless we bring in some power hitters."
"So the critics snipe at us and all of you go running to momma."
"Speaking of Snipe -"
"Oh no way, you're not bringing in the - the -"
"Daywalker. Yep, he's got more chops that the rest of us combined."
"Karate chops, you mean. He's a vamp-killer, Bill. Bring him on board and the storyline will die because all of you will be dead."
"Hey, the best vamps survived the Mighty Buff, so I think the writers can handle this."
"Speaking of buff -"
"Writers seemed to be cramped on it, I know, but hey, it's about ratings after all."
"So why isn't anyone blaming the writers?" Spooki cried, "I mean, the show is only as good as the writing, right? Make the writing more efficient and we get more lift, right?"
"I know it's easy to blame the soft skills but the problem is the platform itself."
"Me. You mean the problem is me?"
"Look, Spooki, I'm just the messenger so -"
"What's Blade got to offer the franchise? I thought he was off on his website Fantasy Pro Golf Assoociation."
"Yep, well, fantasy golf has taken off, worldwide has more participants that we could home to amass in a hundred seasons. By bringing Blade into the picture, we automatically get the FPGA to join us. It's not about Blade alone, but Blade and the FPGA together."
"I don't get it."
"Following, it's all about following. Blade and the FPGA will eclipse you into the twilight."
"It's so unfair. Let's talk about it over dinner."
"Sure, who are we having?"
"That's not funny."
"Over, I mean, Who are we having over?"
Spooki rolls her eyes, weary of the double-speak.
"We could go into town and have a sit-down, maybe have a nice Merlot -"
"Now why would you say that? Sam's a nice guy even if he is a little shifty."
"Ahh, he's a vendor rep. Some of them are that way. He hangs out with shifty people, you know."
"I'm just trying to get my head around all this."
"Let me put it to you this way. Your character has become too complex. People can't keep up with all the different facets. We certainly can't add any more character features. It's just better if we sunset Spooki and take it into something simpler, easier to manage."

"Are you saying," she swallowed hard, "That I'm a high maintenance character?"
"Well, not to put too fine a point on it -"
"How am I too complex?"
"Rather than staying grounded in the everyday, it's become a fairy tale. Our fans are not just moving away, they're hemorraging. It's a true bleed."
"A fairy tale? Seriously?"
"People want to get back to roots. You know. good and evil. Black hats and white hats. Battle lines obvious. It just simplifies the story for the viever. They need simplification or they will get lost."
"Where's Travolta when you need him?"
"Look, if I get knocked off, I have other options."
"Oh, what that BBC show with the cross-dressers? Get real."
"Hey, Ex-Men is very popular."
"I don't think the viewers are lost. They asked for more meat in the storyline and more depth in the characters."
"Yeah, but like anything else, the extra character features don't lead to scalability of the storyline. We've reduced the entire story to just a few people. Now we're bleeding out because we can't spin off."
"Spin offs? They want spin offs?"
"If you can't spin off, you can't scale."
"Spin off! I'll show you spin off!"
"Now Spooki, don't get carried away. Not all of us have a fairy godmother looking after us."
"Actually, thanks to you, none of us have one of those."
"Let's go into town. They just opened a new Which-Wich and I want to check it out."

"You would.  I just don't see how Blade will change anything."

"He has more bandwidth and better technology. Lookit, how do you think Blade got so successful? He could find vamps practically anywhere, no matter how deep they tried to hide themselves."

"Yeah, how did he do that? And why can't we have writers that good?"

"Feature films, Spooki. Different realm altogether. Just ask Bella."

"Bella has hers, I have mine," she sighed, "So how does Blade do it? Haven't our writers gone to every extent to boost the story?"

"Simple. He knows where the vamps aren't. Not at the supermarket. Not at the cinema. Just by process of elimination, Blade can zone-in on his prey anywhere on the map. He's groomed for it. He doesn't have to go to every extent to find what he wants."

"There's just got to be some way we can reclaim our lead in all this. I can't believe we've been eclipsed."

"Hey, reclaim is a thing of the past. We groom our characters now. It's the only way to fly."

0

I offer a simple note of thanks to the folks at IBM Netezza who executed what is arguably the best Enzee Universe ever.

 

I also thank the many of you who attended the Best Practices sessions on Monday. I counted over 450 in attendance just by chair-count. I realize that there is enormous curiosity and hunger among the Enzees for these topics and conversations and I hope you feel that your time was honored and it was worthwhile to you.

 

About the music: I had a number of questions about the "bumper music" played during session breaks. These were selections from Tron:The Legacy soundtrack, Blue Man Group, other movie themes from last year including James Bond, Mission Impossible and a cut from Steppenwolf.

 

Of all the questions asked during the session, the one that interested me most was "What does Netezza have that Oracle does not?" Honestly, we only had a few minutes left in the question-and-answer session and if I gave this answer any honor, it would have taken hours. I wrote up some of the differences in both Netezza Underground and Netezza Transformation, but perhaps its time to revisit these in an open discussion so that they are up-to-date and out-in-front. The problems for Oracle have only become more apparent over the years, so it will be easier to clarify and differentiate without requiring a deep technical dive (in other words, they are obvious on the surface).

 

I was also offered the privilege to sit for an online interview with Mike Kearney - linked here: http://www.youtube.com/user/NetezzaCorp#p/c/55393907B7B3BDBE/3/aLQMoa1S8P8

But keep in mind that this was after all day in the hot-seat with you guys without breakfast or lunch (whew!) so the whole day was a bit of a whirlwind.

 

I wish each and every attendee (and even those who wanted to attend) all the best and hope to see you again soon.

0

The long-awaited "SQL" to Netezza Underground has hit Amazon.Com.

 

http://www.amazon.com/Netezza-Transformation-without-alchemy-sorcery/dp/1453874054/ref=sr_1_1?ie=UTF8&s=books&qid=1306624705&sr=1-1

 

This book tackles some deeper issues around transforming our data warehouse, our approaches and even our thinking to align with the arrival of the TwinFin. No, it's not the TwinFin we'll transform, but that the TwinFin will transform us. Once again, a little tongue-in-cheek irony

 

The book offers working examples of checkpointing, exception handling (primary/foreign key), windowing (SCDs and deduplication) as well as a cookbook on more details to watch for in a migration project. And of course, is replete with Case Study Short and a whole chapter on Case Studies. There's also an appendix at the end to offer up some simple scripting jump-start routines that can make bash so much easier.

 

I also threw my hat into the ring on the Exadata question, as well as several other products that need a kick in the pants.

0

Enzee Universe is ON

Posted by David Birmingham May 19, 2011

Okay folks, the Enzee Universe is upon us and the Best Practice sessios is fired up, the power is vibrating the tracks and the train is about to leave the station. It's a good ride - so get on it.

 

We have a lot of stuff planned for the Best Practice session, to kick off Monday morning before the keynote. It's a six-hour run but hopefully not a white-knuckle experience. As with last year, there's more material than we have time for, so even if you're a little late, jump right in because there's no better time to join than right away.

 

Last year we had over three hundred-fifty in attendance. The conversations and participation last year was very high so we're expecting an even more scintillating performance this year, from you. That's right, the Enzees were the stars last year, turning out in number and unafraid to speak out on the tough issues and cross-pollinate their knowledge to everyone. I received an avalanche of feedback last year on the hot-seat sessions and the different subjects.

 

As with last year, we'll stay on the core tracks but we have some significant updates to the material. So bring your notepad, your thinking cap and hey, maybe even a megaphone if the temperature rises too high. Ahh, but no voice is too great or small in the sessions because the purpose is for all of us to walk away better prepared and enabled to tackle the next-great-solution with a Twinfin at our side.

 

So get a good breakfast, arrive early if you need to, and we'll get moving when the starting pistol sounds.

 

See you in Boston!

0

A question I receive often can be characterized as "do I really need to do this for smaller tables? Isn't this only necessary for larger tables?"

 

And the question could run the gamut from things like when-to-use-bigint to how-to-distribute-tables. The interesting thing - some of it really can be deferred until later, since Netezza has the power to turn-the-ship so to speak. Other things we need to consider for different reasons entirely. For example, will the table always be small? Or can it grow without asking our permission?

 

Case in point - a discussion arises as to whether we need bigint surrogate key types for small lookup tables that will never exceed hundreds or even thousands. The bigint type seems like overkill. So this makes some sense in that we don't want such overkill mechanics in the data model. Then someone notes, "and we'll save disk space too". Well, not really. On the smaller tables the additional bytes are negligible, add to this the compression ratio and the fact that all tables have a minimum required allocation of space - and we see that the avoid-bigint-to-save-space isn't really valid at all.

 

The avoid-bigint-to-reduce-domain-size tends to make sense, until we get into the issues of actually filling them with data. If we want to use a hashN() from the toolkit, we'll have to coordinate between the hash8() or the hash4() for example, to make sure we use the right one. If we plan to use sequence generators, we'll also have to coordinate the separate use of those, one for bigint and one for integer etc. In this aspect, the bigint (since it isn't costing anything in disk space) has higher value in its data-type-portability and the portability of the code or SQL leveraging it. We reduce logistical complexity with no penalty. The bigint then acquires a more polymorphic flavor to it, as a pointer to data rather than as a numeric counter or index.

 

Another aspect is distribution. How do we want to distribute the smaller tables, and what should we choose? For the smaller tables, recall, if they are not distributed on the same key as the larger joining tables  (usually the case) then they will automatically broadcast to the larger table. As this is actually the expected behavior, is there anything that could change this? I have noted on some sites that when queries get complicated, the optimizer does not have enough information to properly make a call on broadcasting, and can actually reverse this outcome - broadcast the larger table toward the smaller. Such things are very difficult to find when the data sizes are marginal and only rear their heads when the data sizes (on the larger table) hit a certain threshold. Meaning: it will work for a while, perhaps even a very long time, and then show signs of inexplicable drag.

 

Of course. this can also happen if the statistics on the table are stale, so we always want the optimizer to make to right call. We could always keep our stats updated and we could make sure those complex queries never got-that-way. We could-and-should do that over time and maintenance activities, but we should also never get bitten in between.

 

To avoid pain and injury, the way to make the smaller tables always behave is to distribute them on random. The optmizer then has no basis to broadcast a larger table into the smaller, because no distribution exists. The smaller will always broadcast to the larger (the behavior we expected, right?) and all is well. The beauty on the TwinFin is that this smaller table will be cached in memory at no additional charge, further boosting the query and any subsequent queries using it - a beautiful thing.

0

So a number of months ago I was in a chat with several colleagues on their use of nz_reclaim on their Mustang systems. If you are a TwinFin user, this really doesn't apply (after all, on the TwinFIn under the covers, the nzreclaim just calls the groom function anyhow).

 

As discussed at the conference and on many forum posts here, the nz_reclaim is one of those options of last resort. People say no, we use a CTAS or other mechanism. The problem with the CTAS, they lament, is that this resets all the rowids in the table and makes it a candidate for a full backup.Groom of course, mitigates all this.

 

But back to Mustang space, because most of us entered the frontier on a Mustang,

 

In their particular environments, which I am seeing is more the case in many places, they run an operational reclaim on their largest tables. Other places just kick off a reclaim on the weekend and let it scan through all the tables. This works great at first, then starts to present problems as we add databases and tables. Then we need something a bit more efficient.

 

If we apply a simple set of queries on the table before engaging the nz_reclaim, even on the largest tables, we can defer the nz_reclaim operation to execute when it's necessary, based on applied thresholds. I mentioned this at the sessions in the past years at the Enzee conferences, but afterward people would ask - how do you find the percentages you are talking about, or to discover those thresholds?

 

So the underpinning mechanics are fairly simple. Each time we perform a delete operation, this is actually a soft-delete. The record is not removed from the used-space of the table, and if you do a lot of operational deletes, you could end up with a lot of unusable dead space owned by the table and unavailable for other use. The update performs a full-record delete-and-insert. So in the end, the problem is the same - we need to clean up the dead space. The TwinFin uses the groom, but the Mustangs use the nz-reclaim, which works a lot like a disk-defragmentation. It is a very expensive operation (in duration especially) and locks the table from access.

 

I will depend upon your own acumen for deriving a list of tables to submit to the function below. It performs the check on the table and gives you a thumbs-up or down based on the provided threshold. I have given a default here of 20 percent, which basically means that if you compare the total records in the table to the total records deleted, if the total records deleted exceeds the threshold, it is declared a candidate. Now that you have a list of candidates, execute nzreclaim on them.

 

The best part of this is that even if you have a very large table that has not exceeded the threshold (say by the weekend run it has only 10 percent dead space) you can forego reclaiming it until the threshold is exceeded. I know of several sites that have operational deletes and updates that use these kinds of thresholds, and they don't perform reclaims nearly as often as those sites that don't use thresholds, or are using an all-or-nothing schedule.

 

In order to "see" the deleted records, we need to access the deletexid field, a hidden column on each Netezza row. If this value is set to "0" it means that the record is normal. If the value is non-zero, it means that the record has been deleted and will be affected by a reclaim. Of course, normal queries ignore the records that have a non-zero deletexid.

 

The function is called nz_reclaim_candidate(), and it accepts the name of a database, a table and an optional parameter for a percentage threshold

 

Note here that TARGET_XFR contain the name of a database on the Netezza machine. All the other assets are manufactured for the demonstration

 

To start out, I'll make a table and add some records to it, forgive the clunkiness of these singleton inserts, as I am only doing this so you can see what I am putting into the table.

 

nzsql -a -d $TARGET_XFR <<!

 

create table zx1 (int1 integer) ;

insert into zx1 select 1;

insert into zx1 select 1;

insert into zx1 select 1;

insert into zx1 select 1;

insert into zx1 select 1;

insert into zx1 select 2;

insert into zx1 select 2;

insert into zx1 select 2;

insert into zx1 select 3;

insert into zx1 select 3;

insert into zx1 select 3;

insert into zx1 select 3;

 

---- now delete some records - this will be 25 percent of the table

delete  from zx1 where int1 = 2;

 

--now lets try to see the records we deleted - this query normally won't show them

select deletexid from zx1;

 

 

--- now lets set up the switch to show the deleted records, active only for this session

set show_deleted_records=true;

 

-- and now we can see the deleted records. It is now simple enough to compare total row count to deleted count, and get a percentage

 

select deletexid from zx1;

 

 

!

 

 

 

# Now let's declare the bash function

 

#==================================

nz_reclaim_candidate()

#==================================

{

exec >&2

DBNAME=$1

TABNAME=$2

MAX_PCT=$3

if [ "x${MAX_PCT}" = "x" ] ; then MAX_PCT=20 ; fi

MDEL=$(nzsql -q -d $DBNAME -A -t -c "set show_deleted_records=true;select count(*) from $TABNAME where deletexid <> 0;" )

echo "delete count=$MDEL"

if [ $MDEL -eq 0 ] ; then return 0 ; fi

MCNT=$(nzsql -q -d $DBNAME -A -t -c "select count(*) from $TABNAME ;" )

echo "table count=$MCNT"

if [ $MCNT -lt $MDEL ] ; then return 1 ; fi

MPCT=$(nzsql -q -d $DBNAME -A -t -c "select (${MDEL}*100)::bigint / ${MCNT}::bigint  ; " )

echo "percentage deleted=$MPCT "

if [ $MPCT -ge $MAX_PCT ] ; then return 1 ; fi

 

return 0

 

}

 

 

 

#### Here you would have a list of tables – I can show you how to grab a list of tables from the database if you want ########

 

###  So lets go through the table list and it will find the table's percentage based on the counts

 

 

#=================================================================================

TABNAMES=zx1

 

for TABNAME in $TABNAMES

do

mtab=$(nz_reclaim_candidate $TARGET_XFR $TABNAME 20 )

mret=$?

if [ $mret -eq 0 ]

then

echo "Not a Candidate"

else

echo "NzReclaim Candidate"

fi

done

 

 

##### drop the test case ####################

 

nzsql -a -d $TARGET_XFR <<!

drop table zx1;

!

 

 

Hopefully this can provide some value to those of you who have mature Mustang environments and need some relief from the "global" nzreclaim approach.

 

Happy hunting!

0

ELT buzz is rising

Posted by David Birmingham Dec 20, 2010

At the June conference we covered a lot of practices around harnessing the Netezza machine for fun and profit. But the buzz on ELT was rising then, and when I arrived (and remained after) the November London Roadshow, people were practically tackling me in the hallways to learn more about how ELT works, how well it work, and how practical it is.

 

Well, the ELT buzz is growing even louder. I've been practically inundated with requests for more information on ELT and how it operates. Of all the opinions on the subject, the conversation usually boils down to the irreducible: How do we get plain vanilla queries to do more than be, well, you know - plain vanilla queries? I mean, the business logic and flow rules are all bound into it! And if the queries can be thrown into the machine with reckless abandon how can we possibly make any sense of them? ELT means forming structured flows of information using intermediate tables as checkpoints. Netezza is one of the few platforms in the world that can do this effortlessly and productively. Anyone now exposed to this (like a common RDBMS developer) would claim that ELT is unworkable on any database. Well, perhaps on your database, but not on Netezza.

 

Of course, we cannot allow arbitrary SQL to enter the flow can we? But does this mean arbitrarily submitted, or arbitrarily formed? You see, if the SQL has a consistent form, we have all the freedom we need for arbitrary submission. I suppose it's the "degree" of arbitrary we're looking for no?

 

Case in point, a vendor tells me "Oh, you can send any arbitrary query to the machine and.." Of course he means anything, absolutely anything our developers choose to concoct of their own cleverness or ignorance, or anywhere in between. In fact, cleverness is perhaps the least of our worries since the clever queries often follow some degree of structured discipline. It's the queries produced by the newbies that pose the most danger. It's hard to remediate a lot of work that took place "in the wild", so to speak. Of course, for those who like their wild-game-SQL or pehaps free-range-SQL, I will not disparage their personal tastes. But this is enterprise computing after all, and many of my personal tastes are already checked at the door.

 

And as I've noted in the past, metadata is where it all happens anyhow so let's take a short dive into the future of ELT and what I suspect is the next wave of stuff we'll see.

 

1 - insert/select queries, while they are the ultimate bread-and-butter of ELT, their hard-wired form should not rule the roost. One need not fear being enslaved by the queries, but should instead harness them. In fact, the queries should never completely exist in one place, but be manufactured at run time. Now, our BI tools already do this, right? We post a request into our favorite BI tool and it leverages its metadata repository to form-the-perfect-query. That's where we need to be. Provide a template and some steering logic, but the query itself is manufactured on-the-fly.

 

2 - Automate everything, from DDL to physical model, from intake to reposit, and it should be a fairly simple approach. When we takea look at the patterns of what we're trying to accomplish, we are essentially punting around metadata (tables, columns, data types, rules) and these are the object-oriented assets shaping the solution-at-hand. We should not be in the businss of crafting SQL statements (let a product do that for us) but in the business of shaping logic. The ETL tools purport to do this but cannot scale (for set-based operations) as effortlessly as Netezza can. So in some ways, the ELT aficionado is stuck in a bubble waitng for the technologies to mature.

 

3 - Think in terms of metadata, factories and steering logic. Okay, what does that mean? Factories mean components that take raw parts and manufacture them into SQL statements and other assets. This allows us to drive practically everything from knowledge of the Netezza catalog. Source table to target table? All we need are the rules in the middle and we're done, right? So if we can make simple lists of rules that are tied to particular table/colums as targets, we are now in the zone to leverage a factory to manufacture the proper insert/select statement to send to the machine. We don't really need to have the predefied insert/select, only a list of target columns tied to rules. The rules then become our source map, and the target columns become our fodder for manufacturing the insert-clause. Bolt a filter / where-clause on the end of it and yes, we've have valid SQL. But wait, what about steering logic? As my factory is manufacturing the query, I can make all kinds of decisions about the table and columns as I encounter them. For example, if I am making an extraction query to face a database, as I encounter each column in the catalog, I can compare this to yet another repository of information that can tell me, does this column use a different name on the extraction source than the target? Perhaps I need to exclude it from the query altogether? Perhaps I need to  alter the value of a given column to a default (data masking) or I may need to bolt-on columns to the end of the structure. In short, the steering logic is a simple set of exports that I can tie by-name to the source and table name so that it is surgically applied as necessary. This means every extraction is now programmable because we've built the programmability into the factory - the switches and signals serve as steering logic as the factory weaves the perfect query.

 

What is BI anyhow? Some companies who do all-data-warehousing have jumped into BI thinking it's just the icing on the cake, so to speak. Others embrace BI as their primary focus but then drive into a pit of miry clay when they have to access data untouched by the scrubbing bubbles of an enterprise-class data warehouse. So which is it? Data warehousing and BI? Does BI encompass data warehousing? I've had this conversation with so many people who cannot seem to put their finger on the definition. One argued passionately that we should simple consider data warehousing and BI to be one and the same, such that when we talk about BI, we're talking about both. Then not five minutes later he claimed that Microsoft Excel was the most widely used BI tool in the world, to which I asked wryly, "since when did Excel start doing data warehousing and ELT" - he sort of stared at me, not knowing why I had asked the question, forgetting his prior claims as though second thoughts entirely. My simple assertion is that the open marketplace calls them out as different - BI is not data warehousing and likely never will be.

 

Cached stuff - Like any system, catalog-hits can become cumbersome and the factories leveraging them can begin to take protracted lifecycles for building out the queries, some of which can be very complex. However, if they are based on static metadata (like that in the catalog) it is easier to cache the manufactured assets (less any table/column-centric adjustments) and avoid re-manufacturing these assets each and every time we need them. In one case, manufacturing the assets for each extract/load took about five minutes to complete when this metadata had to be manufactured from scratch each time. After caching, however the entire duration became less than a minute. Many ELT-oriented tools actually do consume the catalog but unfortunately  do not consider this aspect of data processing - that the metadata itself - when manufactured from static assets, becomes a repeatable product. A perfect candidate for a cache.

 

Transform metadata - talk about "all the rage". Hey, people want ELT, but they don't want to risk losing track of how to trace data lineage through the multiple stages of SQL statements. After all, by the time the data actually arrives in our BI tool, we might see the data undergo hundreds of insert/select queries to arrive at the final data outcome. What if an analyst wants to perform track-back lineage of the data, or perhaps even do an impact analyis of the environment? These are impossible without metadata-driven artifacts, rather than allowing SQL guys to manufacture any arbitrary query on-the-fly.  What does this look like? Well, think in terms of source columns, rules applied to those columns and their outome sent to target columns. Then we need a filter clause (join, where, filter, group) that is easily identified. Of all this, we need to capture rules, participating tables, participating columns (both in the source to target map and in the filter clause). It's not really all that hard to capture it if the developers have followed certain rules.

 

Rules-based asset manufacturing - (a spinoff of one of the above) is simply providing a structured template where developers can formulate and test a single query for accuracy, or multiple dependent queries. But in either case, the metadata in the catalog can provide us with the the raw structures -  columns, tables, etc and we can then formulate templates from these. We then hand off this template to a developer and they fill in the details. They can't really get themselves into trouble because if the developer tries to do something illegal or ignorant, it's all part of the game to us. Our environment requires the developer to adapt - or the query simply won't run. I was on a site where we had only a handful of basic programming rules. Easy enough - like falling off a log. But rules will not protect us from someone who is willing to break them. Here we have templates that provide the balance between rigidity and infinite flexibility. Make the template work and you are by definition in-compliance.

 

Without elaborating into infinite detail, the realm of ELT is still maturing. ETL tools jockey for position in push-down approaches, but know that if they are really good at this, they are essentially working themselves out of a job. If for no better reason than they have effectively harnesses a Netezza machine to do all of the heavy-lifting their own hard-won capabilities currently do. And if so, the irony is that any one of their smaller competitors stands to gain much more in this marketing space because they have less to lose. After all, we don't buy a power-hitter ETL tool and relegate it to data-transport/transfer activities (it's new role alongside the Netezza machine) but a smaller player would love to fill this spot, grow it, mature it and deliver it as a packaged method to create fire-from-fingertips.

0

I've recently been involved in a rousing exchange out on Linked-In with one of Oracle's employees - on the Exadata machine and how it compares to Netezza. The conversation simply had to stop when the Oracle employee implied that I might be biased. While I am a big fan of Netezza (and unashamedly so) it continually rests on Netezza's ability to deliver, not because I am an employee (because I am not). I otherwise receive no financial consideration - whatsoever - for my presence here.

 

So the following dissertation is for Enzees on the subject, because much of the impetus for Exadata's very existence is Oracle's marketplace response to Netezza's products, rather than truly offering up something original.

 

Objectivity on the matter requires us to spin back in time to 2007 when Oracle's senior management publicly recognized the appliances (Netezza in particular) as a threat to their market share. I was in attendance at the 2007 Netezza Conference in Boston when the head of Business Objects global strategy group noted that the Netezza appliance had officially relegated many "mainstay" data warehouse technologies to "secondhand" status for data warehousing. The first on his list was Oracle. When this hit the big-screen, it elicited a standing ovation, because the vast majority of those folks had moved away from Oracle to Netezza, and the Oracle management had finally recognized it as a serious threat - one that then (and continually) affected Oracle's public image outside of some rag-tag gaggle of naysaying defectors. The standing-O was an affirmation that all of them had done the right thing by moving to Netezza.

 

It was on.

 

So some of us watched with anticipation as Oracle, the industry giant, made a move. Some thought they might actually tender an offer to Netezza (if you can't beat 'em, buy 'em). Then the rumors arose that Oracle was building its own mousetrap, partnering with other power-hitters to make it happen, and with all of Oracle's experience and exposure to the data warehouse domain, and its laundry list of evident shortcomings that its users had streamed its way nonstop for over a decade. Surely, surely Oracle would be able to circumscribe the problem and offer a solution. Anticipation was high.

 

But nobody anticipated that Oracle would miss the mark so badly. Not just once, but twice. Putting the transactional Oracle engine on top of "parallel" hardware could not - and has not - solved the problem. In fact, it's almost as if (and I'm speculating here) that Oracle's internal discussions about the solution had resolved the wrong problem completely. On the one hand, Oracle had the opportunity to create a data warehousing engine that would resolve the banes and shortcomings that had earned it a "secondhand" status for RAC. A logical and reasonable question would have been "why are we secondhand?" But nobody at Oracle was asking that question. Were they?

 

Oddly, Oracle still maintains the debunked notion that performance is in the software, because Oracle is a software company. Alas, performance is not found in software. And that is the summary of how Oracle missed the mark. (Okay, argue with me that performance is in the software, and I will ask you to run RAC on a laptop, and see how well that goes).


The Exadata machine looks for all the world like a late-1990's "system-slam". That is - take some specially configured commodity hardware and strap some commodity software to it, and call it a day. Recall the old days when they would roll up their sleeves and do a TPC-D test on a carefully tuned, optimized commodity hardware configuration as a one-hit wonder for the sake of scoring high on the test? This looks (again speculation) like someone had the epiphany to put all that specially configured hardware into a machine to bypass the need to purchase separate assets. This was close, but then they spoiled the ride. Adding RAC on top of it. Whoosh is the sound of the fly-by as the missed opportunity rushes past.

 

This solution still carries the baggage of being an integrated solution rather than a purpose-built solution. Its evident purpose was to compete with Netezza, to protect Oracle's existing user base from bolting, but not to actually solve data warehouse problems (because it didn't do that). In short, Oracle's priority seemed to be targeted towards stemming the hemmorage of lost market share, not provide an innovative technical solution. Time to market was critical for this to take place, so there wasn't time to actually put some thought into a real data warehouse engine.

 

Oracle evidently embraced the RAC engine as the core of the solution with

(a) the fear of rolling out a brand new data-warehouse-centric engine or

(b) use the existing Oracle engine because it is mature and fully debugged.

 

Of course, going with (a) assumes that Oracle has a cadre of engineers that understand data warehousing and can build this sort of engine. The Exadata offering is a clear indication that their engineers either don't understand warehousing, or don't have a voice in the architectural decisions. Of course, going with (b) makes the rather bold assumption that potential data warehouse users actually want and like the Oracle engine for data warehousing, when the industry in general has bolted for Netezza precisely because they don't like the Oracle engine for data warehousing. Someone at Oracle missed that memo, too.

 

So we are left with

(a) Oracle's own data warehousing engineers are voices in the wilderness, don't have the necessary experience, or they don't exist at all,

(b) someone dropped the ball in the solution requirements analysis, or there simply was no analysis at all - and finally

(c) Oracle really hasn't been listening all these years to all those complaints about its shortcomings in the data warehousing world, because Exadata sends this message loud and clear.

 

Pernicious hanging questions include: Why would management deliberately embrace the secondhand engine unless it was unaware that it is secondhand, or it is in denial of its secondhand status? Would the outcome have been different had Oracle recognized that the solution is found in the hardware, with a carefully, specifically constructed (purpose-built) software engine that embraces this? Is recognizing the critical role of hardware
even possible for a software company? Can Dudley Do-Right save Marigold from the evil villian before the commercial break?


The answer to these questions are: It doesn't matter. The outcome is the same. By publishing Exadata as its very best shot at the solution, Oracle has publicly stated, in no uncertain terms, "we don't get it."

 

So the moniker of "secondhand" technology remains firmly affixed to Oracle's data warehousing lapel. At this point the data warehousing marketplace has made-the-call, something akin to calling time-of-death for the patient,

 

because the Exadata offering - so far - remains SOA (Secondhand On Arrival).

0

Team Makeup

Posted by David Birmingham Aug 28, 2010

As some of you know, here in Texas the sport of football is a bit more than a team sport. In some locales, it's practically royalty. Back in high school and college, I was ever-aware of the sports fans that liked to wear the team colors. One in particular invited me to lunch while his parents were in town (I already knew them from back home) but I never knew that his father owned a car that was actually painted in Dallas Cowboys colors. Another colleague of mine was stuck in a parking garage with a dead battery as I was leaving work, and asked if I could help him with a jump. Upon saying yes, he produced a set of maroon-and-white jumper cables, honoring Texas A&M University. Oh yeah, he was a fan.

 

Then several of my friends kept large jars of team-color makeup and would smear it all over them before a game. No, not to go see the game at the stadium, but in their apartment! Yes, the team makeup was always a big hit, and their parties were solely centered on the tiny color TV in the center of their rather modest apartment. I had once wondered what the people next door thought of their shouts, whoops and antics, until I learned that the folks next door were just as rowdy with their own team. Hey, ya gatta be a fan.

 

Okay that's not the team makeup I'm talking about here, but those were just such funny memories I thought I'd share. And to make a point of course. The enthusiasm of your developers and architects, and their desire to cheer you on with your goals to achieve, are directly dependent on the team makeup. Of people, that is. So what does it take to make a data warehouse? Or for that matter, what does it take to effectively roll out a new one, or migrate an old one?


What most people fail to estimate in taming such a beast, is the level of testing required to make it a reality. As an example, many of you must produce documents as part of your regular workday. Those documents are often hard to write, but even more work to proofread. In fact, proofreading is the same form of content-and-context testing we would do for a data warehouse. The chief reason is the product - information and knowledge. Business intelligence is the same way - it has a way of taking on a life of its own, but the only way we can reliably roll out a viable business intelligence platform, is to test, test and test some more. Eyeballs-to-page however, may be required for a book or document, but it won't scale for a warehouse.

 

Many people don't realize that the testing portion of the warehouse can take as much as 80 percent of the project's resources. While we can compress this somewhat with agile methods, we cannot afford to test such quantities of data with simplistic manual approaches. And by that I mean eye-ball examination or screen-shot testing. No, the majority of testing is in the data itself, and on a Netezza machine it's in the billions of records. Eyeballs don't have the bandwidth. We need to use the actual power of the machine to scale this mountain.

 

So what should the team itself look like? May I suggest that for every Architect you would have several developers, and for every developer you would have two or more testers. Ideally three testers for each developer, regardless of how many developers are actually doing the work. I will also suggest that you keep the total count of developers rather lean. Five is perhaps overkill for the back end. For the front end, three solid developers can be an army, and five is about the upper limit. The reason is simple: logistics. If you have five developers on the back end and five on the front end, and three testers in the wake of each, this is a team of 40 people - which quite frankly is overkill in any sense of the word. Not to say that an overall team might not be comprised of 40 people once we include all of the infrastructure folks, but not for pure develop-and-test. We can and should make it leaner.

 

As an aside, I had the rather disturbing experience, numerous times, of encountering folks who worked these things out with overblown spreadsheets that they normally used for application development estimations. A data warehouse gig is completely different from an application development gig. But of course, if one of these spreadsheet guys ever showed up and plugged in his metrics, he would spout off that we need a 30-person team to migrate a couple of tables from one machine to the new one. Once this number is in the air, it becomes the de-facto standard by which all discussions are measured, even though it is completely wrong. In another setting, another spreadsheet-guy plugged in his numbers and characterized a project as a $900k gig when our competitors were bidding $300k for the same work. Knock yourself out, dude, because the client ain't a-bitin' three times more expensive projects because they like our faces. True to form, the $300k bid actually won. But the irony was, that the potential client had no desire whatsoever to pay more than about $400k, so the bid fit their budget just fine. The eventual winner of the bid took a bath, however. The truth is always somewhere in the middle.

 

I still say, watch those spreadsheet-guys. Somethin' up with that.

 

Perhaps it goes without saying that an architect needs to lay out a framework so that all can work comfortably in the same sandbox. This is a challenge and should not be left to the developers to forge on their own. Harnessing it later will be impossible, because too many opportunities for flow-based consolidation will be lost. Workarounds and repetitive logic will become the rule. Let's not go there.

 

If we have say, three solid developers in the back and front end each, they can and should cross-test each other's  work. In this case, we have the senior developers and architect working on the core logic, and the junior developers bench-checking their work and zipping it up for a formal testing team. Here we have a synergy, that a senior developer can crank out ten times the work and quality of the juniors (so says Demarco and Lister, your actual mileage may vary)  but nonetheless, we would not want to put a junior developer in the front seat of this chaim because the testers will be waiting on him more often than not. But with the ten-times-more-power driving the front like a locomotive, the junior developers can wrap up the many tactical areas of the warehouse and cross-test each other, but also receive the work products from the senior developer.

 

Now think about what this kind of model means. The senior developer is literally force-feeding the pipeline with work products and is doing it with the highest quality the team has to offer. The junior developers are learning from the senior without injecting their own inexperience into the mix, which will invariably have to be reviewed by the senior developer anyhow. No, the senior developer is more productive and experienced, so let him/her drive. Seems like every senior developer I talk to, they really, really want to develop and have the testing tedium off their plates. The junior developers really, really want to learn from a senior developer, and of course want to do some development themselves. I'm not saying this is off-limits, but the senior developer can delegate-what-he-knows to the junior developers because they cannot go too far astray without his guidance anyhow, so it's a win-win. And of course, I and every other person who was ever a junior developer had to pay our dues, so not everyone can be the leader. I don't say this dismissively, but we know in a business intelligence project there has to be a driving mind. Too much consensus means too little leadership, and in the famous words of Margaret Thatcher "Consensus is the absence of leadership".

 

But for the people doing the testing, they need something that will scale. To billions of records. And it had better be solid on the first round or they will be playing catch-up for every round after that. While writing and proofreading a document is an eyeballs-only model, don't you think I could at least do myself a favor and run a spell-check and grammar-check on the contents? Such set-based operations resolve a world of problems and let me focus my eyeballs on the harder stuff. But in a data warehouse, our eyeballs will never have enough bandwidth, and will never scale to the necessary heights. Set-based testing is all we have, but it's also all we need. And with a Netezza machine, we're so in the zone.

 

Now testing of the report screens can involve eyeball-based activities but doesn't have to be so egregious. Automated testing tools go a long way to mitigate the necessity for eyeballs on these as well (for the subjective parts like positioning, banners or colors especially). However, if the data is wrong, no amount of pretty-pretty will fix it. As Murphy would say "Beauty is only skin deep, but ugly goes to the bone."

 

Now, no sooner will I write this than I will get feedback from those junior developers who say that they have been relegated, but not to fear. This particular article is in context of a high-productivity bubble of work, normally found with new projects or migrations. The priority is not to make people feel better about their role, but to get past the workload so that everyone can feel better about the work products. I am always looking for opportunities to stretch the developers, both junior and senior. When a junior is ready to sit in the driver's seat of the locomotive, it's because he's passed the Demarco and Lister smoke test. Now what the heck is that, anyhow?

 

Get a copy of Demarco and Lister's Peopleware, a classic in every decade. Something they have empirically measured, is that a junior developer will start out at one level of productivity, and then in a sudden epiphany will transform into someone who is ten times more productive than before. Something mentally and/or emotionally clicks and they get this whoosh. They claim it is different timing for each developer, but usually takes about two years to make this transition. This is perhaps one reason why so many job-search requirement listings show "X years of experience in Y" and the "X" is never less than two years. Not because the poster has ever read Peopleware, but we who are in the field want folks who are 2 years along because we already know they have (at least) transitioned into a high-productivity asset.

 

But this is the mechanism driving the team makeup - and the experience of the developers and their known levels of productivity should help us find the right role for them on the team. We don't want a low productivity person in the locomotive chair. But having one in the wake of a strong developer only makes them stronger and exposes them to practices that will accelerate their transition into the higher productivity person we always wanted anyhow. And then, of course, once the person has made the 10x transition and is self-aware of their value, we have another problem: They are self-aware of their salary level too! Making someone stronger makes them more valuable. Be prepared to recognize the value (or rest assured that your competitor will). But all this, is the nature of the beast we purport to tame, no?

 

Back to set-based testing. This has as much to do with using the right data as it does the right method. The right data means - select a subset of known data that will deliberately exercise all of your business rules and software paths. Nothing is worse than realizing such errors in production. Then, we need set-based testing methods. This means we need three primary assets: (a) source data that we can sluice through our application transforms to get a result (b) a saved baseline result to compare this recent result against and (c) tested components that compare these two results in a reliable manner so that we get a statistical report on what passed and what didn't, and a detailed report on what specific records didn't make the cut. Counts, amounts, checksums and summaries all reveal deviations, especially for regression testing. You might recognize this as an exception report, and this is exactly the spirit of the effort. Our testing has to deal with statistical exceptions, because it is the only practical and scalable way to validate billions of rows.

 

And also notice that such a practice would be the kiss-of-death in many other "secondhand" data warehouse platforms. Such platforms are in no wise optimized to compare monstrous sets of data to each other column-for-column, row-for-row. Queries like that can dim-the-lights and may not return for hours, if not days. We cannot afford a protracted testing phase, and with Netezza we don't have to. Scan times and comparison times are very objective and knowable. The tests will take the same amount of time each time they run, and we always have the option to optimize them further with the Netezza performance model. Power is in the physics.

 

And again, why all the focus on testing? I have seen data warehouses blind-side an organization that accounted only for the opposite equation - 80 percent development and 20 percent testing, when more often than not, exactly the opposite is true. This would mean that if a two-month development effort were characterized with one model (the wrong one) it would look like at most a three-month effort. Why then does it metastasize into a ten-month effort? Because 20 percent (2 months) tranlates to 80 percent (8 months) of testing.

 

That is, if we just embrace the standard model. By embracing the aforementioned model, we get the development out of the way quickly and deliberately, entering the testing phase much sooner, and if more heads are deliberately dedicated to set-based testing we can close this part off even sooner. I have watched very-large-scale projects, with a Netezza team in the middle and strong developers in the locomotive seat, enter their first UAT phase within two months of the project's inception. The funny thing is, the model requires rapid turnaound that only the Netezza workhorse can provide, Try pulling off this team makeup with any other lower-productivity technology, and it won't sing the same key. A high-productivity developer is meaningless on low-productivity technology. And high-productivity testing methods are useless if enslaved to a low-productivity technology.

 

Start it, shape it, ship it. Netezza is the ticket home.

0

In the past several projects, the issue of using views has consistently arisen, as to when to use, not to use and what to expect. Views are one of those mainstay workhorses that we love to hate and sometimes hate to love, but used correctly, can save a world of hurt and lost development time.

 

So we would ask the question - why use a view at all? Isn't the table definition good enough? And what of a synonym? Isn't this just as good?

 

Well, synonyms are handy for configuration management and invaluable for testing. For BI, however, they don't pass-thru the metadata of their underpinning relation's metadata for consumption by the BI tool, so this can be problematic. We also cannot refresh a synonym easily. It has to be dropped and then created in two operations, where view gives us the concurrency-protected operation of "create or replace view" and is muey bien. More on synonyms in another blog entry.

 

Views can easily reach across databases, giving us the ability to stand-up a consumption-point that contains one-part tables and one-part views without having to push data around (very handy for say, a reference database that we want as an on-demand resource of fresh information). I'm a big fan of setting up consumption-point databases so that a user comes to a pre-designated place, not the master repository, to fulfill their information needs. This decouples the user from the master repository and gives us enormous freedom in the ongoing enhancement of their user experience.Views are the vehicle towards this goal.

 

Views also let us do on-demand case/when conversions and typecasting that can be completely encapsulated from the consuming process.


And of course a really cool part about Netezza views - is that we can include as many columns as we want in its "select" clause, the view will not fetch them all, only the one's mentioned in the select that consumes the view - this is a win-win because otherwise it would fetch all of the columns and then drop the majority on the floor to deliver a few.

 

Views have the lightweight nature of a single SQL statement that can be easily installed, where a stored proc often contains multiple SQL statements. Both of these mechanisms serve to hide the logic from the BI tool. But think about this - would we use the stored proc as part of another join? Or would we expect to just select from the stored proc and consume an answer? The more complex the operation, the more we need to just select-and-consume, and take the burden off the BI tool to know more than it has to.

 

A pernicious part of integrating BI tools is just that - expecting that it will know all it needs to know to interact with the Netezza MPP. This is - as you may painfully discover - a false expectation. Case in point, we might have a very creative intersection table between two large fact tables, and we can formulate a query that will browse the information-we-want in mere seconds. Then we plug in our BI tool and ask it to manufacture a query to do the same thing, but it struggles. Now we have to make a call - do we deploy the BI tool in the hopes that later releases will resolve this, or do we install a view or stored procedure that adapt the BI tool to our data model, and then wait for the BI tool to get better in a later release? You see, we can always toss the adaptation when our BI tool gets better. But we cannot allow our user-experience to languish on the same terms. More on this in another essay.


So before I jump into a lot of other things we like about views, I'll address some of the above in their more malignant form.

 

I'll loosely divide views into two buckets - simple and complex. The simple view consumes a single table and may have columnar transforms on it. A complex view, simply put, has more than one table in the join logic.

 

A simple view cannot be easily misused, but a complex view can be misused so easily it will make the head spin on your best troubleshooter. For example, I cannot count the times I've seen a case where a master query joining on a view, which in turn joined on a view, which in turn joined on a view etc. How deep can you go? This is not the issue at all. The issue is in treating the view as though it is a reusable, inheritable object rather than a standalone select-and-consume capability. So where do we draw the line?

 

Transactional thinking - that is - the notion that we can install nested (inherited) views because they handle transaction-at-a-time anyhow and any given instance of them will have a negligible performance problem - is completely washed away when dealing with multi-billion-row scales on a Netezza platform. It's not a transactional platform, so each view potentially initiates a full table scan. Multiply these nested upon nested views and we have nested tables scans - sometimes several separate scans on the same table. Which is more efficient, to look at a multi-billion row table one, or multiple times?

 

One customer had a query that started running very slow one day. We went through a process of discovery to find out what had changed. Seems that a new version of an existing view had
been installed, and the bad query was consuming this view deep under the covers. The bad query and view  were both accessing one of the largest tables in the database, the bad query was now scanning the big table twice, taking a double-hit on the master query itself. Even worse, the changed view did not leverage the big table's zone maps or its distribution key. So a change in one place dramatically affected unchanged functionality of a master query.

 

Because we are embracing economies of extraordinary scale, dynamic objects have a propensity to lose performance integrity over time. What worked yesterday may not work today, so we have to tune it. Netezza is so efficient that this tuning necessity may not arise for years after the implementation. (In one case, four years afterward). By that time, the knowledge of the system's dependencies are not fresh on everyone's mind, so it is easy to make a spot-fix on the view and deploy it. In so doing, we may create a cascading effect for all the other places that consume the view and do so with the expectation of original behavior. In short, the latent nested view architecture is a minefield. We should not implement it because it creates trouble from day one, even though nobody has stepped on mine just yet.

 

At one customer site we had to sift through six levels of view logic to find the performance problem. The customer wanted to know what they should do to fix the problem, but "the problem" was in the overall inplementation and the nested views, not the one bad view, or for that matter, the recent performance symptoms of a minefield implementation.

 

Views can behave as traditional objects if they are single-table views or they leverage additional tables that are small and inconsequential to performance. Don't ever include a big-fat table in a view as part of a performance boosting strategy unless you can designate that the view is in fact a standalone entry point and not something that can arbitrarily participate in the JOIN clause of another master query. Why is this? Invariably we will forget the complexity of the view and then attempt to join it in another operation. For a BI tool, this could be highly problematic as well, because a view that was once simple could spontaneously go complex, and if it affects performance, we'll be pulling our hair out to find the problem through what reduces to a scavenger hunt, or worse, a submarine hunt.

 

Many BI tools simply choke on automatically forming a "complex" Netezza query because there is an implicit assumption of indexes via primary keys, and if these don't exist, the BI tool does the best it can, which in many cases is the least-common-denominator of a query structure. This this doesn't play well on the SPUs for large-scale queries. I cannot count how many times I've seen a convoluted query that we just de-engineered and simplified, and ran an order-of-magnitude faster than the one conjured by the BI tool, yet nothing the tool folks could do seemed to make the BI tool form it the same way. To the rescue: a view that did the right thing - and that was that.

 

What's that? Putting together a view diminishes the flexibilty of the query? Only marginally, and since we're dealing with billions of rows, we don't have much runway for "ultimate" flexibility anyhow. The larger the datasets, the more we need to make sure the queries are as efficiently formed as possible. And since this means as simply formed as possible, we're not talking about BI Tool query engineering, but query de-engineering.


To avoid pain and injury, don't treat all views the same. If we have a complex view, we should tune and designate it as standalone. No matter how much we like its results, it is better not to just arbitrarily include it in another join. the primary reason being - most views are not set up to regard a distribution. So when we include it with our other join, the resolution of distribution might take the form of of least-performing, lower-denominator. We don't want that.

 

One alternative, oddly, is to CTAS - execute such a view in context and insert its data into a temporary table, then use the temporary table in the master join. This affords us the option to (a) leverage the view's normally small output (b) preserve the distribution or (c) align distribution to the next operation (d) simplify the implementation. Of course, your BI tool may not support this, or may support it in an inefficient fashion. Most of the major BI tools will accommodate advanced scenarios, so get your product support rep on the wire and have a heart-to-heart.

 

Yet another alternative is to use the view like an in-line view, except in the where-clause in correlated sub-query. This can often take the form of a where-not-exists clause or the like and can also be very efficient.

 

Another alternative is to break apart the view's logic and assimilate it into the larger view so that all logic is preserved. But you'll be maintaining that logic in two places, right? Not necessarily. We have a lot of view DDL executables that do not directly spawn from a modeling tool. Several of those being in BASH script, which provides for parameterization of logic. If we put the logic into a parameter, then produce the views by including the parameterized logic, we will maintain the core logic in one place (script) but actually deploy two views that leverage it. This is essentially what happens under the covers with many object-oriented environments anyhow. Multiple objects will consume another class and deploy an instance that includes that class, so this approach embraces that inheritance pattern. Not in the dynamic run-time of the view, but in the view's initial DLL-level deployment.

 

MYLOGIC=$( cat <<!
a.limit1 between 50 and 60 and
a.limit2 between 1000 and 50000 and
a.tran_amt < 10000 and
b.employee_id <> 9999
!
)

 

view1="create view view1 as select col1, col2, col3 from mytable a where $MYLOGIC ;"


view2="create view view2 as select col1, col2, col4 from mytable a join yourtable b on a.id = b.id where a.col1 = b.col1 and $MYLOGIC ;"

 

If our modeling tool supports this capability as part of its functionality, and we should leverage it before simply bolting a view into a join. If our modeling tool does not support it, this scripted DDL scenario is easy enough to formulate and leverage without a lot of overhead. The objective: two views that both behave as optimized joins, rather than one view that behaves as a join-with-a-view.

 

Either way, there is a theme here, that simply including the complex view as part of another join's  logic - as though it was a table - is risky and can, even at the outset, offer up such bad performance as to be a non-starter. So a plain-vanilla practice should be to make the complex view behave in a standalone query-and-consume fashion by default. Make no assumptions that it is okay to arbitrarily include it in a larger query's join clause.

 

The further downside is that a de-facto join-with-a-view can work really well at the outset, but the scale of the data can catch up to the even the most robust of implementations, and wiring up the complex view dependencies creates a problem that will not scale, but will only become obvious over time (a minefield)

 

One group invoked a standard for view naming conventions. The simple views would have no prefix at all, so they would look like tables to the casual user. Fair game and all that. The complex views were labeled as v_<viewname> as a cue to a user or report builder: don't use it in the join of a larger query. You'd think that if there was an implicit rule to avoid using anything "v_" prefix that people would play nicely. But not so, since your reporting users may have come from a RDBMS background where it's perfectly okay to mix views into the master query. Awareness of the standard is one thing, but actually embracing it is another. We cannot protect our systems from people who either don't know the rules, don't understand them, or cannot map their experiences from an RDBMS to an MPP.

 

So a suggestion here would be to name the view in a manner that is a departure from common view nomenclature. Calling it an sp_(NAME) might draw the ire of your admins who want stored procs named for what they are, and not obfuscate their names. But if our views are not really common views, and have caveats on their usage, we need a safer naming convention, one that aligns with the goal we are trying to achieve - that of adapting the BI tool to the MPP. One group used a naming convention of "bi_", while another used "rpt_", and still another used the common acronym for their given BI tool. The point is to adopt a convention that is somewhat unconventional, so that those with conventional thinking are able to transform their thinking without finding themselves in a minefield.

 

Nothing is worse than overlooking a minefield - it's a scary view - a view to a kill.

0

Whew! The Enzee Universe this year was quite an experience.

 

I would like to offer my sincerest thanks to all of you who attended the Best Practices session held in marathon-form on Monday before the keynote. Over 300 people signed up, and many of you arrived the evening before, and at the end of the session, you were still there!

 

Afterwards we took a checkpoint and then added more material to the powerpoint presentations that we used during the sessions, and these will be added to the content of the Enzee Universe downloads for those who attended.


Some of you also asked me about the music selections we played at the intro and during the breaks. These were selected in terms of "Your Theme Songs", because some of them were from superhero movies, and some from action-adventure flicks. Here they are, in no particular order, the tune, origin and reason for selection:

 

Theme from "The Incredibles" - because we come from a family (The Enzee Universe) of mult-talented superstars

Theme from "Batman/The Dark Knight" - because sometimes we have to work in a thankless role (however personally rewarding, with high-tech toys)

Theme from "Superman" - because to competitors, Netezza is like Kryptonite, and to the rest of us, it solves World Problems and makes us look good without having to wear our underwear on the outside

Theme from "Spirit, Stallion of the Cimarron" - surging music for those who entered the frontier on a Mustang

Theme from "Surf's Up" - probably enough said, here

Theme from "Mission Impossible" - a congenially offered rebuttal to those naysayers who say it can't be done

Theme from "James Bond" - because he, like many of you, is an MTBA - That's Multi-Talented Bad A**

Herbie Hancock - Rockit - because that's what you do

 

I would also like to thank Netezza for the opportunity to share these ideas, many of which I have gathered over the course of my Netezza derring-do from people just like you, so some of the information is what-is-practiced in the field, and some of it is idea-works that we have re-synthesized into practices that seem work well as a sort of "adaptive composite". The objective of course is to share with you what others are doing, to enrich your base of ideas, but are certainly not hard-and-fast rules. The Netezza appliance is one that unlocks creativity, harnessing it for the Good of All Mankind. So guidelines and practices give us more critical mass to solve problems.

 

Likewise I would like to thank Netezza for the Enzee Community Voice Award presented to me on Tuesday night at the Gala, in recognition for being such a vocal supporter. But my words then apply as now "The people and the product create a synergy that's like electric current. I love interacting with the Enzee Community, and being a part of it".

 

I also noted that a larger number of independent consultant/contractors were present on this go-round. In the best practices sessions, there are a wide range of professionals, from those who are Netezza customers, to consultants working for firms, independent consultants, analysts and the like. The Enzee Universe has various video screens constantly running slow-motion surfing videos in keeping with the TwinFin Theme. One day you might be lookin' at that guy on the wave, thinkin' about the Twinfin and wondering if you're on the wave, or just watching it pass.

 

As with all professions, you might be in the zone where your project is just ending, or is about to, and you're wondering about the next great thing. And as you know, I'm always on the hunt for bright architects and engineers, especially in this economy, so if any of you independent types are looking for an opportunity, give me a shout. I also extend the invitation to anyone else who is reading, with the qualified apology that I am not lurking at the doorways to steal away your company's valuable resources. But I have seen in the past that some bright folks find themselves tapping a pencil on their desk, coming down from the exhilaration of a Netezza 'experience" and wishing for more. I can say - the work is out there. I'm often in contact with people who need someone just like you - hooked on the technology. Hey, who isn't?

 

Finally I offer a simple salutation to everyone who "gathered round the grill" this past week, sampled the wares, wishes and whatcha-ma-callits of the various vendors, trainers and speakers, and came away enriched and enabled to dream a little stronger, solve a little simpler, and crush those waves with the shredding confidence of parallel power. So I'll either see you in your natural habitat, interact with you here, or catch up with you in person when the Enzee Universe cranks up another adventure.

0

I know some of you are wondering if I have "gone dark" or something. But hey, there a conference comin' up and I'm on the hot seat!

 

Much of what will be discussed, I could (and may) post here eventually, but for now it's time to burn this stuff into the presentations so all of us have a productive meeting on Day One.

 

Now some of you have also fired off to me some interesting requests for the discussion, most of which I agree with so we'll definitely make room for it. In the end, the whole of the best practice session is not really for you to hear me prattle on about derring-do - although I do have tales to tell - but more to gather the tales that you have to tell, so that all of us are better for when we return to our respective machines.

 

One of the requests is a deeper dive into the Ten Rules of Large Scale Data Processing that I added to Netezza Underground. We could probably spend all day on that stuff - because it can go really deep, like outta-hand deep on some of them. Not to worry, as I am a highly trained professional moderator and promise to keep you on track.

 

Also note that on Day Two there is a special session on TwinFin migration, with some emphasis on Netezza-To-Netezza migrations since some of us are in the middle of these kinds of conversions. I think this will be especially productive if some of the Enzees (that would be you) are present (and not shy!) so that the attendees can wrap their heads around the overall process and what to expect.So that's another invitation to not only bring your notepad, but your notebook (the one that you use for design sessions) and give us a little non-proprietary insight on how you conquered the world (or at least, your corner of it).

 

Not so strangely, the overall, high-level methodology for moving from an older Netezza to a Twinfin does not change (compared to moving from say, Oracle to a Twinfin), but the challenges can be multivariant. It is these challenges, nuances and subtlties that make life fun - so bring some of them with you! We'd like to hear the adventure.

 

And yes, the rumors are true. I have come under intense pressure to publish yet another Netezza book, and this was evident and underway as of last Fall. The real problem is in deciding what goes into the book and what should be left out - but the left-out stuff is valuable too. I think the consensus now is that the content needs to be divided into two books, and publish them both in a rapid fashion. Hey, it worked for Back to the Future, The Matrix and Lord of the Rings - but I don't have any delusions of such wild success - I just need to get them into print.

 

I am working right now on a mongraph concerning inheritable objects, namely views and such, that in some ways do, and some ways don't translate into systems of scale. After all, once the data gets really big, the physics becomes even more critical. Any inefficiency will only get worse over time. Okay, that's my hook to pique your curiosity. More later!

1 2 3 Previous Next