Skip navigation

Gather 'round the Grill

2 Posts tagged with the staging tag
0

One of the questions oft-asked in best-practices sessions and in general consulting: How do we get a "newbie" on-boarded quickly? Some concern usually arises when the new Enzee approaches the Netezza machine with the same thinking processes as with a traditional RDBMS. While there are "gross" similarities, it is the differences we want to leverage, and these are not either/or questions. There is a better way to implement things in Netezza, and a better way in the traditional RDBMS. Mixing the two is not optimum and can be detrimental.

 

The primary discussion fulcrum is simple: One is a transactional database and one is not. Moving away from "transactional thinking" is the key. How to accomplish this?

 

One of the best ways is to discuss and actually demonstrate the primary differences between bulk and transactional processing. As this is largely the crux of misunderstanding, or even the necessary "paradigm shift" our newbie needs to embrace, a significant hurdle it seems, is the newbie's belief that the core engine functionality of their favorite RDBMS is somehow being indicted or set aside as useless. After all, the transactional RDBMS is just that - transactional - and this is what we want the newbie to move away from. What? All that hard-won and industry-hardened capability - and we're just setting it aside? Really?

 

In a word - Yes.

 

It's not that the transactional capabilities are useless. They simply aren't useful in a data warehouse. More importantly, they don't even exist in a Netezza machine. So attempting to shoe-horn transactional thinking into this machine is a huge disconnect - no differently than using a lawnmower as a hedge-trimmer. Netezza is purpose-built. Transactions are missing by design.

 

Now at least one person is bristling because they know, administratively, that transactional support is handy for logging, managing metadata, troubleshooting hooks and other administrative support. I don't disagree with that, but it's not the activity of bulk data processing. It is far easier to set up a smaller database machine alongside the Netezza machine to perform these administrative transactional tasks. Each machine then has an objective role and purpose, and off we go.

 

What are some of the demonstrable ways that we can introduce the new Enzee to this issue, in a manner that really drives the point home? Well, I can't seem to count how many times I've had (sometimes rather contentious) discussions with "outsiders" (or perhaps "purists" ) on the subject of transactional exception handling. Inserting a record into a transactional database, with its glorious constraints turned on, will guarantee that it will pushback on us with an exception. Said exception requiring the dutiful compliance of an exception handler. You know the drill.

 

But in data warehousing, such transactional exceptions are in the way of our bulk load. We don't want the database to examine each and every record as it arrives, potentially formulating an exception (and its attendant overhead) for each record, or passing each one through after its constraint-based integrity check. We just finished taking all that data through a detailed sieve of business rules in the ETL layer, didn't we? The database needn't trouble itself, just load the data, thank you.

 

Now at least one more "outsider" is bristling. How dare you say that we should set aside the constraint-based exception handling? What possible justification could there be for such a gross trampling of RDBMS functionality? Explain yourself!

 

In a word - performance.

 

Storytime: Just after 9/11, the airports over-compensated with all kinds of rigorous shakedown protocols. Travelers had to show a boarding pass and ID at a checkpoint, then keep them handy for just after the checkpoint. And then also for presentation at the gate prior to boarding, along with random bag searching. If you were the first one to board, or made eye-contact with the bag-search team, it was guaranteed that you would be taken aside and your luggage rummaged. A friend of mine told me that the rummagers liked to carry on a conversation to make you feel more comfortable about their pulling your private things out into the open air for all to see. One of them held up a nose hair trimmer to one of his cohorts and said What the heck is this? Makes one wonder what other kinds of personal appliances we could "salt" the bag with just to embarrass the daylights out of them, hmmm?

 

My friend told me that he was pulled aside a lot, and started experimenting with "stated professions" that the rummager would not care to talk about. At one point he blurted "I'm a professional bodyguard" to which the rummager alerted like a trained narcotics dog and said "So you would know how to use weapons?" to which my friend simply said "Or not."  This of course made the rummager gulp and go quiet, but it still wasn't good enough. My friend didn't want them to talk at all, so they wouldn't waste any time in their rummaging and just get-it-over-with. So at one point he said "I own a funeral home." Which of course, stopped the chatter completely. Nobody really knows how to continue a casual conversation about such a subject.

 

The point being, he'd already had his bags electronically scanned at the checkpoint. Do we really need to check it again? And unlike a constraint-based exception handler, the rummager had the option of only picking out random hapless travelers. The exception handler rummages the bags of every traveler in the line. We can see how utterly inefficient this is. Nowadays, they screen the bags and then don't even check ID again at the gate. Except for random gates on occasion because nefarious people sometimes swap tickets when they get behind the checkpoint. In any case, if we've already exhaustively checked the bags to get the traveler where he is, more checking is a waste of everyone's time. Just like the exception handler. If we just ran the entire set of data through rignorous validation rules, we have no need whatsoever of the transactional exception handling in the database. It will waste processing time.

 

And wasting time, we don't have the luxury to do.

 

The transactionally-constrained bulk-load of data will be, on average, five to ten times slower in operation than its non-constrained equivalent. If our objective is to achieve a fast load - and trust me - it really is - we don't want constraints turned on. We're talking about loading millions if not billions of records. Even in an RDBMS, we cannot afford to convert what could be a thirty-minute operation into a two-plus-hour operation. The window of time simply does not exist. In some locations, if this kind of window ever existed, it is rapidly vanishing as their businesses go-global and need to process data as-the-world-turns.

 

On the flip side, think about the main reason for a transactional exception - it is to keep a transactional application honest. If the data does not comply, the user fixes and re-submits. It's interactive, and it deals with a single entity at a time, not millions of entities at a time.

 

The "outsider" will now brace on this assertion as well, because they think that having thousands of users interacting with the system constitutes this many-entities-at-a-time, but it simply doesn't. And here's why: RDBMS systems are meant to assimilate data in small chunks with high frequency. They are not designed to deal with large chunks at low frequency (e.g. a batch load once a night). They will accommodate such activities, but not do them well. In this case, "well" means loading a million rows a second. The RDBMS cannot approach this.

 

And this is the reasoning behind the Rule #10, which is - when loading bulk data, never involve the database in row-level activities. This means, without exception, turn off the exception handling. Because the database will just protract the duration of the flow, checking each and every record and slowing down all of them as a whole, in order to find the few exceptions. It is the equivalent of making the entire flow suffer for the sake of a few records. this is a bad tradeoff. And once again - didn't we just validate and scrub all these exceptions from the flow, in the ETL/data processing environment? Why are we asking the database to validate them again?

 

And the worst part, is that the potential exceptions are all anticipated and known. What does this mean? As a back-end programmer buildng the data flow, we have direct and objective access to each and every failure point that will stop the data from loading. Why would we delegate this to the database, since it is so inefficient in performing it? Note - not so lacking in functionality, because the RDBMS has lots of functionality to perform it. It is simply too inefficient with bulk loading to be a viable resource.

 

So what are the anticipated exceptions? Let's go for popularity:

 

  1. Bad or null data
  2. Unique key violations
  3. Primary/foreign key violations

 

 

In fact, the above constitute the primary reasons for the load to fail. So lets walk through the basic process we would need to follow if we delegate this to the RDBMS database.

 

In transactional mode, the RDBMS data load will kick out an exception for each of these it finds. Even if it completes with no error, someone in the room will say -It took too long. Even if it didn't find any exceptions.Fix it. Make it faster! The hard-core transactional engenue will attempt to optimize it without turning off exceptions, and find that it cannot be done. If the load of one record requires 1 second, it will take 1 million seconds to load 1 million records.

 

We just don't have 1 million seconds.

 

(Incidentally, in Netezza, the load of 1 record requires 1 second. The load of 1 million records requires 1 second. Use your second wisely!)

 

So our new Enzee will grouse a bit and then look around at the data warehousing sites for answers, and all of them will say, turn the RDBMS exceptions off, load the data, and then turn them back on. The newbie will object - but wait - when I try to turn them back on, the database yelps and says there are constraint violations. I will have to back the records out and try again. Oh yes, we now have a mess on our hands. In the time it took to load the RDBMS data - say thirty minutes - we have now accumulated errors that might take hours to back out, fix and then retry. And we'll have to do it while the batch-window clock is ticking, not in a pre-process where we had more breathing room. We don't have this kind of time window.

 

We will never have this kind of time window.

 

So the next fallback is to fix the exceptions in the data processing realm (ETL tool), prior to loading the data. But isn't this what the data processing realm is for? Really? This means we do all null checking and constraint checking prior to loading. How? We download the primary and foreign keys into the local data processing environment and perform a localized join-filter to remove the exceptions. This is a data warehousing 101 best practice.

 

The newbie will now brace on the idea of downloading all the key values. All of them? That could take, well, it could take a long time!  It will take mere minutes to pull down all the key values. And those mere minutes are nothing compared to the duration of the recovery mess we will endure if we don't take this step.

 

Pay a little time now, or a lot of time later. Use your time wisely.

 

So here is the tradeoff (again in traditional RDBMS space)

  1. Turn off constraints, load the data, and then deal with the mess after the fact. Plan to spend hours backing out the mess and then running the load from scratch.
  2. Download keys, join/filter the key exceptions, turn off constraints, load the data with the expectation that no mess will arise. (because it won't)

 

In short, downloading the keys for constraint checking is a necessary evil. Our only "next best" fallback is to load the data into a pre-target staging table and do the gross comparison there. Then we copy the good records into the target table. But wait - now we've incurred the penalty of the load twice (one for the ETL to staging and one for staging to target). Isn't it cheaper to pull down the keys once than it is to load all the data twice? Not to mention the fact that the average RDBMS engine does not efficiently copy tables either. So even if we decide to go with loading a staging table, the copy of the staging-to-target will take longer than we are willing to wait.

 

Think about this: When the data exception arises in (1) above, where will we fix the problem? In the database, or in the data processing realm? The database can only report the issue, not fix it. If we must fix it in the data processing zone anyhow, why woudn't we fix it proactively rather than reactively?

 

So this approach means something even more valuable - if we find the exceptions in the data processing realm prior to loading, we will have found them proactively and administratively, not
reactively and operationally.

 

This makes a huge difference in the reconciliation of data exceptions when we're dealing with millione or billions of entities.

 

And yet another issue our newbie is pleasantly unaware of - data processing on this scale has to be beholden to the constraints of the lights-out operation, administration, and logistical capabilities of the physical plant around the machine. If the operators have to get involved in the data recovery, with data processing on this scale, it needs to be for incidental reasons, not mass data recovery.
In essence, delegating this activity to the RDBMS, is setting up our operators to fail. We will find them entirely intolerant of this approach. Fix it, they will say. If our answer to them is - hey, me architect, you operator, so gird up thy loins and get thee to work - we have punted (and dangerously so) something we should take complete responsibility for. Because make no mistake, we will be held completely acccountable for it as well. They will call us in the middle of the night. They will only help us incidentally. It's your mess, you clean it up!

 

The primary issue here is that the traditional RDBMS load has to be not only load ready, but consumption-ready. When we load the data, we have to be completely and thoroughly finished with all data processing before it hits the target table. From there, the user should be able to consume it right away. Load-ready and consumption-ready is the name of the game, and it's accomplished for the RDBMS in the ETL environment, because it cannot be efficiently accomplished inside the RDBMS. The RDBMS is simply too slow and inefficient for any form of bulk operation. And again I say, if the only place to actually fix the data is the data processing realm, it only makes sense to do it proactively, not reactively.

 

Now let's flip over to the Netezza side of things.

 

In the Netezza machine, we can stage the data "dirty" if we want to, and we often do. The data can essentially be copied as-is from its external dirty location directly into the machine with an nzload to a staging database. From there, we have it in massively parallel form and can use a series of CTAS operations (ELT-style) to cleanse and shape the data. Once we're ready, when can then do a massively parallel join from the incoming table to the target, validating primary and foreign key values in bulk. Then we just copy the good data and we're done. When using Netezza, it is always faster to let the machine do the data cleanup and integration in a massively parallel, set-based operation (even a series of them) than it is to pull the data out, process it in an ETL tool, and put it back. ETL tools, on average, cannot compete with the massively parallel power of Netezza's engine.

 

Let's look at what we accomplished with little effort: (1) We cleansed the data of dirt. (2) In a single, massively parallel join we validated unique constraints.(3) In a single, massively parallel join we validated foreign keys (one join per key). The total time to accomplish the second two tasks is fractional, often a matter of minutes even on billion-row tables and billion-row loads. The time for the first task is shrunken too, since we can apply our row-level data scrubbing rules in-bulk with sweeping operations rather than row-level operations.

 

Case Study Short: Working with a SQL-server based model, the client was loading 15 million records into the database with the bulk loader and the largest machines available. Total time to load - over 2 hours. Tried it again on an Oracle platform, with a top-line 16-core machine with plenty of high-end disk space. This operation took 30 minutes. This was attempted on a Netezza platform, same data, same volume, and it took 15 seconds. There is a contrast, but not a comparison. Nothing adequately compares to a 15-second data load.

 

The important takeaway is this: If I can load the data in 15 seconds, I have a luxury of time to perform internal ELT, data scrubbing and integration, key checking and the like in a matter of minutes, still ensconcing the data into the final target table before the other two databases even get started. More importantly, I did it without standing up a formal external ETL tool. All of it happened "under the air" of the Netezza machine.

 

Now an interesting exercise for the new Enzee would be to actually walk through the processes noted above. In a problem-solving series of exercises, they should get some data that has embedded constraint violations, then attempt to load the data to an RDBMS with transactional constraints turned on, then turn on the creative juices to see how it can be done more efficiently. I would not suggest loading millions of rows to an RDBMS for this exercise, since they are so inefficient at this. Try it with a smaller row-count and then extrapolate the necessary time-to-load. What they will discover is that they will find themselves slowly backing out their precious transactional exception handling to fix the problem another way. The faster they get, the more the the chosen path will start looking very lean on RDBMS capabilities.

 

The final form of their solution, they will find, is supported de-facto and in massively parallel inside the Netezza appliance at no additional charge.In the end, they will see why Enzees have run, not walked to a Netezza platform for just this kind of capability. We know they have made the transition when we can hear them having a conversation with another newbie about transactional versus bulk processing, and they are coaching the newbie away from the transactional model.  Ahh, a beautiful thing, indeed.

 

This is why Netezza is in no way, no how a transactional machine, and why it doesn't enforce primary and foreign key constraints. These can be installed as metadata, but the expectation is that they will be used by an external, intelligent operation that will leverage them for administrative key validation - in bulk. After all, I can read the key metadata from the Netezza catalog, formulate a series of validation operations that will work for any table, any key, any time. Install it as a stored proc and invoke it when necessary. This allows me to set up the load operations and prepare the final copy to the target (which is often the accumulation of dozens of operations to integrate the data into a common pre-target table). Then validate the data just before it is finally copied to the target. This keeps me from having to do it a record-at-a-time, or to have an exception processor accidentally execute the operation before I am completely finished formulating the data for the load.

 

Row-level exception handling is a beautiful thing - transactionally. If the domain where the exception must be fixed is already the data processing zone, we need to proactively embrace this responsibility and just do it. In the end, row-level exception handling has to be completely removed from our thinking processes. We need to invoke sweeping operations that capture the exceptions in-bulk, not a row-at-a-time. Fix and integrate them in bulk, not a row-at-a-time. Bulk is the name of the game, and always has been.

0

Honor the Host

Posted by David Birmingham May 26, 2009

Some enterprises will stand up a Netezza machine and point all their data processing towards it. They wouldn't think of actually installing anything on the Netezza machine (such as database clients or other client software) and of course, are strongly advised against by the vendor. Why is this? The Netezza host has a lot of work to do in keeping those spinning SPUs happy and busy. Adding other duties can detract from this critical mission, and we don't want that.

 

But we can also abuse the host in subtle ways. A case in point follows - you may have other tales to tell.

 

We always have a need to pull in a wide variety of files. In this particular case, dozens of intake tables in their various staging locations. In many installations, the intake table definitions are few, discrete and stable. But in just as many, the staging tables will mirror the upstream sources, with one table for each upstream interface. In our case, handling source-to-target with no ETL in between. We extract directly from the source into an intake table definition that mimics source column names, but the data types are all varchar to facilitate "dirty" intake. The objective is to get the data into the machine.

 

Then we convert this intake table to its final form, the internal Netezza table that is identical to the source table in column name and type. This conversion is a simple table copy, mechanically speaking, but we have to do some light ELT to make it happen. For example, we need to guard against nulls, empty strings, bogus numeric values and the like. In our case, numerics could be dozens of characters in width because the upstream definition happened to be a view with no defined precision. A typical intake SQL could look like:

 

select

case when column is null then value else column end,
case when translate('-+.0123456789','') = '' then column else null end,

etc

 

Such that each column is wrapped with this kind of logic (call it "Intake ELT"). Now, we don't manually wrap these column defs, we do it dynamically from the Netezza catalog definition. (And for efficiency, we cache it for later reuse, but that's another story).

 

Now we have an intake-ELT that looks thus:


External Database Table -> network -> intake table ->  Intake ELT -> Staging Table

 

Note for clarity - the External Database Table and Staging Table are "book ends" to this operation, and have the same column names, data types and column order. We don't absolutely require common column ordering, but it's handy for troubleshooting.

 

Note also that this works just as well for flat file intake as database intake. Better, in fact, because we can more easily load multiple files at once than multiple tables at once (the database might not like multiple extracts)


All of this worked swimmingly until we encountered a slightly different kind of data feed, one that had to be extracted from an archival source into flat files. Rather than present the flat file as normal (on the network) the admins decided to use the available on-board Netezza storage pad (5 TB of space). Keep in mind that we were not allowed to execute anything directly on the machine, so we had to set up External Tables on top of these files to load them, rather than using NZLOAD. This, too, worked transparently and all was well. Then a "bright idea" occurred, that in the above equation the Intake ELT faced a table (our intake table) and couldn't we just use the intake ELT right on top of the External Table, eliminating the additional middle-man?

 

Like so:


flat File -> External Table -> Intake ELT -> Staging Table


The above configuration only appears more efficient by eliminating the Intake table. Looks are quite deceiving, cconsidering how much "per-column work" the Intake ELT had to perform to get data into the Staging Table. What is not obvious, is that the Intake ELT is now sitting on top of the External Table, which is a Host-managed table, not a SPU-managed table. In this configuration, we have reduced our power from a 108-SPU problem to a 4-(Host) CPU problem. The immediate loss of power was measurable in orders of magnitude.

 

So under the covers, here's the power-plant difference in the two models:

 

External Database Table -> network -> intake table ->  Intake ELT -> Staging Table
                          |----HOST -------------|---SPUs--------------------------------|

 

flat File -> External Table -> Intake ELT -> Staging Table
             |------HOST ------------------------------|


So we can see that the second model is abusing the host with the Intake ELT, and if we go with the original model, the ELT will be handled by the SPUs, offering the necessary scalability and power. In a continuum, we can see where we might initially install nzload or external tables and perhaps "tweak" them along the way. Then a maintenance developer comes along and sees that the "easiest" place to add a fix is in the external table or the nzload rather than pushing it to SPUs. The external table and nzload can (and should) do light-intake formatting per their interface specifications, but no further.

 

The over-arching directive remains the same - get the data into the SPU-based tables as rapidly as possible and then do the "dirty-work" with massively parallel power.