[ www.netezza.com ]
0

One of the most significant challenges of ELT-based processing is the need for housekeeping infrastructure. I mean, we will find ourselves needing temporary tables, and that's okay. But we'll also need persistent temporary tables - that is - tables we create as processing resources in context of a given set of operations, that we might keep coming back to. Or that we might want for troubleshooting. We have to admit, a truly temporary table that evaporates at the end of the session is handy for housekeeping but lousy for troubleshooting. In many ways, the "temp table" should be a means to organize our immediate thoughts, like de-duping a resource list or whatnot. Utility-stuff makes them handy. But when we need to debug intermediate results, we need a persistent table. And then we need to ditch it, with some rules.

 

If we run things in shell-script, whether inside the Netezza Linux host or on a companion Linux box, we'll almost always need temporary files as well. Once we create these, we need a way to get rid of them. And upon creating any of these resources, we need a systematic way to keep them safe, meaning a completely unique way of identifying them. For temporary files, we have many options to timestamp their file names, but for a database table this can be somewhat daunting, considering that we could litter our database with lots of orphaned tables in no time flat. The last thing an operator wants to do is go into a littered database and clean out the trash.

 

The latest release of the Netezza environment provides for stored procedures, which will formalize an invitation (for some) to black-box everything inside the black box. Do not fall into this trap. The black box is only colored that way, it is not intended to be used that way. In the end, we might have componentized stored procedures that do a lot of handy stuff for us, but if their activities are too hidden, we'll hear the same complaints from operators as we hear now when someone violates Rule #10 and performs bulk-data-processing inside their RDBMS, usually with a stored proc. Don't go there. Use the stored procs in peace, but be kind to your operators. Then they won't call you at midnight for answers.

 

So it goes with any operational scenario, but if we embrace some simple housekeeping hooks in a frameworked sense, we can avoid all these woes and pitfalls. The clear objective is to get more business functionality and solutions "under air" - that is - let the black box do what it does best - crunch and munch the data at high volumes.

 

A framework is a systematic way to startup a process, provide common resources for the process, allow the process to consume and leverage the resources as a foundation, and when the process completes, the resources are torn-down and tossed. This allows a given process to simply request a resource from the framework, with the expectation of delivery of course, without having to worry about giving it back, tearing it down or anything else in a housekeeping sense.

 

However, just to keep the operators happy, there is yet another practical means to avoid this headache, and this is to provide more than one database to support the data flow. Many of you are aware of the simplest form of this, where we have an intake database, a workspace database (for transforms) and a target database such as a repository or mart.

 

Staging -> Transforms -> Reposit

 

What this means is that we can intake data from any number of sources and assume that the information is in an unknown or dirty state. We can then apply large-scale transformation (joins, rollups, etc) to the data in massively parallel form. Once completely done, we commit our work to the target with simple table copies. We can afford a table copy in Netezza because of its ability to move data rapidly at the SPU level. Given the right distribution, copying data is a minor penalty when we see what we get back for it:

 

The ability to control the transformation process in a safe zone so that if it fails or corrupts, it is never committed to the repository. We use this technique in ETL all the time, pull data from a source, transform it and prepare it for insert, then commit it. We need to embrace this for ELT as well, because we need to protect the target from corruption with intermediate results from an incomplete operation.


Once we assume the need for such a backbone, we don't really have a lot of high-functionality to support it. After all, if shell-script can help this along, I suppose any of you could provide a frameworked model in Perl, Java or .NET. But might not want to launch an entire development environment just to support these simple activities. They seem simple because they are. Shell script is good because it is inherently a control language and does not tempt us to do a lot of programming. Since we don't need a lot of programming, this is good.

 

When we launch an ELT framework, we need a number of different items to provide context. Once is a reference to the aforementioned databases. Understand also that if we have a database that is for our personal development use, it can behave as all three. Once we move toward integration, we would break apart the references and make sure nothing else breaks in the process.

 

We'll need a holding place for end-of-run teardown. Like a teardown-file we will execute at the end of the run. In this file, we will simply export our commands to tear down the resources we create, as we create them. So if nothing is created, there is nothing to tear down. At the end, we just execute the teardown file and it does the trick.

 

Now we'll need a way to create assets both as a local resource and as a more visible one. In bulding a local resource, we might create-table, or create-view or create-synonym etc and then call a housekeeping functon with the database name and the resource name so that it's marked for teardown. this is as simple as an nzsql "drop table tablename" statement. Either way, we drop the resource at the end of the run.

 

We'll also need the more visible form, that is creating a local asset in our Transform database that is identical to the target version of the asset. CTAS magic works for us here, in that we offer up several parameters to a given function, such as the table-to-create name, the "real" table name and the target database. For example, if our transform database is a my_transforms, and the target database is my_target, with a table name of my_table, we would want to create a table thusly

 

create table my_table_temp as select * from my_target..my_table limit 0;


But we can see a limitation here. We cannot run an application with multiples of these without accidentally stomping on each other. If another thread of this ELT stream is launched (say this one is running behind) it will attempt to create my_table_temp as well, and will fail (but might think it succeeded, the table is there after all) and start to write its data into the same resource that is being used by another thread.

 

No, the most appropriate way to deal with this is the simplicity of the AUDIT_ID, a bigint value that we capture at the beginning of the framework's run (and yes, it's a simple sequence value we pull from a sequencer on the Transforms database).

 

this in hand, we now create with confidence:


create table my_table_temp_$AUDIT_ID as select * from my_target..my_table limit 0;

 

Now we have our very own copy and no other thread, even for debug, will gain access to it. What do we do after this, is we simply echo the drop command to the houskeeping file, and we're done:

 

echo "drop table my_table_temp_$AUDIT_ID ;" >> $housekeeping_file

 

At the end of the run, we'll execute each item in the housekeeping file and this will tear down all the assets we created.


Conversely, we could echo this to another houskeeping_file such as:

 

echo "drop table my_table_temp_$AUDIT_ID ;" >> $housekeeping_file_operator

 

And now we have a way to keep the tables around without losing track of them. The "drop" statment is logged to another external housekeeping file that we will not execute at the end of our current run. Rather, the operator can execute the file once a night or once a week, or whatever, and guarantees that the assets can be dropped without any surgical activity from the operator.

 

We'll do something else with this table name, though, how ahout a FINALIZE file that keeps track of each of these assets? After all, we pulled the definition from the Repository as a means to fill it with data that is destined for the same table in the Repository.

 

 

echo "insert into my_table select * from transforms..temp_my_table_$AUDIT_ID" >> $FINALIZE_FILE
echo "generate express statistics on my_table; " >> $FINALIZE_FILE

 

Now if we never make it to the end, we will toss the FINALIZE_FILE and do nothing. Otherwise we execute the FINALIZE file and commit the temporary tables to the target.

 

And the order of execution is, of course, FINALIZE first and HOUSEKEEPING second!

 

Another simple suggestion is that we make this call into a formal function, such as

 

MY_TABLE=$( create_table_from  "my_table"  "temp"  "$AUDIT_ID" "my_target" )

 

now we have the variables aligned - the source table in my_target, the "temp" prefix and $AUDIT_ID suffix. We also have something else, as we can now reference our new table in simpler form, following:

 

insert into $MY_TABLE (
column1,
column 2
)
select * from my_source;

 

but what if the "my_source" here were created in yet another similar thread - itself creating a temporary asset:

 

insert into $MY_TABLE (
column1,
column 2
)
select * from $MY_SOURCE;

 

In this case, the value of "MY_SOURCE" is actually "staging_database..my_source". But we don't need to know that, do we? What if we loaded up another source table called "my_source_test" ? Could we now point the variable $MY_SOURCE to this new table and it remain transparent to the above ELT SQL statement? You see where this leads? Flexibility, portability. troubleshooting etc - all because we embrace a framework that is letting us out of the box.


We can build up resources and not worry about whether they exist locally, remotely, in staging or whatever. Another simple aspect of this approach is this - if we don't get rid if the resource(s) at the end of the run, or the run aborts prematurely for any particular reason, all of the resources we have already built - remain bult and filled - we don't need to start the ELT from the beginning.

 

Let's say we pull from ten different sources, integrate the data with a workload that takes about an hour (hard churning on billions of records) toward a final target of five reporting tables. If we get to the end and the last table has to abort for any number of reasons, we quit and don't commit. However, our threads are in a condition that allows us to restart from where we left off, and not repeat all that work again, with the added benefit that nothing has been committed to the repository - yet. So we don't lose all the processing time, we just pick up where we left off. Such a scenario requires a manual restart, but the primary takeaway is the ability to checkpoint our work de-facto without ever corrupting the final target. Netezza gives us the power to do these things inside the machine.

 

How, you might ask, do we determine where we left off so we can pick right back up? Hey, I'm out of space on this one, but later -

0 Comments Permalink
0

When punting data around inside our magical machine, one may wonder how to keep track of it all. Some will eschew ELT because it boils down to a pile of SQL statements, and it sometimes feels out of control. Control of course, is what we make of it. Even a well-defined development product is no match for someone who doesn't like controls.

 

However, we know this really does boil down to insert/select combinations like so:

 

Insert into Mytarget (

column list here

)

select

yet another column list

from some tables using complex join and filter logic

 

It seems we have a handle on the top and bottom, but the "select" clause is where the primary transform and integrations are applied. Things can get really ugly here, especially if we're moving from one legacy platform to another. Our select-statements will look very hairy, indeed.

 

The "insert" clause is largely along for the ride.

 

Now it doesn't seem likely that this could get out of control until we're presented with tables that contain, say, a hundred columns. Or even fifty, or say twenty-five. Just enough you see, to keep them from appearing on the same editor page. We might want to add a column to the mix. Hey, add it down below in the select - and make sure you add it in the top to the insert - and don't get anything out of order! And what of the columns are misalgned - data corruptions are a higher-than-everage danger here.

 

It feels a little primitive, but all we really need is some assistance on the source-to-target mapping and we're good to go. It's impractical to do a source-to-target with unweildy insert/select statements, so let's apply a little Netezza magic. Now, considering that the cost of an ELT statement can sometimes run into minutes of execution time, sacrificing a few extra seconds up front, just to support our weary eyes.

 

Let's say we automate the scenario a little bit. I have a table called customers that I want to roll together from our old customers and customer-properties tables. The target table is a reporting table with denormalized stuff to support our ad-hoc folks with a lot of pre-calculated goodies. Once we have the calculations, we want to put them into business visibility.

 

insert into Customers (

customer_id,

first_name,

last_name,

most_recent_purchase_dt,

total_purchases_ctr

lots of other columns here

)

select

c_id,

f_name,

l_name,

max(b.purchase_dt),

sum(b.daily_purchase_ctr),

lots of other columns here

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

 

We can readily see that this very typical SQL statement is doing some heavy lifting for us, just like we want it to do inside the machine. But what if the inser/select clauses had a lot more columns? It wouldn't take much for this to feel nervous about its maintainability. What if we have to interate another table to the mix? Left outer joins? The Select clause has pretty much unlimited potential for complexity.

 

ANSI SQL supports aliases, so let's run with that. We have our source columns in the Select and the Target columns in the Insert, so let's align them thusly (I'll just use the first few for brevity)

 

 

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

 

And lo, we have the makings of a source-to-target map. Don't we? Of course - the Insert-columns appear on the right, ready to functionally redefine the souce values on the left. We do this all the time, don't we? But largely for spontaneous reports and the like. Let's look a little further, because having something like this in open-text doesn't really benefit us.

 

By circumscribing it with a "cat" we can gain two major benefits without sacrificing clarity - one is the ability to put the SQL statement into a place where we can use it, and one is to provide a means to resolve any $ variables that happen to be in the SQL statement. Note the use of the AUDIT_ID variable.

 

 

MY_SELECT=$( cat <<!

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

!

)

 

Okay, now we have some options - so let's try this:

 

nzsql -a <<!

$MY_SELECT  limit 0 ;

!

 

this lets us test the SQL statement, but only to make sure we formatted it right. Now let's do something more useful:

 

 

nzsql -a <<!

create table temp_target as $MY_SELECT  limit 0 ;

!

 

now we have a persistent table in catalog, with correctly named  and sequenced columns that align with the select statement. Note that the columns on the catalog will also have expected data types, which we could check against the target table's data types for consistency, but for now we just need something that the system will accept without complaining.

 

I'm a big fan of letting the Netezza environment do the heavy lifting. We could set up a parsing function to rip through our SELECT statement and find the alias'd column names, but this will fall apart with the more complex SQL statements. We already have a highpowered SQL parser at our disposal, don't we? And doesn't the CTAS have a thousand-and-one uses, after all?

 

Let's do a CTAS like the above - with "limit 0" - meaning that it won't do any real processing work, but will give us the power of its parsing engine to find the target columns with the added benefit of registering them by name and in the proper order - but to a temporary table

 

 

Now let's put the CTAS together with a way to pull the columns off the catalog - I'm throwing this to a flat file for debugging, but you probably know how to stream this directly into a loop - to follow

 

nzsql -A -t -o outputfile.txt   <<!

create temp table temp_target as $MY_SELECT  limit 0 ;

select attname from _v_relation_column where name=upper('temp_target');

EOF

 

Now let's pull this file into a quick loop,

 

M_SEP=""

foreach line in outputfile,txt

do

INSERT_STR=$INSERT_STR $M_SEP $line

M_SEP=","

done

 

 

or how about

 

INSERT_STR=$( nzsql -q -A -t  <<!

create temp table my_temp as  $MY_SELECT limit 0;

select

case when attnum = 1 then ''

        else ',' end ||

attname

from _v_relation_column where name = upper('temp_table') order by attnum ;

!

 

INSERT_STR="insert into TARGET_TBL ("${INSERT_STR}")"

 

 

 

 

 

and form it into a string INSERT_STR that looks like this:

 

 

   customer_id

,  first_name

,  last_name

,  most_recent_purchase_dt,

,  total_purchases_ctr

 

Now what?  Execute the the Insert?

 

nzsql -a <<!

$INSERT_STR  $MY_SELECT ;

!

 

------------------------------------------------

 

If we put the above activities into a bash function call we would find a setup like this:

 

nz_insert_from_select()

{

 

put all the above activities in here

 

}

 

 

-------------------------------------------------------------------------------------------------------------------

So here is what we would implement for any given ELT - we get a visual source-to-target map

 

MY_SELECT=$( cat <<!

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

!

)

 

 

mret=$( nz_insert_from_select  target_table "$MY_SELECT" )

 

 

 

------------------------------------------------------------------------------------

 

 

So in ELT space, one of the keys is to balance how much we need to program versus how much is already programmed for us - in the Netezza parsing engine for starters. Catalog-hits are inconsequential when compared to the functional benefit we achieve, and the visually-aligned columns names even for very large tables. We can then add or delete columns from the ELT by adding or deleting lines in the Select. We dont have to align the columns on the top (Insert) and bottom (Select) because they are side-by side - and we know exactly what is going where.

0 Comments Permalink