Skip navigation
1643 Views 20 Replies Latest reply: Aug 6, 2009 2:28 PM by Paras RSS 1 2 Previous Next
tonypulladan New Enzee 9 posts since
Jun 10, 2009
Currently Being Moderated

Jun 18, 2009 3:04 AM

Hi all, Can somebody please throw some light into how netezza handles simultaneous multiple updates to same table?

Hi all, Can somebody please throw some light into how netezza handles simultaneous multiple updates to same table?

  • Shawn Fox Enzee Exraordinaire 1,324 posts since
    Aug 15, 2006

    Netezza forces serialization.  Only one session can update a table at a time and it must commit before another session can update any data.  INSERT however can be performed by multiple sessions at the same time.  Put in simple terms Netezza does not have row level locking, only table level locking.  Two attempts to update the same table will be serialized automatically (the 2nd pauses until the first completes).

     

    Data loads/updates on Netezza should be done by inserting data into a staging table using one or many parallel insert processes and then using a single SQL insert in a single transaction to merge that data into the target table.  I attached a script that I wrote a few years back that makes the basic process of loading data into a staging table and then into the target table pretty simple.  It requires that the target table must have a primary key defined.  The script does not do updates to the target table, it will replace any records in the target table that have the same PK value as records in the staging table.

     

    The process is basically this:

     

    Something like:

     

    begin transaction;

    delete from targettable where (a,b,c) in (select a,b,c from stagingtable where data_load_id=$LOADID);

    insert into targettable (a,b,c,d,e,f,g) select a,b,c,d,e,f,g from stagingtable where data_load_id=$LOADID;

    commit;

    or maybe you want to update / increment data, you can write a custom script something like this:

     

    /* create temp table to merge data between target and staging table */

    create temp table temptable as

    select

         t.a,

         t.b,

         t.c,

         t.d + s.d,

         t.e - s.e

    from

         targettable t, stagingtable s

    where

         t.a = s.a and

         t.b = s.b and

         t.c = s.b and

         s.data_load_id = $LOADID;

    /* insert new records into temp table which do not exist in target table */

    insert into temptable

    select a,b,c,d,e from stagingtable

    where data_load_id = $LOADID and (a,b,c) not in (select a,b,c from temptable);

    /* now load final resultset into target table */

    begin transaction;

    delete from targettable where (a,b,c) in (select a,b,c from temptable);

    insert into targettable (a,b,c,d,e,f,g) select a,b,c,d,e,f,g from temptable;

    commit;

    Attachments:
  • Shawn Fox Enzee Exraordinaire 1,324 posts since
    Aug 15, 2006

    Just as an additional comment, this type of process is extremly fast.  I have one nightly data load which loads a bit over 200 million records into a staging table and then executes a delete/insert process against another table with 30 billion+ rows and the delete/insert from the staging tables runs in a bit over 60 seconds (fairly wide rows, around 25 columns).

     

    All of the processing time is taken up by the ETL to generate the data.  Once that is done and loaded into the temp table the merge into the target table is very fast.

  • Shawn Fox Enzee Exraordinaire 1,324 posts since
    Aug 15, 2006

    I think you may be trying to misuse Netezza.  If you have multiple sessions which need to update the same record then the table should reside on a standard database like Oracle, SQL Server, MySQL, PostgreSQL, etc.  Netezza is for large scale data loading / analysis, it is not designed to be the backend for an application which does a lot of single row updates/inserts/deletes.

     

    So data flow should look something like:

     

    Application -> OLTP database -> ETL processes -> staging table(s) -> target Netezza table

     

    The ETL processes to push data into Netezza could run on a regular basis, every few minutes, once per day, etc.

     

    When I was referring to multiple sessions loading data into a staging table, I'm talking about standard ETL type processing where you have massive amounts of data to process.  The data gets split between multiple ETL processes to enable scaling across multiple systems/CPUs (generally using a HASH based algorithm).  All of those parallel ETL processes dump data into a staging table in Netezza and then once the ETL processing of the data is completed you run a SQL script to push that data into the target table.

     

    Attempting to have Netezza act as the backend database for a normal application will at best cause you a lot of difficultly and at worst end in complete failure.

  • Shawn Fox Enzee Exraordinaire 1,324 posts since
    Aug 15, 2006

    In addition to this many people also do additional processing inside of Netezza as well:

     

    application -> oltp database -> etl processing -> Netezza staging table -> SQL transformation -> Netezza staging table2 -> Netezza target table

  • Shawn Fox Enzee Exraordinaire 1,324 posts since
    Aug 15, 2006

    I think you'll have difficulty getting this to work without having a method to guarantee that you don't have two different load processes running at the same time.  In my environment this would be pretty easy using our scheduling tool since we can just create a resource requirement (basically a lock, each job requires resource X so only 1 of the 3 jobs could be running at any one time) but I do not know what you have available.  If you aren't using an enterprise scheduling tool you'll have to use some scripts to manage the concurrent updates or just run the updates in sequence.

     

    Even in an OLTP database such as Oracle this load strategy could result in deadlocks in certain situations so you would need to sequence the update processes to prevent multiple updates from running at the same time.

     

    As I see it there are only two ways to solve this problem.

    1) Create 3 tables in Netezza and just join them together using SQL or a view.

    2) Design your ETL process in a way to prevent multiple sessions from trying to load data at the same time.

     

    If I had to have data from multtiple source in a single table, I'd write my ETL processes to extract the 3 datasets in parallel, consolidate multiple updates based on the sequence value into a final result, merge the 3 datasets together, push that into a work table in Netezza, use a SQL statement to get any data out of my target table and other lookup tables that is still needed, dump that into a 2nd work table, then execute a delete/insert from that work table into the target table.  There are many ways to solve this problem, but the gist of it is to design your process to make sure that you don't have two different processes trying to update the target table at the same time.

     

    We have several processes somewhat like this and our solution has always been to just run the extract jobs in parallel into some temporary work area then have the final update into the target table be done by a single script.  It is pretty easy to manage this with any decent job scheduling tool.

  • David Birmingham Active Enzee 426 posts since
    Sep 24, 2007

    Commit to building an environment where updates and deletes are rare and administrative, like for data repair.

     

    Otherwise build an "insert-only" model - you will find that this might stir your creative juices a bit more, but it will defnitely elicit more elegant and creative data processing solutions and most definitely divorce your brain from "RDBMS-think".

     

    Update in Netezza is a combined delete-insert operation of an entire row. So you take the insert penalty and create a hole in your data. And more unfortunately, disrupt your zone maps. A carefully organized zone map can be disrupted with a single update, because the new row goes to the end of the table, where the otherwise zone mapped value may not find a home.

  • Shawn Fox Enzee Exraordinaire 1,324 posts since
    Aug 15, 2006

    The first question to understand is if you can even benefit from zone maps on the particular table in question, then the 2nd question is do the updates 'disrupt' the zone maps.  Many scenarios make an 'insert only' strategy for your data warehouse impossible so there is no other solution to that problem.  While David makes a great point that it is a nice thing to strive for, in many  industry verticals it just isn't going to be practical.

     

    Zone maps are only effective if the data has been physically sorted in the table (by a time column for example, or maybe a phone number, customer id, or whatever).  The only way to guarantee that a table is sorted is to either know exactly how the data is loaded (for example, the table is only inserted into and is thus inherently ordered by time or some other id type column) or to regularly reorder the entire table (create newtable as select * from oldtable order by a,b,c and swap the new table with the old table).  Another alternative is to create materialized views and use the views to sort the table (assuming you have enough extra storage).  In general most of my data is automatically loaded in a fairly sorted order (by time) so mostly we rely on that.  We also have some materialized views which we use to only select a subset of the columns in the table which are most commonly used and to also guarantee sorting based on the MVIEW definition.

     

    Prior to NPS version 4.0 when a record was deleted it would totally destroy the zone maps altogether but fortunatly that problem was fixed.  The only problem with delete/update now is that you have to come back and reclaim the space and you may end up with unsorted data (when you update a record it is going to move to the end of the table, leaving a deleted record in the middle until nzreclaim cleans it up).  The only way to fix that problem is as I mentioned above if  you cannot design a loading method which avoids updates.

     

    I've attached a script which will show  you how effective your zone maps are on a given table.  Must be ran as a user that already has the NZ_USER and NZ_PASSWORD variables set (or has used nzpassword to cache the encrypted password).

    Attachments:
  • Shawn Fox Enzee Exraordinaire 1,324 posts since
    Aug 15, 2006

    One other comment on zone maps in general is that Netezza doesn't even maintain a zone map on small tables.  I don't recall the exact threshold but I believe it is 5 extents per SPU before the zone map will be maintained.  An extent is 3MB, so if you have a 56SPU system, that means that table has to be at least 3*5*56 = 840MB to have any zone maps.  In practical terms the zone maps really aren't very useful until the cost of using the zone maps overcomes the reduced amount of data scanned.  In the 5 extent per SPU table, the best a zone map can do is to eliminate 80% of the data scanned (if the table is perfectly sorted).

     

    Netezza has been talking about having a much more granular zone map based system for a while which is supposed to be delivered this fall.  The new method will give zone maps per 128KB block instead of 3MB extent, basically making the zone maps 24x more useful.  With the additional ability to declare a clustering strategy at table definition time to guarantee that data is automatically sorted in the table the multi dimensional clustering feature is going to massively improve performance in many cases.

     

    We are supposed to be getting clustering + better zone maps + non locking reclaim (technically a non locking multi dimensional clustering rebuild which will be a different feature than nzreclaim but will clean up deleted records at the same time) in the release this fall according to the information they were talking about at last year's conference.

     

    If these features are delivered it won't matter how you are loading data and most of this conversation thread becomes irrelevant.  Of course there are a lot of ifs and whens involved.  It could easily be that you wouldn't see these features in a production environment until early next year (if the features are even delivered at all).

1 2 Previous Next

More Like This

  • Retrieving data ...

Bookmarked By (1)

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points