3 Replies Last post: Jan 19, 2010 5:04 PM by rmenon  
rmenon   32 posts since
Apr 23, 2009
Currently Being Moderated

Jan 8, 2010 5:22 PM

update versus deletes

Hi folks

I am being told that in Netezza updates are in general slower than deletes and that for big tables they are not scalable.

 

I know they are two different statements but in a solution where you have two alternatives - one involving insert/delete and the other updates, is there a good engineering reason for us to choose deletes over updates. Is there any documentation about this anywhere?

 

Thank you!

Shawn Fox   392 posts since
Aug 15, 2006
Currently Being Moderated
1. Jan 8, 2010 6:14 PM in response to: rmenon
Re: update versus deletes

One good reason to do delete/insert is that you cannot update a distribution column.  Depending on your data, that can be a problem.

 

In testing I have done on 10000 series hardware, update/insert is faster than a delete/insert combination, but not by a substantial margin

 

My general practice is to load all the data into a staging table, then perform delete + insert against the target table.  There are certainly exceptions, but you'll always do well with the delete + insert model.

 

In all cases, if you aren't dealing with big tables it probably isn't going to matter a great deal.  Still it is best to follow exactly the same model with every process, a system is easier to support when every process follows exactly the same standard.  Your main goal should be to make it easy to audit the changes to the data each day.

 

On Netezza performance of your ETL is rarely going to be an issue.  The main thing to focus on is making sure that when there is a problem you can determine what went wrong and you can easily recover from it.  For dimension type tables, that means keeping a full copy of the way the data looked prior to your ETL process running.  When possible, you should keep copies from the last several days.  For everything but your largest tables, storage is very cheap compared to the time spent debugging problems.

David Birmingham   161 posts since
Sep 24, 2007
Currently Being Moderated
2. Jan 8, 2010 6:20 PM in response to: rmenon
Re: update versus deletes

You also have the option, used mostly in "continuous" models, to add a companion table that contains all of the "deleted" records simply by using a unique identifier and the same distribution key. By joining to this table using a "where not exists" you effectively filter the deleted records from every query. Painless even on large tables, but again finds its practicality more in continuous models where there is no batch window for maintenance.

 

As Shawn noted - be consistent with whatever you do.

More Like This

  • Retrieving data ...