Hi folks
I am being told that in Netezza updates are in general slower than deletes and that for big tables they are not scalable.
I know they are two different statements but in a solution where you have two alternatives - one involving insert/delete and the other updates, is there a good engineering reason for us to choose deletes over updates. Is there any documentation about this anywhere?
Thank you!
One good reason to do delete/insert is that you cannot update a distribution column. Depending on your data, that can be a problem.
In testing I have done on 10000 series hardware, update/insert is faster than a delete/insert combination, but not by a substantial margin
My general practice is to load all the data into a staging table, then perform delete + insert against the target table. There are certainly exceptions, but you'll always do well with the delete + insert model.
In all cases, if you aren't dealing with big tables it probably isn't going to matter a great deal. Still it is best to follow exactly the same model with every process, a system is easier to support when every process follows exactly the same standard. Your main goal should be to make it easy to audit the changes to the data each day.
On Netezza performance of your ETL is rarely going to be an issue. The main thing to focus on is making sure that when there is a problem you can determine what went wrong and you can easily recover from it. For dimension type tables, that means keeping a full copy of the way the data looked prior to your ETL process running. When possible, you should keep copies from the last several days. For everything but your largest tables, storage is very cheap compared to the time spent debugging problems.
You also have the option, used mostly in "continuous" models, to add a companion table that contains all of the "deleted" records simply by using a unique identifier and the same distribution key. By joining to this table using a "where not exists" you effectively filter the deleted records from every query. Painless even on large tables, but again finds its practicality more in continuous models where there is no batch window for maintenance.
As Shawn noted - be consistent with whatever you do.
Thanx Shawn. I was looking for some obvious performance differences between update versus delete/insert strategy but from your numbers looks like there are none.
Not updating the distribution key is not an issue for us.
Thanx again for your comments,
Menon
