Skip navigation
26964 Views 12 Replies Latest reply: Feb 22, 2012 12:31 AM by Saket Kale RSS
AmbarishDon New Enzee 67 posts since
May 11, 2010
Currently Being Moderated

May 20, 2010 9:37 AM

Which is faster, nzload or external tables?

 

Which is faster, using nzload for loading the data files and loading data through transient external tables? I have done small tests and seen that every time external table is faster than nzload for loading same data file in same table.

Functionally both nzload and external tables have same capabilities apart from load continuation. However load continuation itself slowdown the nzload process.

 

-AD



  • Shawn Fox Enzee Exraordinaire 1,486 posts since
    Aug 15, 2006
    Currently Being Moderated
    1. May 20, 2010 10:20 AM (in response to AmbarishDon)
    Re: Which is faster, nzload or external tables?

    nzload uses external tables to perform the load, so in reality the performance should be close to identical.  The external table mechanism just picks up the data out of the file (does not parse it) and ships it to the NPS host system where the data is actually parsed.  The net result being that the CPU load gets pushed to the NPS host system and is not on the client system.  If you look at pg.log when you use nzload you will see that all nzload does is issue a statement something like "insert into table select * from external 'file' using (...)"

     

    The big advantage you get with external tables over nzload is that you can deal with columns that are not in the same order in your file as they are in the target table without any additional CPU overhead.  For example:

     

    insert into testload(x,y,dt,tm,str)
    select x, y, dt, tm, str
    from external 'c:\temp\test1.csv'
    (
            x integer,

            dt date,
            tm timestamp,
            y numeric(18,4),
            str nvarchar(100)
    ) using (
            remotesource 'odbc'
            ctrlchars true
            delim ','
            logdir 'c:\temp'
            maxerrors 1
            encoding 'internal'
    );

    It is also possible to use regular SQL functions to transform the data during the insert.  For example you might have dates in multiple formats in the file and nzload does not provide a way to deal with that, so you can just read the data in as varchar values, then use TO_DATE to transform the varchar to a date.  The main thing to be careful of is that all of this work happens on the NPS host system, not on the SPUs, so you are only utilizing 1 CPU.  You may get better throughput by dumping the data into a staging table and then using SQL to process the data.  The other issue is that if any of the SQL fails the entire load fails and that is never a good thing.  If you load all of your data into a staging table you can then use SQL to filter out bad records and the staging table mechanism also gives you an audit trail for what the data originally looked like as well as making the code easier to maintain.

     

    I would generally advise that you load data into NPS tables with as little massaging as possible, and then using SQL to process the data.  This allows you to utilize the parallel processing power of the NPS system and also breaks the process up into smaller pieces that are much easier to maintain over the life of your data warehouse.  Using external tables and processing the data with SQL functions during the load process is ok if the data is fairly simple and not huge volumes, but I would recommend against it for high volume / complex transformations.  Also, you should never join the data from an external table to tables in NPS directly since all the work happens on the host server.  If you need to do a join, always load the data into a staging table and then do the join with a 2nd SQL statement.

  • David Birmingham Active Enzee 429 posts since
    Sep 24, 2007
    Currently Being Moderated
    3. May 27, 2010 8:33 AM (in response to AmbarishDon)
    Re: Which is faster, nzload or external tables?

    Just keep in mind the configuration constraints. For example, the external table form has to have visibility to is underpinnng file from the Netezza machine. This is not always possible and in most cases impractical. For example, one enterprise had their file system on a mount that the Netezza host could not see, so they bridged it with file synonyms to another directory that the host could see. Another enterprise physically copied the files into lasso-distance from the machine etc. If you have file space and patient administrators, I suppose these are workable options, but not as practical.

     

    In another setting, they were loading files into the Netezza machine from a windows environment after extracting from their SQL-server instances and web logs. This landed on the local Windows server that the Netezza host had no visibility to, so the external table was not an option at all. The Windows-based nzload client was the answer.

     

    It is important to integrate from a perspective that has the least maintenance overhead. This would tend to mean that since the nzload is appropriate for all cases but the external table is not, then standardizing on the nzload will be a more stable implementation. Then external tables can be used for exceptional cases.

     

    I also agree with Shawn on transform overkill in the external table SQL - I put this into a Gather "Round the Grill blog essay called "Honor the Host" here on the Netezza Community.

  • Shawn Fox Enzee Exraordinaire 1,486 posts since
    Aug 15, 2006
    Currently Being Moderated
    5. May 27, 2010 9:14 AM (in response to AmbarishDon)
    Re: Which is faster, nzload or external tables?

    As far as Netezza is concerned, external tables and nzload are the same thing... it cannot tell the difference.  If you use an external table and SQL functions that might cause the command to be treated differently than a standard external table that does not have any SQL transformations (such as would be used by nzload).  As to workload management and external tables, that has undergone quite a few changes from version to version.  For quite a while external tables didn't operate within WLM but in the most recent releases of NPS they do.  Just to repeat again though, nzload and external tables are exactly the same thing.  nzload just acts as a client for a remote external table... there is no special code in the Netezza backend to support nzload specifically.

     

    The more recent versions of most ETL tools that work on Netezza use ODBC and load data using remote external tables with the remotesource 'odbc' clause, they do not use nzload.  Some support an option that allows you to use either nzload or external tables, but in the end, as far as the Netezza, nzload is the same as using an external table and the backend cannot tell the difference.

  • David Birmingham Active Enzee 429 posts since
    Sep 24, 2007
    Currently Being Moderated
    6. May 27, 2010 5:28 PM (in response to Shawn Fox)
    Re: Which is faster, nzload or external tables?

    At many sites, Linux machines are being stood up alongside the Netezza machine to behave as its application server, with the Netezza machine being more locked-down. nzload is then installed on the Linux machine. It is very difficult to convince one of their admins to jump through the file-mounting hoops for the external tables when nzload is ready to go.

     

    In short, if the security protocol is such that (or changed such that) the Netezza Host is an "island" and deliberately disabled from reaching out to the file system or other devices, then the only external table you can load is from it's onboard disk space. Since in these cases the security protocol will not let you access the machine's disk space externally to write data to it, there is no way to put data onto the internal disk space. They will lock it down all around and offer up the simple solution "nzload". Can't really argue with that.

     

    So for you folks working internally to the customer, you have options. To you folks who are consulting and will want to develop something more reusable, wrap some stuff around nzload and go home happy.

     

    I agree that the external table offers up some flexibility in remapping column ordering, but we covered this in another way with no sweat and nzload is still the workhorse. The problem expressed to us by the client is that the external table "seems" much more complex and cryptic implementation than nzload, so it's easier for their maintenance and troubleshooter folks to manage the utility rather than the subtlies of the external table's notation. I know of sites that use the external table but only as it's wrapped in another script that hides the cryptic parts - but this is what nzload does anyhow. Still we use external tables for loading up logs and even for fetching internalized flat files (like logs) on the netezza machine.

  • Shawn Fox Enzee Exraordinaire 1,486 posts since
    Aug 15, 2006

    I have to admit that remote external tables are not highlighted in the documentation, but they documented.  You can use the REMOTESOURCE option of an external table to cause it to load data from a remote client machine, *not* from the NPS host system.  This feature goes way back to NPS 3.1 at least and may have been around longer.

     

    For example, when I use nzload from a remote client system:

     

    nzload -df test.txt -t foo

     

    This is the SQL that gets executed to do the load:

     

    insert into FOO SELECT * FROM EXTERNAL 'test.txt' USING ( BOOLSTYLE '1_0' COMPRESS FALSE CRINSTRING FALSE CTRLCHARS FALSE DATEDELIM '-' DATESTYLE 'YMD' DELIMITER 'TAB' ENCODING 'latin9' FILLRECORD FALSE IGNOREZERO FALSE LOGDIR '' MAXERRORS 1 MAXROWS 0 NULLVALUE 'NULL' QUOTEDVALUE 'NO' REMOTESOURCE 'odbc' REQUIREQUOTES FALSE SOCKETBUFSIZE 8388608 TIMEDELIM ':' TIMEEXTRAZEROS FALSE TIMESTYLE '24HOUR' TRUNCSTRING FALSE Y2BASE 0)

    Note that all nzload is doing is using a transient external table with the REMOTESOURCE 'odbc' option specified.  You can issue the same command using any ODBC connection (using the isql command which comes with unixODBC for example) and get the same result.  If you are connecting using JDBC just change it to REMOTESOURCE 'jdbc'.  nzload is just a very simple wrapper around the external table functionality.  Thus you can use external tables to be quite a bit more flexible than nzload is (columns don't have to be in the same order as they are in the table, for example) and get identical performance.

     

    Unfortunately remote external tables are not yet supported using nzsql.  This is a frequently requested feature but it hasn't yet made it into the product.

  • David Birmingham Active Enzee 429 posts since
    Sep 24, 2007
    Currently Being Moderated
    9. May 28, 2010 10:27 PM (in response to AmbarishDon)
    Re: Which is faster, nzload or external tables?

    Depends on if you want to support an nzload-based solution or an nzload-based solution AND an external-table-based solution. You will always have a need for nzload, but you will not always require external tables.

     

    See my post above Shawn's - the environmental constraints in more and more sites are locking down the Netezza machine, limiting the visibility of external files and sources - meaning you will have to use nzload in these cases.

     

    The simplicity of the nzload is easier to train, maintain and explain, but when you look at the syntax of the external table, and the fact that you will have to manage these details manually, it still means that the nzload is shorter distance to delivery

  • Saket Kale New Enzee 17 posts since
    Jan 12, 2012
    Currently Being Moderated
    10. Feb 21, 2012 3:52 AM (in response to Shawn Fox)
    Re: Which is faster, nzload or external tables?

    Hello,

     

    Is it possible to load all the files in a particular folder using a control file.

     

    Here this folder will be populated by new files to be added to the warehouse and then the nzload action will take place after every few hours/minutes etc.

     

    I am not aware whether we can directly add all the files in a folder to the box.

     

    Please advise,

     

    Thanks,

    Saket Kale.

  • Chris Rodgers Active Enzee 185 posts since
    Sep 19, 2006
    Currently Being Moderated
    11. Feb 21, 2012 5:00 PM (in response to Saket Kale)
    Re: Which is faster, nzload or external tables?

    I do not see any examples of filemasks in the definition of a file to be loaded by nzload.

     

    You could run a script that does an nzload for every file in the directory.

     

    If you have a known set of file names that are always there, you can define multiple files in your control file.  The example is in the Netezza Data Loading Guide.  Basically, Netezza allows you to write multiple DATAFILE /dir/filename {options} entries in a config file.  One call to nzload specifying this control file executes loads for all defined  DATAFILE entries.

  • Saket Kale New Enzee 17 posts since
    Jan 12, 2012
    Currently Being Moderated
    12. Feb 22, 2012 12:31 AM (in response to Chris Rodgers)
    Re: Which is faster, nzload or external tables?

    Hello Chris,

     

    Thanks for your reply.

     

    However, what I want to do is the files in the directory will keep changing every three hours or so.

    So, I want to have a shell script which will scan the folder and then take the new file in the folder and load it.

     

    I am guessing this cannot be done in control file since the filename will have to be written there.

     

    Please advise,

    Saket Kale.

More Like This

  • Retrieving data ...

Bookmarked By (0)

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points