Skip navigation
14372 Views 2 Replies Latest reply: Jul 7, 2010 11:33 PM by Jeff Feinsmith RSS
heng.mei@opinmind.com New Enzee 1 posts since
Mar 19, 2010
Currently Being Moderated

Jul 7, 2010 9:51 PM

string-to-date conversions, how to ignore invalid dates?

Hi netezza experts,

 

I have a table containing strings that represent dates e.g. '2010-03-22'.  Some of the rows contain invalid date strings.

 

When I try to query I get the following error:

 

nzsql> select to_date(string_column,'YYYY-MM-DD') from my_table;
ERROR:  Invalid Date.

 

Instead of an error.. is it possible for my query to only return the validly formatted dates... while ignore the bad rows?     In general what's the best way to detect whether a date string is valid?

 

Thanks,

Heng

  • David Birmingham Active Enzee 426 posts since
    Sep 24, 2007

    The best way to get rid of invalid dates is before they ever enter the system. A warehouse is intended to be the repository for scrubbed, non-null data. If poorly formatted data is in the warehouse, it means a step was missed in the process to get the data into the table.

     

    So take the steps to fix the places that are allowing this bad data through -

     

    and here's the reason:

     

    Whenever we allow bad data into the warehouse, it requires every downstream process to over-code, buffering themselves or shielding themselves from bad data. Isvalid, isnull, case/when start popping up all over the business implementation, cluttering it with junk-wrappers, when in fact, the business implementation should be the cleanest, most concise SQL in the whole shop.

     

    Rip out the bad dates by changing their values to some known default, like 01-01-1900 or some such. Don't pick null, because this even worse.

     

    But ignoring invalid dates sounds a lot like "continuing to ignore invalid dates" when they should be dealt with as a priority, upon first entry.

     

    Here's the second corollary: stewardship. In the principle of stewardship, we "make the data better every time we touch it" - or in the case of processing the data toward our target tables, "make the data better for the next downstream consumer"

     

    In this case, because bad dates are in the tables, it also means that a lot of other bad data has probably escaped the attention of the data processing flow. There could be a lot more bad data than just the dates. This is a symptom, and not something to ignore.

  • Jeff Feinsmith New Enzee 13 posts since
    Mar 28, 2007

    While I completely agree with David that bad date content is a symptom of more widespread data quality issues and that really needs to be investigated, there are two approaches to identifying bad date content.  This is particularly important if you are using the Netezza database as part of your data quality management and stewardship processes.

     

    The first and perhaps most flexible approach is to create a UDF that executes isdate type of logic (Netezza doesn't have an isdate built-in function).  If you are using Netezza 5.0.x or later, you may find nzLua a convenient way to quickly code up a UDF without bothering to deal with C/C++ (see http://lua-users.org/wiki/DateFormattingFunctions for sample date functions).

     

    Alternatively, you could create a valid dates lookup table and join it to your date-like content (recipe credit goes to Natasha Yanayt):

     

    Create a CALENDAR table  containing all the dates from 0AD:

    CREATE TABLE  CALENDAR (DT DATE);

    INSERT INTO CALENDAR VALUES (‘0001-01-01’);
    INSERT  INTO CALENDAR SELECT DT + 2^  0 FROM CALENDAR;
    INSERT INTO  CALENDAR SELECT DT + 2^  1 FROM CALENDAR;
    INSERT INTO CALENDAR  SELECT DT + 2^  2 FROM CALENDAR;
    INSERT INTO CALENDAR SELECT DT +  2^  3 FROM CALENDAR;
    INSERT INTO CALENDAR SELECT DT + 2^  4 FROM  CALENDAR;
    INSERT INTO CALENDAR SELECT DT + 2^  5 FROM CALENDAR;
    INSERT  INTO CALENDAR SELECT DT + 2^  6 FROM CALENDAR;
    INSERT INTO  CALENDAR SELECT DT + 2^  7 FROM CALENDAR;
    INSERT INTO CALENDAR  SELECT DT + 2^  8 FROM CALENDAR;
    INSERT INTO CALENDAR SELECT DT +  2^  9 FROM CALENDAR;
    INSERT INTO CALENDAR SELECT DT + 2^ 10 FROM  CALENDAR;
    INSERT INTO CALENDAR SELECT DT + 2^ 11 FROM CALENDAR;
    INSERT  INTO CALENDAR SELECT DT + 2^ 12 FROM CALENDAR;
    INSERT INTO  CALENDAR SELECT DT + 2^ 13 FROM CALENDAR;
    INSERT INTO CALENDAR  SELECT DT + 2^ 14 FROM CALENDAR;
    INSERT INTO CALENDAR SELECT DT +  2^ 15 FROM CALENDAR;
    INSERT INTO CALENDAR SELECT DT + 2^ 16 FROM  CALENDAR;
    INSERT INTO CALENDAR SELECT DT + 2^ 17 FROM CALENDAR;
    INSERT  INTO CALENDAR SELECT DT + 2^ 18 FROM CALENDAR;
    INSERT INTO  CALENDAR SELECT DT + 2^ 19 FROM CALENDAR;
    INSERT INTO CALENDAR  SELECT DT + 2^ 20 FROM CALENDAR;

    Join your table to the CALENDAR table. It should be LEFT join, something  like this:

    SELECT  CASE WHEN CALENDAR.DT IS NOT NULL  THEN 1 ELSE 0 END AS DATE_FLAG
    FROM      <your table>  A7
    LEFT  JOIN  CALENDAR        ON A7.DATE_LIKE_COLUMN = TO_CHAR(CALENDAR.DT, ‘YYYYMMDD’)

    The format specified in the to_char() function should match the format of your DATE_LIKE_COLUMN.

    Now, you will have a flag showing if  the
    DATE_LIKE_COLUMN column is a date or not. You can use it in your query.

     

    -Jeff

More Like This

  • Retrieving data ...

Bookmarked By (0)

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points