Jul 7, 2010 9:51 PM
string-to-date conversions, how to ignore invalid dates?
-
Like (0)
Hi netezza experts,
I have a table containing strings that represent dates e.g. '2010-03-22'. Some of the rows contain invalid date strings.
When I try to query I get the following error:
nzsql> select to_date(string_column,'YYYY-MM-DD') from my_table;
ERROR: Invalid Date.
Instead of an error.. is it possible for my query to only return the validly formatted dates... while ignore the bad rows? In general what's the best way to detect whether a date string is valid?
Thanks,
Heng
The best way to get rid of invalid dates is before they ever enter the system. A warehouse is intended to be the repository for scrubbed, non-null data. If poorly formatted data is in the warehouse, it means a step was missed in the process to get the data into the table.
So take the steps to fix the places that are allowing this bad data through -
and here's the reason:
Whenever we allow bad data into the warehouse, it requires every downstream process to over-code, buffering themselves or shielding themselves from bad data. Isvalid, isnull, case/when start popping up all over the business implementation, cluttering it with junk-wrappers, when in fact, the business implementation should be the cleanest, most concise SQL in the whole shop.
Rip out the bad dates by changing their values to some known default, like 01-01-1900 or some such. Don't pick null, because this even worse.
But ignoring invalid dates sounds a lot like "continuing to ignore invalid dates" when they should be dealt with as a priority, upon first entry.
Here's the second corollary: stewardship. In the principle of stewardship, we "make the data better every time we touch it" - or in the case of processing the data toward our target tables, "make the data better for the next downstream consumer"
In this case, because bad dates are in the tables, it also means that a lot of other bad data has probably escaped the attention of the data processing flow. There could be a lot more bad data than just the dates. This is a symptom, and not something to ignore.
While I completely agree with David that bad date content is a symptom of more widespread data quality issues and that really needs to be investigated, there are two approaches to identifying bad date content. This is particularly important if you are using the Netezza database as part of your data quality management and stewardship processes.
The first and perhaps most flexible approach is to create a UDF that executes isdate type of logic (Netezza doesn't have an isdate built-in function). If you are using Netezza 5.0.x or later, you may find nzLua a convenient way to quickly code up a UDF without bothering to deal with C/C++ (see http://lua-users.org/wiki/DateFormattingFunctions for sample date functions).
Alternatively, you could create a valid dates lookup table and join it to your date-like content (recipe credit goes to Natasha Yanayt):
Create a CALENDAR table containing all the dates from 0AD:
CREATE TABLE CALENDAR (DT DATE);
INSERT INTO CALENDAR VALUES (‘0001-01-01’);
INSERT INTO CALENDAR SELECT DT + 2^ 0 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 1 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 2 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 3 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 4 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 5 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 6 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 7 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 8 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 9 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 10 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 11 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 12 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 13 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 14 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 15 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 16 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 17 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 18 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 19 FROM CALENDAR;
INSERT INTO CALENDAR SELECT DT + 2^ 20 FROM CALENDAR;
Join your table to the CALENDAR table. It should be LEFT join, something like this:
SELECT CASE WHEN CALENDAR.DT IS NOT NULL THEN 1 ELSE 0 END AS DATE_FLAG
FROM <your table> A7
LEFT JOIN CALENDAR ON A7.DATE_LIKE_COLUMN = TO_CHAR(CALENDAR.DT, ‘YYYYMMDD’)
The format specified in the to_char() function should match the format of your DATE_LIKE_COLUMN.
Now, you will have a flag showing if the DATE_LIKE_COLUMN column is a date or not. You can use it in your query.
-Jeff

