Nov 2, 2009 11:33 AM
int4 overflow error
-
Like (0)
What is the best way to dissect this error? I know that I am getting this due to bad data, but Im trying to figure out the best solution to find the culprits when dealing with 70 billion rows.
Im trying to load click stream data and turning a data type of DATE to an integer. Sounds simple enough and most loads are working except a few.
here is what I am doing
cast(to_char(day,'YYYYMMDD') as int) as date_id
Im joining on an INTEGER id as well as the day which is a DATE.
Im inserting into a column DATE_ID which is INTEGER
insert into fct_click
(date_id)
select cast(to_char(day,'YYYYMMDD') as int) as date_id
from stage_click a
inner join stage_click_processed b on a.id = b.id
and a.day = b.day;
id = integer
day = date
date_id = integer
Any ideas?
Thanks in advance!
tom_k
I would try to run a few parts separately, seeing if they fail, or looking for min/max values. The overflow sounds like it would be happening on a conversion, aggregation, or insert process.
For example, run just this part:
select cast(to_char(day,'YYYYMMDD') as int) as date_id
from stage_click
and see if that is what is giving you issues. If it does not fail, get the min/max values from date_id that is generated and see if they are beyond the definition of an int.
Additionally, double check the definition of your target table compared to your source data. Is ID really an int? or a bigint in the source data?
Thanks Chris.
Everything seems to check out from your suggestions. Im running the min and max as we speak. There is one difference though. I see the distribution in the stage is set at random while the processed table is set on id and day as the distro.
It seems to be only certain days will not load. The majority of these days did load but some of these one offs did not.
Any other suggestions?
/tom_k
We had a case of multi-billion-row sifting for bogus dates in files - namelly that they had illegal/garbage characters, some of which did not display
a quick way to scan your data for bogus values is to load the data into a varchar, and then follow up with the function as so:
when (translate(trim(column_name), '-:/0123456789','')='') then date(colum_name) else null end column_name
the above checks for valid chars in our date-type values, so just give it a character list that makes send for your data type - if all the characters reviewed end up forming an empty string, then the entire value "passes" and it's a keeper - otherwise do something else with it (as in the above, we made it null) but I am not a big fan of nulls in the final target - our downstream processing scrubbed nulls before the repository received them
In addition, when converting from the varchar, capture any values with a string length of > 8 - this will help you drive out the culprits and discover any common patterns in the bad data.
If you can find a day that will not load, whittle it down to a record that will not load. Create the reproducable case where you have 1 row in each of the tables being joined that will cause the failure. At that point, the error may be come much clearer (take a look at the plan after it runs to see which snippet is failing). If not, then post the ddl and row values here and maybe we can spot it. Otherwise, open a ticket with Netezza and they can torubleshoot it for you - especially if you have a repro set up with a couple of table and small set of data.
That is how I approach most sql bugs - whittle it down to the specific data that does not work...

