Skip navigation
17435 Views 10 Replies Latest reply: Jun 19, 2010 4:06 AM by ankur RSS
ankur New Enzee 10 posts since
Apr 12, 2010
Currently Being Moderated

Jun 10, 2010 7:02 AM

Does distribution columns order matter while applying an aggregate function?

Hi,

 

I am trying to get a unique user count for a period from a temporary table. Temp table is a snippet of the main table T1 and is distributed on i_id.

 

I created 2 temp table using the same query and have the same data, only difference is the distribution columns.

 

Columns:

user_id   bigint not null,

date1     date   not null,

imp_cnt  bigint not null

 

Table Temp1 is distributed on ( user_id, date1) and second table Temp2 on (date1, user_id).

 

when I execute the query : select count(distinct user_id) from Temp2  RESULT count is some 2 billion and it completes in 2 mins but when i execute the same query on Temp1, it takes 5-6 mins to complete.

 

Could you please help me in understanding what is happening here? What am i doing wrong with the temp1?

 

Also, when I run the query, Netezza creates 6 snippets but if I do something like 'select count(*) from (select user_id from temp1 group by user_id) tmp', it creates only 3 snippets. I also observered that my first query completes earlier than the second query. Does the snippet numbers affects the execution time and if yes, how can I make my query to run on more snippets?

  • ravi New Enzee 55 posts since
    Sep 14, 2009

    the more is the unique key distribution more faster will be the query response.

     

    In your case you have distributed Temp 1 on USER_ID and in Temp 2 on DATE, as per my understanding, you must have date column with same data w.r.t. to USER_ID column . Due to this you are getting less time in Temp 1 and more time in Temp 2.

     

    As per your problem, it seems like it does matter with order of declaration for Distribution key in table.

     

    Snippets increases with the data distribution on table.

  • ravi New Enzee 55 posts since
    Sep 14, 2009

    yes, that is what i am saying that it does matter with column distribution. Netezza is giving priority to first distribution key as per your analysis.

  • Shawn Fox Enzee Exraordinaire 1,333 posts since
    Aug 15, 2006

    To further clarify, the column order must be the same.  (user_id,date1) does not give the same distribution as (date1,user_id)

  • Shawn Fox Enzee Exraordinaire 1,333 posts since
    Aug 15, 2006

    It isn't that one columns has 'priority' over the other, but that the hashing calculation uses the columns in a specific order.  I don't actually know exactly how the calculation works, but I'd imagine it is something like a CRC32 algorithm.  So something like MOD(CRC32( 'foo' || 'bar' ), <dataslice_count>) is not going to yield the same results as MOD(CRC32('bar'||'foo'), <dataslice_count>).

     

    It is nothing like an index... just because the order is (column1,column2) doesn't mean the distribution key will help you out for queries which join on column1 or group by on column1.  The only queries that benefit from the distribution must use *all* columns in the distribution key as their join key or group by key.  Also as mentioned in the prior post (and by others), the order of the distribution columns must be the same in both tables.

     

    This is the primary reason behind the normal recommendation to use as few columns in the distribution key as possible to achieve a reasonably good distribution.

  • ravi New Enzee 55 posts since
    Sep 14, 2009

    Try this query to verify your distribution across all SPUs with your tables

     

     

    select datasliceid, count(datasliceid) as “Rows”

    from table_name group by datasliceid order by “Rows”;



     

  • Shawn Fox Enzee Exraordinaire 1,333 posts since
    Aug 15, 2006

    I just realized that no one had addressed most of your other questions, so I'll go back to those as well.

     

    First the simple question as to why the number of snippets is different between the two queries:

         select count(distinct user_id) from Temp2

         select count(*) from (select user_id from temp1 group by user_id) tmp

     

    These two queries are doing very different things, although in this case the result is the same.  The distinct operator inside of the count() has to be implemented in a way so that it does not interfere with other aggregates in the same select clause, that is the first query has the extra snippets.

     

    As to why the execution time would be so different in the two cases, I don't have a good explanation for that.  I would guess that you end up moving less data across the network in one case vs. the other but I don't see why that would be.  There may be some nuance with the distribution column order that I am not aware of.  Do both tables end up with similar skew?  In my experience the skew is generally about the same although not identical when reversing column order in a distribution clause.

  • David Birmingham Active Enzee 426 posts since
    Sep 24, 2007

    While Shawn has no doubt consummately answered this OP - as an additional note - for your consideration -

     

    If you actually use columns to filter a query (as in count of distinct user_ids), you may also experience "process skew" and "date" types are notorious for this effect. It is why date types are really bad choices for distribution keys.

     

    Here's an example:

     

    Let's say I have 2 billion records representing 2 years of retail data. I distribute the data on "transaction date" and when I check the distribution, the data is physically spread out very evenly.

     

    Now when I query for the data, say for the day after Thanksgiving to get a peak sales count, the database responds slowly. Why? Because all of the data for this day has been put on one SPU, so only one SPU is involved in supporting your query. It is obvious why this is a bad state of affairs. If I were to redistribute the data on random, it would get better performance simply because the query will involve more SPUs.

     

    A better use for the transaction data above, is as a zone map, because this will make every SPUs activity more efficieint.

     

    So in your case, you might have great distribution physically, but when you are trying to get a select-distinct in one form, it leverages fewer SPUs in the process, slowing down the query.

     

    Back to your issue on aggregation, Whenever you aggregate something using a CTAS, Netezza's default behavior is to leverage the fields in the group-by clause as the distribution key. You can and usually should override this default with a distribute-on clause. If you let Netezza default, what you will forego is a co-located write. This is very important because a CTAS without a co-located write will blast data all over the machine rather than keeping it localized on the SPU. Redistribution is less efficient than a co-located write.

     

    In addition - using multiple distribution keys rarely has any value. Mainly because once you distribute on those keys, you are required to use ALL of the keys in joins if you want to get co-location. Using less than all of them will functionally work, but will be slower. Distribution keys are not indexes - they are cat'd and hashed together, and the hash value determines which SPU it will land on. If you don't use all the keys in subsequent joins, Netezza will not be able to perform a co-located join and will have to scan the table. So the distribution key did not help you much, except to spread the data out.

     

    Recommendation - stay away from dates as distribution keys - this is a known best practice and pitfall. Also stay away from multiple distribution keys unless you have an obvious and necessary reason to use them this way.

     

    Want more insight - there is a whole chapter in "Netezza Underground" (Amazon.com) dedicated to distribution and how it works.

More Like This

  • Retrieving data ...

Bookmarked By (0)