Skip navigation
14976 Views 9 Replies Latest reply: Feb 28, 2010 7:38 AM by Vijay RSS
sdennyk New Enzee 18 posts since
Nov 22, 2009
Currently Being Moderated

Feb 25, 2010 11:01 AM

Query on distribution keys in a snow flaked schema

Hi ,

 

We have an implementation of a data warehouse. These are the high level details.

 

Dimensions

 

Product hierarchy

 

T1 (Highest level)   T2                          T3

 

PK1                      PK2 (=T1.FK1)        PK3 (=T2.FK2)

FK1                      FK2                        FK3

 

 

Fact Table

 

FACT

 

Key1

Key2

Key3

PK3

Metric 1

Metric 2

 

We have a below sample report query.

 

select * from T3 join T2 on (T3.PK3=T2.FK2)

join T1 on (T1.FK1=T2.PK2)

join FACT on (FACT.PK3=T3.PK3)

 

I am receiving a very bad plan

 

My doubt is how to distribute these dimension tables and FACT so as to get the optimal performance.

 

Also please suggest ceratin methods to fine tune this query.

 

Many Thanks

 

Sebastian

  • Superuser Rookie 90 posts since
    Sep 19, 2008
    Currently Being Moderated
    1. Feb 25, 2010 11:16 AM (in response to sdennyk)
    Re: Query on distribution keys in a snow flaked schema

    need more inputs, its dangerous to decide by just considering one sample query fired on Fact tables. if this is the only query which is fired all the time on this FACT table, you can distribute the FACT on PK3. what are the distributions u used when u got the bad plan?

  • Shawn Fox Enzee Exraordinaire 1,333 posts since
    Aug 15, 2006
    Currently Being Moderated
    2. Feb 25, 2010 11:20 AM (in response to sdennyk)
    Re: Query on distribution keys in a snow flaked schema

    Based on the limited information provided, I would distribute the fact table on product ID  (I'm assuming that is your PK3)

  • Superuser Rookie 90 posts since
    Sep 19, 2008
    Currently Being Moderated
    4. Feb 25, 2010 11:41 AM (in response to sdennyk)
    Re: Query on distribution keys in a snow flaked schema

    cust_table getting braodcasted is fine. but did u observe any redistributions on FACT table? wats ur FACT table present distribution key?

     

    with the below query analyze the plan file for any redistributions on FACT table

     

     

    select * from T3 join T2 on (T3.PK3=T2.FK2)

    join T1 on (T1.FK1=T2.PK2)

    join FACT on (FACT.PK3=T3.PK3)

    join ( SELECT KEY FROM CUST_TABLE WHERE CUST_TABLE.STATUS='Active') foo on (foo.KEY=FACT.KEY)



  • Shawn Fox Enzee Exraordinaire 1,333 posts since
    Aug 15, 2006
    Currently Being Moderated
    5. Feb 25, 2010 11:54 AM (in response to Superuser)
    Re: Query on distribution keys in a snow flaked schema

    I would not be so quick to say that broadcasting the customer table is ok.  If the customer table has 100 million records in it, that can become a very big problem.  It is not uncommon to distribute your fact table on your customer_id column in this case.  As has already been pointed out, however, I have no idea what else is running on the system.  Distributing your fact table on customer_id might benefit this specific query but slow down 95% of the other common queries in your system.

     

    A very common problem in situations like this is that you might have some 'generic' customer key values which occur in large percentage of your records.  For example, if you have a generic customer key such as 'Unknown Customer' that key might be 25-50% of your records.  In such a case if you distribute on customer_id that will result in highly skewed data and terrible performance.

     

    Even if you are just joining on customer_id as in your SQL, if the optimizer decides to redistrbute the fact table and the customer table on customer id to perform the join, you can end up pushing an excessivly large number of records to a single CPU core and thus your query will take forever to complete.

  • Superuser Rookie 90 posts since
    Sep 19, 2008
    Currently Being Moderated
    6. Feb 25, 2010 12:29 PM (in response to Shawn Fox)
    Re: Query on distribution keys in a snow flaked schema

    yup, ur rite.

  • David Birmingham Active Enzee 426 posts since
    Sep 24, 2007
    Currently Being Moderated
    7. Feb 25, 2010 12:51 PM (in response to Superuser)
    Re: Query on distribution keys in a snow flaked schema

    Try not to focus on the pk/fk as the summation of the problem. You can have very different pk/fk from your distribution. A distribution does not have to participate as a pk or fk.

     

    Think about this - if your tables are all distributed on X, where your pk and fk are always Y and Z, then your Y and Z form the functional connection for the data, where X forms the

    performance substrate for comparing them between tables. In many cases X may be a primary key. You would join on all three, even though X might not have anything to do with the

    problem-at-hand, excluding it will cost you in performance.

     

    So find the common denominator for all of your tables in play. Distribute them on a common key and then apply your query accordingly, using the regular functiional join logic but also

    including the distribution key to connect them on the hardware. It doesn't matter if the distribution key actually plays a functional role in the data as long as it keep the data on the same

    SPU during the join of the functional keys.

  • Vijay New Enzee 9 posts since
    Jan 5, 2009
    Currently Being Moderated
    9. Feb 28, 2010 7:38 AM (in response to sdennyk)
    Re: Query on distribution keys in a snow flaked schema

    If you want to solve the problem of this particular query only by creating a copy of the fact table and redistibuting it on the necessary key. Use this strategy only if you are having the report this query is used frequently

More Like This

  • Retrieving data ...

Bookmarked By (0)