Feb 25, 2010 11:01 AM
Query on distribution keys in a snow flaked schema
-
Like (0)
Hi ,
We have an implementation of a data warehouse. These are the high level details.
Dimensions
Product hierarchy
T1 (Highest level) T2 T3
PK1 PK2 (=T1.FK1) PK3 (=T2.FK2)
FK1 FK2 FK3
Fact Table
FACT
Key1
Key2
Key3
PK3
Metric 1
Metric 2
We have a below sample report query.
select * from T3 join T2 on (T3.PK3=T2.FK2)
join T1 on (T1.FK1=T2.PK2)
join FACT on (FACT.PK3=T3.PK3)
I am receiving a very bad plan
My doubt is how to distribute these dimension tables and FACT so as to get the optimal performance.
Also please suggest ceratin methods to fine tune this query.
Many Thanks
Sebastian
need more inputs, its dangerous to decide by just considering one sample query fired on Fact tables.
if this is the only query which is fired all the time on this FACT table, you can distribute the FACT on PK3. what are the distributions u used when u got the bad plan?
Based on the limited information provided, I would distribute the fact table on product ID (I'm assuming that is your PK3)
Hi
I analyzed the plan and I can see that the query has a subquery as below
select * from T3 join T2 on (T3.PK3=T2.FK2)
join T1 on (T1.FK1=T2.PK2)
join FACT on (FACT.PK3=T3.PK3)
where FACT.KEY in ( SELECT KEY FROM CUST_TABLE WHERE CUST_TABLE.STATUS='Active'
The problem is that the join of the FACT table and the CUST_TABLE is being broadcasted.
Any suggestions on how to avoid the same.
Thanks
Sebastian
cust_table getting braodcasted is fine. but did u observe any redistributions on FACT table? wats ur FACT table present distribution key?
with the below query analyze the plan file for any redistributions on FACT table
select * from T3 join T2 on (T3.PK3=T2.FK2)
join T1 on (T1.FK1=T2.PK2)
join FACT on (FACT.PK3=T3.PK3)
join ( SELECT KEY FROM CUST_TABLE WHERE CUST_TABLE.STATUS='Active') foo on (foo.KEY=FACT.KEY)
I would not be so quick to say that broadcasting the customer table is ok. If the customer table has 100 million records in it, that can become a very big problem. It is not uncommon to distribute your fact table on your customer_id column in this case. As has already been pointed out, however, I have no idea what else is running on the system. Distributing your fact table on customer_id might benefit this specific query but slow down 95% of the other common queries in your system.
A very common problem in situations like this is that you might have some 'generic' customer key values which occur in large percentage of your records. For example, if you have a generic customer key such as 'Unknown Customer' that key might be 25-50% of your records. In such a case if you distribute on customer_id that will result in highly skewed data and terrible performance.
Even if you are just joining on customer_id as in your SQL, if the optimizer decides to redistrbute the fact table and the customer table on customer id to perform the join, you can end up pushing an excessivly large number of records to a single CPU core and thus your query will take forever to complete.
yup, ur rite.
Try not to focus on the pk/fk as the summation of the problem. You can have very different pk/fk from your distribution. A distribution does not have to participate as a pk or fk.
Think about this - if your tables are all distributed on X, where your pk and fk are always Y and Z, then your Y and Z form the functional connection for the data, where X forms the
performance substrate for comparing them between tables. In many cases X may be a primary key. You would join on all three, even though X might not have anything to do with the
problem-at-hand, excluding it will cost you in performance.
So find the common denominator for all of your tables in play. Distribute them on a common key and then apply your query accordingly, using the regular functiional join logic but also
including the distribution key to connect them on the hardware. It doesn't matter if the distribution key actually plays a functional role in the data as long as it keep the data on the same
SPU during the join of the functional keys.
Hi All,
The FACT is ditributed on a surrogate key. However I cannot change the ditribution of the FACT as it is joined with FACT2 on the same surrogate key in a different query.
I think the only way out is to set the fact rel planner system parameter to suppress the broadcasting the fact table.
Please correct me if Iam wrong.
Thanks
Sebastian
If you want to solve the problem of this particular query only by creating a copy of the fact table and redistibuting it on the necessary key. Use this strategy only if you are having the report this query is used frequently

