How the optimizer may use an index for cardinality estimates without actually using it

Recently I was sent a data warehouse SQL query that performed badly. It took more than an hour and timed out. It was the first time that query was run with a particular parameter. Here I will show that to solve this, I created an index to give the optimizer better information, even though it did not actually use it to execute the query.

If I can, I always try to get a feeling for the cardinality estimates the optimizer makes, inspired at first by Wolfgang Breitling. The optimizer is strongly driven by expected cardinality and miscalculations, especially in the early stages of an execution plan, lead to suboptimal or even bad plans. There are several reasons for miscalculations (among those data skew and correlated predicates), for which histograms, dynamic sampling and if possible query redesigns are some solutions. Too many times I see predicates in a query that are heavily or even completely correlated. If these latter can be avoided, the better.

The query mentioned was also a case of predicate correlation and I want to show you that the existence of an index, even though not really used for query execution, may still give the optimizer better information.

This is the query:

SELECT   t70910.lvl1anc_divn AS c1, SUBSTR (t70910.lvl1anc_postn, 6, 4) AS c2,
         t70910.lvl1anc_postn AS c3,
         CONCAT (CONCAT (t153459.emp_last_name, ', '),
                 t153459.emp_fst_name
                ) AS c4,
         COUNT
            (DISTINCT CASE
                WHEN t427832.accnt_con_flg = 'C'
                   THEN t427832.contact_wid
             END
            ) AS c5
    FROM wc_prod_rank_tc_a t426428 /* Contact Rank OA (WC_PROD_RANK_TC_A) */,
         w_source_d t163300 /* Sales Cycle (W_SOURCE_D) */,
         w_position_d t69295 /* Position (W_POSITION_D) */,
         w_position_dh t70910 /* SalesForce_Area */,
         w_position_d t153459 /* Lvl1Position_Area (W_POSITION_D) */,
         wc_per_rank_tc_a t427832
   WHERE (    t163300.row_wid = t427832.cycles_wid
          AND t70910.lvl1anc_postn_id = t153459.integration_id
          AND t69295.row_wid = t70910.row_wid
          AND t69295.row_wid = t427832.owner_postn_wid
          AND t70910.lvl1anc_divn = 'SomeValue'
          AND t163300.mktpln_name = 'SomeValue 2008'
          AND t426428.row_wid = t427832.contact_rnk_wid
          AND t427832.datasource_num_id = 10008.0
          AND t427832.accnt_con_flg = 'C'
          AND CAST (t426428.ranking AS INTEGER) <> 1
          AND t70910.lvl1anc_postn LIKE 'UK-1-%'
          AND (t69295.datasource_num_id IN (-1.0, 10008.0))
          AND (t163300.datasource_num_id IN (-1.0, 10008.0))
          AND (t426428.tc_cat IN ('Professional Overall', 'Unspecified'))
          AND (t426428.datasource_num_id IN (-1.0, 10008.0))
          AND t70910.lvl1anc_postn NOT LIKE '%OLD%'
         )
GROUP BY t70910.lvl1anc_postn,
         t70910.lvl1anc_divn,
         CONCAT (CONCAT (t153459.emp_last_name, ', '), t153459.emp_fst_name)
  HAVING 0 <
            COUNT
               (DISTINCT CASE
                   WHEN t427832.accnt_con_flg = 'C'
                      THEN t427832.contact_wid
                END
               )
     AND 0 <
            COUNT
               (DISTINCT CASE
                   WHEN t427832.accnt_con_flg = 'C'
                      THEN t427832.contact_wid
                END
               )
ORDER BY c3, c4, c1;

And by examining the execution plan with the “gather_plan_statistics” hint, I noticed quite a difference between the number of records estimated (186) for the WC_PROD_RANK_TC_A materialized view access and the actual records (912).  I started with dynamic sampling levels 3 and 4 and they already made the query run in 47 seconds. And it was immediately visible that the cardinality estimate for that WC_PROD_RANK_TC_A part was now completely accurate: 912.

However,  for that moment, dynamic sampling was not an option: we cannot add hints, we cannot alter sessions and we cannot just set it at the system level without testing (a representative set of) all other reports. It is certainly planned to test dynamic sampling level 4 soon.

Here is the explain plan for this part of the query without dynamic sampling:

SELECT *   FROM wc_prod_rank_tc_a t426428  
WHERE CAST (t426428.ranking AS INTEGER) <> 1   
  AND (t426428.tc_cat IN ('Professional Overall', 'Unspecified'))   
  AND (t426428.datasource_num_id IN (-1.0, 10008.0));
Plan
ALL_ROWS SELECT STATEMENT Cost: 4 Bytes: 24,924 Cardinality: 186 Time: 1
  2 PARTITION LIST INLIST Cost: 4 Bytes: 24,924 Cardinality: 186 Time: 1 Partition #: 1 Partitions accessed #KEY(INLIST)
    1 SIEBELP.WC_PROD_RANK_TC_A Filter Predicates: ("T426428"."DATASOURCE_NUM_ID"=(-1) OR "T426428"."DATASOURCE_NUM_ID"=10008.0) AND ("T426428"."TC_CAT"='Professional Overall' OR "T426428"."TC_CAT"='Unspecified') AND CAST("T426428"."RANKING" AS INTEGER)<>1 MAT_VIEW ACCESS FULL MAT_VIEW Cost: 4 Bytes: 24,924 Cardinality: 186 Time: 1 Partition #: 1 Partitions accessed #KEY(INLIST)

And with dynamic sampling level 4 (level 3 in this case was also enough):

Plan
ALL_ROWS SELECT STATEMENT Cost: 4 Bytes: 122,208 Cardinality: 912 Time: 1
  2 PARTITION LIST INLIST Cost: 4 Bytes: 122,208 Cardinality: 912 Time: 1 Partition #: 1 Partitions accessed #KEY(INLIST)
    1 SIEBELP.WC_PROD_RANK_TC_A Filter Predicates: ("T426428"."DATASOURCE_NUM_ID"=(-1) OR "T426428"."DATASOURCE_NUM_ID"=10008.0) AND ("T426428"."TC_CAT"='Professional Overall' OR "T426428"."TC_CAT"='Unspecified') AND CAST("T426428"."RANKING" AS INTEGER)<>1 MAT_VIEW ACCESS FULL MAT_VIEW Cost: 4 Bytes: 122,208 Cardinality: 912 Time: 1 Partition #: 1 Partitions accessed #KEY(INLIST)

So with dynamic sampling as “not an option for now”, my next plan was to create a (function based)  index on the predicates for this table:

CREATE INDEX SIEBELP.TVB_WC_PROD_RANK_TC_A
ON SIEBELP.WC_PROD_RANK_TC_A (TC_CAT, CAST("RANKING" AS INTEGER), ROW_WID)  
LOCAL;

And now look at that same statement’s explain plan:

Plan
ALL_ROWS SELECT STATEMENT Cost: 4 Bytes: 406,422 Cardinality: 3,033 Time: 1
  2 PARTITION LIST INLIST Cost: 4 Bytes: 406,422 Cardinality: 3,033 Time: 1 Partition #: 1 Partitions accessed #KEY(INLIST)
    1 SIEBELP.WC_PROD_RANK_TC_A Filter Predicates: (“T426428″.”DATASOURCE_NUM_ID”=(-1) OR “T426428″.”DATASOURCE_NUM_ID”=10008.0) AND (“T426428″.”TC_CAT”=’Professional Overall’ OR “T426428″.”TC_CAT”=’Unspecified’) AND CAST(“T426428″.”RANKING” AS INTEGER)<>1 MAT_VIEW ACCESS FULL MAT_VIEW Cost: 4 Bytes: 406,422 Cardinality: 3,033 Time: 1 Partition #: 1 Partitions accessed #KEY(INLIST)

It is not entirely accurate, but estimating 3 times more rows than 5 times less was much better in this case. With the index, even though it was not used to execute the query, only to estimate  cardinality, the query returned in about 40 seconds instead of an hour.

About these ads

2 thoughts on “How the optimizer may use an index for cardinality estimates without actually using it

  1. Hmm…That is interesting.
    By any chance, do you know what exactly “caused” optimizer to “guess” different cardinality, especially when the index is not being used? It would be great to learn that.

    • Hi Narendra,

      I did some analysis on that and the only difference I could find that was different after creating the index regarding statistics was the extra index and therefor the num_distinct statistic of the index. The funny thing was that after creating the index, the table level statistics showed a very inaccurate num_distinct statistic and the GLOBAL_STATS were set to NO, meaning those index statistics at the table level were derived from the partition level index statistics. At that moment, the explain plan did not show a difference yet with the situtaion without index. But after a statistics gathering at the table level, the index statistic for that new index changed to GLOBAL_STATS=YES and the num_distinct statistic went completely accurate (and also important: no changes occurred at the column statistics level for the columns of the table).

      After that, the cardinality estimate in the explain plan changed to what I mentioned in my post (3033). What I derive from this, but I may be missing something, is that the optimizer -which has to take table level statistics because the filter on the partition key (datasource_num_id) filters two partitions instead of exactly one- decides not to use the index statistic because it is not accurate (derived from partition statistics). But after gathering statistics at the table level, it decides to use this information to calculate the cardinality. Since the extra index and therefor extra statistics on it as the only difference I could find, I assume for now it is the index statistics (not column statistics) it uses to calculate the cardinality. And it might be the reason: Jonathan Lewis has a post on cardinality feedback. In comment (5) there is a remark on this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s