Varlena, LLC | PostgreSQL General Bits Newsletter

PostgreSQL Training,
Consulting & Support

By A. Elein Mustain

08-Nov-2004 Issue: 92

Archives |

General Tidbits |

Google General Bits |

Prev |

General Bits is a column loosely based on the PostgreSQL mailing list pgsql-general.
To find out more about the pgsql-general list and PostgreSQL, see www.PostgreSQL.org.

Query Flattening
Improving Queries	06-Nov-2004

In many applications, queries grow complex by having criteria tacked onto the end (or middle!) of them instead of integrating the criteria into the query as a whole. This is often due to programmers who are used to procedural languages. In procedural languages subroutines and isolation of objects is important. SQL is not a procedural language. It is a language which draws a flow of data from several sources with various criteria and operations. Thinking in flows or cascades of data is quite distinct from the thought process required for a procedural language. Thinking programmatically for queries results in queries that are constructed like the Winchester Mystery House.

When faced with improving existing queries, there is a fair amount of analysis and improvement of queries that can be done even without running the query through EXPLAIN ANALYZE. Making the effort to understand the expected result of each component of a query is vital to being able to improve the queries. This analysis can be done algebraicly or intuitively. My approach is to understand what the results of each component are and to reduce it intuitively. Here are some guidelines that I use.

The first thing to notice is unnecessary subselects. When the target is equivalent in nested subselects, it is likely that they can be collapsed into one query.

Whenever both SUM() and COUNT() are used on the same columns in the same query it bears looking into. The SUM( COUNT(*) ) can usually be the equivalent of a COUNT(*).

Look also for multiple DISTINCT clauses, especially on the same target list and similarly multiple GROUP BY statements on similar subselects. Eliminating subselects with unnecessary GROUP BY or DISTINCT clauses can simplify and speed up your query.

Separating functions into different subqueries can also make for unnecessary subqueries. Remember that aggregates referenced more than once in the same query scope will be shared and that functions, though not aggregates, can be nested.

Although the query planner (post 7.3) will probably reorder the joins effectively where parentheses isolate portions of the joins, for clarity and to prevent the possibility of overriding the planner, it is best to flatten the joins into one query.

In this example, the things that jump out at us are duplicate distinct selections with the same target list and two inner joins without further qualification. The DISTINCTS are collapsible. And (SELECT ... INNER JOIN ...) s INNER JOIN is equivalent to SELECT ... INNER JOIN ... INNER JOIN.

   --
   -- original query 1
   --
   -- EXPLAIN ANALYZE 
   SELECT DISTINCT acol, bcol, dcol, ecol
   FROM (
      SELECT DISTINCT acol, bcol, dcol, ecol
      FROM ONE o
      INNER JOIN two t USING (acol )
   ) AS foo
   INNER JOIN three th USING (acol, dcol);
   
   --
   -- improved query
   -- EXPLAIN ANALYZE 
   SELECT DISTINCT acol, bcol, dcol, ecol
   FROM one o
   INNER JOIN two t USING (acol)
   INNER JOIN three th USING (acol, dcol);

On small vacuumed tables, the queries were proven to be equivalent and the time for the second query, per EXPLAIN ANALYZE was a bit less than half the time for the first query.

It is important to check to be sure your improvements are correct and actually improve the query. In this next example one level of improvements makes the query more readable but does not speed it up. The second level of improvements turn out to be wrong (!). The separation of the two GROUP BY clauses are really needed. This second example is a tale of caution.

In this example we see both SUM() and COUNT(*) and the separation of the two by a subquery that only executes CASE statements. These should be able to be collapsed.

   --
   -- original query 2
   --
   \echo query 2 original
   -- EXPLAIN ANALYZE
   SELECT acol, dcol, sum(count_one) AS count_one, 
      sum(count_two) AS count_two, count(DISTINCT gcol) AS gcol_count
   FROM (
      SELECT acol, dcol, gcol,
         CASE WHEN gcol = 1 THEN cnt ELSE 0 END AS count_one,
         CASE WHEN gcol = 2 THEN cnt ELSE 0 END AS count_two
      FROM (
         SELECT acol, dcol, count(*) AS cnt, gcol
         FROM one o
         INNER JOIN two t USING (acol)
         INNER JOIN three the USING (acol, dcol)
         GROUP BY acol, dcol, gcol
      ) AS bar
   ) AS foo
   GROUP BY acol, dcol;
   
   --
   -- improved query
   --
   \echo query 2 improved
   EXPLAIN ANALYZE
   SELECT acol, dcol, sum(count_one) AS count_one,
      sum(count_two) AS count_two, count(DISTINCT gcol) AS gcol_count
   FROM (
      SELECT acol, dcol, gcol,
         CASE WHEN gcol = 1 THEN count(*) ELSE 0 END AS count_one,
         CASE WHEN gcol = 2 THEN count(*) ELSE 0 END AS count_two
      FROM one o
      INNER JOIN two t USING (acol)
      INNER JOIN three the USING (acol, dcol)
      GROUP BY acol, dcol, gcol
   ) AS foo
   GROUP BY acol, dcol;

Taking the improvements one step too far we get a collapsed GROUP BY clause which changes the end result of the query so that it is now incorrect. This is because the count(*) should only be grouped by acol, dcol and not all three columns acol, dcol and gcol.

   \echo bad query 2 
   EXPLAIN ANALYZE
   SELECT acol, dcol,
      (CASE WHEN gcol = 1 THEN count(*) ELSE 0 END) AS count_one,
      (CASE WHEN gcol = 2 THEN count(*) ELSE 0 END) AS count_two,
      count( gcol ) as gcol_count
   FROM one o
   INNER JOIN two t USING (acol)
   INNER JOIN three th USING (acol, dcol)
   GROUP BY acol, dcol, gcol;

Contributors: elein at varlena.com

More on Group By
[SQL] tricky GROUP BY / JOIN question 07-Nov-2004

The example below joins 4 tables ITEM, BRAND, MODEL and CONDITION. In human understandable terms: a [secondhand] Item is of a particular Model and Brand. The Items retail at different prices depending on their condition.

This selection is close to what is wanted.

   SELECT brand.brand_name, model.model_name,
      min (item.price), max (item.price)
      min (condition.position), max (condition.position)
   FROM item
   LEFT OUTER JOIN model  ON model_pk =item.model_fk
   LEFT OUTER JOIN brand  ON brand.brand_pk =model.brand_fk
   LEFT OUTER JOIN condition on condition.condition_pk = item.condition_fk
   GROUP BY brand.brand_name,model.model_name

But what is really wanted is the condition name of the min and max for each model. The name of the condition is different from the position of the item; name is a text field and position is an integer field. This is what the result should look like:

   Brand | Model | Cond | Cond | Price | Price
         |       | min  | max  | min   | max
   -------------------------------------------
   Canon | A1    | Exc  | Mint | 139   | 155   
   Canon | F1N   | Exc++| Mint-| 329   | 379
   Canon | 24mm  | Exc--| Mint+|  99   | 179
   Nikon | 50mm  | Exc--| Mint+| 109   | 119

This query should do the trick according to Tom Lane. It fetches the name of the condition based on the minimum and maximum positions in separate subqueries.

   SELECT brand.brand_name, model.model_name,
      min (item.price), max (item.price)
      (SELECT name FROM condition c1 WHERE position = min(condition.position)),
      (SELECT name FROM condition c2 WHERE position = max(condition.position)),
   FROM ITEM
   LEFT OUTER JOIN model  ON model_pk =item.model_fk
   LEFT OUTER JOIN brand  ON brand.brand_pk =model.brand_fk
   LEFT OUTER JOIN condition ON condition.condition_pk = item.condition_fk
   GROUP BY brand.brand_name,model.model_name

Note well, however, this technique carries an important warning with regards to Postgres versions earlier than 7.4. "[Prior to 7.4]," Tom writes, "we would have mis-interpreted the aggregate calls to indicate aggregation within the sub-selects. The current interpretation is per SQL spec: since the aggregate argument is a variable of the outer select, the aggregation occurs with respect to that select, and the aggregate result is passed down to the sub-select as a scalar."

This means that prior to 7.4, the inner aggregates would be incorrectly run against the condition table rather than the outer query which includes the correct JOINs and GROUP BY clauses.

Contributors: T E Schmitz mailreg at numerixtechnology.de Tom Lane tgl at sss.pgh.pa.us

Comments and Corrections are welcome. Suggestions and contributions of items are also welcome. Send them in!

Top