|
||||||||||||
|
||||||||||||
|
||||||||||||
|
||||||||||||
|
||||||||||||
In many applications, queries grow complex by having criteria tacked onto the end (or middle!) of them instead of integrating the criteria into the query as a whole. This is often due to programmers who are used to procedural languages. In procedural languages subroutines and isolation of objects is important. SQL is not a procedural language. It is a language which draws a flow of data from several sources with various criteria and operations. Thinking in flows or cascades of data is quite distinct from the thought process required for a procedural language. Thinking programmatically for queries results in queries that are constructed like the Winchester Mystery House. When faced with improving existing queries, there is a fair amount of analysis and improvement of queries that can be done even without running the query through EXPLAIN ANALYZE. Making the effort to understand the expected result of each component of a query is vital to being able to improve the queries. This analysis can be done algebraicly or intuitively. My approach is to understand what the results of each component are and to reduce it intuitively. Here are some guidelines that I use. The first thing to notice is unnecessary subselects. When the target is equivalent in nested subselects, it is likely that they can be collapsed into one query. Whenever both SUM() and COUNT() are used on the same columns in the same query it bears looking into. The SUM( COUNT(*) ) can usually be the equivalent of a COUNT(*). Look also for multiple DISTINCT clauses, especially on the same target list and similarly multiple GROUP BY statements on similar subselects. Eliminating subselects with unnecessary GROUP BY or DISTINCT clauses can simplify and speed up your query. Separating functions into different subqueries can also make for unnecessary subqueries. Remember that aggregates referenced more than once in the same query scope will be shared and that functions, though not aggregates, can be nested. Although the query planner (post 7.3) will probably reorder the joins effectively where parentheses isolate portions of the joins, for clarity and to prevent the possibility of overriding the planner, it is best to flatten the joins into one query. In this example, the things that jump out at us are duplicate distinct selections with the same target list and two inner joins without further qualification. The DISTINCTS are collapsible. And (SELECT ... INNER JOIN ...) s INNER JOIN is equivalent to SELECT ... INNER JOIN ... INNER JOIN. -- -- original query 1 -- -- EXPLAIN ANALYZE SELECT DISTINCT acol, bcol, dcol, ecol FROM ( SELECT DISTINCT acol, bcol, dcol, ecol FROM ONE o INNER JOIN two t USING (acol ) ) AS foo INNER JOIN three th USING (acol, dcol); -- -- improved query -- EXPLAIN ANALYZE SELECT DISTINCT acol, bcol, dcol, ecol FROM one o INNER JOIN two t USING (acol) INNER JOIN three th USING (acol, dcol);On small vacuumed tables, the queries were proven to be equivalent and the time for the second query, per EXPLAIN ANALYZE was a bit less than half the time for the first query. It is important to check to be sure your improvements are correct and actually improve the query. In this next example one level of improvements makes the query more readable but does not speed it up. The second level of improvements turn out to be wrong (!). The separation of the two GROUP BY clauses are really needed. This second example is a tale of caution. In this example we see both SUM() and COUNT(*) and the separation of the two by a subquery that only executes CASE statements. These should be able to be collapsed. -- -- original query 2 -- \echo query 2 original -- EXPLAIN ANALYZE SELECT acol, dcol, sum(count_one) AS count_one, sum(count_two) AS count_two, count(DISTINCT gcol) AS gcol_count FROM ( SELECT acol, dcol, gcol, CASE WHEN gcol = 1 THEN cnt ELSE 0 END AS count_one, CASE WHEN gcol = 2 THEN cnt ELSE 0 END AS count_two FROM ( SELECT acol, dcol, count(*) AS cnt, gcol FROM one o INNER JOIN two t USING (acol) INNER JOIN three the USING (acol, dcol) GROUP BY acol, dcol, gcol ) AS bar ) AS foo GROUP BY acol, dcol; -- -- improved query -- \echo query 2 improved EXPLAIN ANALYZE SELECT acol, dcol, sum(count_one) AS count_one, sum(count_two) AS count_two, count(DISTINCT gcol) AS gcol_count FROM ( SELECT acol, dcol, gcol, CASE WHEN gcol = 1 THEN count(*) ELSE 0 END AS count_one, CASE WHEN gcol = 2 THEN count(*) ELSE 0 END AS count_two FROM one o INNER JOIN two t USING (acol) INNER JOIN three the USING (acol, dcol) GROUP BY acol, dcol, gcol ) AS foo GROUP BY acol, dcol;Taking the improvements one step too far we get a collapsed GROUP BY clause which changes the end result of the query so that it is now incorrect. This is because the count(*) should only be grouped by acol, dcol and not all three columns acol, dcol and gcol. \echo bad query 2 EXPLAIN ANALYZE SELECT acol, dcol, (CASE WHEN gcol = 1 THEN count(*) ELSE 0 END) AS count_one, (CASE WHEN gcol = 2 THEN count(*) ELSE 0 END) AS count_two, count( gcol ) as gcol_count FROM one o INNER JOIN two t USING (acol) INNER JOIN three th USING (acol, dcol) GROUP BY acol, dcol, gcol;
The example below joins 4 tables ITEM, BRAND, MODEL and CONDITION. In human understandable terms: a [secondhand] Item is of a particular Model and Brand. The Items retail at different prices depending on their condition. This selection is close to what is wanted.
SELECT brand.brand_name, model.model_name, min (item.price), max (item.price) min (condition.position), max (condition.position) FROM item LEFT OUTER JOIN model ON model_pk =item.model_fk LEFT OUTER JOIN brand ON brand.brand_pk =model.brand_fk LEFT OUTER JOIN condition on condition.condition_pk = item.condition_fk GROUP BY brand.brand_name,model.model_nameBut what is really wanted is the condition name of the min and max for each model. The name of the condition is different from the position of the item; name is a text field and position is an integer field. This is what the result should look like: Brand | Model | Cond | Cond | Price | Price | | min | max | min | max ------------------------------------------- Canon | A1 | Exc | Mint | 139 | 155 Canon | F1N | Exc++| Mint-| 329 | 379 Canon | 24mm | Exc--| Mint+| 99 | 179 Nikon | 50mm | Exc--| Mint+| 109 | 119 This query should do the trick according to Tom Lane. It fetches the name of the condition based on the minimum and maximum positions in separate subqueries. SELECT brand.brand_name, model.model_name, min (item.price), max (item.price) (SELECT name FROM condition c1 WHERE position = min(condition.position)), (SELECT name FROM condition c2 WHERE position = max(condition.position)), FROM ITEM LEFT OUTER JOIN model ON model_pk =item.model_fk LEFT OUTER JOIN brand ON brand.brand_pk =model.brand_fk LEFT OUTER JOIN condition ON condition.condition_pk = item.condition_fk GROUP BY brand.brand_name,model.model_nameNote well, however, this technique carries an important warning with regards to Postgres versions earlier than 7.4. "[Prior to 7.4]," Tom writes, "we would have mis-interpreted the aggregate calls to indicate aggregation within the sub-selects. The current interpretation is per SQL spec: since the aggregate argument is a variable of the outer select, the aggregation occurs with respect to that select, and the aggregate result is passed down to the sub-select as a scalar." This means that prior to 7.4, the inner aggregates would be incorrectly run against the condition table rather than the outer query which includes the correct JOINs and GROUP BY clauses.
|
||||||||||||
Comments and Corrections are welcome. Suggestions and contributions of items are also welcome. Send them in! Copyright A. Elein Mustain 2003, 2004, 2005, 2006, 2007, 2008, 2009 |