|
||||||||||||
| |
||||||||||||
|
||||||||||||
| |
||||||||||||
|
||||||||||||
In many applications, queries grow complex by having criteria tacked onto the end (or middle!) of them instead of integrating the criteria into the query as a whole. This is often due to programmers who are used to procedural languages. In procedural languages subroutines and isolation of objects is important. SQL is not a procedural language. It is a language which draws a flow of data from several sources with various criteria and operations. Thinking in flows or cascades of data is quite distinct from the thought process required for a procedural language. Thinking programmatically for queries results in queries that are constructed like the Winchester Mystery House. When faced with improving existing queries, there is a fair amount of analysis and improvement of queries that can be done even without running the query through EXPLAIN ANALYZE. Making the effort to understand the expected result of each component of a query is vital to being able to improve the queries. This analysis can be done algebraicly or intuitively. My approach is to understand what the results of each component are and to reduce it intuitively. Here are some guidelines that I use. The first thing to notice is unnecessary subselects. When the target is equivalent in nested subselects, it is likely that they can be collapsed into one query. Whenever both SUM() and COUNT() are used on the same columns in the same query it bears looking into. The SUM( COUNT(*) ) can usually be the equivalent of a COUNT(*). Look also for multiple DISTINCT clauses, especially on the same target list and similarly multiple GROUP BY statements on similar subselects. Eliminating subselects with unnecessary GROUP BY or DISTINCT clauses can simplify and speed up your query. Separating functions into different subqueries can also make for unnecessary subqueries. Remember that aggregates referenced more than once in the same query scope will be shared and that functions, though not aggregates, can be nested. Although the query planner (post 7.3) will probably reorder the joins effectively where parentheses isolate portions of the joins, for clarity and to prevent the possibility of overriding the planner, it is best to flatten the joins into one query. In this example, the things that jump out at us are duplicate distinct selections with the same target list and two inner joins without further qualification. The DISTINCTS are collapsible. And (SELECT ... INNER JOIN ...) s INNER JOIN is equivalent to SELECT ... INNER JOIN ... INNER JOIN.
--
-- original query 1
--
-- EXPLAIN ANALYZE
SELECT DISTINCT acol, bcol, dcol, ecol
FROM (
SELECT DISTINCT acol, bcol, dcol, ecol
FROM ONE o
INNER JOIN two t USING (acol )
) AS foo
INNER JOIN three th USING (acol, dcol);
--
-- improved query
-- EXPLAIN ANALYZE
SELECT DISTINCT acol, bcol, dcol, ecol
FROM one o
INNER JOIN two t USING (acol)
INNER JOIN three th USING (acol, dcol);
On small vacuumed tables, the queries
were proven to be equivalent and the time for the second query,
per EXPLAIN ANALYZE was a bit less
than half the time for the first query.
It is important to check to be sure your improvements are correct and actually improve the query. In this next example one level of improvements makes the query more readable but does not speed it up. The second level of improvements turn out to be wrong (!). The separation of the two GROUP BY clauses are really needed. This second example is a tale of caution. In this example we see both SUM() and COUNT(*) and the separation of the two by a subquery that only executes CASE statements. These should be able to be collapsed.
--
-- original query 2
--
\echo query 2 original
-- EXPLAIN ANALYZE
SELECT acol, dcol, sum(count_one) AS count_one,
sum(count_two) AS count_two, count(DISTINCT gcol) AS gcol_count
FROM (
SELECT acol, dcol, gcol,
CASE WHEN gcol = 1 THEN cnt ELSE 0 END AS count_one,
CASE WHEN gcol = 2 THEN cnt ELSE 0 END AS count_two
FROM (
SELECT acol, dcol, count(*) AS cnt, gcol
FROM one o
INNER JOIN two t USING (acol)
INNER JOIN three the USING (acol, dcol)
GROUP BY acol, dcol, gcol
) AS bar
) AS foo
GROUP BY acol, dcol;
--
-- improved query
--
\echo query 2 improved
EXPLAIN ANALYZE
SELECT acol, dcol, sum(count_one) AS count_one,
sum(count_two) AS count_two, count(DISTINCT gcol) AS gcol_count
FROM (
SELECT acol, dcol, gcol,
CASE WHEN gcol = 1 THEN count(*) ELSE 0 END AS count_one,
CASE WHEN gcol = 2 THEN count(*) ELSE 0 END AS count_two
FROM one o
INNER JOIN two t USING (acol)
INNER JOIN three the USING (acol, dcol)
GROUP BY acol, dcol, gcol
) AS foo
GROUP BY acol, dcol;
Taking the improvements one step too far we get a collapsed
GROUP BY clause which changes the end result of the query
so that it is now incorrect.
This is because the count(*) should only be grouped by
acol, dcol and not all three columns acol, dcol and gcol.
\echo bad query 2
EXPLAIN ANALYZE
SELECT acol, dcol,
(CASE WHEN gcol = 1 THEN count(*) ELSE 0 END) AS count_one,
(CASE WHEN gcol = 2 THEN count(*) ELSE 0 END) AS count_two,
count( gcol ) as gcol_count
FROM one o
INNER JOIN two t USING (acol)
INNER JOIN three th USING (acol, dcol)
GROUP BY acol, dcol, gcol;
The example below joins 4 tables ITEM, BRAND, MODEL and CONDITION. In human understandable terms: a [secondhand] Item is of a particular Model and Brand. The Items retail at different prices depending on their condition. This selection is close to what is wanted.
SELECT brand.brand_name, model.model_name,
min (item.price), max (item.price)
min (condition.position), max (condition.position)
FROM item
LEFT OUTER JOIN model ON model_pk =item.model_fk
LEFT OUTER JOIN brand ON brand.brand_pk =model.brand_fk
LEFT OUTER JOIN condition on condition.condition_pk = item.condition_fk
GROUP BY brand.brand_name,model.model_name
But what is really wanted is the condition name of the
min and max for each model. The name of the condition
is different from the position of the item; name is a text
field and position is an integer field. This is what the
result should look like:
Brand | Model | Cond | Cond | Price | Price
| | min | max | min | max
-------------------------------------------
Canon | A1 | Exc | Mint | 139 | 155
Canon | F1N | Exc++| Mint-| 329 | 379
Canon | 24mm | Exc--| Mint+| 99 | 179
Nikon | 50mm | Exc--| Mint+| 109 | 119
This query should do the trick according to Tom Lane. It fetches the name of the condition based on the minimum and maximum positions in separate subqueries.
SELECT brand.brand_name, model.model_name,
min (item.price), max (item.price)
(SELECT name FROM condition c1 WHERE position = min(condition.position)),
(SELECT name FROM condition c2 WHERE position = max(condition.position)),
FROM ITEM
LEFT OUTER JOIN model ON model_pk =item.model_fk
LEFT OUTER JOIN brand ON brand.brand_pk =model.brand_fk
LEFT OUTER JOIN condition ON condition.condition_pk = item.condition_fk
GROUP BY brand.brand_name,model.model_name
Note well, however, this technique carries an important warning with regards
to Postgres versions earlier than 7.4.
"[Prior to 7.4]," Tom writes, "we would have mis-interpreted the aggregate
calls to indicate aggregation within the sub-selects.
The current interpretation is per SQL spec:
since the aggregate argument is a variable of the outer select, the
aggregation occurs with respect to that select, and the aggregate result
is passed down to the sub-select as a scalar."
This means that prior to 7.4, the inner aggregates would be incorrectly run against the condition table rather than the outer query which includes the correct JOINs and GROUP BY clauses.
|
||||||||||||
|
Comments and Corrections are welcome. Suggestions and contributions of items are also welcome. Send them in!
Copyright A. Elein Mustain 2003, 2004, 2005, 2006, 2007, 2008, 2009
|
||||||||||||