|
||||||||||||
|
||||||||||||
|
||||||||||||
|
||||||||||||
|
||||||||||||
The problem at hand is to select the customers delinquent on their payments. So we want to select the customers where the customers' purchases are 30 days old. The customer table and the purchases tables are what we want. The query that should be in your mind first off follows. Unfortunately, the best query does not always come to mind first. SELECT name, address FROM customers c JOIN purchases ON (purchaser_id = customer_id) WHERE purchase_date > current_date-30;This query creates a set of customers and purchases and then filters in the ones with purchases 30 days ago. It leverages the JOIN capability of the database and lets the optimizer do the filtering. This query does the same thing and gives the same results. But is unnecessarily ugly, complicated and slow. SELECT name, address FROM customers where EXISTS (SELECT 1 FROM purchases WHERE purchaser_id=customers.id AND purchase_date > current_date-30 LIMIT 1);The statement does this: look at each customer, then see if that customer has a purchase greater than 30 days ago. This type of statement construction is a result of not really understanding and leveraging JOINs. This is a common misunderstanding with people who are used to programming in loops and not in sets. A variation on the above query uses IN. SELECT name, address FROM customers WHERE id IN (SELECT purchaser_id FROM purchases WHERE purchase_date > current_date-30 LIMIT 1);This query is no better than the previous query. EXISTS and IN are comparable functionality. Both of these functions are correct but are less than optimal queries. They each do a select of purchases for each customer. (Note also that in 7.3 and earlier IN is particularly slower than EXISTS.) When constructing a SELECT statement, first build the set of data that you want to return as well as the data you want to qualify on. Use the JOIN statement to bring two (or more) tables together to construct that set. The target list will be the data you want to return and the condition should identify the primary condition of the query.
In my family, we are mostly women, kids and pets. This is a picture
of the family tree from my mother with three daughters and the
sisters' kids and all of our pets. The people are blue the pets
are brown. (Names have been changed to protect the innocent :)
The tables we have are these: What I want to know is how may kids and pets each person has. So that query is easy. SELECT people.login, COUNT(pets.name) AS pets, COUNT(kids.name) AS kids FROM people JOIN pets ON (login=owner) JOIN kids ON (login = parent1) GROUP BY people.login; login | pets | kids -----------+------+------ ann | 3 | 3 rachel | 3 | 3 cora | 1 | 1 stephanie | 1 | 1 (4 rows)Whoa. That was terribly wrong. Ann and Rachel only have one pet each. But they do each have three kids. For some reason, Cora and Stephanie were right--one pet and one kid each. But Cat and I am missing completely. Oh. That must be because I should have used a left outer join. Let's try again. SELECT people.login, COUNT(pets.name) AS pets, COUNT(kids.name) AS kids FROM people LEFT OUTER JOIN pets ON (login=owner) LEFT OUTER JOIN kids ON (login = parent1) group by people.login; login | pets | kids -----------+------+------ ann | 3 | 3 cat | 1 | 0 CJ | 0 | 0 cora | 1 | 1 david | 0 | 0 elein | 4 | 0 laura | 0 | 0 rachel | 3 | 3 stephanie | 1 | 1 (9 rows)Well, OK. That brought in the rest of the gang. But Ann and Rachel's counts are still wrong. What is going on? Let us look at the data in kids and people separately. SELECT login, count(pets.name ) FROM people LEFT OUTER JOIN pets ON (login=owner) GROUP BY login; login | count -----------+------- ann | 1 cat | 1 CJ | 0 cora | 1 david | 0 elein | 4 laura | 0 rachel | 1 stephanie | 1 SELECT login, COUNT(kids.name ) FROM people LEFT OUTER JOIN kids ON (login=parent1) GROUP BY login; login | count -----------+------- ann | 3 cat | 0 CJ | 0 cora | 1 david | 0 elein | 0 laura | 0 rachel | 3 stephanie | 1These numbers are correct in every case. What is really happening when both kids and pets are brought in together? Let us remove the COUNT and look at the data. SELECT login, pets.name AS pet, kids.name AS kid FROM people LEFT OUTER JOIN pets on (login=owner) LEFT OUTER JOIN kids ON (login = parent1); login | pet | kid -----------+-----------+----------- ann | buster | elein ann | buster | rachel ann | buster | stephanie cat | edward | CJ | | cora | bowwow | CJ david | | elein | geordie | elein | belle | elein | bubbajack | elein | leaf | laura | | rachel | kiwi | cat rachel | kiwi | laura rachel | kiwi | david stephanie | chewy | cora (16 rows)Oh, my. The pets are repeated for every kid. This is an inadvertent Cartesian product of kids and pets. This accounts for both the correct and incorrect answers above. Ann and Rachel each have one pet but three kids. But the pets are repeated for each kid and it looks like they have more pets than they do. Because there is no correlation between pets and kids we will have to JOIN them to people separately. Abandoning the JOIN syntax for the where clause we join in the people table twice: SELECT p1.login, COUNT(k.name) AS kids, COUNT(pt.name) AS pets FROM kids k , pets pt, people p1, people p2 WHERE k.parent1 = p1.login AND pt.owner = p2.login AND p1.login = p2.login GROUP BY p1.login; login | kids | pets -----------+------+------ ann | 3 | 3 rachel | 3 | 3 cora | 1 | 1 stephanie | 1 | 1 (4 rows)Good grief. That was wrong, too! Let us back up a bit. We want to join people to pets and kids separately and then we want the counts in the resulting target list. So let us separate things like that. SELECT login, kids, pets FROM (SELECT login, COUNT(kids.name) AS kids FROM people LEFT OUTER JOIN kids ON (login=parent1) group by login) k JOIN (SELECT login, COUNT(pets.name) AS pets FROM people LEFT OUTER JOIN pets ON (login=owner) GROUP BY login ) p USING (login); login | kids | pets -----------+------+------ ann | 3 | 1 cat | 0 | 1 CJ | 0 | 0 cora | 1 | 1 david | 0 | 0 elein | 0 | 4 laura | 0 | 0 rachel | 3 | 1 stephanie | 1 | 1 (9 rows)Aha! That is correct. Finally. The two issues we addressed were the inadvertent Cartesian product multiplying the number of pets and the JOIN type leaving out the people with no kids or no pets. We did this by using LEFT OUTER JOIN first to get back the missing people. Then we looked at what the real data was instead of the counts. This told us about the Cartesian product. When we phrased the question correctly, we found the correct methodology. Test your aggregates, especially if you are using multiple aggregates in one statement. More often than not, aggregates on different things cannot be done without subselects. Use subselects sparingly, but you will find them invaluable for statements with multiple aggregates.
|
||||||||||||
Comments and Corrections are welcome. Suggestions and contributions of items are also welcome. Send them in! Copyright A. Elein Mustain 2003, 2004, 2005, 2006, 2007, 2008, 2009 |