|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Here is a slow query which needs some help. It looks like this: SELECT DISTINCT message_board_topics.rec_num, message_board_topics.topic_name, message_board_topics.topic_body, message_board_topics.topic_author, message_board_topics.topic_author_email, message_board_topics.topic_updated, message_board_topics.administrator_topic, message_board_topics.number_of_comments, to_char(topic_date, 'MM.DD.YYYY') as formatted_date FROM message_board_topics left join message_board_comments on (message_board_comments.topic_id=message_board_topics.rec_num) WHERE upper(topic_name) LIKE upper('madbrowser') OR upper(topic_body) LIKE upper('madbrowser') OR upper(topic_author) LIKE upper('madbrowser') OR upper(message_board_comments.comment_author) LIKE upper('madbrowser') ORDER BY message_board_topics.rec_num DESC; There are a couple of obvious things to start off with. We assume there are already indexes on the join columns. So the first is to change LIKE to = since no partial match is needed. The next thing to do is to run EXPLAIN ANALYZE. EXPLAIN ANALYZE shows us that the query is doing a sequential scan. By turning off enable_seqscan we try EXPLAIN ANALYZE again. This is what we get. With enable_seqscan = false: Total runtime: 40855.59 msec With enable_seqscan = true: Total runtime: 13108.33 msecThe planner is choosing the sequential scan because it is faster. "(Hot diggety ;-))", says Tom Lane. Unfortunately the query is still slow. Another idea is to provide functional indexes on upper(topic_name), etc. That could certainly help, but the problem really is the multitude of ORs in the statement. It is possible to divide out the ORs by using a UNION. It might look like this (with a minimal targetlist). select message_board_topics.rec_num from message_board_topics where upper(topic_name) LIKE upper('madbrowser') union select message_board_topics.rec_num from message_board_topics where upper(topic_body) LIKE upper('madbrowser') union select message_board_topics.rec_num from message_board_topics where upper(topic_author) LIKE upper('madbrowser') union select message_board_topics.rec_num from message_board_topics,message_board_comments where message_board_comments.topic_id=message_board_topics.rec_num and upper(message_board_comments.comment_author) LIKE upper('madbrowser') order by 1 desc;This query was pronounced "MUCH faster than my old query...". However, Stephan Szabo noted, "the best solution is probably some sort of full text indexing solution. Putting the keywords from the various columns you want to index along with the rec_num (or topic_id) of the row and an index on the text. Then you could join message_board_topics with that and probably get a much better plan."
The following SQL causes the obscure error message of: ERROR: ExecutePlan: (junk) `ctid' is NULL!. UPDATE Tsummary SET participants = count(distinct credit_id), teams = count(distinct team_id) FROM email_contrib_today ect WHERE ect.project_id = projectidThe underlying complaint of the ERROR is that the aggregate count() is not in the context of a SELECT. Having an aggregate at the top level of an UPDATE statement is illegal per the SQL spec. There are some alternatives. Either you can use an UPDATE with SELECT or you can use pl/pgsql. In both cases, you preselect the aggregate and assign it in the UPDATE statement. The UPDATE SELECT could be done either field by field or with one SELECT statement. The first statement is field by field. The second shows how to select more than one value from the SELECT statement. UPDATE Tsummary SET participants = (select count(distinct credit_id) from email_contrib_today ect where etc.project_id = projectid), teams = (select count(distinct team_id) from email_contrib_today ect where etc.project_id = projectid) UPDATE Tsummary SET participants = f.ccount, teams = f.tcount FROM (SELECT count(distinct credit_id) as ccount, count( distinct team_id) as tcount FROM email_contrib_today WHERE etc.project_id = projectid) f
For amounts of data is is important to know how much storage to anticipate. This information should help guide you in estimating the proper amount of storage for your system in terms of row and index memory usage. For the sake of discussion we will follow the example in the thread which has 3 float4 columns per row. Estimating column size is a whole other topic. A minimal row does not have OIDs or nulls. Assuming intel-like alignment rules and that the column widths are fixed multiples of four a row will have:
Therefore 80M of rows takes about 32MB while its indexes will take up 16MB. However, there are some wishy-washy factors to take into account (besides columns with varying data types). The creation of an index takes more space than the index itself, but it should not take more than twice the size of the finished index. This is transient in that it only occurs when you create the entire index. Data is usually stored on heap pages which are crammed as full as possible. Indexes are stored on b-tree index pages which are initially 65% to 75% full, following the standard theory of the best configuration for b-tree. (Full b-tree pages for non-static tables will cause over splitting causing destruction of the physical ordering which in turn negates the benefit of saving a few index pages.)
Each column of any type is begun fresh on a new 4 byte boundary (or possibly 8 depending on your architecture). Fixed and variable length character columns are stored as varlenas, that is, with a four-byte length followed by the data. This is the most flexible method of storage as it supports both single and multi-byte character sets. The storage size of any fixed size text-like column is 4 + size of col + ( 4 - ( size of col % 4) )This following table was defined with all fixed size character types. It was defined using fixed sizes because the data was truly fixed size. mdcdb=# \d codes Table "codes" Attribute | Type | storage size --------------------+----------------+-------------- rpt_code | character(5) | 4 + 5 + (4 - (5 % 4) ) = 12 xcd_code | character(5) | 4 + 5 + (4 - (5 % 4) ) = 12 mod_1 | character(2) | 4 + 2 + (4 - (2 % 4) ) = 8 mod_2 | character(2) | 4 + 2 + (4 - (2 % 4) ) = 8 carrier | character(5) | 4 + 5 + (4 - (5 % 4) ) = 12 state_code | character(2) | 4 + 2 + (4 - (2 % 4) ) = 8 specialty_code | character(2) | 4 + 2 + (4 - (2 % 4) ) = 8 par | character(1) | 4 + 1 + (4 - (1 % 4) ) = 8 type_code | character(1) | 4 + 1 + (4 - (1 % 4) ) = 8 place_code | character(2) | 4 + 2 + (4 - (2 % 4) ) = 8 submitted_charge | character(13) | 4 + 13 + (4 - (13 % 4)) = 20The column lengths total 112 bytes. The row also has an overhead of 28 bytes so the row size is 140. This is assuming the row does not have OIDs or nulls. It is interesting to note that the table could have been defined with all columns of type text and still maintain the same storage size. It would, however, lose any blank padding gained by the character type.
If you have got a set of duplicate rows and want to update only one of them, you cannot use the usual update statement because both will be changed. (We will not go into why having duplicate rows is bad and wrong here :-) and just examine the problem in question. Obviously the data was bad and we just want to correct it.) There are two fields that can uniquely identify the rows, the OID and the CTID. By selecting the rows' OIDs (or CTIDs) out you can choose one and update it. SELECT oid FROM table WHERE ... LIMIT 1; UPDATE table SET ... WHERE oid = ...; This doesn't really need to be done in two queries, though. It can be done using a simple update from a selection like this. UPDATE foo SET field1 = 'ZZZ' WHERE oid = (SELECT min(oid) FROM foo WHERE field2 = 'XXX');Note, of course, that max(oid) would work as well.
In response to the question of how to append to an array of unknown size, Joe Conway came up with this nice little function return the next index into an array. This then is used to update the array. This script requires 7.3. CREATE OR REPLACE FUNCTION array_next(text[]) returns int AS ' DECLARE arr alias for $1; high int; BEGIN high := 1 + replace(split_part(array_dims(arr),'':'',2),'']'','''')::int; RETURN high; END; ' LANGUAGE 'plpgsql' IMMUTABLE STRICT; create table mytable (myarray text[]); insert into mytable values ('{"abc","d e f"}'); update mytable set myarray[array_next(myarray)] = 'new element'; regression=# select * from mytable; myarray ----------------------------- {abc,"d e f","new element"} (1 row)This is how the function works, from the inside out. 1 + replace( split_part( array_dims(arr), '':'',2), '']'', '''')::int;array_dims gives you the string description of the array. It had two elements in it. select array_dims(myarray) from mytable; --> [1:2]We split that on the : returning the second field. select split_part(array_dims(myarray),':',2) from mytable; --> 2]The ] is replaced with an empty string. replace(split_part(array_dims(arr),'':'',2),'']'','''') --> 2And we cast the remaining 2 to an integer. Back at the top we add 1 to it, getting 3. phew! Using the result of this function as an index into the array we can do the update. update mytable set myarray[array_next(myarray)] = 'new element'; It was also noted that in contrib/intarray are a large set of operators that can perform even more operations on integer arrays.
|
||||||||||||||||||||||||||||||
Comments and Corrections are welcome. Suggestions and contributions of items are also welcome. Send them in! Copyright A. Elein Mustain 2003, 2004, 2005, 2006, 2007, 2008, 2009 |