Varlena, LLC | PostgreSQL General Bits Newsletter

PostgreSQL Training,
Consulting & Support

By A. Elein Mustain

11-Apr-2005 Issue: 105

Archives |

General Tidbits |

Google General Bits |

Prev |

General Bits is a column loosely based on the PostgreSQL mailing list pgsql-general.
To find out more about the pgsql-general list and PostgreSQL, see www.PostgreSQL.org.

Seeing Log Debug Statements
Debug_print	07-04-2005

The debug_print options in postgresql.conf file can be set so that you can see many details in the processing of a query. However, even if you set these variables, it is required that server_min_messages also be set to DEBUG1 in order for the debug_print messages to get to the log file at all.

These are the debug_print messages available. Also note that to get the "pretty" version of the output, set debug_pretty_print.

debug_print_parse
debug_print_rewritten
debug_print_plan

The output you get with these debug flags can be very interesting or very obtuse. I ran the following query on my 8.0.1 data base. It is followed here by its explain analyze output.

   webstat=# select grandtot, to_char(current_date,'DD-Mon-YY') as when,
          grandtot/(date((current_date-1)) - date((current_date-7)) +1) as perrange,
          (date((current_date-1)) - date((current_date-7)) +1) as numdays,
          to_char( date((current_date-7)),'DD-Mon-YY') as startdate,
          to_char( date((current_date-1)),'DD-Mon-YY') as enddate 
          from ( 
             select count(*) as grandtot 
             from rawlog 
             where (code=200 or code=301) and 
                   ldate >= (current_date-7) and 
                   ldate <= (current_date-1) and doc like '%General%' 
          ) foo;
    grandtot |   when    | perrange | numdays | startdate |  enddate
   ----------+-----------+----------+---------+-----------+-----------
        8443 | 07-Apr-05 |     1206 |       7 | 31-Mar-05 | 06-Apr-05
   (1 row)

                             QUERY PLAN 
   ---------------------------------------------------------------------------------------
    Subquery Scan foo  (cost=20266.77..20266.84 rows=1 width=8) 
                       (actual time=72.860..72.863 rows=1 loops=1)
      ->  Aggregate  (cost=20266.77..20266.77 rows=1 width=0) 
                     (actual time=72.707..72.708 rows=1 loops=1)
            ->  Index Scan using logdate on rawlog  (cost=0.01..20266.77 rows=2 width=0) 
                                                    (actual time=0.111..63.919 rows=8443 loops=1)
                  Index Cond: ((ldate >= (('now'::text)::date - 7)) 
                              AND (ldate <= (('now'::text)::date - 1)))
                  Filter: (((code = 200) OR (code = 301)) AND (doc ~~ '%General%'::text))
    Total runtime: 73.199 ms
   (6 rows)

The parsetree, the rewritten plan and the plan can be viewed here. As you can see, it gets rather long, but you can follow how the query was parsed originally and as rewritten and the final scan for the query.

You might want to try it first on a simple query so you can begin to recognize the common elements of the parsed plans.

Contributors: elein at varlena.com, Tom Lane tgl at sss.pgh.pa.us

Organizing Your Environment for Multiple Databases and Installations
How to use pg_service.conf 08-Apr-2005

On my machine I run multiple installations of postgreSQL in non-standard locations. I have 7.4 installed in /local/pgsql74 and 8.0 installed in /local/pgsql80. In order to run anything from an installation I source a file which sets up the appropriate PG variables for whichever installation I indicate on the command line. This script also adds the appropriate bin to my path. Here is an abbreviated version of my pgenv script which I invoke by typing . pgenv 8 in a shell.

   # pgenv
   if [ $1 = 4 ]
   then
           #
           # 7.4 PRODUCTION
           #
           PG_INST=/local/pgsql74
           PGPORT=5434
           MYPG=PG74
   elif [ $1 = 8 ]
   then
           #
           # 8.0 BETA
           #
           PG_INST=/local/pgsql80
           PGPORT=5430
           MYPG=PG80
   fi
   PGDATA=$PG_INST/data
   PGLIB=$PG_INST/lib
   PGHOST=localhost
   PATH=.:$PG_INST/bin:$PATH
   export PGDATA PGHOST PGPORT PG_INST PATH MYPG PGLIB

While this serves me very well, there is another, almost simpler way to do this that includes more detail than my little script. This is the pg_service.conf file. This file allows you to name a service which consists of PG variable settings. Then you can just use the service name, PG_SERVICE, to indicate what group of settings you need.

This is how it works. If you have multiple, isolated installations and one of them in $PG_INST, then in $PG_INST/share is a file called pg_service.conf.sample. You will need to edit it (we'll get to that) and put it in $PG_INST/etc/pg_service.conf. If you have a packaged install the file should be put into /usr/etc/postgresql/pg_service.conf or some variation on that depending on the package installations.

If you had to put $PG_INST on your PATH before, you still need to do that. Then, you will only need to export PG_SERVICE=servicename to pick up all of the PG variables defined in your pg_service.conf file.

This is what my pg_service.conf file looks like. I have a 74 installation and an 80 installation on different ports. I have an elein and a gb database on each of them. I created four service names, elein74, elein80, gb74 and gb80.

	[elein74]
	dbname=elein
	user=elein
	port=5434

	[gb74]
	dbname=gb
	user=elein
	port=5434

	[elein80]
	dbname=elein
	user=elein
	port=5430

	[gb80]
	dbname=gb
	user=elein
	port=5430

Now at the shell, I ensure my PATH contains the $PG_INST/bin directory and export PGSERVICE as gb74.

   $ printenv | grep PG
   PGSERVICE=gb74
   $ echo $PATH
   /local/pgsql80/bin:.:/home/elein/bin:/usr/X11R6/bin:
   /usr/local/bin:/usr/bin:/bin:/usr/bin/X11:

Now when I type psql I get the gb database in the 7.4 installation while running the 8.0 version of psql. Which executables you use is strictly up to where those executables lie on your PATH.

I no longer need to use my installation setting script. I can use pg_service.conf to point to specific databases by setting db name, to different installations by setting the port and different host and user names. Note that it is not necessary to set all of the variables. For example, if I ommitted dbname for a service called [80] then I would still need to reference the database name when I called up a program.

Client applications written using libpq can pass a service=servicename variable in the conninfo string when opening the connection. Use service just as you would use the database or port number.

Contributors: elein at varlena.com

Querying and Flattening Array Columns
Fun with Array Functions 09-Apr-2005

It is important to fool around with arrays and their functions in order to get a good sense of the arrangement of data in multi-dimensional arrays. In this article we will look at flattening arrays into column. Flattening arrays into columns is not necessary to do conditional testing against array elements IN arrayval can do that. This exercise is to enable you to better visualize the data when you use arrays.

An Aside: There is a running religious argument regarding whether or not arrays belong in a "pure" relational database. I assert that if the information describing an object is best described in an array, then it belongs in an array. The best example is that of a polygon. The definition of a polygon is an array of specific points. This article is not about the arguments for or against arrays, but rather how they can be used when you have them. Having said that, the examples in this article, taken from the documentation, are not good examples of when and why to use an array.

Let us start with creating a table that contains two columns each of which is an array. The first is a one dimensional array and the second is a three dimensional array. We use the ARRAY constructor to insert data into the table.

CREATE TABLE sal_emp (
    name            text,
    pay_by_quarter  integer[],
    schedule        text[][][]
);

INSERT INTO sal_emp
    VALUES ('Bill',
    ARRAY[10000, 10000, 10000, 10000],
    ARRAY[['meeting', 'lunch', 'noon'], ['training', 'presentation','2pm']]
    );

INSERT INTO sal_emp
    VALUES ('Carol',
    ARRAY[20000, 25000, 25000, 25000],
    ARRAY[['breakfast', 'consulting','10am'], ['meeting', 'lunch','noon']]);

select * from sal_emp;

-- name  |      pay_by_quarter       |                      schedule
---------+---------------------------+----------------------------------------------------
-- Bill  | {10000,10000,10000,10000} | {{meeting,lunch,noon},{training,presentation,2pm}}
-- Carol | {20000,25000,25000,25000} | {{breakfast,consulting,10am},{meeting,lunch,noon}}
--(2 rows)

pay_by_quarter is a one level array and schedule is a two level array. The array level corresponds to the nesting count of the square brackets which do not match exactly to syntax used to create the table, but rather to the syntax required to do the inserts. For example, the array level of pay_by_quarter is 1. The array level of schedule is 2. The upper bound of pay_by_quarter at level 1 is 4. The upper bound of schedule at level 1 is 2, the upper bound of the array at level 2 is 3. We care about the upper bound of arrays in order to iterate over them.

select array_upper(pay_by_quarter,1) from sal_emp;
--  array_upper
-- -------------
--            4
--            4
(3 rows)

select array_upper(schedule,1) from sal_emp;;
--  array_upper
-- -------------
--            2
--            2
--
select array_upper(schedule,2) from sal_emp;;
--  array_upper
-- -------------
--            3
--            3

The basic technique for iterating over array elements includes using the function generate_series(). This function takes a starting number, an ending number and an option step count.

 select * from generate_series(1,10,2);
--  generate_series
-- -----------------
--               1
--               3
--               5
--               7
--               9
-- (5 rows)

To use the generate_series() function to iterate over an array, you can do something like this. We have chosen an higher than needed end number. And we aliased the function to idx(n) so that we can use it's return value as n.

select name, idx.n, pay_by_quarter[idx.n] from sal_emp, generate_series(1,7) idx(n) order by name;
--  name  | n  | pay_by_quarter
-- -------+----+----------------
--  Bill  |  1 |          10000
--  Bill  |  2 |          10000
--  Bill  |  3 |          10000
--  Bill  |  4 |          10000
--  Bill  |  5 |
--  Bill  |  6 |
--  Bill  |  7 |
--  Carol |  1 |          20000
--  Carol |  2 |          25000
--  Carol |  3 |          25000
--  Carol |  4 |          25000
--  Carol |  5 |
--  Carol |  6 |
--  Carol |  7 |

Notice that the values are NULL for the extra rows generated by generate_series(). To eliminate the NULL values you can use:

select name, idx.n, pay_by_quarter[idx.n] 
from sal_emp, generate_series(1,7) idx(n) 
where pay_by_quarter[idx.n] is not null
order by name;

schedule has two levels of array to then we can use two instances of generate_series() with different alias to flatten out all of the elements of the schedule column.

select name, idx.i, idx2.j, schedule[idx.i][idx2.j] 
from sal_emp, generate_series(1,10) idx(i), generate_series(1,10) idx2(j) 
where schedule[idx.i][idx2.j] is not null 
order by name;
--  name  | i | j |   schedule
-- -------+---+---+--------------
--  Bill  | 1 | 1 | meeting
--  Bill  | 1 | 2 | lunch
--  Bill  | 1 | 3 | noon
--  Bill  | 2 | 1 | training
--  Bill  | 2 | 2 | presentation
--  Bill  | 2 | 3 | 2pm
--  Carol | 1 | 1 | breakfast
--  Carol | 1 | 2 | consulting
--  Carol | 1 | 3 | 10am
--  Carol | 2 | 1 | meeting
--  Carol | 2 | 2 | lunch
--  Carol | 2 | 3 | noon
-- (12 rows)

But we are still using an arbitrary value of 10 as the maximum array index. What if we really did not know off hand what the upper bound of the array is? We can know the upper bound of the array programmatically, but we cannot insert it in the appropriate place in the call to generate_series(). select ... from sal_emp, generate_series(1,array_upper(schedule,1))... will not work because you cannot reference a column in a table that is also in your from clause. You will get an error. So, the next thing to try is a function.

The first of these two functions flattens the pay_by_quarter column. The second flattens the schedule column. Each selects the max(array_upper()) and then uses it in a LOOP query returning each row. array_upper() returns a value for each row so we want to use max(array_upper(schedule,1)) to ensure that we are able to join all of the array elements.

Contributors: elein at varlena.com

Comments and Corrections are welcome. Suggestions and contributions of items are also welcome. Send them in!

Top