Title Slide

Script Notes for talk on Running Aggregates using PostgreSQL and plpython.

© A. Elein Mustain 2003

Presented July 2003 at O'Reilly Open Source Conference, Portland, OR

Caveat: These are my notes on this talk. They are not necessarily complete nor are they a definitive record of any live talk, real or imagined.

Modifications to content of the original script based on new information are noted in italics.

Slide one


The Problem Statement

A question that is often brought up in many forums is how to perform functions on the current row based on the values of the previous row. This talk will show one technique in detail.

I will demonstrate how to use plpython functions in PostgreSQL in order to perform inter-row calculations

You don't need to know plpython for this talk.

This technique can be applied to many different problems. Using this technique you can create simple or elaborate running counters, sums, differences between rows and whatever other useful functions you want to execute on a running set of data.

I would like to acknowledge my comrade David Gould who came up with the original idea of doing inter-row calculations with plpython.

For a discussion of recent changes in plpython, see General Bits Issue #32

Slide two

I am...

My name is Elein Mustain I am ex-Ingres, ex-Illustra, ex-Informix.

I've worked exclusively on and with ORDBMS for more than 10 years and in the database industry for more than 15 years.

I'm the editor/author PostgreSQL General Bits, a web column of informational items mostly from the mailing list pgsql-general.

I'm currently working on a book about Extensibility in PostgreSQL. (No ETA...)

And I have recently released a beta version of a simple reporting engine for PostgreSQL written in perl.

I also do database consulting and PostgreSQL support for various types of applications.

Slide three

PostgreSQL is...

an Open Source Object Relational Database based on the Postgres project from UCBerkeley. It is a feature-rich and flexible RBMS.

Some of the possibilities of the extensibility features will be shown in this talk, with the creation of server side functions.

Server functions are invoked in SQL and processed in the server. (The mechanics of this is a whole 'nother talk ...)

The client or middleware sees the data that the server sends up. SO other than calling SQL with functions, the client usually needs to know nothing about server functions.

plpython is a server side procedural language based on python.

PostgreSQL supports many types of server side procedural languages. For example,

plpgsql is comparable to many other kinds of "SPL" type languages in other databases.

plpython and plperl enable you to write server side functions in python or perl.

Trigger functions, (which we will not be talking about here) can only be written using C, plpgsql or plpython. Trigger functions can also be written in pltcl

This talk is just about a minor trick to process your data. The implementation is almost trivial with the tools that are available in PostgreSQL.

This functionality is also available in C functions. But the point in writing it in plpython is that it is simpler for those who are not C hackers. (C functions also is a whole 'nother talk.)

Slide four

What is a running aggregate?

What I'm calling a running aggregate is a function that returns intermediate values,
eg: counters, running sums, etc.

You can create aggregates in PostgreSQL but they have to be written in C and don't usually produce intermediate results except via group by clauses.

It was pointed out that it is possible to write user defined aggregates in other languages. I will investigate further and this will be another item in another column.
PostgreSQL also has many built in aggregates.

I am also making a distinction between running aggregates and "ordinary" aggregates like the built-in aggregates already available in PostgreSQL like avg, stddev, sum, count, etc.

The process of an aggregate is:

Different aggregates have variations on these steps. In the case of running aggregates, the one row processing returns an intermediate value and the final return and end are not different than the one row processing.

Slide five

Creating a Running counter

This is the simplest running aggregate. It starts with 1 and increments by 1 for each row retrieved by the query.

Yes, this could be done in many different ways, but the point is to demonstrate the technique.

The function simple_counter() is executed once for each row qualified by the selection.


Slide six

Function Definition

Slide seven

The code says

Slide eight

(for python novices) What was that Dictionary thingo?

Dictionaries in python

In my mind a dictionary in Python is like a hash in Perl.

In the case of plpython functions there are three predefined dictionaries in which you can store named data.

Memory leaks on the connection are caused by saving lots of data at once.

Slide nine

One more plpython debugging note:

Use these functions for error handling & debugging.
plpy.error("msg") -- aborts transaction
plpy.fatal("msg") -- messy
plpy.debug("msg") -- like notice but tuneable
plpy.notice("msg") -- notice to logfile and stderr: good for immediate debugging
I use plpy.notice() for debugging plpython functions.

Slide ten

Back to the function

You might think we are done,
but we are not.
This solution will not work as you expect it.

Let us look at HOW and then WHY.


Slide eleven

How this fails

If you do your rigorous testing
you will find that
if you run the test twice,
the counter doesn't reinitialize itself between statements.

This is HOW is fails.


How this fails can get even weirder when you use the function in several of interleaved queries or call it multiple times within one statement.

Slide twelve

WHY this fails

We gave a hint about this earlier.

The scope of the SD dictionary is the connection not really the statement as its name implies.

The scope is also limited to the function. Different functions have separate dictionaries so you don't have to worry about dictionary names clashing between functions.

But you do need to worry about dictionaries of the same function clashing. Calling the same function on the same connection will use the same static dictionary.

The value of SD["currval"] remains in existence from the first assignment through the closing of the connection.

Now we need to know HOW to fix this and WHEN to fix this.

Lets start with seeing how to change the function for initialization.

Slide thirteen

Handling the Scope Issues

HOW: One part of the problem is that we need to initialize the SD value.

So first let's change the function to allow initialization of the current value.

We choose to init it by passing in an initialization flag. If the flag is 1 the function initializes the value. Otherwise it behaves as it did before.


The next problem is WHEN to initialize the current value. In general you would like it initialized before each SQL statement.

But it cannot be done anywhere within the statement or it would be executed for every row selected.

Slide fourteen

Brute Force

The approach taken to solve the initialization timing is a brute force approach:
call the function with the init flag in the statement before using the function.

Note, wrapping these two statements in yet another function won't work, why? This solution does not work for multiple invocations of the function within one statement. Why?

Connection Pool Implications:

You must send statements together on the same connection.

Also Remember that:

you don't know which is the first row, or the last row

There may be a change we can make to plpython to create a properly scoped dictionary for this task. If the scope of the dictionary were the duration of the function instance, there would be no need for this hokey initialization. No promises, but I'll bother people for the details and see what can be done.

Slide fifteen

Extended Application

The idea that you can easily save data between function calls opens the door to many interrow applications.

All this is possible with C functions. But also with plpython.

I will show one more here.

For example you may want to have a function return the previous row's value

This would be done by:

Slide sixteen

Bus Times

Scenario and Data:
Buses flow from the Yard to the stops on the route.

We know WHERE the buses are stopping and WHEN

We want to know the ACTUAL time differences between buses at each stop.

This means two things. First that we need to know:

For each bus stopping at a stop, when the previous bus (if any) was there. Group the results by Stop
Slide seventeen


We are working backwards. First we will assume we already have a function which returns the time for the previous bus at a stop.

These are the results we want to see. Disregard the yard rows. bad data. oops.


			Note that bus #1 arrived @ ashby @ 10:59.
			The  next bus #3 arrived @ ashby @ 11:14
				There was 15:00 minutes between those buses.
			The next bus #5 arrived @ ashby @ 11:18
				3 minutes later.
			Bus #7 then arrived @ ashby @ 11:43,
				24 minutes later.

What we now know is that the driver on #5 was going a little fast and the driver of #7 was going a little slow.

This is what we want to know.

Also note that we handle the first arrivals at a stop in a reasonable manner.

So if we already have a function which returns the previous time, what would the query look like to produce this data.

Slide eighteen

The Query


Key Points

It was kindly pointed out the Tom's recent change to hash aggregates will remove the sorting side effect of group by that I was showing here. Thanks... The alternative is to sort in yet another subquery.
select prevtime(1,NULL,NULL);
select bus, stop, atime, ptime, atime - ptime as headway
   (select bus, stop, atime, prevtime(0, stop, atime) as ptime 
			from (select bus, stop, atime
         	from buslog where route = 'BART'
         	order by sord, atime, bus, stop) foo) bar;

Slide nineteen

The function


Pass in an initialization flag.
Pass in the stop identifier.
Pass in the arrival time.
Return the previous arrival time.

   create or replace function prevtime(integer,text, timestamptz)
   returns timestamp as
   if args[0] == 1:                      | Initialization
         SD["curr"] = None               |    initialization
         SD["prev"] = None               |    initialization
         SD["currstop"] = None           |    initialization
         return "12/31/1959"             |    return my birthday
      else:                              | Runtime
         if SD["currstop"] == None:      | First Stop?
            SD["currstop"] = args[1]     |    save stop
            SD["curr"] = args[2]         |    save current time
            return None                  |    but return null

         if args[1] != SD["currstop"]:   | New Stop?
            SD["currstop"] = args[1]     |    save stop
            SD["curr"] = args[2]         |    save current time
            return None                  |    but return null
                                         | Else same stop, previous time exists
         SD["prev"] = SD["curr"]         | set previous to old current time
         SD["curr"] = args[2]            | save new current time
         return SD["prev"]               | return old current time (now prev)
   ' language 'plpython';

The cases for First Stop and New Stop can be combined. They are broken out for clarity.
Slide twenty

Your ideas wrt interrow functions?

Application to various problems such as All currently available with C. But easier with plpython.

Do not forget to read about the latest changes in plpython. See General Bits Issue #32