Varlena, LLC | PostgreSQL General Bits Newsletter

PostgreSQL Training,
Consulting & Support

By A. Elein Mustain

25-Oct-2005 Issue: 122

Archives |

General Tidbits |

Google General Bits |

Prev |

General Bits is a column loosely based on the PostgreSQL mailing list pgsql-general.
To find out more about the pgsql-general list and PostgreSQL, see www.PostgreSQL.org.

Example of a Timetravel Table
Implementing Timetravel	22-Oct-2005

Timetravel is the Postgres name for no-overwrite storage with no vacuum. In Postgres, originally, min and max timestamps were stored with each row. If you did not vacuum, then you were able to select data from your tables at any point in time by qualifying on the min and max timestamps. This was a great feature that enabled you to view your data sets historically.

Sadly, the feature was removed by PostgreSQL in order to reduce the row overhead. However, it has always been possible to create your own timetravelling tables. If you have a complex data which changes from day to day and you must keep track of days gone by then you can implement some form of timetravel.

There are many ways to implement time travel. The general idea is never delete rows, but instead mark the old row with a start and an end timestamp. That sounds easy, no? Well, the tricky part is that an update overwrites and therefore "loses" the old row. If the problem were only delete, then it would be easy, wouldn't it? I will show you an SQL and plpgsql-only solution to timetravel.

Note that if you want to have timetravel in your database you must be prepared for three things:

table space over head of two timestamp columns
query qualification on those columns for every select
small update overhead

How a Timetravel Table Appears

A timetravel table should look and seem to behave as any other table with regards to INSERTS, UPDATES and DELETES. Selection from a timetravel table needs to be qualified as either now or at some point in time.

Our table called timetravel has a "unique" id and some data in addition to its start and end timestamps. The view timetravel_now selects the data from the current rows. The current rows are those that have no end timestamps.

	CREATE VIEW timetravel_now AS
	   SELECT id, data FROM timetravel WHERE edate IS NULL;

Insert adds the data as usual as you can see with the timetravel_now view. Selecting from the base table shows you the underlying start and end times.

	INSERT INTO timetravel VALUES ('1','one');
	INSERT INTO timetravel VALUES ('2','two');
	INSERT INTO timetravel VALUES ('3','three');
	INSERT INTO timetravel VALUES ('4','four');
	INSERT INTO timetravel VALUES ('5','five');

	=# SELECT * FROM timetravel_now;
	 id | data
	----+-------
	 1  | one
	 2  | two
	 3  | three
	 4  | four
	 5  | five
	(5 rows)

	# SELECT * FROM timetravel;
	 id | data  |           sdate            | edate
	----+-------+----------------------------+-------
	 1  | one   | 2005-10-24 15:39:03.307916 |
	 2  | two   | 2005-10-24 15:39:03.349796 |
	 3  | three | 2005-10-24 15:39:03.39507  |
	 4  | four  | 2005-10-24 15:39:03.438123 |
	 5  | five  | 2005-10-24 15:39:03.489085 |
	(5 rows)

Updates to the timetravel table also behaves as expected. The current data is shown using the timetravel_now view. When you select the base table, you can again see the underlying changes to the start and end timestamps.

	UPDATE timetravel SET data='one.one' WHERE id=1;
	UPDATE timetravel SET data='three.one' WHERE id=3;
	UPDATE timetravel SET data='four.one' WHERE id=4;
	UPDATE timetravel SET data='five.one' WHERE id=5;

	=# SELECT * FROM timetravel_now;
	 id |   data
	----+-----------
	 1  | one.one
	 2  | two
	 3  | three.one
	 4  | four.one
	 5  | five.one
	(5 rows)

	=# SELECT * FROM timetravel ORDER BY id, sdate;
	 id |   data    |           sdate            |           edate
	----+-----------+----------------------------+----------------------------
	 1  | one       | 2005-10-24 15:39:03.307916 | 2005-10-24 15:40:38.294473
	 1  | one.one   | 2005-10-24 15:40:38.294473 |
	 2  | two       | 2005-10-24 15:39:03.349796 |
	 3  | three     | 2005-10-24 15:39:03.39507  | 2005-10-24 15:40:38.376969
	 3  | three.one | 2005-10-24 15:40:38.376969 |
	 4  | four      | 2005-10-24 15:39:03.438123 | 2005-10-24 15:40:38.392555
	 4  | four.one  | 2005-10-24 15:40:38.392555 |
	 5  | five      | 2005-10-24 15:39:03.489085 | 2005-10-24 15:40:38.44119
	 5  | five.one  | 2005-10-24 15:40:38.44119  |
	(9 rows)

Deletion of a row also appears to do what is expected when selecting timetravel_now. In the base table the deleted row is closed by setting the end times.

	DELETE FROM timetravel WHERE id=1;

	=# SELECT * FROM timetravel_now;
	 id |   data
	----+-----------
	 2  | two
	 3  | three.one
	 4  | four.one
	 5  | five.one
	(4 rows)

	# SELECT * FROM timetravel ORDER BY id, sdate;
	 id |   data    |           sdate            |           edate
	----+-----------+----------------------------+----------------------------
	 1  | one       | 2005-10-24 15:39:03.307916 | 2005-10-24 15:40:38.294473
	 1  | one.one   | 2005-10-24 15:40:38.294473 | 2005-10-24 15:43:12.650496
	 2  | two       | 2005-10-24 15:39:03.349796 |
	 3  | three     | 2005-10-24 15:39:03.39507  | 2005-10-24 15:40:38.376969
	 3  | three.one | 2005-10-24 15:40:38.376969 |
	 4  | four      | 2005-10-24 15:39:03.438123 | 2005-10-24 15:40:38.392555
	 4  | four.one  | 2005-10-24 15:40:38.392555 |
	 5  | five      | 2005-10-24 15:39:03.489085 | 2005-10-24 15:40:38.44119
	 5  | five.one  | 2005-10-24 15:40:38.44119  |
	(9 rows)

The point is that we want the table to behave as any other table, but to have the added feature of being able to see the data at different points in time. The timetravel_now view enables you to see the current data. A function timetravel_when show how to select out the data at a point in time. I will show how the table is defined, then how the timetravel_when() function works and finally how the timetravelling rows are managed.

Defining a Timetravel Table

So what do we have to do to make timetravel work on a table? The table must have two columns delineating its valid times. We will call these sdate and edate for start times and end times even though they are timestamp columns.

The underlying table cannot have id defined as its primary key, because there will be multiple values or versions of each row. There should however be a unique index for current rows (id and edate IS NULL) and the combination of the id and start and end times must also unique. The latter is acceptable as an alternative primary key for replication. Just id and sdate will not work as unique keys because of the method update uses to save the old row.

This is our timetravel table definition. Notice that sdate defaults to now().

	CREATE TABLE timetravel (
	   id    TEXT,
	   data  TEXT,
	   sdate TIMESTAMPTZ DEFAULT now(),
	   edate TIMESTAMPTZ,
	   UNIQUE (id, sdate, edate)
	);
	CREATE UNIQUE INDEX timetravel_id ON timetravel (id) WHERE edate IS NULL;

Queries on a Timetravel Tables

Any current row in the table has a NULL value in its end time column so the view for the current values in the tables selects all of its data and qualifies on the end time. This is what we used above to see the current rows in the table. Of course the naming of the underlying table and the "now" view is up to you.

	create view timetravel_now as
	   select id, data from timetravel where edate is null;

When you want to look at historical data then you use the set returning function timetravel_when( timestamptz ). It takes the original table and qualifies it based on the parameter. If the parameter is greater than the current row's start time or if it is between the start and end time of a row, then that row is returned.

In this somewhat contrived example, we want to see what the table looked like when we updated id 4. (Usually, you would just pass a timestamp instead of a sub-query.)

	=# SELECT * FROM timetravel_when(
	(#    (SELECT sdate FROM timetravel WHERE id=4 ORDER BY sdate DESC LIMIT 1 ));
	 id |   data
	----+-----------
	 2  | two
	 3  | three.one
	 4  | four.one
	 5  | five
	 1  | one.one

You can see that when we updated id 4, 1 and 3 were already updated but id 5 was not yet updated. Also at that point in time, id 1 still existed.

To see the table at any point in time we invoke the timetravel_when(timestamptz) function. Notice that the WHERE clause is enclosed in a CASE statement and that we cannot use BETWEEN since start times may equal end times.

	CREATE OR REPLACE FUNCTION timetravel_when (timestamptz)
	RETURNS SETOF timetravel_now AS
	$$
	   SELECT id, data FROM timetravel
	   WHERE (SELECT CASE WHEN edate IS NULL
	         THEN (sdate <= $1)
	         ELSE (sdate <= $1 AND edate > $1)
	         END) ;
	$$ LANGUAGE 'sql';

Because this is an SQL function, other qualifications to the query should get pushed down as is logical to the function's query. We opted for simplicity and clearness in the historical case by wrapping the query in a function making it a parameterized query.

Managing the Timestamps and Rows for a Timetravel Table

Let us start with the simplest case, delete. When we want to delete a row, we don't really want to delete it. Instead we want to set the end time to close it. The implementation of DELETE therefore is a RULE. The delete rule is qualified by the characteristic of the id where edate is null. We are "deleting" the current row. If you try to delete any row that is already closed or deleted, nothing happens.

	CREATE OR REPLACE RULE del_timetravel
	AS ON DELETE TO timetravel
	DO INSTEAD
	   (UPDATE timetravel SET edate=now()
	   WHERE edate IS NULL AND id=OLD.id;
	   );

Take notice, however, that the delete rule then invokes the update trigger. We use the set value of edate to determine that this is an update from the delete rule. The update trigger is written such that this update only updates the current row by setting edate.

Updates, implemented with a trigger, allow the update to occur, but save a copy of the old row with a closing end time. We do not allow update of previously closed or deleted rows. This is done by returning NULL from the trigger.

If this is a user update and not an UPDATE from the DELETE RULE, then edate will be NULL. In that case we will want to save the old row by reinserting it with an end time.

	CREATE OR REPLACE FUNCTION upd_timetravel()
	RETURNS TRIGGER AS
	$$
	BEGIN
	   IF OLD.edate IS NOT NULL THEN -- NEW.edate can be non-null
	      RETURN NULL; -- no update
	   END IF;
	   IF NEW.edate IS NULL THEN
	      INSERT INTO timetravel VALUES (OLD.id, OLD.data, OLD.sdate, now());
	      NEW.sdate = now();
	   END IF;
	   RETURN NEW;
	END;
	$$ LANGUAGE 'plpgsql';
	
	CREATE TRIGGER upd_timetravel BEFORE UPDATE ON timetravel
	FOR EACH ROW EXECUTE PROCEDURE upd_timetravel();

Remember that the insert statement in the update trigger will invoke the insert trigger.

Inserts are also implemented with a trigger. If the id of the row already exists we treat the INSERT as an UPDATE and disallow the INSERT. This is not standard protocol for SQL but it seemed to me to be intuitive with time travelling tables. The goal of this trigger is to ensure that there is only one record for each id where edate is NULL.

	CREATE OR REPLACE FUNCTION ins_timetravel()
	RETURNS TRIGGER AS
	$$
	DECLARE
	   this_id TEXT;
	BEGIN
	   SELECT INTO this_id 1 FROM timetravel WHERE id = NEW.id AND edate IS NULL;
	   IF FOUND AND NEW.edate IS NULL THEN
	      UPDATE timetravel SET data=NEW.data WHERE id=NEW.id;
	      RETURN NULL; -- no insert
	   END IF;
	   RETURN NEW;
	END;
	$$ LANGUAGE 'plpgsql';
	
	CREATE TRIGGER ins_timetravel BEFORE INSERT ON timetravel
	FOR EACH ROW EXECUTE PROCEDURE ins_timetravel();

That is the meat of time travel. This is what we did:

Define a table
- having a "unique" key
- with start and end timestamps
- add unique indexes across the "unique key" and the start and end times
- add unique indexes across the "unique key" where the end time is null
Define a Delete Rule
- set end time and pass the row to update
Define an Update Trigger
- Allow plain updates from delete (where end time is not null)
- Otherwise save the old row using insert and set the start date on the changed row.
Define an Insert Trigger
- If the row already exists, pass row off to update
- Otherwise insert the row with start time set.
Define a View for the current rows
- where the end time is null.
Define a function for rows at a point in time.
- select the qualifying current rows or
- old rows in that time range.

One more small item would also be helpful to present the view timetravel_now as an updateable view. This would be done by adding RULES on the view to perform updates, inserts and deletes on the underlying tables. With this, the application would always use timetravel_now for the usual functionality.

The even more interesting part of this timetravel feature is to include more related tables. This allows you to generate accurate historical reports while your data moves forward day to day. A master and detail join using the same time qualification should always match. The problem here, however, is that foreign key references will have to be qualified by both id and start time in a user defined trigger since a unique index cannot be enforced on only id and sdate due to the order the statments are executed in the update rule.

The creation of a timetravel table is clearly laid out in this example and can be applied to other tables with little effort. But it must be done carefully for each individual table and relation.

The code for setting up this example is here and the test cases for the table are found here.

Contributors: elein at varlena.com, daveg at sonic.net, david at fetter.org

Comments and Corrections are welcome. Suggestions and contributions of items are also welcome. Send them in!

Top