Quick Outline of ErServer Internals

by Andrew Sullivan
23-Oct-2003

Abstract

This is a quick conceptual outline of what erserver does once you have it working, if you ever get it there. That is, this is not an outline of how the setup scripts, etc., work (or don't), but how a tuple on the master ends up on the slave(s).

I. Required support

A. On the master

The master must have the following tables set up:

Any other tables are not strictly required. There is a table that normally gets set up by the system, _rserv_servers_, but it is never actually used. It was apparently a stub for some planned features. There is a server number in _rserv_sync_, and you could therefore in principle use _rserv_servers_ for convenient lookups, but it won't get used by anything.

The master must also have a couple of installed sequences:

Again, there is a sequence which exists for _rserv_servers_. It's not actually used.

The function for the replication log trigger must be created. It's generally called _rserv_log_, but the important detail is that it is what is called by the trigger created on each replicated table, and that it in turn is a C function which is compiled from src/erserver.c; it should compile to a function called erserver.so, which will be in the directory where you installed erserver. (In case you're wondering, by the way, yes this breaks the nice $libdir trick in 7.2 and later.)

B. On the slave

On the slave, you need exact copies of the tables you are going to replicate, plus you need _rserv_slave_tables_ and _rserv_slave_sync_.

C. General

All this is set up as a database super user. In fact, you really only need to be super user to install the function, because it's a C function. In practice, it's easier to set up the relevant bits as super user, and then make some limited GRANTs to the relevant tables.

II. Queueing things to be replicated

A table which is to be replicated must meet several conditions on the master.

A. It must be listed in the table _rserv_tables_:

Column Type Modifiers
tname name
cname name
reloid oid
key integer

The tname is the table name. The cname is the unique, not null key of the table for replication purposes (the setup script creates _ers_uniq, but there's nothing requiring that; if you had single-column primary keys on all your tables, you could use them). The reloid is the oid in pg_class for that table (i.e. select a.oid from pg_class a, _rserv_tables_ b where a.relname = b.tname and b.tname = [target table]), and the key is the attnum from pg_attribute for that field (i.e. select attnum from pg_attribute a, _rserv_tables_ b where a.attname = b.cname and a.attrelid = b.reloid and b.tname = [target table]; note that you'd have to have reloid set correctly for this to work).

B. Trigger Installed

On the master, it must have the replication trigger installed. It is normally called _rserv_trigger_t_, but the name is not important. What _is_ important is that the trigger calls the _rserv_log_ C function (the actual relevant bit is that the function which is called is the function, distributed with the erserver source, which compiles to rserv.so).

C. Permissions

Users writing to a replicated table must have permissions to read and write on _rserv_log_1_ and _rserv_log_2_, and must be able to select from _rserv_active_log_id_.

D. Trigger action

Suppose we are replicating target_table, then. We have the relevant entries and the trigger is set up. When you insert a record into target_table, you will cause the trigger to fire. It will insert into the current log table (see below about cleanlog, but if this is the first time, it should be _rserv_log_1_) a record with the reloid (see _rserv_tables_), the logid (which is the xid from the transaction), the timestamp, whether the record is deleted (-1 is an insert, 0 is an update, 1 a delete), and the value of the key (that is, the actual value in the key field for that row).

III. Replicating

A. Things on the slave

In order to replicate to a slave, you must have set up _rserv_slave_tables_ on the slave analogously to how it is set up on the master. Note that the oids are unlikely to be the same, so you can't just copy the values from the master. Also, your slave tables must have _exactly_ the same structure as they do on the master. In particular, the key field must be available on the slave. Some versions of the setup scripts apparently have done things in the wrong order, and the _ers_uniq field is not correctly set up on the slave. You'll be wanting to fix that. If you don't have the slave completely set up before the sync starts, you will lose data.

B. Configuration file.

If people ask for an annotation of the config file, I'll provide one. I think the comments are not too bad in there, though. It should be relatively obvious if you have a misconfiguration there. (Send mail to the erserver-general list if not.)

C. Run the replication engine

You _must_ use the Sun JDK to run this software: apparently there are bits of it which depend on special Sun bits. Java is only actually portable if it's written that way, and this isn't. Patches are welcome. The logs normally end up /path/to/erserver/logs.

D. Worth noting

If you look at the query generated by the replication engine, you will note that it joins _rserv_log_1_ (or _2_) to the replicated tables, and replicates the version of the row in the table. That means that _old versions of the row are not replicated_. This is efficient, but can surprise you if you're not expecting it.

E. What the engine does

The engine looks at _rserv_sync_ for the master and the current thread's target slave, and goes through the current log (either _rserv_log_1_ or _2_, depending on the value in _rserv_active_log_id_ and the state of _rserv_old_log_status_) and gets everything that is affected _after_ the most recent successful sync for the target server, and which is not currently active. It joins that result to the tables in question (using LEFT JOIN so that it gets null rows for the missing ones), and then applies all the affected rows in the result set for each table to the slave. All this is done in one transaction on the master and one on the slave. The master commits last (which means it is possible that the master crashes without knowing a slave has been updated, but not that the slave crashes when the master thinks rows have been committed, but haven't been).

IV. CleanLog

A. What is a CleanLog?

The erserver engine uses more than one log into which it writes in order to keep track of the rows to be replicated. That could be inefficient, or it could cause maintenance to have to be done regularly. Instead, a strategy of log rotation is used.

B. How does it work?

The _rserv_log_ trigger inserts its results into whichever table is dictated by the current value of _rserv_active_log_id_. So, when it is time to switch logs, the clean log portion of the program just selects nextval('_rserv_active_log_id_'). The sequence cycles, so it returns to 1 after 2.

Of course, immediately after the switch, there is a period in which old transactions still are using _rserv_log_1_ while _rserv_log_2_ is filling up. The replication engine knows to use both logs by the value of _rserv_old_log_status_. Once it has caught up all the slaves with all the items in _rserv_log_1_, it sets the _rserv_old_log_status_ back to normal, and then clean log knows it can proceed with issuing a TRUNCATE on _rserv_log_1_.

The interval between clean log events is controlled in the configuration file by replic.server.CleanLogInterval. The default is once every 24 hours. You'll find that is too infrequent on a busy database.

That's everything I can think of for now. I hope these notes are useful. Feel free to ask questions.

Andrew Sullivan
andrew at libertyrms.info
Formatting by A. Elein Mustain