Varlena, LLC | PostgreSQL General Bits Newsletter

PostgreSQL Training,
Consulting & Support

By A. Elein Mustain

7-Feb-2005 Issue: 96

Archives |

General Tidbits |

Google General Bits |

Prev |

General Bits is a column loosely based on the PostgreSQL mailing list pgsql-general.
To find out more about the pgsql-general list and PostgreSQL, see www.PostgreSQL.org.

The Saga of the ARC Algorithm and Patent
Discussions on Development	6-Feb-2005

PostgreSQL development continues to research and implement leading edge technologies. Recently an algorithm, ARC, was introduced, discussed. Problems ensued regarding the patent of ARC fermenting other discussions. This is an overview of those discussions.

ARC

PostgreSQL uses a cache to save pages in it local memory for easy access of repetitive queries. The basic idea is that once you read a page in, you hold onto it in case someone else needs it later on.

Many algorithms have been written and published over the years in an effort to select the algorithm which is the "most efficient." The algorithms determine the basis for keeping elements in the cache and the timing for freeing them. "Most Efficient" is an elusive qualifier because the activity on different systems is, well, different.

Highly repetitive actions such as are TCP-C-like transactions tend to have a better performance with an LRU (Least Recently Used) algorithm that frees up the pages as necessary based on the recent usage count. Database systems often use LRU algorithms but many are switching to other algorithms to better handle a variety of behaviors not handled well by LRU. For example, one one-time very large sequential scan might flood the cache with pages that are not expected to be used again any time soon. The cache then is not helpful until it is re-populated with more commonly used pages. The PostgreSQL caching algorithm was LRU-K implemented by Tom Lane a few years ago.

The ARC algorithm tries to balance how frequently the pages are used as well as how recently the pages are used. Theoretically, this keeps in the cache pages that are used more often in balance with those pages that are used more frequently. It does this by tracking in two lists frequent pages and recent pages and determining dynamically the best balance of the two.

The details of any caching algorithm can be quite interesting and those who are interested are referred to the papers mentioned or linked as well as the detailed interesting discussion of the ARC implementation in the HACKERS archive.

PostgreSQL and ARC

In November 2003, Jan Wieck proposed using implementation of an ARC (Adaptive Replacement Cache) algorithm for the PostgreSQL caching system. This was based on what was described in the following paper and article and pointed out by Cuong Bui.

ARC: A Self-Tuning, Low Overhead Replacement Cache
USENIX File & Storage Technologies Conference (FAST)
March 31, 2003
August 2003 ;login One Up on LRU

The proposition was discussed on hackers. Related and some unrelated issues were discussed, including vacuum changes, free space mapping, fsync calls and the factors or variables involved in the algorithm.

Jan then committed the ARC implementation into the (then) 7.5 code branch. And Simon Riggs, Tom Lane and others began detailed testing which included reporting tool patches and bug fixes. In the meanwhile discussion continued about the fine tuning the details and the relative impact on the planned background writer and vacuum changes.

Simon Riggs then brought up an update on the ARC idea:

Sorav Bansal and Dharmendra S. Modha,
CAR: Clock with Adaptive Replacement,
in Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pages 187--200, March 2004.

This modified the the size of the two lists, adaptively, so that a "temporal locality window" was introduced. These ideas were became part of the discussion concerning the list size limits primarily based on the shared buffers and the maximum number of backends.

Some of the discussion centered around the use of the effective_cache_size GUC parameter to configure the sizes of the cache lists. Currently it is only used to inform the planner and to make index vs. table scan and join order decisions. It was documented incorrectly and the correction for that was also proposed and applied.

The Patent Issue

A special note here is important. I am not a lawyer and probably all of the engineers involved in the discussions are also not lawyers. For strict legal interpretations we will have to rely on the lawyers and the court system. For better or worse.

Around January 17, 2005, it was discovered by Neil Conway that IBM had applied for a patent on the ARC algorithm on November, 2003.

Thus the discussion turned to the implications of the patent application and removing the ARC code.

The code containing the ARC algorithm was targeted for the 8.0 release of PostgreSQL. Because the patent was pending and not yet issued, Tom Lane suggested that we go ahead with the code as is and look into the issue as soon as possible after the release.

Discussion ensued regarding the effects of patented software on an open source implementation released under a BSD license. Andrew Sullivan pointed out that it would be possible for the PostgreSQL Global Development Group to receive a cease and desist order requesting all code be eradicated from the net--a big problem. It was pointed out by Joshua Drake that the commercial distributors of PostgreSQL would not like a cease and desist order against them either. This would be a particular problem for any company taking PostgreSQL and reselling it as closed source. (This is perfectly legal under the BSD license.) One of the greater fears was that the corporate users will get billed by IBM.

IBM does have a policy enabling open source projects to use its patents approved by the Open Source Initiative. However, our BSD license says people can take PostgreSQL code and do *anything* with it. Including any patents would preclude any third party from packaging and re-selling PostgreSQL. It would require that they pay license fees or remove the patented code. This changes the essence of the BSD license and so is not desirable.

The source code containing ARC is already out on the net in Beta and RC releases (and now also as 8.0). And we know that people are not likely to upgrade quickly if they have moved PostgreSQL into a production system. But there may be liability for those people who have ARC code on their system. The eradication of the code on the net is a task for sisyphus.

It is pertinent that 1) we now knew about the possible patent and 2) this issue arose before a major release of possibly infringing code. However, IBM has just applied for the patent--it has not yet received the patent. The algorithm was publicly published and therefore we assumed (incorrectly) it was available for use. Of course incorrect assumptions have no legal bearing. Nor does lack of due diligence in making an effort to find patents.

A side thread reviewed the general implications for handling patent issues and patent threats. Josh Berkus and Bruce Momjian argued for a general policy for handling releases to rectify possible patent infringements.

Tom Lane argued that it is unlikely that IBM would sue us in the near future, destroying the good will they have established with the open source community. And that we should not rely on that indefinitely. We could never retroactively bring back all of the people who are running 8.0 beta and RC releases, but a change forward would be the right thing to do. He still favored a replacement plan.

Jeff Trout suggested that perhaps we should talk to IBM.

Andrew Dunstan found a very recent paper on an on an alternative to ARC which claims superior performance. Neil Conway suggested a more recent LRU algorithm, the LRU-2Q system.

Replacing the algorithm in the PostgreSQL code is a non-trivial change. This would require a major release of, say 8.1 because it would have a major impact on both performance and reliability. It would be reducing our standards by dedicating anything other than a major release cycle of testing and beta. While pointing this out, Tom Lane also suggested another possible route which would be changing the use of the current ARC algorithm to avoid the exact pattern specified by the patent.

At this point the two main lines of discussion clarified a bit. It was understood that a change would be required pre-emptively in case IBM was granted the patent. And discussion of both the release of the change and the nature of the change was required. The change possibilities were changing the code to defeat the patent or replacing the algorithm altogether. Which change chosen would factor into the timing of implementing it in a release.

Tom Lane brought up the idea of a release 8.1 which would not require an initdb from 8.0. This would facilitate the upgrade from 8.0 and signify a major version change if the decision were reached to change the algorithm rather than altering it. On the other hand, having such a restriction on 8.1 would severely limit other development in progress that was counting on system catalog changes, requiring an initdb. Some of these projects included shared dependencies and shared row locking. It would also affect the plans of other long term projects not yet started. Another angle on the issue of initdb for 8.1 is to resurrect pg_upgrade, an automatic way to upgrade major releases without an initdb.

Conjecture about release cycles was folded into the fact that we were still unsure about the nature of change required. Without an immediate small change to the ARC algorithm the argument was to have a full development cycle, however theoretically shortened, to test and beta a full algorithm change. However, some people, with faith in finding a smaller change or prior art that would by pass the patent, felt more strongly about a minor release, 8.0.x. It was also argued we could provide both an ARC and an LRU algorithm chosen at start up time given enough testing time.

Proposals about bringing back the LRU algorithm got some time, however most of them wanted additional modification that helped with the sequential scan flooding issue.

The discussion turned back on the patent's affect on PostgreSQL and its potential effects. It is clear that we based our algorithm on the ARC paper and did not come up with it entirely on our own. Prior art was unknown. It was repeated by some that IBM won't sue over this issue as long as we change it, but SCO's name was thrown about as a counter argument. The colorful phrase, "flying roosters butt" was also used.

Examining the patent application, at least the first bullet of using two lists to manage the caches has many cases in prior art including code in PostgreSQL dating back to the UC Berkeley days. But we cannot guess which elements of the patent will be approved. Therefore, Tom Lane argues, that it is reasonable to find out whether we do actually violate the patent while at the same time making a plan to remove or change the algorithm.

Heavy digging went on and turned up a number of additional papers that may or may not be prior art with regards to the ARC algorithm.

A couple of other people further pushed the previously unanswered question about contacting IBM directly. Tom Lane pointed out, "People seem to be assuming that asking IBM is a zero-risk thing. It's not. If they are forced to deal with the issue, they might well feel that they have to take action that we'd not like." And IBM might be pushed to acknowledge PostgreSQL as a competitor, something they'd not like to do publicly.

In the meanwhile, PostgreSQL 8.0 was released and release 8.0.1 for a security issue was prepared and then also released. The patent was neither rejected nor approved by the US Patent Office. Neil Conway submitted a patch to revert to an LRU algorithm but there was not enough testing planned for it to go into a minor release.

It was pointed out that a US Patent would not hold the same weight in other countries and that PostgreSQL is an international group. Software patents in Europe are growing out of favor. This, however, does not deflect from the fact that the US Patent, if approved, would have enormous effect in the States as well as commercial PostgreSQL distributors.

In early February Tom Lane proposed an altered version of the ARC algorithm which more closely tied to the 2Q Paper published Johnson and Shasha. Their 1994 paper had something very similar to ARC, but was not "adaptive" and the differences seemed to meet our needs of avoiding cache flooding. The change would require more removal of code in the current PostgreSQL than adding code and seems a more conservative approach.

On February 3, 2005 Tom Lane committed the small change to revise the ARC algorithm to eliminate possible patent infringements. It was committed and to be released in the next versions of 8.0 and in 8.1.

Contributors: elein at varlena.com

Comments and Corrections are welcome. Suggestions and contributions of items are also welcome. Send them in!

Top