author: Lennon Day-Reynolds
last mod: Mon Dec 18 09:35:47 PST 2006
keywords: web scaling apache postgresql linux carp failover ha
Dynamic, database-backed web applications are increasingly important in day-to-day operations at many institutions, making high-availability and redundancy desireable properties for any new deployment. However, many aspects of standard web application design, including their reliance on stateful user sessions and centralization of application data in a single back-end relational database, make redundancy non-trivial to achieve.
An example service that sees a great deal of use here at Reed is our webmail server. A large percentage of the student population rely on webmail as their primary, if not only, mail client. We use a lightly-customized version of the IMP mail client, a featureful web frontend to IMAP mail services which utilizes the Horde application framework written in PHP. We chose IMP in large part because of its tight integration with other Horde applications, including the Turba contact manager and Ingo mail filtering and vacation-message editor, and because it is an open source application designed to run in a traditional LAMP hosting environment.
However, the most basic LAMP model—which has the web server, scripting language “app server”, and database backend all running on the same physical system—has obvious problems when one begins thinking in terms of redundancy and scaling. We have long been running near the maximum load capacity of our primary webmail server, (a relatively fast dual-CPU server running locally- compiled builds of Apache, PHP, and MySQL) so any redundant system must perform at least as well, while minimizing “single failure points” that could bring down the entire service.
After researching existing models and deployments of similar services, we decided to implement a load-balanced set of web application servers, backed by a redundant pair of database servers. Ideally, this should allow the entire system to keep running in the event that any individual system goes down, while allowing us to continue to scale out the most resource-intensive pieces of the system, which have historically been the web server processes running the PHP code.
This model also allows us to provide the same scaling and reliability improvements for applications besides webmail, without re-implementing the load-balancing and database infrastructure. As new applications are added to the system, we can add new web servers to the pool to match the load with minimal configuration changes.
In order to realize this implementation model, we needed at minimum three major components, each of which should itself be redundant in order to insure maximum reliability: a frontend load-balancer which actually accepts incoming client requests, a pool of “application servers” running the PHP application code, and a replicated database backend.
(or, Apache is our Swiss Army Knife)
We implemented the load balancing service using a pair of inexpensive, relatively low-powered single-CPU servers running OpenBSD and Apache 2.2 with its bundled balancing proxy server. For our purposes, the Apache web server’s bundled load-balancing proxy module provides an excellent fit: it matches our existing expertise and preferred server environment, and allows for very flexible dispatching on incoming requests based on the URL, client network address, and even request hedaers or contents. It also transparently tracks the status of backend hosts to allow machines to be dynamically added and removed from the pool of available systems.
The choice of OpenBSD may seem somewhat esoteric, given its normal use for low- level security appliances such as firewalls, but its built-in support for shared virtual IP addresses via the CARP protocol made it relatively easy to build a “hot spare” replication pair, where the backup load balancer will automatically begin handling requests if the primary ever fails.
In addition, OpenBSD’s excellent track record on security issues also allows us to treat the load balancers as a security boudary behind which sensitive data can be transmitted in plaintext, as no network traffic can reach the backend app servers and database without first passing through the OpenBSD hosts. (We of course still encrypt all traffic passing out of the web cluster, including to and from the central IMAP and SMTP servers.)
Our application servers, by contrast, are configured as more or less standard web severs: they run Debian Linux, and use the standard Apache and PHP packages provided by the distribution. Each application is hosted by an Apache listener running on a different high-numbered port, which allows us to pick which applications are run on different physical application servers. (From experience, this can prove critical, given the potential incompatibilities that varied web applications can have due to differing dependencies or configuration.)
This de-coupling of applications from the normal single-process, single-port model also opens up the possibility in the future of running server processes other than Apache for some applications. This is becoming standard practice in web application runtime environments other than PHP, including Java servlets and Ruby on Rails applications.
One component of web application hosting which we examined in some depth, but were unable to find a satisfactory method of fully distributing across the web servers, was server-side session handling. The PHP runtime provides transparent support for storing state between client requests in session objects, which are traditionally stored on the web server filesystem.
In order to insure that the session data is available to users in a load-balanced system such as ours, there are two primary options: make the session available to all the backend servers, (either via the database backend, or some additional sharing mechanism) or insure that the user will be directed to the same backend throughout the life of their session. Since our particular target applications can often have sizeable amounts of data (on the order of 50-100KB per user) in the session objects, the latter option proved to have much better performance.
Building a low-cost, fully-replicated database service is almost certainly the most complex and interesting component of the whole system. In the world of free databases, there are two primary options: MySQL and PostgreSQL. Much has been written about the relative strengths and advantages of these two products, but in the end most stable open source web applications support both more or less equally well, so the selection of one or the other often comes down to personal preferences.
I admit to a personal bias towards PostgreSQL, based both on personal experience with it in production applications, and the belief that its features are a superset of those in MySQL. Again, the additional flexibility offered in its richer feature set is not essential for our webmail application, but it may prove valuable in the future for applications that can more effectively utilize it.
Unfortunately, neither database engine offers a complete highly-available replication solution “out of the box.” MySQL offers a simple master/slave replication model which allows for one read/write “master” server and one or more read-only “slaves”, but there is built-in mechanism for one of the slaves to assume master status if the original master goes down.
PostgreSQL, on the other hand, simply provides no built-in replication support. This has prompted several development efforts (incl. Slony, PGPool, and others) to attempt, with varying levels of success, to “bolt on” a solution for Postgres users.
Most of these packages have been based on a set of triggers and procedures running in the master database which fire on every INSERT, UPDATE, and DELETE call; however, they generally do not support schema synchronization, and cannot guarantee that certain internal details (like OIDs) will be kept in sync between replicas.
More recently, the PostgreSQL core team have implemented Write-Ahead-Logging, or WAL, which could conceivably be used to implement a master/slave replication by simply pushing the transaction logs to one or more read-only replicas, and then forcing “recovery” of the database (i.e., replay of the logs) to bring them up to date with the primary database.
None of these options were really appealing—the poor performance, integrity issues, and lack of support in the PostgreSQL core all led us to look for alternatives to the usual approaches.
Since PostgreSQL lacks integral replication support, we began examining more general-purpose solutions borrowed from the world of Linux clusters. There are a variety of “cluster filesystems” which distribute data across a number of physical hosts, but most involve a fairly large amount of configuration and infrastructure support in pursuit of scaling to arbitrarily-large clusters. Since our goal in “clustering” our database services was reliability, not performance we prefer in this case to implement as simple a replication model as possible.
Our attention was quickly drawn to DRBD, a Linux kernel module that implements a network-mirrored filesystem. It offers many of the benefits of shared-storage replication models such as a fiber-channel SAN, without requiring expensive host adapters or standalone RAID units. Obviously, using regular TCP/IP instead of a highly-optimized storage bus will impact performance, but the savings are quite noticeable, and we avoid introducing a single failure point at the shared storage device itself.
Our DRBD deployment is relatively simple: each database server has an approx. 300GB internal RAID-5 array with three active disks and one hot spare. The RAID volumes are identically partitioned on both hosts: modest partitions are provided for the local root filesystem, log storage, etc., while the majority (~260GB) was initially left uninitialized for use as a DRBD volume.
Since DRBD itself only exports another virtual block device, a regular filesystem must be built atop that space. We chose ext3 for the time being, due to its general reliability and consistent performance. More careful selection, perhaps beginning with a full benchmark of relative performance under a number of filesystems would likely be an interesting exercise.
Building and configuring DRBD is relatively straightforward, especially using the Debian kernel module support tools. (There are also several excellent HOWTO guides online; Google ‘drbd nfs debian’ for more.) Once the DRBD filesystem was initialized, moving the PostgreSQL data to it was straightforward: we simply placed symbolic links from the normal database paths (’/var/lib/postgresql’, in a standard Debian environment) to a directory on the DRBD volume.
With this accomplished, no further special configuration was required for PostgreSQL. The implementation of PostgreSQL has long been proven reasonably robust on a variety of filesystems, including in some cases NFS shares, so dealing with the additional latency introduced by DRBD’s “two-phase commit” model should be no problem.
Our overall configuration diverges from those described in the HOWTO guides primarily only in using UCARP, a userspace implementation of the same CARP protocol our load balancers utilize for virtual-IP address takeover, to monitor host status and trigger failover, rather than the more common ‘heartbeat’ tool used in Linux high-availability settings.
(The decision to use UCARP stemmed both from familiarity with the underlying protocol from the load balancer deployment, and some mixed impressions of heartbeat based on its use in other high-availability services at our institution. It is an excellent tool, but properly configuring it for rapid failover without introducing potentially undersireable behaviors is a truly difficult task.)
At its most basic, UCARP can be considered a simple status monitor; effectively, two hosts with different priorities periodically announce their status via a simple heartbeat message, and the “live” host with the highest priority at any given time is designated as the “primary”. Any time a host either becomes the primary, or loses that status, the UCARP service runs a script associated with that event.
It is assumed in most cases that hosts using UCARP will share a “virtual IP” address that always points to the current primary. Thus, the first action of these promotion/demotion scripts is to assume or relinquish control of this network address as appropriate.
In the case of a host that has just become the primary, the next action is to assume control of the DRBD filesystem, and then start the PostrgreSQL service. Conversely, upon notification from UCARP that it should relinquish primary status, the server shuts down PostgreSQL and unmounts the DRBD filesystem.
We have also begun experimenting with DRBD-backed NFS, as desribed in the afore- mentioned HOWTO, as a shared file store for content uploaded via the web applications.
We’ve observed decent, if unexciting, performance from the layered system. Reading data (i.e., SQL SELECT queries) occurs at about 90% of “raw” speed, while writes occur at 25% of the baseline. In our environment, that translates to about 200 SELECT transactions per second, and 35-40 UPDATE or DELETEs. Most web applications perform vastly more reads from the database than writes, and webmail especially so, since much of the state to be updated lives on the mail server, not in the database.
Testing of the entire stack against our current server shows comparable total request time, but vastly reduced CPU load on the application server(s). With SSL encryption duties offloaded the the load balancer, and database query processing moved to PostgreSQL server, we expect to be able to handle much higher numbers of concurrent connections.
The failover logic appears quite effective for the simple scenarios we’ve developed thus far: i.e., if the primary host is taken offline, (either logically, by disabling networking from a local console login, or physically, but disconnecting it from the network switch) the secondary host is able to mount the shared filesystem and resume database services within a few seconds.