On Thu, Mar 08, 2007 at 02:40:07AM +0900, Greg Loriman wrote:
> I'm putting together a relatively simple site which I want to design from 
> the ground up for horizontal scalability, partly for the challenge, partly 
> because I need to learn and get experience. To help me do this I am going to 
> run at least two virtual machines to enforce the correct environment.
> 
> Currently my idea is to federate the data so that the users are divided 
> between the 2 or more machines, perhaps splitting alphabetically by user 
> name (ie. A-G to machine 1, etc). Where there is interaction between account 
> holders I am thinking of Drb'ing. Obviously rails is not going to be able to 
> do the interaction side of things, but I am fine with that; I'm prepared for 
> a bit of manual labour.
> 
> I would love comments/advice on my above ideas, and further insights into 
> horizontal scaling.

Well: my advice is that the sort of "loose federation" you describe is
something which is very difficult to build. You can make it work for, say, a
proxy-based POP3/IMAP mail cluster: here the protocol is straightforward,
the session can be unambiguously proxied to the right backend server, and
there is no interaction between accounts. (When you start using IMAP shared
folders, this breaks down)

However even in such a scenario, you don't have resilience. If you lose the
machine where the A to G accounts are stored, then all those users lose
their mail. So in fact each backend machine has to be a mini-cluster, or at
least, have mirrored disks and a warm spare machine to plug them into when
disaster strikes.

Many people have resilience as high, or higher, on their agenda than
performance. So this doesn't sound like a good way to go.

My advice would be:

1. Keep your database in one place, so that all the front-ends have shared
access to the same data at all times.

To start with have a single database machine. Then expand this to a
2-machine database cluster. You can then point 2, 3, 4 or more front-ends at
this cluster; for many applications you may find that you won't need to
scale the database until later.

(Note that regardless of whether your application ends is heavier on
front-end CPU or back-end database resources, scaling the frontends and the
database cluster separately makes it much easier to monitor resource
utilisation and scale each part as necessary)

The easiest way to do database clustering is with a master-slave
arrangement: do all your updates on the master, and let these replicate out
to the slaves, where read-only queries take place. Of course, this isn't
good enough for all applications, but for others it's fine.

Full database clustering is challenging, but if your site is making you lots
of money you can always throw an Oracle 10g grid at it. If you're seriously
thinking of that route, you can start with Oracle on day one; it is now free
for a single processor with up to 1GB of RAM and 4GB of table space.

2. For transient session state, assuming your session objects aren't
enormous, use DRb to start with. Point all your front-ends at the same DRb
server. DRb is remarkably fast for what it does, since all the marshalling
is done in C.

When you outgrow that, go to memcached instead. This is actually not hard to
set up: you just run a memcached process on each server. The session data is
automatically distributed between the nodes.

Both cases aren't totally bombproof: if you lose a node, you'll lose some
session data. Either put important session data in the database, or build a
bombproof memcached server [boots from flash, no hard drive, fanless]

If that's not important, then you don't need a separate memcached server. If
you have N webapp frontends, then just run memcached on each of them.

> To facilitate the above I need some kind of proxy in front of the two 
> machines directing incoming requests to the correct machine based on the 
> login name which will be part of the url. Here I come unstuck since I have 
> no idea how to do this.

Well, the traditional approach is to buy a web loadbalancing appliance (or
resilient pair of them), and configure it for "sticky" load balancing based
on a session cookie or some other attribute in the URL.

Hardware appliances are generally good. They are reliable over time; there's
much less to go wrong than a PC. They do a single job well.

You could instead decide to use a recent version of Apache with mod_proxy to
do the proxying for you.

But it may be better to design your app with a single shared database and a
single shared session store, such that it actually doesn't matter where each
request arrives.

> Can anyone give me a few pointers? Is squid the thing? Mongrel (I don't 
> really know what mongrel is)? Can apache be made to do this, and if so is it 
> a bad idea? Obviously it needs to be pluggable since I'll be using my own 
> code (C or Pascal) to do the lookups for the redirection.

mod_proxy with mod_rewrite is "pluggable" in the way you describe. See
http://httpd.apache.org/docs/2.2/misc/rewriteguide.html
and skip to the section headed "Proxy Throughput Round-Robin".

You'd use an External Rewriting Program (map type prg) to choose which
back-end server to redirect to. The example above is written in Perl, but
the same is equally possible in Ruby, C, Pascal or whatever.

However, if you don't know anything about Apache, this is certainly not
where I'd recommend you start.

squid is a proxy cache. You can use it to accelerate static content, but it
won't help you much with dynamic pages from Rails. Mongrel is a webserver
written in Ruby, much as webrick is, although is apparently more efficient.

In summary I'd say start your design with the KISS principle:

- one database; scale it horizontally (by database clustering) when needed

- one global session store; scale it horizontally when needed

- one frontend application server; scale horizontally when needed

In addition to that, consider:

- serve your static HTML, images and CSS from a fast webserver
  (e.g. apache, lighttpd). This is easy to arrange.

- consider serving your Rails application from the same webserver using
  fastcgi (e.g. Apache mod_fcgid), rather than a Ruby webserver like
  mongrel or webrick. Harder to set up, but you can migrate to this later.
  Then most HTTP protocol handling is being done in C.

- profile your application carefully to find out where the bottlenecks are,
  before you throw hardware at performance problems.

HTH,

Brian.