On Thu, 4 Oct 2007 04:40:52 +0900, Chad Perrin wrote:

>> No argument there, as long as it's understood that there are limits to
>> what can be achieved.  I don't want to discourage anyone from seeking
>> linear scalability as an ideal, but it's not a realistic thing to
>> promise or assume.
> 
> It's close enough (again), for many purposes, to "realistic".  When you
> can get roughly linear scaling up to 100 times as much scaling needs, as
> opposed to trying to get similar scaling capabilities out of throwing
> programmers (or programmer time) at the problem, that's certainly
> "realistic" in my estimation.

A lot depends on your application requirements.  If you design it from the
ground up to be "shared nothing", then you may well be lucky enough to
truly HAVE shared nothing.  But you'll also have a pretty limited feature
set.

What's the big buzzword today?  Social networking.  What did we used to
call that?  "Community."  What was the single biggest sticky-paper
community feature?  Buddy lists.  Who does buddy lists besides the Big Guys
(who can throw money at it) and the really small guys (who fit on a single
server)?  Nobody.  Why?  Doesn't scale linearly.  Think about what it takes
to offer a feature that, for every simultaneous user, checks the list of
every other simultaneous user for people you know.  Shared-nothing *that*.

My area of expertise was the AOL mail system.  And, looking back, there
were a number of core features we offered that simply couldn't be done in a
shared-nothing world over slow phone lines:

- You could see when each recipient had read your e-mail.
- If nobody had read it, you could unsend it. 
- You could forward large attachments and long threads without re-uploading
them.  
- Corollary: the server could handle large attachments and long threads
without storing multiple copies.  (Disks were tiny then.)
- When sending e-mail, the system would check if all your recipients were
valid, not full, accepting e-mail from you, etc.  If any were not, the
message wouldn't send.  No bounces.  (This requires a two-phase commit.)
- If the sender or other recipeients of an e-mail were online, their
address would become a hyperlink so you could IM them.  (Buddy lists
again.)
- Your outbox pointed to the same message body as your recipients' inbox.
And all recipients' inboxes pointed to the same message as well. (Disk
space again.)
- The same e-mail message would appear differently to different clients,
depending on the feature set of that client.
- If you were the BCC recipient of an e-mail, you'd see a BCC header with
your name.  If you were the author, you'd see all the BCCs.
- Large bulk mailings stored only a single copy of the message.
- E-mail complaints could be sent to us in a manner that preserved their
evidentiary value in court.
- The client stayed in a wait state after sending until the servers could
guarantee that your e-mail had actually been delivered.  (Two-phase commit
again.)

That's just off the top of my head; I'm sure there were dozens of other
things I've forgotten that were designed when all the mail servers fit on
one machine, and then had to scale to multiple replicated data centers.

Could e-mail live without these features?  Sure.  Internet e-mail never had
them, a generation grew up without them, and these days nobody bemoans the
fact that you can't instantly know your message was delivered to its
destination; in fact, even bounces are becoming a thing of the past.  If
you mistype an address, you may never know, and that's just the way it
works.  "Did you get my e-mail?" is a real question, not just a
passive-aggressive way of saying "I see you read my message, but have not
yet responded."

And some of the features were only important in an age where pipes (both
last-mile and LAN) were very narrow and disks and RAM were very small.
Spam, in particular, made the "one copy of each message" model obsolete,
because spammers wouldn't play by the rules.  

But restricting yourself to only shared-nothing features means ruling out
an awful lot of features.  Including anything depending on a database
index, or a table that fits completely in memory, or any sort of
rate-limiting or duplicate-detection or spam prevention, or in fact
anything that makes any assumptions at all about the state of any database
you're interacting with or relational integrity or any other transaction in
the system, ever.  Including whether the disk drive holding the transaction
you just wrote to disk has disappeared in a puff of head crash.

It was always the little things that bit us.  Know why AOL screen names are
often "Jim293852"?  Well, it started out as "The name 'Jim' is already
taken.  Would you like 'Jim2'?".  Guess how well that scales when the first
available Jim is "Jim35000"?  Not very.

Pop-quiz:  Which of *your* core features would you have to eliminate with
three million simultaneous users?  
-- 
Jay Levitt                |
Boston, MA                | My character doesn't like it when they
Faster: jay at jay dot fm | cry or shout or hit.
http://www.jay.fm         | - Kristoffer