tabled vs. BDB high availability

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jeff Garzik <jeff@garzik.org>
To: Project Hail <hail-devel@vger.kernel.org>
Subject: tabled vs. BDB high availability
Date: Sun, 07 Mar 2010 17:51:41 -0500	[thread overview]
Message-ID: <4B942DFD.5020102@garzik.org> (raw)

As I tersely noted on [1], we have a bit of issue with regards to tabled 
endpoint high availability, even with BDB replication+failover.

For all HTTP methods that update the database, we return a 
RedirectClient error with HTTP Location header.  Even ignoring the FIXME 
in cli_err() [tabled/server/server.c], and even though this behavior is 
within the S3 API, it presents problems as we have currently implemented 
things.

If a site implements a single endpoint "tabled.example.com" that returns 
multiple A/AAAA records in DNS, then the client could potentially cycle 
through a large number of redirects from tabled slaves,  before finally 
hitting the tabled master...  with no guarantee of -ever- finding the 
master.  And the larger your tabled cell, the more redirects each client 
must suffer before finding a master.

If a site implements distinct endpoints for each tabled node 
("t1.example.com", "t2.example.com", etc.) then redirects should result 
in directing clients to the current master, assuming that slaves have a 
deterministic manner of discovering the current master.

Such a setup also makes use of IP Virtual Server impossible.

But that brings us to our second problem, a common problem in computer 
science:  the thundering herd.

When a tabled endpoint crashes or loses its master status, clients must 
move en masse to the new master.  As client counts increase, this 
becomes a "thundering herd" DDoS'ing the new target machine.

Third, our current setup that concentrates writes on the master really 
limits parallelism.  It is the _BDB database_ that must only write on 
the master, but due to our design, this also limits 
client->tabled->chunkd writes to the master.

Ideally, we want to enable writing on every tabled node in a cell. 
Given that the metadata is the only bit that _must_ be performed on the 
master, it seems like the least-effort, least-cost solution for us is 
for slaves to send a "write metadata" message to the master, and then 
perform the data write itself.

BDB documentation[2] hints that the database replication infrastructure 
may be used to send application-specific messages between BDB slaves and 
the BDB master.  That sounds worth investigating.

	Jeff

[1] http://hail.wiki.kernel.org/index.php/Extended_status
[2] file:///usr/share/doc/db4-devel-4.7.25/api_c/rep_transport.html

next             reply	other threads:[~2010-03-07 22:51 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-07 22:51 Jeff Garzik [this message]
2010-03-08  4:56 ` tabled vs. BDB high availability Pete Zaitcev
2010-03-08 12:36   ` Jeff Garzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B942DFD.5020102@garzik.org \
    --to=jeff@garzik.org \
    --cc=hail-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.