From: Jeff Garzik <jeff@garzik.org>
To: Project Hail <hail-devel@vger.kernel.org>
Subject: tabled vs. BDB high availability
Date: Sun, 07 Mar 2010 17:51:41 -0500 [thread overview]
Message-ID: <4B942DFD.5020102@garzik.org> (raw)
As I tersely noted on [1], we have a bit of issue with regards to tabled
endpoint high availability, even with BDB replication+failover.
For all HTTP methods that update the database, we return a
RedirectClient error with HTTP Location header. Even ignoring the FIXME
in cli_err() [tabled/server/server.c], and even though this behavior is
within the S3 API, it presents problems as we have currently implemented
things.
If a site implements a single endpoint "tabled.example.com" that returns
multiple A/AAAA records in DNS, then the client could potentially cycle
through a large number of redirects from tabled slaves, before finally
hitting the tabled master... with no guarantee of -ever- finding the
master. And the larger your tabled cell, the more redirects each client
must suffer before finding a master.
If a site implements distinct endpoints for each tabled node
("t1.example.com", "t2.example.com", etc.) then redirects should result
in directing clients to the current master, assuming that slaves have a
deterministic manner of discovering the current master.
Such a setup also makes use of IP Virtual Server impossible.
But that brings us to our second problem, a common problem in computer
science: the thundering herd.
When a tabled endpoint crashes or loses its master status, clients must
move en masse to the new master. As client counts increase, this
becomes a "thundering herd" DDoS'ing the new target machine.
Third, our current setup that concentrates writes on the master really
limits parallelism. It is the _BDB database_ that must only write on
the master, but due to our design, this also limits
client->tabled->chunkd writes to the master.
Ideally, we want to enable writing on every tabled node in a cell.
Given that the metadata is the only bit that _must_ be performed on the
master, it seems like the least-effort, least-cost solution for us is
for slaves to send a "write metadata" message to the master, and then
perform the data write itself.
BDB documentation[2] hints that the database replication infrastructure
may be used to send application-specific messages between BDB slaves and
the BDB master. That sounds worth investigating.
Jeff
[1] http://hail.wiki.kernel.org/index.php/Extended_status
[2] file:///usr/share/doc/db4-devel-4.7.25/api_c/rep_transport.html
next reply other threads:[~2010-03-07 22:51 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-07 22:51 Jeff Garzik [this message]
2010-03-08 4:56 ` tabled vs. BDB high availability Pete Zaitcev
2010-03-08 12:36 ` Jeff Garzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B942DFD.5020102@garzik.org \
--to=jeff@garzik.org \
--cc=hail-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.