From: Jeff Garzik <jeff@garzik.org>
To: Project Hail <hail-devel@vger.kernel.org>
Subject: tabled vs. BDB high availability
Date: Sun, 07 Mar 2010 17:51:41 -0500 [thread overview]
Message-ID: <4B942DFD.5020102@garzik.org> (raw)
As I tersely noted on [1], we have a bit of issue with regards to tabled
endpoint high availability, even with BDB replication+failover.
For all HTTP methods that update the database, we return a
RedirectClient error with HTTP Location header. Even ignoring the FIXME
in cli_err() [tabled/server/server.c], and even though this behavior is
within the S3 API, it presents problems as we have currently implemented
things.
If a site implements a single endpoint "tabled.example.com" that returns
multiple A/AAAA records in DNS, then the client could potentially cycle
through a large number of redirects from tabled slaves, before finally
hitting the tabled master... with no guarantee of -ever- finding the
master. And the larger your tabled cell, the more redirects each client
must suffer before finding a master.
If a site implements distinct endpoints for each tabled node
("t1.example.com", "t2.example.com", etc.) then redirects should result
in directing clients to the current master, assuming that slaves have a
deterministic manner of discovering the current master.
Such a setup also makes use of IP Virtual Server impossible.
But that brings us to our second problem, a common problem in computer
science: the thundering herd.
When a tabled endpoint crashes or loses its master status, clients must
move en masse to the new master. As client counts increase, this
becomes a "thundering herd" DDoS'ing the new target machine.
Third, our current setup that concentrates writes on the master really
limits parallelism. It is the _BDB database_ that must only write on
the master, but due to our design, this also limits
client->tabled->chunkd writes to the master.
Ideally, we want to enable writing on every tabled node in a cell.
Given that the metadata is the only bit that _must_ be performed on the
master, it seems like the least-effort, least-cost solution for us is
for slaves to send a "write metadata" message to the master, and then
perform the data write itself.
BDB documentation[2] hints that the database replication infrastructure
may be used to send application-specific messages between BDB slaves and
the BDB master. That sounds worth investigating.
Jeff
[1] http://hail.wiki.kernel.org/index.php/Extended_status
[2] file:///usr/share/doc/db4-devel-4.7.25/api_c/rep_transport.html
next reply other threads:[~2010-03-07 22:51 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-07 22:51 Jeff Garzik [this message]
2010-03-08 4:56 ` tabled vs. BDB high availability Pete Zaitcev
2010-03-08 12:36 ` Jeff Garzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B942DFD.5020102@garzik.org \
--to=jeff@garzik.org \
--cc=hail-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox