All of lore.kernel.org
 help / color / mirror / Atom feed
* tabled vs. BDB high availability
@ 2010-03-07 22:51 Jeff Garzik
  2010-03-08  4:56 ` Pete Zaitcev
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff Garzik @ 2010-03-07 22:51 UTC (permalink / raw)
  To: Project Hail


As I tersely noted on [1], we have a bit of issue with regards to tabled 
endpoint high availability, even with BDB replication+failover.

For all HTTP methods that update the database, we return a 
RedirectClient error with HTTP Location header.  Even ignoring the FIXME 
in cli_err() [tabled/server/server.c], and even though this behavior is 
within the S3 API, it presents problems as we have currently implemented 
things.

If a site implements a single endpoint "tabled.example.com" that returns 
multiple A/AAAA records in DNS, then the client could potentially cycle 
through a large number of redirects from tabled slaves,  before finally 
hitting the tabled master...  with no guarantee of -ever- finding the 
master.  And the larger your tabled cell, the more redirects each client 
must suffer before finding a master.

If a site implements distinct endpoints for each tabled node 
("t1.example.com", "t2.example.com", etc.) then redirects should result 
in directing clients to the current master, assuming that slaves have a 
deterministic manner of discovering the current master.

Such a setup also makes use of IP Virtual Server impossible.

But that brings us to our second problem, a common problem in computer 
science:  the thundering herd.

When a tabled endpoint crashes or loses its master status, clients must 
move en masse to the new master.  As client counts increase, this 
becomes a "thundering herd" DDoS'ing the new target machine.

Third, our current setup that concentrates writes on the master really 
limits parallelism.  It is the _BDB database_ that must only write on 
the master, but due to our design, this also limits 
client->tabled->chunkd writes to the master.

Ideally, we want to enable writing on every tabled node in a cell. 
Given that the metadata is the only bit that _must_ be performed on the 
master, it seems like the least-effort, least-cost solution for us is 
for slaves to send a "write metadata" message to the master, and then 
perform the data write itself.

BDB documentation[2] hints that the database replication infrastructure 
may be used to send application-specific messages between BDB slaves and 
the BDB master.  That sounds worth investigating.

	Jeff


[1] http://hail.wiki.kernel.org/index.php/Extended_status
[2] file:///usr/share/doc/db4-devel-4.7.25/api_c/rep_transport.html



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-03-08 12:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-07 22:51 tabled vs. BDB high availability Jeff Garzik
2010-03-08  4:56 ` Pete Zaitcev
2010-03-08 12:36   ` Jeff Garzik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.