All of lore.kernel.org
 help / color / mirror / Atom feed
* tabled vs. BDB high availability
@ 2010-03-07 22:51 Jeff Garzik
  2010-03-08  4:56 ` Pete Zaitcev
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff Garzik @ 2010-03-07 22:51 UTC (permalink / raw)
  To: Project Hail


As I tersely noted on [1], we have a bit of issue with regards to tabled 
endpoint high availability, even with BDB replication+failover.

For all HTTP methods that update the database, we return a 
RedirectClient error with HTTP Location header.  Even ignoring the FIXME 
in cli_err() [tabled/server/server.c], and even though this behavior is 
within the S3 API, it presents problems as we have currently implemented 
things.

If a site implements a single endpoint "tabled.example.com" that returns 
multiple A/AAAA records in DNS, then the client could potentially cycle 
through a large number of redirects from tabled slaves,  before finally 
hitting the tabled master...  with no guarantee of -ever- finding the 
master.  And the larger your tabled cell, the more redirects each client 
must suffer before finding a master.

If a site implements distinct endpoints for each tabled node 
("t1.example.com", "t2.example.com", etc.) then redirects should result 
in directing clients to the current master, assuming that slaves have a 
deterministic manner of discovering the current master.

Such a setup also makes use of IP Virtual Server impossible.

But that brings us to our second problem, a common problem in computer 
science:  the thundering herd.

When a tabled endpoint crashes or loses its master status, clients must 
move en masse to the new master.  As client counts increase, this 
becomes a "thundering herd" DDoS'ing the new target machine.

Third, our current setup that concentrates writes on the master really 
limits parallelism.  It is the _BDB database_ that must only write on 
the master, but due to our design, this also limits 
client->tabled->chunkd writes to the master.

Ideally, we want to enable writing on every tabled node in a cell. 
Given that the metadata is the only bit that _must_ be performed on the 
master, it seems like the least-effort, least-cost solution for us is 
for slaves to send a "write metadata" message to the master, and then 
perform the data write itself.

BDB documentation[2] hints that the database replication infrastructure 
may be used to send application-specific messages between BDB slaves and 
the BDB master.  That sounds worth investigating.

	Jeff


[1] http://hail.wiki.kernel.org/index.php/Extended_status
[2] file:///usr/share/doc/db4-devel-4.7.25/api_c/rep_transport.html



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: tabled vs. BDB high availability
  2010-03-07 22:51 tabled vs. BDB high availability Jeff Garzik
@ 2010-03-08  4:56 ` Pete Zaitcev
  2010-03-08 12:36   ` Jeff Garzik
  0 siblings, 1 reply; 3+ messages in thread
From: Pete Zaitcev @ 2010-03-08  4:56 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Project Hail, zaitcev

On Sun, 07 Mar 2010 17:51:41 -0500
Jeff Garzik <jeff@garzik.org> wrote:

> If a site implements distinct endpoints for each tabled node 
> ("t1.example.com", "t2.example.com", etc.) then redirects should result 
> in directing clients to the current master, assuming that slaves have a 
> deterministic manner of discovering the current master.

I did not implement it, but it's trivial in CLD. Just create a file
named "master" in the group. For added cleverness, split up the
notions of "group master" (who owns the file and most of the DB,
and knows about each master for each bucket), and "bucket master".

> Such a setup also makes use of IP Virtual Server impossible.

Bah, big deal.

> But that brings us to our second problem, a common problem in computer 
> science:  the thundering herd.
> 
> When a tabled endpoint crashes or loses its master status, clients must 
> move en masse to the new master.  As client counts increase, this 
> becomes a "thundering herd" DDoS'ing the new target machine.

Not a problem in practice, I expect. S3 clients are not clients that
sit connected and then are notified about a failover. Instead, they
connect, perform operations as fast as they can, quit. Therefore,
there is not going to be a spike in traffic because of the failover
that is significantly greater than the normal operations rate.

> Ideally, we want to enable writing on every tabled node in a cell. 
> Given that the metadata is the only bit that _must_ be performed on the 
> master, it seems like the least-effort, least-cost solution for us is 
> for slaves to send a "write metadata" message to the master, and then 
> perform the data write itself.

I would not do it, at least not yet. A better effect would be to
have separate DBs with separate masters for each bucket.

Another thing, how many clients do you think tabled is going to
have accessing it at any given time in any realistic deployments
for years to come? How about ONE (although, it may be multi-threaded)?

One retarded thing we can do now is to rush into implementing things
like slave-to-master metadata forwarding when we do not have a single
installation to guide us.

Anyway, my first priority is to make sure that slave mode works at all.
Currently, tabled will suicide if it cannot grab master, instead of
signing up for lock notifications, etc.

-- Pete

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: tabled vs. BDB high availability
  2010-03-08  4:56 ` Pete Zaitcev
@ 2010-03-08 12:36   ` Jeff Garzik
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff Garzik @ 2010-03-08 12:36 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Project Hail

On 03/07/2010 11:56 PM, Pete Zaitcev wrote:
> On Sun, 07 Mar 2010 17:51:41 -0500
> Jeff Garzik<jeff@garzik.org>  wrote:
>
>> If a site implements distinct endpoints for each tabled node
>> ("t1.example.com", "t2.example.com", etc.) then redirects should result
>> in directing clients to the current master, assuming that slaves have a
>> deterministic manner of discovering the current master.
>
> I did not implement it, but it's trivial in CLD. Just create a file
> named "master" in the group. For added cleverness, split up the
> notions of "group master" (who owns the file and most of the DB,
> and knows about each master for each bucket), and "bucket master".
>
>> Such a setup also makes use of IP Virtual Server impossible.
>
> Bah, big deal.

It is, if we actually want to attract users.


>> But that brings us to our second problem, a common problem in computer
>> science:  the thundering herd.
>>
>> When a tabled endpoint crashes or loses its master status, clients must
>> move en masse to the new master.  As client counts increase, this
>> becomes a "thundering herd" DDoS'ing the new target machine.
>
> Not a problem in practice, I expect. S3 clients are not clients that
> sit connected and then are notified about a failover. Instead, they
> connect, perform operations as fast as they can, quit. Therefore,
> there is not going to be a spike in traffic because of the failover
> that is significantly greater than the normal operations rate.

Think standard web browser behavior, including HTTP 1.1 pipelining and 
extended connections...  Think also about the length of time it takes to 
negotiate a new master, and what the clients will do in the meantime. 
Major thundering herd.


>> Ideally, we want to enable writing on every tabled node in a cell.
>> Given that the metadata is the only bit that _must_ be performed on the
>> master, it seems like the least-effort, least-cost solution for us is
>> for slaves to send a "write metadata" message to the master, and then
>> perform the data write itself.
>
> I would not do it, at least not yet. A better effect would be to
> have separate DBs with separate masters for each bucket.

No, that just multiplies the problems already inherent in the current 
design, as well as creating new problems.  BDB just isn't built for 
that, so scaling that solution is a major problem.  A bucket should be a 
scalable unit, and that does nothing to solve it.  Whereas if we solve 
the current problem described in $thread, buckets are automatically 
scalable as well.


> Another thing, how many clients do you think tabled is going to
> have accessing it at any given time in any realistic deployments
> for years to come? How about ONE (although, it may be multi-threaded)?
>
> One retarded thing we can do now is to rush into implementing things
> like slave-to-master metadata forwarding when we do not have a single
> installation to guide us.

I am glad Apache httpd hackers never set such strict, low goals :) 
tabled is a web server, with all that entails, because the S3 API is 
often used to front a web site (or at least the static portion thereof). 
  Standard web browser behavior and multiple, pipelining clients are 
part of tabled's client base.

If we fail to understand and solve problems that people already solved 
ten years ago, then tabled certainly will not attract end-user 
installations.  Understanding standard web server design, and the 
problems and solutions that arose from that, are very important.

	Jeff



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-03-08 12:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-07 22:51 tabled vs. BDB high availability Jeff Garzik
2010-03-08  4:56 ` Pete Zaitcev
2010-03-08 12:36   ` Jeff Garzik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.