Thought on tabled design; scaling tabled

All of lore.kernel.org
 help / color / mirror / Atom feed

* Thought on tabled design; scaling tabled
@ 2009-08-16 12:48 Jeff Garzik
  2009-09-01  2:02 ` Pete Zaitcev
  0 siblings, 1 reply; 2+ messages in thread
From: Jeff Garzik @ 2009-08-16 12:48 UTC (permalink / raw)
  To: Project Hail; +Cc: Pete Zaitcev

So, here's a bit of a tabled brain dump...

single-master scalability
-------------------------
Any db4 replication setup enables high availability through metadata 
replication and fail-over, but it remains a single-master setup.  This 
is quite OK for CLD, whose end-state (probably?) remains a single-master 
setup, but isn't as great for tabled.

The slave tabled's are able to serve read-only HTTP requests, because 
it's trivial to do with the current db4 implementation.  That enables 
/some/ additional parallelism, but at a cost:

Existing S3 clients do not expect certain nodes to be read-only, so that 
expectation would need to be coded into tabled clients and/or site DNS 
setups.  That increases, rather than decreases, our distance from the 
goal of being compatible with existing S3 client code.

Not a huge penalty, but it will definitely prevent us working with some 
existing installations.  Some S3 applications expect massively parallel 
writes, and we will have difficulty fulfilling this expectation for some 
time.

In general, the S3 design was not really intended for a single-master 
setup.  Our design limits us due to the _metadata_ replication scheme, 
which will result in clients copying massive amounts of _data_ through 
the master.  Our metadata implementation limits our data scalability.

forward-to-master solution?
---------------------------
Are there ways to minimize work and maximize re-use of db4 replication 
framework, while enabling some additional parallelism in tabled?

One method is to permit the slave to service database read requests, 
while providing a longer circuit for database write requests:

	cli -> slave tabled -> master tabled -> slave tabled -> cli

This increases latency over a cli->master->cli circuit, but since we are 
only talking about _metadata_, it is not a huge cost.  The slave 
tabled's continue to have direct read/write access to chunkd for object 
data, after all.

Of course, if you want to scale this solution, the tabled master will 
become loaded with requests from the slave tabled.  Consider, for 
example, 20 nodes running tabled.

At that point, you might consider a solution where _only_ slave tabled's 
serve HTTP requests, and the master tabled is dedicated 100% to handling 
the metadata database requests from slave tabled nodes.

app note: tabled slave startup times
------------------------------------
It is important to note that replication startup time is unbounded. 
Each db4 replication client may be forced to download the entire 
metadata database from a remote peer, before it can permit any db4 
database access.

This will not be apparent from our current testsuite, but it will 
-definitely- be an issue in the field.

So keep that in mind, when designing tabled initialization.  You might 
consider two parallel threads at startup:

1) database init
2) CLD init

and refuse to serve HTTP requests until both threads finish.

client guarantees
-----------------
There is also the minor [until you have an app that cares...] detail 
that you can see an older version of the metadata than what you would 
see talking at the same instant to the master.

This is normal in distributed systems, but it is important to be aware 
of it, as this goes to the core guarantees that we provide to each 
client.  Amazon S3 specification is especially lenient in this area, 
noting [essentially] that it is undefined which of two PUTs, for the 
same key, will actually be stored on that key.  Underlying S3 client 
applications are expected to employ their own synchronization, to deal 
with race issues such as this.

failure handling and garbage
----------------------------
Need to pay attention to issues where a client disconnects before 
completing a PUT, or where a tabled or chunkd dies while we are 
pipelining client data to chunkd's.  Right now, we are pretty sloppy 
about leaving objects around in case of error.  Being sloppy can be OK 
-- provided (a) there is some method of garbage collection, and (b) 
later clients will not see inconsistent data via GET.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Thought on tabled design; scaling tabled
  2009-08-16 12:48 Thought on tabled design; scaling tabled Jeff Garzik
@ 2009-09-01  2:02 ` Pete Zaitcev
  0 siblings, 0 replies; 2+ messages in thread
From: Pete Zaitcev @ 2009-09-01  2:02 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Project Hail

On Sun, 16 Aug 2009 08:48:45 -0400, Jeff Garzik <jeff@garzik.org> wrote:

> single-master scalability
> -------------------------
> Any db4 replication setup enables high availability through metadata 
> replication and fail-over, but it remains a single-master setup. []

Believe me, I know, but I thought the current goal is to have
something to what users of Fedora can trust their data, and soon.
Once we have our ext3 we may look at XFSes of the cloud world.

> forward-to-master solution?
> ---------------------------

I want to leapfrog the forwarding phase and go multi-master
without DB4 (and hopefuly fix startup times, and other issues
you identified).

The reason I'm poking Jeff Darcy and everyone else with questions
about Walrus etc. is that I'm open to using anything that works.
The Eucaliptus dude denied strongly (at the Red Hat Cloud Forum)
that he's going to produce anything for the end users. I don't
know if he was just being modest or it was for real.

-- Pete

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2009-09-01  2:02 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-16 12:48 Thought on tabled design; scaling tabled Jeff Garzik
2009-09-01  2:02 ` Pete Zaitcev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.