From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Garzik Subject: tabled vs. BDB high availability Date: Sun, 07 Mar 2010 17:51:41 -0500 Message-ID: <4B942DFD.5020102@garzik.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:subject:content-type :content-transfer-encoding; bh=E+PwN6ajUDyHfb+wVvVdfie88td7npEIBJm5tQlQWvA=; b=hDIei8OYiOZeKvSaj0PenomSM6MVzk2pKC1yftynuFzLskGG3DOXmaCszejZXG5PX9 2lUlSl8ToW08n5civIV2IY3Ue/sxD6fzTIwLKgvA3/HBig4Bsz5RbW0nn4uFHqZ+yyqc cUP3sxZ+CtRXrhYtxiI00imDMf79bB48tWCgc= Sender: hail-devel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Project Hail As I tersely noted on [1], we have a bit of issue with regards to tabled endpoint high availability, even with BDB replication+failover. For all HTTP methods that update the database, we return a RedirectClient error with HTTP Location header. Even ignoring the FIXME in cli_err() [tabled/server/server.c], and even though this behavior is within the S3 API, it presents problems as we have currently implemented things. If a site implements a single endpoint "tabled.example.com" that returns multiple A/AAAA records in DNS, then the client could potentially cycle through a large number of redirects from tabled slaves, before finally hitting the tabled master... with no guarantee of -ever- finding the master. And the larger your tabled cell, the more redirects each client must suffer before finding a master. If a site implements distinct endpoints for each tabled node ("t1.example.com", "t2.example.com", etc.) then redirects should result in directing clients to the current master, assuming that slaves have a deterministic manner of discovering the current master. Such a setup also makes use of IP Virtual Server impossible. But that brings us to our second problem, a common problem in computer science: the thundering herd. When a tabled endpoint crashes or loses its master status, clients must move en masse to the new master. As client counts increase, this becomes a "thundering herd" DDoS'ing the new target machine. Third, our current setup that concentrates writes on the master really limits parallelism. It is the _BDB database_ that must only write on the master, but due to our design, this also limits client->tabled->chunkd writes to the master. Ideally, we want to enable writing on every tabled node in a cell. Given that the metadata is the only bit that _must_ be performed on the master, it seems like the least-effort, least-cost solution for us is for slaves to send a "write metadata" message to the master, and then perform the data write itself. BDB documentation[2] hints that the database replication infrastructure may be used to send application-specific messages between BDB slaves and the BDB master. That sounds worth investigating. Jeff [1] http://hail.wiki.kernel.org/index.php/Extended_status [2] file:///usr/share/doc/db4-devel-4.7.25/api_c/rep_transport.html