From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: POHMELFS high performance network filesystem. Transactions, failover, performance. Date: Wed, 14 May 2008 09:09:58 -0700 (PDT) Message-ID: References: <20080513174523.GA1677@2ka.mipt.ru> <4829E752.8030104@garzik.org> <20080513205114.GA16489@2ka.mipt.ru> <20080514140908.GA14987@shareable.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Evgeniy Polyakov , Jeff Garzik , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org To: Jamie Lokier Return-path: In-Reply-To: <20080514140908.GA14987@shareable.org> Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org On Wed, 14 May 2008, Jamie Lokier wrote: > > Similarly, if only 1 out of 3 replicas is surviving, most people want to > > be able to read their data, while Paxos demands a majority to ensure it is > > correct. > > (Generalising to any "quorum" (majority vote) protocol). > > That's true if you require that all results are guaranteed consistent > or blocked, in the event of any kind of network failure. > > But if you prefer incoherent results in the event of a network split > (and those are often mergable later), and only want to protect against > media/node failures to the best extent possible at any given time, > then quorum protocols can gracefully degrade so you still have access > without a majority of working nodes. Right. In my case, I require guaranteed consistent results for critical cluster state, and use (slightly modified) Paxos for that. For file data, I leverage that cluster state to still maintain perfect consistency in most failure scenarios, while also degrading gracefully to a read/write access to a single replica. When problem situations arise (e.g., replicating to A+B, A fails, read/write to just B for a while, B fails, A recovers), an administrator can step in and explicitly indicate we want to relax consistency to continue (e.g., if B is found to be unsalvageable and a stale A is the best we can do). > In that model, neighbour sensing is used to find the largest coherency > domains fitting a set of parameters (such as "replicate datum X to N > nodes with maximum comms latency T"). If the parameters are able to > be met, quorum gives you the desired robustness in the event of > node/network failures. During any time while the coherency parameters > cannot be met, the robustness reduces to the best it can do > temporarily, and recovers when possible later. As a bonus, you have > some timing guarantees if they are more important. Anything that silently relaxes consistency like that scares me. Does anybody really do that in practice? sage