From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1763349AbYENTD6 (ORCPT ); Wed, 14 May 2008 15:03:58 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758233AbYENTDp (ORCPT ); Wed, 14 May 2008 15:03:45 -0400 Received: from srv5.dvmed.net ([207.36.208.214]:58142 "EHLO mail.dvmed.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755515AbYENTDo (ORCPT ); Wed, 14 May 2008 15:03:44 -0400 Message-ID: <482B378C.5070807@garzik.org> Date: Wed, 14 May 2008 15:03:40 -0400 From: Jeff Garzik User-Agent: Thunderbird 2.0.0.14 (X11/20080501) MIME-Version: 1.0 To: Evgeniy Polyakov CC: Sage Weil , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: POHMELFS high performance network filesystem. Transactions, failover, performance. References: <20080513174523.GA1677@2ka.mipt.ru> <4829E752.8030104@garzik.org> <20080513205114.GA16489@2ka.mipt.ru> <20080514135156.GA23131@2ka.mipt.ru> In-Reply-To: <20080514135156.GA23131@2ka.mipt.ru> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -4.4 (----) X-Spam-Report: SpamAssassin version 3.2.4 on srv5.dvmed.net summary: Content analysis details: (-4.4 points, 5.0 required) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Evgeniy Polyakov wrote: > Hi Sage. > > On Wed, May 14, 2008 at 06:35:19AM -0700, Sage Weil (sage@newdream.net) wrote: >>>> What is your opinion of the Paxos algorithm? >>> It is slow. But it does solve failure cases. >> For writes, Paxos is actually more or less optimal (in the non-failure >> cases, at least). Reads are trickier, but there are ways to keep that >> fast as well. FWIW, Ceph extends basic Paxos with a leasing mechanism to >> keep reads fast, consistent, and distributed. It's only used for cluster >> state, though, not file data. > > Well, it depends... If we are talking about single node perfromance, > then any protocol, which requries to wait for authorization (or any > approach, which waits for acknowledge just after data was sent) is slow. Quite true, but IMO single-node performance is largely an academic exercise today. What production system is run without backups or replication? > If we are talking about agregate parallel perfromance, then its basic > protocol with 2 messages is (probably) optimal, but still I'm not > convinced, that 2 messages case is a good choise, I want one :) I think part of Paxos' attraction is that it is provably correct for the chosen goal, which historically has not been true for hand-rolled consensus algorithms often found these days. There are a bunch of variants (fast paxos, byzantine paxos, fast byzantine paxos, etc., etc.) based on Classical Paxos which make improvements in the performance/latency areas. There is even a Paxos Commit which appears to be more efficient than the standard transaction two-phase commit used by several existing clustered databases. Jeff