From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758726AbZEEVol (ORCPT ); Tue, 5 May 2009 17:44:41 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753655AbZEEVob (ORCPT ); Tue, 5 May 2009 17:44:31 -0400 Received: from [212.69.161.110] ([212.69.161.110]:34574 "EHLO mail09.linbit.com" rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752917AbZEEVob (ORCPT ); Tue, 5 May 2009 17:44:31 -0400 From: Philipp Reisner To: James Bottomley Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters Date: Tue, 5 May 2009 23:45:19 +0200 User-Agent: KMail/1.11.0 (Linux/2.6.27-9-generic; KDE/4.2.0; i686; ; ) Cc: david@lang.hm, Willy Tarreau , Bart Van Assche , Andrew Morton , linux-kernel@vger.kernel.org, Jens Axboe , Greg KH , Neil Brown , Sam Ravnborg , Dave Jones , Nikanth Karthikesan , "Lars Marowsky-Bree" , Kyle Moffett , Lars Ellenberg References: <1241090812-13516-1-git-send-email-philipp.reisner@linbit.com> <200905051756.29703.philipp.reisner@linbit.com> <1241543146.3312.57.camel@mulgrave.int.hansenpartnership.com> In-Reply-To: <1241543146.3312.57.camel@mulgrave.int.hansenpartnership.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200905052345.20515.philipp.reisner@linbit.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Am Dienstag 05 Mai 2009 19:05:46 schrieb James Bottomley: > On Tue, 2009-05-05 at 17:56 +0200, Philipp Reisner wrote: > > On Tuesday 05 May 2009 16:09:45 James Bottomley wrote: > > > On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote: > > > > > > When you do asynchronous replication, how do you ensure that > > > > > > implicit write-after-write dependencies in the stream of writes > > > > > > you get from the file system above, are not violated on the > > > > > > secondary ? > > > > [...] > > > > > > > The way nbd does it (in the updated tools is to use DIRECT_IO and > > > > > fsync). > > > > [...] > > > > > I think you'll find the dio/fsync method above actually does solve all > > > of these issues (mainly because it enforces the semantics from top to > > > bottom in the stack). I agree one could use more elaborate semantics > > > like you do for drbd, but since the simple ones worked efficiently for > > > md/nbd, there didn't seem to be much point. > > > > Do I get it right, that you enforce the exact same write order on the > > secondary node as the stream of writes was comming in on the primary? > > Um, yes ... that's the text book way of doing replication: write order > preservation. > > > Using either DIRECT_IO or fsync() calls ? > > Yes. > > > Is DIRECT_IO/fsync() enabled by default ? > > I'd have to look at the tools (and, unfortunately, there are many > variants) but it was certainly true in the variant I used. [...] My experience is that enforcing the exact same write order as on the primary by using IO draining, kills performance. - Of course things are changing in a world where everybody uses a RAID controller with a gig of battery backed RAM. But there are for sure some embedded users that run the replication technology on top of plain hard disks. What I want to work out is, that in DRBD we have that capability to allow limited reordering on the secondary, to achieve the highest possible performance, while maintaining these implicit write-after-write dependencies. > I also think you're not quite looking at the important case: if you > think about it, the real necessity for the ordered domain is the > network, not so much the actual secondary server. The reason is that > it's very hard to find a failure case where the write order on the > secondary from the network tap to disk actually matters (as long as the > flight into the network tap was in order). The standard failure is of > the primary, not the secondary, so the network stream stops and so does > the secondary writing: as long as we guarantee to stop at a consistent > point in flight, everything works. If the secondary fails while the > primary is still up, that's just a standard replay to bring the > secondary back into replication, so the issue doesn't arise there > either. A common power failure is possible. We aim for an HA system, we can not ignore a possible failure scenario. No user will buy: Well in most scenarios we do it correctly, in the unlikely case of a common power failure, and you loose your former primary at the same time, you might have a secondary with the last write but not that one write before! Correctness before efficiency! But I will now stop this discussion now. Proving that DRBD does some details better than the md/nbd approch gets pointless, when we agreed that DRBD can get merged as a driver. We will focus on the necessary code cleanups. -Phil