From: Philipp Reisner <philipp.reisner@linbit.com>
To: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: david@lang.hm, Willy Tarreau <w@1wt.eu>,
Bart Van Assche <bart.vanassche@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, Jens Axboe <jens.axboe@oracle.com>,
Greg KH <gregkh@suse.de>, Neil Brown <neilb@suse.de>,
Sam Ravnborg <sam@ravnborg.org>, Dave Jones <davej@redhat.com>,
Nikanth Karthikesan <knikanth@suse.de>,
"Lars Marowsky-Bree" <lmb@suse.de>,
Kyle Moffett <kyle@moffetthome.net>,
Lars Ellenberg <lars.ellenberg@linbit.com>
Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
Date: Tue, 5 May 2009 23:45:19 +0200 [thread overview]
Message-ID: <200905052345.20515.philipp.reisner@linbit.com> (raw)
In-Reply-To: <1241543146.3312.57.camel@mulgrave.int.hansenpartnership.com>
Am Dienstag 05 Mai 2009 19:05:46 schrieb James Bottomley:
> On Tue, 2009-05-05 at 17:56 +0200, Philipp Reisner wrote:
> > On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> > > On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > > > When you do asynchronous replication, how do you ensure that
> > > > > > implicit write-after-write dependencies in the stream of writes
> > > > > > you get from the file system above, are not violated on the
> > > > > > secondary ?
> >
> > [...]
> >
> > > > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > > > fsync).
> >
> > [...]
> >
> > > I think you'll find the dio/fsync method above actually does solve all
> > > of these issues (mainly because it enforces the semantics from top to
> > > bottom in the stack). I agree one could use more elaborate semantics
> > > like you do for drbd, but since the simple ones worked efficiently for
> > > md/nbd, there didn't seem to be much point.
> >
> > Do I get it right, that you enforce the exact same write order on the
> > secondary node as the stream of writes was comming in on the primary?
>
> Um, yes ... that's the text book way of doing replication: write order
> preservation.
>
> > Using either DIRECT_IO or fsync() calls ?
>
> Yes.
>
> > Is DIRECT_IO/fsync() enabled by default ?
>
> I'd have to look at the tools (and, unfortunately, there are many
> variants) but it was certainly true in the variant I used.
[...]
My experience is that enforcing the exact same write order as on the primary
by using IO draining, kills performance. - Of course things are changing in
a world where everybody uses a RAID controller with a gig of battery
backed RAM. But there are for sure some embedded users that run
the replication technology on top of plain hard disks.
What I want to work out is, that in DRBD we have that capability to allow
limited reordering on the secondary, to achieve the highest possible
performance, while maintaining these implicit write-after-write dependencies.
> I also think you're not quite looking at the important case: if you
> think about it, the real necessity for the ordered domain is the
> network, not so much the actual secondary server. The reason is that
> it's very hard to find a failure case where the write order on the
> secondary from the network tap to disk actually matters (as long as the
> flight into the network tap was in order). The standard failure is of
> the primary, not the secondary, so the network stream stops and so does
> the secondary writing: as long as we guarantee to stop at a consistent
> point in flight, everything works. If the secondary fails while the
> primary is still up, that's just a standard replay to bring the
> secondary back into replication, so the issue doesn't arise there
> either.
A common power failure is possible. We aim for an HA system, we can
not ignore a possible failure scenario. No user will buy: Well in most
scenarios we do it correctly, in the unlikely case of a common power
failure, and you loose your former primary at the same time, you might
have a secondary with the last write but not that one write before!
Correctness before efficiency!
But I will now stop this discussion now. Proving that DRBD does some
details better than the md/nbd approch gets pointless, when we agreed
that DRBD can get merged as a driver. We will focus on the necessary
code cleanups.
-Phil
next prev parent reply other threads:[~2009-05-05 21:44 UTC|newest]
Thread overview: 88+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-30 11:26 [PATCH 00/16] DRBD: a block device for HA clusters Philipp Reisner
2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
2009-04-30 11:26 ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
2009-04-30 11:26 ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
2009-04-30 11:26 ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
2009-04-30 11:26 ` [PATCH 05/16] DRBD: request Philipp Reisner
2009-04-30 11:26 ` [PATCH 06/16] DRBD: userspace_interface Philipp Reisner
2009-04-30 11:26 ` [PATCH 07/16] DRBD: internal_data_structures Philipp Reisner
2009-04-30 11:26 ` [PATCH 08/16] DRBD: main Philipp Reisner
2009-04-30 11:26 ` [PATCH 09/16] DRBD: receiver Philipp Reisner
2009-04-30 11:26 ` [PATCH 10/16] DRBD: proc Philipp Reisner
2009-04-30 11:26 ` [PATCH 11/16] DRBD: worker Philipp Reisner
2009-04-30 11:26 ` [PATCH 12/16] DRBD: variable_length_integer_encoding Philipp Reisner
2009-04-30 11:26 ` [PATCH 13/16] DRBD: misc Philipp Reisner
2009-04-30 11:26 ` [PATCH 14/16] DRBD: tracepoint_probes Philipp Reisner
2009-04-30 11:26 ` [PATCH 15/16] DRBD: documentation Philipp Reisner
2009-04-30 11:26 ` [PATCH 16/16] DRBD: final Philipp Reisner
2009-05-02 15:45 ` [PATCH 12/16] DRBD: variable_length_integer_encoding James Bottomley
2009-05-02 17:29 ` Lars Ellenberg
2009-05-02 15:44 ` [PATCH 10/16] DRBD: proc James Bottomley
2009-05-02 20:23 ` Lars Ellenberg
2009-05-02 15:41 ` [PATCH 04/16] DRBD: bitmap James Bottomley
2009-05-02 17:28 ` Lars Ellenberg
2009-05-03 5:21 ` Neil Brown
2009-05-03 7:38 ` Lars Ellenberg
2009-05-05 17:48 ` Lars Marowsky-Bree
2009-05-05 17:51 ` James Bottomley
2009-05-05 22:26 ` Neil Brown
2009-05-01 9:01 ` [PATCH 03/16] DRBD: activity_log Andrew Morton
2009-05-02 17:00 ` Lars Ellenberg
2009-05-01 8:59 ` [PATCH 02/16] DRBD: lru_cache Andrew Morton
2009-05-02 15:26 ` Lars Ellenberg
2009-05-02 17:58 ` Andrew Morton
2009-05-02 18:13 ` Lars Ellenberg
2009-05-02 18:26 ` Andrew Morton
2009-05-02 19:39 ` Lars Ellenberg
2009-05-02 23:51 ` Kyle Moffett
2009-05-03 6:27 ` Lars Ellenberg
2009-05-03 14:06 ` Kyle Moffett
2009-05-03 22:48 ` Lars Ellenberg
2009-05-04 0:48 ` Kyle Moffett
2009-05-04 1:01 ` Kyle Moffett
2009-05-04 16:12 ` Rik van Riel
2009-05-04 16:15 ` Lars Ellenberg
2009-05-01 8:59 ` [PATCH 01/16] DRBD: major.h Andrew Morton
2009-05-01 8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
2009-05-01 11:15 ` Lars Marowsky-Bree
2009-05-01 13:14 ` Dave Jones
2009-05-01 19:14 ` Andrew Morton
2009-05-05 4:05 ` Christian Kujau
2009-05-02 7:33 ` Bart Van Assche
2009-05-03 5:36 ` Willy Tarreau
2009-05-03 5:40 ` david
2009-05-03 14:21 ` James Bottomley
2009-05-03 14:36 ` david
2009-05-03 14:45 ` James Bottomley
2009-05-03 14:56 ` david
2009-05-03 15:09 ` James Bottomley
2009-05-03 15:22 ` david
2009-05-03 15:38 ` James Bottomley
2009-05-03 15:48 ` david
2009-05-03 16:02 ` James Bottomley
2009-05-03 16:13 ` david
2009-05-04 8:28 ` Philipp Reisner
2009-05-04 17:24 ` James Bottomley
2009-05-05 8:21 ` Philipp Reisner
2009-05-05 14:09 ` James Bottomley
2009-05-05 15:56 ` Philipp Reisner
2009-05-05 17:05 ` James Bottomley
2009-05-05 21:45 ` Philipp Reisner [this message]
2009-05-05 21:53 ` James Bottomley
2009-05-06 8:17 ` Philipp Reisner
2009-05-05 15:03 ` Bart Van Assche
2009-05-05 15:57 ` Philipp Reisner
2009-05-05 17:38 ` Lars Marowsky-Bree
2009-05-03 10:06 ` Philipp Reisner
2009-05-03 10:15 ` Thomas Backlund
2009-05-03 5:53 ` Neil Brown
2009-05-03 6:24 ` david
2009-05-03 8:29 ` Lars Ellenberg
2009-05-03 11:00 ` Neil Brown
2009-05-03 21:32 ` Lars Ellenberg
2009-05-04 16:12 ` Lars Marowsky-Bree
2009-05-05 22:08 ` Lars Ellenberg
-- strict thread matches above, loose matches on Subject: below --
2009-05-14 22:31 devzero
2009-05-15 12:10 Philipp Reisner
2009-07-06 15:39 [PATCH 00/16] drbd: " Philipp Reisner
2009-07-21 5:49 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200905052345.20515.philipp.reisner@linbit.com \
--to=philipp.reisner@linbit.com \
--cc=James.Bottomley@hansenpartnership.com \
--cc=akpm@linux-foundation.org \
--cc=bart.vanassche@gmail.com \
--cc=davej@redhat.com \
--cc=david@lang.hm \
--cc=gregkh@suse.de \
--cc=jens.axboe@oracle.com \
--cc=knikanth@suse.de \
--cc=kyle@moffetthome.net \
--cc=lars.ellenberg@linbit.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lmb@suse.de \
--cc=neilb@suse.de \
--cc=sam@ravnborg.org \
--cc=w@1wt.eu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox