public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Philipp Reisner <philipp.reisner@linbit.com>
To: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: david@lang.hm, Willy Tarreau <w@1wt.eu>,
	Bart Van Assche <bart.vanassche@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, Jens Axboe <jens.axboe@oracle.com>,
	Greg KH <gregkh@suse.de>, Neil Brown <neilb@suse.de>,
	Sam Ravnborg <sam@ravnborg.org>, Dave Jones <davej@redhat.com>,
	Nikanth Karthikesan <knikanth@suse.de>,
	"Lars Marowsky-Bree" <lmb@suse.de>,
	Kyle Moffett <kyle@moffetthome.net>,
	Lars Ellenberg <lars.ellenberg@linbit.com>
Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
Date: Tue, 5 May 2009 10:21:32 +0200	[thread overview]
Message-ID: <200905051021.33461.philipp.reisner@linbit.com> (raw)
In-Reply-To: <1241457851.3315.41.camel@mulgrave.int.hansenpartnership.com>

On Monday 04 May 2009 19:24:11 James Bottomley wrote:
> On Mon, 2009-05-04 at 10:28 +0200, Philipp Reisner wrote:
> > On Sunday 03 May 2009 16:45:25 James Bottomley wrote:
> > > On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> > > > On Sun, 3 May 2009, James Bottomley wrote:
> > > > > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > > > >
> > > > > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> > > > >> On Sun, 3 May 2009, Willy Tarreau wrote:
> > > > >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > > > >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> > > > >>>>
> > > > >>>> <akpm@linux-foundation.org> wrote:
> > > > >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner
> >
> > <philipp.reisner@linbit.com> wrote:
> > > > >>>>>> This is a repost of DRBD
> > > > >>>>>
> > > > >>>>> Is it being used anywhere for anything?  If so, where and what?
> > > > >>>>
> > > > >>>> One popular application is to run iSCSI and HA software on top
> > > > >>>> of DRBD in order to build a highly available iSCSI storage
> > > > >>>> target.
> > > > >>>
> > > > >>> Confirmed, I have several customers who're doing exactly that.
> > > > >>
> > > > >> I will also say that there are a lot of us out here who would have
> > > > >> a use for DRDB in our HA setups, but have held off implementing it
> > > > >> specificly because it's not yet in the upstream kernel.
> > > > >
> > > > > Actually, that's not a particularly strong reason because we
> > > > > already have an in-kernel replicator that has much of the
> > > > > functionality of drbd that you could use.  The main reason for
> > > > > wanting drbd in kernel is that it has a *current* user base.
> > > > >
> > > > > Both the in kernel md/nbd and drbd do sync and async replication
> > > > > with primary side bitmaps.  The main differences are:
> > > > >
> > > > >      * md/nbd can do 1 to N replication,
> > > > >      * drbd can do active/active replication (useful for cluster
> > > > >        filesystems)
> > > > >      * The chunk size of the md/nbd is tunable
> > > > >      * With the updated nbd-tools, current md/nbd can do point in
> > > > > time rollback on transaction logged secondaries (a BCS requirement)
> > > > > * drbd manages the mirror state explicitly, md/nbd needs a user
> > > > > space helper
> > > > >
> > > > > And probably a few others I forget.
> > > >
> > > > one very big one:
> > > >
> > > > DRDB has better support for dealing with split brain situations and
> > > > recovering from them.
> > >
> > > I don't really think so.  The decision about which (or if a) node
> > > should be killed lies with the HA harness outside of the province of
> > > the replication.
> > >
> > > One could argue that the symmetric active mode of drbd allows both
> > > nodes to continue rather than having the harness make a kill decision
> > > about one.  However, if they both alter the same data, you get an
> > > irreconcilable data corruption fault which, one can argue, is directly
> > > counter to HA principles and so allowing drbd continuation is arguably
> > > the wrong thing to do.
> >
> > When you do asynchronous replication, how do you ensure that implicit
> > write-after-write dependencies in the stream of writes you get from
> > the file system above, are not violated on the secondary ?
>
> Are you telling me drbd doesn't currently do this?
>

No I am not. DRBD does exactly this!
But I am wondering how that is achieved in the MD/NBD stack when running 
in async mode.

The issue is covered since the early days in DRBD, (back in 2000).
The issue, and the solution we have in DRBD is described in this paper:

http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf

> The way nbd does it (in the updated tools is to use DIRECT_IO and
> fsync).

Is that available in the existing tools ? -- Are the updated tools
something that will be available in the future ?

Are you telling me md/ndb (async) doesn't currently do this ?

> > There might be a disk scheduler on the secondary.
>
> There usually is a disk scheduler ... you just have to take the required
> action to persuade it to preserve ordering ... a simplistic way of doing
> this is to switch to the noop scheduler.

The issue actually goes further down the stack. Not only the in kernel
disk scheduler might reorder something, also the driver and finally the
drive might do so.

What we have in DRBD boils down to:

* We obey all possible write after write dependencies in the stream of
  writes we get from the upper layers. And generate DRBD internal
  reorder barriers for the packet stream.
* On the secondary node we impose these barriers onto the stream of writes
  submitted to the stack below us by either:

   - Let previously submitted write-IO drain before we submit write-IO after
     such an DRBD barrier. (That we have since 2000 or so)

   - Additionally issue a blkdev_issue_flush()

   - Use write requests with BIO_RW_BARRIER. This method has two advantages:
     We can continue to submit writes after the DRBD internal barrier
     immediately, and the number of requests with BIO_RW_BARRIER can be
     further reduced. 
     See section 6 of
     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
     for more details, and nice illustrations.

     Unfortunately only high end SAN devices seem to benefit from this
     method. For most in-machine-disk controlers this method does not
     achieve the highest throughput.

Expressed in other words: 
We allow reordering on the secondary node to an extend so that we can
guarantee that no implicit write-after-write dependencies are violated.

Coming back to the idea of disabling the in Linux IO scheduler. It might
solve the issue for some devices, but it does not guarantee to solve it.

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


  reply	other threads:[~2009-05-05  8:22 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-30 11:26 [PATCH 00/16] DRBD: a block device for HA clusters Philipp Reisner
2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
2009-04-30 11:26       ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
2009-04-30 11:26         ` [PATCH 05/16] DRBD: request Philipp Reisner
2009-04-30 11:26           ` [PATCH 06/16] DRBD: userspace_interface Philipp Reisner
2009-04-30 11:26             ` [PATCH 07/16] DRBD: internal_data_structures Philipp Reisner
2009-04-30 11:26               ` [PATCH 08/16] DRBD: main Philipp Reisner
2009-04-30 11:26                 ` [PATCH 09/16] DRBD: receiver Philipp Reisner
2009-04-30 11:26                   ` [PATCH 10/16] DRBD: proc Philipp Reisner
2009-04-30 11:26                     ` [PATCH 11/16] DRBD: worker Philipp Reisner
2009-04-30 11:26                       ` [PATCH 12/16] DRBD: variable_length_integer_encoding Philipp Reisner
2009-04-30 11:26                         ` [PATCH 13/16] DRBD: misc Philipp Reisner
2009-04-30 11:26                           ` [PATCH 14/16] DRBD: tracepoint_probes Philipp Reisner
2009-04-30 11:26                             ` [PATCH 15/16] DRBD: documentation Philipp Reisner
2009-04-30 11:26                               ` [PATCH 16/16] DRBD: final Philipp Reisner
2009-05-02 15:45                         ` [PATCH 12/16] DRBD: variable_length_integer_encoding James Bottomley
2009-05-02 17:29                           ` Lars Ellenberg
2009-05-02 15:44                     ` [PATCH 10/16] DRBD: proc James Bottomley
2009-05-02 20:23                       ` Lars Ellenberg
2009-05-02 15:41         ` [PATCH 04/16] DRBD: bitmap James Bottomley
2009-05-02 17:28           ` Lars Ellenberg
2009-05-03  5:21             ` Neil Brown
2009-05-03  7:38               ` Lars Ellenberg
2009-05-05 17:48               ` Lars Marowsky-Bree
2009-05-05 17:51                 ` James Bottomley
2009-05-05 22:26                 ` Neil Brown
2009-05-01  9:01       ` [PATCH 03/16] DRBD: activity_log Andrew Morton
2009-05-02 17:00         ` Lars Ellenberg
2009-05-01  8:59     ` [PATCH 02/16] DRBD: lru_cache Andrew Morton
2009-05-02 15:26       ` Lars Ellenberg
2009-05-02 17:58         ` Andrew Morton
2009-05-02 18:13           ` Lars Ellenberg
2009-05-02 18:26             ` Andrew Morton
2009-05-02 19:39               ` Lars Ellenberg
2009-05-02 23:51     ` Kyle Moffett
2009-05-03  6:27       ` Lars Ellenberg
2009-05-03 14:06         ` Kyle Moffett
2009-05-03 22:48           ` Lars Ellenberg
2009-05-04  0:48             ` Kyle Moffett
2009-05-04  1:01               ` Kyle Moffett
2009-05-04 16:12                 ` Rik van Riel
2009-05-04 16:15                   ` Lars Ellenberg
2009-05-01  8:59   ` [PATCH 01/16] DRBD: major.h Andrew Morton
2009-05-01  8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
2009-05-01 11:15   ` Lars Marowsky-Bree
2009-05-01 13:14     ` Dave Jones
2009-05-01 19:14       ` Andrew Morton
2009-05-05  4:05     ` Christian Kujau
2009-05-02  7:33   ` Bart Van Assche
2009-05-03  5:36     ` Willy Tarreau
2009-05-03  5:40       ` david
2009-05-03 14:21         ` James Bottomley
2009-05-03 14:36           ` david
2009-05-03 14:45             ` James Bottomley
2009-05-03 14:56               ` david
2009-05-03 15:09                 ` James Bottomley
2009-05-03 15:22                   ` david
2009-05-03 15:38                     ` James Bottomley
2009-05-03 15:48                       ` david
2009-05-03 16:02                         ` James Bottomley
2009-05-03 16:13                           ` david
2009-05-04  8:28               ` Philipp Reisner
2009-05-04 17:24                 ` James Bottomley
2009-05-05  8:21                   ` Philipp Reisner [this message]
2009-05-05 14:09                     ` James Bottomley
2009-05-05 15:56                       ` Philipp Reisner
2009-05-05 17:05                         ` James Bottomley
2009-05-05 21:45                           ` Philipp Reisner
2009-05-05 21:53                             ` James Bottomley
2009-05-06  8:17                               ` Philipp Reisner
2009-05-05 15:03                     ` Bart Van Assche
2009-05-05 15:57                       ` Philipp Reisner
2009-05-05 17:38                         ` Lars Marowsky-Bree
2009-05-03 10:06       ` Philipp Reisner
2009-05-03 10:15         ` Thomas Backlund
2009-05-03  5:53 ` Neil Brown
2009-05-03  6:24   ` david
2009-05-03  8:29   ` Lars Ellenberg
2009-05-03 11:00     ` Neil Brown
2009-05-03 21:32       ` Lars Ellenberg
2009-05-04 16:12         ` Lars Marowsky-Bree
2009-05-05 22:08         ` Lars Ellenberg
  -- strict thread matches above, loose matches on Subject: below --
2009-05-14 22:31 devzero
2009-05-15 12:10 Philipp Reisner
2009-07-06 15:39 [PATCH 00/16] drbd: " Philipp Reisner
2009-07-21  5:49 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200905051021.33461.philipp.reisner@linbit.com \
    --to=philipp.reisner@linbit.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=akpm@linux-foundation.org \
    --cc=bart.vanassche@gmail.com \
    --cc=davej@redhat.com \
    --cc=david@lang.hm \
    --cc=gregkh@suse.de \
    --cc=jens.axboe@oracle.com \
    --cc=knikanth@suse.de \
    --cc=kyle@moffetthome.net \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lmb@suse.de \
    --cc=neilb@suse.de \
    --cc=sam@ravnborg.org \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox