RE: [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Neil Brown <neilb@suse.de>
To: "Williams, Dan J" <dan.j.williams@intel.com>
Cc: Mark Hahn <hahn@mcmaster.ca>, linux-raid@vger.kernel.org
Subject: RE: [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]
Date: Thu, 12 Apr 2007 16:21:32 +1000	[thread overview]
Message-ID: <17949.53228.365167.89832@notabene.brown> (raw)
In-Reply-To: message from Williams, Dan J on Wednesday April 11

On Wednesday April 11, dan.j.williams@intel.com wrote:
> > From: Mark Hahn [mailto:hahn@mcmaster.ca]
> > 
> > > In its current implementation write-back mode acknowledges writes
> before
> > > they have reached non-volatile media.
> > 
> > which is basically normal for unix, no?
> I am referring to when bi_end_io is called on the bio submitted to MD.
> Normally it is not called until after the bi_end_io event for the bio
> submitted to the backing disk.
> 
> > 
> > are you planning to support barriers?  (which are the block system's
> way
> > of supporting filesystem atomicity).
> Not as a part of these performance experiments.  But, I have wondered
> what the underlying issues are behind raid5 not supporting barriers.
> Currently in raid5.c:make_request:
> 
> 	if (unlikely(bio_barrier(bi))) {
> 		bio_endio(bi, bi->bi_size, -EOPNOTSUPP);
> 		return 0;
> 	}

I should be getting this explanation down to a fine art.  I seem to be
delivering it in multiple forums.

My position is that for a virtual device that stores some blocks on
some devices and other blocks on other devices (e.g. raid0, raid5,
linear, LVM, but not raid1) barrier support in the individual devices
is unusable, and that to achieve the goal it is just as easy for the
filesystem to order requests and to use blkdev_issue_flush to force
sync-to-disk. 

The semantics of a barrier (as I understand it) is that all writes
prior to the barrier are safe before the barrier write is commenced,
and that write itself is safe before any subsequence write is
commenced. (I think those semantics are stronger than we should be
exporting - just the first half should be enough - but such is life).

On a single drive, this is achieved by not re-ordering requests around
a barrier, and asking the device to not re-order requests either.
When you have multiple devices, you cannot ask them not to re-order
requests with respect to each other, so the same mechanism cannot be
used.

Instead, you would have to plug the meta-device, unplug all the lower
level queues, wait for all writes to complete, call blkdev_issue_flush
to make sure the data is safe, issue the barrier write and wait for it
to complete, call blkdev_issue_flush again (Well, maybe the barrier
write could have been sent with BIO_RW_BARRIER for the same effect)
then unplug the queue.

And the thing is that all of that complexity ALREADY needs to be in the
filesystem.  Because if a device doesn't support barriers, the
filesystem should wait for all dependant writes to complete and then
issue the 'barrier' write (and probably call blkdev_issue_flush as
well).

And the filesystem is positioned to do this BETTER because it can know
which writes are really dependant and which might be incidental.

Ext3 gets this right except that it never bothers with
blkdev_issue_flush.  XFS doesn't even bother trying (it's designed to
be used with reliable drives).  reiserfs might actually get it
completely right as it does have a call to blkdev_issue_flush in what
looks like the right place, but I cannot be sure without lots of code
review. 

dm/stripe currently gets this wrong.  If it gets a barrier request it
just passes it down to the one target drive thus failing to ensure any
ordering wrt other drives.

All that said:  raid5 is probably in a better position than most to
implement a barrier as it keeps careful track of everything that is
happening, and could easily wait for all prior writes to complete.
This might mesH well with the write-behind approach to caching.  But I
would still rather than the filesystem just got it right for us.
With a single drive, the drive can implement a barrier more
efficiently than the filesystem.  With multiple drives, the
meta-device can at best be as efficient as the filesystem.

NeilBrown

next prev parent reply	other threads:[~2007-04-12  6:21 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-11  6:00 [PATCH RFC 0/4] raid5: write-back caching policy and write performance Dan Williams
2007-04-11  6:00 ` [PATCH RFC 1/4] md: introduce struct stripe_head_state Dan Williams
2007-04-11  6:00 ` [PATCH RFC 2/4] md: refactor raid5 cache policy code using 'struct stripe_cache_policy' Dan Williams
2007-04-11  6:00 ` [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental] Dan Williams
2007-04-11 22:40   ` Mark Hahn
2007-04-12  0:08     ` Williams, Dan J
2007-04-12  6:21       ` Neil Brown [this message]
2007-04-12  5:37   ` Al Boldi
2007-04-11  6:00 ` [PATCH RFC 4/4] md: delayed stripe activation Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17949.53228.365167.89832@notabene.brown \
    --to=neilb@suse.de \
    --cc=dan.j.williams@intel.com \
    --cc=hahn@mcmaster.ca \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).