From: David Chinner <dgc@sgi.com>
To: Neil Brown <neilb@suse.de>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
dm-devel@redhat.com, linux-raid@vger.kernel.org,
Jens Axboe <jens.axboe@oracle.com>, David Chinner <dgc@sgi.com>,
Phillip Susi <psusi@cfl.rr.com>,
Stefan Bader <Stefan.Bader@de.ibm.com>,
Andreas Dilger <adilger@clusterfs.com>,
Tejun Heo <htejun@gmail.com>
Subject: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Date: Mon, 28 May 2007 12:45:59 +1000 [thread overview]
Message-ID: <20070528024559.GA85884050@sgi.com> (raw)
In-Reply-To: <18010.12472.209452.148229@notabene.brown>
On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote:
>
> Thanks everyone for your input. There was some very valuable
> observations in the various emails.
> I will try to pull most of it together and bring out what seem to be
> the important points.
>
>
> 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.
Sounds good to me, but how do we test to see if the underlying
device supports barriers? Do we just assume that they do and
only change behaviour if -o nobarrier is specified in the mount
options?
> 2/ Maybe barriers provide stronger semantics than are required.
>
> All write requests are synchronised around a barrier write. This is
> often more than is required and apparently can cause a measurable
> slowdown.
>
> Also the FUA for the actual commit write might not be needed. It is
> important for consistency that the preceding writes are in safe
> storage before the commit write, but it is not so important that the
> commit write is immediately safe on storage. That isn't needed until
> a 'sync' or 'fsync' or similar.
The use of barriers in XFS assumes the commit write to be on stable
storage before it returns. One of the ordering guarantees that we
need is that the transaction (commit write) is on disk before the
metadata block containing the change in the transaction is written
to disk and the current barrier behaviour gives us that.
> One possible alternative is:
> - writes can overtake barriers, but barrier cannot overtake writes.
No, that breaks the above usage of a barrier....
> - flush before the barrier, not after.
>
> This is considerably weaker, and hence cheaper. But I think it is
> enough for all filesystems (providing it is still an option to call
> blkdev_issue_flush on 'fsync').
No, not enough for XFS.
> Another alternative would be to tag each bio was being in a
> particular barrier-group. Then bio's in different groups could
> overtake each other in either direction, but a BARRIER request must
> be totally ordered w.r.t. other requests in the barrier group.
> This would require an extra bio field, and would give the filesystem
> more appearance of control. I'm not yet sure how much it would
> really help...
And that assumes the filesystem is tracking exact dependencies
between I/Os. Such a mechanism would probably require filesystems
to be redesigned to use this, but I can see how it would be useful
for doing things like ensuring ordering between just an inode and
it's data writes. What would the overhead of having to support
several hundred thousand different barrier groups be (i.e. one per
dirty inode in a system)?
> I think the implementation priorities here are:
Depending on the answer to my first question:
0/ implement a specific test for filesystems to run at mount time
to determine if barriers are supported or not.
> 1/ implement a zero-length BIO_RW_BARRIER option.
> 2/ Use it (or otherwise) to make all dm and md modules handle
> barriers (and loop?).
> 3/ Devise and implement appropriate fall-backs with-in the block layer
> so that -EOPNOTSUP is never returned.
> 4/ Remove unneeded cruft from filesystems (and elsewhere).
Sounds like a good start. ;)
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
next prev parent reply other threads:[~2007-05-28 2:45 UTC|newest]
Thread overview: 102+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown
2007-05-25 11:15 ` David Chinner
2007-05-25 11:49 ` Jens Axboe
2007-05-25 14:49 ` Phillip Susi
2007-05-28 18:32 ` [dm-devel] " Jens Axboe
2007-05-25 13:52 ` Stefan Bader
2007-05-28 1:37 ` Neil Brown
2007-05-29 9:12 ` Stefan Bader
2007-05-25 15:11 ` Phillip Susi
2007-05-26 1:03 ` Andreas Dilger
2007-05-26 10:27 ` Tejun Heo
2007-05-28 1:30 ` Neil Brown
2007-05-28 2:45 ` David Chinner [this message]
2007-05-28 2:57 ` Neil Brown
2007-05-28 4:29 ` David Chinner
2007-05-31 0:46 ` Neil Brown
2007-05-31 0:57 ` Alasdair G Kergon
2007-05-31 1:07 ` Alasdair G Kergon
2007-05-31 1:11 ` David Chinner
2007-05-28 4:48 ` Timothy Shimmin
2007-05-29 6:45 ` Jeremy Higdon
2007-05-29 20:03 ` Phillip Susi
2007-05-29 23:48 ` David Chinner
2007-05-30 0:01 ` david
2007-05-30 6:17 ` David Chinner
2007-05-30 8:55 ` Stefan Bader
2007-05-30 16:52 ` david
2007-05-31 0:20 ` David Chinner
2007-05-31 6:26 ` Jens Axboe
2007-05-31 7:03 ` David Chinner
2007-05-31 7:06 ` Jens Axboe
2007-05-31 13:30 ` Bill Davidsen
2007-05-31 13:36 ` Jens Axboe
2007-06-01 16:04 ` Bill Davidsen
2007-06-02 14:51 ` Jens Axboe
2007-06-02 19:55 ` Bill Davidsen
2007-06-01 3:16 ` Tejun Heo
2007-06-01 8:21 ` Jens Axboe
2007-06-02 9:20 ` Tejun Heo
2007-06-02 14:34 ` Jens Axboe
2007-06-02 22:57 ` Guy Watkins
2007-06-04 7:39 ` Tejun Heo
2007-05-31 18:31 ` Phillip Susi
2007-05-31 19:00 ` Jens Axboe
2007-05-31 19:21 ` david
2007-05-31 19:40 ` Jens Axboe
2007-05-31 23:34 ` David Chinner
2007-06-01 5:59 ` Neil Brown
2007-06-01 6:11 ` Jens Axboe
2007-06-01 7:53 ` David Chinner
2007-06-01 23:56 ` Bill Davidsen
2007-05-31 18:24 ` Phillip Susi
2007-05-30 16:45 ` Phillip Susi
2007-05-30 20:27 ` [dm-devel] " Phillip Susi
2007-05-31 6:24 ` Jens Axboe
2007-05-31 18:37 ` [dm-devel] " Phillip Susi
2007-05-31 18:58 ` Jens Axboe
2007-06-02 0:04 ` Bill Davidsen
2007-05-28 9:29 ` Tejun Heo
2007-05-28 9:43 ` Alasdair G Kergon
2007-05-29 9:25 ` [dm-devel] " Stefan Bader
2007-05-29 22:05 ` Alasdair G Kergon
2007-05-30 9:12 ` [dm-devel] " Stefan Bader
2007-05-30 10:41 ` Alasdair G Kergon
2007-05-30 16:55 ` Phillip Susi
2007-05-31 11:14 ` [dm-devel] " Stefan Bader
2007-06-01 3:25 ` Tejun Heo
2007-06-01 5:55 ` david
2007-06-01 7:16 ` [dm-devel] " Tejun Heo
2007-06-01 17:07 ` Valdis.Kletnieks
2007-06-01 18:09 ` Tejun Heo
2007-07-10 18:39 ` Ric Wheeler
2007-07-10 23:40 ` Valdis.Kletnieks
2007-07-11 2:49 ` Tejun Heo
2007-07-11 22:44 ` Ric Wheeler
2007-07-12 17:34 ` Valdis.Kletnieks
2007-07-12 19:43 ` Ric Wheeler
2007-07-12 23:10 ` Guy Watkins
2007-07-13 11:30 ` Ric Wheeler
2007-07-11 2:51 ` Tejun Heo
2007-05-29 19:59 ` Phillip Susi
2007-05-31 0:22 ` Neil Brown
2007-05-30 9:35 ` Jens Axboe
2007-07-05 12:28 ` Tejun Heo
2007-07-09 12:27 ` Jens Axboe
2007-07-18 10:56 ` [PATCH] block: cosmetic changes Tejun Heo
2007-07-18 10:59 ` [PATCH] block: factor out bio_check_eod() Tejun Heo
2007-07-18 11:06 ` Jens Axboe
2007-07-18 11:18 ` Tejun Heo
2007-07-18 11:31 ` Jens Axboe
2007-07-18 11:33 ` Tejun Heo
2007-07-18 11:34 ` Jens Axboe
2007-07-18 11:41 ` Tejun Heo
2007-07-18 11:45 ` Jens Axboe
2007-07-18 11:49 ` Jens Axboe
2007-07-18 12:34 ` Tejun Heo
2007-07-18 12:31 ` Jens Axboe
2007-05-28 11:17 ` [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Nikita Danilov
2007-05-31 3:31 ` Neil Brown
2007-05-28 14:43 ` Bill Davidsen
2007-05-31 0:37 ` Neil Brown
2007-05-31 12:28 ` Bill Davidsen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070528024559.GA85884050@sgi.com \
--to=dgc@sgi.com \
--cc=Stefan.Bader@de.ibm.com \
--cc=adilger@clusterfs.com \
--cc=dm-devel@redhat.com \
--cc=htejun@gmail.com \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=psusi@cfl.rr.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).