From: Bill Davidsen <davidsen@tmr.com>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: David Chinner <dgc@sgi.com>,
david@lang.hm, Phillip Susi <psusi@cfl.rr.com>,
Neil Brown <neilb@suse.de>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
dm-devel@redhat.com, linux-raid@vger.kernel.org,
Stefan Bader <Stefan.Bader@de.ibm.com>,
Andreas Dilger <adilger@clusterfs.com>,
Tejun Heo <htejun@gmail.com>
Subject: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Date: Sat, 02 Jun 2007 15:55:08 -0400 [thread overview]
Message-ID: <4661CB1C.60806@tmr.com> (raw)
In-Reply-To: <20070602145133.GG32105@kernel.dk>
Jens Axboe wrote:
> On Fri, Jun 01 2007, Bill Davidsen wrote:
>
>> Jens Axboe wrote:
>>
>>> On Thu, May 31 2007, Bill Davidsen wrote:
>>>
>>>
>>>> Jens Axboe wrote:
>>>>
>>>>
>>>>> On Thu, May 31 2007, David Chinner wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Thu, May 31 2007, David Chinner wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> IOWs, there are two parts to the problem:
>>>>>>>>
>>>>>>>> 1 - guaranteeing I/O ordering
>>>>>>>> 2 - guaranteeing blocks are on persistent storage.
>>>>>>>>
>>>>>>>> Right now, a single barrier I/O is used to provide both of these
>>>>>>>> guarantees. In most cases, all we really need to provide is 1); the
>>>>>>>> need for 2) is a much rarer condition but still needs to be
>>>>>>>> provided.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> if I am understanding it correctly, the big win for barriers is that
>>>>>>>>> you do NOT have to stop and wait until the data is on persistant
>>>>>>>>> media before you can continue.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Yes, if we define a barrier to only guarantee 1), then yes this
>>>>>>>> would be a big win (esp. for XFS). But that requires all filesystems
>>>>>>>> to handle sync writes differently, and sync_blockdev() needs to
>>>>>>>> call blkdev_issue_flush() as well....
>>>>>>>>
>>>>>>>> So, what do we do here? Do we define a barrier I/O to only provide
>>>>>>>> ordering, or do we define it to also provide persistent storage
>>>>>>>> writeback? Whatever we decide, it needs to be documented....
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> The block layer already has a notion of the two types of barriers, with
>>>>>>> a very small amount of tweaking we could expose that. There's
>>>>>>> absolutely
>>>>>>> zero reason we can't easily support both types of barriers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> That sounds like a good idea - we can leave the existing
>>>>>> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
>>>>>> behaviour that only guarantees ordering. The filesystem can then
>>>>>> choose which to use where appropriate....
>>>>>>
>>>>>>
>>>>>>
>>>>> Precisely. The current definition of barriers are what Chris and I came
>>>>> up with many years ago, when solving the problem for reiserfs
>>>>> originally. It is by no means the only feasible approach.
>>>>>
>>>>> I'll add a WRITE_ORDERED command to the #barrier branch, it already
>>>>> contains the empty-bio barrier support I posted yesterday (well a
>>>>> slightly modified and cleaned up version).
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Wait. Do filesystems expect (depend on) anything but ordering now? Does
>>>> md? Having users of barriers as they currently behave suddenly getting
>>>> SYNC behavior where they expect ORDERED is likely to have a negative
>>>> effect on performance. Or do I misread what is actually guaranteed by
>>>> WRITE_BARRIER now, and a flush is currently happening in all cases?
>>>>
>>>>
>>> See the above stuff you quote, it's answered there. It's not a change,
>>> this is how the Linux barrier write has always worked since I first
>>> implemented it. What David and I are talking about is adding a more
>>> relaxed version as well, that just implies ordering.
>>>
>>>
>> I was reading the documentation in block/biodoc.txt, which seems to just
>> say ordered:
>>
>> 1.2.1 I/O Barriers
>>
>> There is a way to enforce strict ordering for i/os through barriers.
>> All requests before a barrier point must be serviced before the barrier
>> request and any other requests arriving after the barrier will not be
>> serviced until after the barrier has completed. This is useful for
>> higher
>> level control on write ordering, e.g flushing a log of committed updates
>> to disk before the corresponding updates themselves.
>>
>> A flag in the bio structure, BIO_BARRIER is used to identify a
>> barrier i/o.
>> The generic i/o scheduler would make sure that it places the barrier
>> request and
>> all other requests coming after it after all the previous requests
>> in the
>> queue. Barriers may be implemented in different ways depending on the
>> driver. A SCSI driver for example could make use of ordered tags to
>> preserve the necessary ordering with a lower impact on throughput.
>> For IDE
>> this might be two sync cache flush: a pre and post flush when
>> encountering
>> a barrier write.
>>
>> The "flush" comment is associated with IDE, so it wasn't clear that the
>> device cache is always cleared to force the data to the platter.
>>
>
> The above should mention that the ordered tag comment for SCSI assumes
> that the drive uses write through caching. If it does, then an ordered
> tag is enough. If it doesn't, then you need a bit more than that (a post
> flush, after the ordered tag has completed).
>
>
Thanks, go it.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
next prev parent reply other threads:[~2007-06-02 19:55 UTC|newest]
Thread overview: 102+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown
2007-05-25 11:15 ` David Chinner
2007-05-25 11:49 ` Jens Axboe
2007-05-25 14:49 ` Phillip Susi
2007-05-28 18:32 ` [dm-devel] " Jens Axboe
2007-05-25 13:52 ` Stefan Bader
2007-05-28 1:37 ` Neil Brown
2007-05-29 9:12 ` Stefan Bader
2007-05-25 15:11 ` Phillip Susi
2007-05-26 1:03 ` Andreas Dilger
2007-05-26 10:27 ` Tejun Heo
2007-05-28 1:30 ` Neil Brown
2007-05-28 2:45 ` David Chinner
2007-05-28 2:57 ` Neil Brown
2007-05-28 4:29 ` David Chinner
2007-05-31 0:46 ` Neil Brown
2007-05-31 0:57 ` Alasdair G Kergon
2007-05-31 1:07 ` Alasdair G Kergon
2007-05-31 1:11 ` David Chinner
2007-05-28 4:48 ` Timothy Shimmin
2007-05-29 6:45 ` Jeremy Higdon
2007-05-29 20:03 ` Phillip Susi
2007-05-29 23:48 ` David Chinner
2007-05-30 0:01 ` david
2007-05-30 6:17 ` David Chinner
2007-05-30 8:55 ` Stefan Bader
2007-05-30 16:52 ` david
2007-05-31 0:20 ` David Chinner
2007-05-31 6:26 ` Jens Axboe
2007-05-31 7:03 ` David Chinner
2007-05-31 7:06 ` Jens Axboe
2007-05-31 13:30 ` Bill Davidsen
2007-05-31 13:36 ` Jens Axboe
2007-06-01 16:04 ` Bill Davidsen
2007-06-02 14:51 ` Jens Axboe
2007-06-02 19:55 ` Bill Davidsen [this message]
2007-06-01 3:16 ` Tejun Heo
2007-06-01 8:21 ` Jens Axboe
2007-06-02 9:20 ` Tejun Heo
2007-06-02 14:34 ` Jens Axboe
2007-06-02 22:57 ` Guy Watkins
2007-06-04 7:39 ` Tejun Heo
2007-05-31 18:31 ` Phillip Susi
2007-05-31 19:00 ` Jens Axboe
2007-05-31 19:21 ` david
2007-05-31 19:40 ` Jens Axboe
2007-05-31 23:34 ` David Chinner
2007-06-01 5:59 ` Neil Brown
2007-06-01 6:11 ` Jens Axboe
2007-06-01 7:53 ` David Chinner
2007-06-01 23:56 ` Bill Davidsen
2007-05-31 18:24 ` Phillip Susi
2007-05-30 16:45 ` Phillip Susi
2007-05-30 20:27 ` [dm-devel] " Phillip Susi
2007-05-31 6:24 ` Jens Axboe
2007-05-31 18:37 ` [dm-devel] " Phillip Susi
2007-05-31 18:58 ` Jens Axboe
2007-06-02 0:04 ` Bill Davidsen
2007-05-28 9:29 ` Tejun Heo
2007-05-28 9:43 ` Alasdair G Kergon
2007-05-29 9:25 ` [dm-devel] " Stefan Bader
2007-05-29 22:05 ` Alasdair G Kergon
2007-05-30 9:12 ` [dm-devel] " Stefan Bader
2007-05-30 10:41 ` Alasdair G Kergon
2007-05-30 16:55 ` Phillip Susi
2007-05-31 11:14 ` [dm-devel] " Stefan Bader
2007-06-01 3:25 ` Tejun Heo
2007-06-01 5:55 ` david
2007-06-01 7:16 ` [dm-devel] " Tejun Heo
2007-06-01 17:07 ` Valdis.Kletnieks
2007-06-01 18:09 ` Tejun Heo
2007-07-10 18:39 ` Ric Wheeler
2007-07-10 23:40 ` Valdis.Kletnieks
2007-07-11 2:49 ` Tejun Heo
2007-07-11 22:44 ` Ric Wheeler
2007-07-12 17:34 ` Valdis.Kletnieks
2007-07-12 19:43 ` Ric Wheeler
2007-07-12 23:10 ` Guy Watkins
2007-07-13 11:30 ` Ric Wheeler
2007-07-11 2:51 ` Tejun Heo
2007-05-29 19:59 ` Phillip Susi
2007-05-31 0:22 ` Neil Brown
2007-05-30 9:35 ` Jens Axboe
2007-07-05 12:28 ` Tejun Heo
2007-07-09 12:27 ` Jens Axboe
2007-07-18 10:56 ` [PATCH] block: cosmetic changes Tejun Heo
2007-07-18 10:59 ` [PATCH] block: factor out bio_check_eod() Tejun Heo
2007-07-18 11:06 ` Jens Axboe
2007-07-18 11:18 ` Tejun Heo
2007-07-18 11:31 ` Jens Axboe
2007-07-18 11:33 ` Tejun Heo
2007-07-18 11:34 ` Jens Axboe
2007-07-18 11:41 ` Tejun Heo
2007-07-18 11:45 ` Jens Axboe
2007-07-18 11:49 ` Jens Axboe
2007-07-18 12:34 ` Tejun Heo
2007-07-18 12:31 ` Jens Axboe
2007-05-28 11:17 ` [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Nikita Danilov
2007-05-31 3:31 ` Neil Brown
2007-05-28 14:43 ` Bill Davidsen
2007-05-31 0:37 ` Neil Brown
2007-05-31 12:28 ` Bill Davidsen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4661CB1C.60806@tmr.com \
--to=davidsen@tmr.com \
--cc=Stefan.Bader@de.ibm.com \
--cc=adilger@clusterfs.com \
--cc=david@lang.hm \
--cc=dgc@sgi.com \
--cc=dm-devel@redhat.com \
--cc=htejun@gmail.com \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=psusi@cfl.rr.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).