linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bill Davidsen <davidsen@tmr.com>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: David Chinner <dgc@sgi.com>,
	david@lang.hm, Phillip Susi <psusi@cfl.rr.com>,
	Neil Brown <neilb@suse.de>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org,
	Stefan Bader <Stefan.Bader@de.ibm.com>,
	Andreas Dilger <adilger@clusterfs.com>,
	Tejun Heo <htejun@gmail.com>
Subject: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Date: Sat, 02 Jun 2007 15:55:08 -0400	[thread overview]
Message-ID: <4661CB1C.60806@tmr.com> (raw)
In-Reply-To: <20070602145133.GG32105@kernel.dk>

Jens Axboe wrote:
> On Fri, Jun 01 2007, Bill Davidsen wrote:
>   
>> Jens Axboe wrote:
>>     
>>> On Thu, May 31 2007, Bill Davidsen wrote:
>>>  
>>>       
>>>> Jens Axboe wrote:
>>>>    
>>>>         
>>>>> On Thu, May 31 2007, David Chinner wrote:
>>>>>
>>>>>      
>>>>>           
>>>>>> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
>>>>>>   
>>>>>>        
>>>>>>             
>>>>>>> On Thu, May 31 2007, David Chinner wrote:
>>>>>>>     
>>>>>>>          
>>>>>>>               
>>>>>>>> IOWs, there are two parts to the problem:
>>>>>>>>
>>>>>>>> 	1 - guaranteeing I/O ordering
>>>>>>>> 	2 - guaranteeing blocks are on persistent storage.
>>>>>>>>
>>>>>>>> Right now, a single barrier I/O is used to provide both of these
>>>>>>>> guarantees. In most cases, all we really need to provide is 1); the
>>>>>>>> need for 2) is a much rarer condition but still needs to be
>>>>>>>> provided.
>>>>>>>>
>>>>>>>>       
>>>>>>>>            
>>>>>>>>                 
>>>>>>>>> if I am understanding it correctly, the big win for barriers is that 
>>>>>>>>> you do NOT have to stop and wait until the data is on persistant 
>>>>>>>>> media before you can continue.
>>>>>>>>>         
>>>>>>>>>              
>>>>>>>>>                   
>>>>>>>> Yes, if we define a barrier to only guarantee 1), then yes this
>>>>>>>> would be a big win (esp. for XFS). But that requires all filesystems
>>>>>>>> to handle sync writes differently, and sync_blockdev() needs to
>>>>>>>> call blkdev_issue_flush() as well....
>>>>>>>>
>>>>>>>> So, what do we do here? Do we define a barrier I/O to only provide
>>>>>>>> ordering, or do we define it to also provide persistent storage
>>>>>>>> writeback? Whatever we decide, it needs to be documented....
>>>>>>>>       
>>>>>>>>            
>>>>>>>>                 
>>>>>>> The block layer already has a notion of the two types of barriers, with
>>>>>>> a very small amount of tweaking we could expose that. There's 
>>>>>>> absolutely
>>>>>>> zero reason we can't easily support both types of barriers.
>>>>>>>     
>>>>>>>          
>>>>>>>               
>>>>>> That sounds like a good idea - we can leave the existing
>>>>>> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
>>>>>> behaviour that only guarantees ordering. The filesystem can then
>>>>>> choose which to use where appropriate....
>>>>>>   
>>>>>>        
>>>>>>             
>>>>> Precisely. The current definition of barriers are what Chris and I came
>>>>> up with many years ago, when solving the problem for reiserfs
>>>>> originally. It is by no means the only feasible approach.
>>>>>
>>>>> I'll add a WRITE_ORDERED command to the #barrier branch, it already
>>>>> contains the empty-bio barrier support I posted yesterday (well a
>>>>> slightly modified and cleaned up version).
>>>>>
>>>>>
>>>>>      
>>>>>           
>>>> Wait. Do filesystems expect (depend on) anything but ordering now? Does 
>>>> md? Having users of barriers as they currently behave suddenly getting 
>>>> SYNC behavior where they expect ORDERED is likely to have a negative 
>>>> effect on performance. Or do I misread what is actually guaranteed by 
>>>> WRITE_BARRIER now, and a flush is currently happening in all cases?
>>>>    
>>>>         
>>> See the above stuff you quote, it's answered there. It's not a change,
>>> this is how the Linux barrier write has always worked since I first
>>> implemented it. What David and I are talking about is adding a more
>>> relaxed version as well, that just implies ordering.
>>>  
>>>       
>> I was reading the documentation in block/biodoc.txt, which seems to just 
>> say ordered:
>>
>>    1.2.1 I/O Barriers
>>
>>    There is a way to enforce strict ordering for i/os through barriers.
>>    All requests before a barrier point must be serviced before the barrier
>>    request and any other requests arriving after the barrier will not be
>>    serviced until after the barrier has completed. This is useful for
>>    higher
>>    level control on write ordering, e.g flushing a log of committed updates
>>    to disk before the corresponding updates themselves.
>>
>>    A flag in the bio structure, BIO_BARRIER is used to identify a
>>    barrier i/o.
>>    The generic i/o scheduler would make sure that it places the barrier
>>    request and
>>    all other requests coming after it after all the previous requests
>>    in the
>>    queue. Barriers may be implemented in different ways depending on the
>>    driver. A SCSI driver for example could make use of ordered tags to
>>    preserve the necessary ordering with a lower impact on throughput.
>>    For IDE
>>    this might be two sync cache flush: a pre and post flush when
>>    encountering
>>    a barrier write.
>>
>> The "flush" comment is associated with IDE, so it wasn't clear that the 
>> device cache is always cleared to force the data to the platter.
>>     
>
> The above should mention that the ordered tag comment for SCSI assumes
> that the drive uses write through caching. If it does, then an ordered
> tag is enough. If it doesn't, then you need a bit more than that (a post
> flush, after the ordered tag has completed).
>
>   
Thanks, go it.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


  reply	other threads:[~2007-06-02 19:55 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-25  7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown
2007-05-25 11:15 ` David Chinner
2007-05-25 11:49   ` Jens Axboe
2007-05-25 14:49     ` Phillip Susi
2007-05-28 18:32       ` [dm-devel] " Jens Axboe
2007-05-25 13:52 ` Stefan Bader
2007-05-28  1:37   ` Neil Brown
2007-05-29  9:12     ` Stefan Bader
2007-05-25 15:11 ` Phillip Susi
2007-05-26  1:03 ` Andreas Dilger
2007-05-26 10:27 ` Tejun Heo
2007-05-28  1:30 ` Neil Brown
2007-05-28  2:45   ` David Chinner
2007-05-28  2:57     ` Neil Brown
2007-05-28  4:29       ` David Chinner
2007-05-31  0:46         ` Neil Brown
2007-05-31  0:57           ` Alasdair G Kergon
2007-05-31  1:07           ` Alasdair G Kergon
2007-05-31  1:11             ` David Chinner
2007-05-28  4:48     ` Timothy Shimmin
2007-05-29  6:45       ` Jeremy Higdon
2007-05-29 20:03     ` Phillip Susi
2007-05-29 23:48       ` David Chinner
2007-05-30  0:01         ` david
2007-05-30  6:17           ` David Chinner
2007-05-30  8:55             ` Stefan Bader
2007-05-30 16:52             ` david
2007-05-31  0:20               ` David Chinner
2007-05-31  6:26                 ` Jens Axboe
2007-05-31  7:03                   ` David Chinner
2007-05-31  7:06                     ` Jens Axboe
2007-05-31 13:30                       ` Bill Davidsen
2007-05-31 13:36                         ` Jens Axboe
2007-06-01 16:04                           ` Bill Davidsen
2007-06-02 14:51                             ` Jens Axboe
2007-06-02 19:55                               ` Bill Davidsen [this message]
2007-06-01  3:16                       ` Tejun Heo
2007-06-01  8:21                         ` Jens Axboe
2007-06-02  9:20                           ` Tejun Heo
2007-06-02 14:34                             ` Jens Axboe
2007-06-02 22:57                               ` Guy Watkins
2007-06-04  7:39                               ` Tejun Heo
2007-05-31 18:31                     ` Phillip Susi
2007-05-31 19:00                       ` Jens Axboe
2007-05-31 19:21                         ` david
2007-05-31 19:40                           ` Jens Axboe
2007-05-31 23:34                       ` David Chinner
2007-06-01  5:59                         ` Neil Brown
2007-06-01  6:11                           ` Jens Axboe
2007-06-01  7:53                           ` David Chinner
2007-06-01 23:56                           ` Bill Davidsen
2007-05-31 18:24                 ` Phillip Susi
2007-05-30 16:45         ` Phillip Susi
2007-05-30 20:27           ` [dm-devel] " Phillip Susi
2007-05-31  6:24             ` Jens Axboe
2007-05-31 18:37               ` [dm-devel] " Phillip Susi
2007-05-31 18:58                 ` Jens Axboe
2007-06-02  0:04                   ` Bill Davidsen
2007-05-28  9:29   ` Tejun Heo
2007-05-28  9:43   ` Alasdair G Kergon
2007-05-29  9:25     ` [dm-devel] " Stefan Bader
2007-05-29 22:05       ` Alasdair G Kergon
2007-05-30  9:12         ` [dm-devel] " Stefan Bader
2007-05-30 10:41           ` Alasdair G Kergon
2007-05-30 16:55           ` Phillip Susi
2007-05-31 11:14             ` [dm-devel] " Stefan Bader
2007-06-01  3:25               ` Tejun Heo
2007-06-01  5:55                 ` david
2007-06-01  7:16                   ` [dm-devel] " Tejun Heo
2007-06-01 17:07                     ` Valdis.Kletnieks
2007-06-01 18:09                       ` Tejun Heo
2007-07-10 18:39                     ` Ric Wheeler
2007-07-10 23:40                       ` Valdis.Kletnieks
2007-07-11  2:49                         ` Tejun Heo
2007-07-11 22:44                         ` Ric Wheeler
2007-07-12 17:34                           ` Valdis.Kletnieks
2007-07-12 19:43                             ` Ric Wheeler
2007-07-12 23:10                             ` Guy Watkins
2007-07-13 11:30                               ` Ric Wheeler
2007-07-11  2:51                       ` Tejun Heo
2007-05-29 19:59   ` Phillip Susi
2007-05-31  0:22     ` Neil Brown
2007-05-30  9:35   ` Jens Axboe
2007-07-05 12:28     ` Tejun Heo
2007-07-09 12:27       ` Jens Axboe
2007-07-18 10:56     ` [PATCH] block: cosmetic changes Tejun Heo
2007-07-18 10:59       ` [PATCH] block: factor out bio_check_eod() Tejun Heo
2007-07-18 11:06         ` Jens Axboe
2007-07-18 11:18           ` Tejun Heo
2007-07-18 11:31             ` Jens Axboe
2007-07-18 11:33               ` Tejun Heo
2007-07-18 11:34                 ` Jens Axboe
2007-07-18 11:41                   ` Tejun Heo
2007-07-18 11:45                     ` Jens Axboe
2007-07-18 11:49                       ` Jens Axboe
2007-07-18 12:34                         ` Tejun Heo
2007-07-18 12:31                       ` Jens Axboe
2007-05-28 11:17 ` [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Nikita Danilov
2007-05-31  3:31   ` Neil Brown
2007-05-28 14:43 ` Bill Davidsen
2007-05-31  0:37   ` Neil Brown
2007-05-31 12:28     ` Bill Davidsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4661CB1C.60806@tmr.com \
    --to=davidsen@tmr.com \
    --cc=Stefan.Bader@de.ibm.com \
    --cc=adilger@clusterfs.com \
    --cc=david@lang.hm \
    --cc=dgc@sgi.com \
    --cc=dm-devel@redhat.com \
    --cc=htejun@gmail.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=psusi@cfl.rr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).