* Device loses barrier support (was: Fixed patch for simple barriers.) [not found] ` <20081204100050.GN6703@one.firstfloor.org> @ 2008-12-04 14:00 ` Mikulas Patocka 2008-12-04 14:20 ` Andi Kleen 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-12-04 14:00 UTC (permalink / raw) To: Andi Kleen, linux-kernel, xfs; +Cc: Alasdair G Kergon, Andi Kleen, Milan Broz On Thu, 4 Dec 2008, Andi Kleen wrote: > On Thu, Dec 04, 2008 at 12:09:56AM -0500, Mikulas Patocka wrote: > > > > BTW. how is this patch supposed to work with pvmove? I.e. you advertise to > > a filesystem that you support barriers, then the user runs pvmove and you > > drop barrier support while the filesystem is mounted - that will confuse > > the filesystem and maybe produce a data corruption. I wouldn't recommend > > File systems handle this generally. Also the pvmove itself will > act as a barrier. > > -Andi How do you want to handle this? Imagine: the filesystem submits a 1st write request the filesystem submits a 2nd write barrier request the filesystem submits a 3rd write request ... time passes ... the 1st write request ends with success the 2nd write request ends with -EOPNOTSUPP the 3rd write request ends with success --- when you first see -EOPNOTSUPP, you have already corrupted filesystem (the 3rd write passed while the filesystem expected that it would be finished after the 2nd write) and you are in an interrupt context, where you can't reissue -EOPNOTSUPP request. So what do you want to do? Possible ways how to solve it: 1) Wait synchronously for barriers, don't issue any other writes while barrier is pending. - this basically supresses any performance advantage barriers could have. Ext3 is doing this. - this solion is right. But if this is "the way it should be done", you could rip barriers from the kernel completely and replace them with a simple call to flush hardware cache. In this use scenario, they have no advantage over a simple call to flush cache. 2) Resubmit the failed -EOPNOTSUPP request from a thread. - this is what XFS is doing. Bad for code complexity (there must be a special thread just to catch failed IOs). Also, it still produces corrupted filesystem for a brief period of time. 3) Fail barriers only synchronously? (so that the caller can detect missing barrier support before issuing other writes) - unimplemntable in device mapper, if the device is suspended, it queues bios. 4) Disallow losing barrier support? - for me it looks like a sensible solution. Mikulas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 14:00 ` Device loses barrier support (was: Fixed patch for simple barriers.) Mikulas Patocka @ 2008-12-04 14:20 ` Andi Kleen 2008-12-04 14:17 ` Mikulas Patocka 0 siblings, 1 reply; 20+ messages in thread From: Andi Kleen @ 2008-12-04 14:20 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz > the 1st write request ends with success > the 2nd write request ends with -EOPNOTSUPP > the 3rd write request ends with success > > --- when you first see -EOPNOTSUPP, you have already corrupted filesystem > (the 3rd write passed while the filesystem expected that it would be There's no passing of requests during pvmove. It's a really strong barrier. > finished after the 2nd write) and you are in an interrupt context, where > you can't reissue -EOPNOTSUPP request. So what do you want to do? The barrier aware file systems I know of just resubmit synchronously when a barrier fails. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 14:20 ` Andi Kleen @ 2008-12-04 14:17 ` Mikulas Patocka 2008-12-04 14:58 ` Andi Kleen 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-12-04 14:17 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz On Thu, 4 Dec 2008, Andi Kleen wrote: > > the 1st write request ends with success > > the 2nd write request ends with -EOPNOTSUPP > > the 3rd write request ends with success > > > > --- when you first see -EOPNOTSUPP, you have already corrupted filesystem > > (the 3rd write passed while the filesystem expected that it would be > > There's no passing of requests during pvmove. It's a really strong > barrier. You start pvmove. The filesystem doesn't know about pvmove. The next time filesystem does somethig, it submits these 3 requests and the 2nd fill unexpectedly fail. So the fact that pvmove drains the request queue won't help you. > > finished after the 2nd write) and you are in an interrupt context, where > > you can't reissue -EOPNOTSUPP request. So what do you want to do? > > The barrier aware file systems I know of just resubmit synchronously when > a barrier fails. ... and produce structure corruption for certain period in time, because the writes meant to be ordered are submitted unordered. Mikulas > -Andi > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 14:17 ` Mikulas Patocka @ 2008-12-04 14:58 ` Andi Kleen 2008-12-04 16:45 ` Mikulas Patocka 2008-12-05 5:44 ` Device loses barrier support Timothy Shimmin 0 siblings, 2 replies; 20+ messages in thread From: Andi Kleen @ 2008-12-04 14:58 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz > You start pvmove. The filesystem doesn't know about pvmove. > > The next time filesystem does somethig, it submits these 3 requests and > the 2nd fill unexpectedly fail. Again the file systems handle failing barriers and they expect it. > > So the fact that pvmove drains the request queue won't help you. Help you against what? > > > > finished after the 2nd write) and you are in an interrupt context, where > > > you can't reissue -EOPNOTSUPP request. So what do you want to do? > > > > The barrier aware file systems I know of just resubmit synchronously when > > a barrier fails. > > ... and produce structure corruption for certain period in time, because > the writes meant to be ordered are submitted unordered. No there is nothing unordered. The file system path typically looks like commit of a transaction if (i have never seen a barrier failing) write block with barrier if (EOPNOTSUPP) { record failure submit synchronously } } else submit synchronously So if a pvmove barrier fails it will just submit synchronously. The write block with barrier bit varies, jbd/gfs2 do it synchronously too and xfs does it asynchronously (with io done callbacks), but in both cases they handle an EOPNOTSUPP comming out in the final io done. When the pvmove migrates from no barrier support to barrier support there won't be any barrier on the file system for the time of the current mount, but that's also fine. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 14:58 ` Andi Kleen @ 2008-12-04 16:45 ` Mikulas Patocka 2008-12-04 17:48 ` Andi Kleen 2008-12-07 4:17 ` Device loses barrier support (was: Fixed patch for simple barriers.) Dave Chinner 2008-12-05 5:44 ` Device loses barrier support Timothy Shimmin 1 sibling, 2 replies; 20+ messages in thread From: Mikulas Patocka @ 2008-12-04 16:45 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz > > > > finished after the 2nd write) and you are in an interrupt context, where > > > > you can't reissue -EOPNOTSUPP request. So what do you want to do? > > > > > > The barrier aware file systems I know of just resubmit synchronously when > > > a barrier fails. > > > > ... and produce structure corruption for certain period in time, because > > the writes meant to be ordered are submitted unordered. > > No there is nothing unordered. The file system path typically looks like > > commit of a transaction > if (i have never seen a barrier failing) > write block with barrier > if (EOPNOTSUPP) { > record failure > submit synchronously > } > } else > submit synchronously > If you view this as a "right" way of using barriers, then you can drop barrier support at all and replace this code sequence with: flush disk cache submit write synchronously flush disk cache --- because synchronous barriers bring you no performance advantage over the above sequence. > So if a pvmove barrier fails it will just submit synchronously. > > The write block with barrier bit varies, jbd/gfs2 do it synchronously > too and xfs does it asynchronously (with io done callbacks), but And how does xfs preserve write ordering, if the barrier asynchronously fails with -EOPNOTSUPP and there are other writes submitted after the barrier? > in both cases they handle an EOPNOTSUPP comming out in the final > io done. Mikulas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 16:45 ` Mikulas Patocka @ 2008-12-04 17:48 ` Andi Kleen 2008-12-04 17:53 ` Christoph Hellwig 2008-12-04 19:37 ` Mikulas Patocka 2008-12-07 4:17 ` Device loses barrier support (was: Fixed patch for simple barriers.) Dave Chinner 1 sibling, 2 replies; 20+ messages in thread From: Andi Kleen @ 2008-12-04 17:48 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz On Thu, Dec 04, 2008 at 11:45:44AM -0500, Mikulas Patocka wrote: > > No there is nothing unordered. The file system path typically looks like > > > > commit of a transaction > > if (i have never seen a barrier failing) > > write block with barrier > > if (EOPNOTSUPP) { > > record failure > > submit synchronously > > } > > } else > > submit synchronously > > > > If you view this as a "right" way of using barriers, then you can drop It's the way the file systems do it. If you don't believe me feel free to read the code for yourself. > barrier support at all and replace this code sequence with: > > flush disk cache > submit write synchronously > flush disk cache > > --- because synchronous barriers bring you no performance advantage over > the above sequence. Remember this is done by a commit thread in a journaling file system. Commits are ordered so the thread cannot really order out of order anyways. And yes the barriers are essentially a way to flush the cache regularly for selected commits. The alternative (if you want to guarantee transaction order) would be to disable the write cache completely and do it synchronous on each IO. > > > So if a pvmove barrier fails it will just submit synchronously. > > > > The write block with barrier bit varies, jbd/gfs2 do it synchronously > > too and xfs does it asynchronously (with io done callbacks), but > > And how does xfs preserve write ordering, if the barrier asynchronously > fails with -EOPNOTSUPP and there are other writes submitted after the > barrier? >From the high level journaling perspective they are not asynchronous I think. Just the low level xfs_buf interface happens to use the asynchronous callbacks instead of calling into the block layer directly like jbd et.al. do. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 17:48 ` Andi Kleen @ 2008-12-04 17:53 ` Christoph Hellwig 2008-12-04 19:37 ` Mikulas Patocka 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2008-12-04 17:53 UTC (permalink / raw) To: Andi Kleen Cc: Mikulas Patocka, Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Milan Broz On Thu, Dec 04, 2008 at 06:48:38PM +0100, Andi Kleen wrote: > I think. Just the low level xfs_buf interface happens to use the asynchronous > callbacks instead of calling into the block layer directly like jbd et.al. > do. Only delwri buffers are delayed in XFS, but the journaling code only uses async buffers which *synchronously* call into the block layer, but just don't wait for it to complete.. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 17:48 ` Andi Kleen 2008-12-04 17:53 ` Christoph Hellwig @ 2008-12-04 19:37 ` Mikulas Patocka 2008-12-04 22:15 ` Andi Kleen 2008-12-05 3:26 ` Device loses barrier support Eric Sandeen 1 sibling, 2 replies; 20+ messages in thread From: Mikulas Patocka @ 2008-12-04 19:37 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz On Thu, 4 Dec 2008, Andi Kleen wrote: > On Thu, Dec 04, 2008 at 11:45:44AM -0500, Mikulas Patocka wrote: > > > No there is nothing unordered. The file system path typically looks like > > > > > > commit of a transaction > > > if (i have never seen a barrier failing) > > > write block with barrier > > > if (EOPNOTSUPP) { > > > record failure > > > submit synchronously > > > } > > > } else > > > submit synchronously > > > > > > > If you view this as a "right" way of using barriers, then you can drop > > It's the way the file systems do it. If you don't believe me feel > free to read the code for yourself. > > > barrier support at all and replace this code sequence with: > > > > flush disk cache > > submit write synchronously > > flush disk cache > > > > --- because synchronous barriers bring you no performance advantage over > > the above sequence. > > Remember this is done by a commit thread in a journaling file system. > Commits are ordered so the thread cannot really order out of order > anyways. And yes the barriers are essentially a way to flush the cache > regularly for selected commits. The alternative (if you > want to guarantee transaction order) would be to disable > the write cache completely and do it synchronous on each IO. Some times ago, I used barriers the asynchronous way in the spadfs filesystem and they helped very much. The commit logic is: - take the transaction lock (it prevents any updates) - write remaining dirty buffers (but don't wait) - submit barrier write to move to new transaction (but don't wait) - drop the transaction lock - eventually wait for completion of writes (if this was issued by fsync or sync) or don't wait at all if it was issued by other things (periodic syncer, emergency sync to reclaim free space or so). The advantage of this approach is that the lock is held for very small time, no IOs are really waited for inside the lock. So it doesn't block concurrent activity while someone is committing. Note, that after the lock is dropped, anyone can for example call mark_buffer_dirty, but writing of the new buffer won't be reordered with the barrier write that is already pending in the queue, so consistency is maintained. I somehow got the idea that this was the reason why barriers are implemented the way they are and that this was their intended mode of operation. Note that you can't achieve the same thing (don't wait inside the lock) if you submit barriers synchronously or if you use non-barrier writes and disk cache flushes. If you are pushing what you are pushing --- barriers allowing to return EOPNOTSUPP anytime --- then asynchronous barrier submits can no longer be used, because by the time EOPNOTSUPP is detected, the filesystem is already corrupted. Also, read Documentation/block/barrier.txt and see what you are breaking in the document: "all requests queued after the barrier request must be started only after the barrier request is finished (again, made it to the physical medium)." "Preceding requests are processed before the barrier and following requests after." --- you are going to break these rules. If the barrier can return -EOPNOTSUPP anytime, the following request will be finished before the barrier write is finished. (because the barrier write must be resubmitted without barrier) Basically, if you allow randomly failed barriers, you can drop barriers at all and replace them with blkdev_issue_flush(); write&wait_synchronous(); blkdev_issue_flush(). There is no more reason to maintain complicated logic of barriers, if you cripple them to the unusable point when they randomly fail. Another thing: I'm wondering, where in fsync() does Linux wait for hardware disk cache to be flushed? Isn't there a bug that fsync() will return before the cache is flushed? I couldn't really find it. The last thing do_fsync calls is filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush). Mikulas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 19:37 ` Mikulas Patocka @ 2008-12-04 22:15 ` Andi Kleen 2008-12-04 23:08 ` Mikulas Patocka 2008-12-05 3:26 ` Device loses barrier support Eric Sandeen 1 sibling, 1 reply; 20+ messages in thread From: Andi Kleen @ 2008-12-04 22:15 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz > If you are pushing what you are pushing --- barriers allowing to return > EOPNOTSUPP anytime --- then asynchronous barrier submits can no longer be > used, because by the time EOPNOTSUPP is detected, the filesystem is > already corrupted. Chris Mason pointed out that this can actually already happen. From a quick review this can happen in MD raid1 at least (their barriers_work flag is pretty similar to the DM implementation I did). So everyone has to handle this already anyways. > I'm wondering, where in fsync() does Linux wait for hardware disk cache to > be flushed? Isn't there a bug that fsync() will return before the cache is > flushed? I couldn't really find it. The last thing do_fsync calls is > filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush). At least in fsync() on journaling fs the metadata update should push it. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 22:15 ` Andi Kleen @ 2008-12-04 23:08 ` Mikulas Patocka 2008-12-05 0:48 ` Andi Kleen 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-12-04 23:08 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz On Thu, 4 Dec 2008, Andi Kleen wrote: > > If you are pushing what you are pushing --- barriers allowing to return > > EOPNOTSUPP anytime --- then asynchronous barrier submits can no longer be > > used, because by the time EOPNOTSUPP is detected, the filesystem is > > already corrupted. > > Chris Mason pointed out that this can actually already happen. From > a quick review this can happen in MD raid1 at least (their barriers_work > flag is pretty similar to the DM implementation I did). So everyone > has to handle this already anyways. So: remove barriers completely and use only blkdev_issue_flush to flush disk cache. Because none of the major filesystems learned to use barrier-optimized commits and this "barriers-randomly-fail-with-EOPNOTSUPP" fact makes it impossible to use them in an optimized way anyway. There is another point: "what is the main performance advantage of barriers?" - "that the user can turn on hardware write cache with hdparm -W 1 command". And if barriers fail at random points, the user can't turn on disk cache anyway (he would get data corruption if barrier write failed and hardware write cache was enabled). So barriers make no sense here. > > I'm wondering, where in fsync() does Linux wait for hardware disk cache to > > be flushed? Isn't there a bug that fsync() will return before the cache is > > flushed? I couldn't really find it. The last thing do_fsync calls is > > filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush). > > At least in fsync() on journaling fs the metadata update should push it. > > -Andi And what about fdatasync()? Mikulas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 23:08 ` Mikulas Patocka @ 2008-12-05 0:48 ` Andi Kleen 2008-12-05 1:16 ` Mikulas Patocka 0 siblings, 1 reply; 20+ messages in thread From: Andi Kleen @ 2008-12-05 0:48 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz > And if barriers fail at random points, the user can't turn on disk cache > anyway (he would get data corruption if barrier write failed and hardware I think we already established earlier in the thread that there is no disk corruption > > > I'm wondering, where in fsync() does Linux wait for hardware disk cache to > > > be flushed? Isn't there a bug that fsync() will return before the cache is > > > flushed? I couldn't really find it. The last thing do_fsync calls is > > > filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush). > > > > At least in fsync() on journaling fs the metadata update should push it. > > > > -Andi > > And what about fdatasync()? I don't know. The surest way to find out is to instrument it and try. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-05 0:48 ` Andi Kleen @ 2008-12-05 1:16 ` Mikulas Patocka 2008-12-05 1:37 ` Andi Kleen 0 siblings, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-12-05 1:16 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz On Fri, 5 Dec 2008, Andi Kleen wrote: > > And if barriers fail at random points, the user can't turn on disk cache > > anyway (he would get data corruption if barrier write failed and hardware > > I think we already established earlier in the thread that there is no disk > corruption So, the facts are: * barrier support in md-raid1 deviates from the specification at Documentation/block/barrier.txt. The specification says that requests submitted after the barrier request hit the media after the barrier request hits the media. The reality is that the barrier request can be randomly aborted and the requests submitted after it hit the media before the barrier request. * the filesystems developed hacks to work around this issue, the hacks involve not submitting more requests after the barrier request, synchronously waiting for the barrier request and eventually retrying it. These hacks suppress any performance advantage barriers could bring. * you submit a patch that makes barriers even more often deviate from the specification and you argue that the patch is correct because filesystems handle this deviation. This is runaway logic that will eventually turn Linux into unmaintainable mess. What do you think we'll be doing when we'll be implementing barriers into other dm targets? Do you really think it'll be fun to write code to double-submit all metadata writes just because you and some person at md-raid1 provided an unreliable interface? I am again repeating: either make barriers consistent with the specification, or remove them at all. Mikulas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-05 1:16 ` Mikulas Patocka @ 2008-12-05 1:37 ` Andi Kleen 2008-12-05 2:21 ` Mikulas Patocka 2008-12-05 11:52 ` Alan Cox 0 siblings, 2 replies; 20+ messages in thread From: Andi Kleen @ 2008-12-05 1:37 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz > * barrier support in md-raid1 deviates from the specification at > Documentation/block/barrier.txt. The specification says that requests > submitted after the barrier request hit the media after the barrier > request hits the media. The reality is that the barrier request can be > randomly aborted and the requests submitted after it hit the media before > the barrier request. Yes the spec should be probably updated. But also see Linus' rant from yesterday about code vs documentation. When in doubt the code wins. > > * the filesystems developed hacks to work around this issue, the hacks > involve not submitting more requests after the barrier request, I suspect the reason the file systems did it this way is that it was a much simpler change than to rewrite the transaction manager for this. > synchronously waiting for the barrier request and eventually retrying it. > These hacks suppress any performance advantage barriers could bring. > > * you submit a patch that makes barriers even more often deviate from the > specification and you argue that the patch is correct because filesystems > handle this deviation. Sorry what counts is the code behaviour, not the specification. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-05 1:37 ` Andi Kleen @ 2008-12-05 2:21 ` Mikulas Patocka 2008-12-05 3:09 ` Andi Kleen 2008-12-05 11:52 ` Alan Cox 1 sibling, 1 reply; 20+ messages in thread From: Mikulas Patocka @ 2008-12-05 2:21 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz On Fri, 5 Dec 2008, Andi Kleen wrote: > > * barrier support in md-raid1 deviates from the specification at > > Documentation/block/barrier.txt. The specification says that requests > > submitted after the barrier request hit the media after the barrier > > request hits the media. The reality is that the barrier request can be > > randomly aborted and the requests submitted after it hit the media before > > the barrier request. > > Yes the spec should be probably updated. > > But also see Linus' rant from yesterday about code vs documentation. > When in doubt the code wins. The only one offender is "md". It is less overhead to change "md" to play nice and be reliable than to double-submit requests in all the places that needs write ordering. > > * the filesystems developed hacks to work around this issue, the hacks > > involve not submitting more requests after the barrier request, > > I suspect the reason the file systems did it this way is that > it was a much simpler change than to rewrite the transaction > manager for this. It could be initial reason. But this unreliability also disallows any improvement in filesystems. No one can write asynchronous transaction manager because of that evil EOPNOTSUPP. > > synchronously waiting for the barrier request and eventually retrying it. > > These hacks suppress any performance advantage barriers could bring. > > > > * you submit a patch that makes barriers even more often deviate from the > > specification and you argue that the patch is correct because filesystems > > handle this deviation. > > Sorry what counts is the code behaviour, not the specification. Better interface is that one that has less maintenance overhead. And I don't see requiring the programmers of all IO code to double-submit requests as less maintenance overhead. > -Andi Mikulas --- If you want to make it easier to infer functionality from the code, apply this patch :) --- block/blk-core.c | 8 ++++++++ 1 file changed, 8 insertions(+) Index: linux-2.6.28-rc5-devel/block/blk-core.c =================================================================== --- linux-2.6.28-rc5-devel.orig/block/blk-core.c 2008-12-05 02:54:25.000000000 +0100 +++ linux-2.6.28-rc5-devel/block/blk-core.c 2008-12-05 03:14:23.000000000 +0100 @@ -28,6 +28,7 @@ #include <linux/task_io_accounting_ops.h> #include <linux/blktrace_api.h> #include <linux/fault-inject.h> +#include <linux/random.h> #include "blk.h" @@ -1528,6 +1529,13 @@ void submit_bio(int rw, struct bio *bio) bio->bi_rw |= rw; + /* At least, make the true nature of write barriers obvious. */ + + if (bio_barrier(bio) && !(random32() % 42)) { + bio_endio(bio, -EOPNOTSUPP); + return; + } + /* * If it's a regular read/write or a barrier with data attached, * go through the normal accounting stuff before submission. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-05 2:21 ` Mikulas Patocka @ 2008-12-05 3:09 ` Andi Kleen 0 siblings, 0 replies; 20+ messages in thread From: Andi Kleen @ 2008-12-05 3:09 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz > The only one offender is "md". I'm not sure. It wouldn't surprise me if it can happen with other setups too. Perhaps Chris knows more. > It is less overhead to change "md" to play > nice and be reliable than to double-submit requests in all the places that > needs write ordering. They do that already anyways. > > > > * the filesystems developed hacks to work around this issue, the hacks > > > involve not submitting more requests after the barrier request, > > > > I suspect the reason the file systems did it this way is that > > it was a much simpler change than to rewrite the transaction > > manager for this. > > It could be initial reason. But this unreliability also disallows any > improvement in filesystems. No one can write asynchronous transaction > manager because of that evil EOPNOTSUPP. Doesn't seem right. It would be a simple state machine to handle it fully asynchronous. Alternatively you could always use empty barriers. But we can worry about that when some in tree file system actually tries to do that. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-05 1:37 ` Andi Kleen 2008-12-05 2:21 ` Mikulas Patocka @ 2008-12-05 11:52 ` Alan Cox 2008-12-05 12:29 ` Andi Kleen 1 sibling, 1 reply; 20+ messages in thread From: Alan Cox @ 2008-12-05 11:52 UTC (permalink / raw) To: Andi Kleen Cc: Mikulas Patocka, Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz On Fri, 5 Dec 2008 02:37:39 +0100 Andi Kleen <andi@firstfloor.org> wrote: > > * barrier support in md-raid1 deviates from the specification at > > Documentation/block/barrier.txt. The specification says that requests > > submitted after the barrier request hit the media after the barrier > > request hits the media. The reality is that the barrier request can be > > randomly aborted and the requests submitted after it hit the media before > > the barrier request. > > Yes the spec should be probably updated. > > But also see Linus' rant from yesterday about code vs documentation. > When in doubt the code wins. Not when the fundamental design of the code is broken and trashes performance. The documented behaviour is strongly desirable. Alan ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-05 11:52 ` Alan Cox @ 2008-12-05 12:29 ` Andi Kleen 0 siblings, 0 replies; 20+ messages in thread From: Andi Kleen @ 2008-12-05 12:29 UTC (permalink / raw) To: Alan Cox Cc: Andi Kleen, Mikulas Patocka, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz > Not when the fundamental design of the code is broken and trashes > performance. Sorry but that's just not correct. There's nothing in late failing barriers that "trashes performance". The file system writers have to be careful to handle it, but at least the current ones all do. And also if someone writes a hypothetical fully asynchronously driven barrier based IO transaction system it would be still possible to handle the late failing barrier without too many complications. Also late failing barriers is pretty much the only sane way to implement barriers in software remapping schemes like DM and MD. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support 2008-12-04 19:37 ` Mikulas Patocka 2008-12-04 22:15 ` Andi Kleen @ 2008-12-05 3:26 ` Eric Sandeen 1 sibling, 0 replies; 20+ messages in thread From: Eric Sandeen @ 2008-12-05 3:26 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, Milan Broz, Andi Kleen, linux-kernel, Alasdair G Kergon, xfs Mikulas Patocka wrote: > Another thing: > > I'm wondering, where in fsync() does Linux wait for hardware disk cache to > be flushed? Isn't there a bug that fsync() will return before the cache is > flushed? I couldn't really find it. The last thing do_fsync calls is > filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush). ext4, reiserfs, and xfs all call blkdev_issue_flush() in their ->fsync file operations (or down that path). -Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support (was: Fixed patch for simple barriers.) 2008-12-04 16:45 ` Mikulas Patocka 2008-12-04 17:48 ` Andi Kleen @ 2008-12-07 4:17 ` Dave Chinner 1 sibling, 0 replies; 20+ messages in thread From: Dave Chinner @ 2008-12-07 4:17 UTC (permalink / raw) To: Mikulas Patocka Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz On Thu, Dec 04, 2008 at 11:45:44AM -0500, Mikulas Patocka wrote: > > The write block with barrier bit varies, jbd/gfs2 do it synchronously > > too and xfs does it asynchronously (with io done callbacks), but > > And how does xfs preserve write ordering, if the barrier asynchronously > fails with -EOPNOTSUPP and there are other writes submitted after the > barrier? Doesn't matter. XFS executes journal state changes during the I/O completion callbacks a layer above this resubmit code. Hence, if the I/O is resubmitted before the completion callbacks are run, it just appears that the I/O has taken longer than expected and the state change is delayed.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Device loses barrier support 2008-12-04 14:58 ` Andi Kleen 2008-12-04 16:45 ` Mikulas Patocka @ 2008-12-05 5:44 ` Timothy Shimmin 1 sibling, 0 replies; 20+ messages in thread From: Timothy Shimmin @ 2008-12-05 5:44 UTC (permalink / raw) To: Andi Kleen Cc: Mikulas Patocka, Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Milan Broz Andi Kleen wrote: > The write block with barrier bit varies, jbd/gfs2 do it synchronously > too and xfs does it asynchronously (with io done callbacks), but > in both cases they handle an EOPNOTSUPP comming out in the final > io done. > Yes, XFS handles it, however, it doesn't look like we currently propagate the EOPNOTSUPP up to where we test it (not set for b_error). Patch disscussed recently on xfs list to rectify this. --Tim ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2008-12-07 4:17 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.64.0812040009340.15169@hs20-bc2-1.build.redhat.com>
[not found] ` <20081204100050.GN6703@one.firstfloor.org>
2008-12-04 14:00 ` Device loses barrier support (was: Fixed patch for simple barriers.) Mikulas Patocka
2008-12-04 14:20 ` Andi Kleen
2008-12-04 14:17 ` Mikulas Patocka
2008-12-04 14:58 ` Andi Kleen
2008-12-04 16:45 ` Mikulas Patocka
2008-12-04 17:48 ` Andi Kleen
2008-12-04 17:53 ` Christoph Hellwig
2008-12-04 19:37 ` Mikulas Patocka
2008-12-04 22:15 ` Andi Kleen
2008-12-04 23:08 ` Mikulas Patocka
2008-12-05 0:48 ` Andi Kleen
2008-12-05 1:16 ` Mikulas Patocka
2008-12-05 1:37 ` Andi Kleen
2008-12-05 2:21 ` Mikulas Patocka
2008-12-05 3:09 ` Andi Kleen
2008-12-05 11:52 ` Alan Cox
2008-12-05 12:29 ` Andi Kleen
2008-12-05 3:26 ` Device loses barrier support Eric Sandeen
2008-12-07 4:17 ` Device loses barrier support (was: Fixed patch for simple barriers.) Dave Chinner
2008-12-05 5:44 ` Device loses barrier support Timothy Shimmin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox