public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Device loses barrier support (was: Fixed patch for simple barriers.)
       [not found] ` <20081204100050.GN6703@one.firstfloor.org>
@ 2008-12-04 14:00   ` Mikulas Patocka
  2008-12-04 14:20     ` Andi Kleen
  0 siblings, 1 reply; 20+ messages in thread
From: Mikulas Patocka @ 2008-12-04 14:00 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel, xfs; +Cc: Alasdair G Kergon, Andi Kleen, Milan Broz

On Thu, 4 Dec 2008, Andi Kleen wrote:

> On Thu, Dec 04, 2008 at 12:09:56AM -0500, Mikulas Patocka wrote:
> > 
> > BTW. how is this patch supposed to work with pvmove? I.e. you advertise to 
> > a filesystem that you support barriers, then the user runs pvmove and you 
> > drop barrier support while the filesystem is mounted - that will confuse 
> > the filesystem and maybe produce a data corruption. I wouldn't recommend 
> 
> File systems handle this generally. Also the pvmove itself will
> act as a barrier.
> 
> -Andi

How do you want to handle this?

Imagine:
the filesystem submits a 1st write request
the filesystem submits a 2nd write barrier request
the filesystem submits a 3rd write request

... time passes ...

the 1st write request ends with success
the 2nd write request ends with -EOPNOTSUPP
the 3rd write request ends with success

--- when you first see -EOPNOTSUPP, you have already corrupted filesystem 
(the 3rd write passed while the filesystem expected that it would be 
finished after the 2nd write) and you are in an interrupt context, where 
you can't reissue -EOPNOTSUPP request. So what do you want to do?


Possible ways how to solve it:

1) Wait synchronously for barriers, don't issue any other writes while 
barrier is pending.

- this basically supresses any performance advantage barriers could have. 
Ext3 is doing this.

- this solion is right. But if this is "the way it should be done", you 
could rip barriers from the kernel completely and replace them with a 
simple call to flush hardware cache. In this use scenario, they have no 
advantage over a simple call to flush cache.

2) Resubmit the failed -EOPNOTSUPP request from a thread.

- this is what XFS is doing. Bad for code complexity (there must be a 
special thread just to catch failed IOs). Also, it still produces 
corrupted filesystem for a brief period of time.

3) Fail barriers only synchronously? (so that the caller can detect 
missing barrier support before issuing other writes)

- unimplemntable in device mapper, if the device is suspended, it queues 
bios.

4) Disallow losing barrier support?

- for me it looks like a sensible solution.

Mikulas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 14:20     ` Andi Kleen
@ 2008-12-04 14:17       ` Mikulas Patocka
  2008-12-04 14:58         ` Andi Kleen
  0 siblings, 1 reply; 20+ messages in thread
From: Mikulas Patocka @ 2008-12-04 14:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz

On Thu, 4 Dec 2008, Andi Kleen wrote:

> > the 1st write request ends with success
> > the 2nd write request ends with -EOPNOTSUPP
> > the 3rd write request ends with success
> > 
> > --- when you first see -EOPNOTSUPP, you have already corrupted filesystem 
> > (the 3rd write passed while the filesystem expected that it would be 
> 
> There's no passing of requests during pvmove. It's a really strong
> barrier.

You start pvmove. The filesystem doesn't know about pvmove.

The next time filesystem does somethig, it submits these 3 requests and 
the 2nd fill unexpectedly fail.

So the fact that pvmove drains the request queue won't help you.

> > finished after the 2nd write) and you are in an interrupt context, where 
> > you can't reissue -EOPNOTSUPP request. So what do you want to do?
> 
> The barrier aware file systems I know of just resubmit synchronously when 
> a barrier fails.

... and produce structure corruption for certain period in time, because 
the writes meant to be ordered are submitted unordered.

Mikulas

> -Andi
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 14:00   ` Device loses barrier support (was: Fixed patch for simple barriers.) Mikulas Patocka
@ 2008-12-04 14:20     ` Andi Kleen
  2008-12-04 14:17       ` Mikulas Patocka
  0 siblings, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2008-12-04 14:20 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen,
	Milan Broz

> the 1st write request ends with success
> the 2nd write request ends with -EOPNOTSUPP
> the 3rd write request ends with success
> 
> --- when you first see -EOPNOTSUPP, you have already corrupted filesystem 
> (the 3rd write passed while the filesystem expected that it would be 

There's no passing of requests during pvmove. It's a really strong
barrier.

> finished after the 2nd write) and you are in an interrupt context, where 
> you can't reissue -EOPNOTSUPP request. So what do you want to do?

The barrier aware file systems I know of just resubmit synchronously when 
a barrier fails.

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 14:17       ` Mikulas Patocka
@ 2008-12-04 14:58         ` Andi Kleen
  2008-12-04 16:45           ` Mikulas Patocka
  2008-12-05  5:44           ` Device loses barrier support Timothy Shimmin
  0 siblings, 2 replies; 20+ messages in thread
From: Andi Kleen @ 2008-12-04 14:58 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen,
	Milan Broz

> You start pvmove. The filesystem doesn't know about pvmove.
> 
> The next time filesystem does somethig, it submits these 3 requests and 
> the 2nd fill unexpectedly fail.

Again the file systems handle failing barriers and they expect it.

> 
> So the fact that pvmove drains the request queue won't help you.

Help you against what? 

> 
> > > finished after the 2nd write) and you are in an interrupt context, where 
> > > you can't reissue -EOPNOTSUPP request. So what do you want to do?
> > 
> > The barrier aware file systems I know of just resubmit synchronously when 
> > a barrier fails.
> 
> ... and produce structure corruption for certain period in time, because 
> the writes meant to be ordered are submitted unordered.

No there is nothing unordered. The file system path typically looks like

commit of a transaction
	if (i have never seen a barrier failing) 
		write block with barrier
		if (EOPNOTSUPP) {
			record failure
			submit synchronously
		}
	} else
		submit synchronously


So if a pvmove barrier fails it will just submit synchronously.

The write block with barrier bit varies, jbd/gfs2 do it synchronously
too and xfs does it asynchronously (with io done callbacks), but
in both cases they handle an EOPNOTSUPP comming out in the final
io done.

When the pvmove migrates from no barrier support to barrier support
there won't be any barrier on the file system for the time of the
current mount, but that's also fine.

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 14:58         ` Andi Kleen
@ 2008-12-04 16:45           ` Mikulas Patocka
  2008-12-04 17:48             ` Andi Kleen
  2008-12-07  4:17             ` Device loses barrier support (was: Fixed patch for simple barriers.) Dave Chinner
  2008-12-05  5:44           ` Device loses barrier support Timothy Shimmin
  1 sibling, 2 replies; 20+ messages in thread
From: Mikulas Patocka @ 2008-12-04 16:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz

> > > > finished after the 2nd write) and you are in an interrupt context, where 
> > > > you can't reissue -EOPNOTSUPP request. So what do you want to do?
> > > 
> > > The barrier aware file systems I know of just resubmit synchronously when 
> > > a barrier fails.
> > 
> > ... and produce structure corruption for certain period in time, because 
> > the writes meant to be ordered are submitted unordered.
> 
> No there is nothing unordered. The file system path typically looks like
> 
> commit of a transaction
> 	if (i have never seen a barrier failing) 
> 		write block with barrier
> 		if (EOPNOTSUPP) {
> 			record failure
> 			submit synchronously
> 		}
> 	} else
> 		submit synchronously
> 

If you view this as a "right" way of using barriers, then you can drop 
barrier support at all and replace this code sequence with:

 flush disk cache
 submit write synchronously
 flush disk cache

--- because synchronous barriers bring you no performance advantage over 
the above sequence.

> So if a pvmove barrier fails it will just submit synchronously.
> 
> The write block with barrier bit varies, jbd/gfs2 do it synchronously
> too and xfs does it asynchronously (with io done callbacks), but

And how does xfs preserve write ordering, if the barrier asynchronously 
fails with -EOPNOTSUPP and there are other writes submitted after the 
barrier?

> in both cases they handle an EOPNOTSUPP comming out in the final
> io done.

Mikulas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 16:45           ` Mikulas Patocka
@ 2008-12-04 17:48             ` Andi Kleen
  2008-12-04 17:53               ` Christoph Hellwig
  2008-12-04 19:37               ` Mikulas Patocka
  2008-12-07  4:17             ` Device loses barrier support (was: Fixed patch for simple barriers.) Dave Chinner
  1 sibling, 2 replies; 20+ messages in thread
From: Andi Kleen @ 2008-12-04 17:48 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen,
	Milan Broz

On Thu, Dec 04, 2008 at 11:45:44AM -0500, Mikulas Patocka wrote:
> > No there is nothing unordered. The file system path typically looks like
> > 
> > commit of a transaction
> > 	if (i have never seen a barrier failing) 
> > 		write block with barrier
> > 		if (EOPNOTSUPP) {
> > 			record failure
> > 			submit synchronously
> > 		}
> > 	} else
> > 		submit synchronously
> > 
> 
> If you view this as a "right" way of using barriers, then you can drop 

It's the way the file systems do it. If you don't believe me feel
free to read the code for yourself.

> barrier support at all and replace this code sequence with:
> 
>  flush disk cache
>  submit write synchronously
>  flush disk cache
> 
> --- because synchronous barriers bring you no performance advantage over 
> the above sequence.

Remember this is done by a commit thread in a journaling file system. 
Commits are ordered so the thread cannot really order out of order 
anyways. And yes the barriers are essentially a way to flush the cache
regularly for selected commits. The alternative (if you
want to guarantee transaction order) would be to disable
the write cache completely and do it synchronous on each IO.

> 
> > So if a pvmove barrier fails it will just submit synchronously.
> > 
> > The write block with barrier bit varies, jbd/gfs2 do it synchronously
> > too and xfs does it asynchronously (with io done callbacks), but
> 
> And how does xfs preserve write ordering, if the barrier asynchronously 
> fails with -EOPNOTSUPP and there are other writes submitted after the 
> barrier?

>From the high level journaling perspective they are not asynchronous
I think. Just the low level xfs_buf interface happens to use the asynchronous
callbacks instead of calling into the block layer directly like jbd et.al.
do.

-Andi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 17:48             ` Andi Kleen
@ 2008-12-04 17:53               ` Christoph Hellwig
  2008-12-04 19:37               ` Mikulas Patocka
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2008-12-04 17:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mikulas Patocka, Andi Kleen, linux-kernel, xfs, Alasdair G Kergon,
	Milan Broz

On Thu, Dec 04, 2008 at 06:48:38PM +0100, Andi Kleen wrote:
> I think. Just the low level xfs_buf interface happens to use the asynchronous
> callbacks instead of calling into the block layer directly like jbd et.al.
> do.

Only delwri buffers are delayed in XFS, but the journaling code only
uses async buffers which *synchronously* call into the block layer, but
just don't wait for it to complete..


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 17:48             ` Andi Kleen
  2008-12-04 17:53               ` Christoph Hellwig
@ 2008-12-04 19:37               ` Mikulas Patocka
  2008-12-04 22:15                 ` Andi Kleen
  2008-12-05  3:26                 ` Device loses barrier support Eric Sandeen
  1 sibling, 2 replies; 20+ messages in thread
From: Mikulas Patocka @ 2008-12-04 19:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz



On Thu, 4 Dec 2008, Andi Kleen wrote:

> On Thu, Dec 04, 2008 at 11:45:44AM -0500, Mikulas Patocka wrote:
> > > No there is nothing unordered. The file system path typically looks like
> > > 
> > > commit of a transaction
> > > 	if (i have never seen a barrier failing) 
> > > 		write block with barrier
> > > 		if (EOPNOTSUPP) {
> > > 			record failure
> > > 			submit synchronously
> > > 		}
> > > 	} else
> > > 		submit synchronously
> > > 
> > 
> > If you view this as a "right" way of using barriers, then you can drop 
> 
> It's the way the file systems do it. If you don't believe me feel
> free to read the code for yourself.
> 
> > barrier support at all and replace this code sequence with:
> > 
> >  flush disk cache
> >  submit write synchronously
> >  flush disk cache
> > 
> > --- because synchronous barriers bring you no performance advantage over 
> > the above sequence.
> 
> Remember this is done by a commit thread in a journaling file system. 
> Commits are ordered so the thread cannot really order out of order 
> anyways. And yes the barriers are essentially a way to flush the cache
> regularly for selected commits. The alternative (if you
> want to guarantee transaction order) would be to disable
> the write cache completely and do it synchronous on each IO.

Some times ago, I used barriers the asynchronous way in the spadfs 
filesystem and they helped very much. The commit logic is:

- take the transaction lock (it prevents any updates)
- write remaining dirty buffers (but don't wait)
- submit barrier write to move to new transaction (but don't wait)
- drop the transaction lock
- eventually wait for completion of writes (if this was issued by fsync or 
sync) or don't wait at all if it was issued by other things (periodic 
syncer, emergency sync to reclaim free space or so).

The advantage of this approach is that the lock is held for very small 
time, no IOs are really waited for inside the lock. So it doesn't block 
concurrent activity while someone is committing. Note, that after the lock 
is dropped, anyone can for example call mark_buffer_dirty, but writing of 
the new buffer won't be reordered with the barrier write that is already 
pending in the queue, so consistency is maintained.

I somehow got the idea that this was the reason why barriers are 
implemented the way they are and that this was their intended mode of 
operation. Note that you can't achieve the same thing (don't wait inside 
the lock) if you submit barriers synchronously or if you use non-barrier 
writes and disk cache flushes.

If you are pushing what you are pushing --- barriers allowing to return 
EOPNOTSUPP anytime --- then asynchronous barrier submits can no longer be 
used, because by the time EOPNOTSUPP is detected, the filesystem is 
already corrupted.

Also, read Documentation/block/barrier.txt and see what you are breaking 
in the document:

"all requests queued after the barrier request must be started only after 
the barrier request is finished (again, made it to the physical medium)."

"Preceding requests are processed before the barrier and
following requests after."

--- you are going to break these rules. If the barrier can return 
-EOPNOTSUPP anytime, the following request will be finished before the 
barrier write is finished. (because the barrier write must be resubmitted 
without barrier)

Basically, if you allow randomly failed barriers, you can drop barriers at 
all and replace them with blkdev_issue_flush(); write&wait_synchronous(); 
blkdev_issue_flush(). There is no more reason to maintain complicated 
logic of barriers, if you cripple them to the unusable point when they 
randomly fail.


Another thing:

I'm wondering, where in fsync() does Linux wait for hardware disk cache to 
be flushed? Isn't there a bug that fsync() will return before the cache is 
flushed? I couldn't really find it. The last thing do_fsync calls is 
filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush).

Mikulas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 19:37               ` Mikulas Patocka
@ 2008-12-04 22:15                 ` Andi Kleen
  2008-12-04 23:08                   ` Mikulas Patocka
  2008-12-05  3:26                 ` Device loses barrier support Eric Sandeen
  1 sibling, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2008-12-04 22:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen,
	Milan Broz

> If you are pushing what you are pushing --- barriers allowing to return 
> EOPNOTSUPP anytime --- then asynchronous barrier submits can no longer be 
> used, because by the time EOPNOTSUPP is detected, the filesystem is 
> already corrupted.

Chris Mason pointed out that this can actually already happen. From
a quick review this can happen in MD raid1 at least (their barriers_work
flag is pretty similar to the DM implementation I did).  So everyone
has to handle this already anyways.

> I'm wondering, where in fsync() does Linux wait for hardware disk cache to 
> be flushed? Isn't there a bug that fsync() will return before the cache is 
> flushed? I couldn't really find it. The last thing do_fsync calls is 
> filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush).

At least in fsync() on journaling fs the metadata update should push it.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 22:15                 ` Andi Kleen
@ 2008-12-04 23:08                   ` Mikulas Patocka
  2008-12-05  0:48                     ` Andi Kleen
  0 siblings, 1 reply; 20+ messages in thread
From: Mikulas Patocka @ 2008-12-04 23:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz

On Thu, 4 Dec 2008, Andi Kleen wrote:

> > If you are pushing what you are pushing --- barriers allowing to return 
> > EOPNOTSUPP anytime --- then asynchronous barrier submits can no longer be 
> > used, because by the time EOPNOTSUPP is detected, the filesystem is 
> > already corrupted.
> 
> Chris Mason pointed out that this can actually already happen. From
> a quick review this can happen in MD raid1 at least (their barriers_work
> flag is pretty similar to the DM implementation I did).  So everyone
> has to handle this already anyways.

So: remove barriers completely and use only blkdev_issue_flush to flush 
disk cache. Because none of the major filesystems learned to use 
barrier-optimized commits and this 
"barriers-randomly-fail-with-EOPNOTSUPP" fact makes it impossible to use 
them in an optimized way anyway.

There is another point:
"what is the main performance advantage of barriers?" - "that the user can 
turn on hardware write cache with hdparm -W 1 command".

And if barriers fail at random points, the user can't turn on disk cache 
anyway (he would get data corruption if barrier write failed and hardware 
write cache was enabled). So barriers make no sense here.

> > I'm wondering, where in fsync() does Linux wait for hardware disk cache to 
> > be flushed? Isn't there a bug that fsync() will return before the cache is 
> > flushed? I couldn't really find it. The last thing do_fsync calls is 
> > filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush).
> 
> At least in fsync() on journaling fs the metadata update should push it.
> 
> -Andi

And what about fdatasync()?

Mikulas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 23:08                   ` Mikulas Patocka
@ 2008-12-05  0:48                     ` Andi Kleen
  2008-12-05  1:16                       ` Mikulas Patocka
  0 siblings, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2008-12-05  0:48 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen,
	Milan Broz

> And if barriers fail at random points, the user can't turn on disk cache 
> anyway (he would get data corruption if barrier write failed and hardware 

I think we already established earlier in the thread that there is no disk
corruption

> > > I'm wondering, where in fsync() does Linux wait for hardware disk cache to 
> > > be flushed? Isn't there a bug that fsync() will return before the cache is 
> > > flushed? I couldn't really find it. The last thing do_fsync calls is 
> > > filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush).
> > 
> > At least in fsync() on journaling fs the metadata update should push it.
> > 
> > -Andi
> 
> And what about fdatasync()?

I don't know. The surest way to find out is to instrument it and try.

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-05  0:48                     ` Andi Kleen
@ 2008-12-05  1:16                       ` Mikulas Patocka
  2008-12-05  1:37                         ` Andi Kleen
  0 siblings, 1 reply; 20+ messages in thread
From: Mikulas Patocka @ 2008-12-05  1:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz

On Fri, 5 Dec 2008, Andi Kleen wrote:

> > And if barriers fail at random points, the user can't turn on disk cache 
> > anyway (he would get data corruption if barrier write failed and hardware 
> 
> I think we already established earlier in the thread that there is no disk
> corruption

So, the facts are:

* barrier support in md-raid1 deviates from the specification at 
Documentation/block/barrier.txt. The specification says that requests 
submitted after the barrier request hit the media after the barrier 
request hits the media. The reality is that the barrier request can be 
randomly aborted and the requests submitted after it hit the media before 
the barrier request.

* the filesystems developed hacks to work around this issue, the hacks 
involve not submitting more requests after the barrier request, 
synchronously waiting for the barrier request and eventually retrying it. 
These hacks suppress any performance advantage barriers could bring.

* you submit a patch that makes barriers even more often deviate from the 
specification and you argue that the patch is correct because filesystems 
handle this deviation.

This is runaway logic that will eventually turn Linux into unmaintainable 
mess. What do you think we'll be doing when we'll be implementing barriers 
into other dm targets? Do you really think it'll be fun to write code to 
double-submit all metadata writes just because you and some person at 
md-raid1 provided an unreliable interface?

I am again repeating: either make barriers consistent with the 
specification, or remove them at all.

Mikulas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-05  1:16                       ` Mikulas Patocka
@ 2008-12-05  1:37                         ` Andi Kleen
  2008-12-05  2:21                           ` Mikulas Patocka
  2008-12-05 11:52                           ` Alan Cox
  0 siblings, 2 replies; 20+ messages in thread
From: Andi Kleen @ 2008-12-05  1:37 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen,
	Milan Broz

> * barrier support in md-raid1 deviates from the specification at 
> Documentation/block/barrier.txt. The specification says that requests 
> submitted after the barrier request hit the media after the barrier 
> request hits the media. The reality is that the barrier request can be 
> randomly aborted and the requests submitted after it hit the media before 
> the barrier request.

Yes the spec should be probably updated.

But also see Linus' rant from yesterday about code vs documentation.
When in doubt the code wins.
> 
> * the filesystems developed hacks to work around this issue, the hacks 
> involve not submitting more requests after the barrier request, 

I suspect the reason the file systems did it this way is that
it was a much simpler change than to rewrite the transaction
manager for this.

> synchronously waiting for the barrier request and eventually retrying it. 
> These hacks suppress any performance advantage barriers could bring.
> 
> * you submit a patch that makes barriers even more often deviate from the 
> specification and you argue that the patch is correct because filesystems 
> handle this deviation.

Sorry what counts is the code behaviour, not the specification.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-05  1:37                         ` Andi Kleen
@ 2008-12-05  2:21                           ` Mikulas Patocka
  2008-12-05  3:09                             ` Andi Kleen
  2008-12-05 11:52                           ` Alan Cox
  1 sibling, 1 reply; 20+ messages in thread
From: Mikulas Patocka @ 2008-12-05  2:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, xfs, Alasdair G Kergon, Andi Kleen, Milan Broz



On Fri, 5 Dec 2008, Andi Kleen wrote:

> > * barrier support in md-raid1 deviates from the specification at 
> > Documentation/block/barrier.txt. The specification says that requests 
> > submitted after the barrier request hit the media after the barrier 
> > request hits the media. The reality is that the barrier request can be 
> > randomly aborted and the requests submitted after it hit the media before 
> > the barrier request.
> 
> Yes the spec should be probably updated.
> 
> But also see Linus' rant from yesterday about code vs documentation.
> When in doubt the code wins.

The only one offender is "md". It is less overhead to change "md" to play 
nice and be reliable than to double-submit requests in all the places that 
needs write ordering.

> > * the filesystems developed hacks to work around this issue, the hacks 
> > involve not submitting more requests after the barrier request, 
> 
> I suspect the reason the file systems did it this way is that
> it was a much simpler change than to rewrite the transaction
> manager for this.

It could be initial reason. But this unreliability also disallows any 
improvement in filesystems. No one can write asynchronous transaction 
manager because of that evil EOPNOTSUPP.

> > synchronously waiting for the barrier request and eventually retrying it. 
> > These hacks suppress any performance advantage barriers could bring.
> > 
> > * you submit a patch that makes barriers even more often deviate from the 
> > specification and you argue that the patch is correct because filesystems 
> > handle this deviation.
> 
> Sorry what counts is the code behaviour, not the specification.

Better interface is that one that has less maintenance overhead. And I 
don't see requiring the programmers of all IO code to double-submit 
requests as less maintenance overhead.

> -Andi

Mikulas

---

If you want to make it easier to infer functionality from the code, apply 
this patch :)

---
 block/blk-core.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6.28-rc5-devel/block/blk-core.c
===================================================================
--- linux-2.6.28-rc5-devel.orig/block/blk-core.c	2008-12-05 02:54:25.000000000 +0100
+++ linux-2.6.28-rc5-devel/block/blk-core.c	2008-12-05 03:14:23.000000000 +0100
@@ -28,6 +28,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
+#include <linux/random.h>
 
 #include "blk.h"
 
@@ -1528,6 +1529,13 @@ void submit_bio(int rw, struct bio *bio)
 
 	bio->bi_rw |= rw;
 
+	/* At least, make the true nature of write barriers obvious. */
+
+	if (bio_barrier(bio) && !(random32() % 42)) {
+		bio_endio(bio, -EOPNOTSUPP);
+		return;
+	}
+
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
 	 * go through the normal accounting stuff before submission.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-05  2:21                           ` Mikulas Patocka
@ 2008-12-05  3:09                             ` Andi Kleen
  0 siblings, 0 replies; 20+ messages in thread
From: Andi Kleen @ 2008-12-05  3:09 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen,
	Milan Broz

> The only one offender is "md". 

I'm not sure. It wouldn't surprise me if it can happen with other setups
too. Perhaps Chris knows more.

> It is less overhead to change "md" to play 
> nice and be reliable than to double-submit requests in all the places that 
> needs write ordering.

They do that already anyways.

> 
> > > * the filesystems developed hacks to work around this issue, the hacks 
> > > involve not submitting more requests after the barrier request, 
> > 
> > I suspect the reason the file systems did it this way is that
> > it was a much simpler change than to rewrite the transaction
> > manager for this.
> 
> It could be initial reason. But this unreliability also disallows any 
> improvement in filesystems. No one can write asynchronous transaction 
> manager because of that evil EOPNOTSUPP.

Doesn't seem right. It would be a simple state machine
to handle it fully asynchronous. Alternatively you could
always use empty barriers.

But we can worry about that when some in tree file system
actually tries to do that.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support
  2008-12-04 19:37               ` Mikulas Patocka
  2008-12-04 22:15                 ` Andi Kleen
@ 2008-12-05  3:26                 ` Eric Sandeen
  1 sibling, 0 replies; 20+ messages in thread
From: Eric Sandeen @ 2008-12-05  3:26 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, Milan Broz, Andi Kleen, linux-kernel,
	Alasdair G Kergon, xfs

Mikulas Patocka wrote:

> Another thing:
> 
> I'm wondering, where in fsync() does Linux wait for hardware disk cache to 
> be flushed? Isn't there a bug that fsync() will return before the cache is 
> flushed? I couldn't really find it. The last thing do_fsync calls is 
> filemap_fdatawait and it doesn't do cache flush (blkdev_issue_flush).

ext4, reiserfs, and xfs all call blkdev_issue_flush() in their ->fsync
file operations (or down that path).

-Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support
  2008-12-04 14:58         ` Andi Kleen
  2008-12-04 16:45           ` Mikulas Patocka
@ 2008-12-05  5:44           ` Timothy Shimmin
  1 sibling, 0 replies; 20+ messages in thread
From: Timothy Shimmin @ 2008-12-05  5:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mikulas Patocka, Andi Kleen, linux-kernel, xfs, Alasdair G Kergon,
	Milan Broz

Andi Kleen wrote:
> The write block with barrier bit varies, jbd/gfs2 do it synchronously
> too and xfs does it asynchronously (with io done callbacks), but
> in both cases they handle an EOPNOTSUPP comming out in the final
> io done.
> 
Yes, XFS handles it, however,
it doesn't look like we currently propagate the EOPNOTSUPP
up to where we test it (not set for b_error).
Patch disscussed recently on xfs list to rectify this.

--Tim


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-05  1:37                         ` Andi Kleen
  2008-12-05  2:21                           ` Mikulas Patocka
@ 2008-12-05 11:52                           ` Alan Cox
  2008-12-05 12:29                             ` Andi Kleen
  1 sibling, 1 reply; 20+ messages in thread
From: Alan Cox @ 2008-12-05 11:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mikulas Patocka, Andi Kleen, linux-kernel, xfs, Alasdair G Kergon,
	Andi Kleen, Milan Broz

On Fri, 5 Dec 2008 02:37:39 +0100
Andi Kleen <andi@firstfloor.org> wrote:

> > * barrier support in md-raid1 deviates from the specification at 
> > Documentation/block/barrier.txt. The specification says that requests 
> > submitted after the barrier request hit the media after the barrier 
> > request hits the media. The reality is that the barrier request can be 
> > randomly aborted and the requests submitted after it hit the media before 
> > the barrier request.
> 
> Yes the spec should be probably updated.
> 
> But also see Linus' rant from yesterday about code vs documentation.
> When in doubt the code wins.

Not when the fundamental design of the code is broken and trashes
performance. The documented behaviour is strongly desirable.

Alan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-05 11:52                           ` Alan Cox
@ 2008-12-05 12:29                             ` Andi Kleen
  0 siblings, 0 replies; 20+ messages in thread
From: Andi Kleen @ 2008-12-05 12:29 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Mikulas Patocka, linux-kernel, xfs, Alasdair G Kergon,
	Andi Kleen, Milan Broz

> Not when the fundamental design of the code is broken and trashes
> performance. 

Sorry but that's just not correct. There's nothing in late failing
barriers that "trashes performance". The file system writers have
to be careful to handle it, but at least the current ones all do.
And also if someone writes a hypothetical fully asynchronously driven 
barrier based IO transaction system it would be still possible to handle 
the late failing barrier without too many complications.

Also late failing barriers is pretty much the only sane way to implement
barriers in software remapping schemes like DM and MD.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Device loses barrier support (was: Fixed patch for simple barriers.)
  2008-12-04 16:45           ` Mikulas Patocka
  2008-12-04 17:48             ` Andi Kleen
@ 2008-12-07  4:17             ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2008-12-07  4:17 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andi Kleen, linux-kernel, xfs, Alasdair G Kergon, Andi Kleen,
	Milan Broz

On Thu, Dec 04, 2008 at 11:45:44AM -0500, Mikulas Patocka wrote:
> > The write block with barrier bit varies, jbd/gfs2 do it synchronously
> > too and xfs does it asynchronously (with io done callbacks), but
> 
> And how does xfs preserve write ordering, if the barrier asynchronously 
> fails with -EOPNOTSUPP and there are other writes submitted after the 
> barrier?

Doesn't matter. XFS executes journal state changes during the I/O
completion callbacks a layer above this resubmit code. Hence,
if the I/O is resubmitted before the completion callbacks are
run, it just appears that the I/O has taken longer than expected
and the state change is delayed....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-12-07  4:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.0812040009340.15169@hs20-bc2-1.build.redhat.com>
     [not found] ` <20081204100050.GN6703@one.firstfloor.org>
2008-12-04 14:00   ` Device loses barrier support (was: Fixed patch for simple barriers.) Mikulas Patocka
2008-12-04 14:20     ` Andi Kleen
2008-12-04 14:17       ` Mikulas Patocka
2008-12-04 14:58         ` Andi Kleen
2008-12-04 16:45           ` Mikulas Patocka
2008-12-04 17:48             ` Andi Kleen
2008-12-04 17:53               ` Christoph Hellwig
2008-12-04 19:37               ` Mikulas Patocka
2008-12-04 22:15                 ` Andi Kleen
2008-12-04 23:08                   ` Mikulas Patocka
2008-12-05  0:48                     ` Andi Kleen
2008-12-05  1:16                       ` Mikulas Patocka
2008-12-05  1:37                         ` Andi Kleen
2008-12-05  2:21                           ` Mikulas Patocka
2008-12-05  3:09                             ` Andi Kleen
2008-12-05 11:52                           ` Alan Cox
2008-12-05 12:29                             ` Andi Kleen
2008-12-05  3:26                 ` Device loses barrier support Eric Sandeen
2008-12-07  4:17             ` Device loses barrier support (was: Fixed patch for simple barriers.) Dave Chinner
2008-12-05  5:44           ` Device loses barrier support Timothy Shimmin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox