* [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
@ 2007-05-25 7:58 Neil Brown
2007-05-25 11:15 ` David Chinner
` (7 more replies)
0 siblings, 8 replies; 102+ messages in thread
From: Neil Brown @ 2007-05-25 7:58 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-raid, dm-devel, David Chinner, linux-kernel, Jens Axboe
This mail is about an issue that has been of concern to me for quite a
while and I think it is (well past) time to air it more widely and try
to come to a resolution.
This issue is how write barriers (the block-device kind, not the
memory-barrier kind) should be handled by the various layers.
The following is my understanding, which could well be wrong in
various specifics. Corrections and other comments are more than
welcome.
------------
What are barriers?
==================
Barriers (as generated by requests with BIO_RW_BARRIER) are intended
to ensure that the data in the barrier request is not visible until
all writes submitted earlier are safe on the media, and that the data
is safe on the media before any subsequently submitted requests
are visible on the device.
This is achieved by tagging request in the elevator (or any other
request queue) so that no re-ordering is performed around a
BIO_RW_BARRIER request, and by sending appropriate commands to the
device so that any write-behind caching is defeated by the barrier
request.
Along side BIO_RW_BARRIER is blkdev_issue_flush which calls
q->issue_flush_fn. This can be used to achieve similar effects.
There is no guarantee that a device can support BIO_RW_BARRIER - it is
always possible that a request will fail with EOPNOTSUPP.
Conversely, blkdev_issue_flush must be supported on any device that
uses write-behind caching (it if cannot be supported, then
write-behind caching should be turned off, at least by default).
We can think of there being three types of devices:
1/ SAFE. With a SAFE device, there is no write-behind cache, or if
there is it is non-volatile. Once a write completes it is
completely safe. Such a device does not require barriers
or ->issue_flush_fn, and can respond to them either by a
no-op or with -EOPNOTSUPP (the former is preferred).
2/ FLUSHABLE.
A FLUSHABLE device may have a volatile write-behind cache.
This cache can be flushed with a call to blkdev_issue_flush.
It may not support barrier requests.
3/ BARRIER.
A BARRIER device supports both blkdev_issue_flush and
BIO_RW_BARRIER. Either may be used to synchronise any
write-behind cache to non-volatile storage (media).
Handling of SAFE and FLUSHABLE devices is essentially the same and can
work on a BARRIER device. The BARRIER device has the option of more
efficient handling.
How does a filesystem use this?
===============================
A filesystem will often have a concept of a 'commit' block which makes
an assertion about the correctness of other blocks in the filesystem.
In the most gross sense, this could be the writing of the superblock
of an ext2 filesystem, with the "dirty" bit clear. This write commits
all other writes to the filesystem that precede it.
More subtle/useful is the commit block in a journal as with ext3 and
others. This write commits some number of preceding writes in the
journal or elsewhere.
The filesystem will want to ensure that all preceding writes are safe
before writing the barrier block. There are two ways to achieve this.
1/ Issue all 'preceding writes', wait for them to complete (bi_endio
called), call blkdev_issue_flush, issue the commit write, wait
for it to complete, call blkdev_issue_flush a second time.
(This is needed for FLUSHABLE)
2/ Set the BIO_RW_BARRIER bit in the write request for the commit
block.
(This is more efficient on BARRIER).
The second, while much easier, can fail. So a filesystem should be
prepared to deal with that failure by falling back to the first
option.
Thus the general sequence might be:
a/ issue all "preceding writes".
b/ issue the commit write with BIO_RW_BARRIER
c/ wait for the commit to complete.
If it was successful - done.
If it failed other than with EOPNOTSUPP, abort
else continue
d/ wait for all 'preceding writes' to complete
e/ call blkdev_issue_flush
f/ issue commit write without BIO_RW_BARRIER
g/ wait for commit write to complete
if it failed, abort
h/ call blkdev_issue
DONE
steps b and c can be left out if it is known that the device does not
support barriers. The only way to discover this to try and see if it
fails.
I don't think any filesystem follows all these steps.
ext3 has the right structure, but it doesn't include steps e and h.
reiserfs is similar. It does have a call to blkdev_issue_flush, but
that is only on the fsync path, so it isn't really protecting
general journal commits.
XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f'
depending on a whether it thinks the device handles barriers,
and finally 'g'.
I haven't looked at other filesystems.
So for devices that support BIO_RW_BARRIER, and for devices that don't
need any flush, they work OK, but for device that need flushing, but
don't support BIO_RW_BARRIER, none of them work. This should be easy
to fix.
HOW DO MD or DM USE THIS
========================
1/ striping devices.
This includes md/raid0 md/linear dm-linear dm-stripe and probably
others.
These devices can easily support blkdev_issue_flush by simply
calling blkdev_issue_flush on all component devices.
These devices would find it very hard to support BIO_RW_BARRIER.
Doing this would require keeping track of all in-flight requests
(which some, possibly all, of the above don't) and then:
When a BIO_RW_BARRIER request arrives:
wait for all pending writes to complete
call blkdev_issue_flush on all devices
issue the barrier write to the target device(s)
as BIO_RW_BARRIER,
if that is -EOPNOTSUP, re-issue, wait, flush.
Currently none of the listed modules do that.
md/raid0 and md/linear fail any BIO_RW_BARRIER with -EOPNOTSUP.
dm-linear and dm-stripe simply pass the BIO_RW_BARRIER flag down,
which means data may not be flushed correctly: the commit block
might be written to one device before a preceding block is
written to another device.
I think the best approach for this class of devices is to return
-EOPNOSUP. If the filesystem does the wait (which they all do
already) and the blkdev_issue_flush (which is easy to add), they
don't need to support BIO_RW_BARRIER.
2/ Mirror devices. This includes md/raid1 and dm-raid1.
These device can trivially implement blkdev_issue_flush much like
the striping devices, and can support BIO_RW_BARRIER to some
extent.
md/raid1 currently tries. I'm not sure about dm-raid1.
md/raid1 determines if the underlying devices can handle
BIO_RW_BARRIER. If any cannot, it rejects such requests (EOPNOTSUP)
itself.
If all underlying devices do appear to support barriers, md/raid1
will pass a barrier-write down to all devices.
The difficulty comes if it fails on one device, but not all
devices. In this case it is not clear what to do. Failing the
request is a lie, because some data has been written (possible too
early). Succeeding the request (after re-submitting the failed
requests) is also a lie as the barrier wasn't really honoured.
md/raid1 currently takes the latter approach, but will only do it
once - after that it fails all barrier requests.
Hopefully this is unlikely to happen. What device would work
correctly with barriers once, and then not the next time?
The answer is md/raid1. If you remove a failed device and add a
new device that doesn't support barriers, md/raid1 will notice and
stop supporting barriers.
If md/raid1 can change from supporting barrier to not, then maybe
some other device could too?
I'm not sure what to do about this - maybe just ignore it...
3/ Other modules
Other md and dm modules (raid5, mpath, crypt) do not add anything
interesting to the above. Either handling BIO_RW_BARRIER is
trivial, or extremely difficult.
HOW DO LOW LEVEL DEVICES HANDLE THIS
====================================
This is part of the picture that I haven't explored greatly. My
feeling is that most if not all devices support blkdev_issue_flush
properly, and support barriers reasonably well providing that the
hardware does.
There in an exception I recently found though.
For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to
the controller can be tagged as barriers), SCSI will use the
SYNCHRONIZE_CACHE command to flush the cache after the barrier
request (a bit like the filesystem calling blkdev_issue_flush, but at
a lower level). However it does this without setting the SYNC_NV bit.
This means that a device with a non-volatile cache will be required --
needlessly -- to flush that cache to media.
So: some questions to help encourage response:
- Is the above substantial correct? Totally correct?
- Should the various filesystems be "fixed" as suggested above? Is
someone willing to do that?
- Is the approach to barriers taken by md appropriate? Should dm
do the same? Who will do that?
- Is setting the SYNC_NV bit really the right thing to do? Are there
any other places where the wrong sort of sync might be happening?
Are then any callers that require SYNC_NV to be clear.
- The comment above blkdev_issue_flush says "Caller must run
wait_for_completion() on its own". What does that mean?
- Are there other bit that we could handle better?
BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean?
Thank you for your attention.
NeilBrown
^ permalink raw reply [flat|nested] 102+ messages in thread* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown @ 2007-05-25 11:15 ` David Chinner 2007-05-25 11:49 ` Jens Axboe 2007-05-25 13:52 ` Stefan Bader ` (6 subsequent siblings) 7 siblings, 1 reply; 102+ messages in thread From: David Chinner @ 2007-05-25 11:15 UTC (permalink / raw) To: Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote: > We can think of there being three types of devices: > > 1/ SAFE. With a SAFE device, there is no write-behind cache, or if > there is it is non-volatile. Once a write completes it is > completely safe. Such a device does not require barriers > or ->issue_flush_fn, and can respond to them either by a > no-op or with -EOPNOTSUPP (the former is preferred). > > 2/ FLUSHABLE. > A FLUSHABLE device may have a volatile write-behind cache. > This cache can be flushed with a call to blkdev_issue_flush. > It may not support barrier requests. So returns -EOPNOTSUPP to any barrier request? > 3/ BARRIER. > A BARRIER device supports both blkdev_issue_flush and > BIO_RW_BARRIER. Either may be used to synchronise any > write-behind cache to non-volatile storage (media). > > Handling of SAFE and FLUSHABLE devices is essentially the same and can > work on a BARRIER device. The BARRIER device has the option of more > efficient handling. > > How does a filesystem use this? > =============================== .... > > The filesystem will want to ensure that all preceding writes are safe > before writing the barrier block. There are two ways to achieve this. Three, actually. > 1/ Issue all 'preceding writes', wait for them to complete (bi_endio > called), call blkdev_issue_flush, issue the commit write, wait > for it to complete, call blkdev_issue_flush a second time. > (This is needed for FLUSHABLE) *nod* > 2/ Set the BIO_RW_BARRIER bit in the write request for the commit > block. > (This is more efficient on BARRIER). *nod* 3/ Use a SAFE device. > The second, while much easier, can fail. So we do a test I/O to see if the device supports them before enabling that mode. But, as we've recently discovered, this is not sufficient to detect *correctly functioning* barrier support. > So a filesystem should be > prepared to deal with that failure by falling back to the first > option. I don't buy that argument..... > Thus the general sequence might be: > > a/ issue all "preceding writes". > b/ issue the commit write with BIO_RW_BARRIER At this point, the filesystem has done everything it needs to ensure that the block layer has been informed of the I/O ordering requirements. Why should the filesystem now have to detect block layer breakage, and then use a different block layer API to issue the same I/O under the same constraints? > c/ wait for the commit to complete. > If it was successful - done. > If it failed other than with EOPNOTSUPP, abort > else continue > d/ wait for all 'preceding writes' to complete > e/ call blkdev_issue_flush > f/ issue commit write without BIO_RW_BARRIER > g/ wait for commit write to complete > if it failed, abort > h/ call blkdev_issue ^^^^^^^^^^^^_flush? > DONE > > steps b and c can be left out if it is known that the device does not > support barriers. The only way to discover this to try and see if it > fails. That's a very linear, single-threaded way of looking at it... ;) > I don't think any filesystem follows all these steps. > > ext3 has the right structure, but it doesn't include steps e and h. > reiserfs is similar. It does have a call to blkdev_issue_flush, but > that is only on the fsync path, so it isn't really protecting > general journal commits. > XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' > depending on a whether it thinks the device handles barriers, > and finally 'g'. That's right, except for the "g" (or "c") bit - commit writes are async and nothing waits for them - the io completion wakes anything waiting on it's completion.... (yes, all XFS barrier I/Os are issued async which is why having to handle an -EOPNOTSUPP error is a real pain. The fix I currently have is to reissue the I/O from the completion handler with is ugly, ugly, ugly.....) > So for devices that support BIO_RW_BARRIER, and for devices that don't > need any flush, they work OK, but for device that need flushing, but > don't support BIO_RW_BARRIER, none of them work. This should be easy > to fix. Right - XFS as it stands was designed to work on SAFE devices, and we've modified it to work on BARRIER devices. We don't support FLUSHABLE devices at all. But if the filesystem supports BARRIER devices, I don't see any reason why a filesystem needs to be modified to support FLUSHABLE devices - the key point being that by the time the filesystem has issued the "commit write" it has already waited for all it's dependent I/O, and so all the block device needs to do is issue flushes either side of the commit write.... > HOW DO MD or DM USE THIS > ======================== > > 1/ striping devices. > This includes md/raid0 md/linear dm-linear dm-stripe and probably > others. > > These devices can easily support blkdev_issue_flush by simply > calling blkdev_issue_flush on all component devices. > > These devices would find it very hard to support BIO_RW_BARRIER. > Doing this would require keeping track of all in-flight requests > (which some, possibly all, of the above don't) and then: > When a BIO_RW_BARRIER request arrives: > wait for all pending writes to complete A count of outstanding I/Os and a wait queue is sufficient to implement this, I think. ..... > I think the best approach for this class of devices is to return > -EOPNOSUP. If the filesystem does the wait (which they all do > already) and the blkdev_issue_flush (which is easy to add), they > don't need to support BIO_RW_BARRIER. So you want to define these as a FLUSHABLE device? > 2/ Mirror devices. This includes md/raid1 and dm-raid1. ...... > Hopefully this is unlikely to happen. What device would work > correctly with barriers once, and then not the next time? > The answer is md/raid1. If you remove a failed device and add a > new device that doesn't support barriers, md/raid1 will notice and > stop supporting barriers. In case you hadn't already guess, I don't like this behaviour at all. It makes async I/O completion of barrier I/O an ugly, messy business, and every place you do sync I/O completion you need to put special error handling. If this happens to md/raid1, then why can't it simply do a blkdev_issue_flush, write, blkdev_issue_flush sequence to the device that doesn't support barriers and then the md device *never changes behaviour*. Next time the filesystem is mounted, it will turn off barriers because they won't be supported.... You do that, and suddenly there's 10 filesystems that no longer have to handle this extremely rare corner case. > So: some questions to help encourage response: > > - Is the above substantial correct? Totally correct? Mostly valid ;) I like the idea of clearly defining the different types of devices so we can say exactly what each device behaves like. It also points out (to me, anyway) that from a filesystem POV there is no logical difference between a BARRIER and a FLUSHABLE block device; the only difference is that filesystems are forced to use a different API to support them (and hence they are not supported). > - Should the various filesystems be "fixed" as suggested above? Is > someone willing to do that? Alternate viewpoint - should the block layer be fixed so that the filesystems only need to use one barrier API that provides static behaviour for the life of the mount? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 11:15 ` David Chinner @ 2007-05-25 11:49 ` Jens Axboe 2007-05-25 14:49 ` Phillip Susi 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-05-25 11:49 UTC (permalink / raw) To: David Chinner Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid On Fri, May 25 2007, David Chinner wrote: > > The second, while much easier, can fail. > > So we do a test I/O to see if the device supports them before > enabling that mode. But, as we've recently discovered, this is not > sufficient to detect *correctly functioning* barrier support. Right, those are two different things. But paranoia aside, will this ever be a real life problem? I've always been of the opinion to just nicely ignore them. We can't easily detect it and tell the user his hw is crap. > > So a filesystem should be > > prepared to deal with that failure by falling back to the first > > option. > > I don't buy that argument..... The problem with Neils reasoning there is that blkdev_issue_flush() may use the same method as the barrier to ensure data is on platter. A barrier write will include a flush, but it may also use the FUA bit to ensure data is on platter. So the only situation where a fallback from a barrier to flush would be valid, is if the device lied and told you it could do FUA but it could not and that is the reason why the barrier write failed. If that is the case, the block layer should stop using FUA and fallback to flush-write-flush. And if it does that, then there's never a valid reason to switch from using barrier writes to blkdev_issue_flush() since both methods would either both work or both fail. > > Thus the general sequence might be: > > > > a/ issue all "preceding writes". > > b/ issue the commit write with BIO_RW_BARRIER > > At this point, the filesystem has done everything it needs to ensure > that the block layer has been informed of the I/O ordering > requirements. Why should the filesystem now have to detect block > layer breakage, and then use a different block layer API to issue > the same I/O under the same constraints? It's not block layer breakage, it's a device issue. > > 2/ Mirror devices. This includes md/raid1 and dm-raid1. > ...... > > Hopefully this is unlikely to happen. What device would work > > correctly with barriers once, and then not the next time? > > The answer is md/raid1. If you remove a failed device and add a > > new device that doesn't support barriers, md/raid1 will notice and > > stop supporting barriers. > > In case you hadn't already guess, I don't like this behaviour at > all. It makes async I/O completion of barrier I/O an ugly, messy > business, and every place you do sync I/O completion you need to put > special error handling. That's unfortunately very true. It's an artifact of the sometimes problematic device capability discovery. > If this happens to md/raid1, then why can't it simply do a > blkdev_issue_flush, write, blkdev_issue_flush sequence to the device > that doesn't support barriers and then the md device *never changes > behaviour*. Next time the filesystem is mounted, it will turn off > barriers because they won't be supported.... Because if it doesn't support barriers, blkdev_issue_flush() wouldn't work either. At least that is the case for SATA/IDE, SCSI is somewhat different (and has somewhat other issues). > > - Should the various filesystems be "fixed" as suggested above? Is > > someone willing to do that? > > Alternate viewpoint - should the block layer be fixed so that the > filesystems only need to use one barrier API that provides static > behaviour for the life of the mount? blkdev_issue_flush() isn't part of the barrier API, and using it as a work-around for a device that has barrier "issues" is wrong for the reasons listed above. The DRAIN_FUA -> DRAIN_FLUSH automatic downgrade I mentioned above should be added, in which case blkdev_issue_flush() would never be needed (unless you want to do a data-less barrier, and we should probably add that specific functionality with an empty bio instead of providing an alternate way of doing that). -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 11:49 ` Jens Axboe @ 2007-05-25 14:49 ` Phillip Susi 2007-05-28 18:32 ` [dm-devel] " Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Phillip Susi @ 2007-05-25 14:49 UTC (permalink / raw) To: device-mapper development Cc: linux-fsdevel, linux-raid, David Chinner, linux-kernel Jens Axboe wrote: > A barrier write will include a flush, but it may also use the FUA bit to > ensure data is on platter. So the only situation where a fallback from a > barrier to flush would be valid, is if the device lied and told you it > could do FUA but it could not and that is the reason why the barrier > write failed. If that is the case, the block layer should stop using FUA > and fallback to flush-write-flush. And if it does that, then there's > never a valid reason to switch from using barrier writes to > blkdev_issue_flush() since both methods would either both work or both > fail. IIRC, the FUA bit only forces THAT request to hit the platter before it is completed; it does not flush any previous requests still sitting in the write back queue. Because all io before the barrier must be on the platter as well, setting the FUA bit on the barrier request means you don't have to follow it with a flush, but you still have to precede it with a flush. > It's not block layer breakage, it's a device issue. How isn't it block layer breakage? If the device does not support barriers, isn't it the job of the block layer ( probably the scheduler ) to fall back to flush-write-flush? ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 14:49 ` Phillip Susi @ 2007-05-28 18:32 ` Jens Axboe 0 siblings, 0 replies; 102+ messages in thread From: Jens Axboe @ 2007-05-28 18:32 UTC (permalink / raw) To: Phillip Susi Cc: device-mapper development, David Chinner, linux-fsdevel, linux-raid, linux-kernel (dunny why you explicitly dropped me off the cc/to list when replying to my email, hence I missed it for 3 days) On Fri, May 25 2007, Phillip Susi wrote: > Jens Axboe wrote: > >A barrier write will include a flush, but it may also use the FUA bit to > >ensure data is on platter. So the only situation where a fallback from a > >barrier to flush would be valid, is if the device lied and told you it > >could do FUA but it could not and that is the reason why the barrier > >write failed. If that is the case, the block layer should stop using FUA > >and fallback to flush-write-flush. And if it does that, then there's > >never a valid reason to switch from using barrier writes to > >blkdev_issue_flush() since both methods would either both work or both > >fail. > > IIRC, the FUA bit only forces THAT request to hit the platter before it > is completed; it does not flush any previous requests still sitting in > the write back queue. Because all io before the barrier must be on the > platter as well, setting the FUA bit on the barrier request means you > don't have to follow it with a flush, but you still have to precede it > with a flush. I'm well aware of how FUA works, hence the barrier FUA implementation does flush and then write-fua. The win compared to flush-write-flush is just a saved command, essentially. > >It's not block layer breakage, it's a device issue. > > How isn't it block layer breakage? If the device does not support > barriers, isn't it the job of the block layer ( probably the scheduler ) > to fall back to flush-write-flush? The problem is flush not working, the block layer can't fix that for you obviously. If it's FUA not working, the block layer should fall back to flush-write-flush, as they are obviously functionally equivalent. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown 2007-05-25 11:15 ` David Chinner @ 2007-05-25 13:52 ` Stefan Bader 2007-05-28 1:37 ` Neil Brown 2007-05-25 15:11 ` Phillip Susi ` (5 subsequent siblings) 7 siblings, 1 reply; 102+ messages in thread From: Stefan Bader @ 2007-05-25 13:52 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner, Neil Brown 2007/5/25, Neil Brown <neilb@suse.de>: > > HOW DO MD or DM USE THIS > ======================== > > 1/ striping devices. > This includes md/raid0 md/linear dm-linear dm-stripe and probably > others. > > These devices can easily support blkdev_issue_flush by simply > calling blkdev_issue_flush on all component devices. > This ensures that all of the previous requests have been processed but does this guarantee they where successful? This might be too paranoid but if I understood the concept correctly the success of a barrier request should indicate success of all previous request between this barrier and the last one. > These devices would find it very hard to support BIO_RW_BARRIER. > Doing this would require keeping track of all in-flight requests > (which some, possibly all, of the above don't) and then: > When a BIO_RW_BARRIER request arrives: > wait for all pending writes to complete > call blkdev_issue_flush on all devices > issue the barrier write to the target device(s) > as BIO_RW_BARRIER, > if that is -EOPNOTSUP, re-issue, wait, flush. > I guess just keep a count of submitted requests and errors since the last barrier might be enough. As long as all of the underlying device support at least support a flush the dm device could pretend to support BIO_RW_BARRIER. > > dm-linear and dm-stripe simply pass the BIO_RW_BARRIER flag down, > which means data may not be flushed correctly: the commit block > might be written to one device before a preceding block is > written to another device. > Hm, even worse: if the barrier requests accidentally end up on a device that does support barriers and another one on the map doesn't. Would any layer/fs above care to issue a flush call? > I think the best approach for this class of devices is to return > -EOPNOSUP. If the filesystem does the wait (which they all do > already) and the blkdev_issue_flush (which is easy to add), they > don't need to support BIO_RW_BARRIER. > Without any additional code these really should report -EOPNOTSUPP. If disaster strikes there is no way to make assumptions on the real state on disk. > 2/ Mirror devices. This includes md/raid1 and dm-raid1. > > These device can trivially implement blkdev_issue_flush much like > the striping devices, and can support BIO_RW_BARRIER to some > extent. > md/raid1 currently tries. I'm not sure about dm-raid1. > I fear this is more broken as with linear and stripe. There is no code to check the features of underlying devices and the request itself isn't sent forward but privately built ones (which do not have the barrier flag)... 3/ Multipath devices Requests are sent to the same device but one different paths. So at least with them the chance of one path supporting barriers but not another one seems little (as long as the paths do not use completely different transport layers). But passing on a request with the barrier flag also doesn't seem to be a good idea since previous requests can arrive at the device later. IMHO the best way to handle barriers for dm would be to add the sequence described to the generic mapping layer of dm (before calling the targets mapping function). There is already some sort of counting in-flight requests (suspend/resume needs that) and I guess the downgrade could also be rather simple. If a flush call to the target (mapped device) fails report -EOPNOTSUPP and stay that way (until next boot). > So: some questions to help encourage response: > > - Is the approach to barriers taken by md appropriate? Should dm > do the same? Who will do that? If my assumption about barrier semantics is true, then also md has to somehow make sure all previous requests have _successfully_ completed. In the mirror case I guess it is valid to report success if the mirror itself is in a clean state. Which is all previous requests (and the barrier) where successful on at least one mirror half and this state can be recovered. Question to dm-devel: What do people there think of the possible generic implementation in dm.c? > - The comment above blkdev_issue_flush says "Caller must run > wait_for_completion() on its own". What does that mean? > Guess this means it initiates a flush but doesn't wait for completion. So the caller must wait for the completion of the separate requests on its own, doesn't it? > - Are there other bit that we could handle better? > BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean? > BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no) error recovery. Mainly used by mutlipath targets to avoid long SCSI recovery. This should just be propagated when passing requests on. BIO_RW_SYNC: means this is a bio of a synchronous request. I don't know whether there are more uses to it but this at least causes queues to be flushed immediately instead of waiting for more requests for a short time. Should also just be passed on. Otherwise performance gets poor since something above will rather wait for the current request/bio to complete instead of sending more. Stefan Bader ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 13:52 ` Stefan Bader @ 2007-05-28 1:37 ` Neil Brown 2007-05-29 9:12 ` Stefan Bader 0 siblings, 1 reply; 102+ messages in thread From: Neil Brown @ 2007-05-28 1:37 UTC (permalink / raw) To: Stefan Bader Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner On Friday May 25, Stefan.Bader@de.ibm.com wrote: > 2007/5/25, Neil Brown <neilb@suse.de>: > > - Are there other bit that we could handle better? > > BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean? > > > BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no) > error recovery. Mainly used by mutlipath targets to avoid long SCSI > recovery. This should just be propagated when passing requests on. Is it "much" or "no"? Would it be reasonable to use this for reads from a non-degraded raid1? What about writes? What I would really like is some clarification on what sort of errors get retried, how often, and how much timeout there is.. And does the 'error' code returned in ->bi_end_io allow us to differentiate media errors from other errors yet? > > BIO_RW_SYNC: means this is a bio of a synchronous request. I don't > know whether there are more uses to it but this at least causes queues > to be flushed immediately instead of waiting for more requests for a > short time. Should also just be passed on. Otherwise performance gets > poor since something above will rather wait for the current > request/bio to complete instead of sending more. Yes, this one is pretty straight forward.. I mentioned it more as a reminder to my self that I really should support it in raid5 :-( NeilBrown ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 1:37 ` Neil Brown @ 2007-05-29 9:12 ` Stefan Bader 0 siblings, 0 replies; 102+ messages in thread From: Stefan Bader @ 2007-05-29 9:12 UTC (permalink / raw) To: Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner, Alasdair Kergon > > 2007/5/25, Neil Brown <neilb@suse.de>: > > BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no) > > error recovery. Mainly used by mutlipath targets to avoid long SCSI > > recovery. This should just be propagated when passing requests on. > > Is it "much" or "no"? > Would it be reasonable to use this for reads from a non-degraded > raid1? What about writes? > This depends on the device driver's implementation. AFAIK there is no fix rule how to handle that flag exactly. The SCSI driver seems to omit internal recovery procedures but requests still can take as long as the SCSI request time-out. I am not sure of all internals. Maybe some error recovery is done as long as it shouldn't take very long. For the DASD driver on zSeries this flags will only affect situations when the driver decides there is no other way of succeeding. Recovery is still done. Using this flag was intended to move error handling to an upper layer in the device stack. For multipathing it is good to be able to map a request to another path instead of waiting until the SCSI layer finally would give up with one path. For a RAID1 this might cause requests to fail which would have been recovered. This might require more error handling in md. The error code as it is at this time doesn't say much in detail. I once saw patches (and there are comments about a path missing from Jens Axboe) to pass sense data (from SCSI) in the bio. I am not sure whether this was dropped for some reason or just is in the pipe. Jens? Stefan ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown 2007-05-25 11:15 ` David Chinner 2007-05-25 13:52 ` Stefan Bader @ 2007-05-25 15:11 ` Phillip Susi 2007-05-26 1:03 ` Andreas Dilger ` (4 subsequent siblings) 7 siblings, 0 replies; 102+ messages in thread From: Phillip Susi @ 2007-05-25 15:11 UTC (permalink / raw) To: device-mapper development Cc: linux-fsdevel, linux-raid, David Chinner, linux-kernel, Jens Axboe Neil Brown wrote: > There is no guarantee that a device can support BIO_RW_BARRIER - it is > always possible that a request will fail with EOPNOTSUPP. Why is it not the job of the block layer to translate for broken devices and send them a flush/write/flush? > These devices would find it very hard to support BIO_RW_BARRIER. > Doing this would require keeping track of all in-flight requests > (which some, possibly all, of the above don't) and then: The device mapper keeps track of in flight requests already. When switching tables it has to hold new requests and wait for in flight requests to complete before switching to the new table. When it gets a barrier request it just needs to do the same thing, only not switch tables. > I think the best approach for this class of devices is to return > -EOPNOSUP. If the filesystem does the wait (which they all do > already) and the blkdev_issue_flush (which is easy to add), they > don't need to support BIO_RW_BARRIER. Why? The personalities should just pass the BARRIER flag down to each underlying device, and the dm common code should wait for all in flight io to complete before sending the barrier to the personality. > For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to > the controller can be tagged as barriers), SCSI will use the > SYNCHRONIZE_CACHE command to flush the cache after the barrier > request (a bit like the filesystem calling blkdev_issue_flush, but at Don't you have to flush the cache BEFORE the barrier to ensure that previous IO is committed first, THEN the barrier write? ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown ` (2 preceding siblings ...) 2007-05-25 15:11 ` Phillip Susi @ 2007-05-26 1:03 ` Andreas Dilger 2007-05-26 10:27 ` Tejun Heo ` (3 subsequent siblings) 7 siblings, 0 replies; 102+ messages in thread From: Andreas Dilger @ 2007-05-26 1:03 UTC (permalink / raw) To: Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner On May 25, 2007 17:58 +1000, Neil Brown wrote: > These devices would find it very hard to support BIO_RW_BARRIER. > Doing this would require keeping track of all in-flight requests > (which some, possibly all, of the above don't) and then: > When a BIO_RW_BARRIER request arrives: > wait for all pending writes to complete > call blkdev_issue_flush on all devices > issue the barrier write to the target device(s) > as BIO_RW_BARRIER, > if that is -EOPNOTSUP, re-issue, wait, flush. We noticed when testing the SLES10 kernel (which has barriers enabled by default) that ext3 write throughput went from about 170MB/s to about 130MB/s (on high-end RAID storage using no-op scheduler). The reason (as far as we could tell) is that the barriers are implemented by flushing and waiting for all previosly submitted IOs to finish, but all that ext3/jbd really care about is that the journal blocks are safely on disk. Since the journal blocks are only a small fraction of the total IO in flight, the barrier + write cache ends up being a lot worse than just doing synchronous IO with the write cache disabled because no new IO can be submitted past the barrier, and since that IO is large and contiguous it might complete much faster than the scattered metadata updates that are also being checkpointed to disk from the previous transactions. With jbd there can be both a running and a committing transaction, and multiple checkpointing transactions, and the use of barriers breaks this important optimization. If ext3 used an external journal this problem would be avoided, but then there isn't really a need for barriers in the first place, since the jbd code already will handle the wait for the commit block itself. We've got a pretty-much complete version of the ext3 journal checksumming patch that avoids the need to do the pre-commit barrier, since the checksum can verify at recovery time whether all of the transaction's blocks made it to disk or not (which is what the commit block is all about in the end). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown ` (3 preceding siblings ...) 2007-05-26 1:03 ` Andreas Dilger @ 2007-05-26 10:27 ` Tejun Heo 2007-05-28 1:30 ` Neil Brown ` (2 subsequent siblings) 7 siblings, 0 replies; 102+ messages in thread From: Tejun Heo @ 2007-05-26 10:27 UTC (permalink / raw) To: Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner Hello, Neil Brown. Please cc me on blkdev barriers and, if you haven't yet, reading Documentation/block/barrier.txt can be helpful too. Neil Brown wrote: [--snip--] > 1/ SAFE. With a SAFE device, there is no write-behind cache, or if > there is it is non-volatile. Once a write completes it is > completely safe. Such a device does not require barriers > or ->issue_flush_fn, and can respond to them either by a > no-op or with -EOPNOTSUPP (the former is preferred). > > 2/ FLUSHABLE. > A FLUSHABLE device may have a volatile write-behind cache. > This cache can be flushed with a call to blkdev_issue_flush. > It may not support barrier requests. > > 3/ BARRIER. > A BARRIER device supports both blkdev_issue_flush and > BIO_RW_BARRIER. Either may be used to synchronise any > write-behind cache to non-volatile storage (media). > > Handling of SAFE and FLUSHABLE devices is essentially the same and can > work on a BARRIER device. The BARRIER device has the option of more > efficient handling. Actually, all above three are handled by blkdev flush code. > How does a filesystem use this? > =============================== > [--snip--] > 2/ Set the BIO_RW_BARRIER bit in the write request for the commit > block. > (This is more efficient on BARRIER). This really should be enough. > HOW DO MD or DM USE THIS > ======================== > > 1/ striping devices. > This includes md/raid0 md/linear dm-linear dm-stripe and probably > others. > > These devices can easily support blkdev_issue_flush by simply > calling blkdev_issue_flush on all component devices. > > These devices would find it very hard to support BIO_RW_BARRIER. > Doing this would require keeping track of all in-flight requests > (which some, possibly all, of the above don't) and then: > When a BIO_RW_BARRIER request arrives: > wait for all pending writes to complete > call blkdev_issue_flush on all devices > issue the barrier write to the target device(s) > as BIO_RW_BARRIER, > if that is -EOPNOTSUP, re-issue, wait, flush. Hmm... What do you think about introducing zero-length BIO_RW_BARRIER for this case? > 2/ Mirror devices. This includes md/raid1 and dm-raid1. > > These device can trivially implement blkdev_issue_flush much like > the striping devices, and can support BIO_RW_BARRIER to some > extent. > md/raid1 currently tries. I'm not sure about dm-raid1. > > md/raid1 determines if the underlying devices can handle > BIO_RW_BARRIER. If any cannot, it rejects such requests (EOPNOTSUP) > itself. > If all underlying devices do appear to support barriers, md/raid1 > will pass a barrier-write down to all devices. > The difficulty comes if it fails on one device, but not all > devices. In this case it is not clear what to do. Failing the > request is a lie, because some data has been written (possible too > early). Succeeding the request (after re-submitting the failed > requests) is also a lie as the barrier wasn't really honoured. > md/raid1 currently takes the latter approach, but will only do it > once - after that it fails all barrier requests. > > Hopefully this is unlikely to happen. What device would work > correctly with barriers once, and then not the next time? > The answer is md/raid1. If you remove a failed device and add a > new device that doesn't support barriers, md/raid1 will notice and > stop supporting barriers. > If md/raid1 can change from supporting barrier to not, then maybe > some other device could too? > > I'm not sure what to do about this - maybe just ignore it... That sounds good. :-) > 3/ Other modules > > Other md and dm modules (raid5, mpath, crypt) do not add anything > interesting to the above. Either handling BIO_RW_BARRIER is > trivial, or extremely difficult. > > HOW DO LOW LEVEL DEVICES HANDLE THIS > ==================================== > > This is part of the picture that I haven't explored greatly. My > feeling is that most if not all devices support blkdev_issue_flush > properly, and support barriers reasonably well providing that the > hardware does. > There in an exception I recently found though. > For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to > the controller can be tagged as barriers), SCSI will use the > SYNCHRONIZE_CACHE command to flush the cache after the barrier > request (a bit like the filesystem calling blkdev_issue_flush, but at > a lower level). However it does this without setting the SYNC_NV bit. > This means that a device with a non-volatile cache will be required -- > needlessly -- to flush that cache to media. Yeah, it probably needs updating but some devices might react badly too. > So: some questions to help encourage response: > > - Is the above substantial correct? Totally correct? > - Should the various filesystems be "fixed" as suggested above? Is > someone willing to do that? I don't think adding the complexity to each and every FS is necessary. Except for broken devices, the only reason barrier fails is when the device lied about its capability - either about ordered tag or FUA. It would be far nicer if we can do proper capability testing during device initialization but unfortunately barriers are writes and we can't test without side effects. While developing the current flush code, I had automatic fallback mechanism but removed it before submitting because 1. I wasn't sure whether it would be necessary and 2. it couldn't handle fall back from ordered tag properly (because ordered tag doesn't guarantee failure of latter requests when an earlier one fails, you're already too late when you get the error report from the device). This can be solved by running the first sequence in more restrictive way (ie. we do capability probing at the first barrier from FS). So, if barrier failure due to devices lying about their capability is an actual problem (ATA hasn't seen much if any), it can be solved inside block layer proper. No need to update filesystems. Just issuing barrier when ordering is needed should be enough. If there have been actual reports of these failures, please point me to them. Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown ` (4 preceding siblings ...) 2007-05-26 10:27 ` Tejun Heo @ 2007-05-28 1:30 ` Neil Brown 2007-05-28 2:45 ` David Chinner ` (4 more replies) 2007-05-28 11:17 ` [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Nikita Danilov 2007-05-28 14:43 ` Bill Davidsen 7 siblings, 5 replies; 102+ messages in thread From: Neil Brown @ 2007-05-28 1:30 UTC (permalink / raw) To: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo Thanks everyone for your input. There was some very valuable observations in the various emails. I will try to pull most of it together and bring out what seem to be the important points. 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. This is certainly a very attractive position - it makes the interface cleaner and makes life easier for filesystems and other clients of the block interface. Currently filesystems handle -EOPNOTSUP by a/ resubmitting the request without the BARRIER (after waiting for earlier requests to complete) and b/ possibly printing an error message to the kernel logs. The block layer can do both of these just as easily and it does make sense to do it there. md/dm modules could keep count of requests as has been suggested (though that would be a fairly big change for raid0 as it currently doesn't know when a request completes - bi_endio goes directly to the filesystem). However I think the idea of a zero-length BIO_RW_BARRIER would be a good option. raid0 could send one of these down each device, and when they all return, the barrier request can be sent to it's target device(s). I think this is a worthy goal that we should work towards. 2/ Maybe barriers provide stronger semantics than are required. All write requests are synchronised around a barrier write. This is often more than is required and apparently can cause a measurable slowdown. Also the FUA for the actual commit write might not be needed. It is important for consistency that the preceding writes are in safe storage before the commit write, but it is not so important that the commit write is immediately safe on storage. That isn't needed until a 'sync' or 'fsync' or similar. One possible alternative is: - writes can overtake barriers, but barrier cannot overtake writes. - flush before the barrier, not after. This is considerably weaker, and hence cheaper. But I think it is enough for all filesystems (providing it is still an option to call blkdev_issue_flush on 'fsync'). Another alternative would be to tag each bio was being in a particular barrier-group. Then bio's in different groups could overtake each other in either direction, but a BARRIER request must be totally ordered w.r.t. other requests in the barrier group. This would require an extra bio field, and would give the filesystem more appearance of control. I'm not yet sure how much it would really help... It would allow us to set FUA on all bios with a non-zero barrier-group. That would mean we don't have to flush the entire cache, just those blocks that are critical.... but I'm still not sure it's a good idea. Of course, these weaker rules would only apply inside the elevator. Once the request goes to the device we need to work with what the device provides, which probably means total-ordering around the barrier. I think this requires more discussion before a way forward is clear. 3/ Do we need explicit control of the 'ordered' mode? Consider a SCSI device that has NV RAM cache. mode_sense reports that write-back is enabled, so _FUA or _FLUSH will be used. But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode. But it seems there is no way to query this information. Using _FLUSH causes the NVRAM to be flushed to media which is a terrible performance problem. Setting SYNC_NV doesn't work on the particular device in question. We currently tell customers to mount with -o nobarriers, but that really feels like the wrong solution. We should be telling the scsi device "don't flush". An advantage of 'nobarriers' is it can go in /etc/fstab. Where would you record that a SCSI drive should be set to QUEUE_ORDERD_DRAIN ?? I think the implementation priorities here are: 1/ implement a zero-length BIO_RW_BARRIER option. 2/ Use it (or otherwise) to make all dm and md modules handle barriers (and loop?). 3/ Devise and implement appropriate fall-backs with-in the block layer so that -EOPNOTSUP is never returned. 4/ Remove unneeded cruft from filesystems (and elsewhere). Comments? Thanks, NeilBrown ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 1:30 ` Neil Brown @ 2007-05-28 2:45 ` David Chinner 2007-05-28 2:57 ` Neil Brown ` (2 more replies) 2007-05-28 9:29 ` Tejun Heo ` (3 subsequent siblings) 4 siblings, 3 replies; 102+ messages in thread From: David Chinner @ 2007-05-28 2:45 UTC (permalink / raw) To: Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: > > Thanks everyone for your input. There was some very valuable > observations in the various emails. > I will try to pull most of it together and bring out what seem to be > the important points. > > > 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. Sounds good to me, but how do we test to see if the underlying device supports barriers? Do we just assume that they do and only change behaviour if -o nobarrier is specified in the mount options? > 2/ Maybe barriers provide stronger semantics than are required. > > All write requests are synchronised around a barrier write. This is > often more than is required and apparently can cause a measurable > slowdown. > > Also the FUA for the actual commit write might not be needed. It is > important for consistency that the preceding writes are in safe > storage before the commit write, but it is not so important that the > commit write is immediately safe on storage. That isn't needed until > a 'sync' or 'fsync' or similar. The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. > One possible alternative is: > - writes can overtake barriers, but barrier cannot overtake writes. No, that breaks the above usage of a barrier.... > - flush before the barrier, not after. > > This is considerably weaker, and hence cheaper. But I think it is > enough for all filesystems (providing it is still an option to call > blkdev_issue_flush on 'fsync'). No, not enough for XFS. > Another alternative would be to tag each bio was being in a > particular barrier-group. Then bio's in different groups could > overtake each other in either direction, but a BARRIER request must > be totally ordered w.r.t. other requests in the barrier group. > This would require an extra bio field, and would give the filesystem > more appearance of control. I'm not yet sure how much it would > really help... And that assumes the filesystem is tracking exact dependencies between I/Os. Such a mechanism would probably require filesystems to be redesigned to use this, but I can see how it would be useful for doing things like ensuring ordering between just an inode and it's data writes. What would the overhead of having to support several hundred thousand different barrier groups be (i.e. one per dirty inode in a system)? > I think the implementation priorities here are: Depending on the answer to my first question: 0/ implement a specific test for filesystems to run at mount time to determine if barriers are supported or not. > 1/ implement a zero-length BIO_RW_BARRIER option. > 2/ Use it (or otherwise) to make all dm and md modules handle > barriers (and loop?). > 3/ Devise and implement appropriate fall-backs with-in the block layer > so that -EOPNOTSUP is never returned. > 4/ Remove unneeded cruft from filesystems (and elsewhere). Sounds like a good start. ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 2:45 ` David Chinner @ 2007-05-28 2:57 ` Neil Brown 2007-05-28 4:29 ` David Chinner 2007-05-28 4:48 ` Timothy Shimmin 2007-05-29 20:03 ` Phillip Susi 2 siblings, 1 reply; 102+ messages in thread From: Neil Brown @ 2007-05-28 2:57 UTC (permalink / raw) To: David Chinner Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo On Monday May 28, dgc@sgi.com wrote: > On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: > > > > Thanks everyone for your input. There was some very valuable > > observations in the various emails. > > I will try to pull most of it together and bring out what seem to be > > the important points. > > > > > > 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. > > Sounds good to me, but how do we test to see if the underlying > device supports barriers? Do we just assume that they do and > only change behaviour if -o nobarrier is specified in the mount > options? > What exactly do you want to know, and why do you care? The idea is that every "struct block_device" supports barriers. If the underlying hardware doesn't support them directly, then they get simulated by draining the queue and issuing a flush. Theoretically there could be devices which have a write-back cache that cannot be flushed, and you couldn't implement barriers on such a device. So throw it out and buy another? As far as I can tell, the only thing XFS does differently with devices that don't support barriers is that it prints a warning message to the kernel logs. If the underlying device printed the message when it detected that barriers couldn't be supported, XFS wouldn't need to care at all. NeilBrown ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 2:57 ` Neil Brown @ 2007-05-28 4:29 ` David Chinner 2007-05-31 0:46 ` Neil Brown 0 siblings, 1 reply; 102+ messages in thread From: David Chinner @ 2007-05-28 4:29 UTC (permalink / raw) To: Neil Brown Cc: Tejun Heo, David Chinner, linux-kernel, linux-raid, dm-devel, Jens Axboe, linux-fsdevel, Andreas Dilger On Mon, May 28, 2007 at 12:57:53PM +1000, Neil Brown wrote: > On Monday May 28, dgc@sgi.com wrote: > > On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: > > > Thanks everyone for your input. There was some very valuable > > > observations in the various emails. > > > I will try to pull most of it together and bring out what seem to be > > > the important points. > > > > > > 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. > > > > Sounds good to me, but how do we test to see if the underlying > > device supports barriers? Do we just assume that they do and > > only change behaviour if -o nobarrier is specified in the mount > > options? > > What exactly do you want to know, and why do you care? If someone explicitly mounts "-o barrier" and the underlying device cannot do it, then we want to issue a warning or reject the mount. > The idea is that every "struct block_device" supports barriers. If the > underlying hardware doesn't support them directly, then they get > simulated by draining the queue and issuing a flush. Ok. But you also seem to be implying that there will be devices that cannot support barriers. Even if all devices do eventually support barriers, it may take some time before we reach that goal. Why not start by making it easy to determine what the capabilities of each device are. This can then be removed once we reach the holy grail.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 4:29 ` David Chinner @ 2007-05-31 0:46 ` Neil Brown 2007-05-31 0:57 ` Alasdair G Kergon 2007-05-31 1:07 ` Alasdair G Kergon 0 siblings, 2 replies; 102+ messages in thread From: Neil Brown @ 2007-05-31 0:46 UTC (permalink / raw) To: David Chinner Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo On Monday May 28, dgc@sgi.com wrote: > On Mon, May 28, 2007 at 12:57:53PM +1000, Neil Brown wrote: > > What exactly do you want to know, and why do you care? > > If someone explicitly mounts "-o barrier" and the underlying device > cannot do it, then we want to issue a warning or reject the > mount. I guess that makes sense. But apparently you cannot tell what a device supports until you write to it. So maybe you need to write some metadata with as a barrier, then ask the device what it's barrier status is. The options might be: YES - barriers are fully handled NO - best effort, but due to missing device features, it might not work DISABLED - admin has requested that barriers be ignored. ?? > > > The idea is that every "struct block_device" supports barriers. If the > > underlying hardware doesn't support them directly, then they get > > simulated by draining the queue and issuing a flush. > > Ok. But you also seem to be implying that there will be devices that > cannot support barriers. It seems there will always be hardware that doesn't meet specs. If a device doesn't support SYNCHRONIZE_CACHE or FUA, then implementing barriers all the way to the media would be hard.. > > Even if all devices do eventually support barriers, it may take some > time before we reach that goal. Why not start by making it easy to > determine what the capabilities of each device are. This can then be > removed once we reach the holy grail.... I'd rather not add something that we plan to remove. We currently have -EOPNOTSUP. I don't think there is much point having more than that. I would really like to get to the stage where -EOPNOTSUP is never returned. If a filesystem cares, it could 'ask' as suggested above. What would be a good interface for asking? What if the truth changes (as can happen with md or dm)? NeilBrown ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 0:46 ` Neil Brown @ 2007-05-31 0:57 ` Alasdair G Kergon 2007-05-31 1:07 ` Alasdair G Kergon 1 sibling, 0 replies; 102+ messages in thread From: Alasdair G Kergon @ 2007-05-31 0:57 UTC (permalink / raw) To: device-mapper development Cc: Tejun Heo, David Chinner, linux-kernel, linux-raid, Jens Axboe, linux-fsdevel, Andreas Dilger On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: > What if the truth changes (as can happen with md or dm)? You get notified in endio() that the barrier had to be emulated? Alasdair -- agk@redhat.com ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 0:46 ` Neil Brown 2007-05-31 0:57 ` Alasdair G Kergon @ 2007-05-31 1:07 ` Alasdair G Kergon 2007-05-31 1:11 ` David Chinner 1 sibling, 1 reply; 102+ messages in thread From: Alasdair G Kergon @ 2007-05-31 1:07 UTC (permalink / raw) To: device-mapper development Cc: Tejun Heo, David Chinner, linux-kernel, linux-raid, Jens Axboe, linux-fsdevel, Andreas Dilger On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: > If a filesystem cares, it could 'ask' as suggested above. > What would be a good interface for asking? XFS already tests: bd_disk->queue->ordered == QUEUE_ORDERED_NONE Alasdair -- agk@redhat.com ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 1:07 ` Alasdair G Kergon @ 2007-05-31 1:11 ` David Chinner 0 siblings, 0 replies; 102+ messages in thread From: David Chinner @ 2007-05-31 1:11 UTC (permalink / raw) To: device-mapper development, David Chinner, Tejun Heo, linux-kernel, linux-raid, Jens Axboe, linux-fsdevel, Andreas Dilger On Thu, May 31, 2007 at 02:07:39AM +0100, Alasdair G Kergon wrote: > On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: > > If a filesystem cares, it could 'ask' as suggested above. > > What would be a good interface for asking? > > XFS already tests: > bd_disk->queue->ordered == QUEUE_ORDERED_NONE The side effects of removing that check is what started this whole discussion. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 2:45 ` David Chinner 2007-05-28 2:57 ` Neil Brown @ 2007-05-28 4:48 ` Timothy Shimmin 2007-05-29 6:45 ` Jeremy Higdon 2007-05-29 20:03 ` Phillip Susi 2 siblings, 1 reply; 102+ messages in thread From: Timothy Shimmin @ 2007-05-28 4:48 UTC (permalink / raw) To: David Chinner, Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo Hi, --On 28 May 2007 12:45:59 PM +1000 David Chinner <dgc@sgi.com> wrote: > On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: >> >> Thanks everyone for your input. There was some very valuable >> observations in the various emails. >> I will try to pull most of it together and bring out what seem to be >> the important points. >> >> >> 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. > > Sounds good to me, but how do we test to see if the underlying > device supports barriers? Do we just assume that they do and > only change behaviour if -o nobarrier is specified in the mount > options? > I would assume so. Then when the block layer finds that they aren't supported and does non-barrier ones, then it could report a message. We, xfs, I guess can't take much other course of action and we aint doing much now other than not requesting them anymore and printing an error message. >> 2/ Maybe barriers provide stronger semantics than are required. >> >> All write requests are synchronised around a barrier write. This is >> often more than is required and apparently can cause a measurable >> slowdown. >> >> Also the FUA for the actual commit write might not be needed. It is >> important for consistency that the preceding writes are in safe >> storage before the commit write, but it is not so important that the >> commit write is immediately safe on storage. That isn't needed until >> a 'sync' or 'fsync' or similar. > > The use of barriers in XFS assumes the commit write to be on stable > storage before it returns. One of the ordering guarantees that we > need is that the transaction (commit write) is on disk before the > metadata block containing the change in the transaction is written > to disk and the current barrier behaviour gives us that. > Yep, and that one is what we want the FUA for - for the write into the log. I'm taking it that the FUA write will just guarantee that that particular write has made it to disk on i/o completion (and no write cache flush is done). The other XFS constraint is that we know when the metadata hits the disk so that we can move the tail of the log. And that is what we are effectively getting from the pre-write-flush part of the barrier. It would ensure that any metadata not yet to disk would be on disk before we overwrite the tail of the log. If we could determine cases when we don't have to worry about overwriting the tail of the log, then it would be good if we could just do FUA writes for contraint 1 above. Is that possible? --Tim ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 4:48 ` Timothy Shimmin @ 2007-05-29 6:45 ` Jeremy Higdon 0 siblings, 0 replies; 102+ messages in thread From: Jeremy Higdon @ 2007-05-29 6:45 UTC (permalink / raw) To: Timothy Shimmin Cc: David Chinner, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo On Mon, May 28, 2007 at 02:48:45PM +1000, Timothy Shimmin wrote: > I'm taking it that the FUA write will just guarantee that that > particular write has made it to disk on i/o completion > (and no write cache flush is done). Correct. It only applies to that one write command. jeremy ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 2:45 ` David Chinner 2007-05-28 2:57 ` Neil Brown 2007-05-28 4:48 ` Timothy Shimmin @ 2007-05-29 20:03 ` Phillip Susi 2007-05-29 23:48 ` David Chinner 2 siblings, 1 reply; 102+ messages in thread From: Phillip Susi @ 2007-05-29 20:03 UTC (permalink / raw) To: David Chinner Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Stefan Bader, Andreas Dilger, Tejun Heo David Chinner wrote: > Sounds good to me, but how do we test to see if the underlying > device supports barriers? Do we just assume that they do and > only change behaviour if -o nobarrier is specified in the mount > options? The idea is that ALL block devices will support barriers; if the underlying driver doesn't, then the block layer will work around it. > The use of barriers in XFS assumes the commit write to be on stable > storage before it returns. One of the ordering guarantees that we > need is that the transaction (commit write) is on disk before the > metadata block containing the change in the transaction is written > to disk and the current barrier behaviour gives us that. Barrier != synchronous write, so if XFS relies on that block being on the media when the request is completed, then it is broken. It should only care that the ordering of log-data-log is maintained, not exactly when each specific request completes. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-29 20:03 ` Phillip Susi @ 2007-05-29 23:48 ` David Chinner 2007-05-30 0:01 ` david 2007-05-30 16:45 ` Phillip Susi 0 siblings, 2 replies; 102+ messages in thread From: David Chinner @ 2007-05-29 23:48 UTC (permalink / raw) To: Phillip Susi Cc: David Chinner, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Stefan Bader, Andreas Dilger, Tejun Heo On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: > David Chinner wrote: > >The use of barriers in XFS assumes the commit write to be on stable > >storage before it returns. One of the ordering guarantees that we > >need is that the transaction (commit write) is on disk before the > >metadata block containing the change in the transaction is written > >to disk and the current barrier behaviour gives us that. > > Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. > so if XFS relies on that block being on > the media when the request is completed, then it is broken. XFS relies on the block being stable before any other write goes to disk. That is the semantic that the barrier I/Os currently have. How that is implemented in the device is irrelevant to me, but if I issue a barrier I/O, I do not expect *any* I/O to be reordered around it. > It should > only care that the ordering of log-data-log is maintained, not exactly > when each specific request completes. Yes, and that is provided to XFS by the fact that barrier I/Os are full barriers.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-29 23:48 ` David Chinner @ 2007-05-30 0:01 ` david 2007-05-30 6:17 ` David Chinner 2007-05-30 16:45 ` Phillip Susi 1 sibling, 1 reply; 102+ messages in thread From: david @ 2007-05-30 0:01 UTC (permalink / raw) To: David Chinner Cc: Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Stefan Bader, Andreas Dilger, Tejun Heo On Wed, 30 May 2007, David Chinner wrote: > On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: >> David Chinner wrote: >>> The use of barriers in XFS assumes the commit write to be on stable >>> storage before it returns. One of the ordering guarantees that we >>> need is that the transaction (commit write) is on disk before the >>> metadata block containing the change in the transaction is written >>> to disk and the current barrier behaviour gives us that. >> >> Barrier != synchronous write, > > Of course. FYI, XFS only issues barriers on *async* writes. > > But barrier semantics - as far as they've been described by everyone > but you indicate that the barrier write is guaranteed to be on stable > storage when it returns. this doesn't match what I have seen wtih barriers it's perfectly legal to have the following sequence of events 1. app writes block 10 to OS 2. app writes block 4 to OS 3. app writes barrier to OS 4. app writes block 5 to OS 5. app writes block 20 to OS 6. OS writes block 4 to disk drive 7. OS writes block 10 to disk drive 8. OS writes barrier to disk drive 9. OS writes block 5 to disk drive 10. OS writes block 20 to disk drive 11. disk drive writes block 10 to platter 12. disk drive writes block 4 to platter 13. disk drive writes block 20 to platter 14. disk drive writes block 5 to platter there is nothing that says that when the app finishes step #3 that the OS has even sent the data to the drive, let alone that the drive has flushed it to a platter if the disk drive doesn't support barriers then step #8 becomes 'issue flush' and steps 11 and 12 take place before step #9, 13, 14 David Lang ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 0:01 ` david @ 2007-05-30 6:17 ` David Chinner 2007-05-30 8:55 ` Stefan Bader 2007-05-30 16:52 ` david 0 siblings, 2 replies; 102+ messages in thread From: David Chinner @ 2007-05-30 6:17 UTC (permalink / raw) To: david Cc: David Chinner, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Stefan Bader, Andreas Dilger, Tejun Heo On Tue, May 29, 2007 at 05:01:24PM -0700, david@lang.hm wrote: > On Wed, 30 May 2007, David Chinner wrote: > > >On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: > >>David Chinner wrote: > >>>The use of barriers in XFS assumes the commit write to be on stable > >>>storage before it returns. One of the ordering guarantees that we > >>>need is that the transaction (commit write) is on disk before the > >>>metadata block containing the change in the transaction is written > >>>to disk and the current barrier behaviour gives us that. > >> > >>Barrier != synchronous write, > > > >Of course. FYI, XFS only issues barriers on *async* writes. > > > >But barrier semantics - as far as they've been described by everyone > >but you indicate that the barrier write is guaranteed to be on stable > >storage when it returns. > > this doesn't match what I have seen > > wtih barriers it's perfectly legal to have the following sequence of > events > > 1. app writes block 10 to OS > 2. app writes block 4 to OS > 3. app writes barrier to OS > 4. app writes block 5 to OS > 5. app writes block 20 to OS hmmmmm - applications can't issue barriers to the filesystem. However, if you consider the barrier to be an "fsync()" for example, then it's still the filesystem that is issuing the barrier and there's a block that needs to be written that is associated with that barrier (either an inode or a transaction commit) that needs to be on stable storage before the filesystem returns to userspace. > 6. OS writes block 4 to disk drive > 7. OS writes block 10 to disk drive > 8. OS writes barrier to disk drive > 9. OS writes block 5 to disk drive > 10. OS writes block 20 to disk drive Replace OS with filesystem, and combine 7+8 together - we don't have zero-length barriers and hence they are *always* associated with a write to a certain block on disk. i.e.: 1. FS writes block 4 to disk drive 2. FS writes block 10 to disk drive 3. FS writes *barrier* block X to disk drive 4. FS writes block 5 to disk drive 5. FS writes block 20 to disk drive The order that these are expected by the filesystem to hit stable storage are: 1. block 4 and 10 on stable storage in any order 2. barrier block X on stable storage 3. block 5 and 20 on stable storage in any order The point I'm trying to make is that in XFS, block 5 and 20 cannot be allowed to hit the disk before the barrier block because they have strict order dependency on block X being stable before them, just like block X has strict order dependency that block 4 and 10 must be stable before we start the barrier block write. > 11. disk drive writes block 10 to platter > 12. disk drive writes block 4 to platter > 13. disk drive writes block 20 to platter > 14. disk drive writes block 5 to platter > if the disk drive doesn't support barriers then step #8 becomes 'issue > flush' and steps 11 and 12 take place before step #9, 13, 14 No, you need a flush on either side of the block X write to maintain the same semantics as barrier writes currently have. We have filesystems that require barriers to prevent reordering of writes in both directions and to ensure that the block associated with the barrier is on stable storage when I/o completion is signalled. The existing barrier implementation (where it works) provide these requirements. We need barriers to retain these semantics, otherwise we'll still have to do special stuff in the filesystems to get the semantics that we need. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 6:17 ` David Chinner @ 2007-05-30 8:55 ` Stefan Bader 2007-05-30 16:52 ` david 1 sibling, 0 replies; 102+ messages in thread From: Stefan Bader @ 2007-05-30 8:55 UTC (permalink / raw) To: David Chinner Cc: david, Tejun Heo, linux-kernel, linux-raid, dm-devel, Jens Axboe, linux-fsdevel, Andreas Dilger > The order that these are expected by the filesystem to hit stable > storage are: > > 1. block 4 and 10 on stable storage in any order > 2. barrier block X on stable storage > 3. block 5 and 20 on stable storage in any order > > The point I'm trying to make is that in XFS, block 5 and 20 cannot > be allowed to hit the disk before the barrier block because they > have strict order dependency on block X being stable before them, > just like block X has strict order dependency that block 4 and 10 > must be stable before we start the barrier block write. > That would be the exactly how I understand Documentation/block/barrier.txt: "In other words, I/O barrier requests have the following two properties. 1. Request ordering ... 2. Forced flushing to physical medium" "So, I/O barriers need to guarantee that requests actually get written to non-volatile medium in order." Stefan ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 6:17 ` David Chinner 2007-05-30 8:55 ` Stefan Bader @ 2007-05-30 16:52 ` david 2007-05-31 0:20 ` David Chinner 1 sibling, 1 reply; 102+ messages in thread From: david @ 2007-05-30 16:52 UTC (permalink / raw) To: David Chinner Cc: Tejun Heo, linux-kernel, linux-raid, dm-devel, Jens Axboe, linux-fsdevel, Andreas Dilger On Wed, 30 May 2007, David Chinner wrote: > On Tue, May 29, 2007 at 05:01:24PM -0700, david@lang.hm wrote: >> On Wed, 30 May 2007, David Chinner wrote: >> >>> On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: >>>> David Chinner wrote: >>>>> The use of barriers in XFS assumes the commit write to be on stable >>>>> storage before it returns. One of the ordering guarantees that we >>>>> need is that the transaction (commit write) is on disk before the >>>>> metadata block containing the change in the transaction is written >>>>> to disk and the current barrier behaviour gives us that. >>>> >>>> Barrier != synchronous write, >>> >>> Of course. FYI, XFS only issues barriers on *async* writes. >>> >>> But barrier semantics - as far as they've been described by everyone >>> but you indicate that the barrier write is guaranteed to be on stable >>> storage when it returns. >> >> this doesn't match what I have seen >> >> wtih barriers it's perfectly legal to have the following sequence of >> events >> >> 1. app writes block 10 to OS >> 2. app writes block 4 to OS >> 3. app writes barrier to OS >> 4. app writes block 5 to OS >> 5. app writes block 20 to OS > > hmmmmm - applications can't issue barriers to the filesystem. > However, if you consider the barrier to be an "fsync()" for example, > then it's still the filesystem that is issuing the barrier and > there's a block that needs to be written that is associated with > that barrier (either an inode or a transaction commit) that needs to > be on stable storage before the filesystem returns to userspace. > >> 6. OS writes block 4 to disk drive >> 7. OS writes block 10 to disk drive >> 8. OS writes barrier to disk drive >> 9. OS writes block 5 to disk drive >> 10. OS writes block 20 to disk drive > > Replace OS with filesystem, and combine 7+8 together - we don't have > zero-length barriers and hence they are *always* associated with a > write to a certain block on disk. i.e.: > > 1. FS writes block 4 to disk drive > 2. FS writes block 10 to disk drive > 3. FS writes *barrier* block X to disk drive > 4. FS writes block 5 to disk drive > 5. FS writes block 20 to disk drive > > The order that these are expected by the filesystem to hit stable > storage are: > > 1. block 4 and 10 on stable storage in any order > 2. barrier block X on stable storage > 3. block 5 and 20 on stable storage in any order > > The point I'm trying to make is that in XFS, block 5 and 20 cannot > be allowed to hit the disk before the barrier block because they > have strict order dependency on block X being stable before them, > just like block X has strict order dependency that block 4 and 10 > must be stable before we start the barrier block write. > >> 11. disk drive writes block 10 to platter >> 12. disk drive writes block 4 to platter >> 13. disk drive writes block 20 to platter >> 14. disk drive writes block 5 to platter > >> if the disk drive doesn't support barriers then step #8 becomes 'issue >> flush' and steps 11 and 12 take place before step #9, 13, 14 > > No, you need a flush on either side of the block X write to maintain > the same semantics as barrier writes currently have. > > We have filesystems that require barriers to prevent reordering of > writes in both directions and to ensure that the block associated > with the barrier is on stable storage when I/o completion is > signalled. The existing barrier implementation (where it works) > provide these requirements. We need barriers to retain these > semantics, otherwise we'll still have to do special stuff in > the filesystems to get the semantics that we need. one of us is misunderstanding barriers here. you are understanding barriers to be the same as syncronous writes. (and therefor the data is on persistant media before the call returns) I am understanding barriers to only indicate ordering requirements. things before the barrier can be reordered freely, things after the barrier can be reordered freely, but things cannot be reordered across the barrier. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. in the past barriers have not been fully implmented in most cases, and as a result they have been simulated by forcing a full flush of the buffers to persistant media before any other writes are allowed. This has made them _in practice_ operate the same way as syncronous writes (matching your understanding), but the current thread is talking about fixing the implementation to the official symantics for all hardware that can actually support barriers (and fix it at the OS level) David Lang ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 16:52 ` david @ 2007-05-31 0:20 ` David Chinner 2007-05-31 6:26 ` Jens Axboe 2007-05-31 18:24 ` Phillip Susi 0 siblings, 2 replies; 102+ messages in thread From: David Chinner @ 2007-05-31 0:20 UTC (permalink / raw) To: david Cc: David Chinner, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Stefan Bader, Andreas Dilger, Tejun Heo On Wed, May 30, 2007 at 09:52:49AM -0700, david@lang.hm wrote: > On Wed, 30 May 2007, David Chinner wrote: > >with the barrier is on stable storage when I/o completion is > >signalled. The existing barrier implementation (where it works) > >provide these requirements. We need barriers to retain these > >semantics, otherwise we'll still have to do special stuff in > >the filesystems to get the semantics that we need. > > one of us is misunderstanding barriers here. No, I thinkwe are both on the same level here - it's what barriers are used for that is not clear understood, I think. > you are understanding barriers to be the same as syncronous writes. (and > therefor the data is on persistant media before the call returns) No, I'm describing the high level behaviour that is expected by a filesystem. The reasons for this are below.... > I am understanding barriers to only indicate ordering requirements. things > before the barrier can be reordered freely, things after the barrier can > be reordered freely, but things cannot be reordered across the barrier. Ok, that's my understanding of how *device based barriers* can work, but there's more to it than that. As far as the filesystem is concerned the barrier write needs to *behave* exactly like a sync write because of the guarantees the filesystem has to provide userspace. Specifically - sync, sync writes and fsync. This is the big problem, right? If we use barriers for commit writes, the filesystem can return to userspace after a sync write or fsync() and an *ordered barrier device implementation* may not have written the blocks to persistent media. If we then pull the plug on the box, we've just lost data that sync or fsync said was successfully on disk. That's BAD. Right now a barrier write on the last block of the fsync/sync write is sufficient to prevent that because of the FUA on the barrier block write. A purely ordered barrier implementation does not provide this guarantee. This is the crux of my argument - from a filesystem perspective, there is a *major* difference between a barrier implemented to just guaranteeing ordering and a barrier implemented with a flush+FUA or flush+write+flush. IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. > if I am understanding it correctly, the big win for barriers is that you > do NOT have to stop and wait until the data is on persistant media before > you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well.... So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 0:20 ` David Chinner @ 2007-05-31 6:26 ` Jens Axboe 2007-05-31 7:03 ` David Chinner 2007-05-31 18:24 ` Phillip Susi 1 sibling, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-05-31 6:26 UTC (permalink / raw) To: David Chinner Cc: david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Thu, May 31 2007, David Chinner wrote: > IOWs, there are two parts to the problem: > > 1 - guaranteeing I/O ordering > 2 - guaranteeing blocks are on persistent storage. > > Right now, a single barrier I/O is used to provide both of these > guarantees. In most cases, all we really need to provide is 1); the > need for 2) is a much rarer condition but still needs to be > provided. > > > if I am understanding it correctly, the big win for barriers is that you > > do NOT have to stop and wait until the data is on persistant media before > > you can continue. > > Yes, if we define a barrier to only guarantee 1), then yes this > would be a big win (esp. for XFS). But that requires all filesystems > to handle sync writes differently, and sync_blockdev() needs to > call blkdev_issue_flush() as well.... > > So, what do we do here? Do we define a barrier I/O to only provide > ordering, or do we define it to also provide persistent storage > writeback? Whatever we decide, it needs to be documented.... The block layer already has a notion of the two types of barriers, with a very small amount of tweaking we could expose that. There's absolutely zero reason we can't easily support both types of barriers. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 6:26 ` Jens Axboe @ 2007-05-31 7:03 ` David Chinner 2007-05-31 7:06 ` Jens Axboe 2007-05-31 18:31 ` Phillip Susi 0 siblings, 2 replies; 102+ messages in thread From: David Chinner @ 2007-05-31 7:03 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: > On Thu, May 31 2007, David Chinner wrote: > > IOWs, there are two parts to the problem: > > > > 1 - guaranteeing I/O ordering > > 2 - guaranteeing blocks are on persistent storage. > > > > Right now, a single barrier I/O is used to provide both of these > > guarantees. In most cases, all we really need to provide is 1); the > > need for 2) is a much rarer condition but still needs to be > > provided. > > > > > if I am understanding it correctly, the big win for barriers is that you > > > do NOT have to stop and wait until the data is on persistant media before > > > you can continue. > > > > Yes, if we define a barrier to only guarantee 1), then yes this > > would be a big win (esp. for XFS). But that requires all filesystems > > to handle sync writes differently, and sync_blockdev() needs to > > call blkdev_issue_flush() as well.... > > > > So, what do we do here? Do we define a barrier I/O to only provide > > ordering, or do we define it to also provide persistent storage > > writeback? Whatever we decide, it needs to be documented.... > > The block layer already has a notion of the two types of barriers, with > a very small amount of tweaking we could expose that. There's absolutely > zero reason we can't easily support both types of barriers. That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 7:03 ` David Chinner @ 2007-05-31 7:06 ` Jens Axboe 2007-05-31 13:30 ` Bill Davidsen 2007-06-01 3:16 ` Tejun Heo 2007-05-31 18:31 ` Phillip Susi 1 sibling, 2 replies; 102+ messages in thread From: Jens Axboe @ 2007-05-31 7:06 UTC (permalink / raw) To: David Chinner Cc: david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Thu, May 31 2007, David Chinner wrote: > On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: > > On Thu, May 31 2007, David Chinner wrote: > > > IOWs, there are two parts to the problem: > > > > > > 1 - guaranteeing I/O ordering > > > 2 - guaranteeing blocks are on persistent storage. > > > > > > Right now, a single barrier I/O is used to provide both of these > > > guarantees. In most cases, all we really need to provide is 1); the > > > need for 2) is a much rarer condition but still needs to be > > > provided. > > > > > > > if I am understanding it correctly, the big win for barriers is that you > > > > do NOT have to stop and wait until the data is on persistant media before > > > > you can continue. > > > > > > Yes, if we define a barrier to only guarantee 1), then yes this > > > would be a big win (esp. for XFS). But that requires all filesystems > > > to handle sync writes differently, and sync_blockdev() needs to > > > call blkdev_issue_flush() as well.... > > > > > > So, what do we do here? Do we define a barrier I/O to only provide > > > ordering, or do we define it to also provide persistent storage > > > writeback? Whatever we decide, it needs to be documented.... > > > > The block layer already has a notion of the two types of barriers, with > > a very small amount of tweaking we could expose that. There's absolutely > > zero reason we can't easily support both types of barriers. > > That sounds like a good idea - we can leave the existing > WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > behaviour that only guarantees ordering. The filesystem can then > choose which to use where appropriate.... Precisely. The current definition of barriers are what Chris and I came up with many years ago, when solving the problem for reiserfs originally. It is by no means the only feasible approach. I'll add a WRITE_ORDERED command to the #barrier branch, it already contains the empty-bio barrier support I posted yesterday (well a slightly modified and cleaned up version). -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 7:06 ` Jens Axboe @ 2007-05-31 13:30 ` Bill Davidsen 2007-05-31 13:36 ` Jens Axboe 2007-06-01 3:16 ` Tejun Heo 1 sibling, 1 reply; 102+ messages in thread From: Bill Davidsen @ 2007-05-31 13:30 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo Jens Axboe wrote: > On Thu, May 31 2007, David Chinner wrote: > >> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: >> >>> On Thu, May 31 2007, David Chinner wrote: >>> >>>> IOWs, there are two parts to the problem: >>>> >>>> 1 - guaranteeing I/O ordering >>>> 2 - guaranteeing blocks are on persistent storage. >>>> >>>> Right now, a single barrier I/O is used to provide both of these >>>> guarantees. In most cases, all we really need to provide is 1); the >>>> need for 2) is a much rarer condition but still needs to be >>>> provided. >>>> >>>> >>>>> if I am understanding it correctly, the big win for barriers is that you >>>>> do NOT have to stop and wait until the data is on persistant media before >>>>> you can continue. >>>>> >>>> Yes, if we define a barrier to only guarantee 1), then yes this >>>> would be a big win (esp. for XFS). But that requires all filesystems >>>> to handle sync writes differently, and sync_blockdev() needs to >>>> call blkdev_issue_flush() as well.... >>>> >>>> So, what do we do here? Do we define a barrier I/O to only provide >>>> ordering, or do we define it to also provide persistent storage >>>> writeback? Whatever we decide, it needs to be documented.... >>>> >>> The block layer already has a notion of the two types of barriers, with >>> a very small amount of tweaking we could expose that. There's absolutely >>> zero reason we can't easily support both types of barriers. >>> >> That sounds like a good idea - we can leave the existing >> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED >> behaviour that only guarantees ordering. The filesystem can then >> choose which to use where appropriate.... >> > > Precisely. The current definition of barriers are what Chris and I came > up with many years ago, when solving the problem for reiserfs > originally. It is by no means the only feasible approach. > > I'll add a WRITE_ORDERED command to the #barrier branch, it already > contains the empty-bio barrier support I posted yesterday (well a > slightly modified and cleaned up version). > > Wait. Do filesystems expect (depend on) anything but ordering now? Does md? Having users of barriers as they currently behave suddenly getting SYNC behavior where they expect ORDERED is likely to have a negative effect on performance. Or do I misread what is actually guaranteed by WRITE_BARRIER now, and a flush is currently happening in all cases? And will this also be available to user space f/s, since I just proposed a project which uses one? :-( I think the goal is good, more choice is almost always better choice, I just want to be sure there won't be big disk performance regressions. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 13:30 ` Bill Davidsen @ 2007-05-31 13:36 ` Jens Axboe 2007-06-01 16:04 ` Bill Davidsen 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-05-31 13:36 UTC (permalink / raw) To: Bill Davidsen Cc: David Chinner, david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Thu, May 31 2007, Bill Davidsen wrote: > Jens Axboe wrote: > >On Thu, May 31 2007, David Chinner wrote: > > > >>On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: > >> > >>>On Thu, May 31 2007, David Chinner wrote: > >>> > >>>>IOWs, there are two parts to the problem: > >>>> > >>>> 1 - guaranteeing I/O ordering > >>>> 2 - guaranteeing blocks are on persistent storage. > >>>> > >>>>Right now, a single barrier I/O is used to provide both of these > >>>>guarantees. In most cases, all we really need to provide is 1); the > >>>>need for 2) is a much rarer condition but still needs to be > >>>>provided. > >>>> > >>>> > >>>>>if I am understanding it correctly, the big win for barriers is that > >>>>>you do NOT have to stop and wait until the data is on persistant media > >>>>>before you can continue. > >>>>> > >>>>Yes, if we define a barrier to only guarantee 1), then yes this > >>>>would be a big win (esp. for XFS). But that requires all filesystems > >>>>to handle sync writes differently, and sync_blockdev() needs to > >>>>call blkdev_issue_flush() as well.... > >>>> > >>>>So, what do we do here? Do we define a barrier I/O to only provide > >>>>ordering, or do we define it to also provide persistent storage > >>>>writeback? Whatever we decide, it needs to be documented.... > >>>> > >>>The block layer already has a notion of the two types of barriers, with > >>>a very small amount of tweaking we could expose that. There's absolutely > >>>zero reason we can't easily support both types of barriers. > >>> > >>That sounds like a good idea - we can leave the existing > >>WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >>behaviour that only guarantees ordering. The filesystem can then > >>choose which to use where appropriate.... > >> > > > >Precisely. The current definition of barriers are what Chris and I came > >up with many years ago, when solving the problem for reiserfs > >originally. It is by no means the only feasible approach. > > > >I'll add a WRITE_ORDERED command to the #barrier branch, it already > >contains the empty-bio barrier support I posted yesterday (well a > >slightly modified and cleaned up version). > > > > > Wait. Do filesystems expect (depend on) anything but ordering now? Does > md? Having users of barriers as they currently behave suddenly getting > SYNC behavior where they expect ORDERED is likely to have a negative > effect on performance. Or do I misread what is actually guaranteed by > WRITE_BARRIER now, and a flush is currently happening in all cases? See the above stuff you quote, it's answered there. It's not a change, this is how the Linux barrier write has always worked since I first implemented it. What David and I are talking about is adding a more relaxed version as well, that just implies ordering. > And will this also be available to user space f/s, since I just proposed > a project which uses one? :-( I see several uses for that, so I'd hope so. > I think the goal is good, more choice is almost always better choice, I > just want to be sure there won't be big disk performance regressions. We can't get more heavy weight than the current barrier, it's about as conservative as you can get. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 13:36 ` Jens Axboe @ 2007-06-01 16:04 ` Bill Davidsen 2007-06-02 14:51 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Bill Davidsen @ 2007-06-01 16:04 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo Jens Axboe wrote: > On Thu, May 31 2007, Bill Davidsen wrote: > >> Jens Axboe wrote: >> >>> On Thu, May 31 2007, David Chinner wrote: >>> >>> >>>> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: >>>> >>>> >>>>> On Thu, May 31 2007, David Chinner wrote: >>>>> >>>>> >>>>>> IOWs, there are two parts to the problem: >>>>>> >>>>>> 1 - guaranteeing I/O ordering >>>>>> 2 - guaranteeing blocks are on persistent storage. >>>>>> >>>>>> Right now, a single barrier I/O is used to provide both of these >>>>>> guarantees. In most cases, all we really need to provide is 1); the >>>>>> need for 2) is a much rarer condition but still needs to be >>>>>> provided. >>>>>> >>>>>> >>>>>> >>>>>>> if I am understanding it correctly, the big win for barriers is that >>>>>>> you do NOT have to stop and wait until the data is on persistant media >>>>>>> before you can continue. >>>>>>> >>>>>>> >>>>>> Yes, if we define a barrier to only guarantee 1), then yes this >>>>>> would be a big win (esp. for XFS). But that requires all filesystems >>>>>> to handle sync writes differently, and sync_blockdev() needs to >>>>>> call blkdev_issue_flush() as well.... >>>>>> >>>>>> So, what do we do here? Do we define a barrier I/O to only provide >>>>>> ordering, or do we define it to also provide persistent storage >>>>>> writeback? Whatever we decide, it needs to be documented.... >>>>>> >>>>>> >>>>> The block layer already has a notion of the two types of barriers, with >>>>> a very small amount of tweaking we could expose that. There's absolutely >>>>> zero reason we can't easily support both types of barriers. >>>>> >>>>> >>>> That sounds like a good idea - we can leave the existing >>>> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED >>>> behaviour that only guarantees ordering. The filesystem can then >>>> choose which to use where appropriate.... >>>> >>>> >>> Precisely. The current definition of barriers are what Chris and I came >>> up with many years ago, when solving the problem for reiserfs >>> originally. It is by no means the only feasible approach. >>> >>> I'll add a WRITE_ORDERED command to the #barrier branch, it already >>> contains the empty-bio barrier support I posted yesterday (well a >>> slightly modified and cleaned up version). >>> >>> >>> >> Wait. Do filesystems expect (depend on) anything but ordering now? Does >> md? Having users of barriers as they currently behave suddenly getting >> SYNC behavior where they expect ORDERED is likely to have a negative >> effect on performance. Or do I misread what is actually guaranteed by >> WRITE_BARRIER now, and a flush is currently happening in all cases? >> > > See the above stuff you quote, it's answered there. It's not a change, > this is how the Linux barrier write has always worked since I first > implemented it. What David and I are talking about is adding a more > relaxed version as well, that just implies ordering. > I was reading the documentation in block/biodoc.txt, which seems to just say ordered: 1.2.1 I/O Barriers There is a way to enforce strict ordering for i/os through barriers. All requests before a barrier point must be serviced before the barrier request and any other requests arriving after the barrier will not be serviced until after the barrier has completed. This is useful for higher level control on write ordering, e.g flushing a log of committed updates to disk before the corresponding updates themselves. A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o. The generic i/o scheduler would make sure that it places the barrier request and all other requests coming after it after all the previous requests in the queue. Barriers may be implemented in different ways depending on the driver. A SCSI driver for example could make use of ordered tags to preserve the necessary ordering with a lower impact on throughput. For IDE this might be two sync cache flush: a pre and post flush when encountering a barrier write. The "flush" comment is associated with IDE, so it wasn't clear that the device cache is always cleared to force the data to the platter. >> And will this also be available to user space f/s, since I just proposed >> a project which uses one? :-( >> > > I see several uses for that, so I'd hope so. > > >> I think the goal is good, more choice is almost always better choice, I >> just want to be sure there won't be big disk performance regressions. >> > > We can't get more heavy weight than the current barrier, it's about as > conservative as you can get. > > -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 16:04 ` Bill Davidsen @ 2007-06-02 14:51 ` Jens Axboe 2007-06-02 19:55 ` Bill Davidsen 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-06-02 14:51 UTC (permalink / raw) To: Bill Davidsen Cc: David Chinner, david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Fri, Jun 01 2007, Bill Davidsen wrote: > Jens Axboe wrote: > >On Thu, May 31 2007, Bill Davidsen wrote: > > > >>Jens Axboe wrote: > >> > >>>On Thu, May 31 2007, David Chinner wrote: > >>> > >>> > >>>>On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: > >>>> > >>>> > >>>>>On Thu, May 31 2007, David Chinner wrote: > >>>>> > >>>>> > >>>>>>IOWs, there are two parts to the problem: > >>>>>> > >>>>>> 1 - guaranteeing I/O ordering > >>>>>> 2 - guaranteeing blocks are on persistent storage. > >>>>>> > >>>>>>Right now, a single barrier I/O is used to provide both of these > >>>>>>guarantees. In most cases, all we really need to provide is 1); the > >>>>>>need for 2) is a much rarer condition but still needs to be > >>>>>>provided. > >>>>>> > >>>>>> > >>>>>> > >>>>>>>if I am understanding it correctly, the big win for barriers is that > >>>>>>>you do NOT have to stop and wait until the data is on persistant > >>>>>>>media before you can continue. > >>>>>>> > >>>>>>> > >>>>>>Yes, if we define a barrier to only guarantee 1), then yes this > >>>>>>would be a big win (esp. for XFS). But that requires all filesystems > >>>>>>to handle sync writes differently, and sync_blockdev() needs to > >>>>>>call blkdev_issue_flush() as well.... > >>>>>> > >>>>>>So, what do we do here? Do we define a barrier I/O to only provide > >>>>>>ordering, or do we define it to also provide persistent storage > >>>>>>writeback? Whatever we decide, it needs to be documented.... > >>>>>> > >>>>>> > >>>>>The block layer already has a notion of the two types of barriers, with > >>>>>a very small amount of tweaking we could expose that. There's > >>>>>absolutely > >>>>>zero reason we can't easily support both types of barriers. > >>>>> > >>>>> > >>>>That sounds like a good idea - we can leave the existing > >>>>WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >>>>behaviour that only guarantees ordering. The filesystem can then > >>>>choose which to use where appropriate.... > >>>> > >>>> > >>>Precisely. The current definition of barriers are what Chris and I came > >>>up with many years ago, when solving the problem for reiserfs > >>>originally. It is by no means the only feasible approach. > >>> > >>>I'll add a WRITE_ORDERED command to the #barrier branch, it already > >>>contains the empty-bio barrier support I posted yesterday (well a > >>>slightly modified and cleaned up version). > >>> > >>> > >>> > >>Wait. Do filesystems expect (depend on) anything but ordering now? Does > >>md? Having users of barriers as they currently behave suddenly getting > >>SYNC behavior where they expect ORDERED is likely to have a negative > >>effect on performance. Or do I misread what is actually guaranteed by > >>WRITE_BARRIER now, and a flush is currently happening in all cases? > >> > > > >See the above stuff you quote, it's answered there. It's not a change, > >this is how the Linux barrier write has always worked since I first > >implemented it. What David and I are talking about is adding a more > >relaxed version as well, that just implies ordering. > > > > I was reading the documentation in block/biodoc.txt, which seems to just > say ordered: > > 1.2.1 I/O Barriers > > There is a way to enforce strict ordering for i/os through barriers. > All requests before a barrier point must be serviced before the barrier > request and any other requests arriving after the barrier will not be > serviced until after the barrier has completed. This is useful for > higher > level control on write ordering, e.g flushing a log of committed updates > to disk before the corresponding updates themselves. > > A flag in the bio structure, BIO_BARRIER is used to identify a > barrier i/o. > The generic i/o scheduler would make sure that it places the barrier > request and > all other requests coming after it after all the previous requests > in the > queue. Barriers may be implemented in different ways depending on the > driver. A SCSI driver for example could make use of ordered tags to > preserve the necessary ordering with a lower impact on throughput. > For IDE > this might be two sync cache flush: a pre and post flush when > encountering > a barrier write. > > The "flush" comment is associated with IDE, so it wasn't clear that the > device cache is always cleared to force the data to the platter. The above should mention that the ordered tag comment for SCSI assumes that the drive uses write through caching. If it does, then an ordered tag is enough. If it doesn't, then you need a bit more than that (a post flush, after the ordered tag has completed). -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-02 14:51 ` Jens Axboe @ 2007-06-02 19:55 ` Bill Davidsen 0 siblings, 0 replies; 102+ messages in thread From: Bill Davidsen @ 2007-06-02 19:55 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo Jens Axboe wrote: > On Fri, Jun 01 2007, Bill Davidsen wrote: > >> Jens Axboe wrote: >> >>> On Thu, May 31 2007, Bill Davidsen wrote: >>> >>> >>>> Jens Axboe wrote: >>>> >>>> >>>>> On Thu, May 31 2007, David Chinner wrote: >>>>> >>>>> >>>>> >>>>>> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Thu, May 31 2007, David Chinner wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> IOWs, there are two parts to the problem: >>>>>>>> >>>>>>>> 1 - guaranteeing I/O ordering >>>>>>>> 2 - guaranteeing blocks are on persistent storage. >>>>>>>> >>>>>>>> Right now, a single barrier I/O is used to provide both of these >>>>>>>> guarantees. In most cases, all we really need to provide is 1); the >>>>>>>> need for 2) is a much rarer condition but still needs to be >>>>>>>> provided. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> if I am understanding it correctly, the big win for barriers is that >>>>>>>>> you do NOT have to stop and wait until the data is on persistant >>>>>>>>> media before you can continue. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> Yes, if we define a barrier to only guarantee 1), then yes this >>>>>>>> would be a big win (esp. for XFS). But that requires all filesystems >>>>>>>> to handle sync writes differently, and sync_blockdev() needs to >>>>>>>> call blkdev_issue_flush() as well.... >>>>>>>> >>>>>>>> So, what do we do here? Do we define a barrier I/O to only provide >>>>>>>> ordering, or do we define it to also provide persistent storage >>>>>>>> writeback? Whatever we decide, it needs to be documented.... >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> The block layer already has a notion of the two types of barriers, with >>>>>>> a very small amount of tweaking we could expose that. There's >>>>>>> absolutely >>>>>>> zero reason we can't easily support both types of barriers. >>>>>>> >>>>>>> >>>>>>> >>>>>> That sounds like a good idea - we can leave the existing >>>>>> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED >>>>>> behaviour that only guarantees ordering. The filesystem can then >>>>>> choose which to use where appropriate.... >>>>>> >>>>>> >>>>>> >>>>> Precisely. The current definition of barriers are what Chris and I came >>>>> up with many years ago, when solving the problem for reiserfs >>>>> originally. It is by no means the only feasible approach. >>>>> >>>>> I'll add a WRITE_ORDERED command to the #barrier branch, it already >>>>> contains the empty-bio barrier support I posted yesterday (well a >>>>> slightly modified and cleaned up version). >>>>> >>>>> >>>>> >>>>> >>>> Wait. Do filesystems expect (depend on) anything but ordering now? Does >>>> md? Having users of barriers as they currently behave suddenly getting >>>> SYNC behavior where they expect ORDERED is likely to have a negative >>>> effect on performance. Or do I misread what is actually guaranteed by >>>> WRITE_BARRIER now, and a flush is currently happening in all cases? >>>> >>>> >>> See the above stuff you quote, it's answered there. It's not a change, >>> this is how the Linux barrier write has always worked since I first >>> implemented it. What David and I are talking about is adding a more >>> relaxed version as well, that just implies ordering. >>> >>> >> I was reading the documentation in block/biodoc.txt, which seems to just >> say ordered: >> >> 1.2.1 I/O Barriers >> >> There is a way to enforce strict ordering for i/os through barriers. >> All requests before a barrier point must be serviced before the barrier >> request and any other requests arriving after the barrier will not be >> serviced until after the barrier has completed. This is useful for >> higher >> level control on write ordering, e.g flushing a log of committed updates >> to disk before the corresponding updates themselves. >> >> A flag in the bio structure, BIO_BARRIER is used to identify a >> barrier i/o. >> The generic i/o scheduler would make sure that it places the barrier >> request and >> all other requests coming after it after all the previous requests >> in the >> queue. Barriers may be implemented in different ways depending on the >> driver. A SCSI driver for example could make use of ordered tags to >> preserve the necessary ordering with a lower impact on throughput. >> For IDE >> this might be two sync cache flush: a pre and post flush when >> encountering >> a barrier write. >> >> The "flush" comment is associated with IDE, so it wasn't clear that the >> device cache is always cleared to force the data to the platter. >> > > The above should mention that the ordered tag comment for SCSI assumes > that the drive uses write through caching. If it does, then an ordered > tag is enough. If it doesn't, then you need a bit more than that (a post > flush, after the ordered tag has completed). > > Thanks, go it. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 7:06 ` Jens Axboe 2007-05-31 13:30 ` Bill Davidsen @ 2007-06-01 3:16 ` Tejun Heo 2007-06-01 8:21 ` Jens Axboe 1 sibling, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-06-01 3:16 UTC (permalink / raw) To: Jens Axboe Cc: david, David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Andreas Dilger Jens Axboe wrote: > On Thu, May 31 2007, David Chinner wrote: >> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: >>> On Thu, May 31 2007, David Chinner wrote: >>>> IOWs, there are two parts to the problem: >>>> >>>> 1 - guaranteeing I/O ordering >>>> 2 - guaranteeing blocks are on persistent storage. >>>> >>>> Right now, a single barrier I/O is used to provide both of these >>>> guarantees. In most cases, all we really need to provide is 1); the >>>> need for 2) is a much rarer condition but still needs to be >>>> provided. >>>> >>>>> if I am understanding it correctly, the big win for barriers is that you >>>>> do NOT have to stop and wait until the data is on persistant media before >>>>> you can continue. >>>> Yes, if we define a barrier to only guarantee 1), then yes this >>>> would be a big win (esp. for XFS). But that requires all filesystems >>>> to handle sync writes differently, and sync_blockdev() needs to >>>> call blkdev_issue_flush() as well.... >>>> >>>> So, what do we do here? Do we define a barrier I/O to only provide >>>> ordering, or do we define it to also provide persistent storage >>>> writeback? Whatever we decide, it needs to be documented.... >>> The block layer already has a notion of the two types of barriers, with >>> a very small amount of tweaking we could expose that. There's absolutely >>> zero reason we can't easily support both types of barriers. >> That sounds like a good idea - we can leave the existing >> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED >> behaviour that only guarantees ordering. The filesystem can then >> choose which to use where appropriate.... > > Precisely. The current definition of barriers are what Chris and I came > up with many years ago, when solving the problem for reiserfs > originally. It is by no means the only feasible approach. > > I'll add a WRITE_ORDERED command to the #barrier branch, it already > contains the empty-bio barrier support I posted yesterday (well a > slightly modified and cleaned up version). Would that be very different from issuing barrier and not waiting for its completion? For ATA and SCSI, we'll have to flush write back cache anyway, so I don't see how we can get performance advantage by implementing separate WRITE_ORDERED. I think zero-length barrier (haven't looked at the code yet, still recovering from jet lag :-) can serve as genuine barrier without the extra write tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 3:16 ` Tejun Heo @ 2007-06-01 8:21 ` Jens Axboe 2007-06-02 9:20 ` Tejun Heo 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-06-01 8:21 UTC (permalink / raw) To: Tejun Heo Cc: david, David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Andreas Dilger On Fri, Jun 01 2007, Tejun Heo wrote: > Jens Axboe wrote: > > On Thu, May 31 2007, David Chinner wrote: > >> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: > >>> On Thu, May 31 2007, David Chinner wrote: > >>>> IOWs, there are two parts to the problem: > >>>> > >>>> 1 - guaranteeing I/O ordering > >>>> 2 - guaranteeing blocks are on persistent storage. > >>>> > >>>> Right now, a single barrier I/O is used to provide both of these > >>>> guarantees. In most cases, all we really need to provide is 1); the > >>>> need for 2) is a much rarer condition but still needs to be > >>>> provided. > >>>> > >>>>> if I am understanding it correctly, the big win for barriers is that you > >>>>> do NOT have to stop and wait until the data is on persistant media before > >>>>> you can continue. > >>>> Yes, if we define a barrier to only guarantee 1), then yes this > >>>> would be a big win (esp. for XFS). But that requires all filesystems > >>>> to handle sync writes differently, and sync_blockdev() needs to > >>>> call blkdev_issue_flush() as well.... > >>>> > >>>> So, what do we do here? Do we define a barrier I/O to only provide > >>>> ordering, or do we define it to also provide persistent storage > >>>> writeback? Whatever we decide, it needs to be documented.... > >>> The block layer already has a notion of the two types of barriers, with > >>> a very small amount of tweaking we could expose that. There's absolutely > >>> zero reason we can't easily support both types of barriers. > >> That sounds like a good idea - we can leave the existing > >> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >> behaviour that only guarantees ordering. The filesystem can then > >> choose which to use where appropriate.... > > > > Precisely. The current definition of barriers are what Chris and I came > > up with many years ago, when solving the problem for reiserfs > > originally. It is by no means the only feasible approach. > > > > I'll add a WRITE_ORDERED command to the #barrier branch, it already > > contains the empty-bio barrier support I posted yesterday (well a > > slightly modified and cleaned up version). > > Would that be very different from issuing barrier and not waiting for > its completion? For ATA and SCSI, we'll have to flush write back cache > anyway, so I don't see how we can get performance advantage by > implementing separate WRITE_ORDERED. I think zero-length barrier > (haven't looked at the code yet, still recovering from jet lag :-) can > serve as genuine barrier without the extra write tho. As always, it depends :-) If you are doing pure flush barriers, then there's no difference. Unless you only guarantee ordering wrt previously submitted requests, in which case you can eliminate the post flush. If you are doing ordered tags, then just setting the ordered bit is enough. That is different from the barrier in that we don't need a flush of FUA bit set. In reality maybe the difference isn't all that great, at least we can start by having WRITE_ORDERED == WRITE_BARRIER. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 8:21 ` Jens Axboe @ 2007-06-02 9:20 ` Tejun Heo 2007-06-02 14:34 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-06-02 9:20 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger Hello, Jens Axboe wrote: >> Would that be very different from issuing barrier and not waiting for >> its completion? For ATA and SCSI, we'll have to flush write back cache >> anyway, so I don't see how we can get performance advantage by >> implementing separate WRITE_ORDERED. I think zero-length barrier >> (haven't looked at the code yet, still recovering from jet lag :-) can >> serve as genuine barrier without the extra write tho. > > As always, it depends :-) > > If you are doing pure flush barriers, then there's no difference. Unless > you only guarantee ordering wrt previously submitted requests, in which > case you can eliminate the post flush. > > If you are doing ordered tags, then just setting the ordered bit is > enough. That is different from the barrier in that we don't need a flush > of FUA bit set. Hmmm... I'm feeling dense. Zero-length barrier also requires only one flush to separate requests before and after it (haven't looked at the code yet, will soon). Can you enlighten me? Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-02 9:20 ` Tejun Heo @ 2007-06-02 14:34 ` Jens Axboe 2007-06-02 22:57 ` Guy Watkins 2007-06-04 7:39 ` Tejun Heo 0 siblings, 2 replies; 102+ messages in thread From: Jens Axboe @ 2007-06-02 14:34 UTC (permalink / raw) To: Tejun Heo Cc: david, David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Andreas Dilger On Sat, Jun 02 2007, Tejun Heo wrote: > Hello, > > Jens Axboe wrote: > >> Would that be very different from issuing barrier and not waiting for > >> its completion? For ATA and SCSI, we'll have to flush write back cache > >> anyway, so I don't see how we can get performance advantage by > >> implementing separate WRITE_ORDERED. I think zero-length barrier > >> (haven't looked at the code yet, still recovering from jet lag :-) can > >> serve as genuine barrier without the extra write tho. > > > > As always, it depends :-) > > > > If you are doing pure flush barriers, then there's no difference. Unless > > you only guarantee ordering wrt previously submitted requests, in which > > case you can eliminate the post flush. > > > > If you are doing ordered tags, then just setting the ordered bit is > > enough. That is different from the barrier in that we don't need a flush > > of FUA bit set. > > Hmmm... I'm feeling dense. Zero-length barrier also requires only one > flush to separate requests before and after it (haven't looked at the > code yet, will soon). Can you enlighten me? Yeah, that's what the zero-length barrier implementation I posted does. Not sure if you have a question beyond that, if so fire away :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* RE: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-02 14:34 ` Jens Axboe @ 2007-06-02 22:57 ` Guy Watkins 2007-06-04 7:39 ` Tejun Heo 1 sibling, 0 replies; 102+ messages in thread From: Guy Watkins @ 2007-06-02 22:57 UTC (permalink / raw) To: 'Jens Axboe', 'Tejun Heo' Cc: 'David Chinner', david, 'Phillip Susi', 'Neil Brown', linux-fsdevel, linux-kernel, dm-devel, linux-raid, 'Stefan Bader', 'Andreas Dilger' } -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of Jens Axboe } Sent: Saturday, June 02, 2007 10:35 AM } To: Tejun Heo } Cc: David Chinner; david@lang.hm; Phillip Susi; Neil Brown; linux- } fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; dm- } devel@redhat.com; linux-raid@vger.kernel.org; Stefan Bader; Andreas Dilger } Subject: Re: [RFD] BIO_RW_BARRIER - what it means for devices, } filesystems, and dm/md. } } On Sat, Jun 02 2007, Tejun Heo wrote: } > Hello, } > } > Jens Axboe wrote: } > >> Would that be very different from issuing barrier and not waiting for } > >> its completion? For ATA and SCSI, we'll have to flush write back } cache } > >> anyway, so I don't see how we can get performance advantage by } > >> implementing separate WRITE_ORDERED. I think zero-length barrier } > >> (haven't looked at the code yet, still recovering from jet lag :-) } can } > >> serve as genuine barrier without the extra write tho. } > > } > > As always, it depends :-) } > > } > > If you are doing pure flush barriers, then there's no difference. } Unless } > > you only guarantee ordering wrt previously submitted requests, in } which } > > case you can eliminate the post flush. } > > } > > If you are doing ordered tags, then just setting the ordered bit is } > > enough. That is different from the barrier in that we don't need a } flush } > > of FUA bit set. } > } > Hmmm... I'm feeling dense. Zero-length barrier also requires only one } > flush to separate requests before and after it (haven't looked at the } > code yet, will soon). Can you enlighten me? } } Yeah, that's what the zero-length barrier implementation I posted does. } Not sure if you have a question beyond that, if so fire away :-) } } -- } Jens Axboe I must admit I have only read some of the barrier related posts, so this issue may have been covered. If so, sorry. What I have read seems to be related to a single disk. What if a logical disk is used (md, LVM, ...)? If a barrier is issued to a logical disk and that driver issues barriers to all related devices (logical or physical), all the devices MUST honor the barrier together. If 1 device crosses the barrier before another reaches the barrier, corruption should be assumed. It seems to me each block device that represents more than 2 other devices must do a flush at a barrier so that all devices will cross the barrier at the same time. Guy ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-02 14:34 ` Jens Axboe 2007-06-02 22:57 ` Guy Watkins @ 2007-06-04 7:39 ` Tejun Heo 1 sibling, 0 replies; 102+ messages in thread From: Tejun Heo @ 2007-06-04 7:39 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, david, Phillip Susi, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger Jens Axboe wrote: > On Sat, Jun 02 2007, Tejun Heo wrote: >> Hello, >> >> Jens Axboe wrote: >>>> Would that be very different from issuing barrier and not waiting for >>>> its completion? For ATA and SCSI, we'll have to flush write back cache >>>> anyway, so I don't see how we can get performance advantage by >>>> implementing separate WRITE_ORDERED. I think zero-length barrier >>>> (haven't looked at the code yet, still recovering from jet lag :-) can >>>> serve as genuine barrier without the extra write tho. >>> As always, it depends :-) >>> >>> If you are doing pure flush barriers, then there's no difference. Unless >>> you only guarantee ordering wrt previously submitted requests, in which >>> case you can eliminate the post flush. >>> >>> If you are doing ordered tags, then just setting the ordered bit is >>> enough. That is different from the barrier in that we don't need a flush >>> of FUA bit set. >> Hmmm... I'm feeling dense. Zero-length barrier also requires only one >> flush to separate requests before and after it (haven't looked at the >> code yet, will soon). Can you enlighten me? > > Yeah, that's what the zero-length barrier implementation I posted does. > Not sure if you have a question beyond that, if so fire away :-) I thought you were talking about adding BIO_RW_ORDERED instead of exposing zero length BIO_RW_BARRIER. Sorry about the confusion. :-) -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 7:03 ` David Chinner 2007-05-31 7:06 ` Jens Axboe @ 2007-05-31 18:31 ` Phillip Susi 2007-05-31 19:00 ` Jens Axboe 2007-05-31 23:34 ` David Chinner 1 sibling, 2 replies; 102+ messages in thread From: Phillip Susi @ 2007-05-31 18:31 UTC (permalink / raw) To: David Chinner Cc: Jens Axboe, david, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo David Chinner wrote: > That sounds like a good idea - we can leave the existing > WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > behaviour that only guarantees ordering. The filesystem can then > choose which to use where appropriate.... So what if you want a synchronous write, but DON'T care about the order? They need to be two completely different flags which you can choose to combine, or use individually. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 18:31 ` Phillip Susi @ 2007-05-31 19:00 ` Jens Axboe 2007-05-31 19:21 ` david 2007-05-31 23:34 ` David Chinner 1 sibling, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-05-31 19:00 UTC (permalink / raw) To: Phillip Susi Cc: David Chinner, david, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Thu, May 31 2007, Phillip Susi wrote: > David Chinner wrote: > >That sounds like a good idea - we can leave the existing > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >behaviour that only guarantees ordering. The filesystem can then > >choose which to use where appropriate.... > > So what if you want a synchronous write, but DON'T care about the order? > They need to be two completely different flags which you can choose > to combine, or use individually. If you have a use case for that, we can easily support it as well... Depending on the drive capabilities (FUA support or not), it may be nearly as slow as a "real" barrier write. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 19:00 ` Jens Axboe @ 2007-05-31 19:21 ` david 2007-05-31 19:40 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: david @ 2007-05-31 19:21 UTC (permalink / raw) To: Jens Axboe Cc: Tejun Heo, David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Andreas Dilger On Thu, 31 May 2007, Jens Axboe wrote: > On Thu, May 31 2007, Phillip Susi wrote: >> David Chinner wrote: >>> That sounds like a good idea - we can leave the existing >>> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED >>> behaviour that only guarantees ordering. The filesystem can then >>> choose which to use where appropriate.... >> >> So what if you want a synchronous write, but DON'T care about the order? >> They need to be two completely different flags which you can choose >> to combine, or use individually. > > If you have a use case for that, we can easily support it as well... > Depending on the drive capabilities (FUA support or not), it may be > nearly as slow as a "real" barrier write. true, but a "real" barrier write could have significant side effects on other writes that wouldn't happen with a synchronous wrote (a sync wrote can have other, unrelated writes re-ordered around it, a barrier write can't) David Lang ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 19:21 ` david @ 2007-05-31 19:40 ` Jens Axboe 0 siblings, 0 replies; 102+ messages in thread From: Jens Axboe @ 2007-05-31 19:40 UTC (permalink / raw) To: david Cc: Phillip Susi, David Chinner, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Thu, May 31 2007, david@lang.hm wrote: > On Thu, 31 May 2007, Jens Axboe wrote: > > >On Thu, May 31 2007, Phillip Susi wrote: > >>David Chinner wrote: > >>>That sounds like a good idea - we can leave the existing > >>>WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >>>behaviour that only guarantees ordering. The filesystem can then > >>>choose which to use where appropriate.... > >> > >>So what if you want a synchronous write, but DON'T care about the order? > >> They need to be two completely different flags which you can choose > >>to combine, or use individually. > > > >If you have a use case for that, we can easily support it as well... > >Depending on the drive capabilities (FUA support or not), it may be > >nearly as slow as a "real" barrier write. > > true, but a "real" barrier write could have significant side effects on > other writes that wouldn't happen with a synchronous wrote (a sync wrote > can have other, unrelated writes re-ordered around it, a barrier write > can't) That is true, the sync write also has side effects at the drive side since it may have a varied cost depending on the workload (eg what already resides in the cache when it is issued), unless FUA is active. That is also true for the barrier of course, but only for previously submitted IO as we don't reorder. I'm not saying that a SYNC write wont be potentially useful, just that it's definitely not free even outside of the write itself. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 18:31 ` Phillip Susi 2007-05-31 19:00 ` Jens Axboe @ 2007-05-31 23:34 ` David Chinner 2007-06-01 5:59 ` Neil Brown 1 sibling, 1 reply; 102+ messages in thread From: David Chinner @ 2007-05-31 23:34 UTC (permalink / raw) To: Phillip Susi Cc: David Chinner, Jens Axboe, david, Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: > David Chinner wrote: > >That sounds like a good idea - we can leave the existing > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >behaviour that only guarantees ordering. The filesystem can then > >choose which to use where appropriate.... > > So what if you want a synchronous write, but DON'T care about the order? submit_bio(WRITE_SYNC, bio); Already there, already used by XFS, JFS and direct I/O. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 23:34 ` David Chinner @ 2007-06-01 5:59 ` Neil Brown 2007-06-01 6:11 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 102+ messages in thread From: Neil Brown @ 2007-06-01 5:59 UTC (permalink / raw) To: David Chinner Cc: Phillip Susi, Jens Axboe, david, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Friday June 1, dgc@sgi.com wrote: > On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: > > David Chinner wrote: > > >That sounds like a good idea - we can leave the existing > > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > > >behaviour that only guarantees ordering. The filesystem can then > > >choose which to use where appropriate.... > > > > So what if you want a synchronous write, but DON'T care about the order? > > submit_bio(WRITE_SYNC, bio); > > Already there, already used by XFS, JFS and direct I/O. Are you sure? You seem to be saying that WRITE_SYNC causes the write to be safe on media before the request returns. That isn't my understanding. I think (from comments near the definition and a quick grep through the code) that WRITE_SYNC expedites the delivery of the request through the elevator, but doesn't do anything special about getting it onto the media. It essentially say "Submit this request now, don't wait for more request to bundle with it for better bandwidth utilisation" NeilBrown ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 5:59 ` Neil Brown @ 2007-06-01 6:11 ` Jens Axboe 2007-06-01 7:53 ` David Chinner 2007-06-01 23:56 ` Bill Davidsen 2 siblings, 0 replies; 102+ messages in thread From: Jens Axboe @ 2007-06-01 6:11 UTC (permalink / raw) To: Neil Brown Cc: David Chinner, Phillip Susi, david, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Fri, Jun 01 2007, Neil Brown wrote: > On Friday June 1, dgc@sgi.com wrote: > > On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: > > > David Chinner wrote: > > > >That sounds like a good idea - we can leave the existing > > > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > > > >behaviour that only guarantees ordering. The filesystem can then > > > >choose which to use where appropriate.... > > > > > > So what if you want a synchronous write, but DON'T care about the order? > > > > submit_bio(WRITE_SYNC, bio); > > > > Already there, already used by XFS, JFS and direct I/O. > > Are you sure? > > You seem to be saying that WRITE_SYNC causes the write to be safe on > media before the request returns. That isn't my understanding. > I think (from comments near the definition and a quick grep through > the code) that WRITE_SYNC expedites the delivery of the request > through the elevator, but doesn't do anything special about getting it > onto the media. > It essentially say "Submit this request now, don't wait for more > request to bundle with it for better bandwidth utilisation" That is exactly right. WRITE_SYNC doesn't give any integrity guarentees, it's just makes sure it goes straight through the io scheduler. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 5:59 ` Neil Brown 2007-06-01 6:11 ` Jens Axboe @ 2007-06-01 7:53 ` David Chinner 2007-06-01 23:56 ` Bill Davidsen 2 siblings, 0 replies; 102+ messages in thread From: David Chinner @ 2007-06-01 7:53 UTC (permalink / raw) To: Neil Brown Cc: David Chinner, Phillip Susi, Jens Axboe, david, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo On Fri, Jun 01, 2007 at 03:59:51PM +1000, Neil Brown wrote: > On Friday June 1, dgc@sgi.com wrote: > > On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: > > > David Chinner wrote: > > > >That sounds like a good idea - we can leave the existing > > > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > > > >behaviour that only guarantees ordering. The filesystem can then > > > >choose which to use where appropriate.... > > > > > > So what if you want a synchronous write, but DON'T care about the order? > > > > submit_bio(WRITE_SYNC, bio); > > > > Already there, already used by XFS, JFS and direct I/O. > > Are you sure? > > You seem to be saying that WRITE_SYNC causes the write to be safe on > media before the request returns. Sorry, I wasn't really all that clear :/ What I'm saying the *interface* for higher layer to tell the block layers that a sync write is being executed is already there. i.e. we can already tell the block layer that we are doing a synchronous I/O. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 5:59 ` Neil Brown 2007-06-01 6:11 ` Jens Axboe 2007-06-01 7:53 ` David Chinner @ 2007-06-01 23:56 ` Bill Davidsen 2 siblings, 0 replies; 102+ messages in thread From: Bill Davidsen @ 2007-06-01 23:56 UTC (permalink / raw) To: Neil Brown Cc: David Chinner, Phillip Susi, Jens Axboe, david, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Stefan Bader, Andreas Dilger, Tejun Heo Neil Brown wrote: > On Friday June 1, dgc@sgi.com wrote: > >> On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: >> >>> David Chinner wrote: >>> >>>> That sounds like a good idea - we can leave the existing >>>> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED >>>> behaviour that only guarantees ordering. The filesystem can then >>>> choose which to use where appropriate.... >>>> >>> So what if you want a synchronous write, but DON'T care about the order? >>> >> submit_bio(WRITE_SYNC, bio); >> >> Already there, already used by XFS, JFS and direct I/O. >> > > Are you sure? > > You seem to be saying that WRITE_SYNC causes the write to be safe on > media before the request returns. That isn't my understanding. > I think (from comments near the definition and a quick grep through > the code) that WRITE_SYNC expedites the delivery of the request > through the elevator, but doesn't do anything special about getting it > onto the media. My impression is that the sync will return when the i/o has been delivered to the device, and will get special treatment by the elevator code (I looked quickly, more is needed). I'm sore someone will tell me if I misread this. ;-) -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 0:20 ` David Chinner 2007-05-31 6:26 ` Jens Axboe @ 2007-05-31 18:24 ` Phillip Susi 1 sibling, 0 replies; 102+ messages in thread From: Phillip Susi @ 2007-05-31 18:24 UTC (permalink / raw) To: David Chinner Cc: david, Tejun Heo, linux-kernel, linux-raid, dm-devel, Jens Axboe, linux-fsdevel, Andreas Dilger David Chinner wrote: >> you are understanding barriers to be the same as syncronous writes. (and >> therefor the data is on persistant media before the call returns) > > No, I'm describing the high level behaviour that is expected by > a filesystem. The reasons for this are below.... You say no, but then you go on to contradict yourself below. > Ok, that's my understanding of how *device based barriers* can work, > but there's more to it than that. As far as the filesystem is > concerned the barrier write needs to *behave* exactly like a sync > write because of the guarantees the filesystem has to provide > userspace. Specifically - sync, sync writes and fsync. There, you just ascribed the synchronous property to barrier requests. This is false. Barriers are about ordering, synchronous writes are another thing entirely. The filesystem is supposed to use barriers to maintain ordering for journal data. If you are trying to handle a synchronous write request, that's another flag. > This is the big problem, right? If we use barriers for commit > writes, the filesystem can return to userspace after a sync write or > fsync() and an *ordered barrier device implementation* may not have > written the blocks to persistent media. If we then pull the plug on > the box, we've just lost data that sync or fsync said was > successfully on disk. That's BAD. That's why for synchronous writes, you set the flag to mark the request as synchronous, which has nothing at all to do with barriers. You are trying to use barriers to solve two different problems. Use one flag to indicate ordering, and another to indicate synchronisity. > Right now a barrier write on the last block of the fsync/sync write > is sufficient to prevent that because of the FUA on the barrier > block write. A purely ordered barrier implementation does not > provide this guarantee. This is a side effect of the implementation of the barrier, not part of the semantics of barriers, so you shouldn't rely on this behavior. You don't have to use FUA to handle the barrier request, and if you don't, then the request can be completed while the data is still in the write cache. You just have to make sure to flush it before any subsequent requests. > IOWs, there are two parts to the problem: > > 1 - guaranteeing I/O ordering > 2 - guaranteeing blocks are on persistent storage. > > Right now, a single barrier I/O is used to provide both of these > guarantees. In most cases, all we really need to provide is 1); the > need for 2) is a much rarer condition but still needs to be > provided. Yep... two problems... two flags. > Yes, if we define a barrier to only guarantee 1), then yes this > would be a big win (esp. for XFS). But that requires all filesystems > to handle sync writes differently, and sync_blockdev() needs to > call blkdev_issue_flush() as well.... > > So, what do we do here? Do we define a barrier I/O to only provide > ordering, or do we define it to also provide persistent storage > writeback? Whatever we decide, it needs to be documented.... We do the former or we end up in the same boat as O_DIRECT; where you have one flag that means several things, and no way to specify you only need some of those and not the others. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-29 23:48 ` David Chinner 2007-05-30 0:01 ` david @ 2007-05-30 16:45 ` Phillip Susi 2007-05-30 20:27 ` [dm-devel] " Phillip Susi 1 sibling, 1 reply; 102+ messages in thread From: Phillip Susi @ 2007-05-30 16:45 UTC (permalink / raw) To: David Chinner Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, Stefan Bader, Andreas Dilger, Tejun Heo David Chinner wrote: >> Barrier != synchronous write, > > Of course. FYI, XFS only issues barriers on *async* writes. > > But barrier semantics - as far as they've been described by everyone > but you indicate that the barrier write is guaranteed to be on stable > storage when it returns. Hrm... I may have misunderstood the perspective you were talking from. Yes, when the bio is completed it must be on the media, but the filesystem should issue both requests, and then really not care when they complete. That is to say, the filesystem should not wait for block A to finish before issuing block B; it should issue both, and use barriers to make sure they hit the disk in the correct order. > XFS relies on the block being stable before any other write > goes to disk. That is the semantic that the barrier I/Os currently > have. How that is implemented in the device is irrelevant to me, > but if I issue a barrier I/O, I do not expect *any* I/O to be > reordered around it. Right... it just needs to control the order of the requests, just not wait on one to finish before issuing the next. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 16:45 ` Phillip Susi @ 2007-05-30 20:27 ` Phillip Susi 2007-05-31 6:24 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Phillip Susi @ 2007-05-30 20:27 UTC (permalink / raw) To: device-mapper development Cc: David Chinner, Tejun Heo, linux-kernel, linux-raid, Jens Axboe, linux-fsdevel, Andreas Dilger, Stefan Bader Phillip Susi wrote: > Hrm... I may have misunderstood the perspective you were talking from. > Yes, when the bio is completed it must be on the media, but the > filesystem should issue both requests, and then really not care when > they complete. That is to say, the filesystem should not wait for block > A to finish before issuing block B; it should issue both, and use > barriers to make sure they hit the disk in the correct order. Actually now that I think about it, that wasn't correct. The request CAN be completed before the data has hit the medium. The barrier just constrains the ordering of the writes, but they can still sit in the disk write back cache for some time. Stefan Bader wrote: > That would be the exactly how I understand Documentation/block/barrier.txt: > > "In other words, I/O barrier requests have the following two properties. > 1. Request ordering > ... > 2. Forced flushing to physical medium" > > "So, I/O barriers need to guarantee that requests actually get written > to non-volatile medium in order." I think you misinterpret this, and it probably could be worded a bit better. The barrier request is about constraining the order. The forced flushing is one means to implement that constraint. The other alternative mentioned there is to use ordered tags. The key part there is "requests actually get written to non-volatile medium _in order_", not "before the request completes", which would be synchronous IO. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 20:27 ` [dm-devel] " Phillip Susi @ 2007-05-31 6:24 ` Jens Axboe 2007-05-31 18:37 ` [dm-devel] " Phillip Susi 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-05-31 6:24 UTC (permalink / raw) To: Phillip Susi Cc: Tejun Heo, David Chinner, linux-kernel, linux-raid, device-mapper development, linux-fsdevel, Andreas Dilger On Wed, May 30 2007, Phillip Susi wrote: > >That would be the exactly how I understand Documentation/block/barrier.txt: > > > >"In other words, I/O barrier requests have the following two properties. > >1. Request ordering > >... > >2. Forced flushing to physical medium" > > > >"So, I/O barriers need to guarantee that requests actually get written > >to non-volatile medium in order." > > I think you misinterpret this, and it probably could be worded a bit > better. The barrier request is about constraining the order. The > forced flushing is one means to implement that constraint. The other > alternative mentioned there is to use ordered tags. The key part there > is "requests actually get written to non-volatile medium _in order_", > not "before the request completes", which would be synchronous IO. No Stephan is right, the barrier is both an ordering and integrity constraint. If a driver completes a barrier request before that request and previously submitted requests are on STABLE storage, then it violates that principle. Look at the code and the various ordering options. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 6:24 ` Jens Axboe @ 2007-05-31 18:37 ` Phillip Susi 2007-05-31 18:58 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Phillip Susi @ 2007-05-31 18:37 UTC (permalink / raw) To: Jens Axboe Cc: device-mapper development, David Chinner, Tejun Heo, linux-kernel, linux-raid, linux-fsdevel, Andreas Dilger, Stefan Bader Jens Axboe wrote: > No Stephan is right, the barrier is both an ordering and integrity > constraint. If a driver completes a barrier request before that request > and previously submitted requests are on STABLE storage, then it > violates that principle. Look at the code and the various ordering > options. I am saying that is the wrong thing to do. Barrier should be about ordering only. So long as the order they hit the media is maintained, the order the requests are completed in can change. barrier.txt bears this out: "Requests in ordered sequence are issued in order, but not required to finish in order. Barrier implementation can handle out-of-order completion of ordered sequence. IOW, the requests MUST be processed in order but the hardware/software completion paths are allowed to reorder completion notifications - eg. current SCSI midlayer doesn't preserve completion order during error handling." ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 18:37 ` [dm-devel] " Phillip Susi @ 2007-05-31 18:58 ` Jens Axboe 2007-06-02 0:04 ` Bill Davidsen 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-05-31 18:58 UTC (permalink / raw) To: Phillip Susi Cc: device-mapper development, David Chinner, Tejun Heo, linux-kernel, linux-raid, linux-fsdevel, Andreas Dilger, Stefan Bader On Thu, May 31 2007, Phillip Susi wrote: > Jens Axboe wrote: > >No Stephan is right, the barrier is both an ordering and integrity > >constraint. If a driver completes a barrier request before that request > >and previously submitted requests are on STABLE storage, then it > >violates that principle. Look at the code and the various ordering > >options. > > I am saying that is the wrong thing to do. Barrier should be about > ordering only. So long as the order they hit the media is maintained, > the order the requests are completed in can change. barrier.txt bears But you can't guarentee ordering without flushing the data out as well. It all depends on the type of cache on the device, of course. If you look at the ordinary sata/ide drive with write back caching, you can't just issue the requests in order and pray that the drive cache will make it to platter. If you don't have write back caching, or if the cache is battery backed and thus guarenteed to never be lost, maintaining order is naturally enough. Or if the drive can do ordered queued commands, you can relax the flushing (again depending on the cache type, you may need to take different paths). > "Requests in ordered sequence are issued in order, but not required to > finish in order. Barrier implementation can handle out-of-order > completion of ordered sequence. IOW, the requests MUST be processed in > order but the hardware/software completion paths are allowed to reorder > completion notifications - eg. current SCSI midlayer doesn't preserve > completion order during error handling." If you carefully re-read that paragraph, then it just tells you that the software implementation can deal with reordered completions. It doesn't relax the rconstraints on ordering and integrity AT ALL. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 18:58 ` Jens Axboe @ 2007-06-02 0:04 ` Bill Davidsen 0 siblings, 0 replies; 102+ messages in thread From: Bill Davidsen @ 2007-06-02 0:04 UTC (permalink / raw) To: Jens Axboe Cc: Phillip Susi, device-mapper development, David Chinner, Tejun Heo, linux-kernel, linux-raid, linux-fsdevel, Andreas Dilger, Stefan Bader Jens Axboe wrote: > On Thu, May 31 2007, Phillip Susi wrote: > >> Jens Axboe wrote: >> >>> No Stephan is right, the barrier is both an ordering and integrity >>> constraint. If a driver completes a barrier request before that request >>> and previously submitted requests are on STABLE storage, then it >>> violates that principle. Look at the code and the various ordering >>> options. >>> >> I am saying that is the wrong thing to do. Barrier should be about >> ordering only. So long as the order they hit the media is maintained, >> the order the requests are completed in can change. barrier.txt bears >> > > But you can't guarentee ordering without flushing the data out as well. > It all depends on the type of cache on the device, of course. If you > look at the ordinary sata/ide drive with write back caching, you can't > just issue the requests in order and pray that the drive cache will make > it to platter. > > If you don't have write back caching, or if the cache is battery backed > and thus guarenteed to never be lost, maintaining order is naturally > enough. > Do I misread this? If ordered doesn't reach all the way to the platter then there will be failure modes which result in order not preserved. Battery backed cache doesn't prevect failures between the cache and the platter. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 1:30 ` Neil Brown 2007-05-28 2:45 ` David Chinner @ 2007-05-28 9:29 ` Tejun Heo 2007-05-28 9:43 ` Alasdair G Kergon ` (2 subsequent siblings) 4 siblings, 0 replies; 102+ messages in thread From: Tejun Heo @ 2007-05-28 9:29 UTC (permalink / raw) To: Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger Hello, Neil Brown wrote: > 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. > > This is certainly a very attractive position - it makes the interface > cleaner and makes life easier for filesystems and other clients of > the block interface. > Currently filesystems handle -EOPNOTSUP by > a/ resubmitting the request without the BARRIER (after waiting for > earlier requests to complete) and > b/ possibly printing an error message to the kernel logs. > > The block layer can do both of these just as easily and it does make > sense to do it there. Yeah, I think doing all the above in the block layer is the cleanest way to solve this. If write back cache && flush doesn't work, barrier is bound to fail but block layer still can write the barrier block as requested (without actual barriering), whine about it to the user, and tell the FS that barrier is failed but the write itself went through, so that FS can go on without caring about it unless it wants to. > md/dm modules could keep count of requests as has been suggested > (though that would be a fairly big change for raid0 as it currently > doesn't know when a request completes - bi_endio goes directly to the > filesystem). > However I think the idea of a zero-length BIO_RW_BARRIER would be a > good option. raid0 could send one of these down each device, and > when they all return, the barrier request can be sent to it's target > device(s). Yeap. > 2/ Maybe barriers provide stronger semantics than are required. > > All write requests are synchronised around a barrier write. This is > often more than is required and apparently can cause a measurable > slowdown. > > Also the FUA for the actual commit write might not be needed. It is > important for consistency that the preceding writes are in safe > storage before the commit write, but it is not so important that the > commit write is immediately safe on storage. That isn't needed until > a 'sync' or 'fsync' or similar. > > One possible alternative is: > - writes can overtake barriers, but barrier cannot overtake writes. > - flush before the barrier, not after. I think we can give this property to zero length barriers. > This is considerably weaker, and hence cheaper. But I think it is > enough for all filesystems (providing it is still an option to call > blkdev_issue_flush on 'fsync'). > > Another alternative would be to tag each bio was being in a > particular barrier-group. Then bio's in different groups could > overtake each other in either direction, but a BARRIER request must > be totally ordered w.r.t. other requests in the barrier group. > This would require an extra bio field, and would give the filesystem > more appearance of control. I'm not yet sure how much it would > really help... > It would allow us to set FUA on all bios with a non-zero > barrier-group. That would mean we don't have to flush the entire > cache, just those blocks that are critical.... but I'm still not sure > it's a good idea. Barrier code as it currently stands deals with two colors so there can be only one outstanding barrier at given moment. Expanding it to deal with multiple colors and then to multiple simultaneous groups will take some work but is definitely possible. If FS people can make good use of it, I think it would be worthwhile. > Of course, these weaker rules would only apply inside the elevator. > Once the request goes to the device we need to work with what the > device provides, which probably means total-ordering around the > barrier. Yeah, on device side, the best we can do most of the time is full flush but as long as request queue depth is much deeper than the controller/device one, having multiple barrier groups can be helpful. We need more input from FS people, I think. > 3/ Do we need explicit control of the 'ordered' mode? > > Consider a SCSI device that has NV RAM cache. mode_sense reports > that write-back is enabled, so _FUA or _FLUSH will be used. > But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode. > But it seems there is no way to query this information. > Using _FLUSH causes the NVRAM to be flushed to media which is a > terrible performance problem. If the NV RAM can be reliably detected using one of the inquiry pages, sd driver can switch it to DRAIN automatically. > Setting SYNC_NV doesn't work on the particular device in question. > We currently tell customers to mount with -o nobarriers, but that > really feels like the wrong solution. We should be telling the scsi > device "don't flush". > An advantage of 'nobarriers' is it can go in /etc/fstab. Where > would you record that a SCSI drive should be set to > QUEUE_ORDERD_DRAIN ?? How about exporting ordered mode as sysfs attribute and configuring it using a udev rule? It's a device property after all. Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 1:30 ` Neil Brown 2007-05-28 2:45 ` David Chinner 2007-05-28 9:29 ` Tejun Heo @ 2007-05-28 9:43 ` Alasdair G Kergon 2007-05-29 9:25 ` [dm-devel] " Stefan Bader 2007-05-29 19:59 ` Phillip Susi 2007-05-30 9:35 ` Jens Axboe 4 siblings, 1 reply; 102+ messages in thread From: Alasdair G Kergon @ 2007-05-28 9:43 UTC (permalink / raw) To: device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: > 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. The device-mapper position has always been that we require > a zero-length BIO_RW_BARRIER (i.e. containing no data to read or write - or emulated, possibly device-specific) before we can provide full barrier support. (Consider multiple active paths - each must see barrier.) Until every device supports barriers -EOPNOTSUP support is required. (Consider reconfiguration of stacks of devices - barrier support is a dynamic block device property that can switch between available and unavailable at any time.) Alasdair -- agk@redhat.com ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 9:43 ` Alasdair G Kergon @ 2007-05-29 9:25 ` Stefan Bader 2007-05-29 22:05 ` Alasdair G Kergon 0 siblings, 1 reply; 102+ messages in thread From: Stefan Bader @ 2007-05-29 9:25 UTC (permalink / raw) To: device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo 2007/5/28, Alasdair G Kergon <agk@redhat.com>: > On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: > > 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. > > The device-mapper position has always been that we require > > > a zero-length BIO_RW_BARRIER > > (i.e. containing no data to read or write - or emulated, possibly > device-specific) > > before we can provide full barrier support. > (Consider multiple active paths - each must see barrier.) > Couldn't the same be ac hived by doing a sort of suspend, issuing the barrier request, calling flush to all mapped devices and then wait for in-flight I/O to go to zero? This certainly has the aspect of performance degradation (but that seem to be a generic problem with barriers not being specific enough). > Until every device supports barriers -EOPNOTSUP support is required. > (Consider reconfiguration of stacks of devices - barrier support is a > dynamic block device property that can switch between available and > unavailable at any time.) > Is only an issue if not doing barrier handling in dm. In that case the support in the devices is helpful but not required. For something else: Alasdair, I am not a hundred percent sure about that but I think that just passing the barrier flag on to mapped devices might in some (maybe they are rare) cases cause a layer above to think all data is on-disk while this isn't necessarily true (see my previous post). What do you think? Stefan ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-29 9:25 ` [dm-devel] " Stefan Bader @ 2007-05-29 22:05 ` Alasdair G Kergon 2007-05-30 9:12 ` [dm-devel] " Stefan Bader 0 siblings, 1 reply; 102+ messages in thread From: Alasdair G Kergon @ 2007-05-29 22:05 UTC (permalink / raw) To: Stefan Bader Cc: Tejun Heo, David Chinner, linux-kernel, linux-raid, device-mapper development, Jens Axboe, linux-fsdevel, Andreas Dilger On Tue, May 29, 2007 at 11:25:42AM +0200, Stefan Bader wrote: > doing a sort of suspend, issuing the > barrier request, calling flush to all mapped devices and then wait for > in-flight I/O to go to zero? Something like that is needed for some dm targets to support barriers. (We needn't always wait for *all* in-flight I/O.) When faced with -EOPNOTSUP, do all callers fall back to a sync in the places a barrier would have been used, or are there any more sophisticated strategies attempting to optimise code without barriers? > I am not a hundred percent sure about > that but I think that just passing the barrier flag on to mapped > devices might in some (maybe they are rare) cases cause a layer above > to think all data is on-disk while this isn't necessarily true (see my > previous post). What do you think? An efficient I/O barrier implementation would not normally involve flushing AFAIK: dm surely wouldn't "cause" a higher layer to assume stronger semantics than are provided. Alasdair -- agk@redhat.com ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-29 22:05 ` Alasdair G Kergon @ 2007-05-30 9:12 ` Stefan Bader 2007-05-30 10:41 ` Alasdair G Kergon 2007-05-30 16:55 ` Phillip Susi 0 siblings, 2 replies; 102+ messages in thread From: Stefan Bader @ 2007-05-30 9:12 UTC (permalink / raw) To: Stefan Bader, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo > > in-flight I/O to go to zero? > > Something like that is needed for some dm targets to support barriers. > (We needn't always wait for *all* in-flight I/O.) > When faced with -EOPNOTSUP, do all callers fall back to a sync in > the places a barrier would have been used, or are there any more > sophisticated strategies attempting to optimise code without barriers? > If I didn't misunderstand the idea is that no caller will face an -EOPNOTSUPP in future. IOW every layer or driver somehow makes sure the right thing happens. > > An efficient I/O barrier implementation would not normally involve > flushing AFAIK: dm surely wouldn't "cause" a higher layer to assume > stronger semantics than are provided. > Seems there are at least two assumptions about what the semantics exactly _are_. Based on Documentation/block/barriers.txt I understand a barrier implies ordering and flushing. But regardless of that, assume the (admittedly constructed) following case: You got a linear target that consists of two disks. One drive (a) supports barriers and the other one (b) doesn't. Device-mapper just maps the requests to the appropriate disk. Now the following sequence happens: 1. block x gets mapped to drive b 2. block y (with barrier) gets mapped to drive a Since drive a supports barrier request we don't get -EOPNOTSUPP but the request with block y might get written before block x since the disk are independent. I guess the chances of this are quite low since at some point a barrier request will also hit drive b but for the time being it might be better to indicate -EOPNOTSUPP right from device-mapper. Stefan ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 9:12 ` [dm-devel] " Stefan Bader @ 2007-05-30 10:41 ` Alasdair G Kergon 2007-05-30 16:55 ` Phillip Susi 1 sibling, 0 replies; 102+ messages in thread From: Alasdair G Kergon @ 2007-05-30 10:41 UTC (permalink / raw) To: device-mapper development Cc: Tejun Heo, Stefan Bader, David Chinner, linux-kernel, linux-raid, Jens Axboe, linux-fsdevel, Andreas Dilger On Wed, May 30, 2007 at 11:12:37AM +0200, Stefan Bader wrote: > it might be better to indicate -EOPNOTSUPP right from > device-mapper. Indeed we should. For support, on receipt of a barrier, dm core should send a zero-length barrier to all active underlying paths, and delay mapping any further I/O. Alasdair -- agk@redhat.com ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 9:12 ` [dm-devel] " Stefan Bader 2007-05-30 10:41 ` Alasdair G Kergon @ 2007-05-30 16:55 ` Phillip Susi 2007-05-31 11:14 ` [dm-devel] " Stefan Bader 1 sibling, 1 reply; 102+ messages in thread From: Phillip Susi @ 2007-05-30 16:55 UTC (permalink / raw) To: Stefan Bader Cc: Tejun Heo, Stefan Bader, David Chinner, linux-kernel, linux-raid, device-mapper development, Jens Axboe, linux-fsdevel, Andreas Dilger Stefan Bader wrote: > You got a linear target that consists of two disks. One drive (a) > supports barriers and the other one (b) doesn't. Device-mapper just > maps the requests to the appropriate disk. Now the following sequence > happens: > > 1. block x gets mapped to drive b > 2. block y (with barrier) gets mapped to drive a > > Since drive a supports barrier request we don't get -EOPNOTSUPP but > the request with block y might get written before block x since the > disk are independent. I guess the chances of this are quite low since > at some point a barrier request will also hit drive b but for the time > being it might be better to indicate -EOPNOTSUPP right from > device-mapper. The device mapper needs to ensure that ALL underlying devices get a barrier request when one comes down from above, even if it has to construct zero length barriers to send to most of them. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 16:55 ` Phillip Susi @ 2007-05-31 11:14 ` Stefan Bader 2007-06-01 3:25 ` Tejun Heo 0 siblings, 1 reply; 102+ messages in thread From: Stefan Bader @ 2007-05-31 11:14 UTC (permalink / raw) To: Phillip Susi Cc: device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger, Tejun Heo 2007/5/30, Phillip Susi <psusi@cfl.rr.com>: > Stefan Bader wrote: > > > > Since drive a supports barrier request we don't get -EOPNOTSUPP but > > the request with block y might get written before block x since the > > disk are independent. I guess the chances of this are quite low since > > at some point a barrier request will also hit drive b but for the time > > being it might be better to indicate -EOPNOTSUPP right from > > device-mapper. > > The device mapper needs to ensure that ALL underlying devices get a > barrier request when one comes down from above, even if it has to > construct zero length barriers to send to most of them. > And somehow also make sure all of the barriers have been processed before returning the barrier that came in. Plus it would have to queue all mapping requests until the barrier is done (if strictly acting according to barrier.txt). But I am wondering a bit whether the requirements to barriers are really that tight as described in Tejun's document (barrier request is only started if everything before is safe, the barrier itself isn't returned until it is safe, too, and all requests after the barrier aren't started before the barrier is done). Is it really necessary to defer any further requests until the barrier has been written to save storage? Or would it be sufficient to guarantee that, if a barrier request returns, everything up to (including the barrier) is on safe storage? Stefan ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 11:14 ` [dm-devel] " Stefan Bader @ 2007-06-01 3:25 ` Tejun Heo 2007-06-01 5:55 ` david 0 siblings, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-06-01 3:25 UTC (permalink / raw) To: Stefan Bader Cc: David Chinner, linux-kernel, linux-raid, device-mapper development, Jens Axboe, linux-fsdevel, Andreas Dilger Stefan Bader wrote: > 2007/5/30, Phillip Susi <psusi@cfl.rr.com>: >> Stefan Bader wrote: >> > >> > Since drive a supports barrier request we don't get -EOPNOTSUPP but >> > the request with block y might get written before block x since the >> > disk are independent. I guess the chances of this are quite low since >> > at some point a barrier request will also hit drive b but for the time >> > being it might be better to indicate -EOPNOTSUPP right from >> > device-mapper. >> >> The device mapper needs to ensure that ALL underlying devices get a >> barrier request when one comes down from above, even if it has to >> construct zero length barriers to send to most of them. >> > > And somehow also make sure all of the barriers have been processed > before returning the barrier that came in. Plus it would have to queue > all mapping requests until the barrier is done (if strictly acting > according to barrier.txt). > > But I am wondering a bit whether the requirements to barriers are > really that tight as described in Tejun's document (barrier request is > only started if everything before is safe, the barrier itself isn't > returned until it is safe, too, and all requests after the barrier > aren't started before the barrier is done). Is it really necessary to > defer any further requests until the barrier has been written to save > storage? Or would it be sufficient to guarantee that, if a barrier > request returns, everything up to (including the barrier) is on safe > storage? Well, what's described in barrier.txt is the current implemented semantics and what filesystems expect, so we can't change it underneath them but we definitely can introduce new more relaxed variants, but one thing we should bear in mind is that harddisks don't have humongous caches or very smart controller / instruction set. No matter how relaxed interface the block layer provides, in the end, it just has to issue whole-sale FLUSH CACHE on the device to guarantee data ordering on the media. IMHO, we can do better by paying more attention to how we do things in the request queue which can be deeper and more intelligent than the device queue. Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 3:25 ` Tejun Heo @ 2007-06-01 5:55 ` david 2007-06-01 7:16 ` [dm-devel] " Tejun Heo 0 siblings, 1 reply; 102+ messages in thread From: david @ 2007-06-01 5:55 UTC (permalink / raw) To: Tejun Heo Cc: David Chinner, linux-kernel, linux-raid, device-mapper development, Jens Axboe, linux-fsdevel, Andreas Dilger On Fri, 1 Jun 2007, Tejun Heo wrote: > but one > thing we should bear in mind is that harddisks don't have humongous > caches or very smart controller / instruction set. No matter how > relaxed interface the block layer provides, in the end, it just has to > issue whole-sale FLUSH CACHE on the device to guarantee data ordering on > the media. if you are talking about individual drives you may be right for the moment (but 16M cache on drives is a _lot_ larger then people imagined would be there a few years ago) but when you consider the self-contained disk arrays it's an entirely different story. you can easily have a few gig of cache and a complete OS pretending to be a single drive as far as you are concerned. and the price of such devices is plummeting (in large part thanks to Linux moving into this space), you can now readily buy a 10TB array for $10k that looks like a single drive. David Lang ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 5:55 ` david @ 2007-06-01 7:16 ` Tejun Heo 2007-06-01 17:07 ` Valdis.Kletnieks 2007-07-10 18:39 ` Ric Wheeler 0 siblings, 2 replies; 102+ messages in thread From: Tejun Heo @ 2007-06-01 7:16 UTC (permalink / raw) To: david Cc: Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger, ric [ cc'ing Ric Wheeler for storage array thingie. Hi, whole thread is at http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ] Hello, david@lang.hm wrote: > but when you consider the self-contained disk arrays it's an entirely > different story. you can easily have a few gig of cache and a complete > OS pretending to be a single drive as far as you are concerned. > > and the price of such devices is plummeting (in large part thanks to > Linux moving into this space), you can now readily buy a 10TB array for > $10k that looks like a single drive. Don't those thingies usually have NV cache or backed by battery such that ORDERED_DRAIN is enough? The problem is that the interface between the host and a storage device (ATA or SCSI) is not built to communicate that kind of information (grouped flush, relaxed ordering...). I think battery backed ORDERED_DRAIN combined with fine-grained host queue flush would be pretty good. It doesn't require some fancy new interface which isn't gonna be used widely anyway and can achieve most of performance gain if the storage plays it smart. Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 7:16 ` [dm-devel] " Tejun Heo @ 2007-06-01 17:07 ` Valdis.Kletnieks 2007-06-01 18:09 ` Tejun Heo 2007-07-10 18:39 ` Ric Wheeler 1 sibling, 1 reply; 102+ messages in thread From: Valdis.Kletnieks @ 2007-06-01 17:07 UTC (permalink / raw) To: Tejun Heo Cc: david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger, ric [-- Attachment #1: Type: text/plain, Size: 930 bytes --] On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said: > Don't those thingies usually have NV cache or backed by battery such > that ORDERED_DRAIN is enough? Probably *most* do, but do you really want to bet the user's data on it? > The problem is that the interface between the host and a storage device > (ATA or SCSI) is not built to communicate that kind of information > (grouped flush, relaxed ordering...). I think battery backed > ORDERED_DRAIN combined with fine-grained host queue flush would be > pretty good. It doesn't require some fancy new interface which isn't > gonna be used widely anyway and can achieve most of performance gain if > the storage plays it smart. Yes, that would probably be "pretty good". But how do you get the storage device to *reliably* tell the truth about what it actually implements? (Consider the number of devices that downright lie about their implementation of cache flushing....) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 17:07 ` Valdis.Kletnieks @ 2007-06-01 18:09 ` Tejun Heo 0 siblings, 0 replies; 102+ messages in thread From: Tejun Heo @ 2007-06-01 18:09 UTC (permalink / raw) To: Valdis.Kletnieks Cc: david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger, ric Valdis.Kletnieks@vt.edu wrote: > On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said: >> Don't those thingies usually have NV cache or backed by battery such >> that ORDERED_DRAIN is enough? > > Probably *most* do, but do you really want to bet the user's data on it? Thought we were talking about high-end storage stuff. I don't think I'll be too uncomfortable. The reason why we're talking about this at all is because high-end stuff with fancy NV cache and a hunk of battery will unnecessarily suffer from the current barrier implementation. >> The problem is that the interface between the host and a storage device >> (ATA or SCSI) is not built to communicate that kind of information >> (grouped flush, relaxed ordering...). I think battery backed >> ORDERED_DRAIN combined with fine-grained host queue flush would be >> pretty good. It doesn't require some fancy new interface which isn't >> gonna be used widely anyway and can achieve most of performance gain if >> the storage plays it smart. > > Yes, that would probably be "pretty good". But how do you get the storage > device to *reliably* tell the truth about what it actually implements? (Consider > the number of devices that downright lie about their implementation of cache > flushing....) SCSI NV bit or report write through cache? Again, we're talking about large arrays and we already trust the write through thing even on cheap single spindle drives. sd currently doesn't honor NV bit and it's causing some troubles on some arrays. We'll probably have to honor them at least conditionally. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-06-01 7:16 ` [dm-devel] " Tejun Heo 2007-06-01 17:07 ` Valdis.Kletnieks @ 2007-07-10 18:39 ` Ric Wheeler 2007-07-10 23:40 ` Valdis.Kletnieks 2007-07-11 2:51 ` Tejun Heo 1 sibling, 2 replies; 102+ messages in thread From: Ric Wheeler @ 2007-07-10 18:39 UTC (permalink / raw) To: Tejun Heo Cc: david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger Tejun Heo wrote: > [ cc'ing Ric Wheeler for storage array thingie. Hi, whole thread is at > http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ] I am actually on the list, just really, really far behind in the thread ;-) > > Hello, > > david@lang.hm wrote: >> but when you consider the self-contained disk arrays it's an entirely >> different story. you can easily have a few gig of cache and a complete >> OS pretending to be a single drive as far as you are concerned. >> >> and the price of such devices is plummeting (in large part thanks to >> Linux moving into this space), you can now readily buy a 10TB array for >> $10k that looks like a single drive. > > Don't those thingies usually have NV cache or backed by battery such > that ORDERED_DRAIN is enough? All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) The size of the NV cache can run from a few gigabytes up to hundreds of gigabytes, so you really don't want to invoke cache flushes here if you can avoid it. For this class of device, you can get the required in order completion and data integrity semantics as long as we send the IO's to the device in the correct order. > > The problem is that the interface between the host and a storage device > (ATA or SCSI) is not built to communicate that kind of information > (grouped flush, relaxed ordering...). I think battery backed > ORDERED_DRAIN combined with fine-grained host queue flush would be > pretty good. It doesn't require some fancy new interface which isn't > gonna be used widely anyway and can achieve most of performance gain if > the storage plays it smart. > > Thanks. > I am not really sure that you need this ORDERED_DRAIN for big arrays... ric ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-10 18:39 ` Ric Wheeler @ 2007-07-10 23:40 ` Valdis.Kletnieks 2007-07-11 2:49 ` Tejun Heo 2007-07-11 22:44 ` Ric Wheeler 2007-07-11 2:51 ` Tejun Heo 1 sibling, 2 replies; 102+ messages in thread From: Valdis.Kletnieks @ 2007-07-10 23:40 UTC (permalink / raw) To: ric Cc: Tejun Heo, david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger [-- Attachment #1: Type: text/plain, Size: 608 bytes --] On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: > All of the high end arrays have non-volatile cache (read, on power loss, it is a > promise that it will get all of your data out to permanent storage). You don't > need to ask this kind of array to drain the cache. In fact, it might just ignore > you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a "I'm a high-end array" bit in the sense data that I'm unaware of? [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-10 23:40 ` Valdis.Kletnieks @ 2007-07-11 2:49 ` Tejun Heo 2007-07-11 22:44 ` Ric Wheeler 1 sibling, 0 replies; 102+ messages in thread From: Tejun Heo @ 2007-07-11 2:49 UTC (permalink / raw) To: Valdis.Kletnieks Cc: ric, david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger Valdis.Kletnieks@vt.edu wrote: > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: > >> All of the high end arrays have non-volatile cache (read, on power loss, it is a >> promise that it will get all of your data out to permanent storage). You don't >> need to ask this kind of array to drain the cache. In fact, it might just ignore >> you if you send it that kind of request ;-) > > OK, I'll bite - how does the kernel know whether the other end of that > fiberchannel cable is attached to a DMX-3 or to some no-name product that > may not have the same assurances? Is there a "I'm a high-end array" bit > in the sense data that I'm unaware of? Well, the array just has to tell the kernel that it doesn't to write back caching. The kernel automatically selects ORDERED_DRAIN in such case. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-10 23:40 ` Valdis.Kletnieks 2007-07-11 2:49 ` Tejun Heo @ 2007-07-11 22:44 ` Ric Wheeler 2007-07-12 17:34 ` Valdis.Kletnieks 1 sibling, 1 reply; 102+ messages in thread From: Ric Wheeler @ 2007-07-11 22:44 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Tejun Heo, david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger Valdis.Kletnieks@vt.edu wrote: > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: > >> All of the high end arrays have non-volatile cache (read, on power loss, it is a >> promise that it will get all of your data out to permanent storage). You don't >> need to ask this kind of array to drain the cache. In fact, it might just ignore >> you if you send it that kind of request ;-) > > OK, I'll bite - how does the kernel know whether the other end of that > fiberchannel cable is attached to a DMX-3 or to some no-name product that > may not have the same assurances? Is there a "I'm a high-end array" bit > in the sense data that I'm unaware of? > There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI has similar queries) to see what kind of device you are talking to. I am not sure it is worth the trouble to do any automatic detection/handling of this. In this specific case, it is more a case of when you attach a high end (or mid-tier) device to a server, you should configure it without barriers for its exported LUNs. ric ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-11 22:44 ` Ric Wheeler @ 2007-07-12 17:34 ` Valdis.Kletnieks 2007-07-12 19:43 ` Ric Wheeler 2007-07-12 23:10 ` Guy Watkins 0 siblings, 2 replies; 102+ messages in thread From: Valdis.Kletnieks @ 2007-07-12 17:34 UTC (permalink / raw) To: ric Cc: Tejun Heo, david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger [-- Attachment #1: Type: text/plain, Size: 1654 bytes --] On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said: > Valdis.Kletnieks@vt.edu wrote: > > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: > > > >> All of the high end arrays have non-volatile cache (read, on power loss, it is a > >> promise that it will get all of your data out to permanent storage). You don't > >> need to ask this kind of array to drain the cache. In fact, it might just ignore > >> you if you send it that kind of request ;-) > > > > OK, I'll bite - how does the kernel know whether the other end of that > > fiberchannel cable is attached to a DMX-3 or to some no-name product that > > may not have the same assurances? Is there a "I'm a high-end array" bit > > in the sense data that I'm unaware of? > > > > There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI > has similar queries) to see what kind of device you are talking to. I am not > sure it is worth the trouble to do any automatic detection/handling of this. > > In this specific case, it is more a case of when you attach a high end (or > mid-tier) device to a server, you should configure it without barriers for its > exported LUNs. I don't have a problem with the sysadmin *telling* the system "the other end of that fiber cable has characteristics X, Y and Z". What worried me was that it looked like conflating "device reported writeback cache" with "device actually has enough battery/hamster/whatever backup to flush everything on a power loss". (My back-of-envelope calculation shows for a worst-case of needing a 1ms seek for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync. That's a lot of battery..) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-12 17:34 ` Valdis.Kletnieks @ 2007-07-12 19:43 ` Ric Wheeler 2007-07-12 23:10 ` Guy Watkins 1 sibling, 0 replies; 102+ messages in thread From: Ric Wheeler @ 2007-07-12 19:43 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Tejun Heo, david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger Valdis.Kletnieks@vt.edu wrote: > On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said: >> Valdis.Kletnieks@vt.edu wrote: >>> On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: >>> >>>> All of the high end arrays have non-volatile cache (read, on power loss, it is a >>>> promise that it will get all of your data out to permanent storage). You don't >>>> need to ask this kind of array to drain the cache. In fact, it might just ignore >>>> you if you send it that kind of request ;-) >>> OK, I'll bite - how does the kernel know whether the other end of that >>> fiberchannel cable is attached to a DMX-3 or to some no-name product that >>> may not have the same assurances? Is there a "I'm a high-end array" bit >>> in the sense data that I'm unaware of? >>> >> There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI >> has similar queries) to see what kind of device you are talking to. I am not >> sure it is worth the trouble to do any automatic detection/handling of this. >> >> In this specific case, it is more a case of when you attach a high end (or >> mid-tier) device to a server, you should configure it without barriers for its >> exported LUNs. > > I don't have a problem with the sysadmin *telling* the system "the other end of > that fiber cable has characteristics X, Y and Z". What worried me was that it > looked like conflating "device reported writeback cache" with "device actually > has enough battery/hamster/whatever backup to flush everything on a power loss". > (My back-of-envelope calculation shows for a worst-case of needing a 1ms seek > for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync. That's > a lot of battery..) I think that we are on the same page here - just let the sys admin mount without barriers for big arrays. 1GB of cache, by the way, is really small for some of us ;-) ric ^ permalink raw reply [flat|nested] 102+ messages in thread
* RE: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-12 17:34 ` Valdis.Kletnieks 2007-07-12 19:43 ` Ric Wheeler @ 2007-07-12 23:10 ` Guy Watkins 2007-07-13 11:30 ` Ric Wheeler 1 sibling, 1 reply; 102+ messages in thread From: Guy Watkins @ 2007-07-12 23:10 UTC (permalink / raw) To: Valdis.Kletnieks, ric Cc: 'Tejun Heo', david, 'Stefan Bader', 'Phillip Susi', 'device-mapper development', linux-fsdevel, linux-kernel, linux-raid, 'Jens Axboe', 'David Chinner', 'Andreas Dilger' } -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of Valdis.Kletnieks@vt.edu } Sent: Thursday, July 12, 2007 1:35 PM } To: ric@emc.com } Cc: Tejun Heo; david@lang.hm; Stefan Bader; Phillip Susi; device-mapper } development; linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; } linux-raid@vger.kernel.org; Jens Axboe; David Chinner; Andreas Dilger } Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for } devices, filesystems, and dm/md. } } On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said: } > Valdis.Kletnieks@vt.edu wrote: } > > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: } > > } > >> All of the high end arrays have non-volatile cache (read, on power } loss, it is a } > >> promise that it will get all of your data out to permanent storage). } You don't } > >> need to ask this kind of array to drain the cache. In fact, it might } just ignore } > >> you if you send it that kind of request ;-) } > > } > > OK, I'll bite - how does the kernel know whether the other end of that } > > fiberchannel cable is attached to a DMX-3 or to some no-name product } that } > > may not have the same assurances? Is there a "I'm a high-end array" } bit } > > in the sense data that I'm unaware of? } > > } > } > There are ways to query devices (think of hdparm -I in S-ATA/P-ATA } drives, SCSI } > has similar queries) to see what kind of device you are talking to. I am } not } > sure it is worth the trouble to do any automatic detection/handling of } this. } > } > In this specific case, it is more a case of when you attach a high end } (or } > mid-tier) device to a server, you should configure it without barriers } for its } > exported LUNs. } } I don't have a problem with the sysadmin *telling* the system "the other } end of } that fiber cable has characteristics X, Y and Z". What worried me was } that it } looked like conflating "device reported writeback cache" with "device } actually } has enough battery/hamster/whatever backup to flush everything on a power } loss". } (My back-of-envelope calculation shows for a worst-case of needing a 1ms } seek } for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync. } That's } a lot of battery..) Most hardware RAID devices I know of use the battery to save the cache while the power is off. When the power is restored it flushes the cache to disk. If the power failure lasts longer than the batteries then the cache data is lost, but the batteries last 24+ hours I beleve. A big EMC array we had had enough battery power to power about 400 disks while the 16 Gig of cache was flushed. I think EMC told me the batteries would last about 20 minutes. I don't recall if the array was usable during the 20 minutes. We never tested a power failure. Guy ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-12 23:10 ` Guy Watkins @ 2007-07-13 11:30 ` Ric Wheeler 0 siblings, 0 replies; 102+ messages in thread From: Ric Wheeler @ 2007-07-13 11:30 UTC (permalink / raw) To: Guy Watkins Cc: Valdis.Kletnieks, 'Tejun Heo', david, 'Stefan Bader', 'Phillip Susi', 'device-mapper development', linux-fsdevel, linux-kernel, linux-raid, 'Jens Axboe', 'David Chinner', 'Andreas Dilger' Guy Watkins wrote: > } -----Original Message----- > } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > } owner@vger.kernel.org] On Behalf Of Valdis.Kletnieks@vt.edu > } Sent: Thursday, July 12, 2007 1:35 PM > } To: ric@emc.com > } Cc: Tejun Heo; david@lang.hm; Stefan Bader; Phillip Susi; device-mapper > } development; linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; > } linux-raid@vger.kernel.org; Jens Axboe; David Chinner; Andreas Dilger > } Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for > } devices, filesystems, and dm/md. > } > } On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said: > } > Valdis.Kletnieks@vt.edu wrote: > } > > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: > } > > > } > >> All of the high end arrays have non-volatile cache (read, on power > } loss, it is a > } > >> promise that it will get all of your data out to permanent storage). > } You don't > } > >> need to ask this kind of array to drain the cache. In fact, it might > } just ignore > } > >> you if you send it that kind of request ;-) > } > > > } > > OK, I'll bite - how does the kernel know whether the other end of that > } > > fiberchannel cable is attached to a DMX-3 or to some no-name product > } that > } > > may not have the same assurances? Is there a "I'm a high-end array" > } bit > } > > in the sense data that I'm unaware of? > } > > > } > > } > There are ways to query devices (think of hdparm -I in S-ATA/P-ATA > } drives, SCSI > } > has similar queries) to see what kind of device you are talking to. I am > } not > } > sure it is worth the trouble to do any automatic detection/handling of > } this. > } > > } > In this specific case, it is more a case of when you attach a high end > } (or > } > mid-tier) device to a server, you should configure it without barriers > } for its > } > exported LUNs. > } > } I don't have a problem with the sysadmin *telling* the system "the other > } end of > } that fiber cable has characteristics X, Y and Z". What worried me was > } that it > } looked like conflating "device reported writeback cache" with "device > } actually > } has enough battery/hamster/whatever backup to flush everything on a power > } loss". > } (My back-of-envelope calculation shows for a worst-case of needing a 1ms > } seek > } for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync. > } That's > } a lot of battery..) > > Most hardware RAID devices I know of use the battery to save the cache while > the power is off. When the power is restored it flushes the cache to disk. > If the power failure lasts longer than the batteries then the cache data is > lost, but the batteries last 24+ hours I beleve. Most mid-range and high end arrays actually use that battery to insure that data is all written out to permanent media when the power is lost. I won't go into how that is done, but it clearly would not be a safe assumption to assume that your power outage is only going to be a certain length of time (and if not, you would lose data). > > A big EMC array we had had enough battery power to power about 400 disks > while the 16 Gig of cache was flushed. I think EMC told me the batteries > would last about 20 minutes. I don't recall if the array was usable during > the 20 minutes. We never tested a power failure. > > Guy I worked on the team that designed that big array. At one point, we had an array on loan to a partner who tried to put it in a very small data center. A few weeks later, they brought in an electrician who needed to run more power into the center. It was pretty funny - he tried to find a power button to turn it off and then just walked over and dropped power trying to get the Symm to turn off. When that didn't work, he was really, really confused ;-) ric ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-10 18:39 ` Ric Wheeler 2007-07-10 23:40 ` Valdis.Kletnieks @ 2007-07-11 2:51 ` Tejun Heo 1 sibling, 0 replies; 102+ messages in thread From: Tejun Heo @ 2007-07-11 2:51 UTC (permalink / raw) To: ric Cc: david, Stefan Bader, Phillip Susi, device-mapper development, linux-fsdevel, linux-kernel, linux-raid, Jens Axboe, David Chinner, Andreas Dilger Ric Wheeler wrote: >> Don't those thingies usually have NV cache or backed by battery such >> that ORDERED_DRAIN is enough? > > All of the high end arrays have non-volatile cache (read, on power loss, > it is a promise that it will get all of your data out to permanent > storage). You don't need to ask this kind of array to drain the cache. > In fact, it might just ignore you if you send it that kind of request ;-) > > The size of the NV cache can run from a few gigabytes up to hundreds of > gigabytes, so you really don't want to invoke cache flushes here if you > can avoid it. > > For this class of device, you can get the required in order completion > and data integrity semantics as long as we send the IO's to the device > in the correct order. Thanks for clarification. >> The problem is that the interface between the host and a storage device >> (ATA or SCSI) is not built to communicate that kind of information >> (grouped flush, relaxed ordering...). I think battery backed >> ORDERED_DRAIN combined with fine-grained host queue flush would be >> pretty good. It doesn't require some fancy new interface which isn't >> gonna be used widely anyway and can achieve most of performance gain if >> the storage plays it smart. > > I am not really sure that you need this ORDERED_DRAIN for big arrays... ORDERED_DRAIN is to properly order requests from host request queue (elevator/iosched). We can make it finer-grained but we do need to put some ordering restrictions. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 1:30 ` Neil Brown ` (2 preceding siblings ...) 2007-05-28 9:43 ` Alasdair G Kergon @ 2007-05-29 19:59 ` Phillip Susi 2007-05-31 0:22 ` Neil Brown 2007-05-30 9:35 ` Jens Axboe 4 siblings, 1 reply; 102+ messages in thread From: Phillip Susi @ 2007-05-29 19:59 UTC (permalink / raw) To: Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner, Stefan Bader, Andreas Dilger, Tejun Heo Neil Brown wrote: > md/dm modules could keep count of requests as has been suggested > (though that would be a fairly big change for raid0 as it currently > doesn't know when a request completes - bi_endio goes directly to the > filesystem). Are you sure? I believe that dm handles bi_endio because it waits for all in progress bio to complete before switching tables. > 2/ Maybe barriers provide stronger semantics than are required. > > All write requests are synchronised around a barrier write. This is > often more than is required and apparently can cause a measurable > slowdown. I'm not quite sure I understand this correctly, but the purpose of a barrier request is to prevent the elevator from reordering requests around a barrier. Previous requests must be completed before the barrier, and latter requests must be executed after. That is a sufficiently strong guarantee for careful write or journal filesystems to ensure that a log block hits the disk before the actual transaction blocks, and then the log block is marked as complete only after the actual transaction. This is a weaker guarantee than a flush, and allows for some reordering to improve performance. > Also the FUA for the actual commit write might not be needed. It is > important for consistency that the preceding writes are in safe > storage before the commit write, but it is not so important that the > commit write is immediately safe on storage. That isn't needed until > a 'sync' or 'fsync' or similar. Right, the barrier doesn't need to be flushed right away, so the elevator could complete writes after the barrier if it wishes, then complete the ones before, and finally the barrier itself. Not setting the FUA bit allows the disk to cache the barrier write so it can be completed sooner, but before the queue sends any more requests to the disk, it must be flushed to ensure that the barrier has hit the media before the new requests. > One possible alternative is: > - writes can overtake barriers, but barrier cannot overtake writes. > - flush before the barrier, not after. > > This is considerably weaker, and hence cheaper. But I think it is > enough for all filesystems (providing it is still an option to call > blkdev_issue_flush on 'fsync'). Again I am not sure I quite understand what you mean here, but only writes issued after the barrier can complete before the barrier. Those issued before the barrier can not overtake it in the queue. > Another alternative would be to tag each bio was being in a > particular barrier-group. Then bio's in different groups could > overtake each other in either direction, but a BARRIER request must > be totally ordered w.r.t. other requests in the barrier group. > This would require an extra bio field, and would give the filesystem > more appearance of control. I'm not yet sure how much it would > really help... > It would allow us to set FUA on all bios with a non-zero > barrier-group. That would mean we don't have to flush the entire > cache, just those blocks that are critical.... but I'm still not sure > it's a good idea. This all seems unnecessary work. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-29 19:59 ` Phillip Susi @ 2007-05-31 0:22 ` Neil Brown 0 siblings, 0 replies; 102+ messages in thread From: Neil Brown @ 2007-05-31 0:22 UTC (permalink / raw) To: Phillip Susi Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner, Stefan Bader, Andreas Dilger, Tejun Heo On Tuesday May 29, psusi@cfl.rr.com wrote: > Neil Brown wrote: > > md/dm modules could keep count of requests as has been suggested > > (though that would be a fairly big change for raid0 as it currently > > doesn't know when a request completes - bi_endio goes directly to the > > filesystem). > > Are you sure? I believe that dm handles bi_endio because it waits for > all in progress bio to complete before switching tables. I was taking about md/raid0, not dm-stripe. md/raid0 (and md/linear) currently never know that a request has completed. NeilBrown ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 1:30 ` Neil Brown ` (3 preceding siblings ...) 2007-05-29 19:59 ` Phillip Susi @ 2007-05-30 9:35 ` Jens Axboe 2007-07-05 12:28 ` Tejun Heo 2007-07-18 10:56 ` [PATCH] block: cosmetic changes Tejun Heo 4 siblings, 2 replies; 102+ messages in thread From: Jens Axboe @ 2007-05-30 9:35 UTC (permalink / raw) To: Neil Brown Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger, Tejun Heo On Mon, May 28 2007, Neil Brown wrote: > I think the implementation priorities here are: > > 1/ implement a zero-length BIO_RW_BARRIER option. > 2/ Use it (or otherwise) to make all dm and md modules handle > barriers (and loop?). > 3/ Devise and implement appropriate fall-backs with-in the block layer > so that -EOPNOTSUP is never returned. > 4/ Remove unneeded cruft from filesystems (and elsewhere). This is the start of 1/ above. It's very lightly tested, it's verified to DTRT here at least and not crash :-) It gets rid of the ->issue_flush_fn() queue callback, all the driver knowledge resides in ->prepare_flush_fn() anyways. blkdev_issue_flush() then just reuses the empty-bio approach to queue an empty barrier, this should work equally well for stacked and non-stacked devices. While this patch isn't complete yet, it's clearly the right direction to go. I didn't convert drivers/md/* to support this approach, I'm leaving that to you :-) block/elevator.c | 12 ++ block/ll_rw_blk.c | 173 ++++++++++++++++++-------------- drivers/ide/ide-disk.c | 29 ----- drivers/message/i2o/i2o_block.c | 24 ---- drivers/scsi/scsi_lib.c | 17 --- drivers/scsi/sd.c | 15 -- fs/bio.c | 8 - include/linux/bio.h | 18 ++- include/linux/blkdev.h | 3 include/scsi/scsi_driver.h | 1 include/scsi/sd.h | 1 mm/bounce.c | 6 + 12 files changed, 141 insertions(+), 166 deletions(-) diff --git a/block/elevator.c b/block/elevator.c index ce866eb..af5e58d 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -715,6 +715,18 @@ struct request *elv_next_request(request_queue_t *q) int ret; while ((rq = __elv_next_request(q)) != NULL) { + /* + * Kill the empty barrier place holder, the driver must + * not ever see it. + */ + if (blk_fs_request(rq) && blk_barrier_rq(rq) && + !rq->hard_nr_sectors) { + blkdev_dequeue_request(rq); + rq->cmd_flags |= REQ_QUIET; + end_that_request_chunk(rq, 1, 0); + end_that_request_last(rq, 1); + continue; + } if (!(rq->cmd_flags & REQ_STARTED)) { /* * This is the first time the device driver diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 6b5173a..8680083 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -300,23 +300,6 @@ int blk_queue_ordered(request_queue_t *q, unsigned ordered, EXPORT_SYMBOL(blk_queue_ordered); -/** - * blk_queue_issue_flush_fn - set function for issuing a flush - * @q: the request queue - * @iff: the function to be called issuing the flush - * - * Description: - * If a driver supports issuing a flush command, the support is notified - * to the block layer by defining it through this call. - * - **/ -void blk_queue_issue_flush_fn(request_queue_t *q, issue_flush_fn *iff) -{ - q->issue_flush_fn = iff; -} - -EXPORT_SYMBOL(blk_queue_issue_flush_fn); - /* * Cache flushing for ordered writes handling */ @@ -433,7 +416,8 @@ static inline struct request *start_ordered(request_queue_t *q, rq_init(q, rq); if (bio_data_dir(q->orig_bar_rq->bio) == WRITE) rq->cmd_flags |= REQ_RW; - rq->cmd_flags |= q->ordered & QUEUE_ORDERED_FUA ? REQ_FUA : 0; + if (q->ordered & QUEUE_ORDERED_FUA) + rq->cmd_flags |= REQ_FUA; rq->elevator_private = NULL; rq->elevator_private2 = NULL; init_request_from_bio(rq, q->orig_bar_rq->bio); @@ -445,7 +429,7 @@ static inline struct request *start_ordered(request_queue_t *q, * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs * request gets inbetween ordered sequence. */ - if (q->ordered & QUEUE_ORDERED_POSTFLUSH) + if ((q->ordered & QUEUE_ORDERED_POSTFLUSH) && rq->hard_nr_sectors) queue_flush(q, QUEUE_ORDERED_POSTFLUSH); else q->ordseq |= QUEUE_ORDSEQ_POSTFLUSH; @@ -469,7 +453,7 @@ static inline struct request *start_ordered(request_queue_t *q, int blk_do_ordered(request_queue_t *q, struct request **rqp) { struct request *rq = *rqp; - int is_barrier = blk_fs_request(rq) && blk_barrier_rq(rq); + const int is_barrier = blk_fs_request(rq) && blk_barrier_rq(rq); if (!q->ordseq) { if (!is_barrier) @@ -2635,6 +2619,16 @@ int blk_execute_rq(request_queue_t *q, struct gendisk *bd_disk, EXPORT_SYMBOL(blk_execute_rq); +static int bio_end_empty_barrier(struct bio *bio, unsigned int bytes_done, + int err) +{ + if (err) + clear_bit(BIO_UPTODATE, &bio->bi_flags); + + complete(bio->bi_private); + return 0; +} + /** * blkdev_issue_flush - queue a flush * @bdev: blockdev to issue flush for @@ -2647,7 +2641,10 @@ EXPORT_SYMBOL(blk_execute_rq); */ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector) { + DECLARE_COMPLETION_ONSTACK(wait); request_queue_t *q; + struct bio *bio; + int ret; if (bdev->bd_disk == NULL) return -ENXIO; @@ -2655,10 +2652,32 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector) q = bdev_get_queue(bdev); if (!q) return -ENXIO; - if (!q->issue_flush_fn) - return -EOPNOTSUPP; - return q->issue_flush_fn(q, bdev->bd_disk, error_sector); + bio = bio_alloc(GFP_KERNEL, 0); + if (!bio) + return -ENOMEM; + + bio->bi_end_io = bio_end_empty_barrier; + bio->bi_private = &wait; + bio->bi_bdev = bdev; + submit_bio(1 << BIO_RW_BARRIER, bio); + + wait_for_completion(&wait); + + /* + * The driver must store the error location in ->bi_sector, if + * it supports it. For non-stacked drivers, this should be copied + * from rq->sector. + */ + if (error_sector) + *error_sector = bio->bi_sector; + + ret = 0; + if (!bio_flagged(bio, BIO_UPTODATE)) + ret = -EIO; + + bio_put(bio); + return ret; } EXPORT_SYMBOL(blkdev_issue_flush); @@ -3030,7 +3049,7 @@ static inline void blk_partition_remap(struct bio *bio) { struct block_device *bdev = bio->bi_bdev; - if (bdev != bdev->bd_contains) { + if (bio_sectors(bio) && bdev != bdev->bd_contains) { struct hd_struct *p = bdev->bd_part; const int rw = bio_data_dir(bio); @@ -3092,6 +3111,35 @@ static inline int should_fail_request(struct bio *bio) #endif /* CONFIG_FAIL_MAKE_REQUEST */ +/* + * Check whether this bio extends beyond the end of the device. + */ +static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors) +{ + sector_t maxsector; + + if (!nr_sectors) + return 0; + + /* Test device or partition size, when known. */ + maxsector = bio->bi_bdev->bd_inode->i_size >> 9; + if (maxsector) { + sector_t sector = bio->bi_sector; + + if (maxsector < nr_sectors || maxsector - nr_sectors < sector) { + /* + * This may well happen - the kernel calls bread() + * without checking the size of the device, e.g., when + * mounting a device. + */ + handle_bad_sector(bio); + return 1; + } + } + + return 0; +} + /** * generic_make_request: hand a buffer to its device driver for I/O * @bio: The bio describing the location in memory and on the device. @@ -3119,27 +3167,14 @@ static inline int should_fail_request(struct bio *bio) static inline void __generic_make_request(struct bio *bio) { request_queue_t *q; - sector_t maxsector; sector_t old_sector; int ret, nr_sectors = bio_sectors(bio); dev_t old_dev; might_sleep(); - /* Test device or partition size, when known. */ - maxsector = bio->bi_bdev->bd_inode->i_size >> 9; - if (maxsector) { - sector_t sector = bio->bi_sector; - if (maxsector < nr_sectors || maxsector - nr_sectors < sector) { - /* - * This may well happen - the kernel calls bread() - * without checking the size of the device, e.g., when - * mounting a device. - */ - handle_bad_sector(bio); - goto end_io; - } - } + if (bio_check_eod(bio, nr_sectors)) + goto end_io; /* * Resolve the mapping until finished. (drivers are @@ -3166,7 +3201,7 @@ end_io: break; } - if (unlikely(bio_sectors(bio) > q->max_hw_sectors)) { + if (unlikely(nr_sectors > q->max_hw_sectors)) { printk("bio too big device %s (%u > %u)\n", bdevname(bio->bi_bdev, b), bio_sectors(bio), @@ -3187,7 +3222,7 @@ end_io: blk_partition_remap(bio); if (old_sector != -1) - blk_add_trace_remap(q, bio, old_dev, bio->bi_sector, + blk_add_trace_remap(q, bio, old_dev, bio->bi_sector, old_sector); blk_add_trace_bio(q, bio, BLK_TA_QUEUE); @@ -3195,21 +3230,8 @@ end_io: old_sector = bio->bi_sector; old_dev = bio->bi_bdev->bd_dev; - maxsector = bio->bi_bdev->bd_inode->i_size >> 9; - if (maxsector) { - sector_t sector = bio->bi_sector; - - if (maxsector < nr_sectors || - maxsector - nr_sectors < sector) { - /* - * This may well happen - partitions are not - * checked to make sure they are within the size - * of the whole device. - */ - handle_bad_sector(bio); - goto end_io; - } - } + if (bio_check_eod(bio, nr_sectors)) + goto end_io; ret = q->make_request_fn(q, bio); } while (ret); @@ -3282,23 +3304,32 @@ void submit_bio(int rw, struct bio *bio) { int count = bio_sectors(bio); - BIO_BUG_ON(!bio->bi_size); - BIO_BUG_ON(!bio->bi_io_vec); bio->bi_rw |= rw; - if (rw & WRITE) { - count_vm_events(PGPGOUT, count); - } else { - task_io_account_read(bio->bi_size); - count_vm_events(PGPGIN, count); - } - if (unlikely(block_dump)) { - char b[BDEVNAME_SIZE]; - printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n", - current->comm, current->pid, - (rw & WRITE) ? "WRITE" : "READ", - (unsigned long long)bio->bi_sector, - bdevname(bio->bi_bdev,b)); + /* + * If it's a regular read/write or a barrier with data attached, + * go through the normal accounting stuff before submission. + */ + if (!bio_barrier(bio) || count) { + + BIO_BUG_ON(!bio->bi_size); + BIO_BUG_ON(!bio->bi_io_vec); + + if (rw & WRITE) { + count_vm_events(PGPGOUT, count); + } else { + task_io_account_read(bio->bi_size); + count_vm_events(PGPGIN, count); + } + + if (unlikely(block_dump)) { + char b[BDEVNAME_SIZE]; + printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n", + current->comm, current->pid, + (rw & WRITE) ? "WRITE" : "READ", + (unsigned long long)bio->bi_sector, + bdevname(bio->bi_bdev,b)); + } } generic_make_request(bio); diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c index 7fff773..23f8181 100644 --- a/drivers/ide/ide-disk.c +++ b/drivers/ide/ide-disk.c @@ -697,32 +697,6 @@ static void idedisk_prepare_flush(request_queue_t *q, struct request *rq) rq->buffer = rq->cmd; } -static int idedisk_issue_flush(request_queue_t *q, struct gendisk *disk, - sector_t *error_sector) -{ - ide_drive_t *drive = q->queuedata; - struct request *rq; - int ret; - - if (!drive->wcache) - return 0; - - rq = blk_get_request(q, WRITE, __GFP_WAIT); - - idedisk_prepare_flush(q, rq); - - ret = blk_execute_rq(q, disk, rq, 0); - - /* - * if we failed and caller wants error offset, get it - */ - if (ret && error_sector) - *error_sector = ide_get_error_location(drive, rq->cmd); - - blk_put_request(rq); - return ret; -} - /* * This is tightly woven into the driver->do_special can not touch. * DON'T do it again until a total personality rewrite is committed. @@ -762,7 +736,6 @@ static void update_ordered(ide_drive_t *drive) struct hd_driveid *id = drive->id; unsigned ordered = QUEUE_ORDERED_NONE; prepare_flush_fn *prep_fn = NULL; - issue_flush_fn *issue_fn = NULL; if (drive->wcache) { unsigned long long capacity; @@ -786,13 +759,11 @@ static void update_ordered(ide_drive_t *drive) if (barrier) { ordered = QUEUE_ORDERED_DRAIN_FLUSH; prep_fn = idedisk_prepare_flush; - issue_fn = idedisk_issue_flush; } } else ordered = QUEUE_ORDERED_DRAIN; blk_queue_ordered(drive->queue, ordered, prep_fn); - blk_queue_issue_flush_fn(drive->queue, issue_fn); } static int write_cache(ide_drive_t *drive, int arg) diff --git a/drivers/message/i2o/i2o_block.c b/drivers/message/i2o/i2o_block.c index b17c4b2..e794074 100644 --- a/drivers/message/i2o/i2o_block.c +++ b/drivers/message/i2o/i2o_block.c @@ -149,29 +149,6 @@ static int i2o_block_device_flush(struct i2o_device *dev) }; /** - * i2o_block_issue_flush - device-flush interface for block-layer - * @queue: the request queue of the device which should be flushed - * @disk: gendisk - * @error_sector: error offset - * - * Helper function to provide flush functionality to block-layer. - * - * Returns 0 on success or negative error code on failure. - */ - -static int i2o_block_issue_flush(request_queue_t * queue, struct gendisk *disk, - sector_t * error_sector) -{ - struct i2o_block_device *i2o_blk_dev = queue->queuedata; - int rc = -ENODEV; - - if (likely(i2o_blk_dev)) - rc = i2o_block_device_flush(i2o_blk_dev->i2o_dev); - - return rc; -} - -/** * i2o_block_device_mount - Mount (load) the media of device dev * @dev: I2O device which should receive the mount request * @media_id: Media Identifier @@ -1009,7 +986,6 @@ static struct i2o_block_device *i2o_block_device_alloc(void) } blk_queue_prep_rq(queue, i2o_block_prep_req_fn); - blk_queue_issue_flush_fn(queue, i2o_block_issue_flush); gd->major = I2O_MAJOR; gd->queue = queue; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 1f5a07b..4712456 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1038,22 +1038,6 @@ static int scsi_init_io(struct scsi_cmnd *cmd) return BLKPREP_KILL; } -static int scsi_issue_flush_fn(request_queue_t *q, struct gendisk *disk, - sector_t *error_sector) -{ - struct scsi_device *sdev = q->queuedata; - struct scsi_driver *drv; - - if (sdev->sdev_state != SDEV_RUNNING) - return -ENXIO; - - drv = *(struct scsi_driver **) disk->private_data; - if (drv->issue_flush) - return drv->issue_flush(&sdev->sdev_gendev, error_sector); - - return -EOPNOTSUPP; -} - static struct scsi_cmnd *scsi_get_cmd_from_req(struct scsi_device *sdev, struct request *req) { @@ -1596,7 +1580,6 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev) return NULL; blk_queue_prep_rq(q, scsi_prep_fn); - blk_queue_issue_flush_fn(q, scsi_issue_flush_fn); blk_queue_softirq_done(q, scsi_softirq_done); return q; } diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 3d8c9cb..19f2655 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -241,7 +241,6 @@ static struct scsi_driver sd_template = { }, .rescan = sd_rescan, .init_command = sd_init_command, - .issue_flush = sd_issue_flush, }; /* @@ -800,20 +799,6 @@ static int sd_sync_cache(struct scsi_disk *sdkp) return 0; } -static int sd_issue_flush(struct device *dev, sector_t *error_sector) -{ - int ret = 0; - struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev); - - if (!sdkp) - return -ENODEV; - - if (sdkp->WCE) - ret = sd_sync_cache(sdkp); - scsi_disk_put(sdkp); - return ret; -} - static void sd_prepare_flush(request_queue_t *q, struct request *rq) { memset(rq->cmd, 0, sizeof(rq->cmd)); diff --git a/fs/bio.c b/fs/bio.c index 093345f..413bb19 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -109,11 +109,13 @@ static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned lon void bio_free(struct bio *bio, struct bio_set *bio_set) { - const int pool_idx = BIO_POOL_IDX(bio); + if (bio->bi_io_vec) { + const int pool_idx = BIO_POOL_IDX(bio); - BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS); + BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS); + mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); + } - mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); mempool_free(bio, bio_set->bio_pool); } diff --git a/include/linux/bio.h b/include/linux/bio.h index 4d85262..82a4420 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -174,14 +174,28 @@ struct bio { #define bio_offset(bio) bio_iovec((bio))->bv_offset #define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx) #define bio_sectors(bio) ((bio)->bi_size >> 9) -#define bio_cur_sectors(bio) (bio_iovec(bio)->bv_len >> 9) -#define bio_data(bio) (page_address(bio_page((bio))) + bio_offset((bio))) #define bio_barrier(bio) ((bio)->bi_rw & (1 << BIO_RW_BARRIER)) #define bio_sync(bio) ((bio)->bi_rw & (1 << BIO_RW_SYNC)) #define bio_failfast(bio) ((bio)->bi_rw & (1 << BIO_RW_FAILFAST)) #define bio_rw_ahead(bio) ((bio)->bi_rw & (1 << BIO_RW_AHEAD)) #define bio_rw_meta(bio) ((bio)->bi_rw & (1 << BIO_RW_META)) +static inline unsigned int bio_cur_sectors(struct bio *bio) +{ + if (bio->bi_vcnt) + return bio_iovec(bio)->bv_len >> 9; + + return 0; +} + +static inline void *bio_data(struct bio *bio) +{ + if (bio->bi_vcnt) + return page_address(bio_page(bio)) + bio_offset(bio); + + return NULL; +} + /* * will die */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index db5b00a..47c8540 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -338,7 +338,6 @@ typedef void (unplug_fn) (request_queue_t *); struct bio_vec; typedef int (merge_bvec_fn) (request_queue_t *, struct bio *, struct bio_vec *); -typedef int (issue_flush_fn) (request_queue_t *, struct gendisk *, sector_t *); typedef void (prepare_flush_fn) (request_queue_t *, struct request *); typedef void (softirq_done_fn)(struct request *); @@ -376,7 +375,6 @@ struct request_queue prep_rq_fn *prep_rq_fn; unplug_fn *unplug_fn; merge_bvec_fn *merge_bvec_fn; - issue_flush_fn *issue_flush_fn; prepare_flush_fn *prepare_flush_fn; softirq_done_fn *softirq_done_fn; @@ -749,7 +747,6 @@ extern void blk_queue_dma_alignment(request_queue_t *, int); extern void blk_queue_softirq_done(request_queue_t *, softirq_done_fn *); extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev); extern int blk_queue_ordered(request_queue_t *, unsigned, prepare_flush_fn *); -extern void blk_queue_issue_flush_fn(request_queue_t *, issue_flush_fn *); extern int blk_do_ordered(request_queue_t *, struct request **); extern unsigned blk_ordered_cur_seq(request_queue_t *); extern unsigned blk_ordered_req_seq(struct request *); diff --git a/include/scsi/scsi_driver.h b/include/scsi/scsi_driver.h index 02e26c1..7017d3e 100644 --- a/include/scsi/scsi_driver.h +++ b/include/scsi/scsi_driver.h @@ -13,7 +13,6 @@ struct scsi_driver { int (*init_command)(struct scsi_cmnd *); void (*rescan)(struct device *); - int (*issue_flush)(struct device *, sector_t *); int (*prepare_flush)(struct request_queue *, struct request *); }; #define to_scsi_driver(drv) \ diff --git a/include/scsi/sd.h b/include/scsi/sd.h index 5261488..607a6a1 100644 --- a/include/scsi/sd.h +++ b/include/scsi/sd.h @@ -56,7 +56,6 @@ static int sd_suspend(struct device *dev, pm_message_t state); static int sd_resume(struct device *dev); static void sd_rescan(struct device *); static int sd_init_command(struct scsi_cmnd *); -static int sd_issue_flush(struct device *, sector_t *); static void sd_prepare_flush(request_queue_t *, struct request *); static void sd_read_capacity(struct scsi_disk *sdkp, unsigned char *buffer); static void scsi_disk_release(struct class_device *cdev); diff --git a/mm/bounce.c b/mm/bounce.c index ad401fc..95d0127 100644 --- a/mm/bounce.c +++ b/mm/bounce.c @@ -280,6 +280,12 @@ void blk_queue_bounce(request_queue_t *q, struct bio **bio_orig) mempool_t *pool; /* + * Data-less bio, nothing to bounce + */ + if (!bio_sectors(*bio_orig)) + return; + + /* * for non-isa bounce case, just check if the bounce pfn is equal * to or bigger than the highest pfn in the system -- in that case, * don't waste time iterating over bio segments -- Jens Axboe ^ permalink raw reply related [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-30 9:35 ` Jens Axboe @ 2007-07-05 12:28 ` Tejun Heo 2007-07-09 12:27 ` Jens Axboe 2007-07-18 10:56 ` [PATCH] block: cosmetic changes Tejun Heo 1 sibling, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-07-05 12:28 UTC (permalink / raw) To: Jens Axboe Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger Hello, Jens. Jens Axboe wrote: > On Mon, May 28 2007, Neil Brown wrote: >> I think the implementation priorities here are: >> >> 1/ implement a zero-length BIO_RW_BARRIER option. >> 2/ Use it (or otherwise) to make all dm and md modules handle >> barriers (and loop?). >> 3/ Devise and implement appropriate fall-backs with-in the block layer >> so that -EOPNOTSUP is never returned. >> 4/ Remove unneeded cruft from filesystems (and elsewhere). > > This is the start of 1/ above. It's very lightly tested, it's verified > to DTRT here at least and not crash :-) > > It gets rid of the ->issue_flush_fn() queue callback, all the driver > knowledge resides in ->prepare_flush_fn() anyways. blkdev_issue_flush() > then just reuses the empty-bio approach to queue an empty barrier, this > should work equally well for stacked and non-stacked devices. > > While this patch isn't complete yet, it's clearly the right direction to > go. Finally took a brief look. :-) I think the sequencing for zero-length barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in start_ordered() rather than short circuiting the request after it's issued. What do you think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-07-05 12:28 ` Tejun Heo @ 2007-07-09 12:27 ` Jens Axboe 0 siblings, 0 replies; 102+ messages in thread From: Jens Axboe @ 2007-07-09 12:27 UTC (permalink / raw) To: Tejun Heo Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger On Thu, Jul 05 2007, Tejun Heo wrote: > Hello, Jens. > > Jens Axboe wrote: > > On Mon, May 28 2007, Neil Brown wrote: > >> I think the implementation priorities here are: > >> > >> 1/ implement a zero-length BIO_RW_BARRIER option. > >> 2/ Use it (or otherwise) to make all dm and md modules handle > >> barriers (and loop?). > >> 3/ Devise and implement appropriate fall-backs with-in the block layer > >> so that -EOPNOTSUP is never returned. > >> 4/ Remove unneeded cruft from filesystems (and elsewhere). > > > > This is the start of 1/ above. It's very lightly tested, it's verified > > to DTRT here at least and not crash :-) > > > > It gets rid of the ->issue_flush_fn() queue callback, all the driver > > knowledge resides in ->prepare_flush_fn() anyways. blkdev_issue_flush() > > then just reuses the empty-bio approach to queue an empty barrier, this > > should work equally well for stacked and non-stacked devices. > > > > While this patch isn't complete yet, it's clearly the right direction to > > go. > > Finally took a brief look. :-) I think the sequencing for zero-length > barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in > start_ordered() rather than short circuiting the request after it's > issued. What do you think? Yeah, that might be cleaner and should achieve the same effect. I'll test! -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* [PATCH] block: cosmetic changes 2007-05-30 9:35 ` Jens Axboe 2007-07-05 12:28 ` Tejun Heo @ 2007-07-18 10:56 ` Tejun Heo 2007-07-18 10:59 ` [PATCH] block: factor out bio_check_eod() Tejun Heo 1 sibling, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-07-18 10:56 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Phillip Susi, Andreas Dilger Cosmetic changes. This is taken from Jens' zero-length barrier patch. Signed-off-by: Tejun Heo <htejun@gmail.com> Cc: Jens Axboe <jes.axboe@oracle.com> --- block/ll_rw_blk.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: work/block/ll_rw_blk.c =================================================================== --- work.orig/block/ll_rw_blk.c +++ work/block/ll_rw_blk.c @@ -443,7 +443,8 @@ static inline struct request *start_orde rq_init(q, rq); if (bio_data_dir(q->orig_bar_rq->bio) == WRITE) rq->cmd_flags |= REQ_RW; - rq->cmd_flags |= q->ordered & QUEUE_ORDERED_FUA ? REQ_FUA : 0; + if (q->ordered & QUEUE_ORDERED_FUA) + rq->cmd_flags |= REQ_FUA; rq->elevator_private = NULL; rq->elevator_private2 = NULL; init_request_from_bio(rq, q->orig_bar_rq->bio); @@ -3167,7 +3168,7 @@ end_io: break; } - if (unlikely(bio_sectors(bio) > q->max_hw_sectors)) { + if (unlikely(nr_sectors > q->max_hw_sectors)) { printk("bio too big device %s (%u > %u)\n", bdevname(bio->bi_bdev, b), bio_sectors(bio), ^ permalink raw reply [flat|nested] 102+ messages in thread
* [PATCH] block: factor out bio_check_eod() 2007-07-18 10:56 ` [PATCH] block: cosmetic changes Tejun Heo @ 2007-07-18 10:59 ` Tejun Heo 2007-07-18 11:06 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-07-18 10:59 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Phillip Susi, Andreas Dilger End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). This is taken from Jens' zero-length barrier patch. Signed-off-by: Tejun Heo <htejun@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> --- block/ll_rw_blk.c | 63 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 33 insertions(+), 30 deletions(-) Index: work/block/ll_rw_blk.c =================================================================== --- work.orig/block/ll_rw_blk.c +++ work/block/ll_rw_blk.c @@ -3094,6 +3094,35 @@ static inline int should_fail_request(st #endif /* CONFIG_FAIL_MAKE_REQUEST */ +/* + * Check whether this bio extends beyond the end of the device. + */ +static int bio_check_eod(struct bio *bio, unsigned int nr_sectors) +{ + sector_t maxsector; + + if (!nr_sectors) + return 0; + + /* Test device or partition size, when known. */ + maxsector = bio->bi_bdev->bd_inode->i_size >> 9; + if (maxsector) { + sector_t sector = bio->bi_sector; + + if (maxsector < nr_sectors || maxsector - nr_sectors < sector) { + /* + * This may well happen - the kernel calls bread() + * without checking the size of the device, e.g., when + * mounting a device. + */ + handle_bad_sector(bio); + return 1; + } + } + + return 0; +} + /** * generic_make_request: hand a buffer to its device driver for I/O * @bio: The bio describing the location in memory and on the device. @@ -3121,27 +3150,14 @@ static inline int should_fail_request(st static inline void __generic_make_request(struct bio *bio) { request_queue_t *q; - sector_t maxsector; sector_t old_sector; int ret, nr_sectors = bio_sectors(bio); dev_t old_dev; might_sleep(); - /* Test device or partition size, when known. */ - maxsector = bio->bi_bdev->bd_inode->i_size >> 9; - if (maxsector) { - sector_t sector = bio->bi_sector; - if (maxsector < nr_sectors || maxsector - nr_sectors < sector) { - /* - * This may well happen - the kernel calls bread() - * without checking the size of the device, e.g., when - * mounting a device. - */ - handle_bad_sector(bio); - goto end_io; - } - } + if (bio_check_eod(bio, nr_sectors)) + goto end_io; /* * Resolve the mapping until finished. (drivers are @@ -3197,21 +3213,8 @@ end_io: old_sector = bio->bi_sector; old_dev = bio->bi_bdev->bd_dev; - maxsector = bio->bi_bdev->bd_inode->i_size >> 9; - if (maxsector) { - sector_t sector = bio->bi_sector; - - if (maxsector < nr_sectors || - maxsector - nr_sectors < sector) { - /* - * This may well happen - partitions are not - * checked to make sure they are within the size - * of the whole device. - */ - handle_bad_sector(bio); - goto end_io; - } - } + if (bio_check_eod(bio, nr_sectors)) + goto end_io; ret = q->make_request_fn(q, bio); } while (ret); ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 10:59 ` [PATCH] block: factor out bio_check_eod() Tejun Heo @ 2007-07-18 11:06 ` Jens Axboe 2007-07-18 11:18 ` Tejun Heo 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-07-18 11:06 UTC (permalink / raw) To: Tejun Heo Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger On Wed, Jul 18 2007, Tejun Heo wrote: > End of device check is done twice in __generic_make_request() and it's > fully inlined each time. Factor out bio_check_eod(). Tejun, yeah I should seperate the cleanups and put them in the upstream branch. Will do so and add your signed-off to both of them. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:06 ` Jens Axboe @ 2007-07-18 11:18 ` Tejun Heo 2007-07-18 11:31 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-07-18 11:18 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Phillip Susi, Andreas Dilger Jens Axboe wrote: > On Wed, Jul 18 2007, Tejun Heo wrote: >> End of device check is done twice in __generic_make_request() and it's >> fully inlined each time. Factor out bio_check_eod(). > > Tejun, yeah I should seperate the cleanups and put them in the upstream > branch. Will do so and add your signed-off to both of them. > Would they be different from the one I just posted? No big deal either way. I'm just basing the zero-length barrier on top of these patches. Oh well, the changes are trivial anyway. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:18 ` Tejun Heo @ 2007-07-18 11:31 ` Jens Axboe 2007-07-18 11:33 ` Tejun Heo 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-07-18 11:31 UTC (permalink / raw) To: Tejun Heo Cc: David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Phillip Susi, Andreas Dilger On Wed, Jul 18 2007, Tejun Heo wrote: > Jens Axboe wrote: > > On Wed, Jul 18 2007, Tejun Heo wrote: > >> End of device check is done twice in __generic_make_request() and it's > >> fully inlined each time. Factor out bio_check_eod(). > > > > Tejun, yeah I should seperate the cleanups and put them in the upstream > > branch. Will do so and add your signed-off to both of them. > > > > Would they be different from the one I just posted? No big deal either > way. I'm just basing the zero-length barrier on top of these patches. > Oh well, the changes are trivial anyway. This one ended up being the same, but in the first one you missed some of the cleanups. I ended up splitting the patch some more though, see the series: http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:31 ` Jens Axboe @ 2007-07-18 11:33 ` Tejun Heo 2007-07-18 11:34 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-07-18 11:33 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Phillip Susi, Andreas Dilger Jens Axboe wrote: > On Wed, Jul 18 2007, Tejun Heo wrote: >> Jens Axboe wrote: >>> On Wed, Jul 18 2007, Tejun Heo wrote: >>>> End of device check is done twice in __generic_make_request() and it's >>>> fully inlined each time. Factor out bio_check_eod(). >>> Tejun, yeah I should seperate the cleanups and put them in the upstream >>> branch. Will do so and add your signed-off to both of them. >>> >> Would they be different from the one I just posted? No big deal either >> way. I'm just basing the zero-length barrier on top of these patches. >> Oh well, the changes are trivial anyway. > > This one ended up being the same, but in the first one you missed some > of the cleanups. I ended up splitting the patch some more though, see > the series: > > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:33 ` Tejun Heo @ 2007-07-18 11:34 ` Jens Axboe 2007-07-18 11:41 ` Tejun Heo 0 siblings, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-07-18 11:34 UTC (permalink / raw) To: Tejun Heo Cc: David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Phillip Susi, Andreas Dilger On Wed, Jul 18 2007, Tejun Heo wrote: > Jens Axboe wrote: > > On Wed, Jul 18 2007, Tejun Heo wrote: > >> Jens Axboe wrote: > >>> On Wed, Jul 18 2007, Tejun Heo wrote: > >>>> End of device check is done twice in __generic_make_request() and it's > >>>> fully inlined each time. Factor out bio_check_eod(). > >>> Tejun, yeah I should seperate the cleanups and put them in the upstream > >>> branch. Will do so and add your signed-off to both of them. > >>> > >> Would they be different from the one I just posted? No big deal either > >> way. I'm just basing the zero-length barrier on top of these patches. > >> Oh well, the changes are trivial anyway. > > > > This one ended up being the same, but in the first one you missed some > > of the cleanups. I ended up splitting the patch some more though, see > > the series: > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier > > Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite it completely :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:34 ` Jens Axboe @ 2007-07-18 11:41 ` Tejun Heo 2007-07-18 11:45 ` Jens Axboe 0 siblings, 1 reply; 102+ messages in thread From: Tejun Heo @ 2007-07-18 11:41 UTC (permalink / raw) To: Jens Axboe Cc: David Chinner, linux-kernel, linux-raid, dm-devel, linux-fsdevel, Phillip Susi, Andreas Dilger Jens Axboe wrote: > On Wed, Jul 18 2007, Tejun Heo wrote: >> Jens Axboe wrote: >>> On Wed, Jul 18 2007, Tejun Heo wrote: >>>> Jens Axboe wrote: >>>>> On Wed, Jul 18 2007, Tejun Heo wrote: >>>>>> End of device check is done twice in __generic_make_request() and it's >>>>>> fully inlined each time. Factor out bio_check_eod(). >>>>> Tejun, yeah I should seperate the cleanups and put them in the upstream >>>>> branch. Will do so and add your signed-off to both of them. >>>>> >>>> Would they be different from the one I just posted? No big deal either >>>> way. I'm just basing the zero-length barrier on top of these patches. >>>> Oh well, the changes are trivial anyway. >>> This one ended up being the same, but in the first one you missed some >>> of the cleanups. I ended up splitting the patch some more though, see >>> the series: >>> >>> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier >> Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. > > 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite > it completely :-) I think I'll start from 662d5c5e and steal most parts from 1781c6a3. I like stealing, you know. :-) I think 1781c6a3 also can use splitting - zero length barrier implementation and issue_flush conversion. Anyways, how do I pull from git.kernel.dk? git://git.kernel.dk/linux-2.6-block.git gives me connection reset by server. Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:41 ` Tejun Heo @ 2007-07-18 11:45 ` Jens Axboe 2007-07-18 11:49 ` Jens Axboe 2007-07-18 12:31 ` Jens Axboe 0 siblings, 2 replies; 102+ messages in thread From: Jens Axboe @ 2007-07-18 11:45 UTC (permalink / raw) To: Tejun Heo Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger On Wed, Jul 18 2007, Tejun Heo wrote: > Jens Axboe wrote: > > On Wed, Jul 18 2007, Tejun Heo wrote: > >> Jens Axboe wrote: > >>> On Wed, Jul 18 2007, Tejun Heo wrote: > >>>> Jens Axboe wrote: > >>>>> On Wed, Jul 18 2007, Tejun Heo wrote: > >>>>>> End of device check is done twice in __generic_make_request() and it's > >>>>>> fully inlined each time. Factor out bio_check_eod(). > >>>>> Tejun, yeah I should seperate the cleanups and put them in the upstream > >>>>> branch. Will do so and add your signed-off to both of them. > >>>>> > >>>> Would they be different from the one I just posted? No big deal either > >>>> way. I'm just basing the zero-length barrier on top of these patches. > >>>> Oh well, the changes are trivial anyway. > >>> This one ended up being the same, but in the first one you missed some > >>> of the cleanups. I ended up splitting the patch some more though, see > >>> the series: > >>> > >>> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier > >> Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. > > > > 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite > > it completely :-) > > I think I'll start from 662d5c5e and steal most parts from 1781c6a3. I > like stealing, you know. :-) I think 1781c6a3 also can use splitting - > zero length barrier implementation and issue_flush conversion. Yes that's true, I could split that in two as well. Will do so! > Anyways, how do I pull from git.kernel.dk? > git://git.kernel.dk/linux-2.6-block.git gives me connection reset by server. git://git.kernel.dk/data/git/linux-2.6-block.git somewhat annoying, I'll see if I can prefix it with git-daemon in the future. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:45 ` Jens Axboe @ 2007-07-18 11:49 ` Jens Axboe 2007-07-18 12:34 ` Tejun Heo 2007-07-18 12:31 ` Jens Axboe 1 sibling, 1 reply; 102+ messages in thread From: Jens Axboe @ 2007-07-18 11:49 UTC (permalink / raw) To: Tejun Heo Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger On Wed, Jul 18 2007, Jens Axboe wrote: > On Wed, Jul 18 2007, Tejun Heo wrote: > > Jens Axboe wrote: > > > On Wed, Jul 18 2007, Tejun Heo wrote: > > >> Jens Axboe wrote: > > >>> On Wed, Jul 18 2007, Tejun Heo wrote: > > >>>> Jens Axboe wrote: > > >>>>> On Wed, Jul 18 2007, Tejun Heo wrote: > > >>>>>> End of device check is done twice in __generic_make_request() and it's > > >>>>>> fully inlined each time. Factor out bio_check_eod(). > > >>>>> Tejun, yeah I should seperate the cleanups and put them in the upstream > > >>>>> branch. Will do so and add your signed-off to both of them. > > >>>>> > > >>>> Would they be different from the one I just posted? No big deal either > > >>>> way. I'm just basing the zero-length barrier on top of these patches. > > >>>> Oh well, the changes are trivial anyway. > > >>> This one ended up being the same, but in the first one you missed some > > >>> of the cleanups. I ended up splitting the patch some more though, see > > >>> the series: > > >>> > > >>> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier > > >> Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. > > > > > > 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite > > > it completely :-) > > > > I think I'll start from 662d5c5e and steal most parts from 1781c6a3. I > > like stealing, you know. :-) I think 1781c6a3 also can use splitting - > > zero length barrier implementation and issue_flush conversion. > > Yes that's true, I could split that in two as well. Will do so! > > > Anyways, how do I pull from git.kernel.dk? > > git://git.kernel.dk/linux-2.6-block.git gives me connection reset by server. > > git://git.kernel.dk/data/git/linux-2.6-block.git > > somewhat annoying, I'll see if I can prefix it with git-daemon in the > future. OK, now skip the /data/git/ stuff and just use git://git.kernel.dk/linux-2.6-block.git :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:49 ` Jens Axboe @ 2007-07-18 12:34 ` Tejun Heo 0 siblings, 0 replies; 102+ messages in thread From: Tejun Heo @ 2007-07-18 12:34 UTC (permalink / raw) To: Jens Axboe Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger Jens Axboe wrote: >> somewhat annoying, I'll see if I can prefix it with git-daemon in the >> future. > > OK, now skip the /data/git/ stuff and just use > > git://git.kernel.dk/linux-2.6-block.git Alright, it works like a charm now. Thanks. -- tejun ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [PATCH] block: factor out bio_check_eod() 2007-07-18 11:45 ` Jens Axboe 2007-07-18 11:49 ` Jens Axboe @ 2007-07-18 12:31 ` Jens Axboe 1 sibling, 0 replies; 102+ messages in thread From: Jens Axboe @ 2007-07-18 12:31 UTC (permalink / raw) To: Tejun Heo Cc: Neil Brown, linux-fsdevel, linux-kernel, dm-devel, linux-raid, David Chinner, Phillip Susi, Stefan Bader, Andreas Dilger On Wed, Jul 18 2007, Jens Axboe wrote: > On Wed, Jul 18 2007, Tejun Heo wrote: > > Jens Axboe wrote: > > > On Wed, Jul 18 2007, Tejun Heo wrote: > > >> Jens Axboe wrote: > > >>> On Wed, Jul 18 2007, Tejun Heo wrote: > > >>>> Jens Axboe wrote: > > >>>>> On Wed, Jul 18 2007, Tejun Heo wrote: > > >>>>>> End of device check is done twice in __generic_make_request() and it's > > >>>>>> fully inlined each time. Factor out bio_check_eod(). > > >>>>> Tejun, yeah I should seperate the cleanups and put them in the upstream > > >>>>> branch. Will do so and add your signed-off to both of them. > > >>>>> > > >>>> Would they be different from the one I just posted? No big deal either > > >>>> way. I'm just basing the zero-length barrier on top of these patches. > > >>>> Oh well, the changes are trivial anyway. > > >>> This one ended up being the same, but in the first one you missed some > > >>> of the cleanups. I ended up splitting the patch some more though, see > > >>> the series: > > >>> > > >>> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier > > >> Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. > > > > > > 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite > > > it completely :-) > > > > I think I'll start from 662d5c5e and steal most parts from 1781c6a3. I > > like stealing, you know. :-) I think 1781c6a3 also can use splitting - > > zero length barrier implementation and issue_flush conversion. > > Yes that's true, I could split that in two as well. Will do so! Done, result in the same location. -- Jens Axboe ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown ` (5 preceding siblings ...) 2007-05-28 1:30 ` Neil Brown @ 2007-05-28 11:17 ` Nikita Danilov 2007-05-31 3:31 ` Neil Brown 2007-05-28 14:43 ` Bill Davidsen 7 siblings, 1 reply; 102+ messages in thread From: Nikita Danilov @ 2007-05-28 11:17 UTC (permalink / raw) To: device-mapper development Cc: linux-fsdevel, linux-raid, David Chinner, linux-kernel, Jens Axboe Neil Brown writes: > [...] > Thus the general sequence might be: > > a/ issue all "preceding writes". > b/ issue the commit write with BIO_RW_BARRIER > c/ wait for the commit to complete. > If it was successful - done. > If it failed other than with EOPNOTSUPP, abort > else continue > d/ wait for all 'preceding writes' to complete > e/ call blkdev_issue_flush > f/ issue commit write without BIO_RW_BARRIER > g/ wait for commit write to complete > if it failed, abort > h/ call blkdev_issue > DONE > > steps b and c can be left out if it is known that the device does not > support barriers. The only way to discover this to try and see if it > fails. > > I don't think any filesystem follows all these steps. It seems that steps b/ -- h/ are quite generic, and can be implemented once in a generic code (with some synchronization mechanism like wait-queue at d/). Nikita. [...] > > Thank you for your attention. > > NeilBrown > Nikita. ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 11:17 ` [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Nikita Danilov @ 2007-05-31 3:31 ` Neil Brown 0 siblings, 0 replies; 102+ messages in thread From: Neil Brown @ 2007-05-31 3:31 UTC (permalink / raw) To: Nikita Danilov Cc: device-mapper development, linux-fsdevel, linux-raid, David Chinner, linux-kernel, Jens Axboe On Monday May 28, nikita@clusterfs.com wrote: > Neil Brown writes: > > > > [...] > > > Thus the general sequence might be: > > > > a/ issue all "preceding writes". > > b/ issue the commit write with BIO_RW_BARRIER > > c/ wait for the commit to complete. > > If it was successful - done. > > If it failed other than with EOPNOTSUPP, abort > > else continue > > d/ wait for all 'preceding writes' to complete > > e/ call blkdev_issue_flush > > f/ issue commit write without BIO_RW_BARRIER > > g/ wait for commit write to complete > > if it failed, abort > > h/ call blkdev_issue > > DONE > > > > steps b and c can be left out if it is known that the device does not > > support barriers. The only way to discover this to try and see if it > > fails. > > > > I don't think any filesystem follows all these steps. > > It seems that steps b/ -- h/ are quite generic, and can be implemented > once in a generic code (with some synchronization mechanism like > wait-queue at d/). Yes and no. It depends on what you mean by "preceding write". If you implement this in the filesystem, the filesystem can wait only for those writes where it has an ordering dependency. If you implement it in common code, then you have to wait for all writes that were previously issued. e.g. If you have two different filesystems on two different partitions on the one device, why should writes in one filesystem wait for a barrier issued in the other filesystem. If you have a single filesystem with one thread doing lot of over-writes (no metadata changes) and the another doing lots of metadata changes (requiring journalling and barriers) why should the data write be held up by the metadata updates? So I'm not actually convinced that doing this is common code is the best approach. But it is the easiest. The common code should provide the barrier and flushing primitives, and the filesystem gets to use them however it likes. NeilBrown ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown ` (6 preceding siblings ...) 2007-05-28 11:17 ` [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Nikita Danilov @ 2007-05-28 14:43 ` Bill Davidsen 2007-05-31 0:37 ` Neil Brown 7 siblings, 1 reply; 102+ messages in thread From: Bill Davidsen @ 2007-05-28 14:43 UTC (permalink / raw) To: Neil Brown Cc: David Chinner, linux-kernel, linux-raid, dm-devel, Jens Axboe, linux-fsdevel Neil Brown wrote: > We can think of there being three types of devices: > > 1/ SAFE. With a SAFE device, there is no write-behind cache, or if > there is it is non-volatile. Once a write completes it is > completely safe. Such a device does not require barriers > or ->issue_flush_fn, and can respond to them either by a > no-op or with -EOPNOTSUPP (the former is preferred). > > 2/ FLUSHABLE. > A FLUSHABLE device may have a volatile write-behind cache. > This cache can be flushed with a call to blkdev_issue_flush. > It may not support barrier requests. > > 3/ BARRIER. > A BARRIER device supports both blkdev_issue_flush and > BIO_RW_BARRIER. Either may be used to synchronise any > write-behind cache to non-volatile storage (media). > > Handling of SAFE and FLUSHABLE devices is essentially the same and can > work on a BARRIER device. The BARRIER device has the option of more > efficient handling. > There are two things I'm not sure you covered. First, disks which don't support flush but do have a "cache dirty" status bit you can poll at times like shutdown. If there are no drivers which support these, it can be ignored. Second, NAS (including nbd?). Is there enough information to handle this "really rigt?" Otherwise looks good as a statement of issues. It seems to me that the filesystem should be able to pass the barrier request to the block layer and have it taken care of, rather than have code in each f/s to cope with odd behavior. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-28 14:43 ` Bill Davidsen @ 2007-05-31 0:37 ` Neil Brown 2007-05-31 12:28 ` Bill Davidsen 0 siblings, 1 reply; 102+ messages in thread From: Neil Brown @ 2007-05-31 0:37 UTC (permalink / raw) To: Bill Davidsen Cc: linux-fsdevel, linux-kernel, dm-devel, linux-raid, Jens Axboe, David Chinner On Monday May 28, davidsen@tmr.com wrote: > There are two things I'm not sure you covered. > > First, disks which don't support flush but do have a "cache dirty" > status bit you can poll at times like shutdown. If there are no drivers > which support these, it can be ignored. There are really devices like that? So to implement a flush, you have to stop sending writes and wait and poll - maybe poll every millisecond? That wouldn't be very good for performance.... maybe you just wouldn't bother with barriers on that sort of device? Which reminds me: What is the best way to turn off barriers? Several filesystems have "-o nobarriers" or "-o barriers=0", or the inverse. md/raid currently uses barriers to write metadata, and there is no way to turn that off. I'm beginning to wonder if that is best. Maybe barrier support should be a function of the device. i.e. the filesystem or whatever always sends barrier requests where it thinks it is appropriate, and the block device tries to honour them to the best of its ability, but if you run blockdev --enforce-barriers=no /dev/sda then you lose some reliability guarantees, but gain some throughput (a bit like the 'async' export option for nfsd). > > Second, NAS (including nbd?). Is there enough information to handle this > "really rigt?" NAS means lots of things, including NFS and CIFS where this doesn't apply. For 'nbd', it is entirely up to the protocol. If the protocol allows a barrier flag to be sent to the server, then barriers should just work. If it doesn't, then either the server disables write-back caching, or flushes every request, or you lose all barrier guarantees. For 'iscsi', I guess it works just the same as SCSI... NeilBrown ^ permalink raw reply [flat|nested] 102+ messages in thread
* Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md. 2007-05-31 0:37 ` Neil Brown @ 2007-05-31 12:28 ` Bill Davidsen 0 siblings, 0 replies; 102+ messages in thread From: Bill Davidsen @ 2007-05-31 12:28 UTC (permalink / raw) To: Neil Brown Cc: David Chinner, linux-kernel, linux-raid, dm-devel, Jens Axboe, linux-fsdevel Neil Brown wrote: > On Monday May 28, davidsen@tmr.com wrote: > >> There are two things I'm not sure you covered. >> >> First, disks which don't support flush but do have a "cache dirty" >> status bit you can poll at times like shutdown. If there are no drivers >> which support these, it can be ignored. >> > > There are really devices like that? So to implement a flush, you have > to stop sending writes and wait and poll - maybe poll every > millisecond? > Yes, there really are (or were). But I don't think that there are drivers, so it's not an issue. > That wouldn't be very good for performance.... maybe you just > wouldn't bother with barriers on that sort of device? > That is why there are no drivers... > Which reminds me: What is the best way to turn off barriers? > Several filesystems have "-o nobarriers" or "-o barriers=0", > or the inverse. > If they can function usefully without, the admin gets to make that choice. > md/raid currently uses barriers to write metadata, and there is no > way to turn that off. I'm beginning to wonder if that is best. > I don't see how you can have reliable operation without it, particularly WRT bitmap. > Maybe barrier support should be a function of the device. i.e. the > filesystem or whatever always sends barrier requests where it thinks > it is appropriate, and the block device tries to honour them to the > best of its ability, but if you run > blockdev --enforce-barriers=no /dev/sda > then you lose some reliability guarantees, but gain some throughput (a > bit like the 'async' export option for nfsd). > > Since this is device dependent, it really should be in the device driver, and requests should have status of success, failure, or feature unavailability. >> Second, NAS (including nbd?). Is there enough information to handle this "really right?" >> > > NAS means lots of things, including NFS and CIFS where this doesn't > apply. > Well, we're really talking about network attached devices rather than network filesystems. I guess people do lump them together. > For 'nbd', it is entirely up to the protocol. If the protocol allows > a barrier flag to be sent to the server, then barriers should just > work. If it doesn't, then either the server disables write-back > caching, or flushes every request, or you lose all barrier > guarantees. > Pretty much agrees with what I said above, it's at a level closer to the device, and status should come back from the physical i/o request. > For 'iscsi', I guess it works just the same as SCSI... > Hopefully. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 102+ messages in thread
end of thread, other threads:[~2007-07-18 12:34 UTC | newest] Thread overview: 102+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-25 7:58 [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Neil Brown 2007-05-25 11:15 ` David Chinner 2007-05-25 11:49 ` Jens Axboe 2007-05-25 14:49 ` Phillip Susi 2007-05-28 18:32 ` [dm-devel] " Jens Axboe 2007-05-25 13:52 ` Stefan Bader 2007-05-28 1:37 ` Neil Brown 2007-05-29 9:12 ` Stefan Bader 2007-05-25 15:11 ` Phillip Susi 2007-05-26 1:03 ` Andreas Dilger 2007-05-26 10:27 ` Tejun Heo 2007-05-28 1:30 ` Neil Brown 2007-05-28 2:45 ` David Chinner 2007-05-28 2:57 ` Neil Brown 2007-05-28 4:29 ` David Chinner 2007-05-31 0:46 ` Neil Brown 2007-05-31 0:57 ` Alasdair G Kergon 2007-05-31 1:07 ` Alasdair G Kergon 2007-05-31 1:11 ` David Chinner 2007-05-28 4:48 ` Timothy Shimmin 2007-05-29 6:45 ` Jeremy Higdon 2007-05-29 20:03 ` Phillip Susi 2007-05-29 23:48 ` David Chinner 2007-05-30 0:01 ` david 2007-05-30 6:17 ` David Chinner 2007-05-30 8:55 ` Stefan Bader 2007-05-30 16:52 ` david 2007-05-31 0:20 ` David Chinner 2007-05-31 6:26 ` Jens Axboe 2007-05-31 7:03 ` David Chinner 2007-05-31 7:06 ` Jens Axboe 2007-05-31 13:30 ` Bill Davidsen 2007-05-31 13:36 ` Jens Axboe 2007-06-01 16:04 ` Bill Davidsen 2007-06-02 14:51 ` Jens Axboe 2007-06-02 19:55 ` Bill Davidsen 2007-06-01 3:16 ` Tejun Heo 2007-06-01 8:21 ` Jens Axboe 2007-06-02 9:20 ` Tejun Heo 2007-06-02 14:34 ` Jens Axboe 2007-06-02 22:57 ` Guy Watkins 2007-06-04 7:39 ` Tejun Heo 2007-05-31 18:31 ` Phillip Susi 2007-05-31 19:00 ` Jens Axboe 2007-05-31 19:21 ` david 2007-05-31 19:40 ` Jens Axboe 2007-05-31 23:34 ` David Chinner 2007-06-01 5:59 ` Neil Brown 2007-06-01 6:11 ` Jens Axboe 2007-06-01 7:53 ` David Chinner 2007-06-01 23:56 ` Bill Davidsen 2007-05-31 18:24 ` Phillip Susi 2007-05-30 16:45 ` Phillip Susi 2007-05-30 20:27 ` [dm-devel] " Phillip Susi 2007-05-31 6:24 ` Jens Axboe 2007-05-31 18:37 ` [dm-devel] " Phillip Susi 2007-05-31 18:58 ` Jens Axboe 2007-06-02 0:04 ` Bill Davidsen 2007-05-28 9:29 ` Tejun Heo 2007-05-28 9:43 ` Alasdair G Kergon 2007-05-29 9:25 ` [dm-devel] " Stefan Bader 2007-05-29 22:05 ` Alasdair G Kergon 2007-05-30 9:12 ` [dm-devel] " Stefan Bader 2007-05-30 10:41 ` Alasdair G Kergon 2007-05-30 16:55 ` Phillip Susi 2007-05-31 11:14 ` [dm-devel] " Stefan Bader 2007-06-01 3:25 ` Tejun Heo 2007-06-01 5:55 ` david 2007-06-01 7:16 ` [dm-devel] " Tejun Heo 2007-06-01 17:07 ` Valdis.Kletnieks 2007-06-01 18:09 ` Tejun Heo 2007-07-10 18:39 ` Ric Wheeler 2007-07-10 23:40 ` Valdis.Kletnieks 2007-07-11 2:49 ` Tejun Heo 2007-07-11 22:44 ` Ric Wheeler 2007-07-12 17:34 ` Valdis.Kletnieks 2007-07-12 19:43 ` Ric Wheeler 2007-07-12 23:10 ` Guy Watkins 2007-07-13 11:30 ` Ric Wheeler 2007-07-11 2:51 ` Tejun Heo 2007-05-29 19:59 ` Phillip Susi 2007-05-31 0:22 ` Neil Brown 2007-05-30 9:35 ` Jens Axboe 2007-07-05 12:28 ` Tejun Heo 2007-07-09 12:27 ` Jens Axboe 2007-07-18 10:56 ` [PATCH] block: cosmetic changes Tejun Heo 2007-07-18 10:59 ` [PATCH] block: factor out bio_check_eod() Tejun Heo 2007-07-18 11:06 ` Jens Axboe 2007-07-18 11:18 ` Tejun Heo 2007-07-18 11:31 ` Jens Axboe 2007-07-18 11:33 ` Tejun Heo 2007-07-18 11:34 ` Jens Axboe 2007-07-18 11:41 ` Tejun Heo 2007-07-18 11:45 ` Jens Axboe 2007-07-18 11:49 ` Jens Axboe 2007-07-18 12:34 ` Tejun Heo 2007-07-18 12:31 ` Jens Axboe 2007-05-28 11:17 ` [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md Nikita Danilov 2007-05-31 3:31 ` Neil Brown 2007-05-28 14:43 ` Bill Davidsen 2007-05-31 0:37 ` Neil Brown 2007-05-31 12:28 ` Bill Davidsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).