discard and barriers

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* discard and barriers
@ 2010-08-14 11:56 Christoph Hellwig
  2010-08-14 14:14 ` Ted Ts'o
  2010-08-23 16:42 ` Christoph Hellwig
  0 siblings, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2010-08-14 11:56 UTC (permalink / raw)
  To: hughd, hirofumi, tytso, chris.mason, swhiteho
  Cc: linux-fsdevel, jaxboe, martin.petersen

Currently all filesystems submit their discard requests as
BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER type requests.  That is they
both wait for completion synchronously, and submit them as a barrier.

For those not part of the current barrier discussion the barrier flag
has two implications:

 (a) it prevents reordering of the request with any previous or later
     one
 (b) it causes a cache flush before the request and ensures the request
     is on disk when it returns, which at least for typical SATA
     requests means another flush request as we don't able the FUA
     bit (which isn't applicable to TRIM or UNMAP anyway)

(a) is something we are planning to get rid of in the block layer
completely, so we'll need to figure out a way how to deal with it for
discards.  (b) doesn't actually seem to be nessecary for discards from
my research - given that discards are an optimization for the storage
device we don't care if it actually hits the disk or not in case of
a crash.  And given that the definition of SYNCHRONIZE_CACHE only deals
with data it's not even defined that it affects discard commands.

Can you guys review you rely on a) in your filesystem and if yes help
me to figure out a good way to replace it?  All the callers look like
they do not actually rely on it because they seem to wait for the
actual block freeing metadata I/O beforehand, but I'm not sure.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-14 11:56 discard and barriers Christoph Hellwig
@ 2010-08-14 14:14 ` Ted Ts'o
  2010-08-14 14:52   ` Christoph Hellwig
  2010-08-23 16:42 ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Ted Ts'o @ 2010-08-14 14:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hughd, hirofumi, chris.mason, swhiteho, linux-fsdevel, jaxboe,
	martin.petersen

On Sat, Aug 14, 2010 at 01:56:25PM +0200, Christoph Hellwig wrote:
> Can you guys review you rely on a) in your filesystem and if yes help
> me to figure out a good way to replace it?  All the callers look like
> they do not actually rely on it because they seem to wait for the
> actual block freeing metadata I/O beforehand, but I'm not sure.

In no journal mode we don't actually wait, but nothing is guaranteed
in w/o a journal mode with respect to what happens after a crash, so I
think we're good.  For all file systems that use journalling, we have
to make sure transaction which freed the blocks is safely committed on
the disk platters before we can call trim, so what you're asking
should be guaranteed if the file system is otherwise correct wrt to
journalling --- or am I missing something?

Also, to be clear, the block layer will guarantee that a trim/discard
of block 12345 will not be reordered with respect to a write block
12345, correct?

And on SATA devices, where discard requests are not queued requests,
the ata layer will have to do a queue flush *before* the discard is
sent, right?  But things should be a tiny bit better even with SATA
because we won't need to wait for the barrier to be acknowledged
before sending more write requests to the device.  If I understand
things correctly, the main place where this will have benefit will be
for more advanced interfaces like SAS?

       	       	    	  	      	  	    - Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-14 14:14 ` Ted Ts'o
@ 2010-08-14 14:52   ` Christoph Hellwig
  2010-08-14 15:46     ` Chris Mason
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2010-08-14 14:52 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Christoph Hellwig, hughd, hirofumi, chris.mason, swhiteho,
	linux-fsdevel, jaxboe, martin.petersen

On Sat, Aug 14, 2010 at 10:14:51AM -0400, Ted Ts'o wrote:
> Also, to be clear, the block layer will guarantee that a trim/discard
> of block 12345 will not be reordered with respect to a write block
> 12345, correct?

Right now that is what the hardbarrier does, and that's what we're
trying to get rid of.  For XFS we prevent this by something that is
called the busy extent list - extents delete by a transaction are
inserted into it (it's actually a rbtree not a list these days),
and before we can reuse blocks from it we need to ensure that it
is fully commited.  discards only happen off that list and extents
are only removed from it once the discard has finished.  I assume
other filesystems have a similar mechanism.

> And on SATA devices, where discard requests are not queued requests,
> the ata layer will have to do a queue flush *before* the discard is
> sent, right?

Yes.

> But things should be a tiny bit better even with SATA
> because we won't need to wait for the barrier to be acknowledged
> before sending more write requests to the device.  If I understand
> things correctly, the main place where this will have benefit will be
> for more advanced interfaces like SAS?

The performance improvement is primarily interesting for Fibre or iSCSI
attach arrays with thin provisioning support.  I've not seen a
TP-capable SAS device yet.  The other motivation is that this is the
last piece that relies on the ordering semantics of barriers, which
we're trying to get rid of.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-14 14:52   ` Christoph Hellwig
@ 2010-08-14 15:46     ` Chris Mason
  2010-08-14 17:22       ` Christoph Hellwig
  2010-08-14 20:11       ` Hugh Dickins
  2010-08-15 17:39     ` Ted Ts'o
  2010-08-16  9:41     ` Steven Whitehouse
  2 siblings, 2 replies; 14+ messages in thread
From: Chris Mason @ 2010-08-14 15:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, hughd, hirofumi, swhiteho, linux-fsdevel, jaxboe,
	martin.petersen

On Sat, Aug 14, 2010 at 04:52:10PM +0200, Christoph Hellwig wrote:
> On Sat, Aug 14, 2010 at 10:14:51AM -0400, Ted Ts'o wrote:
> > Also, to be clear, the block layer will guarantee that a trim/discard
> > of block 12345 will not be reordered with respect to a write block
> > 12345, correct?
> 
> Right now that is what the hardbarrier does, and that's what we're
> trying to get rid of.

So btrfs will wait_on_{page/buffer/bio} to meet all ordering
requirements. This holds both for transaction commit and for discard.
Reiserfs has the exception you already know about.

> For XFS we prevent this by something that is
> called the busy extent list - extents delete by a transaction are
> inserted into it (it's actually a rbtree not a list these days),
> and before we can reuse blocks from it we need to ensure that it
> is fully commited.  discards only happen off that list and extents
> are only removed from it once the discard has finished.  I assume
> other filesystems have a similar mechanism.
> 
> > And on SATA devices, where discard requests are not queued requests,
> > the ata layer will have to do a queue flush *before* the discard is
> > sent, right?

Another way to say this is we have to be 100% sure that if we write
something after a discard, that storage will do that write after it does
the discard.

I'm not actually worried about writes before the discard, because the
worst case for us is the drive fails to discard something it could have
(this is the drive's problem).  Cache flushes from the FS will cover the
case where transaction commits depend on the data going in before the
discard. 

I care a lot about the write after the discards though.  If the discards
themselves become async, that's ok too as long as we have some way to do
end_io processing on them.

-chris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-14 15:46     ` Chris Mason
@ 2010-08-14 17:22       ` Christoph Hellwig
  2010-08-14 20:11       ` Hugh Dickins
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2010-08-14 17:22 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Ted Ts'o, hughd, hirofumi,
	swhiteho, linux-fsdevel

On Sat, Aug 14, 2010 at 11:46:36AM -0400, Chris Mason wrote:
> Another way to say this is we have to be 100% sure that if we write
> something after a discard, that storage will do that write after it does
> the discard.

Once we don't have barriers anymore the only way to do that is to wait
for the discard to finish before submitting that I/O.

> I care a lot about the write after the discards though.  If the discards
> themselves become async, that's ok too as long as we have some way to do
> end_io processing on them.

You can just submit the discard bio yourself.  That's what I do in the
current XFS code.  It's a bit awkward that you have to do all the size
checking yourself currently, but we could make that a common helper,
I just don't want to do that now as it would create all kinds of
dependencies for merging the trees in the .37 window.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-14 15:46     ` Chris Mason
  2010-08-14 17:22       ` Christoph Hellwig
@ 2010-08-14 20:11       ` Hugh Dickins
  1 sibling, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2010-08-14 20:11 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Ted Ts'o, hughd, hirofumi,
	swhiteho, linux-fsdevel

On Sat, 14 Aug 2010, Chris Mason wrote:
> On Sat, Aug 14, 2010 at 04:52:10PM +0200, Christoph Hellwig wrote:
> > On Sat, Aug 14, 2010 at 10:14:51AM -0400, Ted Ts'o wrote:
> > > Also, to be clear, the block layer will guarantee that a trim/discard
> > > of block 12345 will not be reordered with respect to a write block
> > > 12345, correct?
> > 
> > Right now that is what the hardbarrier does, and that's what we're
> > trying to get rid of.
> 
> So btrfs will wait_on_{page/buffer/bio} to meet all ordering
> requirements. This holds both for transaction commit and for discard.
> Reiserfs has the exception you already know about.
> 
> > For XFS we prevent this by something that is
> > called the busy extent list - extents delete by a transaction are
> > inserted into it (it's actually a rbtree not a list these days),
> > and before we can reuse blocks from it we need to ensure that it
> > is fully commited.  discards only happen off that list and extents
> > are only removed from it once the discard has finished.  I assume
> > other filesystems have a similar mechanism.

Yes, whatever works for XFS and for btrfs will be enough for swap:
all it needs is to wait on completion of the discard, just as you
enforced with BLKDEV_IFL_WAIT, before issuing more writes to the
discarded area - as it already does.

> > 
> > > And on SATA devices, where discard requests are not queued requests,
> > > the ata layer will have to do a queue flush *before* the discard is
> > > sent, right?
> 
> Another way to say this is we have to be 100% sure that if we write
> something after a discard, that storage will do that write after it does
> the discard.
> 
> I'm not actually worried about writes before the discard, because the
> worst case for us is the drive fails to discard something it could have
> (this is the drive's problem).  Cache flushes from the FS will cover the
> case where transaction commits depend on the data going in before the
> discard. 

That is a great point.  Swap does not need nor want a queue flush
before discard: all that achieves is interfere with the flow to other
partitions.  Can we reason that that queue flush cannot be necessary
in any case - that anything which appears to need it for correctness
must actually be already doing serialization that makes it superfluous?

> 
> I care a lot about the write after the discards though.  If the discards
> themselves become async, that's ok too as long as we have some way to do
> end_io processing on them.

Yes, same for swap.

Hugh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-14 14:52   ` Christoph Hellwig
  2010-08-14 15:46     ` Chris Mason
@ 2010-08-15 17:39     ` Ted Ts'o
  2010-08-15 19:02       ` Christoph Hellwig
  2010-08-16  9:41     ` Steven Whitehouse
  2 siblings, 1 reply; 14+ messages in thread
From: Ted Ts'o @ 2010-08-15 17:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hughd, hirofumi, chris.mason, swhiteho, linux-fsdevel, jaxboe,
	martin.petersen

On Sat, Aug 14, 2010 at 04:52:10PM +0200, Christoph Hellwig wrote:
> On Sat, Aug 14, 2010 at 10:14:51AM -0400, Ted Ts'o wrote:
> > Also, to be clear, the block layer will guarantee that a trim/discard
> > of block 12345 will not be reordered with respect to a write block
> > 12345, correct?
> 
> Right now that is what the hardbarrier does, and that's what we're
> trying to get rid of.  For XFS we prevent this by something that is
> called the busy extent list - extents delete by a transaction are
> inserted into it (it's actually a rbtree not a list these days),
> and before we can reuse blocks from it we need to ensure that it
> is fully commited.  discards only happen off that list and extents
> are only removed from it once the discard has finished.  I assume
> other filesystems have a similar mechanism.

So ext4 does the transaction commit (which guarantees that the file
delete has hit the disk platterns), and *then* issues the discard, and
*then* we zap the busy extent list.  That's the only safe thing to do,
since if we crash before the transaction gets committed, we lose the
data blocks, so I can't issue the discard until after I wait for
commit block to finish.  This should be the case regardless of
anything we change with respect to how the discard operation works,
since if we discard and then crash before the commit block is written,
data blocks will get lost that should not be discarded.  Am I missing
something?

So after these ordering flush/ordering change that have been proposed,
if the block device layer is free to reorder the discard and a
subsequent write to a discard block, I will need to add a *new* wait
for the discard to complete before I can free the busy extent list.
And this will be true for all file systems that are currently issuing
discards.  Again, am I missing something?

This implies that if the changes to allow the reordering of the
discard and the subsequent writes to the discard blocks goes in
*before* we update all of the filesystems, then there is the potential
for data loss.  And while most file systems don't do discuards by
default, but require some mount option, this still might be considered
undesirable.

So that means we need to add the end-io callbacks to the discard
operations *first*, before we remove the implicit flush/ordering
guarantees.

I thought you were saying that it should be safe to remove the
flush/ordering guarantees in your earlier messages, but this is
leaving me quite confused.  Did I misunderstand you?

					- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-15 17:39     ` Ted Ts'o
@ 2010-08-15 19:02       ` Christoph Hellwig
  2010-08-15 21:25         ` Ted Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2010-08-15 19:02 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Christoph Hellwig, hughd, hirofumi, chris.mason, swhiteho,
	linux-fsdevel, jaxboe, martin.petersen

On Sun, Aug 15, 2010 at 01:39:06PM -0400, Ted Ts'o wrote:
> So after these ordering flush/ordering change that have been proposed,
> if the block device layer is free to reorder the discard and a
> subsequent write to a discard block, I will need to add a *new* wait
> for the discard to complete before I can free the busy extent list.
> And this will be true for all file systems that are currently issuing
> discards.  Again, am I missing something?

The above is correct, except for the *new* part.  sb_issue_discard at
the moment is synchronous, so you're already waiting for it to finish.

> So that means we need to add the end-io callbacks to the discard
> operations *first*, before we remove the implicit flush/ordering
> guarantees.

Doing the discard asynchronous and with and end_io callback defintivel
is an optimization over waiting for it synchronously, and it's in fact
what I'm doing in XFS.  It's however unrelated to getting rid of the
barriers.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-15 19:02       ` Christoph Hellwig
@ 2010-08-15 21:25         ` Ted Ts'o
  2010-08-15 21:30           ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Ted Ts'o @ 2010-08-15 21:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hughd, hirofumi, chris.mason, swhiteho, linux-fsdevel, jaxboe,
	martin.petersen

On Sun, Aug 15, 2010 at 09:02:30PM +0200, Christoph Hellwig wrote:
> On Sun, Aug 15, 2010 at 01:39:06PM -0400, Ted Ts'o wrote:
> > So after these ordering flush/ordering change that have been proposed,
> > if the block device layer is free to reorder the discard and a
> > subsequent write to a discard block, I will need to add a *new* wait
> > for the discard to complete before I can free the busy extent list.
> > And this will be true for all file systems that are currently issuing
> > discards.  Again, am I missing something?
> 
> The above is correct, except for the *new* part.  sb_issue_discard at
> the moment is synchronous, so you're already waiting for it to finish.

OK, now I understand why I'm confused.  I thought the proposal was to
change sb_issue_discard() to make it be asynchronous?  Really, what
we're talking about here is eliminating the explicit
barrier/SYNCHRONIZE CACHE from the discard, correct?  The
sb_issue_discard() call will still remain synchronous.

      	     	   	     	       		  	   - Ted


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-15 21:25         ` Ted Ts'o
@ 2010-08-15 21:30           ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2010-08-15 21:30 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Christoph Hellwig, hughd, hirofumi, chris.mason, swhiteho,
	linux-fsdevel, jaxboe, martin.petersen

On Sun, Aug 15, 2010 at 05:25:34PM -0400, Ted Ts'o wrote:
> OK, now I understand why I'm confused.  I thought the proposal was to
> change sb_issue_discard() to make it be asynchronous?  Really, what
> we're talking about here is eliminating the explicit
> barrier/SYNCHRONIZE CACHE from the discard, correct?  The
> sb_issue_discard() call will still remain synchronous.

Yes, at least for now.  I don't think keeping it that way over the long
run is a good idea, but for now getting rid of the barrier is all that
*needs* to be done.  The rest is optimizations that can be done later.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-14 14:52   ` Christoph Hellwig
  2010-08-14 15:46     ` Chris Mason
  2010-08-15 17:39     ` Ted Ts'o
@ 2010-08-16  9:41     ` Steven Whitehouse
  2010-08-16 11:26       ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Steven Whitehouse @ 2010-08-16  9:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, hughd, hirofumi, chris.mason, linux-fsdevel, jaxboe,
	martin.petersen

Hi,

On Sat, 2010-08-14 at 16:52 +0200, Christoph Hellwig wrote:
> On Sat, Aug 14, 2010 at 10:14:51AM -0400, Ted Ts'o wrote:
> > Also, to be clear, the block layer will guarantee that a trim/discard
> > of block 12345 will not be reordered with respect to a write block
> > 12345, correct?
> 
> Right now that is what the hardbarrier does, and that's what we're
> trying to get rid of.  For XFS we prevent this by something that is
> called the busy extent list - extents delete by a transaction are
> inserted into it (it's actually a rbtree not a list these days),
> and before we can reuse blocks from it we need to ensure that it
> is fully commited.  discards only happen off that list and extents
> are only removed from it once the discard has finished.  I assume
> other filesystems have a similar mechanism.
> 
GFS2 has a similar concept, which compares two bit maps to generate the
extent list to generate the discards. This is done after each resource
group has been committed to the journal, and just before the resource
group bitmap is updated with the newly freed blocks (and marked dirty).

Any remote node wanting to use that new space will cause a further
journal flush when it requests the resource group lock (as well as in
place write back of that resource group, of course).

If the local node wants to reuse the recently freed space, then that can
happen as soon as the log commit has finished, so in this case the
barrier and the waiting are required. At the moment it seems to be doing
that on every request, however there is no reason why we couldn't move
the barrier to the end of the log flush code and have one per log flush
conditional upon a discard having been issued (or some equivalent
construct bearing in mind the objective of removing barriers),

Steve.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-16  9:41     ` Steven Whitehouse
@ 2010-08-16 11:26       ` Christoph Hellwig
  2010-08-17 10:59         ` Steven Whitehouse
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2010-08-16 11:26 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Christoph Hellwig, Ted Ts'o, hughd, hirofumi, chris.mason,
	linux-fsdevel, jaxboe, martin.petersen

On Mon, Aug 16, 2010 at 10:41:51AM +0100, Steven Whitehouse wrote:
> GFS2 has a similar concept, which compares two bit maps to generate the
> extent list to generate the discards. This is done after each resource
> group has been committed to the journal, and just before the resource
> group bitmap is updated with the newly freed blocks (and marked dirty).
> 
> Any remote node wanting to use that new space will cause a further
> journal flush when it requests the resource group lock (as well as in
> place write back of that resource group, of course).
> 
> If the local node wants to reuse the recently freed space, then that can
> happen as soon as the log commit has finished, so in this case the
> barrier and the waiting are required.

I don't think you need the barrier for that.  The wait means the
discard has finished, and from that point writes to the blocks discarded
are safe.  There's no need to flush the volatile write cache after
a discard either.  The big question is if you need the drain before
the discard.  Given that you did a log commit before I suspect not
as the log commit waits on all I/Os related to this commit.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-16 11:26       ` Christoph Hellwig
@ 2010-08-17 10:59         ` Steven Whitehouse
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Whitehouse @ 2010-08-17 10:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ted Ts'o, hughd, hirofumi, chris.mason, linux-fsdevel, jaxboe,
	martin.petersen

Hi,

On Mon, 2010-08-16 at 13:26 +0200, Christoph Hellwig wrote:
> On Mon, Aug 16, 2010 at 10:41:51AM +0100, Steven Whitehouse wrote:
> > GFS2 has a similar concept, which compares two bit maps to generate the
> > extent list to generate the discards. This is done after each resource
> > group has been committed to the journal, and just before the resource
> > group bitmap is updated with the newly freed blocks (and marked dirty).
> > 
> > Any remote node wanting to use that new space will cause a further
> > journal flush when it requests the resource group lock (as well as in
> > place write back of that resource group, of course).
> > 
> > If the local node wants to reuse the recently freed space, then that can
> > happen as soon as the log commit has finished, so in this case the
> > barrier and the waiting are required.
> 
> I don't think you need the barrier for that.  The wait means the
> discard has finished, and from that point writes to the blocks discarded
> are safe.  There's no need to flush the volatile write cache after
> a discard either.  The big question is if you need the drain before
> the discard.  Given that you did a log commit before I suspect not
> as the log commit waits on all I/Os related to this commit.
> 

Yes, that sounds reasonable in that case. I didn't know there were
reordering guarantees with discards vs normal I/O so, in that case there
is no need for the barrier. I don't think anything more is required in
this case, so we should be good to go,

Steve.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: discard and barriers
  2010-08-14 11:56 discard and barriers Christoph Hellwig
  2010-08-14 14:14 ` Ted Ts'o
@ 2010-08-23 16:42 ` Christoph Hellwig
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2010-08-23 16:42 UTC (permalink / raw)
  To: jaxboe
  Cc: hughd, hirofumi, tytso, chris.mason, swhiteho, martin.petersen,
	linux-fsdevel

Jens, do you think the following patch is still okay for 3.6.36?
It fixes the massive performance regression due to the full barriers
for discard and also makes the barrier removal in 2.6.37 a lot simpler.

---

From: Christoph Hellwig <hch@lst.de>
Subject: [PATCH] block: do not send discards as barriers

There is no reason to send discards as barriers.  The rationale for that
is easy:  The current barriers do two things, flushing caches and provide
ordering.  There is no reason for a cache flush before a discard because
the discard doesn't care for ordering vs writes to other regions than
the one it discards, and there is no reason for a cache flush afterwards
as a discard doesn't store data.  The ordering semantics aren't used
currently because all discards are done synchronously and thus we
order explicitly.

Even more until very late in the 2.6.35 cycle we didn't send DISCARD_BARRIER
requests as real hardbarrier but as an elevator only barrier which doesn't
provide the above semantics, and the switch to real barriers causes masssive
performance regressions especially for the swap code.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/block/blk-lib.c
===================================================================
--- linux-2.6.orig/block/blk-lib.c	2010-08-23 15:16:37.656033352 +0200
+++ linux-2.6/block/blk-lib.c	2010-08-23 15:16:45.913012470 +0200
@@ -39,8 +39,7 @@ int blkdev_issue_discard(struct block_de
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *q = bdev_get_queue(bdev);
-	int type = flags & BLKDEV_IFL_BARRIER ?
-		DISCARD_BARRIER : DISCARD_NOBARRIER;
+	int type = REQ_WRITE | REQ_DISCARD;
 	unsigned int max_discard_sectors;
 	struct bio *bio;
 	int ret = 0;
@@ -65,7 +64,7 @@ int blkdev_issue_discard(struct block_de
 	if (flags & BLKDEV_IFL_SECURE) {
 		if (!blk_queue_secdiscard(q))
 			return -EOPNOTSUPP;
-		type |= DISCARD_SECURE;
+		type |= REQ_SECURE;
 	}
 
 	while (nr_sects && !ret) {
@@ -162,12 +161,6 @@ int blkdev_issue_zeroout(struct block_de
 	bb.wait = &wait;
 	bb.end_io = NULL;
 
-	if (flags & BLKDEV_IFL_BARRIER) {
-		/* issue async barrier before the data */
-		ret = blkdev_issue_flush(bdev, gfp_mask, NULL, 0);
-		if (ret)
-			return ret;
-	}
 submit:
 	ret = 0;
 	while (nr_sects != 0) {
@@ -199,13 +192,6 @@ submit:
 		issued++;
 		submit_bio(WRITE, bio);
 	}
-	/*
-	 * When all data bios are in flight. Send final barrier if requeted.
-	 */
-	if (nr_sects == 0 && flags & BLKDEV_IFL_BARRIER)
-		ret = blkdev_issue_flush(bdev, gfp_mask, NULL,
-					flags & BLKDEV_IFL_WAIT);
-
 
 	if (flags & BLKDEV_IFL_WAIT)
 		/* Wait for bios in-flight */
Index: linux-2.6/fs/btrfs/extent-tree.c
===================================================================
--- linux-2.6.orig/fs/btrfs/extent-tree.c	2010-08-23 15:16:37.677004089 +0200
+++ linux-2.6/fs/btrfs/extent-tree.c	2010-08-23 15:16:45.918004368 +0200
@@ -1696,7 +1696,7 @@ static void btrfs_issue_discard(struct b
 				u64 start, u64 len)
 {
 	blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL,
-			BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+			BLKDEV_IFL_WAIT);
 }
 
 static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
Index: linux-2.6/fs/gfs2/rgrp.c
===================================================================
--- linux-2.6.orig/fs/gfs2/rgrp.c	2010-08-23 15:16:37.684004298 +0200
+++ linux-2.6/fs/gfs2/rgrp.c	2010-08-23 15:16:45.927004298 +0200
@@ -854,8 +854,7 @@ static void gfs2_rgrp_send_discards(stru
 				if ((start + nr_sects) != blk) {
 					rv = blkdev_issue_discard(bdev, start,
 							    nr_sects, GFP_NOFS,
-							    BLKDEV_IFL_WAIT |
-							    BLKDEV_IFL_BARRIER);
+							    BLKDEV_IFL_WAIT);
 					if (rv)
 						goto fail;
 					nr_sects = 0;
@@ -870,7 +869,7 @@ start_new_extent:
 	}
 	if (nr_sects) {
 		rv = blkdev_issue_discard(bdev, start, nr_sects, GFP_NOFS,
-					 BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+					 BLKDEV_IFL_WAIT);
 		if (rv)
 			goto fail;
 	}
Index: linux-2.6/fs/nilfs2/the_nilfs.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/the_nilfs.c	2010-08-23 15:16:37.692003949 +0200
+++ linux-2.6/fs/nilfs2/the_nilfs.c	2010-08-23 15:16:45.928014774 +0200
@@ -775,8 +775,7 @@ int nilfs_discard_segments(struct the_ni
 						   start * sects_per_block,
 						   nblocks * sects_per_block,
 						   GFP_NOFS,
-						   BLKDEV_IFL_WAIT |
-						   BLKDEV_IFL_BARRIER);
+						   BLKDEV_IFL_WAIT);
 			if (ret < 0)
 				return ret;
 			nblocks = 0;
@@ -787,7 +786,7 @@ int nilfs_discard_segments(struct the_ni
 					   start * sects_per_block,
 					   nblocks * sects_per_block,
 					   GFP_NOFS,
-					  BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+					  BLKDEV_IFL_WAIT);
 	return ret;
 }
 
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2010-08-23 15:16:37.700013168 +0200
+++ linux-2.6/include/linux/blkdev.h	2010-08-23 15:16:45.931003739 +0200
@@ -921,11 +921,9 @@ static inline struct request *blk_map_qu
 }
 enum{
 	BLKDEV_WAIT,	/* wait for completion */
-	BLKDEV_BARRIER,	/* issue request with barrier */
 	BLKDEV_SECURE,	/* secure discard */
 };
 #define BLKDEV_IFL_WAIT		(1 << BLKDEV_WAIT)
-#define BLKDEV_IFL_BARRIER	(1 << BLKDEV_BARRIER)
 #define BLKDEV_IFL_SECURE	(1 << BLKDEV_SECURE)
 extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *,
 			unsigned long);
@@ -939,7 +937,7 @@ static inline int sb_issue_discard(struc
 	block <<= (sb->s_blocksize_bits - 9);
 	nr_blocks <<= (sb->s_blocksize_bits - 9);
 	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_NOFS,
-				   BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+				    BLKDEV_IFL_WAIT);
 }
 
 extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-08-23 15:16:37.709012819 +0200
+++ linux-2.6/include/linux/fs.h	2010-08-23 15:16:45.938005835 +0200
@@ -159,14 +159,6 @@ struct inodes_stat_t {
 #define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
 				 REQ_HARDBARRIER)
 
-/*
- * These aren't really reads or writes, they pass down information about
- * parts of device that are now unused by the file system.
- */
-#define DISCARD_NOBARRIER	(WRITE | REQ_DISCARD)
-#define DISCARD_BARRIER		(WRITE | REQ_DISCARD | REQ_HARDBARRIER)
-#define DISCARD_SECURE		(DISCARD_NOBARRIER | REQ_SECURE)
-
 #define SEL_IN		1
 #define SEL_OUT		2
 #define SEL_EX		4
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2010-08-23 15:16:37.724015124 +0200
+++ linux-2.6/mm/swapfile.c	2010-08-23 15:16:45.948004996 +0200
@@ -142,7 +142,7 @@ static int discard_swap(struct swap_info
 	if (nr_blocks) {
 		err = blkdev_issue_discard(si->bdev, start_block,
 				nr_blocks, GFP_KERNEL,
-				BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+				BLKDEV_IFL_WAIT);
 		if (err)
 			return err;
 		cond_resched();
@@ -154,7 +154,7 @@ static int discard_swap(struct swap_info
 
 		err = blkdev_issue_discard(si->bdev, start_block,
 				nr_blocks, GFP_KERNEL,
-				BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+				BLKDEV_IFL_WAIT);
 		if (err)
 			break;
 
@@ -193,8 +193,7 @@ static void discard_swap_cluster(struct
 			start_block <<= PAGE_SHIFT - 9;
 			nr_blocks <<= PAGE_SHIFT - 9;
 			if (blkdev_issue_discard(si->bdev, start_block,
-				    nr_blocks, GFP_NOIO, BLKDEV_IFL_WAIT |
-							BLKDEV_IFL_BARRIER))
+				    nr_blocks, GFP_NOIO, BLKDEV_IFL_WAIT))
 				break;
 		}
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-08-23 16:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-14 11:56 discard and barriers Christoph Hellwig
2010-08-14 14:14 ` Ted Ts'o
2010-08-14 14:52   ` Christoph Hellwig
2010-08-14 15:46     ` Chris Mason
2010-08-14 17:22       ` Christoph Hellwig
2010-08-14 20:11       ` Hugh Dickins
2010-08-15 17:39     ` Ted Ts'o
2010-08-15 19:02       ` Christoph Hellwig
2010-08-15 21:25         ` Ted Ts'o
2010-08-15 21:30           ` Christoph Hellwig
2010-08-16  9:41     ` Steven Whitehouse
2010-08-16 11:26       ` Christoph Hellwig
2010-08-17 10:59         ` Steven Whitehouse
2010-08-23 16:42 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).