Proposal to improve filesystem/block snapshot interaction

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Proposal to improve filesystem/block snapshot interaction
       [not found] <20070927063113.GD2989@sgi.com>
@ 2007-10-30  1:04 ` Greg Banks
  2007-10-30  1:11   ` Greg Banks
                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Greg Banks @ 2007-10-30  1:04 UTC (permalink / raw)
  To: Linux Filesystem Mailing List
  Cc: David Chinner, Donald Douwsma, Christoph Hellwig,
	Roger Strassburg, Mark Goodwin, Brett Jon Grandbois,
	Arnd Bergmann

G'day,

A number of people have already seen this; I'm posting for wider
comment and to move some interesting discussion to a public list.

I'll apologise in advance for the talk about SGI technologies (including
proprietary ones), but all the problems mentioned apply to in-tree
technologies too.

This proposal seeks to solve three problems in our NAS server product
due to the interaction of the filesystem (XFS) and the block-based
snapshot feature (XVM snapshot).  It's based on discussions held with
various people over the last few weeks, including Roger Strassburg,
Christoph Hellwig, David Chinner, and Donald Douwsma.

a)  The first problem is the server's behaviour when a filesystem
    which is subject to snapshot is written to, and the snapshot
    repository runs out of room.

    The failure mode can be quite severe.  XFS issues a metadata write
    to the block device, triggering a Copy-On-Write operation in the
    XVM snapshot element, which because of the full repository fails
    with EIO.  When XFS sees the failure it shuts down the filesystem.
    All subsequent attempts to perform IO to the filesystem block
    indefinitely.  In particular any NFS server thread will block
    and never reply to the NFS client.  The NFS client will retry,
    causing another NFS server thread to block, and repeat until every
    NFS server thread is blocked.  At this point all NFS service for
    all filesystems ceases.

    See PV 958220 and PV 958140 for a description of this problem and
    some of the approaches which have been discussed for resolving it.

b)  The second problem is that certain common combinations of
    filesystem operations can cause large wastes of space in the XVM
    snapshot repository.

    Examples include writing the same file twice with dd, or writing
    a new file and deleting it.  The cause is the inability of the
    XVM snapshot code to be able to free regions in the snapshot
    repository that are no longer in use by the filesystem; this
    information is simply not available within the block layer.

    Note that problem b) also contributes to problem a) by increasing
    repository usage and thus making it easier to encounter an
    out-of-space condition on the repository.

c)  The third problem is an unfortunate interaction between an XFS
    internal log and block snapshots.

    The log is a fixed region of the block device which is written as
    a side effect of a great many different filesystem operations.
    The information written there has no value and is not even
    read until and unless log recovery needs to be performed after
    the server has crashed.  This means the log does not need to be
    preserved by the block feature snapshot (because at the point in
    time when the snapshot is taken, log recovery must have already
    happened).  In fact the correct procedure when mounting a read-only
    snapshot is to use the "norecovery" option to prevent any attempt
    to read the log (although the NAS server software actually doesn't
    do this).

    However, because the block device layer doesn't have enough
    information to know any better, the first pass of writes to the log
    are subjected to Copy-On-Write.  This has two undesirable effects.
    Firstly, it increases the amount of snapshot repository space
    used by each snapshot, thus contributing to problem a).  Secondly,
    it puts a significant performance penalty on filesystem metadata
    operations for some time after each snapshot is taken; given
    that the NAS server can be configured to take regular frequent
    snapshots this may mean all of the time.

    An obvious solution is to use an external XFS log, but this quite
    inconvenient for the NAS server software to arrange.  For one
    thing, we would need to construct a separate external log device
    for the main filesystem and one for each mounted snapshot.

Note that these problems are not specific to XVM but will be
encountered by any Linux block-COWing snapshot implementation.
For example the DM snapshot implementation is documented to suffer from
problem a).  From the linux/Documentation/device-mapper/snapshot.txt:

> <COW device> will often be smaller than the origin and if it
> fills up the snapshot will become useless and be disabled,
> returning errors.  So it is important to monitor the amount of
> free space and expand the <COW device> before it fills up.

During discussions, it became clear that we could solve all three
of these problems by improving the block device interface to allow a
filesystem to provide the block device with dynamic block usage hints.

For example, when unlinking a file the filesystem could tell the
block device a hint of the form "I'm about to stop using these
blocks".  Most block devices would silently ignore these hints, but
a snapshot COW implementation (the "copy-on-write" XVM element or
the "snapshot-origin" dm target) could use them to help avoid these
problems.  For example, the response to the "I'm about to stop using
these blocks" hint could be to free the space used in the snapshot
repository for unnecessary copies of those blocks.

Of course snapshot cow elements may be part of more generic element
trees.  In general there may be more than one consumer of block usage
hints in a given filesystem's element tree, and their locations in that
tree are not predictable.  This means the block extents mentioned in
the usage hints need to be subject to the block mapping algorithms
provided by the element tree.  As those algorithms are currently
implemented using bio mapping and splitting, the easiest and simplest
way to reuse those algorithms is to add new bio flags.

First we need a mechanism to indicate that a bio is a hint rather
than a real IO.  Perhaps the easiest way is to add a new flag to
the bi_rw field:

#define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */

We'll also need a field to tell us which kind of hint the bio
represents.  Perhaps a new field could be added, or perhaps the top
16 bits of bi_rw (currently used to encode the bio's priority, which
has no meaning for hint bios) could be reused.  The latter approach
may allow hints to be used without modifying the bio structure or
any code that uses it other than the filesystem and the snapshot
implementation.  Such a property would have obvious advantages for
our NAS server software, where XFS and XVM modules are provided but
the other users of struct bio are stock SLES code.

Next we'll need three bio hints types with the following semantics.

BIO_HINT_ALLOCATE
    The bio's block extent will soon be written by the filesystem
    and any COW that may be necessary to achieve that should begin
    now.  If the COW is going to fail, the bio should fail.  Note
    that this provides a way for the filesystem to manage when and
    how failures to COW are reported.

BIO_HINT_RELEASE
    The bio's block extent is no longer in use by the filesystem
    and will not be read in the future.  Any storage used to back
    the extent may be released without any threat to filesystem
    or data integrity.

BIO_HINT_DONTCOW
    (the Bart Simpson BIO).  The bio's block extent is not needed
    in mounted snapshots and does not need to be subjected to COW.

Here's how these proposed hints help solve the abovementioned problems.

Problem a) The filesystem gives the BIO_HINT_ALLOCATE hint to the block
device when preparing to write to blocks and when allocating blocks.
The snapshot implementation checks whether COW is necessary, and if
so performs it immediately.  If the COW fails due to a lack of space
in the snapshot repository, the bio fails.  This can be caught in the
filesystem and reported to userspace (or the NFS server) as ENOSPC via
the existing mechanisms.  Filesystem shutdown is no longer necessary.

Problem b) is solved by the filesystem giving the BIO_HINT_RELEASE
hint to the block device every time it unmaps blocks in xfs_bunmapi.
The snapshot implementation can then free unnecessary copies of those
blocks.

Problem c) is solved by the filesystem giving to the block device a
BIO_HINT_DONTCOW hint describing the block extent of the internal log,
at filesystem mount time.  The snapshot implementation marks that
extent, and subsequent writes to those blocks do not cause COWs but
proceed directly to the origin filesystem.

Comments?

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks
@ 2007-10-30  1:11   ` Greg Banks
  2007-10-30  4:16   ` Neil Brown
  2007-10-30  9:35   ` Dongjun Shin
  2 siblings, 0 replies; 20+ messages in thread
From: Greg Banks @ 2007-10-30  1:11 UTC (permalink / raw)
  To: Linux Filesystem Mailing List
  Cc: David Chinner, Donald Douwsma, Christoph Hellwig,
	Roger Strassburg, Mark Goodwin, Brett Jon Grandbois,
	Arnd Bergmann

On Tue, Oct 30, 2007 at 12:51:47AM +0100, Arnd Bergmann wrote:
> On Monday 29 October 2007, Christoph Hellwig wrote:
> > ----- Forwarded message from Greg Banks <gnb@sgi.com> -----
> > 
> > Date: Thu, 27 Sep 2007 16:31:13 +1000
> > From: Greg Banks <gnb@sgi.com>
> > Subject: Proposal to improve filesystem/block snapshot interaction
> > To: David Chinner <dgc@melbourne.sgi.com>, Donald Douwsma <donaldd@sgi.com>,
> >         Christoph Hellwig <hch@infradead.org>, Roger Strassburg <rls@sgi.com>
> > Cc: Mark Goodwin <markgw@sgi.com>,
> >         Brett Jon Grandbois <brettg@melbourne.sgi.com>
> > 
> > 
> > 
> > This proposal seeks to solve three problems in our NAS server product
> > due to the interaction of the filesystem (XFS) and the block-based
> > snapshot feature (XVM snapshot).  It's based on discussions held with
> > various people over the last few weeks, including Roger Strassburg,
> > Christoph Hellwig, David Chinner, and Donald Douwsma.
> 
> Hi Greg,
> 
> Christoph forwarded me your mail, because I mentioned to him that
> I'm trying to come up with a similar change, and it might make sense
> to combine our efforts.

Excellent, thanks Christoph ;-)


> 
> > For example, when unlinking a file the filesystem could tell the
> > block device a hint of the form "I'm about to stop using these
> > blocks".  Most block devices would silently ignore these hints, but
> > a snapshot COW implementation (the "copy-on-write" XVM element or
> > the "snapshot-origin" dm target) could use them to help avoid these
> > problems.  For example, the response to the "I'm about to stop using
> > these blocks" hint could be to free the space used in the snapshot
> > repository for unnecessary copies of those blocks.
> 
> The case I'm interested in is the more specific case of 'erase',
> which is more of a performance optimization than a space optimization.
> When you have a flash medium, it's useful to erase a block as soon
> as it's becoming unused, so that a subsequent write will be faster.
> Moreover, on an MTD medium, you may not even be able to write to
> a block unless it has been erased before.

Spending the device's time to erase early, when the CPU isn't waiting
for it, instead of later, when it adds to effective write latency.
Makes sense.

> > Of course snapshot cow elements may be part of more generic element
> > trees.  In general there may be more than one consumer of block usage
> > hints in a given filesystem's element tree, and their locations in that
> > tree are not predictable.  This means the block extents mentioned in
> > the usage hints need to be subject to the block mapping algorithms
> > provided by the element tree.  As those algorithms are currently
> > implemented using bio mapping and splitting, the easiest and simplest
> > way to reuse those algorithms is to add new bio flags.
> > 
> > First we need a mechanism to indicate that a bio is a hint rather
> > than a real IO.  Perhaps the easiest way is to add a new flag to
> > the bi_rw field:
> > 
> > #define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */
> 
> My first thought was to do this on the request layer, not already
> on bio, but they can easily be combined, I guess.

My first thoughts were along similar lines, but I wasn't expecting
these hint bios to survive deep enough in the stack to need queuing
and thus visibility in struct request; I was expecting their lifetime
to be some passage and splitting through a volume manager and then
conversion to synchronous metadata operations.  Plus, hijacking bios
means not having to modify every single DM target to duplicate it's
block mapping algorithm.

Basically, I was thinking of loopback-like block mapping and not
considering flash.  I suppose for flash where there's a real erase
operation, you'd want to be queuing and that means a new request type.

> 
> > We'll also need a field to tell us which kind of hint the bio
> > represents.  Perhaps a new field could be added, or perhaps the top
> > 16 bits of bi_rw (currently used to encode the bio's priority, which
> > has no meaning for hint bios) could be reused.  The latter approach
> > may allow hints to be used without modifying the bio structure or
> > any code that uses it other than the filesystem and the snapshot
> > implementation.  Such a property would have obvious advantages for
> > our NAS server software, where XFS and XVM modules are provided but
> > the other users of struct bio are stock SLES code.
> > 
> > 
> > Next we'll need three bio hints types with the following semantics.
> > 
> > BIO_HINT_ALLOCATE
> >     The bio's block extent will soon be written by the filesystem
> >     and any COW that may be necessary to achieve that should begin
> >     now.  If the COW is going to fail, the bio should fail.  Note
> >     that this provides a way for the filesystem to manage when and
> >     how failures to COW are reported.
> > 
> > BIO_HINT_RELEASE
> >     The bio's block extent is no longer in use by the filesystem
> >     and will not be read in the future.  Any storage used to back
> >     the extent may be released without any threat to filesystem
> >     or data integrity.
> > 
> > BIO_HINT_DONTCOW
> >     (the Bart Simpson BIO).  The bio's block extent is not needed
> >     in mounted snapshots and does not need to be subjected to COW.
> >     
> 
> My code currently needs four flags, which don't match yours too much:
> 
> /*
>  * A number of different actions could be triggered by an erase request,
>  * depending on the underlying device. Each device specifies its
>  * capabilities with these flags, while a request specifies the options
>  * that are acceptable. If the logical AND from these two does not
>  * have any bits set, the request will result in
>  * an error.
>  */
> enum {
> 	/*
> 	 * Device may choose to ignore the request, subsequent writes
> 	 * may return the original data. This is meant to work on

Is this supposed to be "reads" ?

> 	 * any block device. When combined with other flags, the driver
> 	 * should only perform an actual erase if it makes sense
> 	 * from a performance perspective, e.g. speeding up subsequent
> 	 * writes.
> 	 */
> 	LB_ERASE_IGNORE		= 0x01,
> 	/*
> 	 * A subsequent read may return zero data for the erase,
> 	 * like on some high-level abstractions for flash memory,
> 	 * or a virtual device.
> 	 */
> 	LB_ERASE_ALL_ZERO	= 0x02,
> 	/*
> 	 * A subsequent read may return a block filled with 0xff,
> 	 * which is the typical behavior on raw NAND flash.
> 	 */
> 	LB_ERASE_ALL_ONE	= 0x04,
> 	/*
> 	 * The device may reject a read request for an erased block
> 	 * until the block has been written again. This is typical
> 	 * for NAND flash with builtin ECC checks, or for optical
> 	 * drives.
> 	 */
> 	LB_ERASE_NUKE		= 0x08,
> 	/*
> 	 * Used by file systems that know that data is no longer
> 	 * in use and want to optimize the next write operations.
> 	 */
> 	LB_ERASE_DISCARD	= LB_ERASE_IGNORE | LB_ERASE_ALL_ZERO |
> 					LB_ERASE_ALL_ONE | LB_ERASE_NUKE,
> 	/*
> 	 * Used when we want the data to be invalidated and make sure
> 	 * it is no longer accessible.
> 	 */
> 	LB_ERASE_DESTROY	= LB_ERASE_ALL_ZERO | LB_ERASE_ALL_ONE |
> 					LB_ERASE_NUKE,
> };
> 
> I guess BIO_HINT_RELEASE would match LB_ERASE_DISCARD best,

Yep.

Actually, I'm curious why you'd want to expose, outside the block
driver, the semantics of reading a block which has been earlier
explicitly discarded.  Surely it's an error for a filesystem to
do that?  How does it help a filesystem to know in advance which
error case that will trigger.

> and perhaps
> there should be some bio flag with LB_ERASE_DESTROY semantics, although
> that doesn't really qualify as a hint any more.

Yes, that's more of a command ;-)

> My release command would be REQ_TYPE_LINUX_BLOCK/REQ_LB_OP_ERASE. Were
> you thinking of adding REQ_LB_* operations as well, or just encoding
> the hint in a REQ_TYPE_FS request?

I wasn't expecting a request to be created for the hint bio at all.

> Shall we move the discussion to a public mailing list? Feel free to
> forward my mail anywhere you like.

Done!

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks
  2007-10-30  1:11   ` Greg Banks
@ 2007-10-30  4:16   ` Neil Brown
  2007-10-30  5:12     ` Greg Banks
  2007-10-30 23:56     ` David Chinner
  2007-10-30  9:35   ` Dongjun Shin
  2 siblings, 2 replies; 20+ messages in thread
From: Neil Brown @ 2007-10-30  4:16 UTC (permalink / raw)
  To: Greg Banks
  Cc: Linux Filesystem Mailing List, David Chinner, Donald Douwsma,
	Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois, Arnd Bergmann

On Tuesday October 30, gnb@sgi.com wrote:
> 
> Of course snapshot cow elements may be part of more generic element
> trees.  In general there may be more than one consumer of block usage
> hints in a given filesystem's element tree, and their locations in that
> tree are not predictable.  This means the block extents mentioned in
> the usage hints need to be subject to the block mapping algorithms
> provided by the element tree.  As those algorithms are currently
> implemented using bio mapping and splitting, the easiest and simplest
> way to reuse those algorithms is to add new bio flags.

So are you imagining that you might have a distinct snapshotable
elements, and that some of these might be combined by e.g. RAID0 into
a larger device, then a filesystem is created on that?

I ask because my first thought was that the sort of communication you
want seems like it would be just between a filesystem and the block
device that it talks directly to, and as you are particularly
interested in XFS and XVM, should could come up with whatever protocol
you want for those two to talk to either other, prototype it, iron out
all the issues, then say "We've got this really cool thing to make
snapshots much faster - wanna share?"  and thus be presenting from a
position of more strength (the old 'code talks' mantra).

> 
> First we need a mechanism to indicate that a bio is a hint rather
> than a real IO.  Perhaps the easiest way is to add a new flag to
> the bi_rw field:
> 
> #define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */

Reminds me of the new approach to issue_flush_fn which is just to have
a zero-length barrier bio (is that implemented yet? I lost track).
But different as a zero length barrier has zero length, and your hints
have a very meaningful length.

> 
> Next we'll need three bio hints types with the following semantics.
> 
> BIO_HINT_ALLOCATE
>     The bio's block extent will soon be written by the filesystem
>     and any COW that may be necessary to achieve that should begin
>     now.  If the COW is going to fail, the bio should fail.  Note
>     that this provides a way for the filesystem to manage when and
>     how failures to COW are reported.

Would it make sense to allow the bi_sector to be changed by the device
and to have that change honoured.
i.e. "Please allocate 128 blocks, maybe 'here'" 
     "OK, 128 blocks allocated, but they are actually over 'there'".

If the device is tracking what space is and isn't used, it might make
life easier for it to do the allocation.  Maybe even have a variant
"Allocate 128 blocks, I don't care where".

Is this bio supposed to block until the copy has happened?  Or only
until the space of the copy has been allocated and possibly committed?
Or must it return without doing any IO at all?

> 
> BIO_HINT_RELEASE
>     The bio's block extent is no longer in use by the filesystem
>     and will not be read in the future.  Any storage used to back
>     the extent may be released without any threat to filesystem
>     or data integrity.

If the allocation unit of the storage device (e.g. a few MB) does not
match the allocation unit of the filesystem (e.g. a few KB) then for
this to be useful either the storage device must start recording tiny
allocations, or the filesystem should re-release areas as they grow.
i.e. when releasing a range of a device, look in the filesystem's usage
records for the largest surrounding free space, and release all of that.

Would this be a burden on the filesystems?
Is my imagined disparity between block sizes valid?
Would it be just as easy for the storage device to track small
allocation/deallocations?

> 
> BIO_HINT_DONTCOW
>     (the Bart Simpson BIO).  The bio's block extent is not needed
>     in mounted snapshots and does not need to be subjected to COW.

This seems like a much more domain-specific function that the other
two which themselves could be more generally useful (I'm imagining
using hints from them to e.g. accelerate RAID reconstruction).

Surely the "correct" thing to do with the log is to put it on a separate
device which itself isn't snapshotted.

If you have a storage manager that is smart enough to handle these
sorts of things, maybe the functionality you want is "Give me a
subordinate device which is not snapshotted, size X", then journal to
that virtual device.
I guess that is equally domain specific, but the difference is that if
you try to read from the DONTCOW part of the snapshot, you get bad
old data, where as if you try to access the subordinate device of a
snapshot, you get an IO error - which is probably safer.

> 
> Comments?

On the whole it seems reasonably sane .... providing you are from the
school which believes that volume managers and filesystems should be
kept separate :-)

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  4:16   ` Neil Brown
@ 2007-10-30  5:12     ` Greg Banks
  2007-10-30  7:43       ` Arnd Bergmann
  2007-11-20 23:43       ` Roger Strassburg
  2007-10-30 23:56     ` David Chinner
  1 sibling, 2 replies; 20+ messages in thread
From: Greg Banks @ 2007-10-30  5:12 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linux Filesystem Mailing List, David Chinner, Donald Douwsma,
	Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois, Arnd Bergmann

On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
> On Tuesday October 30, gnb@sgi.com wrote:
> > 
> > Of course snapshot cow elements may be part of more generic element
> > trees.  In general there may be more than one consumer of block usage
> > hints in a given filesystem's element tree, and their locations in that
> > tree are not predictable.  This means the block extents mentioned in
> > the usage hints need to be subject to the block mapping algorithms
> > provided by the element tree.  As those algorithms are currently
> > implemented using bio mapping and splitting, the easiest and simplest
> > way to reuse those algorithms is to add new bio flags.
> 
> So are you imagining that you might have a distinct snapshotable
> elements, and that some of these might be combined by e.g. RAID0 into
> a larger device, then a filesystem is created on that?

I was thinking more a concatenation than a stripe, but yes you could
do such a thing, e.g. to parallelise the COW procedure.  We don't do
any such thing in our product; the COW element is always inserted at
the top of the logical element tree.

> I ask because my first thought was that the sort of communication you
> want seems like it would be just between a filesystem and the block
> device that it talks directly to, and as you are particularly
> interested in XFS and XVM, should could come up with whatever protocol
> you want for those two to talk to either other, prototype it, iron out
> all the issues, then say "We've got this really cool thing to make
> snapshots much faster - wanna share?"  and thus be presenting from a
> position of more strength (the old 'code talks' mantra).

Indeed, code talks ;-)  I was hoping someone else would do that
talking for me, though.

> > First we need a mechanism to indicate that a bio is a hint rather
> > than a real IO.  Perhaps the easiest way is to add a new flag to
> > the bi_rw field:
> > 
> > #define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */
> 
> Reminds me of the new approach to issue_flush_fn which is just to have
> a zero-length barrier bio (is that implemented yet? I lost track).
> But different as a zero length barrier has zero length, and your hints
> have a very meaningful length.

Yes.

> > 
> > Next we'll need three bio hints types with the following semantics.
> > 
> > BIO_HINT_ALLOCATE
> >     The bio's block extent will soon be written by the filesystem
> >     and any COW that may be necessary to achieve that should begin
> >     now.  If the COW is going to fail, the bio should fail.  Note
> >     that this provides a way for the filesystem to manage when and
> >     how failures to COW are reported.
> 
> Would it make sense to allow the bi_sector to be changed by the device
> and to have that change honoured.
> i.e. "Please allocate 128 blocks, maybe 'here'" 
>      "OK, 128 blocks allocated, but they are actually over 'there'".

That wasn't the expectation at all.  Perhaps "allocate" is a poor
name.   "I have just allocated, deal with it" might be more appropriate.
Perhaps BIO_HINT_WILLUSE or something.

> If the device is tracking what space is and isn't used, it might make
> life easier for it to do the allocation.  Maybe even have a variant
> "Allocate 128 blocks, I don't care where".

That kind of thing might perhaps be useful for flash, but I think
current filesystems would have conniptions.

> Is this bio supposed to block until the copy has happened?  Or only
> until the space of the copy has been allocated and possibly committed?

The latter.  The writes following will block until the COW has
completed, or might be performed sufficiently later that the COW
has meanwhile completed (I think this implies an extra state in the
snapshot metadata to avoid double-COWing).  The point of the hint is
to allow the snapshot code to test for running out of repo space and
report that failure at a time when the filesystem is able to handle
it gracefully.

> Or must it return without doing any IO at all?

I would expect it would be a useful optimisation to start the IO but
not wait for it's completion, but that the first implementation would
just do a space check.

> > 
> > BIO_HINT_RELEASE
> >     The bio's block extent is no longer in use by the filesystem
> >     and will not be read in the future.  Any storage used to back
> >     the extent may be released without any threat to filesystem
> >     or data integrity.
> 
> If the allocation unit of the storage device (e.g. a few MB) does not
> match the allocation unit of the filesystem (e.g. a few KB) then for
> this to be useful either the storage device must start recording tiny
> allocations, or the filesystem should re-release areas as they grow.
> i.e. when releasing a range of a device, look in the filesystem's usage
> records for the largest surrounding free space, and release all of that.

Good point.  I was planning on ignoring this problem :-/ Given that
current snapshot implementations waste *all* the blocks in deleted
files, it would be an improvement to scavenge the blocks in large
extents.  This is especially true for XFS which goes to some effort
to achieve large linear extents.

> Would this be a burden on the filesystems?

I think so.  I would hope the hints could be done in a way which
minimises the impact on filesystems, so that it would be easier to roll
out.  That implies pushing the responsibility for being smart about
combining partial deallocations down to the block device/snapshot code.
Any comments, Roger?

> Is my imagined disparity between block sizes valid?

Yep, at least for XFS and XVM.  If the space was used in lots of
little files, this rounding would probably eat a lot of the savings.

> Would it be just as easy for the storage device to track small
> allocation/deallocations?
> 
> > 
> > BIO_HINT_DONTCOW
> >     (the Bart Simpson BIO).  The bio's block extent is not needed
> >     in mounted snapshots and does not need to be subjected to COW.
> 
> This seems like a much more domain-specific function that the other
> two which themselves could be more generally useful

Agreed, I can't offhand think of a use other than internal logs.

> (I'm imagining
> using hints from them to e.g. accelerate RAID reconstruction).

Ah, interesting idea: delete a file to speed up RAID recovery ;-)

> Surely the "correct" thing to do with the log is to put it on a separate
> device which itself isn't snapshotted.

Indeed.

> If you have a storage manager that is smart enough to handle these
> sorts of things, maybe the functionality you want is "Give me a
> subordinate device which is not snapshotted, size X", then journal to
> that virtual device.

This is usually better, but is not always convenient for a number of
reasons.  For example, you might not have enough disks to build all
of a base, a snapshot repo, and a log device.  Also, the log really
needs to be safe, so you want it mirrored or RAID5, and you want it
fast, and you want it on separate spindles, so it needs several disks;
but now you're using terabytes of disk space for 128 MiB of log.

> I guess that is equally domain specific, but the difference is that if
> you try to read from the DONTCOW part of the snapshot, you get bad
> old data, where as if you try to access the subordinate device of a
> snapshot, you get an IO error - which is probably safer.

I believe (Dave or Roger will correct me here) that XFS needs a log
when you mount, and you get to either provide an external one or use
the internal one.  So when you mount a snapshot of an XFS filesystem
which was built with an external log, you need to provide a new
external log device.  So the storage manager needs to allocate an
external log device for each snapshot it allows.

>
> > 
> > Comments?
> 
> On the whole it seems reasonably sane .... providing you are from the
> school which believes that volume managers and filesystems should be
> kept separate :-)

Yeah, I'm so old-school :-)

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  5:12     ` Greg Banks
@ 2007-10-30  7:43       ` Arnd Bergmann
  2007-11-20 23:43       ` Roger Strassburg
  1 sibling, 0 replies; 20+ messages in thread
From: Arnd Bergmann @ 2007-10-30  7:43 UTC (permalink / raw)
  To: Greg Banks
  Cc: Neil Brown, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois

On Tuesday 30 October 2007, Greg Banks wrote:
> 
> > If the allocation unit of the storage device (e.g. a few MB) does not
> > match the allocation unit of the filesystem (e.g. a few KB) then for
> > this to be useful either the storage device must start recording tiny
> > allocations, or the filesystem should re-release areas as they grow.
> > i.e. when releasing a range of a device, look in the filesystem's usage
> > records for the largest surrounding free space, and release all of that.
> 
> Good point.  I was planning on ignoring this problem :-/ Given that
> current snapshot implementations waste *all* the blocks in deleted
> files, it would be an improvement to scavenge the blocks in large
> extents.  This is especially true for XFS which goes to some effort
> to achieve large linear extents.
> 

Ah, this is an important difference to my idea about an erase operation
on the block device. For erase to be meaningful, you need to know the
erase block size at the file system or user space, so it would be
encoded in the struct block_device, and the user has to issue erase
requests at erase block grenularity.

	Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  5:12     ` Greg Banks
  2007-10-30  7:43       ` Arnd Bergmann
@ 2007-11-20 23:43       ` Roger Strassburg
  1 sibling, 0 replies; 20+ messages in thread
From: Roger Strassburg @ 2007-11-20 23:43 UTC (permalink / raw)
  To: Greg Banks
  Cc: Neil Brown, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Mark Goodwin,
	Brett Jon Grandbois, Arnd Bergmann

Greg,

Sorry I didn't respond sooner - other things have gotten in the way of reading this thread.

See comments below.

Roger

Greg Banks wrote:
> On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
>> On Tuesday October 30, gnb@sgi.com wrote:
>>> Of course snapshot cow elements may be part of more generic element
>>> trees.  In general there may be more than one consumer of block usage
>>> hints in a given filesystem's element tree, and their locations in that
>>> tree are not predictable.  This means the block extents mentioned in
>>> the usage hints need to be subject to the block mapping algorithms
>>> provided by the element tree.  As those algorithms are currently
>>> implemented using bio mapping and splitting, the easiest and simplest
>>> way to reuse those algorithms is to add new bio flags.
>> So are you imagining that you might have a distinct snapshotable
>> elements, and that some of these might be combined by e.g. RAID0 into
>> a larger device, then a filesystem is created on that?
> 
> I was thinking more a concatenation than a stripe, but yes you could
> do such a thing, e.g. to parallelise the COW procedure.  We don't do
> any such thing in our product; the COW element is always inserted at
> the top of the logical element tree.
> 
>> I ask because my first thought was that the sort of communication you
>> want seems like it would be just between a filesystem and the block
>> device that it talks directly to, and as you are particularly
>> interested in XFS and XVM, should could come up with whatever protocol
>> you want for those two to talk to either other, prototype it, iron out
>> all the issues, then say "We've got this really cool thing to make
>> snapshots much faster - wanna share?"  and thus be presenting from a
>> position of more strength (the old 'code talks' mantra).
> 
> Indeed, code talks ;-)  I was hoping someone else would do that
> talking for me, though.
> 
>>> First we need a mechanism to indicate that a bio is a hint rather
>>> than a real IO.  Perhaps the easiest way is to add a new flag to
>>> the bi_rw field:
>>>
>>> #define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */
>> Reminds me of the new approach to issue_flush_fn which is just to have
>> a zero-length barrier bio (is that implemented yet? I lost track).
>> But different as a zero length barrier has zero length, and your hints
>> have a very meaningful length.
> 
> Yes.
> 
>>> Next we'll need three bio hints types with the following semantics.
>>>
>>> BIO_HINT_ALLOCATE
>>>     The bio's block extent will soon be written by the filesystem
>>>     and any COW that may be necessary to achieve that should begin
>>>     now.  If the COW is going to fail, the bio should fail.  Note
>>>     that this provides a way for the filesystem to manage when and
>>>     how failures to COW are reported.
>> Would it make sense to allow the bi_sector to be changed by the device
>> and to have that change honoured.
>> i.e. "Please allocate 128 blocks, maybe 'here'" 
>>      "OK, 128 blocks allocated, but they are actually over 'there'".
> 
> That wasn't the expectation at all.  Perhaps "allocate" is a poor
> name.   "I have just allocated, deal with it" might be more appropriate.
> Perhaps BIO_HINT_WILLUSE or something.
> 
>> If the device is tracking what space is and isn't used, it might make
>> life easier for it to do the allocation.  Maybe even have a variant
>> "Allocate 128 blocks, I don't care where".
> 
> That kind of thing might perhaps be useful for flash, but I think
> current filesystems would have conniptions.
> 
>> Is this bio supposed to block until the copy has happened?  Or only
>> until the space of the copy has been allocated and possibly committed?
> 
> The latter.  The writes following will block until the COW has
> completed, or might be performed sufficiently later that the COW
> has meanwhile completed (I think this implies an extra state in the
> snapshot metadata to avoid double-COWing).  The point of the hint is
> to allow the snapshot code to test for running out of repo space and
> report that failure at a time when the filesystem is able to handle
> it gracefully.
> 
>> Or must it return without doing any IO at all?
> 
> I would expect it would be a useful optimisation to start the IO but
> not wait for it's completion, but that the first implementation would
> just do a space check.
> 
>>> BIO_HINT_RELEASE
>>>     The bio's block extent is no longer in use by the filesystem
>>>     and will not be read in the future.  Any storage used to back
>>>     the extent may be released without any threat to filesystem
>>>     or data integrity.
>> If the allocation unit of the storage device (e.g. a few MB) does not
>> match the allocation unit of the filesystem (e.g. a few KB) then for
>> this to be useful either the storage device must start recording tiny
>> allocations, or the filesystem should re-release areas as they grow.
>> i.e. when releasing a range of a device, look in the filesystem's usage
>> records for the largest surrounding free space, and release all of that.
> 
> Good point.  I was planning on ignoring this problem :-/ Given that
> current snapshot implementations waste *all* the blocks in deleted
> files, it would be an improvement to scavenge the blocks in large
> extents.  This is especially true for XFS which goes to some effort
> to achieve large linear extents.
> 
>> Would this be a burden on the filesystems?
> 
> I think so.  I would hope the hints could be done in a way which
> minimises the impact on filesystems, so that it would be easier to roll
> out.  That implies pushing the responsibility for being smart about
> combining partial deallocations down to the block device/snapshot code.
> Any comments, Roger?

I'm not sure how snapshot can really use a dealloc hint.  Whatever you're deallocating is in the base, but you want it to stay in the snapshot, since the purpose of a snapshot is to keep track of what was there before.

What makes more sense is to somehow pass a hint saying that the data being written is to space that wasn't allocated at the time the snapshot was created, but that would require the filesystem to have knowledge of the snapshot.  This would prevent copying data that doesn't contain meaningful data in the first place.  

>> Is my imagined disparity between block sizes valid?
> 
> Yep, at least for XFS and XVM.  If the space was used in lots of
> little files, this rounding would probably eat a lot of the savings.
> 
>> Would it be just as easy for the storage device to track small
>> allocation/deallocations?
>>
>>> BIO_HINT_DONTCOW
>>>     (the Bart Simpson BIO).  The bio's block extent is not needed
>>>     in mounted snapshots and does not need to be subjected to COW.
>> This seems like a much more domain-specific function that the other
>> two which themselves could be more generally useful
> 
> Agreed, I can't offhand think of a use other than internal logs.
> 
>> (I'm imagining
>> using hints from them to e.g. accelerate RAID reconstruction).
> 
> Ah, interesting idea: delete a file to speed up RAID recovery ;-)
> 
>> Surely the "correct" thing to do with the log is to put it on a separate
>> device which itself isn't snapshotted.
> 
> Indeed.
> 
>> If you have a storage manager that is smart enough to handle these
>> sorts of things, maybe the functionality you want is "Give me a
>> subordinate device which is not snapshotted, size X", then journal to
>> that virtual device.
> 
> This is usually better, but is not always convenient for a number of
> reasons.  For example, you might not have enough disks to build all
> of a base, a snapshot repo, and a log device.  Also, the log really
> needs to be safe, so you want it mirrored or RAID5, and you want it
> fast, and you want it on separate spindles, so it needs several disks;
> but now you're using terabytes of disk space for 128 MiB of log.

The log doesn't need to be on a separate disk, just a separate logical volume.  Also, you don't have to mirror the whole disk in order to mirror the log volume.  Snapshots are done per logical volume, not per physical disk.

>> I guess that is equally domain specific, but the difference is that if
>> you try to read from the DONTCOW part of the snapshot, you get bad
>> old data, where as if you try to access the subordinate device of a
>> snapshot, you get an IO error - which is probably safer.
> 
> I believe (Dave or Roger will correct me here) that XFS needs a log
> when you mount, and you get to either provide an external one or use
> the internal one.  So when you mount a snapshot of an XFS filesystem
> which was built with an external log, you need to provide a new
> external log device.  So the storage manager needs to allocate an
> external log device for each snapshot it allows.

That's correct.

>>> Comments?
>> On the whole it seems reasonably sane .... providing you are from the
>> school which believes that volume managers and filesystems should be
>> kept separate :-)
> 
> Yeah, I'm so old-school :-)
> 
> Greg.


-- 
Roger Strassburg  SGI Storage Systems Software  +49-89-46108-142

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  4:16   ` Neil Brown
  2007-10-30  5:12     ` Greg Banks
@ 2007-10-30 23:56     ` David Chinner
  2007-10-31  4:01       ` Greg Banks
  1 sibling, 1 reply; 20+ messages in thread
From: David Chinner @ 2007-10-30 23:56 UTC (permalink / raw)
  To: Neil Brown
  Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois, Arnd Bergmann

On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
> On Tuesday October 30, gnb@sgi.com wrote:
> > BIO_HINT_RELEASE
> >     The bio's block extent is no longer in use by the filesystem
> >     and will not be read in the future.  Any storage used to back
> >     the extent may be released without any threat to filesystem
> >     or data integrity.
> 
> If the allocation unit of the storage device (e.g. a few MB) does not
> match the allocation unit of the filesystem (e.g. a few KB) then for
> this to be useful either the storage device must start recording tiny
> allocations, or the filesystem should re-release areas as they grow.
> i.e. when releasing a range of a device, look in the filesystem's usage
> records for the largest surrounding free space, and release all of that.

I figured that the easiest way around this is reporting free space
extents, not the amoutn actually freed. e.g.

	4k in file A @ block 10
	4k in file B @ block 11
	4k free space @ block 12
	4k in file C @ block 13
	1008k in free space at block 14.

If we free file A, we report that we've released an extent of 4k @ block 10.
if we then free file B, we report we've released an extent of 12k @ block 10.
If we then free file C, we report a release of 1024k @ block 10.

Then the underlying device knows what the aggregated free space regions
are and can easily release large regions without needing to track tiny
allocations and frees done by the filesystem.

> I guess that is equally domain specific, but the difference is that if
> you try to read from the DONTCOW part of the snapshot, you get bad
> old data, where as if you try to access the subordinate device of a
> snapshot, you get an IO error - which is probably safer.

If you read from a DONTCOW region you should get zeros back - it's
a hole in the snapshot.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30 23:56     ` David Chinner
@ 2007-10-31  4:01       ` Greg Banks
  2007-10-31  7:04         ` David Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Greg Banks @ 2007-10-31  4:01 UTC (permalink / raw)
  To: David Chinner
  Cc: Neil Brown, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois, Arnd Bergmann

On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote:
> On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
> > On Tuesday October 30, gnb@sgi.com wrote:
> > > BIO_HINT_RELEASE
> > >     The bio's block extent is no longer in use by the filesystem
> > >     and will not be read in the future.  Any storage used to back
> > >     the extent may be released without any threat to filesystem
> > >     or data integrity.
> > 
> > If the allocation unit of the storage device (e.g. a few MB) does not
> > match the allocation unit of the filesystem (e.g. a few KB) then for
> > this to be useful either the storage device must start recording tiny
> > allocations, or the filesystem should re-release areas as they grow.
> > i.e. when releasing a range of a device, look in the filesystem's usage
> > records for the largest surrounding free space, and release all of that.
> 
> I figured that the easiest way around this is reporting free space
> extents, not the amoutn actually freed. e.g.
> 
> 	4k in file A @ block 10
> 	4k in file B @ block 11
> 	4k free space @ block 12
> 	4k in file C @ block 13
> 	1008k in free space at block 14.
> 
> If we free file A, we report that we've released an extent of 4k @ block 10.
> if we then free file B, we report we've released an extent of 12k @ block 10.
> If we then free file C, we report a release of 1024k @ block 10.
> 
> Then the underlying device knows what the aggregated free space regions
> are and can easily release large regions without needing to track tiny
> allocations and frees done by the filesystem.

If you could do that in the filesystem, it certainly solve the problem.
In which case I'll explicitly allow for the hint's extent to overlap
extents previous extents thus hinted, and define the semantics
for overlaps.  I think I'll rename the hint to BIO_HINT_RELEASED,
I think that will make the semantics a little clearer.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-31  4:01       ` Greg Banks
@ 2007-10-31  7:04         ` David Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: David Chinner @ 2007-10-31  7:04 UTC (permalink / raw)
  To: Greg Banks
  Cc: David Chinner, Neil Brown, Linux Filesystem Mailing List,
	David Chinner, Donald Douwsma, Christoph Hellwig,
	Roger Strassburg, Mark Goodwin, Brett Jon Grandbois,
	Arnd Bergmann

On Wed, Oct 31, 2007 at 03:01:58PM +1100, Greg Banks wrote:
> On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote:
> > On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
> > > On Tuesday October 30, gnb@sgi.com wrote:
> > > > BIO_HINT_RELEASE
> > > >     The bio's block extent is no longer in use by the filesystem
> > > >     and will not be read in the future.  Any storage used to back
> > > >     the extent may be released without any threat to filesystem
> > > >     or data integrity.
> > > 
> > > If the allocation unit of the storage device (e.g. a few MB) does not
> > > match the allocation unit of the filesystem (e.g. a few KB) then for
> > > this to be useful either the storage device must start recording tiny
> > > allocations, or the filesystem should re-release areas as they grow.
> > > i.e. when releasing a range of a device, look in the filesystem's usage
> > > records for the largest surrounding free space, and release all of that.
> > 
> > I figured that the easiest way around this is reporting free space
> > extents, not the amoutn actually freed. e.g.
> > 
> > 	4k in file A @ block 10
> > 	4k in file B @ block 11
> > 	4k free space @ block 12
> > 	4k in file C @ block 13
> > 	1008k in free space at block 14.
> > 
> > If we free file A, we report that we've released an extent of 4k @ block 10.
> > if we then free file B, we report we've released an extent of 12k @ block 10.
> > If we then free file C, we report a release of 1024k @ block 10.
> > 
> > Then the underlying device knows what the aggregated free space regions
> > are and can easily release large regions without needing to track tiny
> > allocations and frees done by the filesystem.
> 
> If you could do that in the filesystem, it certainly solve the problem.
> In which case I'll explicitly allow for the hint's extent to overlap
> extents previous extents thus hinted, and define the semantics
> for overlaps.  I think I'll rename the hint to BIO_HINT_RELEASED,
> I think that will make the semantics a little clearer.

I think that can be done - i wouldn't have mentioned it if I didn't
think it was possible to implement ;).

It will require a further btree lookup once the free transaction
hits the disk, but I think that's pretty easy to do. I'd probably
hook xfs_alloc_clear_busy() to do this.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks
  2007-10-30  1:11   ` Greg Banks
  2007-10-30  4:16   ` Neil Brown
@ 2007-10-30  9:35   ` Dongjun Shin
  2007-10-30 10:15     ` Arnd Bergmann
                       ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Dongjun Shin @ 2007-10-30  9:35 UTC (permalink / raw)
  To: Greg Banks
  Cc: Linux Filesystem Mailing List, David Chinner, Donald Douwsma,
	Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois, Arnd Bergmann

On 10/30/07, Greg Banks <gnb@sgi.com> wrote:
>
> BIO_HINT_RELEASE
>     The bio's block extent is no longer in use by the filesystem
>     and will not be read in the future.  Any storage used to back
>     the extent may be released without any threat to filesystem
>     or data integrity.
>

I'd like to second the proposal, but it would be more useful to bring the hint
down to the physical devices.

There is an ongoing discussion about adding 'Trim' ATA command for notifying
the drive about the deleted blocks.

http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

This is especially useful for the storage device like Solid State Drive (SSD).

Dongjun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  9:35   ` Dongjun Shin
@ 2007-10-30 10:15     ` Arnd Bergmann
  2007-10-30 10:49       ` Dongjun Shin
  2007-10-30 23:42       ` Kyungmin Park
  2007-10-30 14:06     ` Jörn Engel
  2007-10-31  3:44     ` Greg Banks
  2 siblings, 2 replies; 20+ messages in thread
From: Arnd Bergmann @ 2007-10-30 10:15 UTC (permalink / raw)
  To: Dongjun Shin
  Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois

On Tuesday 30 October 2007, Dongjun Shin wrote:
> There is an ongoing discussion about adding 'Trim' ATA command for notifying
> the drive about the deleted blocks.
> 
> http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
> 
> This is especially useful for the storage device like Solid State Drive (SSD).
> 
This make me curious, why would t13 want to invent a new command when
there is already the erase command from CFA? 

It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf

	Arnd <><

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30 10:15     ` Arnd Bergmann
@ 2007-10-30 10:49       ` Dongjun Shin
  2007-10-30 12:38         ` Arnd Bergmann
  2007-10-30 23:42       ` Kyungmin Park
  1 sibling, 1 reply; 20+ messages in thread
From: Dongjun Shin @ 2007-10-30 10:49 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois

On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote:
> This make me curious, why would t13 want to invent a new command when
> there is already the erase command from CFA?
>
> It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
> should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
> http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf
>

I'm not sure about the background.
However, it's definitely a sign that passing the deleted block info
to the flash-based storage is useful.

Anyway, BIO_HINT_RELEASE could destroy the content of the blocks
after being passed to the device. I think that other bio should not be
reordered
accross that hint (just like barrier).

Dongjun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30 10:49       ` Dongjun Shin
@ 2007-10-30 12:38         ` Arnd Bergmann
  2007-10-30 14:19           ` Dongjun Shin
  0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2007-10-30 12:38 UTC (permalink / raw)
  To: Dongjun Shin
  Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois

On Tuesday 30 October 2007, Dongjun Shin wrote:
> Anyway, BIO_HINT_RELEASE could destroy the content of the blocks
> after being passed to the device. I think that other bio should not be
> reordered accross that hint (just like barrier).

Not sure. Why shouldn't you be able to reorder the hints provided that
they don't overlap with read/write bios for the same block?

	Arnd <><

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30 12:38         ` Arnd Bergmann
@ 2007-10-30 14:19           ` Dongjun Shin
  2007-10-30 15:37             ` Jörn Engel
  0 siblings, 1 reply; 20+ messages in thread
From: Dongjun Shin @ 2007-10-30 14:19 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois

On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Not sure. Why shouldn't you be able to reorder the hints provided that
> they don't overlap with read/write bios for the same block?
>

You're right. The bios can be reordered if they don't overlap with hint.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30 14:19           ` Dongjun Shin
@ 2007-10-30 15:37             ` Jörn Engel
  2007-10-30 16:37               ` Arnd Bergmann
  0 siblings, 1 reply; 20+ messages in thread
From: Jörn Engel @ 2007-10-30 15:37 UTC (permalink / raw)
  To: Dongjun Shin
  Cc: Arnd Bergmann, Greg Banks, Linux Filesystem Mailing List,
	David Chinner, Donald Douwsma, Christoph Hellwig,
	Roger Strassburg, Mark Goodwin, Brett Jon Grandbois

On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
> On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > Not sure. Why shouldn't you be able to reorder the hints provided that
> > they don't overlap with read/write bios for the same block?
> 
> You're right. The bios can be reordered if they don't overlap with hint.

I would keep things simpler.  Bios can be reordered, full stop.  If an
erase and a write overlap, the caller (filesystem?) has to add a
barrier.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30 15:37             ` Jörn Engel
@ 2007-10-30 16:37               ` Arnd Bergmann
  2007-10-30 23:19                 ` Kyungmin Park
  0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2007-10-30 16:37 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Dongjun Shin, Greg Banks, Linux Filesystem Mailing List,
	David Chinner, Donald Douwsma, Christoph Hellwig,
	Roger Strassburg, Mark Goodwin, Brett Jon Grandbois

On Tuesday 30 October 2007, Jörn Engel wrote:
> On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
> > On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > Not sure. Why shouldn't you be able to reorder the hints provided that
> > > they don't overlap with read/write bios for the same block?
> > 
> > You're right. The bios can be reordered if they don't overlap with hint.
> 
> I would keep things simpler.  Bios can be reordered, full stop.  If an
> erase and a write overlap, the caller (filesystem?) has to add a
> barrier.

I thought bios were already ordered if they affect the same blocks.
Either way, I agree that an erase should not be treated special on
the bio layer, its ordering should be handled the same way we do it
for writes.

	Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30 16:37               ` Arnd Bergmann
@ 2007-10-30 23:19                 ` Kyungmin Park
  0 siblings, 0 replies; 20+ messages in thread
From: Kyungmin Park @ 2007-10-30 23:19 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jörn Engel, Dongjun Shin, Greg Banks,
	Linux Filesystem Mailing List, David Chinner, Donald Douwsma,
	Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois

On 10/31/07, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 30 October 2007, Jörn Engel wrote:
> > On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
> > > On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > Not sure. Why shouldn't you be able to reorder the hints provided that
> > > > they don't overlap with read/write bios for the same block?
> > >
> > > You're right. The bios can be reordered if they don't overlap with hint.
> >
> > I would keep things simpler. Bios can be reordered, full stop. If an
> > erase and a write overlap, the caller (filesystem?) has to add a
> > barrier.
>
> I thought bios were already ordered if they affect the same blocks.
> Either way, I agree that an erase should not be treated special on
> the bio layer, its ordering should be handled the same way we do it
> for writes.
>

To support the new ATA command (trim, or dataset), the suggested hint
is not enough.
We have to send the bio with data (at least one sector or more) since
the new ATA command requests the dataset information.

And also we have to strictly follow the order using barrier or other
methods at filesystem level
For example, the delete operation in ext3.
1. delete some file
2. ext3_delete_inode() called
3. ... -> ext3_free_blocks_sb() releases the free blocks
4. If it sends the hints here, it breaks the ext3 power off recovery
scheme since it trims the data from given information after reboot
5. after transaction, all dirty pages are flushed. after this work, we
can trim the free blocks safely.

Another approach is modifying the block framework.
At  I/O scheduler, it don't merge the hint bio (in my terminology, bio
control info) with general bio. In this case we also consider the
reordering problem.
I'm not sure it is possible at this time.

Thank you,
Kyungmin Park
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30 10:15     ` Arnd Bergmann
  2007-10-30 10:49       ` Dongjun Shin
@ 2007-10-30 23:42       ` Kyungmin Park
  1 sibling, 0 replies; 20+ messages in thread
From: Kyungmin Park @ 2007-10-30 23:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Dongjun Shin, Greg Banks, Linux Filesystem Mailing List,
	David Chinner, Donald Douwsma, Christoph Hellwig,
	Roger Strassburg, Mark Goodwin, Brett Jon Grandbois

On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 30 October 2007, Dongjun Shin wrote:
> > There is an ongoing discussion about adding 'Trim' ATA command for notifying
> > the drive about the deleted blocks.
> >
> > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
> >
> > This is especially useful for the storage device like Solid State Drive (SSD).
> >
> This make me curious, why would t13 want to invent a new command when
> there is already the erase command from CFA?
>
> It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
> should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
> http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf
>

IHMO, the main difference is that it requires the physical operation or not.
The CFA_ERAES erases the free blocks, it requires the physical erase operation.
But in trim case, it just unmapped the free blocks at FTL level. it
doesn't require the physical operation.
It's time saving and we can do a lot of works at FTL level internally.

Thank you,
Kyungmin Park

I

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  9:35   ` Dongjun Shin
  2007-10-30 10:15     ` Arnd Bergmann
@ 2007-10-30 14:06     ` Jörn Engel
  2007-10-31  3:44     ` Greg Banks
  2 siblings, 0 replies; 20+ messages in thread
From: Jörn Engel @ 2007-10-30 14:06 UTC (permalink / raw)
  To: Dongjun Shin
  Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner,
	Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois, Arnd Bergmann

On Tue, 30 October 2007 18:35:08 +0900, Dongjun Shin wrote:
> On 10/30/07, Greg Banks <gnb@sgi.com> wrote:
> >
> > BIO_HINT_RELEASE
> >     The bio's block extent is no longer in use by the filesystem
> >     and will not be read in the future.  Any storage used to back
> >     the extent may be released without any threat to filesystem
> >     or data integrity.
> 
> I'd like to second the proposal, but it would be more useful to bring the hint
> down to the physical devices.

Absolutely.  Logfs would love to have an erase operation for block
devices as well.  However the above doesn't quite match my needs,
because the blocks _will_ be read in the future.

There are two reasons for reading things back later.  The good one is to
determine whether the segment was erased or not.  Reads should return
either valid data or one of (all-0xff, all-0x00, -ESOMETHING).  Having
a dedicated error code would be best.

And getting the device erasesize would be useful as well, for obvious
reasons.

Jörn

-- 
When you close your hand, you own nothing. When you open it up, you
own the whole world.
-- Li Mu Bai in Tiger & Dragon
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Proposal to improve filesystem/block snapshot interaction
  2007-10-30  9:35   ` Dongjun Shin
  2007-10-30 10:15     ` Arnd Bergmann
  2007-10-30 14:06     ` Jörn Engel
@ 2007-10-31  3:44     ` Greg Banks
  2 siblings, 0 replies; 20+ messages in thread
From: Greg Banks @ 2007-10-31  3:44 UTC (permalink / raw)
  To: Dongjun Shin
  Cc: Linux Filesystem Mailing List, David Chinner, Donald Douwsma,
	Christoph Hellwig, Roger Strassburg, Mark Goodwin,
	Brett Jon Grandbois, Arnd Bergmann

On Tue, Oct 30, 2007 at 06:35:08PM +0900, Dongjun Shin wrote:
> On 10/30/07, Greg Banks <gnb@sgi.com> wrote:
> >
> > BIO_HINT_RELEASE
> >     The bio's block extent is no longer in use by the filesystem
> >     and will not be read in the future.  Any storage used to back
> >     the extent may be released without any threat to filesystem
> >     or data integrity.
> >
> 
> I'd like to second the proposal, but it would be more useful to bring the hint
> down to the physical devices.
> 
> There is an ongoing discussion about adding 'Trim' ATA command for notifying
> the drive about the deleted blocks.
> 
> http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

What an interesting document.  Am I reading the change markup correctly,
did it get *simpler* in the last revision?  Wow.

I agree that BIO_HINT_RELEASE would be a good match for the proposed
Trim command.  But I don't think we'll ever be issuing Trims with
more than a single LBA Range Entry, that feature seems unhelpful.

The Trim proposal doesn't specify what happens when a sector which
is already deallocated is deallocated again, presumably this is
supposed to be harmless?

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-11-20 23:41 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070927063113.GD2989@sgi.com>
2007-10-30  1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks
2007-10-30  1:11   ` Greg Banks
2007-10-30  4:16   ` Neil Brown
2007-10-30  5:12     ` Greg Banks
2007-10-30  7:43       ` Arnd Bergmann
2007-11-20 23:43       ` Roger Strassburg
2007-10-30 23:56     ` David Chinner
2007-10-31  4:01       ` Greg Banks
2007-10-31  7:04         ` David Chinner
2007-10-30  9:35   ` Dongjun Shin
2007-10-30 10:15     ` Arnd Bergmann
2007-10-30 10:49       ` Dongjun Shin
2007-10-30 12:38         ` Arnd Bergmann
2007-10-30 14:19           ` Dongjun Shin
2007-10-30 15:37             ` Jörn Engel
2007-10-30 16:37               ` Arnd Bergmann
2007-10-30 23:19                 ` Kyungmin Park
2007-10-30 23:42       ` Kyungmin Park
2007-10-30 14:06     ` Jörn Engel
2007-10-31  3:44     ` Greg Banks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).