Re: Proposal to improve filesystem/block snapshot interaction

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Neil Brown <neilb@suse.de>
To: Greg Banks <gnb@sgi.com>
Cc: Linux Filesystem Mailing List <linux-fsdevel@vger.kernel.org>,
	David Chinner <dgc@melbourne.sgi.com>,
	Donald Douwsma <donaldd@melbourne.sgi.com>,
	Christoph Hellwig <hch@infradead.org>,
	Roger Strassburg <rls@sgi.com>, Mark Goodwin <markgw@sgi.com>,
	Brett Jon Grandbois <brettg@melbourne.sgi.com>,
	Arnd Bergmann <arnd@arndb.de>
Subject: Re: Proposal to improve filesystem/block snapshot interaction
Date: Tue, 30 Oct 2007 15:16:06 +1100	[thread overview]
Message-ID: <18214.45062.754722.885137@notabene.brown> (raw)
In-Reply-To: message from Greg Banks on Tuesday October 30

On Tuesday October 30, gnb@sgi.com wrote:
> 
> Of course snapshot cow elements may be part of more generic element
> trees.  In general there may be more than one consumer of block usage
> hints in a given filesystem's element tree, and their locations in that
> tree are not predictable.  This means the block extents mentioned in
> the usage hints need to be subject to the block mapping algorithms
> provided by the element tree.  As those algorithms are currently
> implemented using bio mapping and splitting, the easiest and simplest
> way to reuse those algorithms is to add new bio flags.

So are you imagining that you might have a distinct snapshotable
elements, and that some of these might be combined by e.g. RAID0 into
a larger device, then a filesystem is created on that?

I ask because my first thought was that the sort of communication you
want seems like it would be just between a filesystem and the block
device that it talks directly to, and as you are particularly
interested in XFS and XVM, should could come up with whatever protocol
you want for those two to talk to either other, prototype it, iron out
all the issues, then say "We've got this really cool thing to make
snapshots much faster - wanna share?"  and thus be presenting from a
position of more strength (the old 'code talks' mantra).

> 
> First we need a mechanism to indicate that a bio is a hint rather
> than a real IO.  Perhaps the easiest way is to add a new flag to
> the bi_rw field:
> 
> #define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */

Reminds me of the new approach to issue_flush_fn which is just to have
a zero-length barrier bio (is that implemented yet? I lost track).
But different as a zero length barrier has zero length, and your hints
have a very meaningful length.

> 
> Next we'll need three bio hints types with the following semantics.
> 
> BIO_HINT_ALLOCATE
>     The bio's block extent will soon be written by the filesystem
>     and any COW that may be necessary to achieve that should begin
>     now.  If the COW is going to fail, the bio should fail.  Note
>     that this provides a way for the filesystem to manage when and
>     how failures to COW are reported.

Would it make sense to allow the bi_sector to be changed by the device
and to have that change honoured.
i.e. "Please allocate 128 blocks, maybe 'here'" 
     "OK, 128 blocks allocated, but they are actually over 'there'".

If the device is tracking what space is and isn't used, it might make
life easier for it to do the allocation.  Maybe even have a variant
"Allocate 128 blocks, I don't care where".

Is this bio supposed to block until the copy has happened?  Or only
until the space of the copy has been allocated and possibly committed?
Or must it return without doing any IO at all?

> 
> BIO_HINT_RELEASE
>     The bio's block extent is no longer in use by the filesystem
>     and will not be read in the future.  Any storage used to back
>     the extent may be released without any threat to filesystem
>     or data integrity.

If the allocation unit of the storage device (e.g. a few MB) does not
match the allocation unit of the filesystem (e.g. a few KB) then for
this to be useful either the storage device must start recording tiny
allocations, or the filesystem should re-release areas as they grow.
i.e. when releasing a range of a device, look in the filesystem's usage
records for the largest surrounding free space, and release all of that.

Would this be a burden on the filesystems?
Is my imagined disparity between block sizes valid?
Would it be just as easy for the storage device to track small
allocation/deallocations?

> 
> BIO_HINT_DONTCOW
>     (the Bart Simpson BIO).  The bio's block extent is not needed
>     in mounted snapshots and does not need to be subjected to COW.

This seems like a much more domain-specific function that the other
two which themselves could be more generally useful (I'm imagining
using hints from them to e.g. accelerate RAID reconstruction).

Surely the "correct" thing to do with the log is to put it on a separate
device which itself isn't snapshotted.

If you have a storage manager that is smart enough to handle these
sorts of things, maybe the functionality you want is "Give me a
subordinate device which is not snapshotted, size X", then journal to
that virtual device.
I guess that is equally domain specific, but the difference is that if
you try to read from the DONTCOW part of the snapshot, you get bad
old data, where as if you try to access the subordinate device of a
snapshot, you get an IO error - which is probably safer.

> 
> Comments?

On the whole it seems reasonably sane .... providing you are from the
school which believes that volume managers and filesystems should be
kept separate :-)

NeilBrown

next prev parent reply	other threads:[~2007-10-30  4:16 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20070927063113.GD2989@sgi.com>
2007-10-30  1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks
2007-10-30  1:11   ` Greg Banks
2007-10-30  4:16   ` Neil Brown [this message]
2007-10-30  5:12     ` Greg Banks
2007-10-30  7:43       ` Arnd Bergmann
2007-11-20 23:43       ` Roger Strassburg
2007-10-30 23:56     ` David Chinner
2007-10-31  4:01       ` Greg Banks
2007-10-31  7:04         ` David Chinner
2007-10-30  9:35   ` Dongjun Shin
2007-10-30 10:15     ` Arnd Bergmann
2007-10-30 10:49       ` Dongjun Shin
2007-10-30 12:38         ` Arnd Bergmann
2007-10-30 14:19           ` Dongjun Shin
2007-10-30 15:37             ` Jörn Engel
2007-10-30 16:37               ` Arnd Bergmann
2007-10-30 23:19                 ` Kyungmin Park
2007-10-30 23:42       ` Kyungmin Park
2007-10-30 14:06     ` Jörn Engel
2007-10-31  3:44     ` Greg Banks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18214.45062.754722.885137@notabene.brown \
    --to=neilb@suse.de \
    --cc=arnd@arndb.de \
    --cc=brettg@melbourne.sgi.com \
    --cc=dgc@melbourne.sgi.com \
    --cc=donaldd@melbourne.sgi.com \
    --cc=gnb@sgi.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=markgw@sgi.com \
    --cc=rls@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).