Re: Q: Why subvolumes?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Hugo Mills <hugo@carfax.org.uk>
To: Jerome Haltom <wasabi@cogito.cx>
Cc: Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Q: Why subvolumes?
Date: Tue, 23 Jul 2013 16:06:20 +0100	[thread overview]
Message-ID: <20130723150620.GG20517@carfax.org.uk> (raw)
In-Reply-To: <CA+V+5QrNAo_RVEiONHRqkN5O89jgtoFDecuWnu41_ovJmLVhuA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3556 bytes --]

On Tue, Jul 23, 2013 at 06:59:35AM -0500, Jerome Haltom wrote:
> May I ask why the decision to implement snapshotting through
> subvolumes? I've been very curious about why the design wasn't to
> simply allow snapshotting of any directory or file.

   tl;dr: It just doesn't work that way, and it's hard to do so within
the bounds of snapshots being atomic.

   It's down to the way that snapshots are implemented (btrfs being a
copy-on-write filesystem). A snapshot is an (atomic) copy of the FS
tree for a subvolume, where the FS tree is the metadata tree which
holds the inode information, filenames, directory structure,
permissions and so forth. Being a CoW FS, we can do this easily and
trivially by copying only the root block of the tree -- a matter of a
few KiB. Running ls -R on a snapshot and its original will read
exactly the same blocks on the disk, except for the single top-level
block in each case. As the snapshot is modified, the metadata changes,
and parts of the FS tree for the snapshot are CoWed, leaving the
original blocks in place. There is a reference-counting mechanism here
as well, to ensure that we don't leave unused blocks lying around the
place.

   Now... since the snapshot's FS tree is a direct duplicate of the
original FS tree (actually, it's the same tree, but they look like
different things to the outside world), they share everything --
including things like inode numbers. This is OK within a subvolume,
because we have the semantics that subvolumes have their own distinct
inode-number spaces. If we could snapshot arbitrary subsections of the
FS, we'd end up having to fix up inode numbers to ensure that they
were unique -- which can't really be an atomic operation (unless you
want to have the FS locked while the kernel updates the inodes of the
billion files you just snapshotted).

   The other thing to talk about here is that while the FS tree is a
tree structure, it's not a direct one-to-one map to the directory tree
structure. In fact, it looks more like a list of inodes, in inode
order, with some extra info for easily tracking through the list. The
B-tree structure of the FS tree is just a fast indexing method. So
snapshotting a directory entry within the FS tree would require
(somehow) making an atomic copy, or CoW copy, of only the parts of the
FS tree that fall under the directory in question -- so you'd end up
trying to take a sequence of records in the FS tree, of arbitrary size
(proportional roughly to the number of entries in the directory) and
copying them to somewhere else in the same tree in such a way that you
can automatically dereference the copies when you modify them. So,
ultimately, it boils down to being able to do CoW operations at the
byte level, which is going to introduce huge quantities of extra
metadata, and it all starts looking really awkward to implement (plus
having to deal with the long time taken to copy the directory entries
for the thing you're snapshotting).

   I doubt it would be possible to retrofit btrfs to do it without
more or less a ground-up rewrite, if even then. I would further doubt
that you'd end up with something that would run with any kind of
acceptable performance, or with sane bounds on the amount of metadata
used.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- I am but mad north-north-west:  when the wind is southerly, I ---  
                       know a hawk from a handsaw.                       

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

next prev parent reply	other threads:[~2013-07-23 15:06 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-23 11:59 Q: Why subvolumes? Jerome Haltom
2013-07-23 14:52 ` AW: " Andreas Buschka
2013-07-23 15:06 ` Hugo Mills [this message]
2013-07-23 17:47   ` Q: " Gabriel de Perthuis
2013-07-23 19:30     ` Hugo Mills
2013-07-23 19:41       ` Gabriel de Perthuis
2013-07-23 19:43       ` Jerome Haltom
2013-07-23 21:52         ` Chris Murphy
2013-07-23 23:39           ` Jerome Haltom
2013-07-24  1:27             ` Josef Bacik
2013-07-24  2:02               ` Chris Murphy
2013-08-04 14:56         ` Alexandre Oliva

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130723150620.GG20517@carfax.org.uk \
    --to=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wasabi@cogito.cx \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).