linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	adilger@sun.com, sfr@canb.auug.org.au,
	linux-kernel@vger.kernel.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: Notes on support for multiple devices for a single filesystem
Date: Wed, 17 Dec 2008 15:58:05 -0500	[thread overview]
Message-ID: <1229547485.27170.77.camel@think.oraclecorp.com> (raw)
In-Reply-To: <20081217115325.3312858a.akpm@linux-foundation.org>

On Wed, 2008-12-17 at 11:53 -0800, Andrew Morton wrote:
> On Wed, 17 Dec 2008 08:23:44 -0500
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > FYI: here's a little writeup I did this summer on support for
> > filesystems spanning multiple block devices:
> > 
> > 
> > -- 
> > 
> > === Notes on support for multiple devices for a single filesystem ===
> > 
> > == Intro ==
> > 
> > Btrfs (and an experimental XFS version) can support multiple underlying block
> > devices for a single filesystem instances in a generalized and flexible way.
> > 
> > Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and
> > the special real-time device in XFS all data and metadata may be spread over a
> > potentially large number of block devices, and not just one (or two)
> > 
> > 
> > == Requirements ==
> > 
> > We want a scheme to support these complex filesystem topologies in way
> > that is
> > 
> >  a) easy to setup and non-fragile for the users
> >  b) scalable to a large number of disks in the system
> >  c) recoverable without requiring user space running first
> >  d) generic enough to work for multiple filesystems or other consumers
> > 
> > Requirement a) means that a multiple-device filesystem should be mountable
> > by a simple fstab entry (UUID/LABEL or some other cookie) which continues
> > to work when the filesystem topology changes.
> 
> "device topology"?
> 
> > Requirement b) implies we must not do a scan over all available block devices
> > in large systems, but use an event-based callout on detection of new block
> > devices.
> > 
> > Requirement c) means there must be some version to add devices to a filesystem
> > by kernel command lines, even if this is not the default way, and might require
> > additional knowledge from the user / system administrator.
> > 
> > Requirement d) means that we should not implement this mechanism inside a
> > single filesystem.
> > 
> 
> One thing I've never seen comprehensively addressed is: why do this in
> the filesystem at all?  Why not let MD take care of all this and
> present a single block device to the fs layer?
> 
> Lots of filesystems are violating this, and I'm sure the reasons for
> this are good, but this document seems like a suitable place in which to
> briefly decribe those reasons.

I'd almost rather see this doc stick to the device topology interface in
hopes of describing something that RAID and MD can use too.  But just to
toss some information into the pool:

* When moving data around (raid rebuild, restripe, pvmove etc), we want
to make sure the data read off the disk is correct before writing it to
the new location (checksum verification).

* When moving data around, we don't want to move data that isn't
actually used by the filesystem.  This could be solved via new APIs, but
keeping it crash safe would be very tricky.

* When checksum verification fails on read, the FS should be able to ask
the raid implementation for another copy.  This could be solved via new
APIs.

* Different parts of the filesystem might want different underlying raid
parameters.  The easiest example is metadata vs data, where a 4k
stripesize for data might be a bad idea and a 64k stripesize for
metadata would result in many more rwm cycles.

* Sharing the filesystem transaction layer.  LVM and MD have to pretend
they are a single consistent array of bytes all the time, for each and
every write they return as complete to the FS.

By pushing the multiple device support up into the filesystem, I can
share the filesystem's transaction layer.  Work can be done in larger
atomic units, and the filesystem will stay consistent because it is all
coordinated.

There are other bits and pieces like high speed front end caching
devices that would be difficult in MD/LVM, but since I don't have that
coded yet I suppose they don't really count...

-chris



  reply	other threads:[~2008-12-17 20:58 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-20 12:18 Btrfs trees for linux-next Chris Mason
2008-12-11  2:34 ` Chris Mason
2008-12-11  3:14   ` Stephen Rothwell
2008-12-11  4:06     ` Andrew Morton
2008-12-11  5:55       ` Stephen Rothwell
2008-12-11 14:43       ` Chris Mason
2008-12-15 21:03         ` Andreas Dilger
2008-12-15 22:55           ` Kay Sievers
2008-12-16  1:37             ` Chris Mason
2008-12-16  1:39               ` Kay Sievers
2008-12-17 13:23           ` Notes on support for multiple devices for a single filesystem Christoph Hellwig
2008-12-17 14:50             ` Kay Sievers
2008-12-17 15:08               ` Christoph Hellwig
2008-12-17 15:33                 ` Kay Sievers
2008-12-17 14:53             ` Chris Mason
2008-12-17 19:53             ` Andrew Morton
2008-12-17 20:58               ` Chris Mason [this message]
2008-12-17 21:20                 ` Kay Sievers
2008-12-17 21:26                   ` Chris Mason
2008-12-17 21:27                   ` Jeff Garzik
2008-12-18 21:22                     ` Bryan Henderson
2008-12-17 21:24                 ` Andreas Dilger
2008-12-17 21:30                   ` Jeff Garzik
2008-12-17 21:41                   ` Chris Mason
2008-12-22  1:59               ` Liu Hui
2008-12-17 22:04             ` Andreas Dilger
2008-12-17 22:19               ` Dave Kleikamp
     [not found] <e1f6055f0812181336q105b4ebcy81d72edd2a35baa8@mail.gmail.com>
2008-12-19 19:03 ` Bryan Henderson
2008-12-19 19:30   ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1229547485.27170.77.camel@think.oraclecorp.com \
    --to=chris.mason@oracle.com \
    --cc=adilger@sun.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@infradead.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sfr@canb.auug.org.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).