Re: Mis-Design of Btrfs?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ric Wheeler <rwheeler@redhat.com>
To: NeilBrown <neilb@suse.de>
Cc: Nico Schottelius <nico-lkml-20110623@schottelius.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Chris Mason <chris.mason@oracle.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Alasdair G Kergon <agk@redhat.com>
Subject: Re: Mis-Design of Btrfs?
Date: Thu, 14 Jul 2011 07:02:22 +0100	[thread overview]
Message-ID: <4E1E866E.2050405@redhat.com> (raw)
In-Reply-To: <20110714155620.6e9ac2cc@notabene.brown>

On 07/14/2011 06:56 AM, NeilBrown wrote:
> On Wed, 29 Jun 2011 10:29:53 +0100 Ric Wheeler<rwheeler@redhat.com>  wrote:
>
>> On 06/27/2011 07:46 AM, NeilBrown wrote:
>>> On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
>>> <nico-lkml-20110623@schottelius.org>   wrote:
>>>
>>>> Good morning devs,
>>>>
>>>> I'm wondering whether the raid- and volume-management-builtin of btrfs is
>>>> actually a sane idea or not.
>>>> Currently we do have md/device-mapper support for raid
>>>> already, btrfs lacks raid5 support and re-implements stuff that
>>>> has already been done.
>>>>
>>>> I'm aware of the fact that it is very useful to know on which devices
>>>> we are in a filesystem. But I'm wondering, whether it wouldn't be
>>>> smarter to generalise the information exposure through the VFS layer
>>>> instead of replicating functionality:
>>>>
>>>> Physical:   USB-HD   SSD   USB-Flash          | Exposes information to
>>>> Raid:       Raid1, Raid5, Raid10, etc.        | higher levels
>>>> Crypto:     Luks                              |
>>>> LVM:        Groups/Volumes                    |
>>>> FS:         xfs/jfs/reiser/ext3               v
>>>>
>>>> Thus a filesystem like ext3 could be aware that it is running
>>>> on a USB HD, enable -o sync be default or have the filesystem
>>>> to rewrite blocks when running on crypto or optimise for an SSD, ...
>>> I would certainly agree that exposing information to higher levels is a good
>>> idea.  To some extent we do.  But it isn't always as easy as it might sound.
>>> Choosing exactly what information to expose is the challenge.  If you lack
>>> sufficient foresight you might expose something which turns out to be
>>> very specific to just one device, so all those upper levels which make use of
>>> the information find they are really special-casing one specific device,
>>> which isn't a good idea.
>>>
>>>
>>> However it doesn't follow that RAID5 should not be implemented in BTRFS.
>>> The levels that you have drawn are just one perspective.  While that has
>>> value, it may not be universal.
>>> I could easily argue that the LVM layer is a mistake and that filesystems
>>> should provide that functionality directly.
>>> I could almost argue the same for crypto.
>>> RAID1 can make a lot of sense to be tightly integrated with the FS.
>>> RAID5 ... I'm less convinced, but then I have a vested interest there so that
>>> isn't an objective assessment.
>>>
>>> Part of "the way Linux works" is that s/he who writes the code gets to make
>>> the design decisions.   The BTRFS developers might create something truly
>>> awesome, or might end up having to support a RAID feature that they
>>> subsequently think is a bad idea.  But it really is their decision to make.
>>>
>>> NeilBrown
>>>
>> One more thing to add here is that I think that we still have a chance to
>> increase the sharing between btrfs and the MD stack if we can get those changes
>> made. No one likes to duplicate code, but we will need a richer interface
>> between the block and file system layer to help close that gap.
>>
>> Ric
>>
> I'm certainly open to suggestions and collaboration.  Do you have in mind any
> particular way to make the interface richer??
>
> NeilBrown

Hi Neil,

I know that Chris has a very specific set of use cases for btrfs and think that 
Alasdair and others have started to look at what is doable.

The obvious use case is the following:

If a file system uses checksumming or other data corruption detection bits, it 
can detect that it got bad data on a write. If that data was protected by RAID, 
it would like to ask the block layer to try to read from another mirror (for 
raid1) or try to validate/rebuild from parity.

Today, I think that a retry will basically just give us back a random chance of 
getting data from a different mirror or the same one that we got data from on 
the first go.

Chris, Alasdair, was that a good summary of one concern?

Thanks!

Ric

next prev parent reply	other threads:[~2011-07-14  6:02 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20110623105337.GD3753@ethz.ch>
     [not found] ` <20110627164637.377314e2@notabene.brown>
2011-06-29  9:29   ` Mis-Design of Btrfs? Ric Wheeler
2011-06-29 10:47     ` A. James Lewis
2011-07-14 20:47       ` Erik Jensen
2011-07-14  5:56     ` NeilBrown
2011-07-14  6:02       ` Ric Wheeler [this message]
2011-07-14  6:38         ` NeilBrown
2011-07-14  6:57           ` Ric Wheeler
2011-07-15  2:32             ` Chris Mason
2011-07-15  4:58               ` david
2011-07-15  6:33                 ` NeilBrown
2011-07-15 11:34                   ` Chris Mason
2011-07-15 12:58                     ` Ric Wheeler
2011-07-15 13:20                       ` Chris Mason
2011-07-15 13:31                         ` Ric Wheeler
2011-07-15 14:00                           ` Chris Mason
2011-07-15 14:07                             ` Hugo Mills
2011-07-15 14:24                               ` Chris Mason
2011-07-15 14:47                                 ` Christian Aßfalg
2011-07-15 14:54                                 ` Hugo Mills
2011-07-15 15:12                                   ` Chris Mason
2011-07-15 16:23                         ` david
2011-07-15 16:51                           ` Ric Wheeler
2011-07-15 17:01                             ` david
2011-07-15 17:23                               ` Ric Wheeler
2011-07-15 13:55                       ` Mike Snitzer
2011-07-15 16:03                   ` david
2011-07-14  9:37           ` Jan Schmidt
2011-07-14  9:55             ` NeilBrown
2011-07-14 16:27           ` Goffredo Baroncelli
2011-07-14 16:55           ` Alasdair G Kergon
2011-07-14 16:55           ` Alasdair G Kergon
2011-07-14 19:50             ` John Stoffel
2011-07-14 20:48               ` david
2011-07-14 20:50               ` Erik Jensen
2011-07-14  6:59         ` Arne Jansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E1E866E.2050405@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=agk@redhat.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=nico-lkml-20110623@schottelius.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).