Re: Mis-Design of Btrfs?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Ric Wheeler <rwheeler@redhat.com>
Cc: Nico Schottelius <nico-lkml-20110623@schottelius.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Chris Mason <chris.mason@oracle.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Alasdair G Kergon <agk@redhat.com>
Subject: Re: Mis-Design of Btrfs?
Date: Thu, 14 Jul 2011 16:38:36 +1000	[thread overview]
Message-ID: <20110714163836.35a729c1@notabene.brown> (raw)
In-Reply-To: <4E1E866E.2050405@redhat.com>

On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler <rwheeler@redhat.com> wrote:

> > I'm certainly open to suggestions and collaboration.  Do you have in mind any
> > particular way to make the interface richer??
> >
> > NeilBrown
> 
> Hi Neil,
> 
> I know that Chris has a very specific set of use cases for btrfs and think that 
> Alasdair and others have started to look at what is doable.
> 
> The obvious use case is the following:
> 
> If a file system uses checksumming or other data corruption detection bits, it 
> can detect that it got bad data on a write. If that data was protected by RAID, 
> it would like to ask the block layer to try to read from another mirror (for 
> raid1) or try to validate/rebuild from parity.
> 
> Today, I think that a retry will basically just give us back a random chance of 
> getting data from a different mirror or the same one that we got data from on 
> the first go.
> 
> Chris, Alasdair, was that a good summary of one concern?
> 
> Thanks!
> 
> Ric

I imagine a new field in 'struct bio' which was normally zero but could be
some small integer.  It is only meaningful for read.
When 0 it means "get this data way you like".
When non-zero it means "get this data using method N", where the different
methods are up to the device.

For a mirrored RAID, method N means read from device N-1.
For stripe/parity RAID, method 1 means "use other data blocks and parity
blocks to reconstruct data.

The default for non RAID devices is to return EINVAL for any N > 0.
A remapping device (dm-linear, dm-stripe etc) would just pass the number
down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
some thought.

So if btrfs reads a block and the checksum looks wrong, it reads again with
a larger N.  It continues incrementing N and retrying until it gets a block
that it likes or it gets EINVAL.  There should probably be an error code
(EAGAIN?) which means "I cannot work with that number, but try the next one".

It would be trivial for me to implement this for RAID1 and RAID10, and
relatively easy for RAID5.
I'd need to give a bit of thought to RAID6 as there are possibly multiple
ways to reconstruct from different combinations of parity and data.  I'm not
sure if there would be much point in doing that though.

It might make sense for a device to be able to report what the maximum
'N' supported is... that might make stacked raid easier to manage...

NeilBrown

next prev parent reply	other threads:[~2011-07-14  6:38 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20110623105337.GD3753@ethz.ch>
     [not found] ` <20110627164637.377314e2@notabene.brown>
2011-06-29  9:29   ` Mis-Design of Btrfs? Ric Wheeler
2011-06-29 10:47     ` A. James Lewis
2011-07-14 20:47       ` Erik Jensen
2011-07-14  5:56     ` NeilBrown
2011-07-14  6:02       ` Ric Wheeler
2011-07-14  6:38         ` NeilBrown [this message]
2011-07-14  6:57           ` Ric Wheeler
2011-07-15  2:32             ` Chris Mason
2011-07-15  4:58               ` david
2011-07-15  6:33                 ` NeilBrown
2011-07-15 11:34                   ` Chris Mason
2011-07-15 12:58                     ` Ric Wheeler
2011-07-15 13:20                       ` Chris Mason
2011-07-15 13:31                         ` Ric Wheeler
2011-07-15 14:00                           ` Chris Mason
2011-07-15 14:07                             ` Hugo Mills
2011-07-15 14:24                               ` Chris Mason
2011-07-15 14:47                                 ` Christian Aßfalg
2011-07-15 14:54                                 ` Hugo Mills
2011-07-15 15:12                                   ` Chris Mason
2011-07-15 16:23                         ` david
2011-07-15 16:51                           ` Ric Wheeler
2011-07-15 17:01                             ` david
2011-07-15 17:23                               ` Ric Wheeler
2011-07-15 13:55                       ` Mike Snitzer
2011-07-15 16:03                   ` david
2011-07-14  9:37           ` Jan Schmidt
2011-07-14  9:55             ` NeilBrown
2011-07-14 16:27           ` Goffredo Baroncelli
2011-07-14 16:55           ` Alasdair G Kergon
2011-07-14 16:55           ` Alasdair G Kergon
2011-07-14 19:50             ` John Stoffel
2011-07-14 20:48               ` david
2011-07-14 20:50               ` Erik Jensen
2011-07-14  6:59         ` Arne Jansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110714163836.35a729c1@notabene.brown \
    --to=neilb@suse.de \
    --cc=agk@redhat.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nico-lkml-20110623@schottelius.org \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).