RAID[56] status

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* RAID[56] status
@ 2009-08-06 10:17 David Woodhouse
  2009-08-07  9:43 ` Roy Sigurd Karlsbakk
  2009-11-10 19:51 ` RAID[56] status Dan Williams
  0 siblings, 2 replies; 11+ messages in thread
From: David Woodhouse @ 2009-08-06 10:17 UTC (permalink / raw)
  To: chris.mason; +Cc: linux-btrfs

If we've abandoned the idea of putting the number of redundant blocks
into the top bits of the type bitmask (and I hope we have), then we're
fairly much there. Current code is at:

   git://, http://git.infradead.org/users/dwmw2/btrfs-raid56.git
   git://, http://git.infradead.org/users/dwmw2/btrfs-progs-raid56.git=20

We have recovery working, as well as both full-stripe writes and a
temporary hack to allow smaller writes to work (with the 'write hole'
problem, of course). The main thing we need to do is ensure that we
_always_ do full-stripe writes, and then we can ditch the partial write
support.

I want to do a few other things, but AFAICT none of that needs to delay
the merge:

  - Better rebuild support -- if we lose a disk and add a replacement,
    we want to recreate only the contents of that disk, rather than
    allocating a new chunk elsewhere and then rewriting _everything_.
=20
  - Support for more than 2 redundant blocks per stripe (RAID[789] or
    RAID6[=C2=B3=E2=81=B4=E2=81=B5] or whatever we'll call it).

  - RAID[56789]0 support.

  - Clean up the discard support to do the right thing.

--=20
David Woodhouse                            Open Source Technology Centr=
e
David.Woodhouse@intel.com                              Intel Corporatio=
n

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID[56] status
  2009-08-06 10:17 RAID[56] status David Woodhouse
@ 2009-08-07  9:43 ` Roy Sigurd Karlsbakk
  2009-08-07 15:22   ` David Woodhouse
  2009-11-10 19:51 ` RAID[56] status Dan Williams
  1 sibling, 1 reply; 11+ messages in thread
From: Roy Sigurd Karlsbakk @ 2009-08-07  9:43 UTC (permalink / raw)
  To: linux-btrfs

Hi

This is great. How does the current code handle corruption on a drive, =
=20
or two drives with RAID-6 in a stripe? Is the checksumming done per =20
drive or for the whole stripe?

roy

On 6. aug.. 2009, at 12.17, David Woodhouse wrote:

> If we've abandoned the idea of putting the number of redundant blocks
> into the top bits of the type bitmask (and I hope we have), then we'r=
e
> fairly much there. Current code is at:
>
>   git://, http://git.infradead.org/users/dwmw2/btrfs-raid56.git
>   git://, http://git.infradead.org/users/dwmw2/btrfs-progs-raid56.git
>
> We have recovery working, as well as both full-stripe writes and a
> temporary hack to allow smaller writes to work (with the 'write hole'
> problem, of course). The main thing we need to do is ensure that we
> _always_ do full-stripe writes, and then we can ditch the partial =20
> write
> support.
>
> I want to do a few other things, but AFAICT none of that needs to =20
> delay
> the merge:
>
>  - Better rebuild support -- if we lose a disk and add a replacement,
>    we want to recreate only the contents of that disk, rather than
>    allocating a new chunk elsewhere and then rewriting _everything_.
>
>  - Support for more than 2 redundant blocks per stripe (RAID[789] or
>    RAID6[=C2=B3=E2=81=B4=E2=81=B5] or whatever we'll call it).
>
>  - RAID[56789]0 support.
>
>  - Clean up the discard support to do the right thing.
>
> --=20
> David Woodhouse                            Open Source Technology =20
> Centre
> David.Woodhouse@intel.com                              Intel =20
> Corporation
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-=20
> btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
Roy Sigurd Karlsbakk
(+47) 97542685
roy@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres =20
intelligibelt. Det er et element=C3=A6rt imperativ for alle pedagoger =C3=
=A5 =20
unng=C3=A5 eksessiv anvendelse av idiomer med fremmed opprinnelse. I de=
 =20
fleste tilfeller eksisterer adekvate og relevante synonymer p=C3=A5 nor=
sk.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID[56] status
  2009-08-07  9:43 ` Roy Sigurd Karlsbakk
@ 2009-08-07 15:22   ` David Woodhouse
  2009-09-02 16:32     ` [PATCH] don't OOPs when we are not raid56 jim owens
  0 siblings, 1 reply; 11+ messages in thread
From: David Woodhouse @ 2009-08-07 15:22 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 2009-08-07 at 11:43 +0200, Roy Sigurd Karlsbakk wrote:
> This is great. How does the current code handle corruption on a drive,
> or two drives with RAID-6 in a stripe? Is the checksumming done per  
> drive or for the whole stripe?

http://git.infradead.org/users/dwmw2/btrfs-raid56.git

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] don't OOPs when we are not raid56
  2009-08-07 15:22   ` David Woodhouse
@ 2009-09-02 16:32     ` jim owens
  2009-09-08  9:15       ` David Woodhouse
  0 siblings, 1 reply; 11+ messages in thread
From: jim owens @ 2009-09-02 16:32 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-btrfs

David Woodhouse wrote:

> http://git.infradead.org/users/dwmw2/btrfs-raid56.git


Signed-off-by: jim owens <jowens@hp.com>
---
  fs/btrfs/volumes.c |    2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 95babc1..913c29f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2895,7 +2895,7 @@ again:
  		multi->num_stripes = num_stripes;
  		multi->max_errors = max_errors;
  	}
-	if (raid_map_ret) {
+	if (raid_map) {
  		sort_parity_stripes(multi, raid_map);
  		*raid_map_ret = raid_map;
  	}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] don't OOPs when we are not raid56
  2009-09-02 16:32     ` [PATCH] don't OOPs when we are not raid56 jim owens
@ 2009-09-08  9:15       ` David Woodhouse
  2009-09-08 13:48         ` Chris Mason
  0 siblings, 1 reply; 11+ messages in thread
From: David Woodhouse @ 2009-09-08  9:15 UTC (permalink / raw)
  To: jim owens; +Cc: linux-btrfs

On Wed, 2009-09-02 at 12:32 -0400, jim owens wrote:
> @@ -2895,7 +2895,7 @@ again:
>                 multi->num_stripes = num_stripes;
>                 multi->max_errors = max_errors;
>         }
> -       if (raid_map_ret) {
> +       if (raid_map) {
>                 sort_parity_stripes(multi, raid_map);
>                 *raid_map_ret = raid_map;
>         }

Applied (manually, because I think your mail was whitespace-damaged).
Thanks.

Chris, where do we stand with getting this merged? You were going to
sort out the upper layers to handle the minimum write size, weren't you?

I'm also going to do RAID50/60 support, and with hpa's help I'll extend
it to do RAID7/70 too -- but you're not waiting for that, are you?
 
-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] don't OOPs when we are not raid56
  2009-09-08  9:15       ` David Woodhouse
@ 2009-09-08 13:48         ` Chris Mason
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Mason @ 2009-09-08 13:48 UTC (permalink / raw)
  To: David Woodhouse; +Cc: jim owens, linux-btrfs

On Tue, Sep 08, 2009 at 10:15:29AM +0100, David Woodhouse wrote:
> On Wed, 2009-09-02 at 12:32 -0400, jim owens wrote:
> > @@ -2895,7 +2895,7 @@ again:
> >                 multi->num_stripes = num_stripes;
> >                 multi->max_errors = max_errors;
> >         }
> > -       if (raid_map_ret) {
> > +       if (raid_map) {
> >                 sort_parity_stripes(multi, raid_map);
> >                 *raid_map_ret = raid_map;
> >         }
> 
> Applied (manually, because I think your mail was whitespace-damaged).
> Thanks.
> 
> Chris, where do we stand with getting this merged? You were going to
> sort out the upper layers to handle the minimum write size, weren't you?

Yes, I'm working on sorting that out.  Jens distracted me with some
depressing benchmarks, but now that those are fixed I can move on ;)

> 
> I'm also going to do RAID50/60 support, and with hpa's help I'll extend
> it to do RAID7/70 too -- but you're not waiting for that, are you?

Great news, no I'm not waiting for that.

-chris


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID[56] status
  2009-08-06 10:17 RAID[56] status David Woodhouse
  2009-08-07  9:43 ` Roy Sigurd Karlsbakk
@ 2009-11-10 19:51 ` Dan Williams
  2009-11-10 20:05   ` Tomasz Torcz
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Dan Williams @ 2009-11-10 19:51 UTC (permalink / raw)
  To: David Woodhouse; +Cc: chris.mason, linux-btrfs, NeilBrown

On Thu, Aug 6, 2009 at 3:17 AM, David Woodhouse <dwmw2@infradead.org> w=
rote:
> If we've abandoned the idea of putting the number of redundant blocks
> into the top bits of the type bitmask (and I hope we have), then we'r=
e
> fairly much there. Current code is at:
>
> =C2=A0 git://, http://git.infradead.org/users/dwmw2/btrfs-raid56.git
> =C2=A0 git://, http://git.infradead.org/users/dwmw2/btrfs-progs-raid5=
6.git
>
> We have recovery working, as well as both full-stripe writes and a
> temporary hack to allow smaller writes to work (with the 'write hole'
> problem, of course). The main thing we need to do is ensure that we
> _always_ do full-stripe writes, and then we can ditch the partial wri=
te
> support.
>
> I want to do a few other things, but AFAICT none of that needs to del=
ay
> the merge:
>
> =C2=A0- Better rebuild support -- if we lose a disk and add a replace=
ment,
> =C2=A0 =C2=A0we want to recreate only the contents of that disk, rath=
er than
> =C2=A0 =C2=A0allocating a new chunk elsewhere and then rewriting _eve=
rything_.
>
> =C2=A0- Support for more than 2 redundant blocks per stripe (RAID[789=
] or
> =C2=A0 =C2=A0RAID6[=C2=B3=E2=81=B4=E2=81=B5] or whatever we'll call i=
t).
>
> =C2=A0- RAID[56789]0 support.
>
> =C2=A0- Clean up the discard support to do the right thing.
>

A few comments/questions from the brief look I had at this:

1/ The btrfs_multi_bio struct bears a resemblance to the md
stripe_head struct, to the point where it makes me wonder if the
generic raid functionality could be shared between md and btrfs via a
common 'libraid'.  I hope to follow up this wondering with code, but
wanted to get the question out in the open lest someone else already
determined it was a non-starter.

2/ I question why subvolumes are actively avoiding the the device
model.  They are in essence virtual block devices with different
lifetime rules specific to btrfs.  The current behavior of specifying
all members on the mount command line eliminates the ability to query,
via sysfs, if a btrfs subvolume is degraded/failed, or to assemble the
subvolume(s) prior to activating the filesystem.  One scenario that
comes to mind is handling a 4-disk btrfs filesystem with both raid10
and raid6 subvolumes.  Depending on the device discovery order the
user may be able to start all subvolumes in the filesystem in degraded
mode once the right two disks are available, or maybe it's ok to start
the raid6 subvolume early even if that means the raid10 is failed.
Basically, the current model precludes those possibilities and mimics
the dmraid "assume all members are available, auto-assemble everything
at once, and hide virtual block device details from sysfs" model.

3/ The md-raid6 recovery code assumes that there is always at least
two good blocks to perform recovery.  That makes the current minimum
number of raid6 members 4, not 3.  (small nit the btrfs code calls
members 'stripes', in md a stripe of data is a collection of blocks
from all members).

4/ A small issue, there appears to be no way to specify different
raid10/5/6 data layouts, maybe I missed it.  See the --layout option
to mdadm.  It appears the only layout option is the raid level.

Regards,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID[56] status
  2009-11-10 19:51 ` RAID[56] status Dan Williams
@ 2009-11-10 20:05   ` Tomasz Torcz
  2009-11-10 20:11   ` Chris Mason
  2009-11-10 21:06   ` tsuraan
  2 siblings, 0 replies; 11+ messages in thread
From: Tomasz Torcz @ 2009-11-10 20:05 UTC (permalink / raw)
  To: linux-btrfs

On Tue, Nov 10, 2009 at 12:51:06PM -0700, Dan Williams wrote:
> 4/ A small issue, there appears to be no way to specify different
> raid10/5/6 data layouts, maybe I missed it.  See the --layout option
> to mdadm.  It appears the only layout option is the raid level.

  Is this really important? In my all experience, mdadm was the first
place when I got asked about RAID layout. No other RAID system known
to me exposes such design decision to user. Why would user need to both=
er
with such detail?

--=20
Tomasz Torcz                        To co nierealne -- tutaj jest norma=
lne.
xmpp: zdzichubg@chrome.pl          Ziomale na =BFycie maj=B1 tu patenty=
 specjalne.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID[56] status
  2009-11-10 19:51 ` RAID[56] status Dan Williams
  2009-11-10 20:05   ` Tomasz Torcz
@ 2009-11-10 20:11   ` Chris Mason
  2009-11-10 21:06   ` tsuraan
  2 siblings, 0 replies; 11+ messages in thread
From: Chris Mason @ 2009-11-10 20:11 UTC (permalink / raw)
  To: Dan Williams; +Cc: David Woodhouse, linux-btrfs, NeilBrown

On Tue, Nov 10, 2009 at 12:51:06PM -0700, Dan Williams wrote:
> On Thu, Aug 6, 2009 at 3:17 AM, David Woodhouse <dwmw2@infradead.org>=
 wrote:
> > If we've abandoned the idea of putting the number of redundant bloc=
ks
> > into the top bits of the type bitmask (and I hope we have), then we=
're
> > fairly much there. Current code is at:
> >
> > =C2=A0 git://, http://git.infradead.org/users/dwmw2/btrfs-raid56.gi=
t
> > =C2=A0 git://, http://git.infradead.org/users/dwmw2/btrfs-progs-rai=
d56.git
> >
> > We have recovery working, as well as both full-stripe writes and a
> > temporary hack to allow smaller writes to work (with the 'write hol=
e'
> > problem, of course). The main thing we need to do is ensure that we
> > _always_ do full-stripe writes, and then we can ditch the partial w=
rite
> > support.
> >
> > I want to do a few other things, but AFAICT none of that needs to d=
elay
> > the merge:
> >
> > =C2=A0- Better rebuild support -- if we lose a disk and add a repla=
cement,
> > =C2=A0 =C2=A0we want to recreate only the contents of that disk, ra=
ther than
> > =C2=A0 =C2=A0allocating a new chunk elsewhere and then rewriting _e=
verything_.
> >
> > =C2=A0- Support for more than 2 redundant blocks per stripe (RAID[7=
89] or
> > =C2=A0 =C2=A0RAID6[=C2=B3=E2=81=B4=E2=81=B5] or whatever we'll call=
 it).
> >
> > =C2=A0- RAID[56789]0 support.
> >
> > =C2=A0- Clean up the discard support to do the right thing.
> >
>=20
> A few comments/questions from the brief look I had at this:
>=20
> 1/ The btrfs_multi_bio struct bears a resemblance to the md
> stripe_head struct, to the point where it makes me wonder if the
> generic raid functionality could be shared between md and btrfs via a
> common 'libraid'.  I hope to follow up this wondering with code, but
> wanted to get the question out in the open lest someone else already
> determined it was a non-starter.

I'm not opposed to this, but I expect things are different enough in th=
e
guts of the implementations to make it awkward.  It would be nice to
factor out the parts that split a bio up and send it down to the lower
devices, which is something that btrfs doesn't currently do in its
raid1,0,10 code.

>=20
> 2/ I question why subvolumes are actively avoiding the the device
> model.  They are in essence virtual block devices with different
> lifetime rules specific to btrfs.  The current behavior of specifying
> all members on the mount command line eliminates the ability to query=
,
> via sysfs, if a btrfs subvolume is degraded/failed, or to assemble th=
e
> subvolume(s) prior to activating the filesystem.

Today we have an ioctl to scan for btrfs devices and assemble the FS
prior to activating it.  There is also code Kay Sievers has been workin=
g
on to integrate the scanning into udev and sysfs.  A later version of
the btrfs code will just assemble based on what udev has already scanne=
d
for us.

Subvolumes aren't quite virtual block devices because they share
storage, and in the case of snapshots or clones they can share
individual blocks.

> One scenario that
> comes to mind is handling a 4-disk btrfs filesystem with both raid10
> and raid6 subvolumes.  Depending on the device discovery order the
> user may be able to start all subvolumes in the filesystem in degrade=
d
> mode once the right two disks are available, or maybe it's ok to star=
t
> the raid6 subvolume early even if that means the raid10 is failed.
>
> Basically, the current model precludes those possibilities and mimics
> the dmraid "assume all members are available, auto-assemble everythin=
g
> at once, and hide virtual block device details from sysfs" model.

=46rom a btrfs point of view the FS will mount as long as the metadata
required is there.  Some day the subvolumes will have the ability to
store different raid profiles for differnet subvolumes but that doesn't
happen right now (just the metadata vs data split)

>=20
> 3/ The md-raid6 recovery code assumes that there is always at least
> two good blocks to perform recovery.  That makes the current minimum
> number of raid6 members 4, not 3.  (small nit the btrfs code calls
> members 'stripes', in md a stripe of data is a collection of blocks
> from all members).
>=20
> 4/ A small issue, there appears to be no way to specify different
> raid10/5/6 data layouts, maybe I missed it.  See the --layout option
> to mdadm.  It appears the only layout option is the raid level.

Correct, we're not as flexible as we could be right now.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID[56] status
  2009-11-10 19:51 ` RAID[56] status Dan Williams
  2009-11-10 20:05   ` Tomasz Torcz
  2009-11-10 20:11   ` Chris Mason
@ 2009-11-10 21:06   ` tsuraan
  2009-11-10 21:20     ` Gregory Maxwell
  2 siblings, 1 reply; 11+ messages in thread
From: tsuraan @ 2009-11-10 21:06 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-btrfs

> 3/ The md-raid6 recovery code assumes that there is always at least
> two good blocks to perform recovery.  That makes the current minimum
> number of raid6 members 4, not 3.  (small nit the btrfs code calls
> members 'stripes', in md a stripe of data is a collection of blocks
> from all members).

Why would you use RAID6 on three drives instead of mirroring across
all of them?  I agree it's an artificial limitation, but would anybody
use a RAID6 with fewer than 4 drives?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID[56] status
  2009-11-10 21:06   ` tsuraan
@ 2009-11-10 21:20     ` Gregory Maxwell
  0 siblings, 0 replies; 11+ messages in thread
From: Gregory Maxwell @ 2009-11-10 21:20 UTC (permalink / raw)
  To: tsuraan; +Cc: Dan Williams, linux-btrfs

On Tue, Nov 10, 2009 at 4:06 PM, tsuraan <tsuraan@gmail.com> wrote:
>> 3/ The md-raid6 recovery code assumes that there is always at least
>> two good blocks to perform recovery. =C2=A0That makes the current mi=
nimum
>> number of raid6 members 4, not 3. =C2=A0(small nit the btrfs code ca=
lls
>> members 'stripes', in md a stripe of data is a collection of blocks
>> from all members).
>
> Why would you use RAID6 on three drives instead of mirroring across
> all of them? =C2=A0I agree it's an artificial limitation, but would a=
nybody
> use a RAID6 with fewer than 4 drives?

Here is some text I wrote on a local linux-users-group list a few month=
s ago, on
a thread talking about cost/reliability trade-off on small arrays.
(it doesn't seem to be in a public archive)


Lets also consider another configuration:
Raid  0:  4 * 1TB WD RE3s =3D $640; 4TB; $0.160/GB

WD1002FBYS (1TB WD RE3) has a spec MTBF of 1.2 million hours. Lets
assume a mean time to replace for each drive of 72 hours, I think
thats a reasonably prompt response for a disk at home.

Raid 0
1.2million hours/4 =3D 34.22313483 yrs MTBF
$4.675/TB/MTBF_YEAR

Raid 5
1.2million_hrs * (1.2million_hrs/(4*3*72)) =3D 190,128 yrs MTBF
$0.00112/TB/MTBF_YEAR

Raid 0+1
(1.2million_hrs * 1.2million_hrs / (2 * 72))/2 =3D 570,386 yrs MTBF
$.00056102/TB/MTBF_YEAR

Raid 6
1.2million_hrs*1.2million_hrs*1.2million_hrs/(4*3*2*72*72) =3D
1,584,404,390 yrs MTBF
$0.00000020/TB/MTBF_YEAR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-11-10 21:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-06 10:17 RAID[56] status David Woodhouse
2009-08-07  9:43 ` Roy Sigurd Karlsbakk
2009-08-07 15:22   ` David Woodhouse
2009-09-02 16:32     ` [PATCH] don't OOPs when we are not raid56 jim owens
2009-09-08  9:15       ` David Woodhouse
2009-09-08 13:48         ` Chris Mason
2009-11-10 19:51 ` RAID[56] status Dan Williams
2009-11-10 20:05   ` Tomasz Torcz
2009-11-10 20:11   ` Chris Mason
2009-11-10 21:06   ` tsuraan
2009-11-10 21:20     ` Gregory Maxwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox