RAID1 3+ drives

All of lore.kernel.org
 help / color / mirror / Atom feed

* RAID1 3+ drives
@ 2014-06-28  0:30 Zack Coffey
  2014-06-28  0:51 ` Russell Coker
  0 siblings, 1 reply; 11+ messages in thread
From: Zack Coffey @ 2014-06-28  0:30 UTC (permalink / raw)
  To: linux-btrfs

Can I get more protection by using more than 2 drives?

I had an onboard RAID a few years back that would let me use RAID1
across up to 4 drives.

Apologies if this has been covered already, I don't recall seeing
anything saying yay or nay.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28  0:30 RAID1 3+ drives Zack Coffey
@ 2014-06-28  0:51 ` Russell Coker
  2014-06-28  4:26   ` Duncan
  0 siblings, 1 reply; 11+ messages in thread
From: Russell Coker @ 2014-06-28  0:51 UTC (permalink / raw)
  To: Zack Coffey; +Cc: linux-btrfs

On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote:
> Can I get more protection by using more than 2 drives?
> 
> I had an onboard RAID a few years back that would let me use RAID1
> across up to 4 drives.

Currently the only RAID level that fully works in BTRFS is RAID-1 with data on 
2 disks.  If you have 4 disks in the array then each block will be on 2 of the 
disks.  RAID-5/6 code mostly works but the last report I read indicated that 
some situations for recovery and disk replacement didn't work - presumably 
anyone who's afraid of multiple disks failing isn't going to want to trust 
BTRFS RAID-6 code at the moment.

If you want to have 4 disks in a fully redundant configuration (IE you could 
lose 3 disks without losing any data) then the thing to do is to have 2 RAID-1 
arrays with Linux software RAID and then run BTRFS RAID-1 on top of that.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28  0:51 ` Russell Coker
@ 2014-06-28  4:26   ` Duncan
  2014-06-28  6:28     ` Russell Coker
  2014-06-28 10:13     ` Roman Mamedov
  0 siblings, 2 replies; 11+ messages in thread
From: Duncan @ 2014-06-28  4:26 UTC (permalink / raw)
  To: linux-btrfs

Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted:

> On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote:
>> Can I get more protection by using more than 2 drives?
>> 
>> I had an onboard RAID a few years back that would let me use RAID1
>> across up to 4 drives.
>> 
> Currently the only RAID level that fully works in BTRFS is RAID-1 with
> data on 2 disks.

Not /quite/ correct.  Raid0 works, but of course that isn't exactly 
"RAID" as it's not "redundant".  And raid10 works.  But that's simply 
raid0 over raid1.  So depending on whether you consider raid0 actually 
"RAID" or not, which in turn depends on how strict you are with the 
"redundant" part, there is or is not more than btrfs raid1 working.

> If you have 4 disks in the array then each block will
> be on 2 of the disks.

Correct.

FWIW I'm told that the paper that laid out the original definition of 
RAID (which was linked on this list in a similar discussion some months 
ago) defined RAID-1 as paired redundancy, no matter the number of 
devices.  Various implementations (including Linux' own mdraid soft-raid, 
and I believe dmraid as well) feature multi-way-mirroring aka N-way-
mirroring such that N devices equals N way mirroring, but that's an 
implementation extension and isn't actually necessary to claim RAID-1 
support.

So look for N-way-mirroring when you go RAID shopping, and no, btrfs does 
not have it at this time, altho it is roadmapped for implementation after 
completion of the raid5/6 code.

FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for 
device redundancy, but to take full advantage of btrfs data integrity 
features, allowing to "scrub" a checksum-mismatch copy with the content 
of a checksum-validated copy if available.  That's currently possible, 
but due to the pair-mirroring-only restriction, there's only one 
additional copy, and if it happens to be bad as well, there's no 
possibility of a third copy to scrub from.  As it happens my personal 
sweet-spot between cost/performance and reliability would be 3-way 
mirroring, but once they code beyond N=2, N should go unlimited, so N=3, 
N=4, N=50 if you have a way to hook them all up... should all be possible.

But...

> RAID-5/6 code mostly works but the last report I
> read indicated that some situations for recovery and disk replacement
> didn't work - presumably anyone who's afraid of multiple disks failing
> isn't going to want to trust BTRFS RAID-6 code at the moment.

The raid5/6 code was on the list to be introduced in the next kernel or 
two something like two years ago, when I originally looked into it, and 
likely before that.  Like many of the btrfs features, it actually took 
rather longer to cook than was in the original plan -- it's actually 
rather more complicated than anticipated, and additionally it has been 
put off a few times to work on bugfixing currently supported feature 
bugs.  An incomplete raid56 implementation, normal runtime but not scrub 
or recovery, was introduced several kernels ago now, but it's still not 
complete.

So N-way-mirroring, which is supposed to build on several bits of the 
raid5/6 implementation and therefore is roadmapped for after it, 
continues to look about the same 3-5 kernels off, after raid5/6, as it 
did two years ago.  Except, having seen the raid5/6 timing, and having 
looked back at btrfs feature history going back rather longer, even if 
raid5/6 was declared finished for kernel 3.17 (since 3.16 is past the 
commit window), I'd guess it'd probably take another five kernels (a 
year's worth) or so, at /least/, for N-way-mirroring to properly cook.

So in actuality I'd be surprised to see any N-way-mirroring code at all 
before next spring, and would /not/ be surprised at all to see it take 
all of next year to fully cook to "completion".

Not that I'm complaining /too/ much.  We work with what we have and btrfs 
as it is is quite beyond the features of most filesystems (just the data 
integrity and multi-device filesystem stuff at all, is great to work 
with, besides the stuff like subvolumes and snapshotting that doesn't fit 
my use-case that well =:^), even if it /is/ all presently limited to two-
way-mirroring! =:^\ ).  But it will sure be nice when I /can/ count on 
that third copy to scrub two bad copies, if two copies /do/ happen to be 
bad.

> If you want to have 4 disks in a fully redundant configuration (IE you
> could lose 3 disks without losing any data) then the thing to do is to
> have 2 RAID-1 arrays with Linux software RAID and then run BTRFS RAID-1
> on top of that.

The caveat with that is that at least mdraid1/dmraid1 has no verified 
data integrity, and while mdraid5/6 does have 1/2-way-parity calculation, 
it's only used in recovery, NOT cross-verified in ordinary use.

So it's not a proper substitute, tho I guess some big-money hardware 
raids might do it.

In fact, with md/dmraid and its reasonable possibility of silent 
corruption since at that level any of the copies could be returned and 
there's no data integrity checking, if whatever md/dmraid level copy /is/ 
returned ends up being bad, then btrfs will consider that side of the 
pair bad, without any way to check additional copies at the underlying md/
dmraid level.  Effectively you only have two verified copies no matter 
how many ways the dm/mdraid level is mirrored, since there's no 
verification at the dm/mdraid level at all.

Tho if you ran a md/dmraid level scrub often enough, and then ran a btrfs 
scrub on top, one could be /reasonably/ assured of freedom from lower 
level corruption.  But with both levels of scrub together very possibly 
taking a couple days, and various ongoing write activity in the mean 
time, by the time one run was done it'd be time to start the next one, so 
you'd effectively be running scrub at one level or the other *ALL* the 
time!

So... I'd suggest either forgetting about data integrity for the time 
being and just running md/dmraid without worrying about it, or just 
running btrfs with pairs, and backing up to another btrfs of pairs.  
Btrfs send/receive could even be used as the primary syncing method 
between the main and backup set, altho I'd suggest having a fallback such 
as rsync setup and tested to work as well, in case there's a bug in send/
receive that stalls that method for awhile.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28  4:26   ` Duncan
@ 2014-06-28  6:28     ` Russell Coker
  2014-06-28  7:38       ` Martin Steigerwald
                         ` (2 more replies)
  2014-06-28 10:13     ` Roman Mamedov
  1 sibling, 3 replies; 11+ messages in thread
From: Russell Coker @ 2014-06-28  6:28 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Sat, 28 Jun 2014 04:26:43 Duncan wrote:
> Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted:
> > On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote:
> >> Can I get more protection by using more than 2 drives?
> >> 
> >> I had an onboard RAID a few years back that would let me use RAID1
> >> across up to 4 drives.
> > 
> > Currently the only RAID level that fully works in BTRFS is RAID-1 with
> > data on 2 disks.
> 
> Not /quite/ correct.  Raid0 works, but of course that isn't exactly
> "RAID" as it's not "redundant".  And raid10 works.  But that's simply
> raid0 over raid1.  So depending on whether you consider raid0 actually

http://en.wikipedia.org/wiki/Linux_MD_RAID_10#LINUX-MD-RAID-10

There are a number of ways of doing RAID-0 over RAID-1, but BTRFS doesn't do 
any of them.  When you have more than 2 disks and tell BTRFS to do RAID-1 you 
get a result that might be somewhat comparable to Linux software RAID-10, 
except for the issue of having disks of different sizes and adding more disks 
after creating the "RAID".

> "RAID" or not, which in turn depends on how strict you are with the
> "redundant" part, there is or is not more than btrfs raid1 working.

The way BTRFS, ZFS, and WAFL work is quite different to anything described in 
any of the original papers on RAID.  One could make a case that what these 
filesystems do shouldn't be called RAID, but then we would be searching for 
another term for it.

> > If you have 4 disks in the array then each block will
> > be on 2 of the disks.
> 
> Correct.
> 
> FWIW I'm told that the paper that laid out the original definition of
> RAID (which was linked on this list in a similar discussion some months
> ago) defined RAID-1 as paired redundancy, no matter the number of
> devices.  Various implementations (including Linux' own mdraid soft-raid,
> and I believe dmraid as well) feature multi-way-mirroring aka N-way-
> mirroring such that N devices equals N way mirroring, but that's an
> implementation extension and isn't actually necessary to claim RAID-1
> support.

The paper is a little ambiguous as to whether a 3 disk mirror can be RAID-1.

> So look for N-way-mirroring when you go RAID shopping, and no, btrfs does
> not have it at this time, altho it is roadmapped for implementation after
> completion of the raid5/6 code.
> 
> FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for
> device redundancy, but to take full advantage of btrfs data integrity
> features, allowing to "scrub" a checksum-mismatch copy with the content
> of a checksum-validated copy if available.  That's currently possible,
> but due to the pair-mirroring-only restriction, there's only one
> additional copy, and if it happens to be bad as well, there's no
> possibility of a third copy to scrub from.  As it happens my personal
> sweet-spot between cost/performance and reliability would be 3-way
> mirroring, but once they code beyond N=2, N should go unlimited, so N=3,
> N=4, N=50 if you have a way to hook them all up... should all be possible.

What I want is the ZFS copies= feature.

> > If you want to have 4 disks in a fully redundant configuration (IE you
> > could lose 3 disks without losing any data) then the thing to do is to
> > have 2 RAID-1 arrays with Linux software RAID and then run BTRFS RAID-1
> > on top of that.
> 
> The caveat with that is that at least mdraid1/dmraid1 has no verified
> data integrity, and while mdraid5/6 does have 1/2-way-parity calculation,
> it's only used in recovery, NOT cross-verified in ordinary use.

Linux Software RAID-6 only uses the parity when you have a hard read error.  
If you have a disk return bad data and say it's good then you just lose.

That said the rate of disks returning such bad data is very low.  If you had a 
hypothetical array of 4 disks as I suggested then to lose data you need to 
have one pair of disks entirely fail and another disk return corrupt data or 
have 2 disks in separate RAID-1 pairs return corrupt data on matching sectors 
(according to BTRFS data copies) such that Linux software RAID copies the 
corrupt data to the good disk.

That sort of thing is much less likely than having a regular BTRFS RAID-1 
array of 2 disks failing.

Also if you were REALLY paranoid you could have 2 BTRFS RAID-1 filesystems 
that each contain a single large file.  Those 2 large files could be run via 
losetup and used for another BTRFS RAID-1 filesystem.  That gets you 
redundancy at both levels.  Of course if you had 2 disks in one pair fail then 
the loopback BTRFS filesystem would still be OK.

How does the BTRFS kernel code handle a loopback device read failure?

> In fact, with md/dmraid and its reasonable possibility of silent
> corruption since at that level any of the copies could be returned and
> there's no data integrity checking, if whatever md/dmraid level copy /is/
> returned ends up being bad, then btrfs will consider that side of the
> pair bad, without any way to check additional copies at the underlying md/
> dmraid level.  Effectively you only have two verified copies no matter
> how many ways the dm/mdraid level is mirrored, since there's no
> verification at the dm/mdraid level at all.

BTRFS doesn't consider a side of the pair to be bad, just the block that was 
read.  Usually disk corruption is in the order of dozens of blocks and the 
rest of the disk will be good.

> Tho if you ran a md/dmraid level scrub often enough, and then ran a btrfs
> scrub on top, one could be /reasonably/ assured of freedom from lower
> level corruption.

Not at all.  Linux software RAID scrub will copy data from one disk to the 
other.  It may copy from the good disk to the bad or from the bad disk to the 
good - and it won't know which it's doing.

Also last time I checked a scrub of Linux software RAID-1 still reported large 
multiples of 128 sectors mismatching in normal operation.  So you won't even 
know if a disk is returning bogus data unless the bad data is copied to the 
good disk and exposed to BTRFS.

> But with both levels of scrub together very possibly
> taking a couple days, and various ongoing write activity in the mean
> time, by the time one run was done it'd be time to start the next one, so
> you'd effectively be running scrub at one level or the other *ALL* the
> time!

No.  I have a RAID-1 array of 3TB disks that is 2/3 full which I scrub every 
Sunday night.  If I had an array of 4 disks then I could do scrubs on Saturday 
night as well.

> So... I'd suggest either forgetting about data integrity for the time
> being and just running md/dmraid without worrying about it, or just
> running btrfs with pairs, and backing up to another btrfs of pairs.
> Btrfs send/receive could even be used as the primary syncing method
> between the main and backup set, altho I'd suggest having a fallback such
> as rsync setup and tested to work as well, in case there's a bug in send/
> receive that stalls that method for awhile.

One advantage of BTRFS backup is that you know if the data is corrupt.  If you 
make several backups that end up with different blocks on disk then Linux 
knows which one has the correct file data.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28  6:28     ` Russell Coker
@ 2014-06-28  7:38       ` Martin Steigerwald
  2014-06-28  7:43         ` Hugo Mills
  2014-06-28 11:38       ` Duncan
  2014-06-28 18:15       ` Chris Murphy
  2 siblings, 1 reply; 11+ messages in thread
From: Martin Steigerwald @ 2014-06-28  7:38 UTC (permalink / raw)
  To: russell; +Cc: Duncan, linux-btrfs

Am Samstag, 28. Juni 2014, 16:28:23 schrieb Russell Coker:
> > So look for N-way-mirroring when you go RAID shopping, and no, btrfs does
> > not have it at this time, altho it is roadmapped for implementation after
> > completion of the raid5/6 code.
> >
> > 
> >
> > FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for
> > device redundancy, but to take full advantage of btrfs data integrity
> > features, allowing to "scrub" a checksum-mismatch copy with the content
> > of a checksum-validated copy if available.  That's currently possible,
> > but due to the pair-mirroring-only restriction, there's only one
> > additional copy, and if it happens to be bad as well, there's no
> > possibility of a third copy to scrub from.  As it happens my personal
> > sweet-spot between cost/performance and reliability would be 3-way
> > mirroring, but once they code beyond N=2, N should go unlimited, so N=3,
> > N=4, N=50 if you have a way to hook them all up... should all be possible.
> 
> What I want is the ZFS copies= feature.

Something like this, even more flexible, was planned to be added. There were 
some discussion on how to specificy complex redundancy patterns totally flexibly 
exactly with how much redundancy, how much spares and so on.

I didn't read any of this since a long time. I wonder what happened to this 
idea.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28  7:38       ` Martin Steigerwald
@ 2014-06-28  7:43         ` Hugo Mills
  0 siblings, 0 replies; 11+ messages in thread
From: Hugo Mills @ 2014-06-28  7:43 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: russell, Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1923 bytes --]

On Sat, Jun 28, 2014 at 09:38:00AM +0200, Martin Steigerwald wrote:
> Am Samstag, 28. Juni 2014, 16:28:23 schrieb Russell Coker:
> > > So look for N-way-mirroring when you go RAID shopping, and no, btrfs does
> > > not have it at this time, altho it is roadmapped for implementation after
> > > completion of the raid5/6 code.
> > >
> > > 
> > >
> > > FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for
> > > device redundancy, but to take full advantage of btrfs data integrity
> > > features, allowing to "scrub" a checksum-mismatch copy with the content
> > > of a checksum-validated copy if available.  That's currently possible,
> > > but due to the pair-mirroring-only restriction, there's only one
> > > additional copy, and if it happens to be bad as well, there's no
> > > possibility of a third copy to scrub from.  As it happens my personal
> > > sweet-spot between cost/performance and reliability would be 3-way
> > > mirroring, but once they code beyond N=2, N should go unlimited, so N=3,
> > > N=4, N=50 if you have a way to hook them all up... should all be possible.
> > 
> > What I want is the ZFS copies= feature.
> 
> Something like this, even more flexible, was planned to be added. There were 
> some discussion on how to specificy complex redundancy patterns totally flexibly 
> exactly with how much redundancy, how much spares and so on.
> 
> I didn't read any of this since a long time. I wonder what happened to this 
> idea.

   It's moving slowly in fits and starts. I haven't forgotten it.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- But people have always eaten people,  / what else is there to ---  
         eat?  / If the Juju had meant us not to eat people / he         
                     wouldn't have made us of meat.                      

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28  4:26   ` Duncan
  2014-06-28  6:28     ` Russell Coker
@ 2014-06-28 10:13     ` Roman Mamedov
  2014-06-29  2:30       ` Duncan
  1 sibling, 1 reply; 11+ messages in thread
From: Roman Mamedov @ 2014-06-28 10:13 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1234 bytes --]

On Sat, 28 Jun 2014 04:26:43 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted:
> 
> > On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote:
> >> Can I get more protection by using more than 2 drives?
> >> 
> >> I had an onboard RAID a few years back that would let me use RAID1
> >> across up to 4 drives.
> >> 
> > Currently the only RAID level that fully works in BTRFS is RAID-1 with
> > data on 2 disks.
> 
> Not /quite/ correct.  Raid0 works, but of course that isn't exactly 
> "RAID" as it's not "redundant".  And raid10 works.  But that's simply 
> raid0 over raid1.  So depending on whether you consider raid0 actually 
> "RAID" or not, which in turn depends on how strict you are with the 
> "redundant" part, there is or is not more than btrfs raid1 working.

Also depending on what you consider "fully works", RAID1 may not qualify too,
as neither the read-balancing, nor write-submission algorithms are ready for
production use, performance-wise.

(RAID1 writes to two disks sequentially, not at the same time; and reads are
satisfied from in effect a random device, not from the least-busy device).

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28  6:28     ` Russell Coker
  2014-06-28  7:38       ` Martin Steigerwald
@ 2014-06-28 11:38       ` Duncan
  2014-06-28 13:40         ` Russell Coker
  2014-06-28 18:15       ` Chris Murphy
  2 siblings, 1 reply; 11+ messages in thread
From: Duncan @ 2014-06-28 11:38 UTC (permalink / raw)
  To: linux-btrfs

Russell Coker posted on Sat, 28 Jun 2014 16:28:23 +1000 as excerpted:

> On Sat, 28 Jun 2014 04:26:43 Duncan wrote:
>> Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted:
>> > On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote:
>> >> Can I get more protection by using more than 2 drives?
>> >> 
>> >> I had an onboard RAID a few years back that would let me use RAID1
>> >> across up to 4 drives.
>> > 
>> > Currently the only RAID level that fully works in BTRFS is RAID-1
>> > with data on 2 disks.
>> 
>> Not /quite/ correct.  Raid0 works, but of course that isn't exactly
>> "RAID" as it's not "redundant".  And raid10 works.  But that's simply
>> raid0 over raid1.  So depending on whether you consider raid0 actually
> 
> http://en.wikipedia.org/wiki/Linux_MD_RAID_10#LINUX-MD-RAID-10
> 
> There are a number of ways of doing RAID-0 over RAID-1,

Yes...

> but BTRFS doesn't do any of them.

It does...

> When you have more than 2 disks and
> tell BTRFS to do RAID-1 you get a result that might be somewhat
> comparable to Linux software RAID-10, except for the issue of having
> disks of different sizes and adding more disks after creating the
> "RAID".

What about when you tell btrfs to do raid10?

Unless you're going to argue that btrfs raid10 mode isn't "real" raid10, 
or that like raid5/6 it's not complete, but you haven't mentioned it at 
all, so that doesn't seem to be what you're saying.

Which was my point when I mentioned raid10 in the first place, it's 
there, and unlike raid5/6, I've never seen any indication that it's not 
complete or supported.

>> "RAID" or not, which in turn depends on how strict you are with the
>> "redundant" part, there is or is not more than btrfs raid1 working.
> 
> The way BTRFS, ZFS, and WAFL work is quite different to anything
> described in any of the original papers on RAID.  One could make a case
> that what these filesystems do shouldn't be called RAID, but then we
> would be searching for another term for it.

The FAQ admits that some people call it a layering violation... =8^0  
Which in a way it is, as it combines a below-filesystem virtual device 
layer (where raid is normally found) with the filesystem layer.  But the 
argument is, it's a /useful/ layering violation.  Which it is, as that's 
what gives btrfs the ability to do what it does with some of its features.

But the flip side of that is that since it includes so much that is 
normally strictly isolated into other layers, it's intensely complex, far 
more so than most other filesystems, which is why it's taking so horribly 
long to introduce some of these features, and why some of the scaling 
bugs in particular have been so nasty -- it's just /dealing/ with that 
much more than the ordinary filesystem.

The nearest competitor that I'm aware of is zfs.  But (1) zfs made some 
compromises that btrfs is trying to avoid, and (2) AFAIK, zfs had a LOT 
more real resources sunk into it.  I'm sure there's people that know way 
more about its development than I do.

And of course zfs isn't GPLv2 compatible, the reason it'll never be 
mainline Linux unless the zfs owners wish it so, but it's very obvious 
they wish it NOT so, which is why it remains as it is.  That's not 
important to everyone, but it's a big reason I can't/won't seriously 
consider zfs here.

> What I want is the ZFS copies= feature.

As others have mentioned, the discussed idea is multi-axis 
configurability, N-mirror, S-stripe, P-parity (tho I don't believe that's 
the letters used).  It's possible strip-size could be added to that as 
well.  Hugo is the guy that has been working most directly on defining 
that.

*BUT*, at this point that's all pie-in-the-sky for btrfs, while I guess 
zfs copies= "just works".  If the licensing issues weren't there, I 
imagine I'd be using zfs today, and if btrfs took another decade or 
whatever to mature, no big deal.  But the licensing issues are there and 
zfs is thus not an option for me, so... as I said earlier, we work with 
what we have.

>> The caveat with that is that at least mdraid1/dmraid1 has no verified
>> data integrity, and while mdraid5/6 does have 1/2-way-parity
>> calculation, it's only used in recovery, NOT cross-verified in ordinary
>> use.
> 
> Linux Software RAID-6 only uses the parity when you have a hard read
> error. If you have a disk return bad data and say it's good then you
> just lose.

Which is basically restating what I was saying.

> That said the rate of disks returning such bad data is very low.  If you
> had a hypothetical array of 4 disks as I suggested then to lose data you
> need to have one pair of disks entirely fail and another disk return
> corrupt data or have 2 disks in separate RAID-1 pairs return corrupt
> data on matching sectors (according to BTRFS data copies) such that
> Linux software RAID copies the corrupt data to the good disk.

Well, it's a bit more complex than that, and the details can definitely 
come back to bite you in certain corner cases, but I agree with the 
general idea.

> That sort of thing is much less likely than having a regular BTRFS
> RAID-1 array of 2 disks failing.

The problem is that there's little or no control of it at the mdraid 
level.  In md/raid1 mode, a "scrub" simply copies the data, good or bad, 
from the first device to the others.  There's no data integrity checking 
and not even a majority vote, it simply dumbly copies what's on one 
device to the others, as long as what's on the first device is readable 
at all.  In theory raid6 with its two-way-parity could be better, since 
it /does/ have the two-way-parity data it /could/ check, but the 
frustrating part of it is that it /doesn't/!  It only reads the data 
strip not the entire stripe, and doesn't do any cross-checking unless it 
has to make up for a dropped device.

And with the size of disks we have today, the statistics on multiple 
whole device reliability are NOT good to us!  There's a VERY REAL chance, 
even likelihood, that at least one block on the device is going to be 
bad, and not be caught by its own error detection!

There's some serious study and work going into this, and it's why people 
working on modern filesystems are pretty much all adding data integrity 
features, etc.  Btrfs and zfs aren't alone in that.  And it's really 
because there's no choice.  As TB scale to PB, the chances are that 
there /will/ be one or possibly more device-undetected errors somewhere 
on that device.  One in a billion or whatever (IDR the real number and 
I'm too lazy to do the math ATM) chance, but once you have numbers 
nearing a billion...

> Also if you were REALLY paranoid you could have 2 BTRFS RAID-1
> filesystems that each contain a single large file.  Those 2 large files
> could be run via losetup and used for another BTRFS RAID-1 filesystem. 
> That gets you redundancy at both levels.  Of course if you had 2 disks
> in one pair fail then the loopback BTRFS filesystem would still be OK.

But the COW and fragmentation issues on the bottom level... OUCH!  And 
you can't simply set NOCOW, because that turns off the checksumming as 
well, leaving you right back where you were without the integrity 
checking!

IOW, it might work for filesystems to a quarter TiB or so, but don't 
except it to scale to TiB plus without getting MASSIVELY slow.  I used to 
mention that theoretical option too, but once I saw the problems btrfs 
has with fragmentation on internal-write files, which is what loop-file 
would be... lets just say when I thought about mentioning it I shuddered 
and decided to forget I even considered it.

Tho for the sub-100-GiB filesystems I'm dealing with here, on fast SSD 
with near 100% over-provisioning (hey, the size I wanted wasn't available 
at a good price so I took what I could get, and the overprovisioning 
certainly doesn't hurt!), it might actually be somewhat practical...

> How does the BTRFS kernel code handle a loopback device read failure?
> 
>> In fact, with md/dmraid and its reasonable possibility of silent
>> corruption since at that level any of the copies could be returned and
>> there's no data integrity checking, if whatever md/dmraid level copy
>> /is/ returned ends up being bad, then btrfs will consider that side of
>> the pair bad, without any way to check additional copies at the
>> underlying md/dmraid level.  Effectively you only have two verified
>> copies no matter how many ways the dm/mdraid level is mirrored, since
>> there's no verification at the dm/mdraid level at all.
> 
> BTRFS doesn't consider a side of the pair to be bad, just the block that
> was read.  Usually disk corruption is in the order of dozens of blocks
> and the rest of the disk will be good.

I didn't word that well, primarily because I didn't even think of the 
whole-device-bad case.

What I meant was that in the context of a btrfs scrub, btrfs will only be 
aware of the two "sides" for every block, no matter how many devices the 
underlying mdraid on that "side" is actually composed of.  At the btrfs 
level, then, it'll only have one chance to present good data, and the 
mdraid level will effectively pick a candidate randomly.  If the picked 
candidate happens to return a block that fails the btrfs checksum, it'll 
reject that block from that side, regardless of how many good copies 
there might also be.  If it /does/ reject that block, you better *HOPE* 
that the copy it picks from the mdraid on the /other/ side happens to be 
valid, because if it's not...

If it's not, then btrfs will show both sides as failing the checksum, 
which means as far as btrfs is concerned that block (not the whole btrfs 
device "side", just that block, but that's bad enough) is dead, there's 
no good copies for it to use, regardless of the number of good copies on 
the other devices composing the underlying mdraids on each side.

It's simply a matter of chance, over which the admin has very little 
control.  That's the frustrating part, and the point I was trying to get 
across.

But I agree (now that you made me aware of that read of what I wrote in 
the first place) that the way I wrote it did sound like I was saying that 
btrfs would drop that whole underlying mdraid, composing that "side".  
But while that's what I appeared to write, that's not what I had in 
mind...

>> Tho if you ran a md/dmraid level scrub often enough, and then ran a
>> btrfs scrub on top, one could be /reasonably/ assured of freedom from
>> lower level corruption.
> 
> Not at all.  Linux software RAID scrub will copy data from one disk to
> the other.  It may copy from the good disk to the bad or from the bad
> disk to the good - and it won't know which it's doing.

Which was my point.

But, assuming that you do an mdraid scrub and it finds and copies a bad 
version.  At that point, if you've been both-layer scrubbing regularly, 
the chances of the /other/ side being bad are relatively low, so if as 
soon as you finish the mdraid scrub, you do a btrfs scrub, it should 
catch that bad copy and rewrite it from other, good copy at the btrfs 
level.  The rewrite will then be propagated down to all the devices on 
the underlying mdraid on the bad side of the btrfs, and with a bit of 
luck, that will rewrite all the bad copies, or at least the bad copy on 
the first mdraid device so that the next mdraid scrub will propagate it 
to the bad device.

If you constantly scrub the underlying mdraids and it sometimes 
propagates a bad block at that level, followed by a scrub at the btrfs 
level to (hopefully) force rewrites of any bad copies that the mdraid 
scrub propagated, then back to the mdraid level, then back to the btrfs 
level, basically constantly scrubbing at one level or the other, then in 
theory anyway, the chances of bitrot appearing on both sides of the btrfs 
at the same time are rather lowered...

*BUT* at a cost of essentially *CONSTANT* scrubbing.  Constant because at 
the multi-TBs we're talking, just completing a single scrub cycle could 
well take more than a standard 8-hour work-day, so by the time you 
finish, it's already about time to start the next scrub cycle.

That sort of constant scrubbing is going to take its toll both on device 
life and on I/O thruput for whatever data you're actually storing on the 
device, since a good share of the time it's going to be scrubbing as 
well, slowing down the speed of the real I/O.

And I just don't see that as realistic.  At least not for spinning rust, 
which is where people talking about multi-TB capacities are likely to be 
at this point.  For SSD it could be feasible as the scrubs should go fast 
enough that most of the time will be /between/ scrubs instead of /doing/ 
scrubs, and even during the scrubs, normal I/O shouldn't be /too/ held up 
on SSD, altho higher capacity I/O certainly would be, but of course SSD 
limits you to the lower capacities and higher costs of SSD.

> Also last time I checked a scrub of Linux software RAID-1 still reported
> large multiples of 128 sectors mismatching in normal operation.

Ouch!  That I hadn't even considered.

> So you won't even know if a disk is returning bogus data unless the bad
> data is copied to the good disk and exposed to BTRFS.
> 
>> But with both levels of scrub together very possibly taking a couple
>> days, and various ongoing write activity in the mean time, by the time
>> one run was done it'd be time to start the next one, so you'd
>> effectively be running scrub at one level or the other *ALL* the time!
> 
> No.  I have a RAID-1 array of 3TB disks that is 2/3 full which I scrub
> every Sunday night.  If I had an array of 4 disks then I could do scrubs
> on Saturday night as well.

But are you scrubbing at both the btrfs and the md/dmraid level?  That'll 
effectively double the scrub-time.

And the idea was to scrub say every other day if not daily, so the chance 
of developing further bitrot and thus of getting it on both sides of the 
btrfs at the same time, is reduced as much as possible because the bitrot 
is caught and btrfs-scrub-corrected as soon as possible.

And while that might not take a full 24 hours, it's likely to take a 
significant enough portion of 24 hours, that if you're doing a full mdraid 
and btrfs level both scrub every two days, some significant fraction (say 
a third to a half) of the time will be spent scrubbing, during which 
normal I/O speeds will be significantly reduced, while also reducing 
device lifetime due to the relatively high duty cycle seek activity.

>> So... I'd suggest either forgetting about data integrity for the time
>> being and just running md/dmraid without worrying about it, or just
>> running btrfs with pairs, and backing up to another btrfs of pairs.
>> Btrfs send/receive could even be used as the primary syncing method
>> between the main and backup set, altho I'd suggest having a fallback
>> such as rsync setup and tested to work as well, in case there's a bug
>> in send/ receive that stalls that method for awhile.
> 
> One advantage of BTRFS backup is that you know if the data is corrupt. 
> If you make several backups that end up with different blocks on disk
> then Linux knows which one has the correct file data.

Absolutely agreed. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28 11:38       ` Duncan
@ 2014-06-28 13:40         ` Russell Coker
  0 siblings, 0 replies; 11+ messages in thread
From: Russell Coker @ 2014-06-28 13:40 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Sat, 28 Jun 2014 11:38:47 Duncan wrote:
> And with the size of disks we have today, the statistics on multiple
> whole device reliability are NOT good to us!  There's a VERY REAL chance,
> even likelihood, that at least one block on the device is going to be
> bad, and not be caught by its own error detection!

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

The above paper suggests that it's about 10% of SATA disks getting such errors 
per year and that typically a disk that has such a problem has it for ~50 
sectors.  The probability of having 2 disks randomly get such errors (if they 
are truly random and independent) would be something like 1% per year.  The 
probability that the ~50 sectors on each of 2*3TB disks happening to match up 
is much lower.

> > Also if you were REALLY paranoid you could have 2 BTRFS RAID-1
> > filesystems that each contain a single large file.  Those 2 large files
> > could be run via losetup and used for another BTRFS RAID-1 filesystem.
> > That gets you redundancy at both levels.  Of course if you had 2 disks
> > in one pair fail then the loopback BTRFS filesystem would still be OK.
> 
> But the COW and fragmentation issues on the bottom level... OUCH!  And
> you can't simply set NOCOW, because that turns off the checksumming as
> well, leaving you right back where you were without the integrity
> checking!

It really depends on how much performance you need.  I've got some virtual 
servers running BTRFS within BTRFS and with modern hardware and a light load 
it works OK.

> *BUT* at a cost of essentially *CONSTANT* scrubbing.  Constant because at
> the multi-TBs we're talking, just completing a single scrub cycle could
> well take more than a standard 8-hour work-day, so by the time you
> finish, it's already about time to start the next scrub cycle.

Scrubbing my BTRFS RAID-1 filesystem with 2.4TB of data stored on a pair of 
3TB disks takes 5 hours.

> That sort of constant scrubbing is going to take its toll both on device
> life and on I/O thruput for whatever data you're actually storing on the
> device, since a good share of the time it's going to be scrubbing as
> well, slowing down the speed of the real I/O.

Some years ago I asked an executive from a company that manufactured hard 
drives about this.  The engineering manager who was directed to answer my 
question told me that the drives were designed to perform any sequence of 
legal operations continually for the warranty period.  So if a disk had a 3 
year warranty then it should be able to survive a scrubbing loop for 3 years.

But scrubbing a system that runs 24*7 is a problem.  Hopefully we will get a 
speed limit feature for BTRFS scrubbing as there is for Linux software RAID 
rebuild/scrub.

> > No.  I have a RAID-1 array of 3TB disks that is 2/3 full which I scrub
> > every Sunday night.  If I had an array of 4 disks then I could do scrubs
> > on Saturday night as well.
> 
> But are you scrubbing at both the btrfs and the md/dmraid level?  That'll
> effectively double the scrub-time.

It's a BTRFS RAID-1, there is no mdadm on that system.

> And while that might not take a full 24 hours, it's likely to take a
> significant enough portion of 24 hours, that if you're doing a full mdraid
> and btrfs level both scrub every two days, some significant fraction (say
> a third to a half) of the time will be spent scrubbing, during which
> normal I/O speeds will be significantly reduced, while also reducing
> device lifetime due to the relatively high duty cycle seek activity.

When the expected error rate for SATA disks is ~10% of disks having errors per 
year a scrub every second day seems rather paranoid.

But if you are that paranoid then the wisc.edu paper suggests that you should 
be buying "enterprise" disks that have a much lower error rate.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28  6:28     ` Russell Coker
  2014-06-28  7:38       ` Martin Steigerwald
  2014-06-28 11:38       ` Duncan
@ 2014-06-28 18:15       ` Chris Murphy
  2 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2014-06-28 18:15 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jun 28, 2014, at 12:28 AM, Russell Coker <russell@coker.com.au> wrote:
> 
>> Tho if you ran a md/dmraid level scrub often enough, and then ran a btrfs
>> scrub on top, one could be /reasonably/ assured of freedom from lower
>> level corruption.
> 
> Not at all.  Linux software RAID scrub will copy data from one disk to the 
> other.  

md supports two kinds of scrub: check and repair. Check is the same as btrfs read-only scrub with -r option.

> It may copy from the good disk to the bad or from the bad disk to the 
> good - and it won't know which it's doing.

Yes.

> Also last time I checked a scrub of Linux software RAID-1 still reported large 
> multiples of 128 sectors mismatching in normal operation.  So you won't even 
> know if a disk is returning bogus data unless the bad data is copied to the 
> good disk and exposed to BTRFS.

For raid1,raid10 you need to zero the drives or you will get mismatches. And swap partition or swap file on an md device will also cause mismatches. Mismatches on raid1,10 are much less likely for other types of files, but man 4 md does say it's possible so mismatch_cnt isn't perfectly reliable on raid1,10.


Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 3+ drives
  2014-06-28 10:13     ` Roman Mamedov
@ 2014-06-29  2:30       ` Duncan
  0 siblings, 0 replies; 11+ messages in thread
From: Duncan @ 2014-06-29  2:30 UTC (permalink / raw)
  To: linux-btrfs

Roman Mamedov posted on Sat, 28 Jun 2014 16:13:47 +0600 as excerpted:

> Also depending on what you consider "fully works", RAID1 may not qualify
> too,
> as neither the read-balancing, nor write-submission algorithms are ready
> for production use, performance-wise.
> 
> (RAID1 writes to two disks sequentially, not at the same time; and reads
> are satisfied from in effect a random device, not from the least-busy
> device).

Good point.  The current algorithms were designed as "good enough" stand-
ins for testing.  They were /not/ designed as highly efficient
parallel-I/O on parallel devices and cores implementations, as that was 
to come later.

Of course part of /that/ problem is that often enough, the I/O channel 
is /not/ the bottleneck, the bottleneck is still the horrible scaling 
issues due to calculating the interplay between all those snapshots and 
quotas and massive internal-rewrite-pattern-VM-images, thus the reason we 
have snapshot-aware-defrag disabled ATM, so arguably focusing on the most 
efficient I/O queues algorithm at this point would be premature 
optimization, which would mean it's a /good/ thing they haven't focused 
on updating them yet.  Once these horrible scaling issues are addressed 
and snapshot-aware-defrag and the like can be enabled again without 
triggering week-going-and-it's-still-not-half-done issues, /then/ perhaps 
it's time to look at the parallel I/O queuing and balancing algorithms.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-06-29  2:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-28  0:30 RAID1 3+ drives Zack Coffey
2014-06-28  0:51 ` Russell Coker
2014-06-28  4:26   ` Duncan
2014-06-28  6:28     ` Russell Coker
2014-06-28  7:38       ` Martin Steigerwald
2014-06-28  7:43         ` Hugo Mills
2014-06-28 11:38       ` Duncan
2014-06-28 13:40         ` Russell Coker
2014-06-28 18:15       ` Chris Murphy
2014-06-28 10:13     ` Roman Mamedov
2014-06-29  2:30       ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.