bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
@ 2012-09-24 13:37 Jakub Husák
  2012-09-25  4:19 ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Husák @ 2012-09-24 13:37 UTC (permalink / raw)
  To: linux-raid

Hi, I have found a serious bug, that affects at least 4-disk md raid10 
with far2 layout. The kernel allows it to run with two failed drives 
silently without failing the whole array, despite it's not possible for 
it to work correctly because of chunks distribution with far2 layout.
The worst thing about it is that the write IO errors are invisible for 
the file system and running processes, the written data are just lost, 
only with IO errors reported in dmesg. Even force-reassembling ends up 
with clean,degraded array, with TWO disks, ignoring that I've tried to 
assemble it with all four devices. Recreating the array with 
--assume-clean is the only way to put it together.

System:

Ubuntu 12.04
Linux version 3.2.0-30-generic (buildd@batsu) (gcc version 4.6.3 
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #48-Ubuntu SMP Fri Aug 24 16:52:48 UTC 
2012
mdadm - v3.2.5 - 18th May 2012

and

Debian 6.0
Linux version 2.6.32-5-xen-amd64 (Debian 2.6.32-35) (dannf@debian.org) 
(gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Tue Jun 14 12:46:30 UTC 2011
mdadm - v3.1.4 - 31st August 2010

and

Centos 6.3


How to repeat:

dd if=/dev/zero of=d0 bs=1M count=100
dd if=/dev/zero of=d1 bs=1M count=100
dd if=/dev/zero of=d2 bs=1M count=100
dd if=/dev/zero of=d3 bs=1M count=100
losetup -f d0
losetup -f d1
losetup -f d2
losetup -f d3

mdadm -C /dev/md0 --level=10 --raid-devices=4 --layout=f2 /dev/loop[0-3]

dd if=/dev/zero of=/dev/md0 bs=512K count=10
10+0 records in
10+0 records out
5242880 bytes (5,2 MB) copied, 0,0409824 s, 128 MB/s

OK

mdadm /dev/md0 --fail /dev/loop0
mdadm /dev/md0 --fail /dev/loop3

mdadm -D /dev/md0
/dev/md0:
         Version : 1.2
   Creation Time : Mon Sep 24 08:47:10 2012
      Raid Level : raid10
      Array Size : 202752 (198.03 MiB 207.62 MB)
   Used Dev Size : 101376 (99.02 MiB 103.81 MB)
    Raid Devices : 4
   Total Devices : 4
     Persistence : Superblock is persistent

     Update Time : Mon Sep 24 08:48:55 2012
           State : clean, degraded                 <<< !!!!!!!!
  Active Devices : 2
Working Devices : 2
  Failed Devices : 2
   Spare Devices : 0

          Layout : far=2
      Chunk Size : 512K

            Name : koubas-desktop:0  (local to host koubas-desktop)
            UUID : 3ea4ded7:c10b1778:dc9f92aa:6e7cb196
          Events : 21

     Number   Major   Minor   RaidDevice State
        0       0        0        0 removed <<< !!!!!!!!
        1       7        1        1      active sync   /dev/loop1
        2       7        2        2      active sync   /dev/loop2
        3       0        0        3 removed <<< !!!!!!!!

        0       7        0        -      faulty spare /dev/loop0
        3       7        3        -      faulty spare /dev/loop3

dd if=/dev/zero of=/dev/md0 bs=512K count=10
10+0 records in
10+0 records out
5242880 bytes (5,2 MB) copied, 0,0245752 s, 213 MB/s
echo $?
0    <<< !!!!!!!

dmesg:
[883011.442366] md/raid10:md0: Disk failure on loop0, disabling device.
[883011.442367] md/raid10:md0: Operation continuing on 3 devices.
[883011.473292] RAID10 conf printout:
[883011.473296]  --- wd:3 rd:4
[883011.473299]  disk 0, wo:1, o:0, dev:loop0
[883011.473301]  disk 1, wo:0, o:1, dev:loop1
[883011.473302]  disk 2, wo:0, o:1, dev:loop2
[883011.473304]  disk 3, wo:0, o:1, dev:loop3
[883011.492046] RAID10 conf printout:
[883011.492051]  --- wd:3 rd:4
[883011.492054]  disk 1, wo:0, o:1, dev:loop1
[883011.492056]  disk 2, wo:0, o:1, dev:loop2
[883011.492058]  disk 3, wo:0, o:1, dev:loop3
[883015.875089] md/raid10:md0: Disk failure on loop3, disabling device.
[883015.875090] md/raid10:md0: Operation continuing on 2 devices. <<< !!!!!
[883015.886686] RAID10 conf printout:
[883015.886692]  --- wd:2 rd:4
[883015.886695]  disk 1, wo:0, o:1, dev:loop1
[883015.886697]  disk 2, wo:0, o:1, dev:loop2
[883015.886699]  disk 3, wo:1, o:0, dev:loop3
[883015.900018] RAID10 conf printout:
[883015.900023]  --- wd:2 rd:4
[883015.900025]  disk 1, wo:0, o:1, dev:loop1
[883015.900027]  disk 2, wo:0, o:1, dev:loop2
************* "successful" dd follows: *******************
[883015.903622] quiet_error: 6 callbacks suppressed
[883015.903624] Buffer I/O error on device md0, logical block 50672
[883015.903628] Buffer I/O error on device md0, logical block 50672
[883015.903635] Buffer I/O error on device md0, logical block 50686
[883015.903638] Buffer I/O error on device md0, logical block 50686
[883015.903669] Buffer I/O error on device md0, logical block 50687
[883015.903672] Buffer I/O error on device md0, logical block 50687
[883015.903706] Buffer I/O error on device md0, logical block 50687
[883015.903710] Buffer I/O error on device md0, logical block 50687
[883015.903714] Buffer I/O error on device md0, logical block 50687
[883015.903717] Buffer I/O error on device md0, logical block 50687
[883052.136435] quiet_error: 6 callbacks suppressed
[883052.136439] Buffer I/O error on device md0, logical block 384
[883052.136442] lost page write due to I/O error on md0
[883052.136448] Buffer I/O error on device md0, logical block 385
[883052.136450] lost page write due to I/O error on md0
[883052.136454] Buffer I/O error on device md0, logical block 386
[883052.136456] lost page write due to I/O error on md0
[883052.136460] Buffer I/O error on device md0, logical block 387
[883052.136462] lost page write due to I/O error on md0
[883052.136466] Buffer I/O error on device md0, logical block 388
[883052.136468] lost page write due to I/O error on md0
[883052.136472] Buffer I/O error on device md0, logical block 389
[883052.136474] lost page write due to I/O error on md0
[883052.136478] Buffer I/O error on device md0, logical block 390
[883052.136480] lost page write due to I/O error on md0
[883052.136484] Buffer I/O error on device md0, logical block 391
[883052.136486] lost page write due to I/O error on md0
[883052.136492] Buffer I/O error on device md0, logical block 392
[883052.136494] lost page write due to I/O error on md0
[883052.136498] Buffer I/O error on device md0, logical block 393
[883052.136500] lost page write due to I/O error on md0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-24 13:37 bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption Jakub Husák
@ 2012-09-25  4:19 ` NeilBrown
  2012-09-25  5:00   ` Mikael Abrahamsson
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-09-25  4:19 UTC (permalink / raw)
  To: Jakub Husák; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7197 bytes --]

On Mon, 24 Sep 2012 15:37:11 +0200 Jakub Husák <jakub@gooseman.cz> wrote:

> Hi, I have found a serious bug, that affects at least 4-disk md raid10 
> with far2 layout. The kernel allows it to run with two failed drives 
> silently without failing the whole array, despite it's not possible for 
> it to work correctly because of chunks distribution with far2 layout.
> The worst thing about it is that the write IO errors are invisible for 
> the file system and running processes, the written data are just lost, 
> only with IO errors reported in dmesg. Even force-reassembling ends up 
> with clean,degraded array, with TWO disks, ignoring that I've tried to 
> assemble it with all four devices. Recreating the array with 
> --assume-clean is the only way to put it together.

Why do you say that "the write IO errors are invisible for the filesystem"?
They are certainly reported in the kernel logs that you should and I'm sure
an application would see them if it checked return status properly.

md is behaving as designed here.  It deliberately does not fail the whole
array, it just fails those blocks which are no longer accessible.

NeilBrown



> 
> System:
> 
> Ubuntu 12.04
> Linux version 3.2.0-30-generic (buildd@batsu) (gcc version 4.6.3 
> (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #48-Ubuntu SMP Fri Aug 24 16:52:48 UTC 
> 2012
> mdadm - v3.2.5 - 18th May 2012
> 
> and
> 
> Debian 6.0
> Linux version 2.6.32-5-xen-amd64 (Debian 2.6.32-35) (dannf@debian.org) 
> (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Tue Jun 14 12:46:30 UTC 2011
> mdadm - v3.1.4 - 31st August 2010
> 
> and
> 
> Centos 6.3
> 
> 
> How to repeat:
> 
> dd if=/dev/zero of=d0 bs=1M count=100
> dd if=/dev/zero of=d1 bs=1M count=100
> dd if=/dev/zero of=d2 bs=1M count=100
> dd if=/dev/zero of=d3 bs=1M count=100
> losetup -f d0
> losetup -f d1
> losetup -f d2
> losetup -f d3
> 
> mdadm -C /dev/md0 --level=10 --raid-devices=4 --layout=f2 /dev/loop[0-3]
> 
> dd if=/dev/zero of=/dev/md0 bs=512K count=10
> 10+0 records in
> 10+0 records out
> 5242880 bytes (5,2 MB) copied, 0,0409824 s, 128 MB/s
> 
> OK
> 
> mdadm /dev/md0 --fail /dev/loop0
> mdadm /dev/md0 --fail /dev/loop3
> 
> mdadm -D /dev/md0
> /dev/md0:
>          Version : 1.2
>    Creation Time : Mon Sep 24 08:47:10 2012
>       Raid Level : raid10
>       Array Size : 202752 (198.03 MiB 207.62 MB)
>    Used Dev Size : 101376 (99.02 MiB 103.81 MB)
>     Raid Devices : 4
>    Total Devices : 4
>      Persistence : Superblock is persistent
> 
>      Update Time : Mon Sep 24 08:48:55 2012
>            State : clean, degraded                 <<< !!!!!!!!
>   Active Devices : 2
> Working Devices : 2
>   Failed Devices : 2
>    Spare Devices : 0
> 
>           Layout : far=2
>       Chunk Size : 512K
> 
>             Name : koubas-desktop:0  (local to host koubas-desktop)
>             UUID : 3ea4ded7:c10b1778:dc9f92aa:6e7cb196
>           Events : 21
> 
>      Number   Major   Minor   RaidDevice State
>         0       0        0        0 removed <<< !!!!!!!!
>         1       7        1        1      active sync   /dev/loop1
>         2       7        2        2      active sync   /dev/loop2
>         3       0        0        3 removed <<< !!!!!!!!
> 
>         0       7        0        -      faulty spare /dev/loop0
>         3       7        3        -      faulty spare /dev/loop3
> 
> dd if=/dev/zero of=/dev/md0 bs=512K count=10
> 10+0 records in
> 10+0 records out
> 5242880 bytes (5,2 MB) copied, 0,0245752 s, 213 MB/s
> echo $?
> 0    <<< !!!!!!!
> 
> dmesg:
> [883011.442366] md/raid10:md0: Disk failure on loop0, disabling device.
> [883011.442367] md/raid10:md0: Operation continuing on 3 devices.
> [883011.473292] RAID10 conf printout:
> [883011.473296]  --- wd:3 rd:4
> [883011.473299]  disk 0, wo:1, o:0, dev:loop0
> [883011.473301]  disk 1, wo:0, o:1, dev:loop1
> [883011.473302]  disk 2, wo:0, o:1, dev:loop2
> [883011.473304]  disk 3, wo:0, o:1, dev:loop3
> [883011.492046] RAID10 conf printout:
> [883011.492051]  --- wd:3 rd:4
> [883011.492054]  disk 1, wo:0, o:1, dev:loop1
> [883011.492056]  disk 2, wo:0, o:1, dev:loop2
> [883011.492058]  disk 3, wo:0, o:1, dev:loop3
> [883015.875089] md/raid10:md0: Disk failure on loop3, disabling device.
> [883015.875090] md/raid10:md0: Operation continuing on 2 devices. <<< !!!!!
> [883015.886686] RAID10 conf printout:
> [883015.886692]  --- wd:2 rd:4
> [883015.886695]  disk 1, wo:0, o:1, dev:loop1
> [883015.886697]  disk 2, wo:0, o:1, dev:loop2
> [883015.886699]  disk 3, wo:1, o:0, dev:loop3
> [883015.900018] RAID10 conf printout:
> [883015.900023]  --- wd:2 rd:4
> [883015.900025]  disk 1, wo:0, o:1, dev:loop1
> [883015.900027]  disk 2, wo:0, o:1, dev:loop2
> ************* "successful" dd follows: *******************
> [883015.903622] quiet_error: 6 callbacks suppressed
> [883015.903624] Buffer I/O error on device md0, logical block 50672
> [883015.903628] Buffer I/O error on device md0, logical block 50672
> [883015.903635] Buffer I/O error on device md0, logical block 50686
> [883015.903638] Buffer I/O error on device md0, logical block 50686
> [883015.903669] Buffer I/O error on device md0, logical block 50687
> [883015.903672] Buffer I/O error on device md0, logical block 50687
> [883015.903706] Buffer I/O error on device md0, logical block 50687
> [883015.903710] Buffer I/O error on device md0, logical block 50687
> [883015.903714] Buffer I/O error on device md0, logical block 50687
> [883015.903717] Buffer I/O error on device md0, logical block 50687
> [883052.136435] quiet_error: 6 callbacks suppressed
> [883052.136439] Buffer I/O error on device md0, logical block 384
> [883052.136442] lost page write due to I/O error on md0
> [883052.136448] Buffer I/O error on device md0, logical block 385
> [883052.136450] lost page write due to I/O error on md0
> [883052.136454] Buffer I/O error on device md0, logical block 386
> [883052.136456] lost page write due to I/O error on md0
> [883052.136460] Buffer I/O error on device md0, logical block 387
> [883052.136462] lost page write due to I/O error on md0
> [883052.136466] Buffer I/O error on device md0, logical block 388
> [883052.136468] lost page write due to I/O error on md0
> [883052.136472] Buffer I/O error on device md0, logical block 389
> [883052.136474] lost page write due to I/O error on md0
> [883052.136478] Buffer I/O error on device md0, logical block 390
> [883052.136480] lost page write due to I/O error on md0
> [883052.136484] Buffer I/O error on device md0, logical block 391
> [883052.136486] lost page write due to I/O error on md0
> [883052.136492] Buffer I/O error on device md0, logical block 392
> [883052.136494] lost page write due to I/O error on md0
> [883052.136498] Buffer I/O error on device md0, logical block 393
> [883052.136500] lost page write due to I/O error on md0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-25  4:19 ` NeilBrown
@ 2012-09-25  5:00   ` Mikael Abrahamsson
  2012-09-25  9:48     ` jakub
  0 siblings, 1 reply; 14+ messages in thread
From: Mikael Abrahamsson @ 2012-09-25  5:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jakub Husák, linux-raid

On Tue, 25 Sep 2012, NeilBrown wrote:

> Why do you say that "the write IO errors are invisible for the 
> filesystem"? They are certainly reported in the kernel logs that you 
> should and I'm sure an application would see them if it checked return 
> status properly.
>
> md is behaving as designed here.  It deliberately does not fail the 
> whole array, it just fails those blocks which are no longer accessible.

I would imagine OP would be helped by mounting filesystem with 
"errors=remount-ro" to make sure the filesystem stops writing to the 
drives upon errors.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-25  5:00   ` Mikael Abrahamsson
@ 2012-09-25  9:48     ` jakub
  2012-09-25 11:14       ` keld
  2012-09-25 12:32       ` NeilBrown
  0 siblings, 2 replies; 14+ messages in thread
From: jakub @ 2012-09-25  9:48 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: NeilBrown, linux-raid

On Tue, 25 Sep 2012 07:00:44 +0200 (CEST), Mikael Abrahamsson
<swmike@swm.pp.se> wrote:
> On Tue, 25 Sep 2012, NeilBrown wrote:
> 
>> Why do you say that "the write IO errors are invisible for the 
>> filesystem"? They are certainly reported in the kernel logs that you 
>> should and I'm sure an application would see them if it checked return 
>> status properly.
>>
>> md is behaving as designed here.  It deliberately does not fail the 
>> whole array, it just fails those blocks which are no longer accessible.
> 

Would you please refer to some documentation that this behaviour is
correct? I now tried to fail several disks in raid5, raid0 and raid10-near,
in case of r0 and r10n, mdadm didn't even allow me to remove more disks
than is sufficient to access all the data. In case of r5 I was able to fail
2 out of 3, but the array was correctly marked as FAILED and couldn't be
accessed at all. I'd expect that behaviour even in my case of raid10-far. I
can't even assmenble and run it with less than required count of disks.

> I would imagine OP would be helped by mounting filesystem with 
> "errors=remount-ro" to make sure the filesystem stops writing to the 
> drives upon errors.

Yes it's a good point.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-25  9:48     ` jakub
@ 2012-09-25 11:14       ` keld
  2012-09-25 11:47         ` John Robinson
  2012-09-25 12:32       ` NeilBrown
  1 sibling, 1 reply; 14+ messages in thread
From: keld @ 2012-09-25 11:14 UTC (permalink / raw)
  To: jakub; +Cc: Mikael Abrahamsson, NeilBrown, linux-raid

On Tue, Sep 25, 2012 at 11:48:34AM +0200, jakub@gooseman.cz wrote:
> 
> On Tue, 25 Sep 2012 07:00:44 +0200 (CEST), Mikael Abrahamsson
> <swmike@swm.pp.se> wrote:
> > On Tue, 25 Sep 2012, NeilBrown wrote:
> > 
> >> Why do you say that "the write IO errors are invisible for the 
> >> filesystem"? They are certainly reported in the kernel logs that you 
> >> should and I'm sure an application would see them if it checked return 
> >> status properly.
> >>
> >> md is behaving as designed here.  It deliberately does not fail the 
> >> whole array, it just fails those blocks which are no longer accessible.
> > 
> 
> Would you please refer to some documentation that this behaviour is
> correct? I now tried to fail several disks in raid5, raid0 and raid10-near,
> in case of r0 and r10n, mdadm didn't even allow me to remove more disks
> than is sufficient to access all the data. In case of r5 I was able to fail
> 2 out of 3, but the array was correctly marked as FAILED and couldn't be
> accessed at all. I'd expect that behaviour even in my case of raid10-far. I
> can't even assmenble and run it with less than required count of disks.
> 
> 
> > I would imagine OP would be helped by mounting filesystem with 
> > "errors=remount-ro" to make sure the filesystem stops writing to the 
> > drives upon errors.
> 
> Yes it's a good point.

It would be be against the whole purpose of mirrored raid to put the raid in read-only 
mode when one disk fails.

A  mirrored raid should be able to still function also for writes,
when this is still possible. A raid10,far with 4 disks should be able to function with 2 
failed disks (in the best case). As long as all data is available and the 2 remaining functional
disks are OK, they should be able to function fully normally. Of cause there needs to
be warnings of the fact that 2 disks have failed. But it should not be recorded 
in the log for each write failed for the non-functioning disks. 

Best regards
keld 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-25 11:14       ` keld
@ 2012-09-25 11:47         ` John Robinson
  0 siblings, 0 replies; 14+ messages in thread
From: John Robinson @ 2012-09-25 11:47 UTC (permalink / raw)
  To: keld; +Cc: jakub, Mikael Abrahamsson, NeilBrown, linux-raid

On 25/09/2012 12:14, keld@keldix.com wrote:
[...]
> A  mirrored raid should be able to still function also for writes,
> when this is still possible. A raid10,far with 4 disks should be able to function with 2
> failed disks (in the best case). As long as all data is available and the 2 remaining functional
> disks are OK, they should be able to function fully normally. Of cause there needs to
> be warnings of the fact that 2 disks have failed. But it should not be recorded
> in the log for each write failed for the non-functioning disks.

Yes but in thise case, two adjacent devices have failed, so we cannot 
read from or write to a quarter of the array.

I think I agree that the OP's dd command, which would certainly have 
tried to write to an area of the array there was no backing store for, 
ought to have failed with an error message, and that the failure 
behaviour ought to be the same for f2 as it is for n2.

Cheers,

John.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-25  9:48     ` jakub
  2012-09-25 11:14       ` keld
@ 2012-09-25 12:32       ` NeilBrown
       [not found]         ` <50628B39.90205@gooseman.cz>
  1 sibling, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-09-25 12:32 UTC (permalink / raw)
  To: jakub; +Cc: Mikael Abrahamsson, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1405 bytes --]

On Tue, 25 Sep 2012 11:48:34 +0200 <jakub@gooseman.cz> wrote:

> 
> On Tue, 25 Sep 2012 07:00:44 +0200 (CEST), Mikael Abrahamsson
> <swmike@swm.pp.se> wrote:
> > On Tue, 25 Sep 2012, NeilBrown wrote:
> > 
> >> Why do you say that "the write IO errors are invisible for the 
> >> filesystem"? They are certainly reported in the kernel logs that you 
> >> should and I'm sure an application would see them if it checked return 
> >> status properly.
> >>
> >> md is behaving as designed here.  It deliberately does not fail the 
> >> whole array, it just fails those blocks which are no longer accessible.
> > 
> 
> Would you please refer to some documentation that this behaviour is
> correct? I now tried to fail several disks in raid5, raid0 and raid10-near,
> in case of r0 and r10n, mdadm didn't even allow me to remove more disks
> than is sufficient to access all the data. In case of r5 I was able to fail
> 2 out of 3, but the array was correctly marked as FAILED and couldn't be
> accessed at all. I'd expect that behaviour even in my case of raid10-far. I
> can't even assmenble and run it with less than required count of disks.
> 

Could you please be explicit about exactly how the behaviour that you think
of as "correct" would differ from the current behaviour?  Because I cannot
really see what point you are making - I need a little help.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <50628B39.90205@gooseman.cz>]

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
       [not found]         ` <50628B39.90205@gooseman.cz>
@ 2012-09-26  5:41           ` NeilBrown
  2012-09-26  8:28             ` keld
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-09-26  5:41 UTC (permalink / raw)
  To: Jakub Husák; +Cc: Mikael Abrahamsson, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2055 bytes --]

On Wed, 26 Sep 2012 06:57:29 +0200 Jakub Husák <jakub@gooseman.cz> wrote:

> On 25.9.2012 14:32, NeilBrown wrote:
> > On Tue, 25 Sep 2012 11:48:34 +0200 <jakub@gooseman.cz> wrote:
> >
> >>
> >> Would you please refer to some documentation that this behaviour is
> >> correct? I now tried to fail several disks in raid5, raid0 and raid10-near,
> >> in case of r0 and r10n, mdadm didn't even allow me to remove more disks
> >> than is sufficient to access all the data. In case of r5 I was able to fail
> >> 2 out of 3, but the array was correctly marked as FAILED and couldn't be
> >> accessed at all. I'd expect that behaviour even in my case of raid10-far. I
> >> can't even assmenble and run it with less than required count of disks.
> >>
> > Could you please be explicit about exactly how the behaviour that you think
> > of as "correct" would differ from the current behaviour?  Because I cannot
> > really see what point you are making - I need a little help.
> >
> > Thanks,
> > NeilBrown
> I think that when two adjacent drives fail, or the array is being 
> assembled with two adjacent drives missing, the status wouldn't be 
> "clean, degraded", the array "running"  and reporting some inaccessible 
> blocks when you try to use it - as it happens in my case of R10F.
> Instead, the array status would be "FAILED " and won't be allowed to 
> run. R0, R5, R10N behave in that manner (if i tested well), which I 
> consider correct.
> 
> The "degraded" status means, at lest for me, that the array is fully 
> functional, only with limited redundancy.
> R10 with far2 layout and four disks can't be only "degraded" when any 
> two disks are missing, unlike R10 near2 in some cases.
> 
> If something is still not clear, please be patient, i'll try to squeeze 
> maximum out of my torturous English ;)
> 
> Thaks

Ahh.... I see it now.
There is a bug in the 'enough' function in mdadm and in drivers/md/raid10.c
It doesn't handle 'far' layouts properly.

I'll sort out some patches.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-26  5:41           ` NeilBrown
@ 2012-09-26  8:28             ` keld
  2012-09-26  8:59               ` John Robinson
  0 siblings, 1 reply; 14+ messages in thread
From: keld @ 2012-09-26  8:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jakub Hus?k, Mikael Abrahamsson, linux-raid

On Wed, Sep 26, 2012 at 03:41:07PM +1000, NeilBrown wrote:
> On Wed, 26 Sep 2012 06:57:29 +0200 Jakub Hus?k <jakub@gooseman.cz> wrote:
> 
> > On 25.9.2012 14:32, NeilBrown wrote:
> > > On Tue, 25 Sep 2012 11:48:34 +0200 <jakub@gooseman.cz> wrote:
> > >
> > >>
> > >> Would you please refer to some documentation that this behaviour is
> > >> correct? I now tried to fail several disks in raid5, raid0 and raid10-near,
> > >> in case of r0 and r10n, mdadm didn't even allow me to remove more disks
> > >> than is sufficient to access all the data. In case of r5 I was able to fail
> > >> 2 out of 3, but the array was correctly marked as FAILED and couldn't be
> > >> accessed at all. I'd expect that behaviour even in my case of raid10-far. I
> > >> can't even assmenble and run it with less than required count of disks.
> > >>
> > > Could you please be explicit about exactly how the behaviour that you think
> > > of as "correct" would differ from the current behaviour?  Because I cannot
> > > really see what point you are making - I need a little help.
> > >
> > > Thanks,
> > > NeilBrown
> > I think that when two adjacent drives fail, or the array is being 
> > assembled with two adjacent drives missing, the status wouldn't be 
> > "clean, degraded", the array "running"  and reporting some inaccessible 
> > blocks when you try to use it - as it happens in my case of R10F.
> > Instead, the array status would be "FAILED " and won't be allowed to 
> > run. R0, R5, R10N behave in that manner (if i tested well), which I 
> > consider correct.
> > 
> > The "degraded" status means, at lest for me, that the array is fully 
> > functional, only with limited redundancy.
> > R10 with far2 layout and four disks can't be only "degraded" when any 
> > two disks are missing, unlike R10 near2 in some cases.
> > 
> > If something is still not clear, please be patient, i'll try to squeeze 
> > maximum out of my torturous English ;)
> > 
> > Thaks
> 
> Ahh.... I see it now.
> There is a bug in the 'enough' function in mdadm and in drivers/md/raid10.c
> It doesn't handle 'far' layouts properly.
> 
> I'll sort out some patches.

I also understand it now, I think. raid10,f2 with 4 disks cannot in the current implementation
survive 2 failing disks. We have discussed earlier how to implement raid10,far that would mean
better survival chances with more disks failing. This is not implemented yet.

Best regards
Keld

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-26  8:28             ` keld
@ 2012-09-26  8:59               ` John Robinson
  2012-09-26  9:08                 ` keld
  0 siblings, 1 reply; 14+ messages in thread
From: John Robinson @ 2012-09-26  8:59 UTC (permalink / raw)
  To: keld; +Cc: NeilBrown, Jakub Hus?k, Mikael Abrahamsson, linux-raid

On 26/09/2012 09:28, keld@keldix.com wrote:
[...]
> I also understand it now, I think. raid10,f2 with 4 disks cannot in the current implementation
> survive 2 failing disks.

It can, but only two non-adjacent discs. With 4 drives sda-sdd, than 
means you can only lose both sda and sdc, or sdb and sdd.

> We have discussed earlier how to implement raid10,far that would mean
> better survival chances with more disks failing. This is not implemented yet.

No, but even if/when it is, there will still be some combinations of two 
discs that you cannot afford to lose. The layout change to try to 
improve redundancy will not be generic, as it doesn't work for an odd 
number of discs, so the existing layout would have to be retained as an 
option.

Cheers,

John.

-- 
John Robinson, yuiop IT services
0131 557 9577 / 07771 784 058
46/12 Broughton Road, Edinburgh EH7 4EE

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-26  8:59               ` John Robinson
@ 2012-09-26  9:08                 ` keld
  2012-09-26  9:23                   ` keld
  0 siblings, 1 reply; 14+ messages in thread
From: keld @ 2012-09-26  9:08 UTC (permalink / raw)
  To: John Robinson; +Cc: NeilBrown, Jakub Hus?k, Mikael Abrahamsson, linux-raid

On Wed, Sep 26, 2012 at 09:59:42AM +0100, John Robinson wrote:
> On 26/09/2012 09:28, keld@keldix.com wrote:
> [...]
> >I also understand it now, I think. raid10,f2 with 4 disks cannot in the 
> >current implementation
> >survive 2 failing disks.
> 
> It can, but only two non-adjacent discs. With 4 drives sda-sdd, than 
> means you can only lose both sda and sdc, or sdb and sdd.

Agree

> >We have discussed earlier how to implement raid10,far that would mean
> >better survival chances with more disks failing. This is not implemented 
> >yet.
> 
> No, but even if/when it is, there will still be some combinations of two 
> discs that you cannot afford to lose. The layout change to try to 
> improve redundancy will not be generic, as it doesn't work for an odd 
> number of discs, so the existing layout would have to be retained as an 
> option.

Well, at least for backward compatibility we need an option for the current layout.

For odd number of disks, I do think we can improve the chances for more failing disks, as
discussed earlier.


best regards
keld

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
  2012-09-26  9:08                 ` keld
@ 2012-09-26  9:23                   ` keld
       [not found]                     ` <5067F014.5020600@gooseman.cz>
  0 siblings, 1 reply; 14+ messages in thread
From: keld @ 2012-09-26  9:23 UTC (permalink / raw)
  To: John Robinson; +Cc: NeilBrown, Jakub Hus?k, Mikael Abrahamsson, linux-raid

On Wed, Sep 26, 2012 at 11:08:03AM +0200, keld@keldix.com wrote:
> On Wed, Sep 26, 2012 at 09:59:42AM +0100, John Robinson wrote:
> > On 26/09/2012 09:28, keld@keldix.com wrote:
> > [...]
> > >We have discussed earlier how to implement raid10,far that would mean
> > >better survival chances with more disks failing. This is not implemented 
> > >yet.
> > 
> > No, but even if/when it is, there will still be some combinations of two 
> > discs that you cannot afford to lose. The layout change to try to 
> > improve redundancy will not be generic, as it doesn't work for an odd 
> > number of discs, so the existing layout would have to be retained as an 
> > option.
> 
> Well, at least for backward compatibility we need an option for the current layout.
> 
> For odd number of disks, I do think we can improve the chances for more failing disks, as
> discussed earlier.

To sum up: for raid10,f2 with odd numbers of disks, you can have a group of 3 disks and then
the rest of the disks ordered in pairs. Thus one disk in each of the pairs and one
disk in the 3-group all could fail and the raid would still be functional. 
This is almost the same improvement as for the even numbered raid10,f2.
The scheme is easily generalised to raids with more than 2 copies.

Best regards
Keld

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <5067F014.5020600@gooseman.cz>]

* Re: bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption
       [not found]                     ` <5067F014.5020600@gooseman.cz>
@ 2012-09-30 10:24                       ` keld
  0 siblings, 0 replies; 14+ messages in thread
From: keld @ 2012-09-30 10:24 UTC (permalink / raw)
  To: Jakub Hus?k; +Cc: John Robinson, NeilBrown, Mikael Abrahamsson, linux-raid

On Sun, Sep 30, 2012 at 09:09:08AM +0200, Jakub Hus?k wrote:
> Are you sure it's possible to design r10 f2 with the same performance 
> benefits of current implementation, which will be (in some cases) able 
> to survive 2-disk failure, when we talk about 4-disk array? IMHO it's 
> always a trade of between performance and redundancy. When you want to 
> survive with more failed drives, you can't spread the data and iops 
> among so many drives. But maybe it's just me trying to use the law of 
> conservation of energy in a wrong place :), I haven't studied raid 
> layouts so much.

Yes, I am sure we can design a better layout for raid10,far,
with the same performance characteristics..
The current layout can already survive 2 disk failures
for a 4 disk array, but the chances are only 33 % for surviving the 2 disk crash.
The new layout can survive in 66 % of the cases.
And the performance is completely the same.
That is why the new layout should be the default layout for raid10,far.
No performance difference but greatly improved survival chances.

> Just to make things clear: I have no problem with the redundancy level 
> of current r10f2, it was all just about the wrong handling of array failure.

Yes, that is understood. Anyway, my take is that both layouts need to be
supported by the kernel, for backwards compatibility. We need to have
kernel code that can handle all the existing raid10,far arrays already in production.

best regards
keld


> 
> Thanks
> Jakub Hus?k
> 
> On 26.9.2012 11:23, keld@keldix.com wrote:
> >On Wed, Sep 26, 2012 at 11:08:03AM +0200, keld@keldix.com wrote:
> >>On Wed, Sep 26, 2012 at 09:59:42AM +0100, John Robinson wrote:
> >>>On 26/09/2012 09:28, keld@keldix.com wrote:
> >>>[...]
> >>>>We have discussed earlier how to implement raid10,far that would mean
> >>>>better survival chances with more disks failing. This is not implemented
> >>>>yet.
> >>>No, but even if/when it is, there will still be some combinations of two
> >>>discs that you cannot afford to lose. The layout change to try to
> >>>improve redundancy will not be generic, as it doesn't work for an odd
> >>>number of discs, so the existing layout would have to be retained as an
> >>>option.
> >>Well, at least for backward compatibility we need an option for the 
> >>current layout.
> >>
> >>For odd number of disks, I do think we can improve the chances for more 
> >>failing disks, as
> >>discussed earlier.
> >To sum up: for raid10,f2 with odd numbers of disks, you can have a group 
> >of 3 disks and then
> >the rest of the disks ordered in pairs. Thus one disk in each of the pairs 
> >and one
> >disk in the 3-group all could fail and the raid would still be functional.
> >This is almost the same improvement as for the even numbered raid10,f2.
> >The scheme is easily generalised to raids with more than 2 copies.
> >
> >Best regards
> >Keld
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <50601CED.1050607@gooseman.cz>]

* bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing "silent" data corruption
       [not found] <50601CED.1050607@gooseman.cz>
@ 2012-09-24  8:46 ` Jakub Husák
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Husák @ 2012-09-24  8:46 UTC (permalink / raw)
  To: linux-raid

Hi, I have found a serious bug, that affects at least 4-disk md raid10 
with far2 layout. The kernel allows it to run with two failed drives 
silently without failing the whole array, despite it's not possible for 
it to work correctly because of chunks distribution with far2 layout.
The worst thing about it is that the write IO errors are invisible for 
the file system and running processes, the written data are just lost, 
only with IO errors reported in dmesg. Even force-reassembling ends up 
with clean,degraded array, with TWO disks, ignoring that I've tried to 
assemble it with all four devices. Recreating the array with 
--assume-clean is the only way to put it together.

System:

Ubuntu 12.04
Linux version 3.2.0-30-generic (buildd@batsu) (gcc version 4.6.3 
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #48-Ubuntu SMP Fri Aug 24 16:52:48 UTC 2012
mdadm - v3.2.5 - 18th May 2012

and

Debian 6.0
Linux version 2.6.32-5-xen-amd64 (Debian 2.6.32-35) (dannf@debian.org) 
(gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Tue Jun 14 12:46:30 UTC 2011
mdadm - v3.1.4 - 31st August 2010

and

Centos 6.3


How to repeat:

dd if=/dev/zero of=d0 bs=1M count=100
dd if=/dev/zero of=d1 bs=1M count=100
dd if=/dev/zero of=d2 bs=1M count=100
dd if=/dev/zero of=d3 bs=1M count=100
losetup -f d0
losetup -f d1
losetup -f d2
losetup -f d3

mdadm -C /dev/md0 --level=10 --raid-devices=4 --layout=f2 /dev/loop[0-3]

dd if=/dev/zero of=/dev/md0 bs=512K count=10
10+0 records in
10+0 records out
5242880 bytes (5,2 MB) copied, 0,0409824 s, 128 MB/s

OK

mdadm /dev/md0 --fail /dev/loop0
mdadm /dev/md0 --fail /dev/loop3

mdadm -D /dev/md0
/dev/md0:
         Version : 1.2
   Creation Time : Mon Sep 24 08:47:10 2012
      Raid Level : raid10
      Array Size : 202752 (198.03 MiB 207.62 MB)
   Used Dev Size : 101376 (99.02 MiB 103.81 MB)
    Raid Devices : 4
   Total Devices : 4
     Persistence : Superblock is persistent

     Update Time : Mon Sep 24 08:48:55 2012
           State : clean, degraded                 <<< !!!!!!!!
  Active Devices : 2
Working Devices : 2
  Failed Devices : 2
   Spare Devices : 0

          Layout : far=2
      Chunk Size : 512K

            Name : koubas-desktop:0  (local to host koubas-desktop)
            UUID : 3ea4ded7:c10b1778:dc9f92aa:6e7cb196
          Events : 21

     Number   Major   Minor   RaidDevice State
        0       0        0        0 removed                   <<< !!!!!!!!
        1       7        1        1      active sync   /dev/loop1
        2       7        2        2      active sync   /dev/loop2
        3       0        0        3 removed                   <<< !!!!!!!!

        0       7        0        -      faulty spare /dev/loop0
        3       7        3        -      faulty spare /dev/loop3

dd if=/dev/zero of=/dev/md0 bs=512K count=10
10+0 records in
10+0 records out
5242880 bytes (5,2 MB) copied, 0,0245752 s, 213 MB/s
echo $?
0    <<< !!!!!!!

dmesg:
[883011.442366] md/raid10:md0: Disk failure on loop0, disabling device.
[883011.442367] md/raid10:md0: Operation continuing on 3 devices.
[883011.473292] RAID10 conf printout:
[883011.473296]  --- wd:3 rd:4
[883011.473299]  disk 0, wo:1, o:0, dev:loop0
[883011.473301]  disk 1, wo:0, o:1, dev:loop1
[883011.473302]  disk 2, wo:0, o:1, dev:loop2
[883011.473304]  disk 3, wo:0, o:1, dev:loop3
[883011.492046] RAID10 conf printout:
[883011.492051]  --- wd:3 rd:4
[883011.492054]  disk 1, wo:0, o:1, dev:loop1
[883011.492056]  disk 2, wo:0, o:1, dev:loop2
[883011.492058]  disk 3, wo:0, o:1, dev:loop3
[883015.875089] md/raid10:md0: Disk failure on loop3, disabling device.
[883015.875090] md/raid10:md0: Operation continuing on 2 devices.  <<< !!!!!
[883015.886686] RAID10 conf printout:
[883015.886692]  --- wd:2 rd:4
[883015.886695]  disk 1, wo:0, o:1, dev:loop1
[883015.886697]  disk 2, wo:0, o:1, dev:loop2
[883015.886699]  disk 3, wo:1, o:0, dev:loop3
[883015.900018] RAID10 conf printout:
[883015.900023]  --- wd:2 rd:4
[883015.900025]  disk 1, wo:0, o:1, dev:loop1
[883015.900027]  disk 2, wo:0, o:1, dev:loop2
************* "successful" dd follows: *******************
[883015.903622] quiet_error: 6 callbacks suppressed
[883015.903624] Buffer I/O error on device md0, logical block 50672
[883015.903628] Buffer I/O error on device md0, logical block 50672
[883015.903635] Buffer I/O error on device md0, logical block 50686
[883015.903638] Buffer I/O error on device md0, logical block 50686
[883015.903669] Buffer I/O error on device md0, logical block 50687
[883015.903672] Buffer I/O error on device md0, logical block 50687
[883015.903706] Buffer I/O error on device md0, logical block 50687
[883015.903710] Buffer I/O error on device md0, logical block 50687
[883015.903714] Buffer I/O error on device md0, logical block 50687
[883015.903717] Buffer I/O error on device md0, logical block 50687
[883052.136435] quiet_error: 6 callbacks suppressed
[883052.136439] Buffer I/O error on device md0, logical block 384
[883052.136442] lost page write due to I/O error on md0
[883052.136448] Buffer I/O error on device md0, logical block 385
[883052.136450] lost page write due to I/O error on md0
[883052.136454] Buffer I/O error on device md0, logical block 386
[883052.136456] lost page write due to I/O error on md0
[883052.136460] Buffer I/O error on device md0, logical block 387
[883052.136462] lost page write due to I/O error on md0
[883052.136466] Buffer I/O error on device md0, logical block 388
[883052.136468] lost page write due to I/O error on md0
[883052.136472] Buffer I/O error on device md0, logical block 389
[883052.136474] lost page write due to I/O error on md0
[883052.136478] Buffer I/O error on device md0, logical block 390
[883052.136480] lost page write due to I/O error on md0
[883052.136484] Buffer I/O error on device md0, logical block 391
[883052.136486] lost page write due to I/O error on md0
[883052.136492] Buffer I/O error on device md0, logical block 392
[883052.136494] lost page write due to I/O error on md0
[883052.136498] Buffer I/O error on device md0, logical block 393
[883052.136500] lost page write due to I/O error on md0




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-09-30 10:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-24 13:37 bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing silent data corruption Jakub Husák
2012-09-25  4:19 ` NeilBrown
2012-09-25  5:00   ` Mikael Abrahamsson
2012-09-25  9:48     ` jakub
2012-09-25 11:14       ` keld
2012-09-25 11:47         ` John Robinson
2012-09-25 12:32       ` NeilBrown
     [not found]         ` <50628B39.90205@gooseman.cz>
2012-09-26  5:41           ` NeilBrown
2012-09-26  8:28             ` keld
2012-09-26  8:59               ` John Robinson
2012-09-26  9:08                 ` keld
2012-09-26  9:23                   ` keld
     [not found]                     ` <5067F014.5020600@gooseman.cz>
2012-09-30 10:24                       ` keld
     [not found] <50601CED.1050607@gooseman.cz>
2012-09-24  8:46 ` bug: 4-disk md raid10 far2 can be assembled clean with only two disks, causing "silent" " Jakub Husák

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).