want-replacement got stuck?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* want-replacement got stuck?
@ 2012-11-20 22:11 George Spelvin
  2012-11-21 16:33 ` George Spelvin
  2012-11-22  2:10 ` NeilBrown
  0 siblings, 2 replies; 16+ messages in thread
From: George Spelvin @ 2012-11-20 22:11 UTC (permalink / raw)
  To: linux-raid; +Cc: linux

I have a RAID10 array with 4 active + 1 spare.
Kernel is 3.6.5, x86-64 but running 32-bit unserland.

After a recent failure on sdd2, the spare sdc2 was
activated and things looked something like (manual edit,
may not be perfectly faithful):

md5 : active raid10 sdd2[4](F) sdb2[1] sde2[2] sdc2[3] sda2[0]
      725591552 blocks 256K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 50/173 pages [200KB], 2048KB chunk

smartctl -A showed 1 pending sector, but badblocks didn't
find it, so I decided to play with moving things back:

# badblocks -s -v /dev/sdd2
# mdadm /dev/md5 -r /dev/sdd2 -a /dev/sdd2
# echo want_replacement >  /sys/block/md5/md/dev-sdc2/state

This ran for a while, but now it has stopped, with the following
configuration:

md5 : active raid10 sdd2[3](R) sdb2[1] sde2[2] sdc2[4](F) sda2[0]
      725591552 blocks 256K chunks 2 near-copies [4/4] [UUU_]
      bitmap: 50/173 pages [200KB], 2048KB chunk

# [530]# cat /sys/block/md5/md/dev-sd?2/state
in_sync
in_sync
faulty,want_replacement
in_sync,replacement
in_sync

I'm not quite sure how to interpret this state, and why it is showing
"4/4" good drives but [UUU_].

Unlike the failures which caused sdd2 to drop out, that are quite verbose
in the syslog, I can't see what cause the resync to stop.

Here's the initial failover:

Nov 20 11:49:06 science kernel: ata4: EH complete
Nov 20 11:49:06 science kernel: md/raid10:md5: read error corrected (8 sectors at 40 on sdd2)
Nov 20 11:49:06 science kernel: md/raid10:md5: sdd2: Raid device exceeded read_error threshold [cur 21:max 20]
Nov 20 11:49:06 science kernel: md/raid10:md5: sdd2: Failing raid device
Nov 20 11:49:06 science kernel: md/raid10:md5: Disk failure on sdd2, disabling device.
Nov 20 11:49:06 science kernel: md/raid10:md5: Operation continuing on 3 devices.
Nov 20 11:49:06 science kernel: RAID10 conf printout:
Nov 20 11:49:06 science kernel: --- wd:3 rd:4
Nov 20 11:49:06 science kernel: disk 0, wo:0, o:1, dev:sda2
Nov 20 11:49:06 science kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 20 11:49:06 science kernel: disk 2, wo:0, o:1, dev:sde2
Nov 20 11:49:06 science kernel: disk 3, wo:1, o:0, dev:sdd2
Nov 20 11:49:06 science kernel: RAID10 conf printout:
Nov 20 11:49:06 science kernel: --- wd:3 rd:4
Nov 20 11:49:06 science kernel: disk 0, wo:0, o:1, dev:sda2
Nov 20 11:49:06 science kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 20 11:49:06 science kernel: disk 2, wo:0, o:1, dev:sde2
Nov 20 11:49:06 science kernel: disk 3, wo:1, o:0, dev:sdd2
Nov 20 11:49:06 science kernel: RAID10 conf printout:
Nov 20 11:49:06 science kernel: --- wd:3 rd:4
Nov 20 11:49:06 science kernel: disk 0, wo:0, o:1, dev:sda2
Nov 20 11:49:06 science kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 20 11:49:06 science kernel: disk 2, wo:0, o:1, dev:sde2
Nov 20 11:49:06 science kernel: RAID10 conf printout:
Nov 20 11:49:06 science kernel: --- wd:3 rd:4
Nov 20 11:49:06 science kernel: disk 0, wo:0, o:1, dev:sda2
Nov 20 11:49:06 science kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 20 11:49:06 science kernel: disk 2, wo:0, o:1, dev:sde2
Nov 20 11:49:06 science kernel: disk 3, wo:1, o:1, dev:sdc2
Nov 20 11:49:06 science kernel: md: recovery of RAID array md5
Nov 20 11:49:06 science kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov 20 11:49:06 science kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Nov 20 11:49:06 science kernel: md: using 128k window, over a total of 362795776k.
u

And its completion:
Nov 20 13:50:47 science kernel: md: md5: recovery done.
Nov 20 13:50:47 science kernel: RAID10 conf printout:
Nov 20 13:50:47 science kernel: --- wd:4 rd:4
Nov 20 13:50:47 science kernel: disk 0, wo:0, o:1, dev:sda2
Nov 20 13:50:47 science kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 20 13:50:47 science kernel: disk 2, wo:0, o:1, dev:sde2
Nov 20 13:50:47 science kernel: disk 3, wo:0, o:1, dev:sdc2

Here's where I remove and re-add sdd2:
Nov 20 16:34:01 science kernel: md: unbind<sdd2>
Nov 20 16:34:01 science kernel: md: export_rdev(sdd2)
Nov 20 16:34:11 science kernel: md: bind<sdd2>
Nov 20 16:34:12 science kernel: RAID10 conf printout:
Nov 20 16:34:12 science kernel: --- wd:4 rd:4
Nov 20 16:34:12 science kernel: disk 0, wo:0, o:1, dev:sda2
Nov 20 16:34:12 science kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 20 16:34:12 science kernel: disk 2, wo:0, o:1, dev:sde2
Nov 20 16:34:12 science kernel: disk 3, wo:0, o:1, dev:sdc2

And do the want_replacement:
Nov 20 16:38:07 science kernel: RAID10 conf printout:
Nov 20 16:38:07 science kernel: --- wd:4 rd:4
Nov 20 16:38:07 science kernel: disk 0, wo:0, o:1, dev:sda2
Nov 20 16:38:07 science kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 20 16:38:07 science kernel: disk 2, wo:0, o:1, dev:sde2
Nov 20 16:38:07 science kernel: disk 3, wo:0, o:1, dev:sdc2
Nov 20 16:38:07 science kernel: md: recovery of RAID array md5
Nov 20 16:38:07 science kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov 20 16:38:07 science kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Nov 20 16:38:07 science kernel: md: using 128k window, over a total of 362795776k.

It appears to have completed:
Nov 20 18:40:01 science kernel: md: md5: recovery done.
Nov 20 18:40:01 science kernel: RAID10 conf printout:
Nov 20 18:40:01 science kernel: --- wd:4 rd:4
Nov 20 18:40:01 science kernel: disk 0, wo:0, o:1, dev:sda2
Nov 20 18:40:01 science kernel: disk 1, wo:0, o:1, dev:sdb2
Nov 20 18:40:01 science kernel: disk 2, wo:0, o:1, dev:sde2
Nov 20 18:40:01 science kernel: disk 3, wo:1, o:0, dev:sdc2

But as mentioned, the RAID state is a bit odd.  sdc2 is still in the
array and sdd2 is not.

Can anyone suggest what is going on?  Thank you!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-20 22:11 want-replacement got stuck? George Spelvin
@ 2012-11-21 16:33 ` George Spelvin
  2012-11-21 16:41   ` Roman Mamedov
                     ` (2 more replies)
  2012-11-22  2:10 ` NeilBrown
  1 sibling, 3 replies; 16+ messages in thread
From: George Spelvin @ 2012-11-21 16:33 UTC (permalink / raw)
  To: linux-raid, linux

Just to follow up to that earlier complaint, ext4 is now noticing some errors:

Nov 21 06:21:53 science kernel: EXT4-fs error (device md5): ext4_find_entry:1234: inode #5881516: comm rsync: checksumming directory block 0
Nov 21 07:57:03 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 4206: bad block bitmap checksum
Nov 21 08:41:37 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 3960: bad block bitmap checksum
Nov 21 08:45:18 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 4737: bad block bitmap checksum
Nov 21 08:50:16 science kernel: EXT4-fs error (device md5): ext4_mb_generate_buddy:741: group 4206, 5621 clusters in bitmap, 6888 in gd
Nov 21 08:50:16 science kernel: JBD2: Spotted dirty metadata buffer (dev = md5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
Nov 21 15:50:29 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm python: bg 4138: bad block bitmap checksum
Nov 21 16:21:00 science kernel: UDP: bad checksum. From 187.194.52.187:65535 to 71.41.210.146:6881 ulen 70

I also experienced transient corruption of the last few K of my incoming mailbox.  (I.e. the last
couple of messages were overwritten with some other text file.  This morning, it's fine.)

Something is definitely wonky here...  I'm leaving it in the "stuck" state for a while
in case there's useful debugging info to be extracted, but I'm getting very alarmed by these
messages and want to reboot soon.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-21 16:33 ` George Spelvin
@ 2012-11-21 16:41   ` Roman Mamedov
  2012-11-21 18:08     ` George Spelvin
  2012-11-21 19:21   ` joystick
  2012-11-22  2:15   ` NeilBrown
  2 siblings, 1 reply; 16+ messages in thread
From: Roman Mamedov @ 2012-11-21 16:41 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On 21 Nov 2012 11:33:00 -0500
"George Spelvin" <linux@horizon.com> wrote:

> Nov 21 16:21:00 science kernel: UDP: bad checksum. From 187.194.52.187:65535 to 71.41.210.146:6881 ulen 70

This is a message that your system even got a checksum error when verifying an
UDP packet received from the network -- something completely unrelated to
Ext4, MD or disks in general.

I'd say you just have bad RAM/motherboard/CPU or some related general hardware
failure.

-- 
With respect,
Roman

~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-21 16:41   ` Roman Mamedov
@ 2012-11-21 18:08     ` George Spelvin
  0 siblings, 0 replies; 16+ messages in thread
From: George Spelvin @ 2012-11-21 18:08 UTC (permalink / raw)
  To: linux, rm; +Cc: linux-raid

> This is a message that your system even got a checksum error when verifying an
> UDP packet received from the network -- something completely unrelated to
> Ext4, MD or disks in general.
> 
> I'd say you just have bad RAM/motherboard/CPU or some related general hardware
> failure.

Please think a second and realize how ridiculous a suggestion that is.

The reason that network packet checksums exist is because networks
sometimes corrupt packets.  It is not surprising to see one every
few hours.

The system is running on ECC memory, with few errors logged (but not zero;
scrubbing logs work!) and has been very stable for years, including
lengthy mprime (prime95) runs.

How about the more likely alternative that the packet was actually
corrupted in the network?  Or even sent with a bad checksum as part
of some sort of vulnerability scan?

Look at the packet:
UDP: bad checksum. From 187.194.52.187:65535 to 71.41.210.146:6881 ulen 70

Source: port 65535 is not a typical OS-chosen port number.
The source host appears to be an ISP in Mexico.
The destination port is a common bittorrent port, which is not in use.

This is some bot-net port-scanning.

The packet isn't in RAM for more than a few microseconds before the
checksum is verified and rejected.  Any error in my machine would have
to be in the network card (in which case it wouldn't affect the disk),
or the RAM error rate would have to be so high it would insta-crash.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-21 16:33 ` George Spelvin
  2012-11-21 16:41   ` Roman Mamedov
@ 2012-11-21 19:21   ` joystick
  2012-11-21 21:19     ` George Spelvin
  2012-11-22  2:15   ` NeilBrown
  2 siblings, 1 reply; 16+ messages in thread
From: joystick @ 2012-11-21 19:21 UTC (permalink / raw)
  To: George Spelvin, linux-raid

On 11/21/12 17:33, George Spelvin wrote:
> Just to follow up to that earlier complaint, ext4 is now noticing some errors:
>

The following procedure MIGHT provide additional information (but might 
also change the state of the array so I'm not 100% sure I'm suggesting 
the best thing to do)

cat /sys/block/md5/md/mismatch_cnt
---> record it somewhere because it will be cleared

echo check > /sys/block/md5/md/sync_action
(wait for resync to finish)
(note that this should not alter the data on the disks, it's a read-only 
procedure)

cat /sys/block/md5/md/mismatch_cnt
---> record it again now

this is relevant because if the hot replace procedure wrote wrong data 
on one disk, it should be possible to determine that from parity as 
described.

Another thing you can try, less invasive, is:

for i in /dev/sd[abcde]2 ; do echo $i ; mdadm -X /dev/sdY ; done

this should tell you various things including which flags are on the 
metadata of each disk, and which disks believe to be in the array and 
which ones believe to be out.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-21 19:21   ` joystick
@ 2012-11-21 21:19     ` George Spelvin
  2012-11-21 22:56       ` joystick
  2012-11-22  3:25       ` George Spelvin
  0 siblings, 2 replies; 16+ messages in thread
From: George Spelvin @ 2012-11-21 21:19 UTC (permalink / raw)
  To: joystick, linux-raid, linux

Here are the results from your suggestions.  The check produced something
interesting: it halted almost instantly, rather than doing anything.

# for i in /dev/sd[a-e]2 ; do echo ; mdadm -X /dev/$i ; done

        Filename : /dev/sda2
           Magic : 6d746962
         Version : 4
            UUID : 69952341:376cf679:a23623b9:31f68afb
          Events : 8617657
  Events Cleared : 8617657
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 725591552 (691.98 GiB 743.01 GB)
          Bitmap : 354293 bits (chunks), 7421 dirty (2.1%)

        Filename : /dev/sdb2
           Magic : 6d746962
         Version : 4
            UUID : 69952341:376cf679:a23623b9:31f68afb
          Events : 8617657
  Events Cleared : 8617657
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 725591552 (691.98 GiB 743.01 GB)
          Bitmap : 354293 bits (chunks), 7421 dirty (2.1%)

        Filename : /dev/sdc2
           Magic : 6d746962
         Version : 4
            UUID : 69952341:376cf679:a23623b9:31f68afb
          Events : 8617655
  Events Cleared : 8617653
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 725591552 (691.98 GiB 743.01 GB)
          Bitmap : 354293 bits (chunks), 13 dirty (0.0%)

        Filename : /dev/sdd2
           Magic : 6d746962
         Version : 4
            UUID : 69952341:376cf679:a23623b9:31f68afb
          Events : 8617657
  Events Cleared : 8617657
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 725591552 (691.98 GiB 743.01 GB)
          Bitmap : 354293 bits (chunks), 7421 dirty (2.1%)

        Filename : /dev/sde2
           Magic : 6d746962
         Version : 4
            UUID : 69952341:376cf679:a23623b9:31f68afb
          Events : 8617657
  Events Cleared : 8617657
           State : OK
       Chunksize : 2 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 725591552 (691.98 GiB 743.01 GB)
          Bitmap : 354293 bits (chunks), 7421 dirty (2.1%)

# cat /sys/block/md5/md/mismatch_cnt
128
# cat /sys/block/md5/md/sync_action
idle
# echo check > /sys/block/md5/md/sync_action

	Unexpected news: does not appear to start!  Immediately after:

# cat /sys/block/md5/md/sync_action
idle

# !echo ; !cat
check
# tail /var/log/kern.log
Nov 21 20:45:09 science kernel: md: data-check of RAID array md5
Nov 21 20:45:09 science kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov 21 20:45:09 science kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov 21 20:45:09 science kernel: md: using 128k window, over a total of 725591552k.
Nov 21 20:45:09 science kernel: md: md5: data-check done.
Nov 21 20:45:45 science kernel: md: data-check of RAID array md5
Nov 21 20:45:45 science kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov 21 20:45:45 science kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov 21 20:45:45 science kernel: md: using 128k window, over a total of 725591552k.
Nov 21 20:45:45 science kernel: md: md5: data-check done.
# echo check > /sys/block/md5/md/sync_action ; cat /sys/block/md5/md/sync_action
check
# echo check > /sys/block/md5/md/sync_action ; sleep 0.1 ; cat /sys/block/md5/md/sync_action
idle

# cat /sys/block/md5/md/mismatch_cnt
0
# echo check > /sys/block/md5/md/sync_action ; cat /proc/mdstat /sys/block/md5/md/sync_action
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md5 : active raid10 sdd2[3](R) sdb2[1] sde2[2] sdc2[4](F) sda2[0]
      725591552 blocks 256K chunks 2 near-copies [4/4] [UUU_]
      [===================>.]  check = 99.9% (725591296/725591552) finish=0.2min speed=0K/sec
      bitmap: 99/173 pages [396KB], 2048KB chunk

unused devices: <none>
check

I read sectors 725591296+/-1000 with hdparm --read-sector on sdc2 and sdd2 (they were actually
reads of /dev/sdc and /dev/sdd with the aopprioriate partition offset, obviously), and saw no errors.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-21 21:19     ` George Spelvin
@ 2012-11-21 22:56       ` joystick
  2012-11-22  3:25       ` George Spelvin
  1 sibling, 0 replies; 16+ messages in thread
From: joystick @ 2012-11-21 22:56 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

On 11/21/12 22:19, George Spelvin wrote:
> Here are the results from your suggestions.  The check produced something
> interesting: it halted almost instantly, rather than doing anything.
>
> # for i in /dev/sd[a-e]2 ; do echo ; mdadm -X /dev/$i ; done
>
>          Filename : /dev/sda2
>             Magic : 6d746962
>           Version : 4
>              UUID : 69952341:376cf679:a23623b9:31f68afb
>            Events : 8617657
>    Events Cleared : 8617657
>             State : OK
>         Chunksize : 2 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 725591552 (691.98 GiB 743.01 GB)
>            Bitmap : 354293 bits (chunks), 7421 dirty (2.1%)
>

Just this?
I think there should have been additional fields like "Device Role", 
"Array State", "layout"...
try with --verbose maybe?

The Events count is extremely high, I don't have it higher than 25000 on 
very active servers, I'm not sure what it means. Also one of your 
devices has a slightly lower count, and would confirm that it's failed 
(spares follow the count continuously). I don't know that MD code well, 
you might look into the driver what kind of events exactly increase the 
count.

Another test:
cat /sys/block/md5/md/degraded
returns 1 I suppose?

The fact that check returns immediately might indicate that the array is 
indeed degraded. In this case it is correct that check cannot be 
performed on a degraded array because there is no parity/mirroring to 
check/compare. The fact that for a brief instant you can see progress is 
strange though (you might have a look at the code in the driver for 
understanding that, but it's probably not so meaningful).

But the ext4 errors must come from elsewhere. The fact that they become 
apparent only after a rebuild (to sdc2) might indicate that the source 
disk (mirror of sdd, which I don't know precisely what drive is in a 
near-copies raid10) might have contained bad data, which maybe was 
previously masked by sdd which was available and reads might have gone 
preferably to sdd (the algorithm usually choses the nearest hdd head but 
who knows...). In general your disks were in a bad shape, you can tell 
that from:
 > Nov 20 11:49:06 science kernel: md/raid10:md5: sdd2: Raid device 
exceeded read_error threshold [cur 21:max 20]
I would have replaced the disk at the 2nd-3rd error maximum, you got up 
to 21. But even considering this, MD should probably have behaved anyway 
differently

My guess is that the hot-replace to sdd failed (sdd failed during 
hot-replace), but this error was not properly handled by MD (*). This is 
the first time somebody reports onto the ML a case of failure of the 
destination drive during hot-replace so there is not much experience, 
you are a pioneer.
(*) it might have erroneously failed sdc instead of failing sdd for 
example, which looks like hot-replace has succeeded but sdd wouldn't 
actually contain correct data...

For the rest I don't really know what to say, except that it doesn't 
look right. Let's hope Neil pops up.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-21 21:19     ` George Spelvin
  2012-11-21 22:56       ` joystick
@ 2012-11-22  3:25       ` George Spelvin
  2012-11-22  4:22         ` NeilBrown
  1 sibling, 1 reply; 16+ messages in thread
From: George Spelvin @ 2012-11-22  3:25 UTC (permalink / raw)
  To: neilb; +Cc: joystick, linux-raid, linux

Some more information...

From the "stuck" state, I rebooted the machine.  It came up with 

md5 : active raid10 sde2[2] sdd2[3] sda2[0] sdb2[1]
      725591552 blocks 256K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 172/173 pages [688KB], 2048KB chunk

and e2fsck found severe problems, like multiply-referenced blocks.

I compared sdd2 and sde2 with cmp, and it found tons of
differences.  So I knew what the problem was.  All I havd to do
was pick the right one to fail.

Fortunately, I had the last RAID config on the screen of the
machine I had sshed in from, and decided I truested sdd2 less,
so failed it.

After flushing the device cache (hdparm -f /dev/md5), the errors
went away!  I was left with only what the original e2fsck -p had done
before halting.  (Namely. some updates to i_blocks).

Now I've zeroed sdd2's uperblock and added it back, and things seem
to be working okay.

NeilBrown <neilb@suse.de> wrote:
> Yes.... this is a real worry.  Fortunately I know what is causing it.

Yay!  Tell me when you have a patch to test.

> Meanwhile you have a corrupted filesystem.  Sorry.
> The nature of the corruption is that since the replacement finished
> no writes have gone to slot-3 at all.  So if md ever devices to read
> from slot 3 it will get stale data.

That's sort of what the pattern of errors looked like.

> I suggest you fail the sdd2, reboot, make sure one sda2, sb2, sde2 are
> in the array, run fsck, and then if it seems happy enough, add sdc2
> and/or sdd2 back in so they rebuild completely.

I did this in a sort of bass-ackward way, but I accomplished it in
the end.  And no data loss.  Yippee!

> Thanks for helping to make md better by risking your data :-)
I'm just glad I suffered less damage than my recent ext4 resizing
experiments, which were.... not completely successful.

Anyway, thanks for the help, and all the hard work.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-22  3:25       ` George Spelvin
@ 2012-11-22  4:22         ` NeilBrown
  2012-11-22  5:27           ` George Spelvin
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2012-11-22  4:22 UTC (permalink / raw)
  To: George Spelvin; +Cc: joystick, linux-raid

[-- Attachment #1: Type: text/plain, Size: 9029 bytes --]

On 21 Nov 2012 22:25:04 -0500 "George Spelvin" <linux@horizon.com> wrote:

...

> Now I've zeroed sdd2's uperblock and added it back, and things seem
> to be working okay.

All's well that ends well :-)

> 
> 
> NeilBrown <neilb@suse.de> wrote:
> > Yes.... this is a real worry.  Fortunately I know what is causing it.
> 
> Yay!  Tell me when you have a patch to test.

OK, below are two patches.
The first fixes the data corruption race.  You'll almost never hit it with
the bug that the second patch fixes, but in theory you could.  Without the
second patch you can hit it easily.

The second patch fixes the problem where the replacement device stayed a
replacement even after the replacement operation completed.

The symptoms are easy to reproduce - once you know what to look for - and the
patches work for me so I'll be sending them to Linus  tomorrow.  But extra
testing always helps.

Thanks,
NeilBrown


commit e7c0c3fa29280d62aa5e11101a674bb3064bd791
Author: NeilBrown <neilb@suse.de>
Date:   Thu Nov 22 14:42:49 2012 +1100

    md/raid10: close race that lose writes lost when replacement completes.
    
    When a replacement operation completes there is a small window
    when the original device is marked 'faulty' and the replacement
    still looks like a replacement.  The faulty should be removed and
    the replacement moved in place very quickly, bit it isn't instant.
    
    So the code write out to the array must handle the possibility that
    the only working device for some slot in the replacement - but it
    doesn't.  If the primary device is faulty it just gives up.  This
    can lead to corruption.
    
    So make the code more robust: if either  the primary or the
    replacement is present and working, write to them.  Only when
    neither are present do we give up.
    
    This bug has been present since replacement was introduced in
    3.3, so it is suitable for any -stable kernel since then.
    
    Reported-by: "George Spelvin" <linux@horizon.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d1295af..ad03251 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1334,18 +1334,21 @@ retry_write:
 			blocked_rdev = rrdev;
 			break;
 		}
+		if (rdev && (test_bit(Faulty, &rdev->flags)
+			     || test_bit(Unmerged, &rdev->flags)))
+			rdev = NULL;
 		if (rrdev && (test_bit(Faulty, &rrdev->flags)
 			      || test_bit(Unmerged, &rrdev->flags)))
 			rrdev = NULL;
 
 		r10_bio->devs[i].bio = NULL;
 		r10_bio->devs[i].repl_bio = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags) ||
-		    test_bit(Unmerged, &rdev->flags)) {
+
+		if (!rdev && !rrdev) {
 			set_bit(R10BIO_Degraded, &r10_bio->state);
 			continue;
 		}
-		if (test_bit(WriteErrorSeen, &rdev->flags)) {
+		if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
 			sector_t first_bad;
 			sector_t dev_sector = r10_bio->devs[i].addr;
 			int bad_sectors;
@@ -1387,8 +1390,10 @@ retry_write:
 					max_sectors = good_sectors;
 			}
 		}
-		r10_bio->devs[i].bio = bio;
-		atomic_inc(&rdev->nr_pending);
+		if (rdev) {
+			r10_bio->devs[i].bio = bio;
+			atomic_inc(&rdev->nr_pending);
+		}
 		if (rrdev) {
 			r10_bio->devs[i].repl_bio = bio;
 			atomic_inc(&rrdev->nr_pending);
@@ -1444,69 +1449,71 @@ retry_write:
 	for (i = 0; i < conf->copies; i++) {
 		struct bio *mbio;
 		int d = r10_bio->devs[i].devnum;
-		if (!r10_bio->devs[i].bio)
-			continue;
+		if (r10_bio->devs[i].bio) {
+			struct md_rdev *rdev = conf->mirrors[d].rdev;
+			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+				    max_sectors);
+			r10_bio->devs[i].bio = mbio;
+
+			mbio->bi_sector	= (r10_bio->devs[i].addr+
+					   choose_data_offset(r10_bio,
+							      rdev));
+			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_end_io	= raid10_end_write_request;
+			mbio->bi_rw = WRITE | do_sync | do_fua | do_discard;
+			mbio->bi_private = r10_bio;
 
-		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
-			    max_sectors);
-		r10_bio->devs[i].bio = mbio;
+			atomic_inc(&r10_bio->remaining);
 
-		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   choose_data_offset(r10_bio,
-						      conf->mirrors[d].rdev));
-		mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
-		mbio->bi_end_io	= raid10_end_write_request;
-		mbio->bi_rw = WRITE | do_sync | do_fua | do_discard;
-		mbio->bi_private = r10_bio;
+			cb = blk_check_plugged(raid10_unplug, mddev,
+					       sizeof(*plug));
+			if (cb)
+				plug = container_of(cb, struct raid10_plug_cb,
+						    cb);
+			else
+				plug = NULL;
+			spin_lock_irqsave(&conf->device_lock, flags);
+			if (plug) {
+				bio_list_add(&plug->pending, mbio);
+				plug->pending_cnt++;
+			} else {
+				bio_list_add(&conf->pending_bio_list, mbio);
+				conf->pending_count++;
+			}
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			if (!plug)
+				md_wakeup_thread(mddev->thread);
+		}
 
-		atomic_inc(&r10_bio->remaining);
+		if (r10_bio->devs[i].repl_bio) {
+			struct md_rdev *rdev = conf->mirrors[d].replacement;
+			if (rdev == NULL) {
+				/* Replacement just got moved to main 'rdev' */
+				smp_mb();
+				rdev = conf->mirrors[d].rdev;
+			}
+			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+				    max_sectors);
+			r10_bio->devs[i].repl_bio = mbio;
+
+			mbio->bi_sector	= (r10_bio->devs[i].addr +
+					   choose_data_offset(
+						   r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_end_io	= raid10_end_write_request;
+			mbio->bi_rw = WRITE | do_sync | do_fua | do_discard;
+			mbio->bi_private = r10_bio;
 
-		cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug));
-		if (cb)
-			plug = container_of(cb, struct raid10_plug_cb, cb);
-		else
-			plug = NULL;
-		spin_lock_irqsave(&conf->device_lock, flags);
-		if (plug) {
-			bio_list_add(&plug->pending, mbio);
-			plug->pending_cnt++;
-		} else {
+			atomic_inc(&r10_bio->remaining);
+			spin_lock_irqsave(&conf->device_lock, flags);
 			bio_list_add(&conf->pending_bio_list, mbio);
 			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			if (!mddev_check_plugged(mddev))
+				md_wakeup_thread(mddev->thread);
 		}
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		if (!plug)
-			md_wakeup_thread(mddev->thread);
-
-		if (!r10_bio->devs[i].repl_bio)
-			continue;
-
-		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
-			    max_sectors);
-		r10_bio->devs[i].repl_bio = mbio;
-
-		/* We are actively writing to the original device
-		 * so it cannot disappear, so the replacement cannot
-		 * become NULL here
-		 */
-		mbio->bi_sector	= (r10_bio->devs[i].addr +
-				   choose_data_offset(
-					   r10_bio,
-					   conf->mirrors[d].replacement));
-		mbio->bi_bdev = conf->mirrors[d].replacement->bdev;
-		mbio->bi_end_io	= raid10_end_write_request;
-		mbio->bi_rw = WRITE | do_sync | do_fua | do_discard;
-		mbio->bi_private = r10_bio;
-
-		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		if (!mddev_check_plugged(mddev))
-			md_wakeup_thread(mddev->thread);
 	}
 
 	/* Don't remove the bias on 'remaining' (one_write_done) until



commit 884162df2aadd7414bef4935e1a54976fd4e3988
Author: NeilBrown <neilb@suse.de>
Date:   Thu Nov 22 15:12:09 2012 +1100

    md/raid10: decrement correct pending counter when writing to replacement.
    
    When a write to a replacement device completes, we carefully
    and correctly found the rdev that the write actually went to
    and the blithely called rdev_dec_pending on the primary rdev,
    even if this write was to the replacement.
    
    This means that any writes to an array while a replacement
    was ongoing would cause the nr_pending count for the primary
    device to go negative, so it could never be removed.
    
    This bug has been present since replacement was introduced in
    3.3, so it is suitable for any -stable kernel since then.
    
    Reported-by: "George Spelvin" <linux@horizon.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ad03251..0d5d0ff 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -499,7 +499,7 @@ static void raid10_end_write_request(struct bio *bio, int error)
 	 */
 	one_write_done(r10_bio);
 	if (dec_rdev)
-		rdev_dec_pending(conf->mirrors[dev].rdev, conf->mddev);
+		rdev_dec_pending(rdev, conf->mddev);
 }
 
 /*

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-22  4:22         ` NeilBrown
@ 2012-11-22  5:27           ` George Spelvin
  2012-11-22  5:39             ` George Spelvin
  0 siblings, 1 reply; 16+ messages in thread
From: George Spelvin @ 2012-11-22  5:27 UTC (permalink / raw)
  To: linux, neilb; +Cc: joystick, linux-raid

NeilBrown <neilb@suse.de> wrote:
> OK, below are two patches.
> The first fixes the data corruption race.  You'll almost never hit it with
> the bug that the second patch fixes, but in theory you could.  Without the
> second patch you can hit it easily.

This one doesn't apply cleanly to 3.6.7.  In particular, the large third hunk where
you wrap some stuff in "if" that used to be handled by "continue".

The first difference is "mbio->bi_rw = WRITE | do_sync | do_fua | do_discard;",
which is missing the "do_discard"part in my tree.  Not a biggie.

The second difference, which I'm less confident of, is that entire
"cb = blk_check_plugged(...); if (cb) plug = contaner_of(...)" block,
which is also missing from 3.6.7. Instead, I just have

                        if (!mddev_check_plugged(mddev))
                                md_wakeup_thread(mddev->thread);

Is it safe to just indent that inside the if() and leave it?

Here's the patch I ended up applying.  First, the -b version (ignore
whitespace changes), which makes it easier to read, then the real thing.

(I also un-line-wrapped two instances of choose_data_offset that no
longer needed to be broken to fit in 80 columns.)


Does this look okay to you?  I'll reboot and test if you say so.


commit 1611c6944449fff9cfbf2a96db30d6d9e81f1979
Author: NeilBrown <neilb@suse.de>
Date:   Thu Nov 22 14:42:49 2012 +1100

    md/raid10: close race that lose writes lost when replacement completes.
    
    When a replacement operation completes there is a small window
    when the original device is marked 'faulty' and the replacement
    still looks like a replacement.  The faulty should be removed and
    the replacement moved in place very quickly, bit it isn't instant.
    
    So the code write out to the array must handle the possibility that
    the only working device for some slot in the replacement - but it
    doesn't.  If the primary device is faulty it just gives up.  This
    can lead to corruption.
    
    So make the code more robust: if either  the primary or the
    replacement is present and working, write to them.  Only when
    neither are present do we give up.
    
    This bug has been present since replacement was introduced in
    3.3, so it is suitable for any -stable kernel since then.
    
    Reported-by: "George Spelvin" <linux@horizon.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown <neilb@suse.de>

diff -b --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a48c215..60e5a65 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1287,18 +1287,21 @@ retry_write:
 			blocked_rdev = rrdev;
 			break;
 		}
+		if (rdev && (test_bit(Faulty, &rdev->flags)
+			     || test_bit(Unmerged, &rdev->flags)))
+			rdev = NULL;
 		if (rrdev && (test_bit(Faulty, &rrdev->flags)
 			      || test_bit(Unmerged, &rrdev->flags)))
 			rrdev = NULL;
 
 		r10_bio->devs[i].bio = NULL;
 		r10_bio->devs[i].repl_bio = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags) ||
-		    test_bit(Unmerged, &rdev->flags)) {
+
+		if (!rdev && !rrdev) {
 			set_bit(R10BIO_Degraded, &r10_bio->state);
 			continue;
 		}
-		if (test_bit(WriteErrorSeen, &rdev->flags)) {
+		if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
 			sector_t first_bad;
 			sector_t dev_sector = r10_bio->devs[i].addr;
 			int bad_sectors;
@@ -1340,8 +1343,10 @@ retry_write:
 					max_sectors = good_sectors;
 			}
 		}
+		if (rdev) {
 			r10_bio->devs[i].bio = bio;
 			atomic_inc(&rdev->nr_pending);
+		}
 		if (rrdev) {
 			r10_bio->devs[i].repl_bio = bio;
 			atomic_inc(&rrdev->nr_pending);
@@ -1400,15 +1405,16 @@ retry_write:
 		if (!r10_bio->devs[i].bio)
 			continue;
 
+		if (r10_bio->devs[i].bio) {
+			struct md_rdev *rdev = conf->mirrors[d].rdev;
 			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
 				    max_sectors);
 			r10_bio->devs[i].bio = mbio;
 
-		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   choose_data_offset(r10_bio,
-						      conf->mirrors[d].rdev));
-		mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
+			mbio->bi_sector	= (r10_bio->devs[i].addr +
+					   choose_data_offset(r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			mbio->bi_rw = WRITE | do_sync | do_fua;
 			mbio->bi_private = r10_bio;
@@ -1420,24 +1426,23 @@ retry_write:
 			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (!mddev_check_plugged(mddev))
 				md_wakeup_thread(mddev->thread);
+		}
 
-		if (!r10_bio->devs[i].repl_bio)
-			continue;
-
+		if (r10_bio->devs[i].repl_bio) {
+			struct md_rdev *rdev = conf->mirrors[d].replacement;
+			if (rdev == NULL) {
+				/* Replacement just got moved to main 'rdev' */
+				smp_mb();
+				rdev = conf->mirrors[d].rdev;
+			}
 			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
 				    max_sectors);
 			r10_bio->devs[i].repl_bio = mbio;
 
-		/* We are actively writing to the original device
-		 * so it cannot disappear, so the replacement cannot
-		 * become NULL here
-		 */
 			mbio->bi_sector	= (r10_bio->devs[i].addr +
-				   choose_data_offset(
-					   r10_bio,
-					   conf->mirrors[d].replacement));
-		mbio->bi_bdev = conf->mirrors[d].replacement->bdev;
+					   choose_data_offset(r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			mbio->bi_rw = WRITE | do_sync | do_fua;
 			mbio->bi_private = r10_bio;
@@ -1450,6 +1455,7 @@ retry_write:
 			if (!mddev_check_plugged(mddev))
 				md_wakeup_thread(mddev->thread);
 		}
+	}
 
 	/* Don't remove the bias on 'remaining' (one_write_done) until
 	 * after checking if we need to go around again.




commit 1611c6944449fff9cfbf2a96db30d6d9e81f1979
Author: NeilBrown <neilb@suse.de>
Date:   Thu Nov 22 14:42:49 2012 +1100

    md/raid10: close race that lose writes lost when replacement completes.
    
    When a replacement operation completes there is a small window
    when the original device is marked 'faulty' and the replacement
    still looks like a replacement.  The faulty should be removed and
    the replacement moved in place very quickly, bit it isn't instant.
    
    So the code write out to the array must handle the possibility that
    the only working device for some slot in the replacement - but it
    doesn't.  If the primary device is faulty it just gives up.  This
    can lead to corruption.
    
    So make the code more robust: if either  the primary or the
    replacement is present and working, write to them.  Only when
    neither are present do we give up.
    
    This bug has been present since replacement was introduced in
    3.3, so it is suitable for any -stable kernel since then.
    
    Reported-by: "George Spelvin" <linux@horizon.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a48c215..60e5a65 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1287,18 +1287,21 @@ retry_write:
 			blocked_rdev = rrdev;
 			break;
 		}
+		if (rdev && (test_bit(Faulty, &rdev->flags)
+			     || test_bit(Unmerged, &rdev->flags)))
+			rdev = NULL;
 		if (rrdev && (test_bit(Faulty, &rrdev->flags)
 			      || test_bit(Unmerged, &rrdev->flags)))
 			rrdev = NULL;
 
 		r10_bio->devs[i].bio = NULL;
 		r10_bio->devs[i].repl_bio = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags) ||
-		    test_bit(Unmerged, &rdev->flags)) {
+
+		if (!rdev && !rrdev) {
 			set_bit(R10BIO_Degraded, &r10_bio->state);
 			continue;
 		}
-		if (test_bit(WriteErrorSeen, &rdev->flags)) {
+		if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
 			sector_t first_bad;
 			sector_t dev_sector = r10_bio->devs[i].addr;
 			int bad_sectors;
@@ -1340,8 +1343,10 @@ retry_write:
 					max_sectors = good_sectors;
 			}
 		}
-		r10_bio->devs[i].bio = bio;
-		atomic_inc(&rdev->nr_pending);
+		if (rdev) {
+			r10_bio->devs[i].bio = bio;
+			atomic_inc(&rdev->nr_pending);
+		}
 		if (rrdev) {
 			r10_bio->devs[i].repl_bio = bio;
 			atomic_inc(&rrdev->nr_pending);
@@ -1400,55 +1405,56 @@ retry_write:
 		if (!r10_bio->devs[i].bio)
 			continue;
 
-		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
-			    max_sectors);
-		r10_bio->devs[i].bio = mbio;
-
-		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   choose_data_offset(r10_bio,
-						      conf->mirrors[d].rdev));
-		mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
-		mbio->bi_end_io	= raid10_end_write_request;
-		mbio->bi_rw = WRITE | do_sync | do_fua;
-		mbio->bi_private = r10_bio;
-
-		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		if (!mddev_check_plugged(mddev))
-			md_wakeup_thread(mddev->thread);
-
-		if (!r10_bio->devs[i].repl_bio)
-			continue;
+		if (r10_bio->devs[i].bio) {
+			struct md_rdev *rdev = conf->mirrors[d].rdev;
+			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+				    max_sectors);
+			r10_bio->devs[i].bio = mbio;
+
+			mbio->bi_sector	= (r10_bio->devs[i].addr +
+					   choose_data_offset(r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_end_io	= raid10_end_write_request;
+			mbio->bi_rw = WRITE | do_sync | do_fua;
+			mbio->bi_private = r10_bio;
 
-		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
-			    max_sectors);
-		r10_bio->devs[i].repl_bio = mbio;
+			atomic_inc(&r10_bio->remaining);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			if (!mddev_check_plugged(mddev))
+				md_wakeup_thread(mddev->thread);
+		}
 
-		/* We are actively writing to the original device
-		 * so it cannot disappear, so the replacement cannot
-		 * become NULL here
-		 */
-		mbio->bi_sector	= (r10_bio->devs[i].addr +
-				   choose_data_offset(
-					   r10_bio,
-					   conf->mirrors[d].replacement));
-		mbio->bi_bdev = conf->mirrors[d].replacement->bdev;
-		mbio->bi_end_io	= raid10_end_write_request;
-		mbio->bi_rw = WRITE | do_sync | do_fua;
-		mbio->bi_private = r10_bio;
+		if (r10_bio->devs[i].repl_bio) {
+			struct md_rdev *rdev = conf->mirrors[d].replacement;
+			if (rdev == NULL) {
+				/* Replacement just got moved to main 'rdev' */
+				smp_mb();
+				rdev = conf->mirrors[d].rdev;
+			}
+			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+				    max_sectors);
+			r10_bio->devs[i].repl_bio = mbio;
+
+			mbio->bi_sector	= (r10_bio->devs[i].addr +
+					   choose_data_offset(r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_end_io	= raid10_end_write_request;
+			mbio->bi_rw = WRITE | do_sync | do_fua;
+			mbio->bi_private = r10_bio;
 
-		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		if (!mddev_check_plugged(mddev))
-			md_wakeup_thread(mddev->thread);
+			atomic_inc(&r10_bio->remaining);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			if (!mddev_check_plugged(mddev))
+				md_wakeup_thread(mddev->thread);
+		}
 	}
 
 	/* Don't remove the bias on 'remaining' (one_write_done) until

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-22  5:27           ` George Spelvin
@ 2012-11-22  5:39             ` George Spelvin
  2012-11-22  5:47               ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: George Spelvin @ 2012-11-22  5:39 UTC (permalink / raw)
  To: linux, neilb; +Cc: joystick, linux-raid

Oops!  My manual patch application forgot a few lines.  (I left the
"if (!r10_bio->devs[i].repl_bio) continue;" code in place while adding
its replacement.)

Here's a corrected patch.  (Again, -b, then the real thing.)

commit 0fbb6ffcf9485ac94c57759849ba415352c131da
Author: NeilBrown <neilb@suse.de>
Date:   Thu Nov 22 14:42:49 2012 +1100

    md/raid10: close race that lose writes lost when replacement completes.
    
    When a replacement operation completes there is a small window
    when the original device is marked 'faulty' and the replacement
    still looks like a replacement.  The faulty should be removed and
    the replacement moved in place very quickly, bit it isn't instant.
    
    So the code write out to the array must handle the possibility that
    the only working device for some slot in the replacement - but it
    doesn't.  If the primary device is faulty it just gives up.  This
    can lead to corruption.
    
    So make the code more robust: if either  the primary or the
    replacement is present and working, write to them.  Only when
    neither are present do we give up.
    
    This bug has been present since replacement was introduced in
    3.3, so it is suitable for any -stable kernel since then.
    
    Reported-by: "George Spelvin" <linux@horizon.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a48c215..06a359b 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1287,18 +1287,21 @@ retry_write:
 			blocked_rdev = rrdev;
 			break;
 		}
+		if (rdev && (test_bit(Faulty, &rdev->flags)
+			     || test_bit(Unmerged, &rdev->flags)))
+			rdev = NULL;
 		if (rrdev && (test_bit(Faulty, &rrdev->flags)
 			      || test_bit(Unmerged, &rrdev->flags)))
 			rrdev = NULL;
 
 		r10_bio->devs[i].bio = NULL;
 		r10_bio->devs[i].repl_bio = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags) ||
-		    test_bit(Unmerged, &rdev->flags)) {
+
+		if (!rdev && !rrdev) {
 			set_bit(R10BIO_Degraded, &r10_bio->state);
 			continue;
 		}
-		if (test_bit(WriteErrorSeen, &rdev->flags)) {
+		if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
 			sector_t first_bad;
 			sector_t dev_sector = r10_bio->devs[i].addr;
 			int bad_sectors;
@@ -1340,8 +1343,10 @@ retry_write:
 					max_sectors = good_sectors;
 			}
 		}
+		if (rdev) {
 			r10_bio->devs[i].bio = bio;
 			atomic_inc(&rdev->nr_pending);
+		}
 		if (rrdev) {
 			r10_bio->devs[i].repl_bio = bio;
 			atomic_inc(&rrdev->nr_pending);
@@ -1397,18 +1402,17 @@ retry_write:
 	for (i = 0; i < conf->copies; i++) {
 		struct bio *mbio;
 		int d = r10_bio->devs[i].devnum;
-		if (!r10_bio->devs[i].bio)
-			continue;
 
+		if (r10_bio->devs[i].bio) {
+			struct md_rdev *rdev = conf->mirrors[d].rdev;
 			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
 				    max_sectors);
 			r10_bio->devs[i].bio = mbio;
 
-		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   choose_data_offset(r10_bio,
-						      conf->mirrors[d].rdev));
-		mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
+			mbio->bi_sector	= (r10_bio->devs[i].addr +
+					   choose_data_offset(r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			mbio->bi_rw = WRITE | do_sync | do_fua;
 			mbio->bi_private = r10_bio;
@@ -1420,24 +1424,23 @@ retry_write:
 			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (!mddev_check_plugged(mddev))
 				md_wakeup_thread(mddev->thread);
+		}
 
-		if (!r10_bio->devs[i].repl_bio)
-			continue;
-
+		if (r10_bio->devs[i].repl_bio) {
+			struct md_rdev *rdev = conf->mirrors[d].replacement;
+			if (rdev == NULL) {
+				/* Replacement just got moved to main 'rdev' */
+				smp_mb();
+				rdev = conf->mirrors[d].rdev;
+			}
 			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
 				    max_sectors);
 			r10_bio->devs[i].repl_bio = mbio;
 
-		/* We are actively writing to the original device
-		 * so it cannot disappear, so the replacement cannot
-		 * become NULL here
-		 */
 			mbio->bi_sector	= (r10_bio->devs[i].addr +
-				   choose_data_offset(
-					   r10_bio,
-					   conf->mirrors[d].replacement));
-		mbio->bi_bdev = conf->mirrors[d].replacement->bdev;
+					   choose_data_offset(r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			mbio->bi_rw = WRITE | do_sync | do_fua;
 			mbio->bi_private = r10_bio;
@@ -1450,6 +1453,7 @@ retry_write:
 			if (!mddev_check_plugged(mddev))
 				md_wakeup_thread(mddev->thread);
 		}
+	}
 
 	/* Don't remove the bias on 'remaining' (one_write_done) until
 	 * after checking if we need to go around again.



commit 0fbb6ffcf9485ac94c57759849ba415352c131da
Author: NeilBrown <neilb@suse.de>
Date:   Thu Nov 22 14:42:49 2012 +1100

    md/raid10: close race that lose writes lost when replacement completes.
    
    When a replacement operation completes there is a small window
    when the original device is marked 'faulty' and the replacement
    still looks like a replacement.  The faulty should be removed and
    the replacement moved in place very quickly, bit it isn't instant.
    
    So the code write out to the array must handle the possibility that
    the only working device for some slot in the replacement - but it
    doesn't.  If the primary device is faulty it just gives up.  This
    can lead to corruption.
    
    So make the code more robust: if either  the primary or the
    replacement is present and working, write to them.  Only when
    neither are present do we give up.
    
    This bug has been present since replacement was introduced in
    3.3, so it is suitable for any -stable kernel since then.
    
    Reported-by: "George Spelvin" <linux@horizon.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a48c215..06a359b 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1287,18 +1287,21 @@ retry_write:
 			blocked_rdev = rrdev;
 			break;
 		}
+		if (rdev && (test_bit(Faulty, &rdev->flags)
+			     || test_bit(Unmerged, &rdev->flags)))
+			rdev = NULL;
 		if (rrdev && (test_bit(Faulty, &rrdev->flags)
 			      || test_bit(Unmerged, &rrdev->flags)))
 			rrdev = NULL;
 
 		r10_bio->devs[i].bio = NULL;
 		r10_bio->devs[i].repl_bio = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags) ||
-		    test_bit(Unmerged, &rdev->flags)) {
+
+		if (!rdev && !rrdev) {
 			set_bit(R10BIO_Degraded, &r10_bio->state);
 			continue;
 		}
-		if (test_bit(WriteErrorSeen, &rdev->flags)) {
+		if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
 			sector_t first_bad;
 			sector_t dev_sector = r10_bio->devs[i].addr;
 			int bad_sectors;
@@ -1340,8 +1343,10 @@ retry_write:
 					max_sectors = good_sectors;
 			}
 		}
-		r10_bio->devs[i].bio = bio;
-		atomic_inc(&rdev->nr_pending);
+		if (rdev) {
+			r10_bio->devs[i].bio = bio;
+			atomic_inc(&rdev->nr_pending);
+		}
 		if (rrdev) {
 			r10_bio->devs[i].repl_bio = bio;
 			atomic_inc(&rrdev->nr_pending);
@@ -1397,58 +1402,57 @@ retry_write:
 	for (i = 0; i < conf->copies; i++) {
 		struct bio *mbio;
 		int d = r10_bio->devs[i].devnum;
-		if (!r10_bio->devs[i].bio)
-			continue;
-
-		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
-			    max_sectors);
-		r10_bio->devs[i].bio = mbio;
 
-		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   choose_data_offset(r10_bio,
-						      conf->mirrors[d].rdev));
-		mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
-		mbio->bi_end_io	= raid10_end_write_request;
-		mbio->bi_rw = WRITE | do_sync | do_fua;
-		mbio->bi_private = r10_bio;
-
-		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		if (!mddev_check_plugged(mddev))
-			md_wakeup_thread(mddev->thread);
-
-		if (!r10_bio->devs[i].repl_bio)
-			continue;
+		if (r10_bio->devs[i].bio) {
+			struct md_rdev *rdev = conf->mirrors[d].rdev;
+			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+				    max_sectors);
+			r10_bio->devs[i].bio = mbio;
+
+			mbio->bi_sector	= (r10_bio->devs[i].addr +
+					   choose_data_offset(r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_end_io	= raid10_end_write_request;
+			mbio->bi_rw = WRITE | do_sync | do_fua;
+			mbio->bi_private = r10_bio;
 
-		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
-			    max_sectors);
-		r10_bio->devs[i].repl_bio = mbio;
+			atomic_inc(&r10_bio->remaining);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			if (!mddev_check_plugged(mddev))
+				md_wakeup_thread(mddev->thread);
+		}
 
-		/* We are actively writing to the original device
-		 * so it cannot disappear, so the replacement cannot
-		 * become NULL here
-		 */
-		mbio->bi_sector	= (r10_bio->devs[i].addr +
-				   choose_data_offset(
-					   r10_bio,
-					   conf->mirrors[d].replacement));
-		mbio->bi_bdev = conf->mirrors[d].replacement->bdev;
-		mbio->bi_end_io	= raid10_end_write_request;
-		mbio->bi_rw = WRITE | do_sync | do_fua;
-		mbio->bi_private = r10_bio;
+		if (r10_bio->devs[i].repl_bio) {
+			struct md_rdev *rdev = conf->mirrors[d].replacement;
+			if (rdev == NULL) {
+				/* Replacement just got moved to main 'rdev' */
+				smp_mb();
+				rdev = conf->mirrors[d].rdev;
+			}
+			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+				    max_sectors);
+			r10_bio->devs[i].repl_bio = mbio;
+
+			mbio->bi_sector	= (r10_bio->devs[i].addr +
+					   choose_data_offset(r10_bio, rdev));
+			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_end_io	= raid10_end_write_request;
+			mbio->bi_rw = WRITE | do_sync | do_fua;
+			mbio->bi_private = r10_bio;
 
-		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		if (!mddev_check_plugged(mddev))
-			md_wakeup_thread(mddev->thread);
+			atomic_inc(&r10_bio->remaining);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			if (!mddev_check_plugged(mddev))
+				md_wakeup_thread(mddev->thread);
+		}
 	}
 
 	/* Don't remove the bias on 'remaining' (one_write_done) until

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-22  5:39             ` George Spelvin
@ 2012-11-22  5:47               ` NeilBrown
  2012-11-22  6:45                 ` George Spelvin
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2012-11-22  5:47 UTC (permalink / raw)
  To: George Spelvin; +Cc: joystick, linux-raid

[-- Attachment #1: Type: text/plain, Size: 352 bytes --]

On 22 Nov 2012 00:39:03 -0500 "George Spelvin" <linux@horizon.com> wrote:

> Oops!  My manual patch application forgot a few lines.  (I left the
> "if (!r10_bio->devs[i].repl_bio) continue;" code in place while adding
> its replacement.)
> 
> Here's a corrected patch.  (Again, -b, then the real thing.)

Yes, that looks right.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-22  5:47               ` NeilBrown
@ 2012-11-22  6:45                 ` George Spelvin
  2012-11-22 11:30                   ` George Spelvin
  0 siblings, 1 reply; 16+ messages in thread
From: George Spelvin @ 2012-11-22  6:45 UTC (permalink / raw)
  To: linux, neilb; +Cc: joystick, linux-raid

> Yes, that looks right.

Well, testing now (replacing sde)...

md5 : active raid10 sdc2[4](R) sdd2[3] sda2[0] sde2[2] sdb2[1]
      725591552 blocks 256K chunks 2 near-copies [4/4] [UUUU]
      [>....................]  recovery =  0.6% (2505344/362795776) finish=107.4min speed=55909K/sec
      bitmap: 5/173 pages [20KB], 2048KB chunk

One noteworthy thing is that the array didn't come up completely on boot...

Nov 22 06:14:54 science kernel: md:  adding sda2 ...
Nov 22 06:14:54 science kernel: md: sda1 has different UUID to sda2
Nov 22 06:14:54 science kernel: md:  adding sde2 ...
Nov 22 06:14:54 science kernel: md: sde1 has different UUID to sda2
Nov 22 06:14:54 science kernel: md: sdd3 has different UUID to sda2
Nov 22 06:14:54 science kernel: md:  adding sdd2 ...
Nov 22 06:14:54 science kernel: md: sdd1 has different UUID to sda2
Nov 22 06:14:54 science kernel: md: sdc3 has different UUID to sda2
Nov 22 06:14:54 science kernel: md:  adding sdb2 ...
Nov 22 06:14:54 science kernel: md: sdb1 has different UUID to sda2
Nov 22 06:14:54 science kernel: md:  adding sdc2 ...
Nov 22 06:14:54 science kernel: md: sdc1 has different UUID to sda2
Nov 22 06:14:54 science kernel: md: created md5
Nov 22 06:14:54 science kernel: md: bind<sdc2>
Nov 22 06:14:54 science kernel: md: bind<sdb2>
Nov 22 06:14:54 science kernel: md: export_rdev(sdd2)
Nov 22 06:14:54 science kernel: md: bind<sde2>
Nov 22 06:14:54 science kernel: md: bind<sda2>
Nov 22 06:14:54 science kernel: md: running: <sda2><sde2><sdb2><sdc2>
Nov 22 06:14:54 science kernel: md: kicking non-fresh sdc2 from array!
Nov 22 06:14:54 science kernel: md: unbind<sdc2>
Nov 22 06:14:54 science kernel: md: export_rdev(sdc2)
Nov 22 06:14:54 science kernel: md/raid10:md5: active with 3 out of 4 devices
Nov 22 06:14:54 science kernel: created bitmap (173 pages) for device md5
Nov 22 06:14:54 science kernel: md5: bitmap initialized from disk: read 11 pages, set 0 of 354293 bits
Nov 22 06:14:54 science kernel: md5: detected capacity change from 0 to 743005749248

It came up with neither sdd2 nor sdc2 in the array, even though the rebuild of sdd2 (from
--zero-superblock) had finished before.

Adding sdd2 was a re-add and went smoothly, though.

Oh, if you think you need my S-o-b to send that 3.6.x patch to stable@
(personally, I dont't; all the original work is yours and all I did was
mechanical), feel free to add it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-22  6:45                 ` George Spelvin
@ 2012-11-22 11:30                   ` George Spelvin
  0 siblings, 0 replies; 16+ messages in thread
From: George Spelvin @ 2012-11-22 11:30 UTC (permalink / raw)
  To: linux, neilb; +Cc: joystick, linux-raid

> Well, testing now (replacing sde)...

And I can confirm that, with your patches, it worked exactly as expected.

Thank you very much!

Tested-by: George Spelvin <linux@horizon.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-21 16:33 ` George Spelvin
  2012-11-21 16:41   ` Roman Mamedov
  2012-11-21 19:21   ` joystick
@ 2012-11-22  2:15   ` NeilBrown
  2 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2012-11-22  2:15 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2786 bytes --]

On 21 Nov 2012 11:33:00 -0500 "George Spelvin" <linux@horizon.com> wrote:

> Just to follow up to that earlier complaint, ext4 is now noticing some errors:
> 
> Nov 21 06:21:53 science kernel: EXT4-fs error (device md5): ext4_find_entry:1234: inode #5881516: comm rsync: checksumming directory block 0
> Nov 21 07:57:03 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 4206: bad block bitmap checksum
> Nov 21 08:41:37 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 3960: bad block bitmap checksum
> Nov 21 08:45:18 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm flush-9:5: bg 4737: bad block bitmap checksum
> Nov 21 08:50:16 science kernel: EXT4-fs error (device md5): ext4_mb_generate_buddy:741: group 4206, 5621 clusters in bitmap, 6888 in gd
> Nov 21 08:50:16 science kernel: JBD2: Spotted dirty metadata buffer (dev = md5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> Nov 21 15:50:29 science kernel: EXT4-fs error (device md5): ext4_validate_block_bitmap:353: comm python: bg 4138: bad block bitmap checksum
> Nov 21 16:21:00 science kernel: UDP: bad checksum. From 187.194.52.187:65535 to 71.41.210.146:6881 ulen 70
> 
> I also experienced transient corruption of the last few K of my incoming mailbox.  (I.e. the last
> couple of messages were overwritten with some other text file.  This morning, it's fine.)
> 
> Something is definitely wonky here...  I'm leaving it in the "stuck" state for a while
> in case there's useful debugging info to be extracted, but I'm getting very alarmed by these
> messages and want to reboot soon.


Yes.... this is a real worry.  Fortunately I know what is causing it.

The code for writing to a RAID10 naively assumes that if the 'main' device in
a slot is faulty, then there isn't any replacement device to write to either.

This is normally the case as a faulty device will be promptly remove - or it
should be at least.  As you've already discovered, sometimes it isn't prompt.

But even if it were, there could be races so that the main device fails just
as we look at it, and then the replacement couldn't possibly have been moved
down yet.

Meanwhile you have a corrupted filesystem.  Sorry.
The nature of the corruption is that since the replacement finished no writes
have gone to slot-3 at all.  So if md ever devices to read from slot 3 it
will get stale data.

I suggest you fail the sdd2, reboot, make sure one sda2, sb2,sde2 are in the
array, run fsck, and then if it seems happy enough, add sdc2 and/or sdd2 back
in so they rebuild completely.

Thanks for helping to make md better by risking your data :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: want-replacement got stuck?
  2012-11-20 22:11 want-replacement got stuck? George Spelvin
  2012-11-21 16:33 ` George Spelvin
@ 2012-11-22  2:10 ` NeilBrown
  1 sibling, 0 replies; 16+ messages in thread
From: NeilBrown @ 2012-11-22  2:10 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2621 bytes --]

On 20 Nov 2012 17:11:45 -0500 "George Spelvin" <linux@horizon.com> wrote:

> I have a RAID10 array with 4 active + 1 spare.
> Kernel is 3.6.5, x86-64 but running 32-bit unserland.
> 
> After a recent failure on sdd2, the spare sdc2 was
> activated and things looked something like (manual edit,
> may not be perfectly faithful):
> 
> md5 : active raid10 sdd2[4](F) sdb2[1] sde2[2] sdc2[3] sda2[0]
>       725591552 blocks 256K chunks 2 near-copies [4/4] [UUUU]
>       bitmap: 50/173 pages [200KB], 2048KB chunk
> 
> smartctl -A showed 1 pending sector, but badblocks didn't
> find it, so I decided to play with moving things back:
> 
> # badblocks -s -v /dev/sdd2
> # mdadm /dev/md5 -r /dev/sdd2 -a /dev/sdd2
> # echo want_replacement >  /sys/block/md5/md/dev-sdc2/state
> 
> This ran for a while, but now it has stopped, with the following
> configuration:
> 
> md5 : active raid10 sdd2[3](R) sdb2[1] sde2[2] sdc2[4](F) sda2[0]
>       725591552 blocks 256K chunks 2 near-copies [4/4] [UUU_]
>       bitmap: 50/173 pages [200KB], 2048KB chunk
> 
> # [530]# cat /sys/block/md5/md/dev-sd?2/state
> in_sync
> in_sync
> faulty,want_replacement
> in_sync,replacement
> in_sync
> 
> I'm not quite sure how to interpret this state, and why it is showing
> "4/4" good drives but [UUU_].

"4/4" means the array is not degraded.
[UUU_] means that the drive in slot 3 is faulty.
The way this can happen without the array being degraded is that the
replacement is fully in-sync.

What has happened is the replacement finished perfectly and the want-replace
device was marked as faulty, but when md tried to remove that faulty device
it found that it was still active.  Some request that has previously been
sent hadn't completed yet.  So it couldn't remove it immediately.
Unfortunately it doesn't retry in any great hurry .. or possibly at all. 
I'll have to look in to that and figure out the best fix.

...
> It appears to have completed:
> Nov 20 18:40:01 science kernel: md: md5: recovery done.
> Nov 20 18:40:01 science kernel: RAID10 conf printout:
> Nov 20 18:40:01 science kernel: --- wd:4 rd:4
> Nov 20 18:40:01 science kernel: disk 0, wo:0, o:1, dev:sda2
> Nov 20 18:40:01 science kernel: disk 1, wo:0, o:1, dev:sdb2
> Nov 20 18:40:01 science kernel: disk 2, wo:0, o:1, dev:sde2
> Nov 20 18:40:01 science kernel: disk 3, wo:1, o:0, dev:sdc2
> 
> But as mentioned, the RAID state is a bit odd.  sdc2 is still in the
> array and sdd2 is not.

Yes, it completed.  The "conf printout" doesn't mention replacement devices
yet.  I guess it should..

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-11-22 11:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-20 22:11 want-replacement got stuck? George Spelvin
2012-11-21 16:33 ` George Spelvin
2012-11-21 16:41   ` Roman Mamedov
2012-11-21 18:08     ` George Spelvin
2012-11-21 19:21   ` joystick
2012-11-21 21:19     ` George Spelvin
2012-11-21 22:56       ` joystick
2012-11-22  3:25       ` George Spelvin
2012-11-22  4:22         ` NeilBrown
2012-11-22  5:27           ` George Spelvin
2012-11-22  5:39             ` George Spelvin
2012-11-22  5:47               ` NeilBrown
2012-11-22  6:45                 ` George Spelvin
2012-11-22 11:30                   ` George Spelvin
2012-11-22  2:15   ` NeilBrown
2012-11-22  2:10 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).