All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bart Kus <me@bartk.us>
To: linux-raid@vger.kernel.org
Subject: Re: (help!) MD RAID6 won't --re-add devices?
Date: Sat, 15 Jan 2011 11:50:58 -0800	[thread overview]
Message-ID: <4D31FAA2.2080202@bartk.us> (raw)
In-Reply-To: <4D31DE07.1000507@bartk.us>

Some research has revealed a frightening solution:

http://forums.gentoo.org/viewtopic-t-716757-start-0.html

That thread calls upon mdadm --create with the --assume-clean flag.  It 
also seems to re-enforce my suspicions that MD has lost my device order 
numbers when it marked the drives as spare (thanks, MD!  Remind me to 
get you a nice christmas present next year.).  I know the order of 5 out 
of 10 devices, so that leaves 120 permutations to try.  I've whipped up 
some software to generate all the permuted mdadm --create commands.

The question now: how do I test if I've got the right combination?  Can 
I dd a meg off the assembled array and check for errors somewhere?

The other question: Is testing incorrect combinations destructive to any 
data on the drives?  Like, would RAID6 kick in and start "fixing" parity 
errors, even if I'm just reading?

--Bart

On 1/15/2011 9:48 AM, Bart Kus wrote:
> Things seem to have gone from bad to worse.  I upgraded to the latest 
> mdadm, and it actually let me do an --add operation, but --re-add was 
> still failing.  It added all the devices as spares though.  I stopped 
> the array and tried to re-assemble it, but it's not starting.
>
> jo ~ # mdadm -A /dev/md4 -f -u da14eb85:00658f24:80f7a070:b9026515
> mdadm: /dev/md4 assembled from 5 drives and 5 spares - not enough to 
> start the array.
>
> How do I promote these "spares" to being the active decides they once 
> were?  Yes, they're behind a few events, so there will be some data loss.
>
> --Bart
>
> On 1/13/2011 5:03 AM, Bart Kus wrote:
>> Hello,
>>
>> I had a Port Multiplier failure overnight.  This put 5 out of 10 
>> drives offline, degrading my RAID6 array.  The file system is still 
>> mounted (and failing to write):
>>
>> Buffer I/O error on device md4, logical block 3907023608
>> Filesystem "md4": xfs_log_force: error 5 returned.
>> etc...
>>
>> The array is in the following state:
>>
>> /dev/md4:
>>         Version : 1.02
>>   Creation Time : Sun Aug 10 23:41:49 2008
>>      Raid Level : raid6
>>      Array Size : 15628094464 (14904.11 GiB 16003.17 GB)
>>   Used Dev Size : 1953511808 (1863.01 GiB 2000.40 GB)
>>    Raid Devices : 10
>>   Total Devices : 11
>>     Persistence : Superblock is persistent
>>
>>     Update Time : Wed Jan 12 05:32:14 2011
>>           State : clean, degraded
>>  Active Devices : 5
>> Working Devices : 5
>>  Failed Devices : 6
>>   Spare Devices : 0
>>
>>      Chunk Size : 64K
>>
>>            Name : 4
>>            UUID : da14eb85:00658f24:80f7a070:b9026515
>>          Events : 4300692
>>
>>     Number   Major   Minor   RaidDevice State
>>       15       8        1        0      active sync   /dev/sda1
>>        1       0        0        1      removed
>>       12       8       33        2      active sync   /dev/sdc1
>>       16       8       49        3      active sync   /dev/sdd1
>>        4       0        0        4      removed
>>       20       8      193        5      active sync   /dev/sdm1
>>        6       0        0        6      removed
>>        7       0        0        7      removed
>>        8       0        0        8      removed
>>       13       8       17        9      active sync   /dev/sdb1
>>
>>       10       8       97        -      faulty spare
>>       11       8      129        -      faulty spare
>>       14       8      113        -      faulty spare
>>       17       8       81        -      faulty spare
>>       18       8       65        -      faulty spare
>>       19       8      145        -      faulty spare
>>
>> I have replaced the faulty PM and the drives have registered back 
>> with the system, under new names:
>>
>> sd 3:0:0:0: [sdn] Attached SCSI disk
>> sd 3:1:0:0: [sdo] Attached SCSI disk
>> sd 3:2:0:0: [sdp] Attached SCSI disk
>> sd 3:4:0:0: [sdr] Attached SCSI disk
>> sd 3:3:0:0: [sdq] Attached SCSI disk
>>
>> But I can't seem to --re-add them into the array now!
>>
>> # mdadm /dev/md4 --re-add /dev/sdn1 --re-add /dev/sdo1 --re-add 
>> /dev/sdp1 --re-add /dev/sdr1 --re-add /dev/sdq1
>> mdadm: add new device failed for /dev/sdn1 as 21: Device or resource 
>> busy
>>
>> I haven't unmounted the file system and/or stopped the /dev/md4 
>> device, since I think that would drop any buffers either layer might 
>> be holding.  I'd of course prefer to lose as little data as 
>> possible.  How can I get this array going again?
>>
>> PS: I think the reason "Failed Devices" shows 6 and not 5 is because 
>> I had a single HD failure a couple weeks back.  I replaced the drive 
>> and the array re-built A-OK.  I guess it still counted the failure 
>> since the array wasn't stopped during the repair.
>>
>> Thanks for any guidance,
>>
>> --Bart
>>
>> PPS: mdadm - v3.0 - 2nd June 2009
>> PPS: Linux jo.bartk.us 2.6.35-gentoo-r9 #1 SMP Sat Oct 2 21:22:14 PDT 
>> 2010 x86_64 Intel(R) Core(TM)2 Quad CPU @ 2.40GHz GenuineIntel GNU/Linux
>> PPS:  # mdadm --examine /dev/sdn1
>> /dev/sdn1:
>>           Magic : a92b4efc
>>         Version : 1.2
>>     Feature Map : 0x0
>>      Array UUID : da14eb85:00658f24:80f7a070:b9026515
>>            Name : 4
>>   Creation Time : Sun Aug 10 23:41:49 2008
>>      Raid Level : raid6
>>    Raid Devices : 10
>>
>>  Avail Dev Size : 3907023730 (1863.01 GiB 2000.40 GB)
>>      Array Size : 31256188928 (14904.11 GiB 16003.17 GB)
>>   Used Dev Size : 3907023616 (1863.01 GiB 2000.40 GB)
>>     Data Offset : 272 sectors
>>    Super Offset : 8 sectors
>>           State : clean
>>     Device UUID : c0cf419f:4c33dc64:84bc1c1a:7e9778ba
>>
>>     Update Time : Wed Jan 12 05:39:55 2011
>>        Checksum : bdb14e66 - correct
>>          Events : 4300672
>>
>>      Chunk Size : 64K
>>
>>    Device Role : spare
>>    Array State : A.AA.A...A ('A' == active, '.' == missing)
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


  reply	other threads:[~2011-01-15 19:50 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-13 13:03 (help!) MD RAID6 won't --re-add devices? Bart Kus
2011-01-15 17:48 ` Bart Kus
2011-01-15 19:50   ` Bart Kus [this message]
2011-01-16  0:05     ` Jérôme Poulin
2011-01-16 21:19       ` (help!) MD RAID6 won't --re-add devices? [SOLVED!] Bart Kus
  -- strict thread matches above, loose matches on Subject: below --
2011-01-12 13:52 (help!) MD RAID6 won't --re-add devices? Bart Kus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D31FAA2.2080202@bartk.us \
    --to=me@bartk.us \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.