Re: Please Help! RAID5 -> 6 reshapre gone bad

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Richard Herd <2001oddity@gmail.com>
To: Phil Turmel <philip@turmel.org>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: Please Help! RAID5 -> 6 reshapre gone bad
Date: Tue, 7 Feb 2012 14:10:40 +1100	[thread overview]
Message-ID: <CAOANJV8MJRLu6+BC6HdBLqhNZsMrdhZpYN_Wc+8FObMJ2++O4w@mail.gmail.com> (raw)
In-Reply-To: <4F309313.2070406@turmel.org>

Thanks again Phil.

To confirm:

root@raven:/# mdadm -Avv --force --backup-file=/usb/md0.backup
/dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1

Results in the below, so even with --force it doesn't want to accept
'non-fresh' sdc.

mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 4.
mdadm:/dev/md0 has an active reshape - checking if critical section
needs to be restored
mdadm: accepting backup with timestamp 1328559119 for array with
timestamp 1328567549
mdadm: restoring critical section
mdadm: no uptodate device for slot 0 of /dev/md0
mdadm: added /dev/sda1 to /dev/md0 as 2
mdadm: added /dev/sdc1 to /dev/md0 as 3
mdadm: added /dev/sdf1 to /dev/md0 as 4
mdadm: added /dev/sdd1 to /dev/md0 as 5
mdadm: added /dev/sdb1 to /dev/md0 as 1
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

And dmesg shows:
[11595.863451] md: bind<sda1>
[11595.863972] md: bind<sdc1>
[11595.865341] md: bind<sdf1>
[11595.869893] md: bind<sdd1>
[11595.870891] md: bind<sdb1>
[11595.871357] md: kicking non-fresh sdc1 from array!
[11595.871370] md: unbind<sdc1>
[11595.880072] md: export_rdev(sdc1)
[11595.882513] raid5: reshape will continue
[11595.882538] raid5: device sdb1 operational as raid disk 1
[11595.882542] raid5: device sdf1 operational as raid disk 4
[11595.882546] raid5: device sda1 operational as raid disk 2
[11595.883544] raid5: allocated 6308kB for md0
[11595.883627] 1: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0
[11595.883633] 5: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=1 op2=0
[11595.883637] 4: w=2 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0
[11595.883642] 2: w=3 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0
[11595.883645] raid5: not enough operational devices for md0 (3/6 failed)
[11595.891968] RAID5 conf printout:
[11595.891971]  --- rd:6 wd:3
[11595.891976]  disk 1, o:1, dev:sdb1
[11595.891979]  disk 2, o:1, dev:sda1
[11595.891983]  disk 4, o:1, dev:sdf1
[11595.891986]  disk 5, o:1, dev:sdd1
[11595.892520] raid5: failed to run raid set md0
[11595.900726] md: pers->run() failed ...

Cheers


On Tue, Feb 7, 2012 at 1:57 PM, Phil Turmel <philip@turmel.org> wrote:
> Hi Richard,
>
> [restored CC list...  please use reply-to-all on kernel.org lists]
>
> On 02/06/2012 09:40 PM, Richard Herd wrote:
>> Hi Phil,
>>
>> Thanks for the swift response :-)  Also I'm in (what I'd like to say
>> but can't - sunny) Sydney...
>>
>> OK, without slathering this thread is smart reports I can quite
>> definitely say you are exactly nail-on-the-head with regard to the
>> read errors escalating into link timeouts.  This is exactly what is
>> happening.  I had thought this was actually a pretty common setup for
>> home users (eg mdadm and drives such as WD20EARS/ST2000s) - I have the
>> luxury of budgets for Netapp kit at work - unfortunately my personal
>> finances only stretch to an ITX case and a bunch of cheap HDs!
>
> I understand the constraints, as I pinch pennies at home and at the
> office (I own my engineering firm).  I've made do with cheap desktop
> drives that do support ERC.  I got burned when Seagate dropped ERC on
> their latest desktop drives.  Hitachi Deskstar is the only affordable
> model on the market that still support ERC.
>
>> I understand it's the ERC causing disks to get kicked, and fully
>> understand if you can't help further.
>
> Not that I won't help, as there's no risk to me :-)
>
>> Assembling without sdg I'm not sure will do it, as what we have is 4
>> disks with the same events counter (3 active sync (sda/sdb/sdf), 1
>> spare rebuilding (sdd)), and 2 (sdg/sdc) removed with older event
>> counters.  Leaving out sdg leaves us with sdc which has an event
>> counter of 1848333.  As the 3 active sync (sda/sdb/sdf) + 1 spare
>> (sdd) have an event counter of 1848341, mdadm doesn't want to let me
>> use sdc in the array even with --force.
>
> This surprises me.  The purpose of "--force" with assemble is to
> ignore the event count.  Have you tried this with the newer mdadm
> you compiled?
>
>> As you say as it's in the middle of a reshape so a recreate is out.
>>
>> I'm considering data loss is a given at this point, but even being
>> able to bring the array online degraded and pull out whatever is still
>> intact would help.
>>
>> If you have any further suggestions that would be great, but I do
>> understand your position on ERC and thank you for your input :-)
>
> Please do retry the --assemble --force with /dev/sdg left out?
>
> I'll leave the balance of your response untrimmed for the list to see.
>
> Phil
>
>
>> Feb  7 01:07:16 raven kernel: [18891.989330] ata8: hard resetting link
>> Feb  7 01:07:22 raven kernel: [18897.356104] ata8: link is slow to
>> respond, please be patient (ready=0)
>> Feb  7 01:07:26 raven kernel: [18902.004280] ata8: hard resetting link
>> Feb  7 01:07:32 raven kernel: [18907.372104] ata8: link is slow to
>> respond, please be patient (ready=0)
>> Feb  7 01:07:36 raven kernel: [18912.020097] ata8: SATA link up 6.0
>> Gbps (SStatus 133 SControl 300)
>> Feb  7 01:07:41 raven kernel: [18917.020093] ata8.00: qc timeout (cmd 0xec)
>> Feb  7 01:07:41 raven kernel: [18917.028074] ata8.00: failed to
>> IDENTIFY (I/O error, err_mask=0x4)
>> Feb  7 01:07:41 raven kernel: [18917.028310] ata8: hard resetting link
>> Feb  7 01:07:47 raven kernel: [18922.396089] ata8: link is slow to
>> respond, please be patient (ready=0)
>> Feb  7 01:07:51 raven kernel: [18927.044313] ata8: hard resetting link
>> Feb  7 01:07:56 raven kernel: [18932.020099] ata8: SATA link up 6.0
>> Gbps (SStatus 133 SControl 300)
>> Feb  7 01:08:06 raven kernel: [18942.020048] ata8.00: qc timeout (cmd 0xec)
>> Feb  7 01:08:06 raven kernel: [18942.028075] ata8.00: failed to
>> IDENTIFY (I/O error, err_mask=0x4)
>> Feb  7 01:08:06 raven kernel: [18942.028307] ata8: limiting SATA link
>> speed to 3.0 Gbps
>> Feb  7 01:08:06 raven kernel: [18942.028321] ata8: hard resetting link
>> Feb  7 01:08:12 raven kernel: [18947.396108] ata8: link is slow to
>> respond, please be patient (ready=0)
>> Feb  7 01:08:16 raven kernel: [18951.988069] ata8: SATA link up 6.0
>> Gbps (SStatus 133 SControl 320)
>> Feb  7 01:08:46 raven kernel: [18981.988104] ata8.00: qc timeout (cmd 0xec)
>> Feb  7 01:08:46 raven kernel: [18981.996070] ata8.00: failed to
>> IDENTIFY (I/O error, err_mask=0x4)
>> Feb  7 01:08:46 raven kernel: [18981.996302] ata8.00: disabled
>> Feb  7 01:08:46 raven kernel: [18981.996324] ata8.00: device reported
>> invalid CHS sector 0
>> Feb  7 01:08:46 raven kernel: [18981.996348] ata8: hard resetting link
>> Feb  7 01:08:52 raven kernel: [18987.364104] ata8: link is slow to
>> respond, please be patient (ready=0)
>> Feb  7 01:08:56 raven kernel: [18992.012050] ata8: SATA link up 6.0
>> Gbps (SStatus 133 SControl 320)
>> Feb  7 01:08:56 raven kernel: [18992.012114] ata8: EH complete
>> Feb  7 01:08:56 raven kernel: [18992.012158] sd 8:0:0:0: [sdg]
>> Unhandled error code
>> Feb  7 01:08:56 raven kernel: [18992.012165] sd 8:0:0:0: [sdg] Result:
>> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Feb  7 01:08:56 raven kernel: [18992.012176] sd 8:0:0:0: [sdg] CDB:
>> Write(10): 2a 00 e8 e0 74 3f 00 00 08 00
>> Feb  7 01:08:56 raven kernel: [18992.012696] md: super_written gets
>> error=-5, uptodate=0
>> Feb  7 01:08:56 raven kernel: [18992.013169] sd 8:0:0:0: [sdg]
>> Unhandled error code
>> Feb  7 01:08:56 raven kernel: [18992.013176] sd 8:0:0:0: [sdg] Result:
>> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Feb  7 01:08:56 raven kernel: [18992.013186] sd 8:0:0:0: [sdg] CDB:
>> Read(10): 28 00 04 9d bd bf 00 00 80 00
>> Feb  7 01:08:56 raven kernel: [18992.276986] sd 8:0:0:0: [sdg]
>> Unhandled error code
>> Feb  7 01:08:56 raven kernel: [18992.276999] sd 8:0:0:0: [sdg] Result:
>> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Feb  7 01:08:56 raven kernel: [18992.277012] sd 8:0:0:0: [sdg] CDB:
>> Read(10): 28 00 04 9d be 3f 00 00 80 00
>> Feb  7 01:08:56 raven kernel: [18992.316919] sd 8:0:0:0: [sdg]
>> Unhandled error code
>> Feb  7 01:08:56 raven kernel: [18992.316930] sd 8:0:0:0: [sdg] Result:
>> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Feb  7 01:08:56 raven kernel: [18992.316942] sd 8:0:0:0: [sdg] CDB:
>> Read(10): 28 00 04 9d be bf 00 00 80 00
>> Feb  7 01:08:56 raven kernel: [18992.326906] sd 8:0:0:0: [sdg]
>> Unhandled error code
>> Feb  7 01:08:56 raven kernel: [18992.326920] sd 8:0:0:0: [sdg] Result:
>> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Feb  7 01:08:56 raven kernel: [18992.326932] sd 8:0:0:0: [sdg] CDB:
>> Read(10): 28 00 04 9d bf 3f 00 00 80 00
>> Feb  7 01:08:56 raven kernel: [18992.327944] sd 8:0:0:0: [sdg]
>> Unhandled error code
>> Feb  7 01:08:56 raven kernel: [18992.327956] sd 8:0:0:0: [sdg] Result:
>> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Feb  7 01:08:56 raven kernel: [18992.327968] sd 8:0:0:0: [sdg] CDB:
>> Read(10): 28 00 04 9d bf bf 00 00 80 00
>> Feb  7 01:08:57 raven kernel: [18992.555093] md: md0: reshape done.
>> Feb  7 01:08:57 raven kernel: [18992.607595] md: reshape of RAID array md0
>> Feb  7 01:08:57 raven kernel: [18992.607606] md: minimum _guaranteed_
>> speed: 200000 KB/sec/disk.
>> Feb  7 01:08:57 raven kernel: [18992.607614] md: using maximum
>> available idle IO bandwidth (but not more than 200000 KB/sec) for
>> reshape.
>> Feb  7 01:08:57 raven kernel: [18992.607628] md: using 128k window,
>> over a total of 1953511936 blocks.
>> Feb  7 06:41:02 raven rsyslogd: [origin software="rsyslogd"
>> swVersion="4.2.0" x-pid="911" x-info="http://www.rsyslog.com"]
>> rsyslogd was HUPed, type 'lightweight'.
>> Feb  7 07:12:32 raven kernel: [40807.989092] ata5: hard resetting link
>> Feb  7 07:12:38 raven kernel: [40813.524074] ata5: SATA link up 6.0
>> Gbps (SStatus 133 SControl 300)
>> Feb  7 07:12:43 raven kernel: [40818.524106] ata5.00: qc timeout (cmd 0xec)
>> Feb  7 07:12:43 raven kernel: [40818.524126] ata5.00: failed to
>> IDENTIFY (I/O error, err_mask=0x4)
>> Feb  7 07:12:43 raven kernel: [40818.532788] ata5: hard resetting link
>> Feb  7 07:12:48 raven kernel: [40824.058039] ata5: SATA link up 6.0
>> Gbps (SStatus 133 SControl 300)
>> Feb  7 07:12:58 raven kernel: [40834.056101] ata5.00: qc timeout (cmd 0xec)
>> Feb  7 07:12:58 raven kernel: [40834.056121] ata5.00: failed to
>> IDENTIFY (I/O error, err_mask=0x4)
>> Feb  7 07:12:58 raven kernel: [40834.064203] ata5: limiting SATA link
>> speed to 3.0 Gbps
>> Feb  7 07:12:58 raven kernel: [40834.064217] ata5: hard resetting link
>> Feb  7 07:13:04 raven kernel: [40839.592095] ata5: SATA link up 3.0
>> Gbps (SStatus 123 SControl 320)
>> Feb  7 07:13:34 raven kernel: [40869.592088] ata5.00: qc timeout (cmd 0xec)
>> Feb  7 07:13:34 raven kernel: [40869.592110] ata5.00: failed to
>> IDENTIFY (I/O error, err_mask=0x4)
>> Feb  7 07:13:34 raven kernel: [40869.599676] ata5.00: disabled
>> Feb  7 07:13:34 raven kernel: [40869.599700] ata5.00: device reported
>> invalid CHS sector 0
>> Feb  7 07:13:34 raven kernel: [40869.599724] ata5: hard resetting link
>> Feb  7 07:13:39 raven kernel: [40875.124128] ata5: SATA link up 3.0
>> Gbps (SStatus 123 SControl 320)
>> Feb  7 07:13:39 raven kernel: [40875.124201] ata5: EH complete
>> Feb  7 07:13:39 raven kernel: [40875.124243] sd 4:0:0:0: [sdd]
>> Unhandled error code
>> Feb  7 07:13:39 raven kernel: [40875.124251] sd 4:0:0:0: [sdd] Result:
>> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Feb  7 07:13:39 raven kernel: [40875.124262] sd 4:0:0:0: [sdd] CDB:
>> Write(10): 2a 00 e8 e0 74 3f 00 00 08 00
>> Feb  7 07:13:39 raven kernel: [40875.135544] md: super_written gets
>> error=-5, uptodate=0
>> Feb  7 07:13:39 raven kernel: [40875.152171] sd 4:0:0:0: [sdd]
>> Unhandled error code
>> Feb  7 07:13:39 raven kernel: [40875.152179] sd 4:0:0:0: [sdd] Result:
>> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Feb  7 07:13:39 raven kernel: [40875.152189] sd 4:0:0:0: [sdd] CDB:
>> Read(10): 28 00 09 2b f2 3f 00 00 80 00
>> Feb  7 07:13:41 raven kernel: [40876.734504] md: md0: reshape done.
>> Feb  7 07:13:41 raven kernel: [40876.736298] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.743529] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.750009] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.755143] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.760126] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.765070] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.769890] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.774759] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.779456] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.784166] lost page write due to
>> I/O error on md0
>> Feb  7 07:13:41 raven kernel: [40876.788773] JBD: Detected IO errors
>> while flushing file data on md0
>> Feb  7 07:13:41 raven kernel: [40876.796386] JBD: Detected IO errors
>> while flushing file data on md0
>>
>> On Tue, Feb 7, 2012 at 1:15 PM, Phil Turmel <philip@turmel.org> wrote:
>>> Hi Richard,
>>>
>>> On 02/06/2012 08:34 PM, Richard Herd wrote:
>>>> Hey guys,
>>>>
>>>> I'm in a bit of a pickle here and if any mdadm kings could step in and
>>>> throw some advice my way I'd be very grateful :-)
>>>>
>>>> Quick bit of background - little NAS based on an AMD E350 running
>>>> Ubuntu 10.04. Running a software RAID 5 from 5x2TB disks.  Every few
>>>> months one of the drives would fail a request and get kicked from the
>>>> array (as is becoming common for these larger multi TB drives they
>>>> tolerate the occasional bad sector by reallocating from a pool of
>>>> spares (but that's a whole other story)).  This happened across a
>>>> variety of brands and two different controllers. I'd simply add the
>>>> disk that got popped back in and let it re-sync.  SMART tests always
>>>> in good health.
>>>
>>> Some more detail on the actual devices would help, especially the
>>> output of lsdrv [1] to document what device serial numbers are which,
>>> for future reference.
>>>
>>> I also suspect you have problems with your drive's error recovery
>>> control, also known as time-limited error recovery.  Simple sector
>>> errors should *not* be kicking out your drives.  Mdadm knows to
>>> reconstruct from parity and rewrite when a read error is encountered.
>>> That either succeeds directly, or causes the drive to remap.
>>>
>>> You say that the SMART tests are good, so read errors are probably
>>> escalating into link timeouts, and the drive ignores the attempt to
>>> reconstruct.  *That* kicks the drive out.
>>>
>>> "smartctl -x" reports for all of your drives would help identify if
>>> you have this problem.  You *cannot* safely run raid arrays with drives
>>> that don't (or won't) report errors in a timely fashion (a few seconds).
>>>
>>>> It did make me nervous though.  So I decided I'd add a second disk for
>>>> a bit of extra redundancy, making the array a RAID 6 - the thinking
>>>> was the occasional disk getting kicked and re-added from a RAID 6
>>>> array wouldn't present as much risk as a single disk getting kicked
>>>> from a RAID 5.
>>>>
>>>> So first off, I added the 6th disk as a hotspare to the RAID5 array.
>>>> So I now had my 5 disk RAID 5 + hotspare.
>>>>
>>>> I then found that mdadm 2.6.7 (in the repositories) isn't actually
>>>> capable of a 5->6 reshape.  So I pulled the latest 3.2.3 sources and
>>>> compiled myself a new version of mdadm.
>>>>
>>>> With the newer version of mdadm, it was happy to do the reshape - so I
>>>> set it off on it's merry way, using an esata HD (mounted at /usb :-P)
>>>> for the backupfile:
>>>>
>>>> root@raven:/# mdadm --grow /dev/md0 --level=6 --raid-devices=6
>>>> --backup-file=/usb/md0.backup
>>>>
>>>> It would take a week to reshape, but it was ona UPS & happily ticking
>>>> along.  The array would be online the whole time so I was in no rush.
>>>> Content, I went to get some shut-eye.
>>>>
>>>> I got up this morning and took a quick look in /proc/mdstat to see how
>>>> things were going and saw things had failed spectacularly.  At least
>>>> two disks had been kicked from the array and the whole thing had
>>>> crumbled.
>>>
>>> Do you still have the dmesg for this?
>>>
>>>> Ouch.
>>>>
>>>> I tried to assembe the array, to see if it would continue the reshape:
>>>>
>>>> root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0
>>>> /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1
>>>>
>>>> Unfortunately mdadm had decided that the backup-file was out of date
>>>> (timestamps didn't match) and was erroring with: Failed to restore
>>>> critical section for reshape, sorry..
>>>>
>>>> Chances are things were in such a mess that backup file wasn't going
>>>> to be used anyway, so I blocked the timestamp check with: export
>>>> MDADM_GROW_ALLOW_OLD=1
>>>>
>>>> That allowed me to assemble the array, but not run it as there were
>>>> not enough disks to start it.
>>>>
>>>> This is the current state of the array:
>>>>
>>>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
>>>> [raid4] [raid10]
>>>> md0 : inactive sdb1[1] sdd1[5] sdf1[4] sda1[2]
>>>>       7814047744 blocks super 0.91
>>>>
>>>> unused devices: <none>
>>>>
>>>> root@raven:/# mdadm --detail /dev/md0
>>>> /dev/md0:
>>>>         Version : 0.91
>>>>   Creation Time : Tue Jul 12 23:05:01 2011
>>>>      Raid Level : raid6
>>>>   Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>>>    Raid Devices : 6
>>>>   Total Devices : 4
>>>> Preferred Minor : 0
>>>>     Persistence : Superblock is persistent
>>>>
>>>>     Update Time : Tue Feb  7 09:32:29 2012
>>>>           State : active, FAILED, Not Started
>>>>  Active Devices : 3
>>>> Working Devices : 4
>>>>  Failed Devices : 0
>>>>   Spare Devices : 1
>>>>
>>>>          Layout : left-symmetric-6
>>>>      Chunk Size : 64K
>>>>
>>>>      New Layout : left-symmetric
>>>>
>>>>            UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven)
>>>>          Events : 0.1848341
>>>>
>>>>     Number   Major   Minor   RaidDevice State
>>>>        0       0        0        0      removed
>>>>        1       8       17        1      active sync   /dev/sdb1
>>>>        2       8        1        2      active sync   /dev/sda1
>>>>        3       0        0        3      removed
>>>>        4       8       81        4      active sync   /dev/sdf1
>>>>        5       8       49        5      spare rebuilding   /dev/sdd1
>>>>
>>>> The two removed disks:
>>>> [ 3020.998529] md: kicking non-fresh sdc1 from array!
>>>> [ 3021.012672] md: kicking non-fresh sdg1 from array!
>>>>
>>>> Attempted to re-add the disks (same for both):
>>>> root@raven:/# mdadm /dev/md0 --add /dev/sdg1
>>>> mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a
>>>> --re-add fails.
>>>> mdadm: not performing --add as that would convert /dev/sdg1 in to a spare.
>>>> mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first.
>>>>
>>>> With a failed array the last thing we want to do is add spares and
>>>> trigger a resync so obviously I haven't zeroed the superblocks and
>>>> added yet.
>>>
>>> That would be catastrophic.
>>>
>>>> Checked and two disks really are out of sync:
>>>> root@raven:/# mdadm --examine /dev/sd[a-h]1 | grep Event
>>>>          Events : 1848341
>>>>          Events : 1848341
>>>>          Events : 1848333
>>>>          Events : 1848341
>>>>          Events : 1848341
>>>>          Events : 1772921
>>>
>>> So /dev/sdg1 dropped out first, and /dev/sdc1 followed and killed the
>>> array.
>>>
>>>> I'll post the output of --examine on all the disks below - if anyone
>>>> has any advice I'd really appreciate it (Neil Brown doesn't read these
>>>> forums does he?!?).  I would usually move next to recreating the array
>>>> and using assume-clean but since it's right in the middle of a reshape
>>>> I'm not inclined to try.
>>>
>>> Neil absolutely reads this mailing list, and is likely to pitch in if
>>> I don't offer precisely correct advice :-)
>>>
>>> He's in an Australian time zone though, so latency might vary.  I'm on the
>>> U.S. east coast, fwiw.
>>>
>>> In any case, with a re-shape in progress, "--create --assume-clean" is
>>> not an option.
>>>
>>>> Critical stuff is of course backed up, but there is some user data not
>>>> covered by backups that I'd like to try and restore if at all
>>>> possible.
>>>
>>> Hope is not all lost.  If we can get your ERC adjusted, the next step
>>> would be to disconnect /dev/sdg from the system, and assemble with
>>> --force and MDADM_GROW_ALLOW_OLD=1
>>>
>>> That'll let the reshape finish, leaving you with a single-degraded
>>> raid6.  Then you fsck and make critical backups.  Then you --zero- and
>>> --add /dev/sdg.
>>>
>>> If your drives don't support ERC, I can't recommend you continue until
>>> you've ddrescue'd your drives onto new ones that do support ERC.
>>>
>>> HTH,
>>>
>>> Phil
>>>
>>> [1] http://github.com/pturmel/lsdrv
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2012-02-07  3:10 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-07  1:34 Please Help! RAID5 -> 6 reshapre gone bad Richard Herd
2012-02-07  2:15 ` Phil Turmel
     [not found]   ` <CAOANJV955ZdLexRTjVkQzTMapAaMitq5eqxP0rUvDjjLh4Wgzw@mail.gmail.com>
2012-02-07  2:57     ` Phil Turmel
2012-02-07  3:10       ` Richard Herd [this message]
2012-02-07  3:24       ` Keith Keller
2012-02-07  3:38         ` Phil Turmel
2012-01-31  6:31           ` rebuild raid6 after two failures Keith Keller
2012-02-01  4:42             ` Keith Keller
2012-02-01  5:31               ` NeilBrown
2012-02-01  5:48                 ` Keith Keller
2012-02-03 16:08               ` using dd (or dd_rescue) to salvage array Keith Keller
2012-02-04 18:01                 ` Stefan /*St0fF*/ Hübner
2012-02-05 19:10                   ` Keith Keller
2012-02-06 21:37                     ` Stefan *St0fF* Huebner
2012-02-07  3:44                       ` Keith Keller
2012-02-07  4:24                       ` Keith Keller
2012-02-07 20:01                         ` Stefan *St0fF* Huebner
2012-02-08  7:13         ` Please Help! RAID5 -> 6 reshapre gone bad Stan Hoeppner
2012-02-07  3:04     ` Fwd: " Richard Herd
2012-02-07  2:39 ` NeilBrown
2012-02-07  3:10   ` NeilBrown
2012-02-07  3:19     ` Richard Herd
2012-02-07  3:39       ` NeilBrown
2012-02-07  3:50         ` Richard Herd
2012-02-07  4:25           ` NeilBrown
2012-02-07  5:02             ` Richard Herd
2012-02-07  5:16               ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOANJV8MJRLu6+BC6HdBLqhNZsMrdhZpYN_Wc+8FObMJ2++O4w@mail.gmail.com \
    --to=2001oddity@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=philip@turmel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).