* Please Help! RAID5 -> 6 reshapre gone bad @ 2012-02-07 1:34 Richard Herd 2012-02-07 2:15 ` Phil Turmel 2012-02-07 2:39 ` NeilBrown 0 siblings, 2 replies; 27+ messages in thread From: Richard Herd @ 2012-02-07 1:34 UTC (permalink / raw) To: linux-raid Hey guys, I'm in a bit of a pickle here and if any mdadm kings could step in and throw some advice my way I'd be very grateful :-) Quick bit of background - little NAS based on an AMD E350 running Ubuntu 10.04. Running a software RAID 5 from 5x2TB disks. Every few months one of the drives would fail a request and get kicked from the array (as is becoming common for these larger multi TB drives they tolerate the occasional bad sector by reallocating from a pool of spares (but that's a whole other story)). This happened across a variety of brands and two different controllers. I'd simply add the disk that got popped back in and let it re-sync. SMART tests always in good health. It did make me nervous though. So I decided I'd add a second disk for a bit of extra redundancy, making the array a RAID 6 - the thinking was the occasional disk getting kicked and re-added from a RAID 6 array wouldn't present as much risk as a single disk getting kicked from a RAID 5. So first off, I added the 6th disk as a hotspare to the RAID5 array. So I now had my 5 disk RAID 5 + hotspare. I then found that mdadm 2.6.7 (in the repositories) isn't actually capable of a 5->6 reshape. So I pulled the latest 3.2.3 sources and compiled myself a new version of mdadm. With the newer version of mdadm, it was happy to do the reshape - so I set it off on it's merry way, using an esata HD (mounted at /usb :-P) for the backupfile: root@raven:/# mdadm --grow /dev/md0 --level=6 --raid-devices=6 --backup-file=/usb/md0.backup It would take a week to reshape, but it was ona UPS & happily ticking along. The array would be online the whole time so I was in no rush. Content, I went to get some shut-eye. I got up this morning and took a quick look in /proc/mdstat to see how things were going and saw things had failed spectacularly. At least two disks had been kicked from the array and the whole thing had crumbled. Ouch. I tried to assembe the array, to see if it would continue the reshape: root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 Unfortunately mdadm had decided that the backup-file was out of date (timestamps didn't match) and was erroring with: Failed to restore critical section for reshape, sorry.. Chances are things were in such a mess that backup file wasn't going to be used anyway, so I blocked the timestamp check with: export MDADM_GROW_ALLOW_OLD=1 That allowed me to assemble the array, but not run it as there were not enough disks to start it. This is the current state of the array: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : inactive sdb1[1] sdd1[5] sdf1[4] sda1[2] 7814047744 blocks super 0.91 unused devices: <none> root@raven:/# mdadm --detail /dev/md0 /dev/md0: Version : 0.91 Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Raid Devices : 6 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Feb 7 09:32:29 2012 State : active, FAILED, Not Started Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric-6 Chunk Size : 64K New Layout : left-symmetric UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Events : 0.1848341 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 17 1 active sync /dev/sdb1 2 8 1 2 active sync /dev/sda1 3 0 0 3 removed 4 8 81 4 active sync /dev/sdf1 5 8 49 5 spare rebuilding /dev/sdd1 The two removed disks: [ 3020.998529] md: kicking non-fresh sdc1 from array! [ 3021.012672] md: kicking non-fresh sdg1 from array! Attempted to re-add the disks (same for both): root@raven:/# mdadm /dev/md0 --add /dev/sdg1 mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a --re-add fails. mdadm: not performing --add as that would convert /dev/sdg1 in to a spare. mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first. With a failed array the last thing we want to do is add spares and trigger a resync so obviously I haven't zeroed the superblocks and added yet. Checked and two disks really are out of sync: root@raven:/# mdadm --examine /dev/sd[a-h]1 | grep Event Events : 1848341 Events : 1848341 Events : 1848333 Events : 1848341 Events : 1848341 Events : 1772921 I'll post the output of --examine on all the disks below - if anyone has any advice I'd really appreciate it (Neil Brown doesn't read these forums does he?!?). I would usually move next to recreating the array and using assume-clean but since it's right in the middle of a reshape I'm not inclined to try. Critical stuff is of course backed up, but there is some user data not covered by backups that I'd like to try and restore if at all possible. Thanks root@raven:/# mdadm --examine /dev/sd[a-h]1 /dev/sda1: Magic : a92b4efc Version : 0.91.00 UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Array Size : 7814047744 (7452.06 GiB 8001.58 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Reshape pos'n : 307740672 (293.48 GiB 315.13 GB) New Layout : left-symmetric Update Time : Tue Feb 7 09:32:29 2012 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 Checksum : 3c0c8563 - correct Events : 1848341 Layout : left-symmetric-6 Chunk Size : 64K Number Major Minor RaidDevice State this 2 8 17 2 active sync /dev/sdb1 0 0 0 0 0 removed 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 active sync /dev/sdb1 3 3 0 0 3 faulty removed 4 4 8 81 4 active sync /dev/sdf1 5 5 8 65 5 active /dev/sde1 /dev/sdb1: Magic : a92b4efc Version : 0.91.00 UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Array Size : 7814047744 (7452.06 GiB 8001.58 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Reshape pos'n : 307740672 (293.48 GiB 315.13 GB) New Layout : left-symmetric Update Time : Tue Feb 7 09:32:29 2012 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 Checksum : 3c0c8571 - correct Events : 1848341 Layout : left-symmetric-6 Chunk Size : 64K Number Major Minor RaidDevice State this 1 8 33 1 active sync /dev/sdc1 0 0 0 0 0 removed 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 active sync /dev/sdb1 3 3 0 0 3 faulty removed 4 4 8 81 4 active sync /dev/sdf1 5 5 8 65 5 active /dev/sde1 /dev/sdc1: Magic : a92b4efc Version : 0.91.00 UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Array Size : 7814047744 (7452.06 GiB 8001.58 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Reshape pos'n : 307740672 (293.48 GiB 315.13 GB) New Layout : left-symmetric Update Time : Tue Feb 7 07:12:01 2012 State : clean Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Checksum : 3c0c6478 - correct Events : 1848333 Layout : left-symmetric-6 Chunk Size : 64K Number Major Minor RaidDevice State this 3 8 49 3 active sync /dev/sdd1 0 0 0 0 0 removed 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 active sync /dev/sdb1 3 3 8 49 3 active sync /dev/sdd1 4 4 8 81 4 active sync /dev/sdf1 5 5 8 65 5 active /dev/sde1 /dev/sdd1: Magic : a92b4efc Version : 0.91.00 UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Array Size : 7814047744 (7452.06 GiB 8001.58 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Reshape pos'n : 307740672 (293.48 GiB 315.13 GB) New Layout : left-symmetric Update Time : Tue Feb 7 09:32:29 2012 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 Checksum : 3c0c8595 - correct Events : 1848341 Layout : left-symmetric-6 Chunk Size : 64K Number Major Minor RaidDevice State this 5 8 65 5 active /dev/sde1 0 0 0 0 0 removed 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 active sync /dev/sdb1 3 3 0 0 3 faulty removed 4 4 8 81 4 active sync /dev/sdf1 5 5 8 65 5 active /dev/sde1 /dev/sdf1: Magic : a92b4efc Version : 0.91.00 UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Array Size : 7814047744 (7452.06 GiB 8001.58 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Reshape pos'n : 307740672 (293.48 GiB 315.13 GB) New Layout : left-symmetric Update Time : Tue Feb 7 09:32:29 2012 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 Checksum : 3c0c85a7 - correct Events : 1848341 Layout : left-symmetric-6 Chunk Size : 64K Number Major Minor RaidDevice State this 4 8 81 4 active sync /dev/sdf1 0 0 0 0 0 removed 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 active sync /dev/sdb1 3 3 0 0 3 faulty removed 4 4 8 81 4 active sync /dev/sdf1 5 5 8 65 5 active /dev/sde1 /dev/sdg1: Magic : a92b4efc Version : 0.91.00 UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Array Size : 7814047744 (7452.06 GiB 8001.58 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Reshape pos'n : 307740672 (293.48 GiB 315.13 GB) New Layout : left-symmetric Update Time : Tue Feb 7 01:06:46 2012 State : clean Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Checksum : 3c09c1d2 - correct Events : 1772921 Layout : left-symmetric-6 Chunk Size : 64K Number Major Minor RaidDevice State this 0 8 97 0 active sync /dev/sdg1 0 0 8 97 0 active sync /dev/sdg1 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 active sync /dev/sdb1 3 3 8 49 3 active sync /dev/sdd1 4 4 8 81 4 active sync /dev/sdf1 5 5 8 65 5 active /dev/sde1 -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 1:34 Please Help! RAID5 -> 6 reshapre gone bad Richard Herd @ 2012-02-07 2:15 ` Phil Turmel [not found] ` <CAOANJV955ZdLexRTjVkQzTMapAaMitq5eqxP0rUvDjjLh4Wgzw@mail.gmail.com> 2012-02-07 2:39 ` NeilBrown 1 sibling, 1 reply; 27+ messages in thread From: Phil Turmel @ 2012-02-07 2:15 UTC (permalink / raw) To: Richard Herd; +Cc: linux-raid Hi Richard, On 02/06/2012 08:34 PM, Richard Herd wrote: > Hey guys, > > I'm in a bit of a pickle here and if any mdadm kings could step in and > throw some advice my way I'd be very grateful :-) > > Quick bit of background - little NAS based on an AMD E350 running > Ubuntu 10.04. Running a software RAID 5 from 5x2TB disks. Every few > months one of the drives would fail a request and get kicked from the > array (as is becoming common for these larger multi TB drives they > tolerate the occasional bad sector by reallocating from a pool of > spares (but that's a whole other story)). This happened across a > variety of brands and two different controllers. I'd simply add the > disk that got popped back in and let it re-sync. SMART tests always > in good health. Some more detail on the actual devices would help, especially the output of lsdrv [1] to document what device serial numbers are which, for future reference. I also suspect you have problems with your drive's error recovery control, also known as time-limited error recovery. Simple sector errors should *not* be kicking out your drives. Mdadm knows to reconstruct from parity and rewrite when a read error is encountered. That either succeeds directly, or causes the drive to remap. You say that the SMART tests are good, so read errors are probably escalating into link timeouts, and the drive ignores the attempt to reconstruct. *That* kicks the drive out. "smartctl -x" reports for all of your drives would help identify if you have this problem. You *cannot* safely run raid arrays with drives that don't (or won't) report errors in a timely fashion (a few seconds). > It did make me nervous though. So I decided I'd add a second disk for > a bit of extra redundancy, making the array a RAID 6 - the thinking > was the occasional disk getting kicked and re-added from a RAID 6 > array wouldn't present as much risk as a single disk getting kicked > from a RAID 5. > > So first off, I added the 6th disk as a hotspare to the RAID5 array. > So I now had my 5 disk RAID 5 + hotspare. > > I then found that mdadm 2.6.7 (in the repositories) isn't actually > capable of a 5->6 reshape. So I pulled the latest 3.2.3 sources and > compiled myself a new version of mdadm. > > With the newer version of mdadm, it was happy to do the reshape - so I > set it off on it's merry way, using an esata HD (mounted at /usb :-P) > for the backupfile: > > root@raven:/# mdadm --grow /dev/md0 --level=6 --raid-devices=6 > --backup-file=/usb/md0.backup > > It would take a week to reshape, but it was ona UPS & happily ticking > along. The array would be online the whole time so I was in no rush. > Content, I went to get some shut-eye. > > I got up this morning and took a quick look in /proc/mdstat to see how > things were going and saw things had failed spectacularly. At least > two disks had been kicked from the array and the whole thing had > crumbled. Do you still have the dmesg for this? > Ouch. > > I tried to assembe the array, to see if it would continue the reshape: > > root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 > /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 > > Unfortunately mdadm had decided that the backup-file was out of date > (timestamps didn't match) and was erroring with: Failed to restore > critical section for reshape, sorry.. > > Chances are things were in such a mess that backup file wasn't going > to be used anyway, so I blocked the timestamp check with: export > MDADM_GROW_ALLOW_OLD=1 > > That allowed me to assemble the array, but not run it as there were > not enough disks to start it. > > This is the current state of the array: > > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] > [raid4] [raid10] > md0 : inactive sdb1[1] sdd1[5] sdf1[4] sda1[2] > 7814047744 blocks super 0.91 > > unused devices: <none> > > root@raven:/# mdadm --detail /dev/md0 > /dev/md0: > Version : 0.91 > Creation Time : Tue Jul 12 23:05:01 2011 > Raid Level : raid6 > Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) > Raid Devices : 6 > Total Devices : 4 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Update Time : Tue Feb 7 09:32:29 2012 > State : active, FAILED, Not Started > Active Devices : 3 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 1 > > Layout : left-symmetric-6 > Chunk Size : 64K > > New Layout : left-symmetric > > UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) > Events : 0.1848341 > > Number Major Minor RaidDevice State > 0 0 0 0 removed > 1 8 17 1 active sync /dev/sdb1 > 2 8 1 2 active sync /dev/sda1 > 3 0 0 3 removed > 4 8 81 4 active sync /dev/sdf1 > 5 8 49 5 spare rebuilding /dev/sdd1 > > The two removed disks: > [ 3020.998529] md: kicking non-fresh sdc1 from array! > [ 3021.012672] md: kicking non-fresh sdg1 from array! > > Attempted to re-add the disks (same for both): > root@raven:/# mdadm /dev/md0 --add /dev/sdg1 > mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a > --re-add fails. > mdadm: not performing --add as that would convert /dev/sdg1 in to a spare. > mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first. > > With a failed array the last thing we want to do is add spares and > trigger a resync so obviously I haven't zeroed the superblocks and > added yet. That would be catastrophic. > Checked and two disks really are out of sync: > root@raven:/# mdadm --examine /dev/sd[a-h]1 | grep Event > Events : 1848341 > Events : 1848341 > Events : 1848333 > Events : 1848341 > Events : 1848341 > Events : 1772921 So /dev/sdg1 dropped out first, and /dev/sdc1 followed and killed the array. > I'll post the output of --examine on all the disks below - if anyone > has any advice I'd really appreciate it (Neil Brown doesn't read these > forums does he?!?). I would usually move next to recreating the array > and using assume-clean but since it's right in the middle of a reshape > I'm not inclined to try. Neil absolutely reads this mailing list, and is likely to pitch in if I don't offer precisely correct advice :-) He's in an Australian time zone though, so latency might vary. I'm on the U.S. east coast, fwiw. In any case, with a re-shape in progress, "--create --assume-clean" is not an option. > Critical stuff is of course backed up, but there is some user data not > covered by backups that I'd like to try and restore if at all > possible. Hope is not all lost. If we can get your ERC adjusted, the next step would be to disconnect /dev/sdg from the system, and assemble with --force and MDADM_GROW_ALLOW_OLD=1 That'll let the reshape finish, leaving you with a single-degraded raid6. Then you fsck and make critical backups. Then you --zero- and --add /dev/sdg. If your drives don't support ERC, I can't recommend you continue until you've ddrescue'd your drives onto new ones that do support ERC. HTH, Phil [1] http://github.com/pturmel/lsdrv ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CAOANJV955ZdLexRTjVkQzTMapAaMitq5eqxP0rUvDjjLh4Wgzw@mail.gmail.com>]
* Re: Please Help! RAID5 -> 6 reshapre gone bad [not found] ` <CAOANJV955ZdLexRTjVkQzTMapAaMitq5eqxP0rUvDjjLh4Wgzw@mail.gmail.com> @ 2012-02-07 2:57 ` Phil Turmel 2012-02-07 3:10 ` Richard Herd 2012-02-07 3:24 ` Keith Keller 2012-02-07 3:04 ` Fwd: " Richard Herd 1 sibling, 2 replies; 27+ messages in thread From: Phil Turmel @ 2012-02-07 2:57 UTC (permalink / raw) To: Richard Herd, linux-raid@vger.kernel.org Hi Richard, [restored CC list... please use reply-to-all on kernel.org lists] On 02/06/2012 09:40 PM, Richard Herd wrote: > Hi Phil, > > Thanks for the swift response :-) Also I'm in (what I'd like to say > but can't - sunny) Sydney... > > OK, without slathering this thread is smart reports I can quite > definitely say you are exactly nail-on-the-head with regard to the > read errors escalating into link timeouts. This is exactly what is > happening. I had thought this was actually a pretty common setup for > home users (eg mdadm and drives such as WD20EARS/ST2000s) - I have the > luxury of budgets for Netapp kit at work - unfortunately my personal > finances only stretch to an ITX case and a bunch of cheap HDs! I understand the constraints, as I pinch pennies at home and at the office (I own my engineering firm). I've made do with cheap desktop drives that do support ERC. I got burned when Seagate dropped ERC on their latest desktop drives. Hitachi Deskstar is the only affordable model on the market that still support ERC. > I understand it's the ERC causing disks to get kicked, and fully > understand if you can't help further. Not that I won't help, as there's no risk to me :-) > Assembling without sdg I'm not sure will do it, as what we have is 4 > disks with the same events counter (3 active sync (sda/sdb/sdf), 1 > spare rebuilding (sdd)), and 2 (sdg/sdc) removed with older event > counters. Leaving out sdg leaves us with sdc which has an event > counter of 1848333. As the 3 active sync (sda/sdb/sdf) + 1 spare > (sdd) have an event counter of 1848341, mdadm doesn't want to let me > use sdc in the array even with --force. This surprises me. The purpose of "--force" with assemble is to ignore the event count. Have you tried this with the newer mdadm you compiled? > As you say as it's in the middle of a reshape so a recreate is out. > > I'm considering data loss is a given at this point, but even being > able to bring the array online degraded and pull out whatever is still > intact would help. > > If you have any further suggestions that would be great, but I do > understand your position on ERC and thank you for your input :-) Please do retry the --assemble --force with /dev/sdg left out? I'll leave the balance of your response untrimmed for the list to see. Phil > Feb 7 01:07:16 raven kernel: [18891.989330] ata8: hard resetting link > Feb 7 01:07:22 raven kernel: [18897.356104] ata8: link is slow to > respond, please be patient (ready=0) > Feb 7 01:07:26 raven kernel: [18902.004280] ata8: hard resetting link > Feb 7 01:07:32 raven kernel: [18907.372104] ata8: link is slow to > respond, please be patient (ready=0) > Feb 7 01:07:36 raven kernel: [18912.020097] ata8: SATA link up 6.0 > Gbps (SStatus 133 SControl 300) > Feb 7 01:07:41 raven kernel: [18917.020093] ata8.00: qc timeout (cmd 0xec) > Feb 7 01:07:41 raven kernel: [18917.028074] ata8.00: failed to > IDENTIFY (I/O error, err_mask=0x4) > Feb 7 01:07:41 raven kernel: [18917.028310] ata8: hard resetting link > Feb 7 01:07:47 raven kernel: [18922.396089] ata8: link is slow to > respond, please be patient (ready=0) > Feb 7 01:07:51 raven kernel: [18927.044313] ata8: hard resetting link > Feb 7 01:07:56 raven kernel: [18932.020099] ata8: SATA link up 6.0 > Gbps (SStatus 133 SControl 300) > Feb 7 01:08:06 raven kernel: [18942.020048] ata8.00: qc timeout (cmd 0xec) > Feb 7 01:08:06 raven kernel: [18942.028075] ata8.00: failed to > IDENTIFY (I/O error, err_mask=0x4) > Feb 7 01:08:06 raven kernel: [18942.028307] ata8: limiting SATA link > speed to 3.0 Gbps > Feb 7 01:08:06 raven kernel: [18942.028321] ata8: hard resetting link > Feb 7 01:08:12 raven kernel: [18947.396108] ata8: link is slow to > respond, please be patient (ready=0) > Feb 7 01:08:16 raven kernel: [18951.988069] ata8: SATA link up 6.0 > Gbps (SStatus 133 SControl 320) > Feb 7 01:08:46 raven kernel: [18981.988104] ata8.00: qc timeout (cmd 0xec) > Feb 7 01:08:46 raven kernel: [18981.996070] ata8.00: failed to > IDENTIFY (I/O error, err_mask=0x4) > Feb 7 01:08:46 raven kernel: [18981.996302] ata8.00: disabled > Feb 7 01:08:46 raven kernel: [18981.996324] ata8.00: device reported > invalid CHS sector 0 > Feb 7 01:08:46 raven kernel: [18981.996348] ata8: hard resetting link > Feb 7 01:08:52 raven kernel: [18987.364104] ata8: link is slow to > respond, please be patient (ready=0) > Feb 7 01:08:56 raven kernel: [18992.012050] ata8: SATA link up 6.0 > Gbps (SStatus 133 SControl 320) > Feb 7 01:08:56 raven kernel: [18992.012114] ata8: EH complete > Feb 7 01:08:56 raven kernel: [18992.012158] sd 8:0:0:0: [sdg] > Unhandled error code > Feb 7 01:08:56 raven kernel: [18992.012165] sd 8:0:0:0: [sdg] Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Feb 7 01:08:56 raven kernel: [18992.012176] sd 8:0:0:0: [sdg] CDB: > Write(10): 2a 00 e8 e0 74 3f 00 00 08 00 > Feb 7 01:08:56 raven kernel: [18992.012696] md: super_written gets > error=-5, uptodate=0 > Feb 7 01:08:56 raven kernel: [18992.013169] sd 8:0:0:0: [sdg] > Unhandled error code > Feb 7 01:08:56 raven kernel: [18992.013176] sd 8:0:0:0: [sdg] Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Feb 7 01:08:56 raven kernel: [18992.013186] sd 8:0:0:0: [sdg] CDB: > Read(10): 28 00 04 9d bd bf 00 00 80 00 > Feb 7 01:08:56 raven kernel: [18992.276986] sd 8:0:0:0: [sdg] > Unhandled error code > Feb 7 01:08:56 raven kernel: [18992.276999] sd 8:0:0:0: [sdg] Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Feb 7 01:08:56 raven kernel: [18992.277012] sd 8:0:0:0: [sdg] CDB: > Read(10): 28 00 04 9d be 3f 00 00 80 00 > Feb 7 01:08:56 raven kernel: [18992.316919] sd 8:0:0:0: [sdg] > Unhandled error code > Feb 7 01:08:56 raven kernel: [18992.316930] sd 8:0:0:0: [sdg] Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Feb 7 01:08:56 raven kernel: [18992.316942] sd 8:0:0:0: [sdg] CDB: > Read(10): 28 00 04 9d be bf 00 00 80 00 > Feb 7 01:08:56 raven kernel: [18992.326906] sd 8:0:0:0: [sdg] > Unhandled error code > Feb 7 01:08:56 raven kernel: [18992.326920] sd 8:0:0:0: [sdg] Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Feb 7 01:08:56 raven kernel: [18992.326932] sd 8:0:0:0: [sdg] CDB: > Read(10): 28 00 04 9d bf 3f 00 00 80 00 > Feb 7 01:08:56 raven kernel: [18992.327944] sd 8:0:0:0: [sdg] > Unhandled error code > Feb 7 01:08:56 raven kernel: [18992.327956] sd 8:0:0:0: [sdg] Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Feb 7 01:08:56 raven kernel: [18992.327968] sd 8:0:0:0: [sdg] CDB: > Read(10): 28 00 04 9d bf bf 00 00 80 00 > Feb 7 01:08:57 raven kernel: [18992.555093] md: md0: reshape done. > Feb 7 01:08:57 raven kernel: [18992.607595] md: reshape of RAID array md0 > Feb 7 01:08:57 raven kernel: [18992.607606] md: minimum _guaranteed_ > speed: 200000 KB/sec/disk. > Feb 7 01:08:57 raven kernel: [18992.607614] md: using maximum > available idle IO bandwidth (but not more than 200000 KB/sec) for > reshape. > Feb 7 01:08:57 raven kernel: [18992.607628] md: using 128k window, > over a total of 1953511936 blocks. > Feb 7 06:41:02 raven rsyslogd: [origin software="rsyslogd" > swVersion="4.2.0" x-pid="911" x-info="http://www.rsyslog.com"] > rsyslogd was HUPed, type 'lightweight'. > Feb 7 07:12:32 raven kernel: [40807.989092] ata5: hard resetting link > Feb 7 07:12:38 raven kernel: [40813.524074] ata5: SATA link up 6.0 > Gbps (SStatus 133 SControl 300) > Feb 7 07:12:43 raven kernel: [40818.524106] ata5.00: qc timeout (cmd 0xec) > Feb 7 07:12:43 raven kernel: [40818.524126] ata5.00: failed to > IDENTIFY (I/O error, err_mask=0x4) > Feb 7 07:12:43 raven kernel: [40818.532788] ata5: hard resetting link > Feb 7 07:12:48 raven kernel: [40824.058039] ata5: SATA link up 6.0 > Gbps (SStatus 133 SControl 300) > Feb 7 07:12:58 raven kernel: [40834.056101] ata5.00: qc timeout (cmd 0xec) > Feb 7 07:12:58 raven kernel: [40834.056121] ata5.00: failed to > IDENTIFY (I/O error, err_mask=0x4) > Feb 7 07:12:58 raven kernel: [40834.064203] ata5: limiting SATA link > speed to 3.0 Gbps > Feb 7 07:12:58 raven kernel: [40834.064217] ata5: hard resetting link > Feb 7 07:13:04 raven kernel: [40839.592095] ata5: SATA link up 3.0 > Gbps (SStatus 123 SControl 320) > Feb 7 07:13:34 raven kernel: [40869.592088] ata5.00: qc timeout (cmd 0xec) > Feb 7 07:13:34 raven kernel: [40869.592110] ata5.00: failed to > IDENTIFY (I/O error, err_mask=0x4) > Feb 7 07:13:34 raven kernel: [40869.599676] ata5.00: disabled > Feb 7 07:13:34 raven kernel: [40869.599700] ata5.00: device reported > invalid CHS sector 0 > Feb 7 07:13:34 raven kernel: [40869.599724] ata5: hard resetting link > Feb 7 07:13:39 raven kernel: [40875.124128] ata5: SATA link up 3.0 > Gbps (SStatus 123 SControl 320) > Feb 7 07:13:39 raven kernel: [40875.124201] ata5: EH complete > Feb 7 07:13:39 raven kernel: [40875.124243] sd 4:0:0:0: [sdd] > Unhandled error code > Feb 7 07:13:39 raven kernel: [40875.124251] sd 4:0:0:0: [sdd] Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Feb 7 07:13:39 raven kernel: [40875.124262] sd 4:0:0:0: [sdd] CDB: > Write(10): 2a 00 e8 e0 74 3f 00 00 08 00 > Feb 7 07:13:39 raven kernel: [40875.135544] md: super_written gets > error=-5, uptodate=0 > Feb 7 07:13:39 raven kernel: [40875.152171] sd 4:0:0:0: [sdd] > Unhandled error code > Feb 7 07:13:39 raven kernel: [40875.152179] sd 4:0:0:0: [sdd] Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Feb 7 07:13:39 raven kernel: [40875.152189] sd 4:0:0:0: [sdd] CDB: > Read(10): 28 00 09 2b f2 3f 00 00 80 00 > Feb 7 07:13:41 raven kernel: [40876.734504] md: md0: reshape done. > Feb 7 07:13:41 raven kernel: [40876.736298] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.743529] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.750009] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.755143] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.760126] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.765070] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.769890] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.774759] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.779456] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.784166] lost page write due to > I/O error on md0 > Feb 7 07:13:41 raven kernel: [40876.788773] JBD: Detected IO errors > while flushing file data on md0 > Feb 7 07:13:41 raven kernel: [40876.796386] JBD: Detected IO errors > while flushing file data on md0 > > On Tue, Feb 7, 2012 at 1:15 PM, Phil Turmel <philip@turmel.org> wrote: >> Hi Richard, >> >> On 02/06/2012 08:34 PM, Richard Herd wrote: >>> Hey guys, >>> >>> I'm in a bit of a pickle here and if any mdadm kings could step in and >>> throw some advice my way I'd be very grateful :-) >>> >>> Quick bit of background - little NAS based on an AMD E350 running >>> Ubuntu 10.04. Running a software RAID 5 from 5x2TB disks. Every few >>> months one of the drives would fail a request and get kicked from the >>> array (as is becoming common for these larger multi TB drives they >>> tolerate the occasional bad sector by reallocating from a pool of >>> spares (but that's a whole other story)). This happened across a >>> variety of brands and two different controllers. I'd simply add the >>> disk that got popped back in and let it re-sync. SMART tests always >>> in good health. >> >> Some more detail on the actual devices would help, especially the >> output of lsdrv [1] to document what device serial numbers are which, >> for future reference. >> >> I also suspect you have problems with your drive's error recovery >> control, also known as time-limited error recovery. Simple sector >> errors should *not* be kicking out your drives. Mdadm knows to >> reconstruct from parity and rewrite when a read error is encountered. >> That either succeeds directly, or causes the drive to remap. >> >> You say that the SMART tests are good, so read errors are probably >> escalating into link timeouts, and the drive ignores the attempt to >> reconstruct. *That* kicks the drive out. >> >> "smartctl -x" reports for all of your drives would help identify if >> you have this problem. You *cannot* safely run raid arrays with drives >> that don't (or won't) report errors in a timely fashion (a few seconds). >> >>> It did make me nervous though. So I decided I'd add a second disk for >>> a bit of extra redundancy, making the array a RAID 6 - the thinking >>> was the occasional disk getting kicked and re-added from a RAID 6 >>> array wouldn't present as much risk as a single disk getting kicked >>> from a RAID 5. >>> >>> So first off, I added the 6th disk as a hotspare to the RAID5 array. >>> So I now had my 5 disk RAID 5 + hotspare. >>> >>> I then found that mdadm 2.6.7 (in the repositories) isn't actually >>> capable of a 5->6 reshape. So I pulled the latest 3.2.3 sources and >>> compiled myself a new version of mdadm. >>> >>> With the newer version of mdadm, it was happy to do the reshape - so I >>> set it off on it's merry way, using an esata HD (mounted at /usb :-P) >>> for the backupfile: >>> >>> root@raven:/# mdadm --grow /dev/md0 --level=6 --raid-devices=6 >>> --backup-file=/usb/md0.backup >>> >>> It would take a week to reshape, but it was ona UPS & happily ticking >>> along. The array would be online the whole time so I was in no rush. >>> Content, I went to get some shut-eye. >>> >>> I got up this morning and took a quick look in /proc/mdstat to see how >>> things were going and saw things had failed spectacularly. At least >>> two disks had been kicked from the array and the whole thing had >>> crumbled. >> >> Do you still have the dmesg for this? >> >>> Ouch. >>> >>> I tried to assembe the array, to see if it would continue the reshape: >>> >>> root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 >>> /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 >>> >>> Unfortunately mdadm had decided that the backup-file was out of date >>> (timestamps didn't match) and was erroring with: Failed to restore >>> critical section for reshape, sorry.. >>> >>> Chances are things were in such a mess that backup file wasn't going >>> to be used anyway, so I blocked the timestamp check with: export >>> MDADM_GROW_ALLOW_OLD=1 >>> >>> That allowed me to assemble the array, but not run it as there were >>> not enough disks to start it. >>> >>> This is the current state of the array: >>> >>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] >>> [raid4] [raid10] >>> md0 : inactive sdb1[1] sdd1[5] sdf1[4] sda1[2] >>> 7814047744 blocks super 0.91 >>> >>> unused devices: <none> >>> >>> root@raven:/# mdadm --detail /dev/md0 >>> /dev/md0: >>> Version : 0.91 >>> Creation Time : Tue Jul 12 23:05:01 2011 >>> Raid Level : raid6 >>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) >>> Raid Devices : 6 >>> Total Devices : 4 >>> Preferred Minor : 0 >>> Persistence : Superblock is persistent >>> >>> Update Time : Tue Feb 7 09:32:29 2012 >>> State : active, FAILED, Not Started >>> Active Devices : 3 >>> Working Devices : 4 >>> Failed Devices : 0 >>> Spare Devices : 1 >>> >>> Layout : left-symmetric-6 >>> Chunk Size : 64K >>> >>> New Layout : left-symmetric >>> >>> UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) >>> Events : 0.1848341 >>> >>> Number Major Minor RaidDevice State >>> 0 0 0 0 removed >>> 1 8 17 1 active sync /dev/sdb1 >>> 2 8 1 2 active sync /dev/sda1 >>> 3 0 0 3 removed >>> 4 8 81 4 active sync /dev/sdf1 >>> 5 8 49 5 spare rebuilding /dev/sdd1 >>> >>> The two removed disks: >>> [ 3020.998529] md: kicking non-fresh sdc1 from array! >>> [ 3021.012672] md: kicking non-fresh sdg1 from array! >>> >>> Attempted to re-add the disks (same for both): >>> root@raven:/# mdadm /dev/md0 --add /dev/sdg1 >>> mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a >>> --re-add fails. >>> mdadm: not performing --add as that would convert /dev/sdg1 in to a spare. >>> mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first. >>> >>> With a failed array the last thing we want to do is add spares and >>> trigger a resync so obviously I haven't zeroed the superblocks and >>> added yet. >> >> That would be catastrophic. >> >>> Checked and two disks really are out of sync: >>> root@raven:/# mdadm --examine /dev/sd[a-h]1 | grep Event >>> Events : 1848341 >>> Events : 1848341 >>> Events : 1848333 >>> Events : 1848341 >>> Events : 1848341 >>> Events : 1772921 >> >> So /dev/sdg1 dropped out first, and /dev/sdc1 followed and killed the >> array. >> >>> I'll post the output of --examine on all the disks below - if anyone >>> has any advice I'd really appreciate it (Neil Brown doesn't read these >>> forums does he?!?). I would usually move next to recreating the array >>> and using assume-clean but since it's right in the middle of a reshape >>> I'm not inclined to try. >> >> Neil absolutely reads this mailing list, and is likely to pitch in if >> I don't offer precisely correct advice :-) >> >> He's in an Australian time zone though, so latency might vary. I'm on the >> U.S. east coast, fwiw. >> >> In any case, with a re-shape in progress, "--create --assume-clean" is >> not an option. >> >>> Critical stuff is of course backed up, but there is some user data not >>> covered by backups that I'd like to try and restore if at all >>> possible. >> >> Hope is not all lost. If we can get your ERC adjusted, the next step >> would be to disconnect /dev/sdg from the system, and assemble with >> --force and MDADM_GROW_ALLOW_OLD=1 >> >> That'll let the reshape finish, leaving you with a single-degraded >> raid6. Then you fsck and make critical backups. Then you --zero- and >> --add /dev/sdg. >> >> If your drives don't support ERC, I can't recommend you continue until >> you've ddrescue'd your drives onto new ones that do support ERC. >> >> HTH, >> >> Phil >> >> [1] http://github.com/pturmel/lsdrv ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 2:57 ` Phil Turmel @ 2012-02-07 3:10 ` Richard Herd 2012-02-07 3:24 ` Keith Keller 1 sibling, 0 replies; 27+ messages in thread From: Richard Herd @ 2012-02-07 3:10 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid@vger.kernel.org Thanks again Phil. To confirm: root@raven:/# mdadm -Avv --force --backup-file=/usb/md0.backup /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 Results in the below, so even with --force it doesn't want to accept 'non-fresh' sdc. mdadm: looking for devices for /dev/md0 mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1. mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 5. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 4. mdadm:/dev/md0 has an active reshape - checking if critical section needs to be restored mdadm: accepting backup with timestamp 1328559119 for array with timestamp 1328567549 mdadm: restoring critical section mdadm: no uptodate device for slot 0 of /dev/md0 mdadm: added /dev/sda1 to /dev/md0 as 2 mdadm: added /dev/sdc1 to /dev/md0 as 3 mdadm: added /dev/sdf1 to /dev/md0 as 4 mdadm: added /dev/sdd1 to /dev/md0 as 5 mdadm: added /dev/sdb1 to /dev/md0 as 1 mdadm: failed to RUN_ARRAY /dev/md0: Input/output error And dmesg shows: [11595.863451] md: bind<sda1> [11595.863972] md: bind<sdc1> [11595.865341] md: bind<sdf1> [11595.869893] md: bind<sdd1> [11595.870891] md: bind<sdb1> [11595.871357] md: kicking non-fresh sdc1 from array! [11595.871370] md: unbind<sdc1> [11595.880072] md: export_rdev(sdc1) [11595.882513] raid5: reshape will continue [11595.882538] raid5: device sdb1 operational as raid disk 1 [11595.882542] raid5: device sdf1 operational as raid disk 4 [11595.882546] raid5: device sda1 operational as raid disk 2 [11595.883544] raid5: allocated 6308kB for md0 [11595.883627] 1: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 [11595.883633] 5: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=1 op2=0 [11595.883637] 4: w=2 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 [11595.883642] 2: w=3 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 [11595.883645] raid5: not enough operational devices for md0 (3/6 failed) [11595.891968] RAID5 conf printout: [11595.891971] --- rd:6 wd:3 [11595.891976] disk 1, o:1, dev:sdb1 [11595.891979] disk 2, o:1, dev:sda1 [11595.891983] disk 4, o:1, dev:sdf1 [11595.891986] disk 5, o:1, dev:sdd1 [11595.892520] raid5: failed to run raid set md0 [11595.900726] md: pers->run() failed ... Cheers On Tue, Feb 7, 2012 at 1:57 PM, Phil Turmel <philip@turmel.org> wrote: > Hi Richard, > > [restored CC list... please use reply-to-all on kernel.org lists] > > On 02/06/2012 09:40 PM, Richard Herd wrote: >> Hi Phil, >> >> Thanks for the swift response :-) Also I'm in (what I'd like to say >> but can't - sunny) Sydney... >> >> OK, without slathering this thread is smart reports I can quite >> definitely say you are exactly nail-on-the-head with regard to the >> read errors escalating into link timeouts. This is exactly what is >> happening. I had thought this was actually a pretty common setup for >> home users (eg mdadm and drives such as WD20EARS/ST2000s) - I have the >> luxury of budgets for Netapp kit at work - unfortunately my personal >> finances only stretch to an ITX case and a bunch of cheap HDs! > > I understand the constraints, as I pinch pennies at home and at the > office (I own my engineering firm). I've made do with cheap desktop > drives that do support ERC. I got burned when Seagate dropped ERC on > their latest desktop drives. Hitachi Deskstar is the only affordable > model on the market that still support ERC. > >> I understand it's the ERC causing disks to get kicked, and fully >> understand if you can't help further. > > Not that I won't help, as there's no risk to me :-) > >> Assembling without sdg I'm not sure will do it, as what we have is 4 >> disks with the same events counter (3 active sync (sda/sdb/sdf), 1 >> spare rebuilding (sdd)), and 2 (sdg/sdc) removed with older event >> counters. Leaving out sdg leaves us with sdc which has an event >> counter of 1848333. As the 3 active sync (sda/sdb/sdf) + 1 spare >> (sdd) have an event counter of 1848341, mdadm doesn't want to let me >> use sdc in the array even with --force. > > This surprises me. The purpose of "--force" with assemble is to > ignore the event count. Have you tried this with the newer mdadm > you compiled? > >> As you say as it's in the middle of a reshape so a recreate is out. >> >> I'm considering data loss is a given at this point, but even being >> able to bring the array online degraded and pull out whatever is still >> intact would help. >> >> If you have any further suggestions that would be great, but I do >> understand your position on ERC and thank you for your input :-) > > Please do retry the --assemble --force with /dev/sdg left out? > > I'll leave the balance of your response untrimmed for the list to see. > > Phil > > >> Feb 7 01:07:16 raven kernel: [18891.989330] ata8: hard resetting link >> Feb 7 01:07:22 raven kernel: [18897.356104] ata8: link is slow to >> respond, please be patient (ready=0) >> Feb 7 01:07:26 raven kernel: [18902.004280] ata8: hard resetting link >> Feb 7 01:07:32 raven kernel: [18907.372104] ata8: link is slow to >> respond, please be patient (ready=0) >> Feb 7 01:07:36 raven kernel: [18912.020097] ata8: SATA link up 6.0 >> Gbps (SStatus 133 SControl 300) >> Feb 7 01:07:41 raven kernel: [18917.020093] ata8.00: qc timeout (cmd 0xec) >> Feb 7 01:07:41 raven kernel: [18917.028074] ata8.00: failed to >> IDENTIFY (I/O error, err_mask=0x4) >> Feb 7 01:07:41 raven kernel: [18917.028310] ata8: hard resetting link >> Feb 7 01:07:47 raven kernel: [18922.396089] ata8: link is slow to >> respond, please be patient (ready=0) >> Feb 7 01:07:51 raven kernel: [18927.044313] ata8: hard resetting link >> Feb 7 01:07:56 raven kernel: [18932.020099] ata8: SATA link up 6.0 >> Gbps (SStatus 133 SControl 300) >> Feb 7 01:08:06 raven kernel: [18942.020048] ata8.00: qc timeout (cmd 0xec) >> Feb 7 01:08:06 raven kernel: [18942.028075] ata8.00: failed to >> IDENTIFY (I/O error, err_mask=0x4) >> Feb 7 01:08:06 raven kernel: [18942.028307] ata8: limiting SATA link >> speed to 3.0 Gbps >> Feb 7 01:08:06 raven kernel: [18942.028321] ata8: hard resetting link >> Feb 7 01:08:12 raven kernel: [18947.396108] ata8: link is slow to >> respond, please be patient (ready=0) >> Feb 7 01:08:16 raven kernel: [18951.988069] ata8: SATA link up 6.0 >> Gbps (SStatus 133 SControl 320) >> Feb 7 01:08:46 raven kernel: [18981.988104] ata8.00: qc timeout (cmd 0xec) >> Feb 7 01:08:46 raven kernel: [18981.996070] ata8.00: failed to >> IDENTIFY (I/O error, err_mask=0x4) >> Feb 7 01:08:46 raven kernel: [18981.996302] ata8.00: disabled >> Feb 7 01:08:46 raven kernel: [18981.996324] ata8.00: device reported >> invalid CHS sector 0 >> Feb 7 01:08:46 raven kernel: [18981.996348] ata8: hard resetting link >> Feb 7 01:08:52 raven kernel: [18987.364104] ata8: link is slow to >> respond, please be patient (ready=0) >> Feb 7 01:08:56 raven kernel: [18992.012050] ata8: SATA link up 6.0 >> Gbps (SStatus 133 SControl 320) >> Feb 7 01:08:56 raven kernel: [18992.012114] ata8: EH complete >> Feb 7 01:08:56 raven kernel: [18992.012158] sd 8:0:0:0: [sdg] >> Unhandled error code >> Feb 7 01:08:56 raven kernel: [18992.012165] sd 8:0:0:0: [sdg] Result: >> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK >> Feb 7 01:08:56 raven kernel: [18992.012176] sd 8:0:0:0: [sdg] CDB: >> Write(10): 2a 00 e8 e0 74 3f 00 00 08 00 >> Feb 7 01:08:56 raven kernel: [18992.012696] md: super_written gets >> error=-5, uptodate=0 >> Feb 7 01:08:56 raven kernel: [18992.013169] sd 8:0:0:0: [sdg] >> Unhandled error code >> Feb 7 01:08:56 raven kernel: [18992.013176] sd 8:0:0:0: [sdg] Result: >> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK >> Feb 7 01:08:56 raven kernel: [18992.013186] sd 8:0:0:0: [sdg] CDB: >> Read(10): 28 00 04 9d bd bf 00 00 80 00 >> Feb 7 01:08:56 raven kernel: [18992.276986] sd 8:0:0:0: [sdg] >> Unhandled error code >> Feb 7 01:08:56 raven kernel: [18992.276999] sd 8:0:0:0: [sdg] Result: >> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK >> Feb 7 01:08:56 raven kernel: [18992.277012] sd 8:0:0:0: [sdg] CDB: >> Read(10): 28 00 04 9d be 3f 00 00 80 00 >> Feb 7 01:08:56 raven kernel: [18992.316919] sd 8:0:0:0: [sdg] >> Unhandled error code >> Feb 7 01:08:56 raven kernel: [18992.316930] sd 8:0:0:0: [sdg] Result: >> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK >> Feb 7 01:08:56 raven kernel: [18992.316942] sd 8:0:0:0: [sdg] CDB: >> Read(10): 28 00 04 9d be bf 00 00 80 00 >> Feb 7 01:08:56 raven kernel: [18992.326906] sd 8:0:0:0: [sdg] >> Unhandled error code >> Feb 7 01:08:56 raven kernel: [18992.326920] sd 8:0:0:0: [sdg] Result: >> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK >> Feb 7 01:08:56 raven kernel: [18992.326932] sd 8:0:0:0: [sdg] CDB: >> Read(10): 28 00 04 9d bf 3f 00 00 80 00 >> Feb 7 01:08:56 raven kernel: [18992.327944] sd 8:0:0:0: [sdg] >> Unhandled error code >> Feb 7 01:08:56 raven kernel: [18992.327956] sd 8:0:0:0: [sdg] Result: >> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK >> Feb 7 01:08:56 raven kernel: [18992.327968] sd 8:0:0:0: [sdg] CDB: >> Read(10): 28 00 04 9d bf bf 00 00 80 00 >> Feb 7 01:08:57 raven kernel: [18992.555093] md: md0: reshape done. >> Feb 7 01:08:57 raven kernel: [18992.607595] md: reshape of RAID array md0 >> Feb 7 01:08:57 raven kernel: [18992.607606] md: minimum _guaranteed_ >> speed: 200000 KB/sec/disk. >> Feb 7 01:08:57 raven kernel: [18992.607614] md: using maximum >> available idle IO bandwidth (but not more than 200000 KB/sec) for >> reshape. >> Feb 7 01:08:57 raven kernel: [18992.607628] md: using 128k window, >> over a total of 1953511936 blocks. >> Feb 7 06:41:02 raven rsyslogd: [origin software="rsyslogd" >> swVersion="4.2.0" x-pid="911" x-info="http://www.rsyslog.com"] >> rsyslogd was HUPed, type 'lightweight'. >> Feb 7 07:12:32 raven kernel: [40807.989092] ata5: hard resetting link >> Feb 7 07:12:38 raven kernel: [40813.524074] ata5: SATA link up 6.0 >> Gbps (SStatus 133 SControl 300) >> Feb 7 07:12:43 raven kernel: [40818.524106] ata5.00: qc timeout (cmd 0xec) >> Feb 7 07:12:43 raven kernel: [40818.524126] ata5.00: failed to >> IDENTIFY (I/O error, err_mask=0x4) >> Feb 7 07:12:43 raven kernel: [40818.532788] ata5: hard resetting link >> Feb 7 07:12:48 raven kernel: [40824.058039] ata5: SATA link up 6.0 >> Gbps (SStatus 133 SControl 300) >> Feb 7 07:12:58 raven kernel: [40834.056101] ata5.00: qc timeout (cmd 0xec) >> Feb 7 07:12:58 raven kernel: [40834.056121] ata5.00: failed to >> IDENTIFY (I/O error, err_mask=0x4) >> Feb 7 07:12:58 raven kernel: [40834.064203] ata5: limiting SATA link >> speed to 3.0 Gbps >> Feb 7 07:12:58 raven kernel: [40834.064217] ata5: hard resetting link >> Feb 7 07:13:04 raven kernel: [40839.592095] ata5: SATA link up 3.0 >> Gbps (SStatus 123 SControl 320) >> Feb 7 07:13:34 raven kernel: [40869.592088] ata5.00: qc timeout (cmd 0xec) >> Feb 7 07:13:34 raven kernel: [40869.592110] ata5.00: failed to >> IDENTIFY (I/O error, err_mask=0x4) >> Feb 7 07:13:34 raven kernel: [40869.599676] ata5.00: disabled >> Feb 7 07:13:34 raven kernel: [40869.599700] ata5.00: device reported >> invalid CHS sector 0 >> Feb 7 07:13:34 raven kernel: [40869.599724] ata5: hard resetting link >> Feb 7 07:13:39 raven kernel: [40875.124128] ata5: SATA link up 3.0 >> Gbps (SStatus 123 SControl 320) >> Feb 7 07:13:39 raven kernel: [40875.124201] ata5: EH complete >> Feb 7 07:13:39 raven kernel: [40875.124243] sd 4:0:0:0: [sdd] >> Unhandled error code >> Feb 7 07:13:39 raven kernel: [40875.124251] sd 4:0:0:0: [sdd] Result: >> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK >> Feb 7 07:13:39 raven kernel: [40875.124262] sd 4:0:0:0: [sdd] CDB: >> Write(10): 2a 00 e8 e0 74 3f 00 00 08 00 >> Feb 7 07:13:39 raven kernel: [40875.135544] md: super_written gets >> error=-5, uptodate=0 >> Feb 7 07:13:39 raven kernel: [40875.152171] sd 4:0:0:0: [sdd] >> Unhandled error code >> Feb 7 07:13:39 raven kernel: [40875.152179] sd 4:0:0:0: [sdd] Result: >> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK >> Feb 7 07:13:39 raven kernel: [40875.152189] sd 4:0:0:0: [sdd] CDB: >> Read(10): 28 00 09 2b f2 3f 00 00 80 00 >> Feb 7 07:13:41 raven kernel: [40876.734504] md: md0: reshape done. >> Feb 7 07:13:41 raven kernel: [40876.736298] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.743529] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.750009] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.755143] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.760126] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.765070] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.769890] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.774759] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.779456] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.784166] lost page write due to >> I/O error on md0 >> Feb 7 07:13:41 raven kernel: [40876.788773] JBD: Detected IO errors >> while flushing file data on md0 >> Feb 7 07:13:41 raven kernel: [40876.796386] JBD: Detected IO errors >> while flushing file data on md0 >> >> On Tue, Feb 7, 2012 at 1:15 PM, Phil Turmel <philip@turmel.org> wrote: >>> Hi Richard, >>> >>> On 02/06/2012 08:34 PM, Richard Herd wrote: >>>> Hey guys, >>>> >>>> I'm in a bit of a pickle here and if any mdadm kings could step in and >>>> throw some advice my way I'd be very grateful :-) >>>> >>>> Quick bit of background - little NAS based on an AMD E350 running >>>> Ubuntu 10.04. Running a software RAID 5 from 5x2TB disks. Every few >>>> months one of the drives would fail a request and get kicked from the >>>> array (as is becoming common for these larger multi TB drives they >>>> tolerate the occasional bad sector by reallocating from a pool of >>>> spares (but that's a whole other story)). This happened across a >>>> variety of brands and two different controllers. I'd simply add the >>>> disk that got popped back in and let it re-sync. SMART tests always >>>> in good health. >>> >>> Some more detail on the actual devices would help, especially the >>> output of lsdrv [1] to document what device serial numbers are which, >>> for future reference. >>> >>> I also suspect you have problems with your drive's error recovery >>> control, also known as time-limited error recovery. Simple sector >>> errors should *not* be kicking out your drives. Mdadm knows to >>> reconstruct from parity and rewrite when a read error is encountered. >>> That either succeeds directly, or causes the drive to remap. >>> >>> You say that the SMART tests are good, so read errors are probably >>> escalating into link timeouts, and the drive ignores the attempt to >>> reconstruct. *That* kicks the drive out. >>> >>> "smartctl -x" reports for all of your drives would help identify if >>> you have this problem. You *cannot* safely run raid arrays with drives >>> that don't (or won't) report errors in a timely fashion (a few seconds). >>> >>>> It did make me nervous though. So I decided I'd add a second disk for >>>> a bit of extra redundancy, making the array a RAID 6 - the thinking >>>> was the occasional disk getting kicked and re-added from a RAID 6 >>>> array wouldn't present as much risk as a single disk getting kicked >>>> from a RAID 5. >>>> >>>> So first off, I added the 6th disk as a hotspare to the RAID5 array. >>>> So I now had my 5 disk RAID 5 + hotspare. >>>> >>>> I then found that mdadm 2.6.7 (in the repositories) isn't actually >>>> capable of a 5->6 reshape. So I pulled the latest 3.2.3 sources and >>>> compiled myself a new version of mdadm. >>>> >>>> With the newer version of mdadm, it was happy to do the reshape - so I >>>> set it off on it's merry way, using an esata HD (mounted at /usb :-P) >>>> for the backupfile: >>>> >>>> root@raven:/# mdadm --grow /dev/md0 --level=6 --raid-devices=6 >>>> --backup-file=/usb/md0.backup >>>> >>>> It would take a week to reshape, but it was ona UPS & happily ticking >>>> along. The array would be online the whole time so I was in no rush. >>>> Content, I went to get some shut-eye. >>>> >>>> I got up this morning and took a quick look in /proc/mdstat to see how >>>> things were going and saw things had failed spectacularly. At least >>>> two disks had been kicked from the array and the whole thing had >>>> crumbled. >>> >>> Do you still have the dmesg for this? >>> >>>> Ouch. >>>> >>>> I tried to assembe the array, to see if it would continue the reshape: >>>> >>>> root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 >>>> /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 >>>> >>>> Unfortunately mdadm had decided that the backup-file was out of date >>>> (timestamps didn't match) and was erroring with: Failed to restore >>>> critical section for reshape, sorry.. >>>> >>>> Chances are things were in such a mess that backup file wasn't going >>>> to be used anyway, so I blocked the timestamp check with: export >>>> MDADM_GROW_ALLOW_OLD=1 >>>> >>>> That allowed me to assemble the array, but not run it as there were >>>> not enough disks to start it. >>>> >>>> This is the current state of the array: >>>> >>>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] >>>> [raid4] [raid10] >>>> md0 : inactive sdb1[1] sdd1[5] sdf1[4] sda1[2] >>>> 7814047744 blocks super 0.91 >>>> >>>> unused devices: <none> >>>> >>>> root@raven:/# mdadm --detail /dev/md0 >>>> /dev/md0: >>>> Version : 0.91 >>>> Creation Time : Tue Jul 12 23:05:01 2011 >>>> Raid Level : raid6 >>>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) >>>> Raid Devices : 6 >>>> Total Devices : 4 >>>> Preferred Minor : 0 >>>> Persistence : Superblock is persistent >>>> >>>> Update Time : Tue Feb 7 09:32:29 2012 >>>> State : active, FAILED, Not Started >>>> Active Devices : 3 >>>> Working Devices : 4 >>>> Failed Devices : 0 >>>> Spare Devices : 1 >>>> >>>> Layout : left-symmetric-6 >>>> Chunk Size : 64K >>>> >>>> New Layout : left-symmetric >>>> >>>> UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) >>>> Events : 0.1848341 >>>> >>>> Number Major Minor RaidDevice State >>>> 0 0 0 0 removed >>>> 1 8 17 1 active sync /dev/sdb1 >>>> 2 8 1 2 active sync /dev/sda1 >>>> 3 0 0 3 removed >>>> 4 8 81 4 active sync /dev/sdf1 >>>> 5 8 49 5 spare rebuilding /dev/sdd1 >>>> >>>> The two removed disks: >>>> [ 3020.998529] md: kicking non-fresh sdc1 from array! >>>> [ 3021.012672] md: kicking non-fresh sdg1 from array! >>>> >>>> Attempted to re-add the disks (same for both): >>>> root@raven:/# mdadm /dev/md0 --add /dev/sdg1 >>>> mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a >>>> --re-add fails. >>>> mdadm: not performing --add as that would convert /dev/sdg1 in to a spare. >>>> mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first. >>>> >>>> With a failed array the last thing we want to do is add spares and >>>> trigger a resync so obviously I haven't zeroed the superblocks and >>>> added yet. >>> >>> That would be catastrophic. >>> >>>> Checked and two disks really are out of sync: >>>> root@raven:/# mdadm --examine /dev/sd[a-h]1 | grep Event >>>> Events : 1848341 >>>> Events : 1848341 >>>> Events : 1848333 >>>> Events : 1848341 >>>> Events : 1848341 >>>> Events : 1772921 >>> >>> So /dev/sdg1 dropped out first, and /dev/sdc1 followed and killed the >>> array. >>> >>>> I'll post the output of --examine on all the disks below - if anyone >>>> has any advice I'd really appreciate it (Neil Brown doesn't read these >>>> forums does he?!?). I would usually move next to recreating the array >>>> and using assume-clean but since it's right in the middle of a reshape >>>> I'm not inclined to try. >>> >>> Neil absolutely reads this mailing list, and is likely to pitch in if >>> I don't offer precisely correct advice :-) >>> >>> He's in an Australian time zone though, so latency might vary. I'm on the >>> U.S. east coast, fwiw. >>> >>> In any case, with a re-shape in progress, "--create --assume-clean" is >>> not an option. >>> >>>> Critical stuff is of course backed up, but there is some user data not >>>> covered by backups that I'd like to try and restore if at all >>>> possible. >>> >>> Hope is not all lost. If we can get your ERC adjusted, the next step >>> would be to disconnect /dev/sdg from the system, and assemble with >>> --force and MDADM_GROW_ALLOW_OLD=1 >>> >>> That'll let the reshape finish, leaving you with a single-degraded >>> raid6. Then you fsck and make critical backups. Then you --zero- and >>> --add /dev/sdg. >>> >>> If your drives don't support ERC, I can't recommend you continue until >>> you've ddrescue'd your drives onto new ones that do support ERC. >>> >>> HTH, >>> >>> Phil >>> >>> [1] http://github.com/pturmel/lsdrv > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 2:57 ` Phil Turmel 2012-02-07 3:10 ` Richard Herd @ 2012-02-07 3:24 ` Keith Keller 2012-02-07 3:38 ` Phil Turmel 2012-02-08 7:13 ` Please Help! RAID5 -> 6 reshapre gone bad Stan Hoeppner 1 sibling, 2 replies; 27+ messages in thread From: Keith Keller @ 2012-02-07 3:24 UTC (permalink / raw) To: linux-raid On 2012-02-07, Phil Turmel <philip@turmel.org> wrote: > > I understand the constraints, as I pinch pennies at home and at the > office (I own my engineering firm). I've made do with cheap desktop > drives that do support ERC. I got burned when Seagate dropped ERC on > their latest desktop drives. Hitachi Deskstar is the only affordable > model on the market that still support ERC. I can testify that the EARS/EADS drives can be troublesome (see my recent threads on the list). I also found out that apparently the flooding in Thailand is delaying all drive vendors' enterprise drives-- they seem to be one of the few factories that make an essential part, and their factories are all underwater. Have others had success with mdraid and the Deskstar drives? I wouldn't mind saving a little money if the drives will actually work, especially if I can get them in before April (the earliest one vendor thinks they might be able to start building drives again). --keith -- kkeller@wombat.san-francisco.ca.us ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 3:24 ` Keith Keller @ 2012-02-07 3:38 ` Phil Turmel 2012-01-31 6:31 ` rebuild raid6 after two failures Keith Keller 2012-02-08 7:13 ` Please Help! RAID5 -> 6 reshapre gone bad Stan Hoeppner 1 sibling, 1 reply; 27+ messages in thread From: Phil Turmel @ 2012-02-07 3:38 UTC (permalink / raw) To: Keith Keller; +Cc: linux-raid On 02/06/2012 10:24 PM, Keith Keller wrote: > On 2012-02-07, Phil Turmel <philip@turmel.org> wrote: >> >> I understand the constraints, as I pinch pennies at home and at the >> office (I own my engineering firm). I've made do with cheap desktop >> drives that do support ERC. I got burned when Seagate dropped ERC on >> their latest desktop drives. Hitachi Deskstar is the only affordable >> model on the market that still support ERC. > > I can testify that the EARS/EADS drives can be troublesome (see my > recent threads on the list). I also found out that apparently the > flooding in Thailand is delaying all drive vendors' enterprise drives-- > they seem to be one of the few factories that make an essential part, > and their factories are all underwater. > > Have others had success with mdraid and the Deskstar drives? I wouldn't > mind saving a little money if the drives will actually work, especially > if I can get them in before April (the earliest one vendor thinks > they might be able to start building drives again). Ow. I haven't actually bought any yet... I was hoping the prices would come down before I needed to. Sounds like I'll be waiting longer than I expected. But, I reviewed the OEM documentation for the 7K3000 family, and they clearly document support for the SCT ERC commands (para 9.18.1.2). Phil ^ permalink raw reply [flat|nested] 27+ messages in thread
* rebuild raid6 after two failures @ 2012-01-31 6:31 ` Keith Keller 2012-02-01 4:42 ` Keith Keller 0 siblings, 1 reply; 27+ messages in thread From: Keith Keller @ 2012-01-31 6:31 UTC (permalink / raw) To: linux-raid Hello list, I recently had a RAID6 lose two drives in quick succession, with one spare already in place. The rebuild started fine with the spare, but now that I've replaced the failed disks, should I expect the current rebuild to finish, then rebuild on another spare? Or do I need to do something special to kick off the rebuilding on another spare? I tried looking for the answer using various web search permutations with no success. My mdadm and uname output is below. (I did not remember to use a newer mdadm to add the spares, so I originally used 2.6.9, but I do have 3.2.3 available on this box.) Thanks for any pointers. --keith # uname -a Linux xxxxxxxxxx 2.6.39-4.1.el5.elrepo #1 SMP PREEMPT Wed Jan 18 13:16:25 EST 2012 x86_64 x86_64 x86_64 GNU/Linux # mdadm -D /dev/md0 /dev/md0: Version : 1.01 Creation Time : Thu Sep 29 21:26:35 2011 Raid Level : raid6 Array Size : 15624911360 (14901.08 GiB 15999.91 GB) Used Dev Size : 1953113920 (1862.63 GiB 1999.99 GB) Raid Devices : 10 Total Devices : 12 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Mon Jan 30 22:07:26 2012 State : clean, degraded, recovering Active Devices : 8 Working Devices : 12 Failed Devices : 0 Spare Devices : 4 Chunk Size : 64K Rebuild Status : 18% complete Name : 0 UUID : 24363b01:90deb9b5:4b51e5df:68b8b6ea Events : 164419 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 13 8 33 1 active sync /dev/sdc1 11 8 145 2 active sync /dev/sdj1 12 8 161 3 active sync /dev/sdk1 4 8 65 4 active sync /dev/sde1 9 8 113 5 active sync /dev/sdh1 10 8 81 6 active sync /dev/sdf1 3 8 49 7 spare rebuilding /dev/sdd1 8 8 129 8 active sync /dev/sdi1 9 0 0 9 removed 14 8 177 - spare /dev/sdl1 15 8 209 - spare /dev/sdn1 16 8 225 - spare /dev/sdo1 -- kkeller@wombat.san-francisco.ca.us (try just my userid to email me) AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt see X- headers for PGP signature information ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: rebuild raid6 after two failures 2012-01-31 6:31 ` rebuild raid6 after two failures Keith Keller @ 2012-02-01 4:42 ` Keith Keller 2012-02-01 5:31 ` NeilBrown 2012-02-03 16:08 ` using dd (or dd_rescue) to salvage array Keith Keller 0 siblings, 2 replies; 27+ messages in thread From: Keith Keller @ 2012-02-01 4:42 UTC (permalink / raw) To: linux-raid On 2012-01-31, Keith Keller <kkeller@wombat.san-francisco.ca.us> wrote: > > I recently had a RAID6 lose two drives in quick succession, with one > spare already in place. The rebuild started fine with the spare, but > now that I've replaced the failed disks, should I expect the current > rebuild to finish, then rebuild on another spare? [snip] Well, for better or worse, this is now a moot question--I had another drive kicked out of the array, I believe prematurely by the controller. I was able to --assemble --force the array, and it is now rebuilding two spares instead of one. AFAIR there was no activity on the filesystem at the time, so I am optimistic that the filesystem should be fine after an fsck. Thanks to the advice from last time which suggested --assemble --force instead of --assume-clean in this situation. Could it have been the older version of mdadm that didn't tell the kernel to start rebuilding the added spare? I have made 3.2.3 my default mdadm, which I hope alleviates some of the issues I've had with rebuilds not starting. (As an aside, I've also bitten the bullet and decided to swap out all the WD-EARS drives for real RAID drives; ideally I'd replace the controller, but I don't want to invest the time needed to replace and test all the components properly.) --keith -- kkeller@wombat.san-francisco.ca.us ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: rebuild raid6 after two failures 2012-02-01 4:42 ` Keith Keller @ 2012-02-01 5:31 ` NeilBrown 2012-02-01 5:48 ` Keith Keller 2012-02-03 16:08 ` using dd (or dd_rescue) to salvage array Keith Keller 1 sibling, 1 reply; 27+ messages in thread From: NeilBrown @ 2012-02-01 5:31 UTC (permalink / raw) To: Keith Keller; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1704 bytes --] On Tue, 31 Jan 2012 20:42:28 -0800 Keith Keller <kkeller@wombat.san-francisco.ca.us> wrote: > On 2012-01-31, Keith Keller <kkeller@wombat.san-francisco.ca.us> wrote: > > > > I recently had a RAID6 lose two drives in quick succession, with one > > spare already in place. The rebuild started fine with the spare, but > > now that I've replaced the failed disks, should I expect the current > > rebuild to finish, then rebuild on another spare? > > [snip] > > Well, for better or worse, this is now a moot question--I had another > drive kicked out of the array, I believe prematurely by the controller. > I was able to --assemble --force the array, and it is now rebuilding > two spares instead of one. AFAIR there was no activity on the > filesystem at the time, so I am optimistic that the filesystem should be > fine after an fsck. Thanks to the advice from last time which suggested > --assemble --force instead of --assume-clean in this situation. > > Could it have been the older version of mdadm that didn't tell the > kernel to start rebuilding the added spare? I have made 3.2.3 my > default mdadm, which I hope alleviates some of the issues I've had with > rebuilds not starting. (As an aside, I've also bitten the bullet and > decided to swap out all the WD-EARS drives for real RAID drives; ideally > I'd replace the controller, but I don't want to invest the time needed > to replace and test all the components properly.) If a spare is being rebuild when another spare is added, it keeps with the first rebuild rather than restarting from the beginning. This means that you get some redundancy sooner, which is probably a good thing. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: rebuild raid6 after two failures 2012-02-01 5:31 ` NeilBrown @ 2012-02-01 5:48 ` Keith Keller 0 siblings, 0 replies; 27+ messages in thread From: Keith Keller @ 2012-02-01 5:48 UTC (permalink / raw) To: linux-raid On 2012-02-01, NeilBrown <neilb@suse.de> wrote: > > If a spare is being rebuild when another spare is added, it keeps with the > first rebuild rather than restarting from the beginning. > > This means that you get some redundancy sooner, which is probably a good > thing. Great, thanks for the info. I just wanted to check that the behavior I saw earlier was expected. (Yes, it's a good thing!) --keith -- kkeller@wombat.san-francisco.ca.us ^ permalink raw reply [flat|nested] 27+ messages in thread
* using dd (or dd_rescue) to salvage array 2012-02-01 4:42 ` Keith Keller 2012-02-01 5:31 ` NeilBrown @ 2012-02-03 16:08 ` Keith Keller 2012-02-04 18:01 ` Stefan /*St0fF*/ Hübner 1 sibling, 1 reply; 27+ messages in thread From: Keith Keller @ 2012-02-03 16:08 UTC (permalink / raw) To: linux-raid On 2012-02-01, Keith Keller <kkeller@wombat.san-francisco.ca.us> wrote: > > Well, for better or worse, this is now a moot question--I had another > drive kicked out of the array, I believe prematurely by the controller. It turns out to be worse--the drive does in fact appear to be failing, which would be the third failure on this RAID6 array. I had what might be a crazy thought--would it be worth the trouble to attempt to use dd (or dd_rescue, a tool I found that claims to continue on bad blocks) to write the disk image to another disk, and attempt a rebuild with the new disk? Or am I just wasting my time? (The array is hosting an rsnapshot backup set, so I can recreate the latest snapshot, but it'll take a while. So it'd be nice to save the array if it's possible and not time- consuming.) Thanks for your help! --keith -- kkeller@wombat.san-francisco.ca.us ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: using dd (or dd_rescue) to salvage array 2012-02-03 16:08 ` using dd (or dd_rescue) to salvage array Keith Keller @ 2012-02-04 18:01 ` Stefan /*St0fF*/ Hübner 2012-02-05 19:10 ` Keith Keller 0 siblings, 1 reply; 27+ messages in thread From: Stefan /*St0fF*/ Hübner @ 2012-02-04 18:01 UTC (permalink / raw) To: Keith Keller; +Cc: linux-raid Am 03.02.2012 17:08, schrieb Keith Keller: > On 2012-02-01, Keith Keller <kkeller@wombat.san-francisco.ca.us> wrote: >> >> Well, for better or worse, this is now a moot question--I had another >> drive kicked out of the array, I believe prematurely by the controller. > > It turns out to be worse--the drive does in fact appear to be failing, > which would be the third failure on this RAID6 array. I had what might > be a crazy thought--would it be worth the trouble to attempt to use dd > (or dd_rescue, a tool I found that claims to continue on bad blocks) to > write the disk image to another disk, and attempt a rebuild with the new > disk? Or am I just wasting my time? (The array is hosting an rsnapshot > backup set, so I can recreate the latest snapshot, but it'll take a > while. So it'd be nice to save the array if it's possible and not time- > consuming.) > > Thanks for your help! > > --keith > Hi Keith, actually, ddrescue is THE WAY TO GO in this case. Don't use the old ddrescue, but the GNU version. Some distros call it gddrescue, on gentoo the old one is called dd-rescue and the gnu-one ddrescue. Just check it out: http://www.gnu.org/software/ddrescue/ddrescue.html Good luck, Stefan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: using dd (or dd_rescue) to salvage array 2012-02-04 18:01 ` Stefan /*St0fF*/ Hübner @ 2012-02-05 19:10 ` Keith Keller 2012-02-06 21:37 ` Stefan *St0fF* Huebner 0 siblings, 1 reply; 27+ messages in thread From: Keith Keller @ 2012-02-05 19:10 UTC (permalink / raw) To: linux-raid On 2012-02-04, Stefan /*St0fF*/ Hübner <stefan.huebner@stud.tu-ilmenau.de> wrote: > > actually, ddrescue is THE WAY TO GO in this case. Don't use the old > ddrescue, but the GNU version. Some distros call it gddrescue, on > gentoo the old one is called dd-rescue and the gnu-one ddrescue. Just > check it out: http://www.gnu.org/software/ddrescue/ddrescue.html Thanks for the advice, Stefan. Frustratingly enough, I will get a chance to try GNU ddrescue despite my impatience--I originally used dd_rescue to try to get an image of the failing drive, and while that succeeded just fine (only lost 8k), the target ended up reporting ECC errors during the rebuild! So I've taken a new image with ddrescue to a tested drive (again, losing 8k), and am hoping that it goes better. (At the moment I'm just attempting a one-spare rebuild, which I'm hoping will go faster than a two-disk build, and therefore report any problems sooner.) I realized after reading my initial post that I wasn't 100% clear what I was asking. I knew that some sort of dd would work, but I'd only done it before in a filesystem context, and didn't know how mdraid would react. So I am curious, does anyone know what I might expect when the rebuild gets to the part on the new image where the data was lost? Will it just create a problem on the filesystem, or might something worse happen? Should I run a check if the rebuild completes successfully? And will mismatch_cnt get populated by the rebuild, or would I need a check to expose mismatches? --keith -- kkeller-usenet@wombat.san-francisco.ca.us -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: using dd (or dd_rescue) to salvage array 2012-02-05 19:10 ` Keith Keller @ 2012-02-06 21:37 ` Stefan *St0fF* Huebner 2012-02-07 3:44 ` Keith Keller 2012-02-07 4:24 ` Keith Keller 0 siblings, 2 replies; 27+ messages in thread From: Stefan *St0fF* Huebner @ 2012-02-06 21:37 UTC (permalink / raw) To: Keith Keller; +Cc: linux-raid On 05.02.2012 20:10, Keith Keller wrote: > On 2012-02-04, Stefan /*St0fF*/ Hübner<stefan.huebner@stud.tu-ilmenau.de> wrote: >> actually, ddrescue is THE WAY TO GO in this case. Don't use the old >> ddrescue, but the GNU version. Some distros call it gddrescue, on >> gentoo the old one is called dd-rescue and the gnu-one ddrescue. Just >> check it out: http://www.gnu.org/software/ddrescue/ddrescue.html > Thanks for the advice, Stefan. Frustratingly enough, I will get a > chance to try GNU ddrescue despite my impatience--I originally used > dd_rescue to try to get an image of the failing drive, and while that > succeeded just fine (only lost 8k), the target ended up reporting ECC > errors during the rebuild! So I've taken a new image with ddrescue > to a tested drive (again, losing 8k), and am hoping that it goes better. > (At the moment I'm just attempting a one-spare rebuild, which I'm hoping > will go faster than a two-disk build, and therefore report any problems > sooner.) > > I realized after reading my initial post that I wasn't 100% clear what I > was asking. I knew that some sort of dd would work, but I'd only done > it before in a filesystem context, and didn't know how mdraid would > react. So I am curious, does anyone know what I might expect when the > rebuild gets to the part on the new image where the data was lost? Will > it just create a problem on the filesystem, or might something worse > happen? Should I run a check if the rebuild completes successfully? > And will mismatch_cnt get populated by the rebuild, or would I need a > check to expose mismatches? > > --keith > From the logical point of view those lost 8k would create bad data - i.e. a filesystem problem OR simply corrupted data. That depends on which blocks exactly are bad. If you were using lvm it could even be worse, like broken metadata. It would be good if those 8k were "in a row" - that way at max 3 fs-blocks (when using 4k fs-blocksize) would be corrupted. If you're lucky, you won't even notice - like me: my system SSD broke down lately. I ddrescued as much as I could, but around 250k are gone. I'm dual-booting windows and gentoo and I have not yet encountered a problem from the missing data. Lucky me... Cheers, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: using dd (or dd_rescue) to salvage array 2012-02-06 21:37 ` Stefan *St0fF* Huebner @ 2012-02-07 3:44 ` Keith Keller 2012-02-07 4:24 ` Keith Keller 1 sibling, 0 replies; 27+ messages in thread From: Keith Keller @ 2012-02-07 3:44 UTC (permalink / raw) To: linux-raid On 2012-02-06, Stefan *St0fF* Huebner <st0ff@gmx.net> wrote: > > From the logical point of view those lost 8k would create bad data - > i.e. a filesystem problem OR simply corrupted data. That depends on > which blocks exactly are bad. If you were using lvm it could even be > worse, like broken metadata. I am using LVM, so I'll just have to hope for the best. I haven't yet done an xfs_repair, but I will do that soon. I just made my volume active, and vgchange didn't complain, so I'm guessing that's a good sign. > It would be good if those 8k were "in a row" - that way at max 3 > fs-blocks (when using 4k fs-blocksize) would be corrupted. It was--it looks like it was really just that one spot on the drive. So I am hopeful that any errors that are a result of the lost 8k will be reparable by xfs_repair. --keith -- kkeller@wombat.san-francisco.ca.us ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: using dd (or dd_rescue) to salvage array 2012-02-06 21:37 ` Stefan *St0fF* Huebner 2012-02-07 3:44 ` Keith Keller @ 2012-02-07 4:24 ` Keith Keller 2012-02-07 20:01 ` Stefan *St0fF* Huebner 1 sibling, 1 reply; 27+ messages in thread From: Keith Keller @ 2012-02-07 4:24 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 1499 bytes --] On Mon, Feb 06, 2012 at 10:37:38PM +0100, Stefan *St0fF* Huebner wrote: > From the logical point of view those lost 8k would create bad data - > i.e. a filesystem problem OR simply corrupted data. That depends on > which blocks exactly are bad. If you were using lvm it could even be > worse, like broken metadata. FWIW, xfs_repair has spit out over 100k lines on stderr, but when I mounted before the repair (this is suggested if you need to replay the log), everything seemed intact. So I'm not yet sure what to make of things; perhaps it'll be fine, or perhaps I need to start over. (Alternatively, perhaps the next rsnapshot run will expose problems.) On Mon, Feb 06, 2012 at 10:38:27PM -0500, Phil Turmel wrote: > > But, I reviewed the OEM documentation for the 7K3000 family, and they > clearly document support for the SCT ERC commands (para 9.18.1.2). If I'm reading the docs right, then the 5K3000 also supports them (if you're really cheap and can tolerate the slower speeds). At this point, if we're waiting till April for real ''enterprise'' drives, I can't see anything too bad about getting one or two of these and testing them out in my environment--the EARS drives are bad enough with my configuration that it's hard to imagine being any worse. (To be fair, I have to blame the 3ware 9550 controller a bit too; I have EARS drives on other 3ware controllers without all these issues.) --keith -- kkeller@wombat.san-francisco.ca.us [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: using dd (or dd_rescue) to salvage array 2012-02-07 4:24 ` Keith Keller @ 2012-02-07 20:01 ` Stefan *St0fF* Huebner 0 siblings, 0 replies; 27+ messages in thread From: Stefan *St0fF* Huebner @ 2012-02-07 20:01 UTC (permalink / raw) To: Keith Keller; +Cc: linux-raid On 07.02.2012 05:24, Keith Keller wrote: > On Mon, Feb 06, 2012 at 10:37:38PM +0100, Stefan *St0fF* Huebner wrote: >> From the logical point of view those lost 8k would create bad data - >> i.e. a filesystem problem OR simply corrupted data. That depends on >> which blocks exactly are bad. If you were using lvm it could even be >> worse, like broken metadata. > FWIW, xfs_repair has spit out over 100k lines on stderr, but when I > mounted before the repair (this is suggested if you need to replay the > log), everything seemed intact. So I'm not yet sure what to make of > things; perhaps it'll be fine, or perhaps I need to start over. > (Alternatively, perhaps the next rsnapshot run will expose problems.) > > > On Mon, Feb 06, 2012 at 10:38:27PM -0500, Phil Turmel wrote: >> But, I reviewed the OEM documentation for the 7K3000 family, and they >> clearly document support for the SCT ERC commands (para 9.18.1.2). > If I'm reading the docs right, then the 5K3000 also supports them (if > you're really cheap and can tolerate the slower speeds). At this point, > if we're waiting till April for real ''enterprise'' drives, I can't see > anything too bad about getting one or two of these and testing them out > in my environment--the EARS drives are bad enough with my configuration > that it's hard to imagine being any worse. (To be fair, I have to blame > the 3ware 9550 controller a bit too; I have EARS drives on other 3ware > controllers without all these issues.) > > --keith > > > Sounds promising. If you're really lucky, the blocks were freed by the fs and nothing has gone... Stefan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 3:24 ` Keith Keller 2012-02-07 3:38 ` Phil Turmel @ 2012-02-08 7:13 ` Stan Hoeppner 1 sibling, 0 replies; 27+ messages in thread From: Stan Hoeppner @ 2012-02-08 7:13 UTC (permalink / raw) To: Keith Keller; +Cc: linux-raid On 2/6/2012 9:24 PM, Keith Keller wrote: > Have others had success with mdraid and the Deskstar drives? I wouldn't > mind saving a little money if the drives will actually work, especially > if I can get them in before April (the earliest one vendor thinks > they might be able to start building drives again). Newegg seems to have nine 7.2k Deskstar models in stock, cond new, from 500GB to 3TB: http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=100007603+50001984+600003340&QksAutoSuggestion=&ShowDeactivatedMark=False&Configurator=&IsNodeId=1&Subcategory=14&description=&hisInDesc=&Ntk=&CFG=&SpeTabStoreType=&AdvancedSearch=1&srchInDesc= As of the date/time of this email, here's a spattering of the 9 available and their order qty limitations: H3IK30003272SW (0S03208) 3TB qty: 20 HUA723020ALA640 (0F12455) 2TB qty: 5 H3IK20003272SP (0S02861) 2TB qty: 20 HDS723020BLA642 (0f12115) 2TB qty: 20 HDS721010DLE630 (0F13180) 1TB qty: 100 7K1000.C 0F10383 1TB qty: 20 The HDS721010DLE630 1TB model looks pretty attractive if one needs spindle count IOPS more than total capacity. And needs more than 20 drives. -- Stan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Fwd: Please Help! RAID5 -> 6 reshapre gone bad [not found] ` <CAOANJV955ZdLexRTjVkQzTMapAaMitq5eqxP0rUvDjjLh4Wgzw@mail.gmail.com> 2012-02-07 2:57 ` Phil Turmel @ 2012-02-07 3:04 ` Richard Herd 1 sibling, 0 replies; 27+ messages in thread From: Richard Herd @ 2012-02-07 3:04 UTC (permalink / raw) To: linux-raid Sorry, sent that directly to Phil instead of back to the list. FYI the below email. Thanks for the response Neil :-) Also, just as a bit of clarification, it may help to understand what was going on 'real-world': # Last night reshape kicked off to go from RAID 5 to 6. # This morning at 1 a disk (sdg) was kicked out of the array basically for timing out on ERC. mdadm stops reshape, then continues the reshape without sdg. # This morning at 7 a second disk (sdc) was kicked out of the array (again ERC). mdadm stops reshape, and does not continue _however_ md0 itself is NOT stopped. As I have vlc recording streams from my security cameras to md0 24/7, I think what happened at 7 this morning was that the array got into a bad state with the two failed disks and stopped the reshape, but didn't stop md0. md0 stayed mounted and vlc will have been doing writes of the cam footage to md0 for a couple of hours until about 9 when I noticed this and manually did mdadm --stop /dev/md0. I would hazard a guess as that's why sdc has an older event counter than the rest of the array - it was kicked out at 7 but the array stayed up without enough disks for another couple of hours until 9 when manually stopped. Hopefully that makes sense and adds a bit of context :-) Cheers ---------- Forwarded message ---------- From: Richard Herd <2001oddity@gmail.com> Date: Tue, Feb 7, 2012 at 1:40 PM Subject: Re: Please Help! RAID5 -> 6 reshapre gone bad To: Phil Turmel Hi Phil, Thanks for the swift response :-) Also I'm in (what I'd like to say but can't - sunny) Sydney... OK, without slathering this thread is smart reports I can quite definitely say you are exactly nail-on-the-head with regard to the read errors escalating into link timeouts. This is exactly what is happening. I had thought this was actually a pretty common setup for home users (eg mdadm and drives such as WD20EARS/ST2000s) - I have the luxury of budgets for Netapp kit at work - unfortunately my personal finances only stretch to an ITX case and a bunch of cheap HDs! I understand it's the ERC causing disks to get kicked, and fully understand if you can't help further. Assembling without sdg I'm not sure will do it, as what we have is 4 disks with the same events counter (3 active sync (sda/sdb/sdf), 1 spare rebuilding (sdd)), and 2 (sdg/sdc) removed with older event counters. Leaving out sdg leaves us with sdc which has an event counter of 1848333. As the 3 active sync (sda/sdb/sdf) + 1 spare (sdd) have an event counter of 1848341, mdadm doesn't want to let me use sdc in the array even with --force. As you say as it's in the middle of a reshape so a recreate is out. I'm considering data loss is a given at this point, but even being able to bring the array online degraded and pull out whatever is still intact would help. If you have any further suggestions that would be great, but I do understand your position on ERC and thank you for your input :-) Cheers Feb 7 01:07:16 raven kernel: [18891.989330] ata8: hard resetting link Feb 7 01:07:22 raven kernel: [18897.356104] ata8: link is slow to respond, please be patient (ready=0) Feb 7 01:07:26 raven kernel: [18902.004280] ata8: hard resetting link Feb 7 01:07:32 raven kernel: [18907.372104] ata8: link is slow to respond, please be patient (ready=0) Feb 7 01:07:36 raven kernel: [18912.020097] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 7 01:07:41 raven kernel: [18917.020093] ata8.00: qc timeout (cmd 0xec) Feb 7 01:07:41 raven kernel: [18917.028074] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 7 01:07:41 raven kernel: [18917.028310] ata8: hard resetting link Feb 7 01:07:47 raven kernel: [18922.396089] ata8: link is slow to respond, please be patient (ready=0) Feb 7 01:07:51 raven kernel: [18927.044313] ata8: hard resetting link Feb 7 01:07:56 raven kernel: [18932.020099] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 7 01:08:06 raven kernel: [18942.020048] ata8.00: qc timeout (cmd 0xec) Feb 7 01:08:06 raven kernel: [18942.028075] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 7 01:08:06 raven kernel: [18942.028307] ata8: limiting SATA link speed to 3.0 Gbps Feb 7 01:08:06 raven kernel: [18942.028321] ata8: hard resetting link Feb 7 01:08:12 raven kernel: [18947.396108] ata8: link is slow to respond, please be patient (ready=0) Feb 7 01:08:16 raven kernel: [18951.988069] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320) Feb 7 01:08:46 raven kernel: [18981.988104] ata8.00: qc timeout (cmd 0xec) Feb 7 01:08:46 raven kernel: [18981.996070] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 7 01:08:46 raven kernel: [18981.996302] ata8.00: disabled Feb 7 01:08:46 raven kernel: [18981.996324] ata8.00: device reported invalid CHS sector 0 Feb 7 01:08:46 raven kernel: [18981.996348] ata8: hard resetting link Feb 7 01:08:52 raven kernel: [18987.364104] ata8: link is slow to respond, please be patient (ready=0) Feb 7 01:08:56 raven kernel: [18992.012050] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320) Feb 7 01:08:56 raven kernel: [18992.012114] ata8: EH complete Feb 7 01:08:56 raven kernel: [18992.012158] sd 8:0:0:0: [sdg] Unhandled error code Feb 7 01:08:56 raven kernel: [18992.012165] sd 8:0:0:0: [sdg] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 7 01:08:56 raven kernel: [18992.012176] sd 8:0:0:0: [sdg] CDB: Write(10): 2a 00 e8 e0 74 3f 00 00 08 00 Feb 7 01:08:56 raven kernel: [18992.012696] md: super_written gets error=-5, uptodate=0 Feb 7 01:08:56 raven kernel: [18992.013169] sd 8:0:0:0: [sdg] Unhandled error code Feb 7 01:08:56 raven kernel: [18992.013176] sd 8:0:0:0: [sdg] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 7 01:08:56 raven kernel: [18992.013186] sd 8:0:0:0: [sdg] CDB: Read(10): 28 00 04 9d bd bf 00 00 80 00 Feb 7 01:08:56 raven kernel: [18992.276986] sd 8:0:0:0: [sdg] Unhandled error code Feb 7 01:08:56 raven kernel: [18992.276999] sd 8:0:0:0: [sdg] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 7 01:08:56 raven kernel: [18992.277012] sd 8:0:0:0: [sdg] CDB: Read(10): 28 00 04 9d be 3f 00 00 80 00 Feb 7 01:08:56 raven kernel: [18992.316919] sd 8:0:0:0: [sdg] Unhandled error code Feb 7 01:08:56 raven kernel: [18992.316930] sd 8:0:0:0: [sdg] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 7 01:08:56 raven kernel: [18992.316942] sd 8:0:0:0: [sdg] CDB: Read(10): 28 00 04 9d be bf 00 00 80 00 Feb 7 01:08:56 raven kernel: [18992.326906] sd 8:0:0:0: [sdg] Unhandled error code Feb 7 01:08:56 raven kernel: [18992.326920] sd 8:0:0:0: [sdg] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 7 01:08:56 raven kernel: [18992.326932] sd 8:0:0:0: [sdg] CDB: Read(10): 28 00 04 9d bf 3f 00 00 80 00 Feb 7 01:08:56 raven kernel: [18992.327944] sd 8:0:0:0: [sdg] Unhandled error code Feb 7 01:08:56 raven kernel: [18992.327956] sd 8:0:0:0: [sdg] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 7 01:08:56 raven kernel: [18992.327968] sd 8:0:0:0: [sdg] CDB: Read(10): 28 00 04 9d bf bf 00 00 80 00 Feb 7 01:08:57 raven kernel: [18992.555093] md: md0: reshape done. Feb 7 01:08:57 raven kernel: [18992.607595] md: reshape of RAID array md0 Feb 7 01:08:57 raven kernel: [18992.607606] md: minimum _guaranteed_ speed: 200000 KB/sec/disk. Feb 7 01:08:57 raven kernel: [18992.607614] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape. Feb 7 01:08:57 raven kernel: [18992.607628] md: using 128k window, over a total of 1953511936 blocks. Feb 7 06:41:02 raven rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="911" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'. Feb 7 07:12:32 raven kernel: [40807.989092] ata5: hard resetting link Feb 7 07:12:38 raven kernel: [40813.524074] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 7 07:12:43 raven kernel: [40818.524106] ata5.00: qc timeout (cmd 0xec) Feb 7 07:12:43 raven kernel: [40818.524126] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 7 07:12:43 raven kernel: [40818.532788] ata5: hard resetting link Feb 7 07:12:48 raven kernel: [40824.058039] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 7 07:12:58 raven kernel: [40834.056101] ata5.00: qc timeout (cmd 0xec) Feb 7 07:12:58 raven kernel: [40834.056121] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 7 07:12:58 raven kernel: [40834.064203] ata5: limiting SATA link speed to 3.0 Gbps Feb 7 07:12:58 raven kernel: [40834.064217] ata5: hard resetting link Feb 7 07:13:04 raven kernel: [40839.592095] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Feb 7 07:13:34 raven kernel: [40869.592088] ata5.00: qc timeout (cmd 0xec) Feb 7 07:13:34 raven kernel: [40869.592110] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4) Feb 7 07:13:34 raven kernel: [40869.599676] ata5.00: disabled Feb 7 07:13:34 raven kernel: [40869.599700] ata5.00: device reported invalid CHS sector 0 Feb 7 07:13:34 raven kernel: [40869.599724] ata5: hard resetting link Feb 7 07:13:39 raven kernel: [40875.124128] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Feb 7 07:13:39 raven kernel: [40875.124201] ata5: EH complete Feb 7 07:13:39 raven kernel: [40875.124243] sd 4:0:0:0: [sdd] Unhandled error code Feb 7 07:13:39 raven kernel: [40875.124251] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 7 07:13:39 raven kernel: [40875.124262] sd 4:0:0:0: [sdd] CDB: Write(10): 2a 00 e8 e0 74 3f 00 00 08 00 Feb 7 07:13:39 raven kernel: [40875.135544] md: super_written gets error=-5, uptodate=0 Feb 7 07:13:39 raven kernel: [40875.152171] sd 4:0:0:0: [sdd] Unhandled error code Feb 7 07:13:39 raven kernel: [40875.152179] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 7 07:13:39 raven kernel: [40875.152189] sd 4:0:0:0: [sdd] CDB: Read(10): 28 00 09 2b f2 3f 00 00 80 00 Feb 7 07:13:41 raven kernel: [40876.734504] md: md0: reshape done. Feb 7 07:13:41 raven kernel: [40876.736298] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.743529] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.750009] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.755143] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.760126] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.765070] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.769890] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.774759] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.779456] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.784166] lost page write due to I/O error on md0 Feb 7 07:13:41 raven kernel: [40876.788773] JBD: Detected IO errors while flushing file data on md0 Feb 7 07:13:41 raven kernel: [40876.796386] JBD: Detected IO errors while flushing file data on md0 On Tue, Feb 7, 2012 at 1:15 PM, Phil Turmel <philip@turmel.org> wrote: > Hi Richard, > > On 02/06/2012 08:34 PM, Richard Herd wrote: >> Hey guys, >> >> I'm in a bit of a pickle here and if any mdadm kings could step in and >> throw some advice my way I'd be very grateful :-) >> >> Quick bit of background - little NAS based on an AMD E350 running >> Ubuntu 10.04. Running a software RAID 5 from 5x2TB disks. Every few >> months one of the drives would fail a request and get kicked from the >> array (as is becoming common for these larger multi TB drives they >> tolerate the occasional bad sector by reallocating from a pool of >> spares (but that's a whole other story)). This happened across a >> variety of brands and two different controllers. I'd simply add the >> disk that got popped back in and let it re-sync. SMART tests always >> in good health. > > Some more detail on the actual devices would help, especially the > output of lsdrv [1] to document what device serial numbers are which, > for future reference. > > I also suspect you have problems with your drive's error recovery > control, also known as time-limited error recovery. Simple sector > errors should *not* be kicking out your drives. Mdadm knows to > reconstruct from parity and rewrite when a read error is encountered. > That either succeeds directly, or causes the drive to remap. > > You say that the SMART tests are good, so read errors are probably > escalating into link timeouts, and the drive ignores the attempt to > reconstruct. *That* kicks the drive out. > > "smartctl -x" reports for all of your drives would help identify if > you have this problem. You *cannot* safely run raid arrays with drives > that don't (or won't) report errors in a timely fashion (a few seconds). > >> It did make me nervous though. So I decided I'd add a second disk for >> a bit of extra redundancy, making the array a RAID 6 - the thinking >> was the occasional disk getting kicked and re-added from a RAID 6 >> array wouldn't present as much risk as a single disk getting kicked >> from a RAID 5. >> >> So first off, I added the 6th disk as a hotspare to the RAID5 array. >> So I now had my 5 disk RAID 5 + hotspare. >> >> I then found that mdadm 2.6.7 (in the repositories) isn't actually >> capable of a 5->6 reshape. So I pulled the latest 3.2.3 sources and >> compiled myself a new version of mdadm. >> >> With the newer version of mdadm, it was happy to do the reshape - so I >> set it off on it's merry way, using an esata HD (mounted at /usb :-P) >> for the backupfile: >> >> root@raven:/# mdadm --grow /dev/md0 --level=6 --raid-devices=6 >> --backup-file=/usb/md0.backup >> >> It would take a week to reshape, but it was ona UPS & happily ticking >> along. The array would be online the whole time so I was in no rush. >> Content, I went to get some shut-eye. >> >> I got up this morning and took a quick look in /proc/mdstat to see how >> things were going and saw things had failed spectacularly. At least >> two disks had been kicked from the array and the whole thing had >> crumbled. > > Do you still have the dmesg for this? > >> Ouch. >> >> I tried to assembe the array, to see if it would continue the reshape: >> >> root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 >> /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 >> >> Unfortunately mdadm had decided that the backup-file was out of date >> (timestamps didn't match) and was erroring with: Failed to restore >> critical section for reshape, sorry.. >> >> Chances are things were in such a mess that backup file wasn't going >> to be used anyway, so I blocked the timestamp check with: export >> MDADM_GROW_ALLOW_OLD=1 >> >> That allowed me to assemble the array, but not run it as there were >> not enough disks to start it. >> >> This is the current state of the array: >> >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] >> [raid4] [raid10] >> md0 : inactive sdb1[1] sdd1[5] sdf1[4] sda1[2] >> 7814047744 blocks super 0.91 >> >> unused devices: <none> >> >> root@raven:/# mdadm --detail /dev/md0 >> /dev/md0: >> Version : 0.91 >> Creation Time : Tue Jul 12 23:05:01 2011 >> Raid Level : raid6 >> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) >> Raid Devices : 6 >> Total Devices : 4 >> Preferred Minor : 0 >> Persistence : Superblock is persistent >> >> Update Time : Tue Feb 7 09:32:29 2012 >> State : active, FAILED, Not Started >> Active Devices : 3 >> Working Devices : 4 >> Failed Devices : 0 >> Spare Devices : 1 >> >> Layout : left-symmetric-6 >> Chunk Size : 64K >> >> New Layout : left-symmetric >> >> UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) >> Events : 0.1848341 >> >> Number Major Minor RaidDevice State >> 0 0 0 0 removed >> 1 8 17 1 active sync /dev/sdb1 >> 2 8 1 2 active sync /dev/sda1 >> 3 0 0 3 removed >> 4 8 81 4 active sync /dev/sdf1 >> 5 8 49 5 spare rebuilding /dev/sdd1 >> >> The two removed disks: >> [ 3020.998529] md: kicking non-fresh sdc1 from array! >> [ 3021.012672] md: kicking non-fresh sdg1 from array! >> >> Attempted to re-add the disks (same for both): >> root@raven:/# mdadm /dev/md0 --add /dev/sdg1 >> mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a >> --re-add fails. >> mdadm: not performing --add as that would convert /dev/sdg1 in to a spare. >> mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first. >> >> With a failed array the last thing we want to do is add spares and >> trigger a resync so obviously I haven't zeroed the superblocks and >> added yet. > > That would be catastrophic. > >> Checked and two disks really are out of sync: >> root@raven:/# mdadm --examine /dev/sd[a-h]1 | grep Event >> Events : 1848341 >> Events : 1848341 >> Events : 1848333 >> Events : 1848341 >> Events : 1848341 >> Events : 1772921 > > So /dev/sdg1 dropped out first, and /dev/sdc1 followed and killed the > array. > >> I'll post the output of --examine on all the disks below - if anyone >> has any advice I'd really appreciate it (Neil Brown doesn't read these >> forums does he?!?). I would usually move next to recreating the array >> and using assume-clean but since it's right in the middle of a reshape >> I'm not inclined to try. > > Neil absolutely reads this mailing list, and is likely to pitch in if > I don't offer precisely correct advice :-) > > He's in an Australian time zone though, so latency might vary. I'm on the > U.S. east coast, fwiw. > > In any case, with a re-shape in progress, "--create --assume-clean" is > not an option. > >> Critical stuff is of course backed up, but there is some user data not >> covered by backups that I'd like to try and restore if at all >> possible. > > Hope is not all lost. If we can get your ERC adjusted, the next step > would be to disconnect /dev/sdg from the system, and assemble with > --force and MDADM_GROW_ALLOW_OLD=1 > > That'll let the reshape finish, leaving you with a single-degraded > raid6. Then you fsck and make critical backups. Then you --zero- and > --add /dev/sdg. > > If your drives don't support ERC, I can't recommend you continue until > you've ddrescue'd your drives onto new ones that do support ERC. > > HTH, > > Phil > > [1] http://github.com/pturmel/lsdrv -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 1:34 Please Help! RAID5 -> 6 reshapre gone bad Richard Herd 2012-02-07 2:15 ` Phil Turmel @ 2012-02-07 2:39 ` NeilBrown 2012-02-07 3:10 ` NeilBrown 1 sibling, 1 reply; 27+ messages in thread From: NeilBrown @ 2012-02-07 2:39 UTC (permalink / raw) To: Richard Herd; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 7103 bytes --] On Tue, 7 Feb 2012 12:34:48 +1100 Richard Herd <2001oddity@gmail.com> wrote: > Hey guys, > > I'm in a bit of a pickle here and if any mdadm kings could step in and > throw some advice my way I'd be very grateful :-) > > Quick bit of background - little NAS based on an AMD E350 running > Ubuntu 10.04. Running a software RAID 5 from 5x2TB disks. Every few > months one of the drives would fail a request and get kicked from the > array (as is becoming common for these larger multi TB drives they > tolerate the occasional bad sector by reallocating from a pool of > spares (but that's a whole other story)). This happened across a > variety of brands and two different controllers. I'd simply add the > disk that got popped back in and let it re-sync. SMART tests always > in good health. > > It did make me nervous though. So I decided I'd add a second disk for > a bit of extra redundancy, making the array a RAID 6 - the thinking > was the occasional disk getting kicked and re-added from a RAID 6 > array wouldn't present as much risk as a single disk getting kicked > from a RAID 5. > > So first off, I added the 6th disk as a hotspare to the RAID5 array. > So I now had my 5 disk RAID 5 + hotspare. > > I then found that mdadm 2.6.7 (in the repositories) isn't actually > capable of a 5->6 reshape. So I pulled the latest 3.2.3 sources and > compiled myself a new version of mdadm. > > With the newer version of mdadm, it was happy to do the reshape - so I > set it off on it's merry way, using an esata HD (mounted at /usb :-P) > for the backupfile: > > root@raven:/# mdadm --grow /dev/md0 --level=6 --raid-devices=6 > --backup-file=/usb/md0.backup > > It would take a week to reshape, but it was ona UPS & happily ticking > along. The array would be online the whole time so I was in no rush. > Content, I went to get some shut-eye. > > I got up this morning and took a quick look in /proc/mdstat to see how > things were going and saw things had failed spectacularly. At least > two disks had been kicked from the array and the whole thing had > crumbled. > > Ouch. > > I tried to assembe the array, to see if it would continue the reshape: > > root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 > /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 > > Unfortunately mdadm had decided that the backup-file was out of date > (timestamps didn't match) and was erroring with: Failed to restore > critical section for reshape, sorry.. > > Chances are things were in such a mess that backup file wasn't going > to be used anyway, so I blocked the timestamp check with: export > MDADM_GROW_ALLOW_OLD=1 > > That allowed me to assemble the array, but not run it as there were > not enough disks to start it. You probably just need to add "--force" to the assemble line. So stop the array (mdamd -S /dev/md0) and assemble again with --force as well as the other options.... or maybe don't. I just tested that and I didn't do what it should. I've hacked the code a bit and can see what the problem is and think I can fix it. So leave it a bit. I'll let you know when you should grab my latest code and try that. > > This is the current state of the array: > > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] > [raid4] [raid10] > md0 : inactive sdb1[1] sdd1[5] sdf1[4] sda1[2] > 7814047744 blocks super 0.91 > > unused devices: <none> > > root@raven:/# mdadm --detail /dev/md0 > /dev/md0: > Version : 0.91 > Creation Time : Tue Jul 12 23:05:01 2011 > Raid Level : raid6 > Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) > Raid Devices : 6 > Total Devices : 4 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Update Time : Tue Feb 7 09:32:29 2012 > State : active, FAILED, Not Started > Active Devices : 3 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 1 > > Layout : left-symmetric-6 > Chunk Size : 64K > > New Layout : left-symmetric > > UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) > Events : 0.1848341 > > Number Major Minor RaidDevice State > 0 0 0 0 removed > 1 8 17 1 active sync /dev/sdb1 > 2 8 1 2 active sync /dev/sda1 > 3 0 0 3 removed > 4 8 81 4 active sync /dev/sdf1 > 5 8 49 5 spare rebuilding /dev/sdd1 > > The two removed disks: > [ 3020.998529] md: kicking non-fresh sdc1 from array! > [ 3021.012672] md: kicking non-fresh sdg1 from array! > > Attempted to re-add the disks (same for both): > root@raven:/# mdadm /dev/md0 --add /dev/sdg1 > mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a > --re-add fails. > mdadm: not performing --add as that would convert /dev/sdg1 in to a spare. Gee I'm glad I put that check in! > mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first. > > With a failed array the last thing we want to do is add spares and > trigger a resync so obviously I haven't zeroed the superblocks and > added yet. Excellent! > > Checked and two disks really are out of sync: > root@raven:/# mdadm --examine /dev/sd[a-h]1 | grep Event > Events : 1848341 > Events : 1848341 > Events : 1848333 > Events : 1848341 > Events : 1848341 > Events : 1772921 sdg1 failed first shortly after 01:06:46. The reshape should have just continued. However every device has the same: > Reshape pos'n : 307740672 (293.48 GiB 315.13 GB) including sdg1. That implied that it didn't continue. Confused. Anyway, around 07:12:01, sdc1 failed. This will definitely have stopped the reshape and everything else. > > I'll post the output of --examine on all the disks below - if anyone > has any advice I'd really appreciate it (Neil Brown doesn't read these > forums does he?!?). I would usually move next to recreating the array > and using assume-clean but since it's right in the middle of a reshape > I'm not inclined to try. Me? No, I don't hang out here much... > > Critical stuff is of course backed up, but there is some user data not > covered by backups that I'd like to try and restore if at all > possible. "backups" - music to my ears. I definitely recommend an 'fsck' after we get it going again and there could be minor corruption, but you will probably have everything back. Of course I cannot promise that it won't just happen again when it hits another read error. Not sure what you can do about that. So - stay tuned. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 2:39 ` NeilBrown @ 2012-02-07 3:10 ` NeilBrown 2012-02-07 3:19 ` Richard Herd 0 siblings, 1 reply; 27+ messages in thread From: NeilBrown @ 2012-02-07 3:10 UTC (permalink / raw) To: NeilBrown; +Cc: Richard Herd, linux-raid [-- Attachment #1: Type: text/plain, Size: 1760 bytes --] On Tue, 7 Feb 2012 13:39:47 +1100 NeilBrown <neilb@suse.de> wrote: > > I tried to assembe the array, to see if it would continue the reshape: > > > > root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 > > /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 > > > > Unfortunately mdadm had decided that the backup-file was out of date > > (timestamps didn't match) and was erroring with: Failed to restore > > critical section for reshape, sorry.. > > > > Chances are things were in such a mess that backup file wasn't going > > to be used anyway, so I blocked the timestamp check with: export > > MDADM_GROW_ALLOW_OLD=1 > > > > That allowed me to assemble the array, but not run it as there were > > not enough disks to start it. > > You probably just need to add "--force" to the assemble line. > So stop the array (mdamd -S /dev/md0) and assemble again with --force as well > as the other options.... or maybe don't. > > I just tested that and I didn't do what it should. I've hacked the code a > bit and can see what the problem is and think I can fix it. > > So leave it a bit. I'll let you know when you should grab my latest code > and try that. Ok, that should work.. If you: git clone git://neil.brown.name/mdadm cd mdadm make export MDADM_GROW_ALLOW_OLD=1 ./mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 ..list.of.devices.. --force it should restart the grow. Once device will be left failed. If you think it is usable then when the grow completes you can add it back in. If you get another failure it will die again and you'll have to restart it. If you get a persistent failure, you might be out of luck. Please let me know how it goes. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 3:10 ` NeilBrown @ 2012-02-07 3:19 ` Richard Herd 2012-02-07 3:39 ` NeilBrown 0 siblings, 1 reply; 27+ messages in thread From: Richard Herd @ 2012-02-07 3:19 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, Thanks. FYI, I've cloned your git repo and compiled and tried using your code. Unfortunately everything looks the same as below (exactly same output, exactly same dmesg - still wants to kick non-fresh sdc from the array at assemble). Cheers Rich On Tue, Feb 7, 2012 at 2:10 PM, NeilBrown <neilb@suse.de> wrote: > On Tue, 7 Feb 2012 13:39:47 +1100 NeilBrown <neilb@suse.de> wrote: > >> > I tried to assembe the array, to see if it would continue the reshape: >> > >> > root@raven:/# mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 >> > /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 >> > >> > Unfortunately mdadm had decided that the backup-file was out of date >> > (timestamps didn't match) and was erroring with: Failed to restore >> > critical section for reshape, sorry.. >> > >> > Chances are things were in such a mess that backup file wasn't going >> > to be used anyway, so I blocked the timestamp check with: export >> > MDADM_GROW_ALLOW_OLD=1 >> > >> > That allowed me to assemble the array, but not run it as there were >> > not enough disks to start it. >> >> You probably just need to add "--force" to the assemble line. >> So stop the array (mdamd -S /dev/md0) and assemble again with --force as well >> as the other options.... or maybe don't. >> >> I just tested that and I didn't do what it should. I've hacked the code a >> bit and can see what the problem is and think I can fix it. >> >> So leave it a bit. I'll let you know when you should grab my latest code >> and try that. > > Ok, that should work.. > If you: > > git clone git://neil.brown.name/mdadm > cd mdadm > make > export MDADM_GROW_ALLOW_OLD=1 > ./mdadm -Avv --backup-file=/usb/md0.backup /dev/md0 ..list.of.devices.. --force > > > it should restart the grow. Once device will be left failed. If you think > it is usable then when the grow completes you can add it back in. > > If you get another failure it will die again and you'll have to restart it. > > If you get a persistent failure, you might be out of luck. > > Please let me know how it goes. > > NeilBrown > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 3:19 ` Richard Herd @ 2012-02-07 3:39 ` NeilBrown 2012-02-07 3:50 ` Richard Herd 0 siblings, 1 reply; 27+ messages in thread From: NeilBrown @ 2012-02-07 3:39 UTC (permalink / raw) To: Richard Herd; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 745 bytes --] On Tue, 7 Feb 2012 14:19:06 +1100 Richard Herd <2001oddity@gmail.com> wrote: > Hi Neil, > > Thanks. > > FYI, I've cloned your git repo and compiled and tried using your code. > Unfortunately everything looks the same as below (exactly same > output, exactly same dmesg - still wants to kick non-fresh sdc from > the array at assemble). Strange. Please report output of git describe HEAD and also run the 'mdadm --assemble --force ....' with -vvv as well, and report all of the output. Also I think some of you devices have changed named a bit. Make sure you list exactly the 6 devices that were recently in the array. i.e. exactly those that report something sensible to "mdadm -E /dev/WHATEVER" NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 3:39 ` NeilBrown @ 2012-02-07 3:50 ` Richard Herd 2012-02-07 4:25 ` NeilBrown 0 siblings, 1 reply; 27+ messages in thread From: Richard Herd @ 2012-02-07 3:50 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, OK, git head is: mdadm-3.2.3-21-gda8fe5a I have 8 disks. They get muddled about each boot (an issue I have never addressed). Ignore sde (esata HD) and sdh (usb boot). It seems even with --force, dmesg always reports 'kicking non-fresh sdc/g1 from array!'. Leaving sdg out as suggested by Phil doesn't help unfortunately. root@raven:/neil/mdadm# ./mdadm -Avvv --force --backup-file=/usb/md0.backup /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sdg1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1. mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 5. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 0. mdadm:/dev/md0 has an active reshape - checking if critical section needs to be restored mdadm: accepting backup with timestamp 1328559119 for array with timestamp 1328567549 mdadm: restoring critical section mdadm: added /dev/sdg1 to /dev/md0 as 0 mdadm: added /dev/sda1 to /dev/md0 as 2 mdadm: added /dev/sdc1 to /dev/md0 as 3 mdadm: added /dev/sdf1 to /dev/md0 as 4 mdadm: added /dev/sdd1 to /dev/md0 as 5 mdadm: added /dev/sdb1 to /dev/md0 as 1 mdadm: failed to RUN_ARRAY /dev/md0: Input/output error and dmesg: [13964.591801] md: bind<sdg1> [13964.595371] md: bind<sda1> [13964.595668] md: bind<sdc1> [13964.595900] md: bind<sdf1> [13964.599084] md: bind<sdd1> [13964.599652] md: bind<sdb1> [13964.600478] md: kicking non-fresh sdc1 from array! [13964.600493] md: unbind<sdc1> [13964.612138] md: export_rdev(sdc1) [13964.612163] md: kicking non-fresh sdg1 from array! [13964.612183] md: unbind<sdg1> [13964.624077] md: export_rdev(sdg1) [13964.628203] raid5: reshape will continue [13964.628243] raid5: device sdb1 operational as raid disk 1 [13964.628252] raid5: device sdf1 operational as raid disk 4 [13964.628260] raid5: device sda1 operational as raid disk 2 [13964.629614] raid5: allocated 6308kB for md0 [13964.629731] 1: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 [13964.629742] 5: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=1 op2=0 [13964.629751] 4: w=2 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 [13964.629760] 2: w=3 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 [13964.629767] raid5: not enough operational devices for md0 (3/6 failed) [13964.640403] RAID5 conf printout: [13964.640409] --- rd:6 wd:3 [13964.640416] disk 1, o:1, dev:sdb1 [13964.640423] disk 2, o:1, dev:sda1 [13964.640429] disk 4, o:1, dev:sdf1 [13964.640436] disk 5, o:1, dev:sdd1 [13964.641621] raid5: failed to run raid set md0 [13964.649886] md: pers->run() failed ... root@raven:/neil/mdadm# mdadm --detail /dev/md0 /dev/md0: Version : 0.91 Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Raid Devices : 6 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Feb 7 09:32:29 2012 State : active, FAILED, Not Started Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric-6 Chunk Size : 64K New Layout : left-symmetric UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Events : 0.1848341 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 17 1 active sync /dev/sdb1 2 8 1 2 active sync /dev/sda1 3 0 0 3 removed 4 8 81 4 active sync /dev/sdf1 5 8 49 5 spare rebuilding /dev/sdd1 On Tue, Feb 7, 2012 at 2:39 PM, NeilBrown <neilb@suse.de> wrote: > On Tue, 7 Feb 2012 14:19:06 +1100 Richard Herd <2001oddity@gmail.com> wrote: > >> Hi Neil, >> >> Thanks. >> >> FYI, I've cloned your git repo and compiled and tried using your code. >> Unfortunately everything looks the same as below (exactly same >> output, exactly same dmesg - still wants to kick non-fresh sdc from >> the array at assemble). > > Strange. > > Please report output of > git describe HEAD > > and also run the 'mdadm --assemble --force ....' with -vvv as well, and > report all of the output. > > Also I think some of you devices have changed named a bit. Make sure you > list exactly the 6 devices that were recently in the array. i.e. exactly > those that report something sensible to "mdadm -E /dev/WHATEVER" > > NeilBrown > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 3:50 ` Richard Herd @ 2012-02-07 4:25 ` NeilBrown 2012-02-07 5:02 ` Richard Herd 0 siblings, 1 reply; 27+ messages in thread From: NeilBrown @ 2012-02-07 4:25 UTC (permalink / raw) To: Richard Herd; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3359 bytes --] On Tue, 7 Feb 2012 14:50:57 +1100 Richard Herd <2001oddity@gmail.com> wrote: > Hi Neil, > > OK, git head is: mdadm-3.2.3-21-gda8fe5a > > I have 8 disks. They get muddled about each boot (an issue I have > never addressed). Ignore sde (esata HD) and sdh (usb boot). > > It seems even with --force, dmesg always reports 'kicking non-fresh > sdc/g1 from array!'. Leaving sdg out as suggested by Phil doesn't > help unfortunately. > > root@raven:/neil/mdadm# ./mdadm -Avvv --force > --backup-file=/usb/md0.backup /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 > /dev/sdd1 /dev/sdf1 /dev/sdg1 > mdadm: looking for devices for /dev/md0 > mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 2. > mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1. > mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3. > mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 5. > mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 4. > mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 0. > mdadm:/dev/md0 has an active reshape - checking if critical section > needs to be restored > mdadm: accepting backup with timestamp 1328559119 for array with > timestamp 1328567549 > mdadm: restoring critical section > mdadm: added /dev/sdg1 to /dev/md0 as 0 > mdadm: added /dev/sda1 to /dev/md0 as 2 > mdadm: added /dev/sdc1 to /dev/md0 as 3 > mdadm: added /dev/sdf1 to /dev/md0 as 4 > mdadm: added /dev/sdd1 to /dev/md0 as 5 > mdadm: added /dev/sdb1 to /dev/md0 as 1 > mdadm: failed to RUN_ARRAY /dev/md0: Input/output error Hmmm.... maybe your kernel isn't quite doing the right thing. commit 674806d62fb02a22eea948c9f1b5e58e0947b728 is important. It is in 2.6.35. What kernel are you running? Definitely something older given the "1: w=1 pa=18...." messages. They disappear in 2.6.34. So I'm afraid you're going to need a new kernel. NeilBrown > > and dmesg: > [13964.591801] md: bind<sdg1> > [13964.595371] md: bind<sda1> > [13964.595668] md: bind<sdc1> > [13964.595900] md: bind<sdf1> > [13964.599084] md: bind<sdd1> > [13964.599652] md: bind<sdb1> > [13964.600478] md: kicking non-fresh sdc1 from array! > [13964.600493] md: unbind<sdc1> > [13964.612138] md: export_rdev(sdc1) > [13964.612163] md: kicking non-fresh sdg1 from array! > [13964.612183] md: unbind<sdg1> > [13964.624077] md: export_rdev(sdg1) > [13964.628203] raid5: reshape will continue > [13964.628243] raid5: device sdb1 operational as raid disk 1 > [13964.628252] raid5: device sdf1 operational as raid disk 4 > [13964.628260] raid5: device sda1 operational as raid disk 2 > [13964.629614] raid5: allocated 6308kB for md0 > [13964.629731] 1: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 > [13964.629742] 5: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=1 op2=0 > [13964.629751] 4: w=2 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 > [13964.629760] 2: w=3 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 > [13964.629767] raid5: not enough operational devices for md0 (3/6 failed) > [13964.640403] RAID5 conf printout: > [13964.640409] --- rd:6 wd:3 > [13964.640416] disk 1, o:1, dev:sdb1 > [13964.640423] disk 2, o:1, dev:sda1 > [13964.640429] disk 4, o:1, dev:sdf1 > [13964.640436] disk 5, o:1, dev:sdd1 > [13964.641621] raid5: failed to run raid set md0 > [13964.649886] md: pers->run() failed ... [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 4:25 ` NeilBrown @ 2012-02-07 5:02 ` Richard Herd 2012-02-07 5:16 ` NeilBrown 0 siblings, 1 reply; 27+ messages in thread From: Richard Herd @ 2012-02-07 5:02 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, Hmm - see you're point about the kernel... Kernel updated. I'm now running 2.6.38. I went to work on it a bit more under 2.6.38 - I'm not sure here, it wouldn't take all the disks as before, but this time seems to have assembled (with --force) using 4 of the disks. Trying to re-add the 5th and 6th didn't throw the same warning as before (failed to re-add and not adding as spare), it said ''re-added /dev/xxx to /dev/md0' but when checking detail we can see they were added as spares not as part of the array. Anyway, with the array assembled and running, I have got the filesystem mounted and am quickly smashing an rsync to mirror what I can (8TB, how long could it take? lol). Thanks so much for your help guys - once I got the hint on the kernel it wasn't too hard to get the array assembled again. Now it's just a waiting game I guess to see how much of the data is intact. Also, at what point would those two disks now marked as spare be re-synced into the array? After the reshape completes? Really appreciate your help :-) Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid6 sde1[6](S) sdg1[7](S) sdc1[1] sdf1[4] sdd1[3] sdb1[2] 7814047744 blocks super 0.91 level 6, 64k chunk, algorithm 18 [6/4] [_UUUU_] [>....................] reshape = 3.9% (78086144/1953511936) finish=11710.7min speed=2668K/sec unused devices: <none> root@raven:~# mdadm --detail /dev/md0 /dev/md0: Version : 0.91 Creation Time : Tue Jul 12 23:05:01 2011 Raid Level : raid6 Array Size : 7814047744 (7452.06 GiB 8001.58 GB) Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Feb 7 15:52:10 2012 State : clean, degraded, reshaping Active Devices : 4 Working Devices : 6 Failed Devices : 0 Spare Devices : 2 Layout : left-symmetric-6 Chunk Size : 64K Reshape Status : 3% complete New Layout : left-symmetric UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) Events : 0.1850269 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 33 1 active sync /dev/sdc1 2 8 17 2 active sync /dev/sdb1 3 8 49 3 active sync /dev/sdd1 4 8 81 4 active sync /dev/sdf1 5 0 0 5 removed 6 8 65 - spare /dev/sde1 7 8 97 - spare /dev/sdg1 On Tue, Feb 7, 2012 at 3:25 PM, NeilBrown <neilb@suse.de> wrote: > On Tue, 7 Feb 2012 14:50:57 +1100 Richard Herd <2001oddity@gmail.com> wrote: > >> Hi Neil, >> >> OK, git head is: mdadm-3.2.3-21-gda8fe5a >> >> I have 8 disks. They get muddled about each boot (an issue I have >> never addressed). Ignore sde (esata HD) and sdh (usb boot). >> >> It seems even with --force, dmesg always reports 'kicking non-fresh >> sdc/g1 from array!'. Leaving sdg out as suggested by Phil doesn't >> help unfortunately. >> >> root@raven:/neil/mdadm# ./mdadm -Avvv --force >> --backup-file=/usb/md0.backup /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 >> /dev/sdd1 /dev/sdf1 /dev/sdg1 >> mdadm: looking for devices for /dev/md0 >> mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 2. >> mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1. >> mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3. >> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 5. >> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 4. >> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 0. >> mdadm:/dev/md0 has an active reshape - checking if critical section >> needs to be restored >> mdadm: accepting backup with timestamp 1328559119 for array with >> timestamp 1328567549 >> mdadm: restoring critical section >> mdadm: added /dev/sdg1 to /dev/md0 as 0 >> mdadm: added /dev/sda1 to /dev/md0 as 2 >> mdadm: added /dev/sdc1 to /dev/md0 as 3 >> mdadm: added /dev/sdf1 to /dev/md0 as 4 >> mdadm: added /dev/sdd1 to /dev/md0 as 5 >> mdadm: added /dev/sdb1 to /dev/md0 as 1 >> mdadm: failed to RUN_ARRAY /dev/md0: Input/output error > > > Hmmm.... maybe your kernel isn't quite doing the right thing. > commit 674806d62fb02a22eea948c9f1b5e58e0947b728 is important. > It is in 2.6.35. What kernel are you running? > Definitely something older given the "1: w=1 pa=18...." messages. They > disappear in 2.6.34. > > So I'm afraid you're going to need a new kernel. > > NeilBrown > > > > >> >> and dmesg: >> [13964.591801] md: bind<sdg1> >> [13964.595371] md: bind<sda1> >> [13964.595668] md: bind<sdc1> >> [13964.595900] md: bind<sdf1> >> [13964.599084] md: bind<sdd1> >> [13964.599652] md: bind<sdb1> >> [13964.600478] md: kicking non-fresh sdc1 from array! >> [13964.600493] md: unbind<sdc1> >> [13964.612138] md: export_rdev(sdc1) >> [13964.612163] md: kicking non-fresh sdg1 from array! >> [13964.612183] md: unbind<sdg1> >> [13964.624077] md: export_rdev(sdg1) >> [13964.628203] raid5: reshape will continue >> [13964.628243] raid5: device sdb1 operational as raid disk 1 >> [13964.628252] raid5: device sdf1 operational as raid disk 4 >> [13964.628260] raid5: device sda1 operational as raid disk 2 >> [13964.629614] raid5: allocated 6308kB for md0 >> [13964.629731] 1: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 >> [13964.629742] 5: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=1 op2=0 >> [13964.629751] 4: w=2 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 >> [13964.629760] 2: w=3 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 >> [13964.629767] raid5: not enough operational devices for md0 (3/6 failed) >> [13964.640403] RAID5 conf printout: >> [13964.640409] --- rd:6 wd:3 >> [13964.640416] disk 1, o:1, dev:sdb1 >> [13964.640423] disk 2, o:1, dev:sda1 >> [13964.640429] disk 4, o:1, dev:sdf1 >> [13964.640436] disk 5, o:1, dev:sdd1 >> [13964.641621] raid5: failed to run raid set md0 >> [13964.649886] md: pers->run() failed ... -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Please Help! RAID5 -> 6 reshapre gone bad 2012-02-07 5:02 ` Richard Herd @ 2012-02-07 5:16 ` NeilBrown 0 siblings, 0 replies; 27+ messages in thread From: NeilBrown @ 2012-02-07 5:16 UTC (permalink / raw) To: Richard Herd; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 7145 bytes --] On Tue, 7 Feb 2012 16:02:27 +1100 Richard Herd <2001oddity@gmail.com> wrote: > Hi Neil, > > Hmm - see you're point about the kernel... > > Kernel updated. I'm now running 2.6.38. > > I went to work on it a bit more under 2.6.38 - I'm not sure here, it > wouldn't take all the disks as before, but this time seems to have > assembled (with --force) using 4 of the disks. > > Trying to re-add the 5th and 6th didn't throw the same warning as > before (failed to re-add and not adding as spare), it said ''re-added > /dev/xxx to /dev/md0' but when checking detail we can see they were > added as spares not as part of the array. That is expected. "--force" just gets you enough to keep going and that is what you have. Hopefully no more errors (keep the air-con ?? or maybe just keep the doors open, depending where you are :-) > > Anyway, with the array assembled and running, I have got the > filesystem mounted and am quickly smashing an rsync to mirror what I > can (8TB, how long could it take? lol). Good news. > > Thanks so much for your help guys - once I got the hint on the kernel > it wasn't too hard to get the array assembled again. Now it's just a > waiting game I guess to see how much of the data is intact. Also, at > what point would those two disks now marked as spare be re-synced into > the array? After the reshape completes? Yes. When the reshape completes, both the spares will get included into the array and recovered together. > > Really appreciate your help :-) And I appreciate nice detailed bug reports - they tend to get more attention. Thanks! NeilBrown > > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] > [raid4] [raid10] > md0 : active raid6 sde1[6](S) sdg1[7](S) sdc1[1] sdf1[4] sdd1[3] sdb1[2] > 7814047744 blocks super 0.91 level 6, 64k chunk, algorithm 18 > [6/4] [_UUUU_] > [>....................] reshape = 3.9% (78086144/1953511936) > finish=11710.7min speed=2668K/sec > > unused devices: <none> > > > root@raven:~# mdadm --detail /dev/md0 > /dev/md0: > Version : 0.91 > Creation Time : Tue Jul 12 23:05:01 2011 > Raid Level : raid6 > Array Size : 7814047744 (7452.06 GiB 8001.58 GB) > Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) > Raid Devices : 6 > Total Devices : 6 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Update Time : Tue Feb 7 15:52:10 2012 > State : clean, degraded, reshaping > Active Devices : 4 > Working Devices : 6 > Failed Devices : 0 > Spare Devices : 2 > > Layout : left-symmetric-6 > Chunk Size : 64K > > Reshape Status : 3% complete > New Layout : left-symmetric > > UUID : 9a76d1bd:2aabd685:1fc5fe0e:7751cfd7 (local to host raven) > Events : 0.1850269 > > Number Major Minor RaidDevice State > 0 0 0 0 removed > 1 8 33 1 active sync /dev/sdc1 > 2 8 17 2 active sync /dev/sdb1 > 3 8 49 3 active sync /dev/sdd1 > 4 8 81 4 active sync /dev/sdf1 > 5 0 0 5 removed > > 6 8 65 - spare /dev/sde1 > 7 8 97 - spare /dev/sdg1 > > > > > On Tue, Feb 7, 2012 at 3:25 PM, NeilBrown <neilb@suse.de> wrote: > > On Tue, 7 Feb 2012 14:50:57 +1100 Richard Herd <2001oddity@gmail.com> wrote: > > > >> Hi Neil, > >> > >> OK, git head is: mdadm-3.2.3-21-gda8fe5a > >> > >> I have 8 disks. They get muddled about each boot (an issue I have > >> never addressed). Ignore sde (esata HD) and sdh (usb boot). > >> > >> It seems even with --force, dmesg always reports 'kicking non-fresh > >> sdc/g1 from array!'. Leaving sdg out as suggested by Phil doesn't > >> help unfortunately. > >> > >> root@raven:/neil/mdadm# ./mdadm -Avvv --force > >> --backup-file=/usb/md0.backup /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 > >> /dev/sdd1 /dev/sdf1 /dev/sdg1 > >> mdadm: looking for devices for /dev/md0 > >> mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 2. > >> mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1. > >> mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3. > >> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 5. > >> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 4. > >> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 0. > >> mdadm:/dev/md0 has an active reshape - checking if critical section > >> needs to be restored > >> mdadm: accepting backup with timestamp 1328559119 for array with > >> timestamp 1328567549 > >> mdadm: restoring critical section > >> mdadm: added /dev/sdg1 to /dev/md0 as 0 > >> mdadm: added /dev/sda1 to /dev/md0 as 2 > >> mdadm: added /dev/sdc1 to /dev/md0 as 3 > >> mdadm: added /dev/sdf1 to /dev/md0 as 4 > >> mdadm: added /dev/sdd1 to /dev/md0 as 5 > >> mdadm: added /dev/sdb1 to /dev/md0 as 1 > >> mdadm: failed to RUN_ARRAY /dev/md0: Input/output error > > > > > > Hmmm.... maybe your kernel isn't quite doing the right thing. > > commit 674806d62fb02a22eea948c9f1b5e58e0947b728 is important. > > It is in 2.6.35. What kernel are you running? > > Definitely something older given the "1: w=1 pa=18...." messages. They > > disappear in 2.6.34. > > > > So I'm afraid you're going to need a new kernel. > > > > NeilBrown > > > > > > > > > >> > >> and dmesg: > >> [13964.591801] md: bind<sdg1> > >> [13964.595371] md: bind<sda1> > >> [13964.595668] md: bind<sdc1> > >> [13964.595900] md: bind<sdf1> > >> [13964.599084] md: bind<sdd1> > >> [13964.599652] md: bind<sdb1> > >> [13964.600478] md: kicking non-fresh sdc1 from array! > >> [13964.600493] md: unbind<sdc1> > >> [13964.612138] md: export_rdev(sdc1) > >> [13964.612163] md: kicking non-fresh sdg1 from array! > >> [13964.612183] md: unbind<sdg1> > >> [13964.624077] md: export_rdev(sdg1) > >> [13964.628203] raid5: reshape will continue > >> [13964.628243] raid5: device sdb1 operational as raid disk 1 > >> [13964.628252] raid5: device sdf1 operational as raid disk 4 > >> [13964.628260] raid5: device sda1 operational as raid disk 2 > >> [13964.629614] raid5: allocated 6308kB for md0 > >> [13964.629731] 1: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 > >> [13964.629742] 5: w=1 pa=18 pr=6 m=2 a=2 r=6 op1=1 op2=0 > >> [13964.629751] 4: w=2 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 > >> [13964.629760] 2: w=3 pa=18 pr=6 m=2 a=2 r=6 op1=0 op2=0 > >> [13964.629767] raid5: not enough operational devices for md0 (3/6 failed) > >> [13964.640403] RAID5 conf printout: > >> [13964.640409] --- rd:6 wd:3 > >> [13964.640416] disk 1, o:1, dev:sdb1 > >> [13964.640423] disk 2, o:1, dev:sda1 > >> [13964.640429] disk 4, o:1, dev:sdf1 > >> [13964.640436] disk 5, o:1, dev:sdd1 > >> [13964.641621] raid5: failed to run raid set md0 > >> [13964.649886] md: pers->run() failed ... [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2012-02-08 7:13 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-02-07 1:34 Please Help! RAID5 -> 6 reshapre gone bad Richard Herd 2012-02-07 2:15 ` Phil Turmel [not found] ` <CAOANJV955ZdLexRTjVkQzTMapAaMitq5eqxP0rUvDjjLh4Wgzw@mail.gmail.com> 2012-02-07 2:57 ` Phil Turmel 2012-02-07 3:10 ` Richard Herd 2012-02-07 3:24 ` Keith Keller 2012-02-07 3:38 ` Phil Turmel 2012-01-31 6:31 ` rebuild raid6 after two failures Keith Keller 2012-02-01 4:42 ` Keith Keller 2012-02-01 5:31 ` NeilBrown 2012-02-01 5:48 ` Keith Keller 2012-02-03 16:08 ` using dd (or dd_rescue) to salvage array Keith Keller 2012-02-04 18:01 ` Stefan /*St0fF*/ Hübner 2012-02-05 19:10 ` Keith Keller 2012-02-06 21:37 ` Stefan *St0fF* Huebner 2012-02-07 3:44 ` Keith Keller 2012-02-07 4:24 ` Keith Keller 2012-02-07 20:01 ` Stefan *St0fF* Huebner 2012-02-08 7:13 ` Please Help! RAID5 -> 6 reshapre gone bad Stan Hoeppner 2012-02-07 3:04 ` Fwd: " Richard Herd 2012-02-07 2:39 ` NeilBrown 2012-02-07 3:10 ` NeilBrown 2012-02-07 3:19 ` Richard Herd 2012-02-07 3:39 ` NeilBrown 2012-02-07 3:50 ` Richard Herd 2012-02-07 4:25 ` NeilBrown 2012-02-07 5:02 ` Richard Herd 2012-02-07 5:16 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).