* mdadm raid5 single drive fail, single drive out of sync terror @ 2014-11-26 15:08 Jon Robison 2014-11-26 15:47 ` Phil Turmel 2014-11-26 15:49 ` Robin Hill 0 siblings, 2 replies; 6+ messages in thread From: Jon Robison @ 2014-11-26 15:08 UTC (permalink / raw) To: linux-raid Hi all! I upgraded to mdadm-3.3-7.fc20.x86_64, and my raid5 array would no longer recognize /dev/sdb1 in my raid 5 array (which is normally /dev/sd[b-f]1). I `mdadm --detail --scan`, which resulted in a degraded array, then added /dev/sdb1, and it started rebuilding happily until 25% or so, when another failure seemed to occur. I am convinced the data is fine on /dev/sd[c-f]1, and that somehow I just need to inform mdadm about that, but they got out of sync and /dev/sde1 thinks the array is AAAAA while the others think its AAA.. . The drives also seem to think e is bad because f said e was bad or some weird stuff, and sde1 is behind by ~50 events or so. That error hasn't shown itself recently. I fear sdb is bad and sde is going to go soon. Results of `mdadm --examine /dev/sd[b-f]1` are here http://dpaste.com/2Z7CPVY I'm scared and alone. Everything is off and sitting as above, though e 50 events behind and out of synch. New drives coming Friday and backup is of course a bit old. I'm petrified to execute `mdadm --create --assume-clean --level=5 --raid-devices=5 /dev/md0 /dev/sdf1 /dev/sdd1 /dev/sdc1 /dev/sde1 missing`, but that seems my next option unless ya'll know better. I tried `mdadm --assemble -f /dev/md0 /dev/sdf1 /dev/sdd1 /dev/sdc1 /dev/sde1` and it said something like can't start with only 3 devices (which I wouldn't expect because examine still shows 4, just that they are out of sync and I thought that was -f's express purpose in assemble mode). Anyone have any suggestions? Thanks! ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mdadm raid5 single drive fail, single drive out of sync terror 2014-11-26 15:08 mdadm raid5 single drive fail, single drive out of sync terror Jon Robison @ 2014-11-26 15:47 ` Phil Turmel 2014-11-26 15:49 ` Robin Hill 1 sibling, 0 replies; 6+ messages in thread From: Phil Turmel @ 2014-11-26 15:47 UTC (permalink / raw) To: Jon Robison, linux-raid Good morning Jon, On 11/26/2014 10:08 AM, Jon Robison wrote: > Hi all! > > I upgraded to mdadm-3.3-7.fc20.x86_64, and my raid5 array would no > longer recognize /dev/sdb1 in my raid 5 array (which is normally > /dev/sd[b-f]1). I `mdadm --detail --scan`, which resulted in a degraded > array, then added /dev/sdb1, and it started rebuilding happily until 25% > or so, when another failure seemed to occur. Well, failures during rebuild of a raid5 are common. In my experience, including helping on this list, most often due to timeout mismatch and a failure to regularly scrub. > I am convinced the data is fine on /dev/sd[c-f]1, and that somehow I > just need to inform mdadm about that, but they got out of sync and > /dev/sde1 thinks the array is AAAAA while the others think its AAA.. . > The drives also seem to think e is bad because f said e was bad or some > weird stuff, and sde1 is behind by ~50 events or so. That error hasn't > shown itself recently. I fear sdb is bad and sde is going to go soon. Please show your dmesg from the start of the problem. Also show "smartctl -x /dev/sdX" for each of the member devices. Also show an excerpt from "ls -l /dev/disk/by-id/" that shows the device vs. serial number relationship for your drives. > Results of `mdadm --examine /dev/sd[b-f]1` are here > http://dpaste.com/2Z7CPVY Just put the results in the email in the future. Kernel.org tolerates relatively large messages. > I'm scared and alone. Everything is off and sitting as above, though e > 50 events behind and out of synch. New drives coming Friday and backup > is of course a bit old. I'm petrified to execute `mdadm --create > --assume-clean --level=5 --raid-devices=5 /dev/md0 /dev/sdf1 /dev/sdd1 > /dev/sdc1 /dev/sde1 missing`, You should be petrified of any '--create' operation. What you've shown above would certainly *not* work, thanks to your data offsets. > but that seems my next option unless ya'll > know better. I tried `mdadm --assemble -f /dev/md0 /dev/sdf1 /dev/sdd1 > /dev/sdc1 /dev/sde1` and it said something like can't start with only 3 > devices (which I wouldn't expect because examine still shows 4, just > that they are out of sync and I thought that was -f's express purpose in > assemble mode). Anyone have any suggestions? Thanks! Show the contents of /proc/mdstat, then show the results of: mdadm --stop /dev/md0 mdadm --assemble --force --verbose /dev/md0 /dev/sd[cdef]1 Phil ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mdadm raid5 single drive fail, single drive out of sync terror 2014-11-26 15:08 mdadm raid5 single drive fail, single drive out of sync terror Jon Robison 2014-11-26 15:47 ` Phil Turmel @ 2014-11-26 15:49 ` Robin Hill 2014-11-26 16:13 ` Robison, Jon (CMG-Atlanta) 2014-11-28 17:00 ` Robison, Jon (CMG-Atlanta) 1 sibling, 2 replies; 6+ messages in thread From: Robin Hill @ 2014-11-26 15:49 UTC (permalink / raw) To: Jon Robison; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3164 bytes --] On Wed Nov 26, 2014 at 10:08:12AM -0500, Jon Robison wrote: > Hi all! > > I upgraded to mdadm-3.3-7.fc20.x86_64, and my raid5 array would no > longer recognize /dev/sdb1 in my raid 5 array (which is normally > /dev/sd[b-f]1). I `mdadm --detail --scan`, which resulted in a degraded > array, then added /dev/sdb1, and it started rebuilding happily until 25% > or so, when another failure seemed to occur. > > I am convinced the data is fine on /dev/sd[c-f]1, and that somehow I > just need to inform mdadm about that, but they got out of sync and > /dev/sde1 thinks the array is AAAAA while the others think its AAA.. . > The drives also seem to think e is bad because f said e was bad or some > weird stuff, and sde1 is behind by ~50 events or so. That error hasn't > shown itself recently. I fear sdb is bad and sde is going to go soon. > > Results of `mdadm --examine /dev/sd[b-f]1` are here > http://dpaste.com/2Z7CPVY > > I'm scared and alone. Everything is off and sitting as above, though e > 50 events behind and out of synch. New drives coming Friday and backup > is of course a bit old. I'm petrified to execute `mdadm --create > --assume-clean --level=5 --raid-devices=5 /dev/md0 /dev/sdf1 /dev/sdd1 > /dev/sdc1 /dev/sde1 missing`, but that seems my next option unless ya'll > know better. I tried `mdadm --assemble -f /dev/md0 /dev/sdf1 /dev/sdd1 > /dev/sdc1 /dev/sde1` and it said something like can't start with only 3 > devices (which I wouldn't expect because examine still shows 4, just > that they are out of sync and I thought that was -f's express purpose in > assemble mode). Anyone have any suggestions? Thanks! It looks like this is a bug in 3.3 (the checkin logs show something similar anyway). I'd advise getting 3.3.1 or 3.3.2 and retrying the forced assembly. If it failed during the rebuild, that would suggest there's an unreadable block on sde though, which means you'll hit the same issue again when you try to rebuild sdb. You'll need to: - image sde to a new disk (via ddrescue) - assemble the array - add another new disk in to rebuild - once the rebuild has completed, force a fsck on the array (fsck -f /dev/md0) as the unreadable block may have caused some filesystem corruption. It may also cause some file corruption, but that's not something that can be easily checked. These read errors can be picked up and fixed by running regular array checks (echo check > /sys/block/md0/md/sync_action). Most distributions have these set up in cron, so make sure that's in there and enabled. The failed disks may actually be okay (sde particularly), so I'd advise checking SMART stats and running full badblocks write tests on them. If the badblocks tests run okay and there's no increase in reallocated sectors reported in SMART, they should be perfectly okay for re-use. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mdadm raid5 single drive fail, single drive out of sync terror 2014-11-26 15:49 ` Robin Hill @ 2014-11-26 16:13 ` Robison, Jon (CMG-Atlanta) 2014-11-26 16:38 ` Robin Hill 2014-11-28 17:00 ` Robison, Jon (CMG-Atlanta) 1 sibling, 1 reply; 6+ messages in thread From: Robison, Jon (CMG-Atlanta) @ 2014-11-26 16:13 UTC (permalink / raw) To: linux-raid On 11/26/14 10:49 AM, Robin Hill wrote: > On Wed Nov 26, 2014 at 10:08:12AM -0500, Jon Robison wrote: > >> Hi all! >> >> I upgraded to mdadm-3.3-7.fc20.x86_64, and my raid5 array would no >> longer recognize /dev/sdb1 in my raid 5 array (which is normally >> /dev/sd[b-f]1). I `mdadm --detail --scan`, which resulted in a degraded >> array, then added /dev/sdb1, and it started rebuilding happily until 25% >> or so, when another failure seemed to occur. >> >> I am convinced the data is fine on /dev/sd[c-f]1, and that somehow I >> just need to inform mdadm about that, but they got out of sync and >> /dev/sde1 thinks the array is AAAAA while the others think its AAA.. . >> The drives also seem to think e is bad because f said e was bad or some >> weird stuff, and sde1 is behind by ~50 events or so. That error hasn't >> shown itself recently. I fear sdb is bad and sde is going to go soon. >> >> Results of `mdadm --examine /dev/sd[b-f]1` are here >> http://dpaste.com/2Z7CPVY >> >> I'm scared and alone. Everything is off and sitting as above, though e >> 50 events behind and out of synch. New drives coming Friday and backup >> is of course a bit old. I'm petrified to execute `mdadm --create >> --assume-clean --level=5 --raid-devices=5 /dev/md0 /dev/sdf1 /dev/sdd1 >> /dev/sdc1 /dev/sde1 missing`, but that seems my next option unless ya'll >> know better. I tried `mdadm --assemble -f /dev/md0 /dev/sdf1 /dev/sdd1 >> /dev/sdc1 /dev/sde1` and it said something like can't start with only 3 >> devices (which I wouldn't expect because examine still shows 4, just >> that they are out of sync and I thought that was -f's express purpose in >> assemble mode). Anyone have any suggestions? Thanks! > It looks like this is a bug in 3.3 (the checkin logs show something > similar anyway). I'd advise getting 3.3.1 or 3.3.2 and retrying the > forced assembly. > > If it failed during the rebuild, that would suggest there's an > unreadable block on sde though, which means you'll hit the same issue > again when you try to rebuild sdb. You'll need to: > - image sde to a new disk (via ddrescue) > - assemble the array > - add another new disk in to rebuild > - once the rebuild has completed, force a fsck on the array > (fsck -f /dev/md0) as the unreadable block may have caused some > filesystem corruption. It may also cause some file corruption, but > that's not something that can be easily checked. > > These read errors can be picked up and fixed by running regular array > checks (echo check > /sys/block/md0/md/sync_action). Most distributions > have these set up in cron, so make sure that's in there and enabled. > > The failed disks may actually be okay (sde particularly), so I'd advise > checking SMART stats and running full badblocks write tests on them. If > the badblocks tests run okay and there's no increase in reallocated > sectors reported in SMART, they should be perfectly okay for re-use. > > Cheers, > Robin Thanks you two, I'll check the logs on the machine later. I'm hopeful about new mdadm, rawhide appears to have 3.3.1 at least... maybe I'll livecd with 3.3.2..? When I checked yesterday, SMART said everything (including sdb and sde) was ok, I didn't do badblocks though. dmesg didn't seem to have anything meaningful, though I'll attach later. I'm inclined to wait for the disks to come on Friday. When I add them to the backup machine, it should only be ~500GB off, so it could rsync that for a few hours in degraded mode. I'd rather have the 500GB and risk however many bad blocks. Does that sound logical or will rsyncing with potential sde bad blocks ruin the whole target filesystem? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mdadm raid5 single drive fail, single drive out of sync terror 2014-11-26 16:13 ` Robison, Jon (CMG-Atlanta) @ 2014-11-26 16:38 ` Robin Hill 0 siblings, 0 replies; 6+ messages in thread From: Robin Hill @ 2014-11-26 16:38 UTC (permalink / raw) To: Robison, Jon (CMG-Atlanta); +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 4842 bytes --] On Wed Nov 26, 2014 at 11:13:02AM -0500, Robison, Jon (CMG-Atlanta) wrote: > On 11/26/14 10:49 AM, Robin Hill wrote: > > On Wed Nov 26, 2014 at 10:08:12AM -0500, Jon Robison wrote: > > > >> Hi all! > >> > >> I upgraded to mdadm-3.3-7.fc20.x86_64, and my raid5 array would no > >> longer recognize /dev/sdb1 in my raid 5 array (which is normally > >> /dev/sd[b-f]1). I `mdadm --detail --scan`, which resulted in a degraded > >> array, then added /dev/sdb1, and it started rebuilding happily until 25% > >> or so, when another failure seemed to occur. > >> > >> I am convinced the data is fine on /dev/sd[c-f]1, and that somehow I > >> just need to inform mdadm about that, but they got out of sync and > >> /dev/sde1 thinks the array is AAAAA while the others think its AAA.. . > >> The drives also seem to think e is bad because f said e was bad or some > >> weird stuff, and sde1 is behind by ~50 events or so. That error hasn't > >> shown itself recently. I fear sdb is bad and sde is going to go soon. > >> > >> Results of `mdadm --examine /dev/sd[b-f]1` are here > >> http://dpaste.com/2Z7CPVY > >> > >> I'm scared and alone. Everything is off and sitting as above, though e > >> 50 events behind and out of synch. New drives coming Friday and backup > >> is of course a bit old. I'm petrified to execute `mdadm --create > >> --assume-clean --level=5 --raid-devices=5 /dev/md0 /dev/sdf1 /dev/sdd1 > >> /dev/sdc1 /dev/sde1 missing`, but that seems my next option unless ya'll > >> know better. I tried `mdadm --assemble -f /dev/md0 /dev/sdf1 /dev/sdd1 > >> /dev/sdc1 /dev/sde1` and it said something like can't start with only 3 > >> devices (which I wouldn't expect because examine still shows 4, just > >> that they are out of sync and I thought that was -f's express purpose in > >> assemble mode). Anyone have any suggestions? Thanks! > > It looks like this is a bug in 3.3 (the checkin logs show something > > similar anyway). I'd advise getting 3.3.1 or 3.3.2 and retrying the > > forced assembly. > > > > If it failed during the rebuild, that would suggest there's an > > unreadable block on sde though, which means you'll hit the same issue > > again when you try to rebuild sdb. You'll need to: > > - image sde to a new disk (via ddrescue) > > - assemble the array > > - add another new disk in to rebuild > > - once the rebuild has completed, force a fsck on the array > > (fsck -f /dev/md0) as the unreadable block may have caused some > > filesystem corruption. It may also cause some file corruption, but > > that's not something that can be easily checked. > > > > These read errors can be picked up and fixed by running regular array > > checks (echo check > /sys/block/md0/md/sync_action). Most distributions > > have these set up in cron, so make sure that's in there and enabled. > > > > The failed disks may actually be okay (sde particularly), so I'd advise > > checking SMART stats and running full badblocks write tests on them. If > > the badblocks tests run okay and there's no increase in reallocated > > sectors reported in SMART, they should be perfectly okay for re-use. > > > > Cheers, > > Robin > Thanks you two, I'll check the logs on the machine later. I'm hopeful > about new mdadm, rawhide appears to have 3.3.1 at least... maybe I'll > livecd with 3.3.2..? When I checked yesterday, SMART said everything > (including sdb and sde) was ok, I didn't do badblocks though. dmesg > didn't seem to have anything meaningful, though I'll attach later. > The full badblocks write test is destructive, so should only be done once you've got everything recovered from the disks. There is a safe read-write mode, but that won't do as thorough a test. > I'm inclined to wait for the disks to come on Friday. When I add them to > the backup machine, it should only be ~500GB off, so it could rsync that > for a few hours in degraded mode. I'd rather have the 500GB and risk > however many bad blocks. Does that sound logical or will rsyncing with > potential sde bad blocks ruin the whole target filesystem? > If the unreadable block contains filesystem metadata or file data which needs synching, the array will fail when the processing hits it. I'd expect that to just cause the rsync process to stop, but I wouldn't want to count on it. I'd run the rsync without deletions first (which should be safe - worst case is that the file being synced gets corrupted), then run it with deletions only if everything worked the first time. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: mdadm raid5 single drive fail, single drive out of sync terror 2014-11-26 15:49 ` Robin Hill 2014-11-26 16:13 ` Robison, Jon (CMG-Atlanta) @ 2014-11-28 17:00 ` Robison, Jon (CMG-Atlanta) 1 sibling, 0 replies; 6+ messages in thread From: Robison, Jon (CMG-Atlanta) @ 2014-11-28 17:00 UTC (permalink / raw) To: linux-raid Thanks Robin and Phil, mdadm 3.3.2 did allow successful forced reassemble (had to run the command twice for whatever reason, first execution said 4 aren't enough drives). I am updating my backup but already retrieved the things of high value. I consider this mission accomplished already. Next steps I will take: backup -> fsck -> backup -> add missing disk -> add more automation to main and backup -> profit On 11/26/14 10:49 AM, Robin Hill wrote: > On Wed Nov 26, 2014 at 10:08:12AM -0500, Jon Robison wrote: > >> Hi all! >> >> I upgraded to mdadm-3.3-7.fc20.x86_64, and my raid5 array would no >> longer recognize /dev/sdb1 in my raid 5 array (which is normally >> /dev/sd[b-f]1). I `mdadm --detail --scan`, which resulted in a degraded >> array, then added /dev/sdb1, and it started rebuilding happily until 25% >> or so, when another failure seemed to occur. >> >> I am convinced the data is fine on /dev/sd[c-f]1, and that somehow I >> just need to inform mdadm about that, but they got out of sync and >> /dev/sde1 thinks the array is AAAAA while the others think its AAA.. . >> The drives also seem to think e is bad because f said e was bad or some >> weird stuff, and sde1 is behind by ~50 events or so. That error hasn't >> shown itself recently. I fear sdb is bad and sde is going to go soon. >> >> Results of `mdadm --examine /dev/sd[b-f]1` are here >> http://dpaste.com/2Z7CPVY >> >> I'm scared and alone. Everything is off and sitting as above, though e >> 50 events behind and out of synch. New drives coming Friday and backup >> is of course a bit old. I'm petrified to execute `mdadm --create >> --assume-clean --level=5 --raid-devices=5 /dev/md0 /dev/sdf1 /dev/sdd1 >> /dev/sdc1 /dev/sde1 missing`, but that seems my next option unless ya'll >> know better. I tried `mdadm --assemble -f /dev/md0 /dev/sdf1 /dev/sdd1 >> /dev/sdc1 /dev/sde1` and it said something like can't start with only 3 >> devices (which I wouldn't expect because examine still shows 4, just >> that they are out of sync and I thought that was -f's express purpose in >> assemble mode). Anyone have any suggestions? Thanks! > It looks like this is a bug in 3.3 (the checkin logs show something > similar anyway). I'd advise getting 3.3.1 or 3.3.2 and retrying the > forced assembly. > > If it failed during the rebuild, that would suggest there's an > unreadable block on sde though, which means you'll hit the same issue > again when you try to rebuild sdb. You'll need to: > - image sde to a new disk (via ddrescue) > - assemble the array > - add another new disk in to rebuild > - once the rebuild has completed, force a fsck on the array > (fsck -f /dev/md0) as the unreadable block may have caused some > filesystem corruption. It may also cause some file corruption, but > that's not something that can be easily checked. > > These read errors can be picked up and fixed by running regular array > checks (echo check > /sys/block/md0/md/sync_action). Most distributions > have these set up in cron, so make sure that's in there and enabled. > > The failed disks may actually be okay (sde particularly), so I'd advise > checking SMART stats and running full badblocks write tests on them. If > the badblocks tests run okay and there's no increase in reallocated > sectors reported in SMART, they should be perfectly okay for re-use. > > Cheers, > Robin ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-11-28 17:00 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-11-26 15:08 mdadm raid5 single drive fail, single drive out of sync terror Jon Robison 2014-11-26 15:47 ` Phil Turmel 2014-11-26 15:49 ` Robin Hill 2014-11-26 16:13 ` Robison, Jon (CMG-Atlanta) 2014-11-26 16:38 ` Robin Hill 2014-11-28 17:00 ` Robison, Jon (CMG-Atlanta)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).