* What to do about "ignoring %s as it reports %s as failed"?
@ 2013-01-10 19:01 Daniel Browning
2013-01-12 21:18 ` Daniel Browning
0 siblings, 1 reply; 2+ messages in thread
From: Daniel Browning @ 2013-01-10 19:01 UTC (permalink / raw)
To: linux-raid
Hello, folks. What should I do about the following error?
mdadm: ignoring /dev/sdd1 as it reports /dev/sdb1 as failed
I'm building a new replacement array and restoring from backup, but I would
still like to try and salvage this failed one if possible, and I was
surprised to find very few results on google for that particular error
message.
Here is the background. I recently had a 4-disk raid5 array made up of:
/dev/sdb1
/dev/sdc1
/dev/sdd1
/dev/sde1
Wednesday afternoon (yesterday), /dev/sde1 failed, so the array went into
degraded (no parity) state. I thought I'd give sde another chance, so I
zero'd the superblock and re-added it to the array, which began rebuilding.
But then when it had reached 72.4% early this morning, /dev/sdb1 failed:
md127 : active raid5 sde1[5] sdc1[0] sdb1[1](F) sdd1[4]
5859302400 blocks super 1.2 level 5, 512k chunk,
algorithm 2 [4/2] [U__U]
[==============>......] recovery = 72.4% (1414790348/1953100800)
finish=1192.3min speed=7524K/sec
But /dev/sdb1 is working now (same as /dev/sde1). I tried to re-assemble the
raid:
[root@lx4 ~]# mdadm --assemble --verbose /dev/md127 /dev/sd[bcde]1
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot -1.
mdadm: added /dev/sdb1 to /dev/md127 as 1 (possibly out of date)
mdadm: no uptodate device for slot 2 of /dev/md127
mdadm: added /dev/sdd1 to /dev/md127 as 3
mdadm: added /dev/sde1 to /dev/md127 as -1
mdadm: added /dev/sdc1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 2 drives and 1 spare - not enough to start
the array.
But it rejected /dev/sdb1, so I ran --force to have it update the event
count:
[root@lx4 ~]# mdadm --assemble --force --verbose /dev/md127 /dev/sd[bcde]1
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot -1.
mdadm: forcing event count in /dev/sdb1(1) from 905199 upto 905262
mdadm: clearing FAULTY flag for device 0 in /dev/md127 for /dev/sdb1
mdadm: Marking array /dev/md127 as 'clean'
mdadm: added /dev/sdb1 to /dev/md127 as 1
mdadm: no uptodate device for slot 2 of /dev/md127
mdadm: added /dev/sdd1 to /dev/md127 as 3
mdadm: added /dev/sde1 to /dev/md127 as -1
mdadm: added /dev/sdc1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 3 drives and 1 spare - not enough to start
the array.
This surprised me a lot, because I thought 3 drives would have been enough
to start the array. But when I ran it again, I got a different error:
[root@lx4 ~]# mdadm --assemble --force --verbose /dev/md127 /dev/sd[bcde]1
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot -1.
mdadm: ignoring /dev/sdd1 as it reports /dev/sdb1 as failed
mdadm: added /dev/sdb1 to /dev/md127 as 1
mdadm: no uptodate device for slot 2 of /dev/md127
mdadm: no uptodate device for slot 3 of /dev/md127
mdadm: added /dev/sde1 to /dev/md127 as -1
mdadm: added /dev/sdc1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 2 drives and 1 spare - not enough to start
the array.
It appears to be failing because of this:
mdadm: ignoring /dev/sdd1 as it reports /dev/sdb1 as failed
The sauce says this:
/* If this device thinks that 'most_recent' has failed, then
* we must reject this device.
*/
But I can't interpret that into a possible fix. Any ideas?
Thanks in advance,
--
Daniel Browning
Appendix A. Versions
Distro: Fedora Core 16
Kernel: 3.4.4-4.fc16.x86_64 #1 SMP Thu Jul 5 20:01:38 UTC 2012
mdadm: v3.2.5 - 18th May 2012
Appendix B. contents of mdstat after a failed "--assemble":
md127 : inactive sdc1[0](S) sdb1[1](S)
3906202639 blocks super 1.2
Appendix C. mdadm --examine for all disks, from *before* the
"--assemble --force" was executed:
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 4ca86345:c28c62be:03c9f77b:6760ef5c
Name : lx4:127
Creation Time : Sun Oct 10 15:46:28 2010
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 3906202639 (1862.62 GiB 1999.98 GB)
Array Size : 5859302400 (5587.87 GiB 5999.93 GB)
Used Dev Size : 3906201600 (1862.62 GiB 1999.98 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 156bc6e0:eaa285fd:8f4ef720:6f2171c2
Update Time : Thu Jan 10 00:50:25 2013
Checksum : f0945b4a - correct
Events : 905199
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : AAAA ('A' == active, '.' == missing)
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 4ca86345:c28c62be:03c9f77b:6760ef5c
Name : lx4:127
Creation Time : Sun Oct 10 15:46:28 2010
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 3906202639 (1862.62 GiB 1999.98 GB)
Array Size : 5859302400 (5587.87 GiB 5999.93 GB)
Used Dev Size : 3906201600 (1862.62 GiB 1999.98 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 2dbbc5d0:f3deb841:50c7c992:c9abf856
Update Time : Thu Jan 10 09:14:03 2013
Checksum : 2b1b4f88 - correct
Events : 905262
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : A..A ('A' == active, '.' == missing)
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 4ca86345:c28c62be:03c9f77b:6760ef5c
Name : lx4:127
Creation Time : Sun Oct 10 15:46:28 2010
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 3906202639 (1862.62 GiB 1999.98 GB)
Array Size : 5859302400 (5587.87 GiB 5999.93 GB)
Used Dev Size : 3906201600 (1862.62 GiB 1999.98 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : bdd8c401:9389bf9b:c80762a2:682b0297
Update Time : Thu Jan 10 09:14:03 2013
Checksum : 5c2d7d3 - correct
Events : 905262
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : A..A ('A' == active, '.' == missing)
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 4ca86345:c28c62be:03c9f77b:6760ef5c
Name : lx4:127
Creation Time : Sun Oct 10 15:46:28 2010
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 3906202639 (1862.62 GiB 1999.98 GB)
Array Size : 5859302400 (5587.87 GiB 5999.93 GB)
Used Dev Size : 3906201600 (1862.62 GiB 1999.98 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 78c381f7:1447cbd4:6af86729:d4c08320
Update Time : Thu Jan 10 09:14:03 2013
Checksum : 4513061e - correct
Events : 905262
Layout : left-symmetric
Chunk Size : 512K
Device Role : spare
Array State : A..A ('A' == active, '.' == missing)
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: What to do about "ignoring %s as it reports %s as failed"?
2013-01-10 19:01 What to do about "ignoring %s as it reports %s as failed"? Daniel Browning
@ 2013-01-12 21:18 ` Daniel Browning
0 siblings, 0 replies; 2+ messages in thread
From: Daniel Browning @ 2013-01-12 21:18 UTC (permalink / raw)
To: linux-raid
On Thursday 10 January 2013 11:01:01 am Daniel Browning wrote:
> Hello, folks. What should I do about the following error?
>
> mdadm: ignoring /dev/sdd1 as it reports /dev/sdb1 as failed
>
I was never able to find a way around that error, so I had completely
written off the failed array. But today I rebooted the server (for a
completely unrelated reason) and when it came back up, the failed array
started working just fine, automatically. No error like the above. The array
passed fsck with a clean bill of health, but now I'm checking for silent
corruption by comparing against backups (except files that have a newer mod
time).
I would still be interested in knowing what I *should* have done when
encountering that error, and if there was any other solution aside from
rebooting. If not, I have to say I'm disappointed that a reboot is required
to fix this type of issue, because I thought that was a Windows thing, not
something to expect from Linux and/or mdadm.
One cooincidence that struck me as very funny was that this morning I read
the following comic:
http://thedoghousediaries.com/4822
But I still didn't think that rebooting would help my raid issue, so I
didn't bother to reboot. Later on when I rebooted for a different reason, I
realized just how timely that comic was.
Special thanks to one "frostschutz" in the freenode #linux-raid IRC channel,
who helped me out with all this.
--
Daniel Browning
Kavod Technologies
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2013-01-12 21:18 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-10 19:01 What to do about "ignoring %s as it reports %s as failed"? Daniel Browning
2013-01-12 21:18 ` Daniel Browning
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).