What to do about "ignoring %s as it reports %s as failed"?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* What to do about "ignoring %s as it reports %s as failed"?
@ 2013-01-10 19:01 Daniel Browning
  2013-01-12 21:18 ` Daniel Browning
  0 siblings, 1 reply; 2+ messages in thread
From: Daniel Browning @ 2013-01-10 19:01 UTC (permalink / raw)
  To: linux-raid

Hello, folks. What should I do about the following error?

	mdadm: ignoring /dev/sdd1 as it reports /dev/sdb1 as failed

I'm building a new replacement array and restoring from backup, but I would 
still like to try and salvage this failed one if possible, and I was 
surprised to find very few results on google for that particular error 
message.

Here is the background. I recently had a 4-disk raid5 array made up of:

	/dev/sdb1
	/dev/sdc1
	/dev/sdd1
	/dev/sde1

Wednesday afternoon (yesterday), /dev/sde1 failed, so the array went into 
degraded (no parity) state. I thought I'd give sde another chance, so I 
zero'd the superblock and re-added it to the array, which began rebuilding. 
But then when it had reached 72.4% early this morning, /dev/sdb1 failed:

md127 : active raid5 sde1[5] sdc1[0] sdb1[1](F) sdd1[4]
      5859302400 blocks super 1.2 level 5, 512k chunk,
      algorithm 2 [4/2] [U__U]
      [==============>......]  recovery = 72.4% (1414790348/1953100800)
      finish=1192.3min speed=7524K/sec

But /dev/sdb1 is working now (same as /dev/sde1). I tried to re-assemble the 
raid:

[root@lx4 ~]# mdadm --assemble --verbose /dev/md127 /dev/sd[bcde]1
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot -1.
mdadm: added /dev/sdb1 to /dev/md127 as 1 (possibly out of date)
mdadm: no uptodate device for slot 2 of /dev/md127
mdadm: added /dev/sdd1 to /dev/md127 as 3
mdadm: added /dev/sde1 to /dev/md127 as -1
mdadm: added /dev/sdc1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 2 drives and 1 spare - not enough to start 
the array.

But it rejected /dev/sdb1, so I ran --force to have it update the event 
count:

[root@lx4 ~]# mdadm --assemble --force --verbose /dev/md127 /dev/sd[bcde]1
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot -1.
mdadm: forcing event count in /dev/sdb1(1) from 905199 upto 905262
mdadm: clearing FAULTY flag for device 0 in /dev/md127 for /dev/sdb1
mdadm: Marking array /dev/md127 as 'clean'
mdadm: added /dev/sdb1 to /dev/md127 as 1
mdadm: no uptodate device for slot 2 of /dev/md127
mdadm: added /dev/sdd1 to /dev/md127 as 3
mdadm: added /dev/sde1 to /dev/md127 as -1
mdadm: added /dev/sdc1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 3 drives and 1 spare - not enough to start 
the array.

This surprised me a lot, because I thought 3 drives would have been enough 
to start the array. But when I ran it again, I got a different error:

[root@lx4 ~]# mdadm --assemble --force --verbose /dev/md127 /dev/sd[bcde]1
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot -1.
mdadm: ignoring /dev/sdd1 as it reports /dev/sdb1 as failed
mdadm: added /dev/sdb1 to /dev/md127 as 1
mdadm: no uptodate device for slot 2 of /dev/md127
mdadm: no uptodate device for slot 3 of /dev/md127
mdadm: added /dev/sde1 to /dev/md127 as -1
mdadm: added /dev/sdc1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 2 drives and 1 spare - not enough to start 
the array.

It appears to be failing because of this:

	mdadm: ignoring /dev/sdd1 as it reports /dev/sdb1 as failed

The sauce says this:

/* If this device thinks that 'most_recent' has failed, then
 * we must reject this device.
 */

But I can't interpret that into a possible fix. Any ideas?

Thanks in advance,
--
Daniel Browning

Appendix A. Versions
Distro: Fedora Core 16
Kernel: 3.4.4-4.fc16.x86_64 #1 SMP Thu Jul 5 20:01:38 UTC 2012
mdadm: v3.2.5 - 18th May 2012

Appendix B. contents of mdstat after a failed "--assemble":
md127 : inactive sdc1[0](S) sdb1[1](S)
      3906202639 blocks super 1.2

Appendix C. mdadm --examine for all disks, from *before* the 
"--assemble --force" was executed:
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4ca86345:c28c62be:03c9f77b:6760ef5c
           Name : lx4:127
  Creation Time : Sun Oct 10 15:46:28 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3906202639 (1862.62 GiB 1999.98 GB)
     Array Size : 5859302400 (5587.87 GiB 5999.93 GB)
  Used Dev Size : 3906201600 (1862.62 GiB 1999.98 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 156bc6e0:eaa285fd:8f4ef720:6f2171c2

    Update Time : Thu Jan 10 00:50:25 2013
       Checksum : f0945b4a - correct
         Events : 905199

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4ca86345:c28c62be:03c9f77b:6760ef5c
           Name : lx4:127
  Creation Time : Sun Oct 10 15:46:28 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3906202639 (1862.62 GiB 1999.98 GB)
     Array Size : 5859302400 (5587.87 GiB 5999.93 GB)
  Used Dev Size : 3906201600 (1862.62 GiB 1999.98 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 2dbbc5d0:f3deb841:50c7c992:c9abf856

    Update Time : Thu Jan 10 09:14:03 2013
       Checksum : 2b1b4f88 - correct
         Events : 905262

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A..A ('A' == active, '.' == missing)
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4ca86345:c28c62be:03c9f77b:6760ef5c
           Name : lx4:127
  Creation Time : Sun Oct 10 15:46:28 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3906202639 (1862.62 GiB 1999.98 GB)
     Array Size : 5859302400 (5587.87 GiB 5999.93 GB)
  Used Dev Size : 3906201600 (1862.62 GiB 1999.98 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : bdd8c401:9389bf9b:c80762a2:682b0297

    Update Time : Thu Jan 10 09:14:03 2013
       Checksum : 5c2d7d3 - correct
         Events : 905262

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : A..A ('A' == active, '.' == missing)
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4ca86345:c28c62be:03c9f77b:6760ef5c
           Name : lx4:127
  Creation Time : Sun Oct 10 15:46:28 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3906202639 (1862.62 GiB 1999.98 GB)
     Array Size : 5859302400 (5587.87 GiB 5999.93 GB)
  Used Dev Size : 3906201600 (1862.62 GiB 1999.98 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 78c381f7:1447cbd4:6af86729:d4c08320

    Update Time : Thu Jan 10 09:14:03 2013
       Checksum : 4513061e - correct
         Events : 905262

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : A..A ('A' == active, '.' == missing)

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: What to do about "ignoring %s as it reports %s as failed"?
  2013-01-10 19:01 What to do about "ignoring %s as it reports %s as failed"? Daniel Browning
@ 2013-01-12 21:18 ` Daniel Browning
  0 siblings, 0 replies; 2+ messages in thread
From: Daniel Browning @ 2013-01-12 21:18 UTC (permalink / raw)
  To: linux-raid

On Thursday 10 January 2013 11:01:01 am Daniel Browning wrote:
> Hello, folks. What should I do about the following error?
> 
> 	mdadm: ignoring /dev/sdd1 as it reports /dev/sdb1 as failed
> 

I was never able to find a way around that error, so I had completely 
written off the failed array. But today I rebooted the server (for a 
completely unrelated reason) and when it came back up, the failed array 
started working just fine, automatically. No error like the above. The array 
passed fsck with a clean bill of health, but now I'm checking for silent 
corruption by comparing against backups (except files that have a newer mod 
time).

I would still be interested in knowing what I *should* have done when 
encountering that error, and if there was any other solution aside from 
rebooting. If not, I have to say I'm disappointed that a reboot is required 
to fix this type of issue, because I thought that was a Windows thing, not 
something to expect from Linux and/or mdadm.

One cooincidence that struck me as very funny was that this morning I read 
the following comic:

http://thedoghousediaries.com/4822

But I still didn't think that rebooting would help my raid issue, so I 
didn't bother to reboot. Later on when I rebooted for a different reason, I 
realized just how timely that comic was.

Special thanks to one "frostschutz" in the freenode #linux-raid IRC channel, 
who helped me out with all this.

--
Daniel Browning
Kavod Technologies

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-01-12 21:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-10 19:01 What to do about "ignoring %s as it reports %s as failed"? Daniel Browning
2013-01-12 21:18 ` Daniel Browning

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).