degraded raid5 refuses to start

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* degraded raid5 refuses to start
@ 2006-07-01 21:20 Jason Lunz
  2006-07-02  1:37 ` Jason Lunz
  0 siblings, 1 reply; 2+ messages in thread
From: Jason Lunz @ 2006-07-01 21:20 UTC (permalink / raw)
  To: linux-raid

I have a 4-disk raid5 (sda3, sdb3, hda1, hdc1). sda and sdb share a
silicon image sata card.  sdb died completely, then 20 minutes later,
the sata_sil driver became fatally confused and the machine locked up.
I shut down the machine and waited until I had a replacement for sdb.

I've got a replacement for sdb now, but I can't get the array to start
so that I can add it and resync. When I try to assemble the degraded
array, I get this:

root@orr:~# mdadm -Af /dev/md2 /dev/sda3 /dev/hda1 /dev/hdc1
mdadm: failed to RUN_ARRAY /dev/md2: Input/output error

root@orr:~# dmesg | tail -n 15
md: bind<hda1>
md: bind<hdc1>
md: bind<sda3>
md: md2: raid array is not clean -- starting background reconstruction
raid5: device sda3 operational as raid disk 0
raid5: device hdc1 operational as raid disk 3
raid5: device hda1 operational as raid disk 2
raid5: cannot start dirty degraded array for md2
RAID5 conf printout:
 --- rd:4 wd:3 fd:1
 disk 0, o:1, dev:sda3
 disk 2, o:1, dev:hda1
 disk 3, o:1, dev:hdc1
raid5: failed to run raid set md2
md: pers->run() failed ...

How do I convince the array to start? I can add the new disk to the
array, but it simply becomes a spare and the raid5 remains inactive.

The superblock on the 1 of the 3 drives is a little different than the
other two:

root@orr:~# mdadm -E /dev/hda1 > sb-hda1
root@orr:~# mdadm -E /dev/hdc1 > sb-hdc1
root@orr:~# mdadm -E /dev/sda3 > sb-sda3
root@orr:~# diff -u sb-hda1 sb-hdc1
--- sb-hda1     2006-07-01 17:17:36.000000000 -0400
+++ sb-hdc1     2006-07-01 17:17:41.000000000 -0400
@@ -1,4 +1,4 @@
-/dev/hda1:
+/dev/hdc1:
           Magic : a92b4efc
         Version : 00.90.00
            UUID : 6b8b4567:327b23c6:643c9869:66334873
@@ -16,14 +16,14 @@
 Working Devices : 3
  Failed Devices : 2
   Spare Devices : 0
-       Checksum : a2163da6 - correct
+       Checksum : a2163dbb - correct
          Events : 0.47575379

          Layout : left-symmetric
      Chunk Size : 64K

       Number   Major   Minor   RaidDevice State
-this     2       3        1        2      active sync   /dev/hda1
+this     3      22        1        3      active sync   /dev/hdc1

    0     0       8        3        0      active sync   /dev/sda3
    1     1       0        0        1      faulty removed
root@orr:~# diff -u sb-hda1 sb-sda3
--- sb-hda1     2006-07-01 17:17:36.000000000 -0400
+++ sb-sda3     2006-07-01 17:17:43.000000000 -0400
@@ -1,4 +1,4 @@
-/dev/hda1:
+/dev/sda3:
           Magic : a92b4efc
         Version : 00.90.00
            UUID : 6b8b4567:327b23c6:643c9869:66334873
@@ -10,22 +10,22 @@
   Total Devices : 4
 Preferred Minor : 2

-    Update Time : Mon Jun 26 22:51:12 2006
-          State : active
+    Update Time : Mon Jun 26 22:51:06 2006
+          State : clean
  Active Devices : 3
 Working Devices : 3
  Failed Devices : 2
   Spare Devices : 0
-       Checksum : a2163da6 - correct
-         Events : 0.47575379
+       Checksum : a4ec2eec - correct
+         Events : 0.47575378

          Layout : left-symmetric
      Chunk Size : 64K

       Number   Major   Minor   RaidDevice State
-this     2       3        1        2      active sync   /dev/hda1
+this     0       8        3        0      active sync   /dev/sda3

    0     0       8        3        0      active sync   /dev/sda3
-   1     1       0        0        1      faulty removed
+   1     1       0        0        1      spare
    2     2       3        1        2      active sync   /dev/hda1
    3     3      22        1        3      active sync   /dev/hdc1

How do I get this array going again?  Am I doing something wrong?
Reading the list archives indicates that there could be bugs in this
area, or that I may need to recreate the array with -C (though that
seems heavyhanded to me).

thanks,

Jason


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: degraded raid5 refuses to start
  2006-07-01 21:20 degraded raid5 refuses to start Jason Lunz
@ 2006-07-02  1:37 ` Jason Lunz
  0 siblings, 0 replies; 2+ messages in thread
From: Jason Lunz @ 2006-07-02  1:37 UTC (permalink / raw)
  To: linux-raid

lunz@falooley.org said:
> How do I get this array going again?  Am I doing something wrong?
> Reading the list archives indicates that there could be bugs in this
> area, or that I may need to recreate the array with -C (though that
> seems heavyhanded to me).

This is what I ended up doing. I made backups of the three superblocks,
then recreated them with:

# mdadm -C /dev/md2 -n4 -l5 /dev/sda3 missing /dev/hda1 /dev/hdc1

(I knew the chunk size and layout would be the same, since I just use
the defaults).

After this, the array works again. I have before and after images of the
three superblocks if anyone wants to look into how they got into this
state.

As far as I can see, the problem was that the broken array got into a
state where the superblock counts were like this:

   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2

    Update Time : Mon Jun 26 22:51:12 2006
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 2
  Spare Devices : 0

notice how the total number of Working + Failed (5) exceeds the number
of disks in the array. Maybe there's a bug to be fixed here that lets
these counters get out of whack somehow?

After reconstructing the array, the Failed count went back down to 1,
and everything started working normally again. I wonder if simply
decrementing that one value in each superblock would have been enough to 
get the array going again, rather than rewriting all the superblocks. If
so, maybe that can be safely built into mdadm? 

Either that, or it was having two disks marked State: active and one
marked clean in the degraded array.

anyway, I have a dead disk and kept all my data, so thanks.

Jason

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2006-07-02  1:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-01 21:20 degraded raid5 refuses to start Jason Lunz
2006-07-02  1:37 ` Jason Lunz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).