raid5 recover after a 2 disk failure

All of lore.kernel.org
 help / color / mirror / Atom feed

* raid5 recover after a 2 disk failure
@ 2007-06-17  6:57 frank jenkins
  2007-06-17  7:14 ` frank jenkins
  0 siblings, 1 reply; 6+ messages in thread
From: frank jenkins @ 2007-06-17  6:57 UTC (permalink / raw)
  To: linux-raid

I have a 5 disk raid5 array that had a disk failure. I removed the disk, 
added a new one (and a spare), and recovery began. Halfway through recovery, 
a second disk failed.

However, while the first disk really was dead, the second seems to have been 
a transient error, as the smart data and disk testing seem to show the disk 
is fine.

The question is, how can I tell mdadm to unfail this second disk. From what 
I've found in the archives, I think I need to use the --force option, but 
I'm concern about getting device names in the wrong order (and totally 
destroying my array in the process), so thought I'd ask here first. Here is 
my /proc/mdstat when recovery initially began:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
[raid10]
md1 : active raid5 sdc1[0](S) sdf1[5] sdb1[4] sda1[3] sde1[2] sdd1[1]
      976783616 blocks level 5, 32k chunk, algorithm 2 [5/4] [_UUUU]
      [>....................]  recovery =  0.0% (237952/244195904) 
finish=427.0min speed=9518K/sec

and here is my current mdstat:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
[raid10]
md1 : active raid5 sdc1[5](S) sdf1[6](S) sdb1[4] sda1[3] sde1[7](F) sdd1[1]
      976783616 blocks level 5, 32k chunk, algorithm 2 [5/3] [_U_UU]

sde is the disk that is now marked as failed, and which I would like to put 
back into service.

Also, what does the number in []'s mean after each device, and why did that 
number change on sdc, sde, and sdf?

Thanks, Frank

_________________________________________________________________
Get a preview of Live Earth, the hottest event this summer - only on MSN 
http://liveearth.msn.com?source=msntaglineliveearthhm

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: raid5 recover after a 2 disk failure
  2007-06-17  6:57 frank jenkins
@ 2007-06-17  7:14 ` frank jenkins
  0 siblings, 0 replies; 6+ messages in thread
From: frank jenkins @ 2007-06-17  7:14 UTC (permalink / raw)
  To: linux-raid

To clarify, all I want to do is temporarily mount the array so I can copy as 
much data off as possible, then blow the entire array away and find out if 
sdc really is bad, or its just a bad block or bad cable or whatever. I get 
the feeling that sdc is mostly fine, in which case I should be able to 
recover most of the data on the array.

Also, is it possible to set this disks read only so that mdadm won't write 
to them no matter what I do? That would make me feel a lot better when 
trying various options to force mdadm into mounting them.

Thanks, Frank


>From: "frank jenkins" <fjenkins873@hotmail.com>
>To: linux-raid@vger.kernel.org
>Subject: raid5 recover after a 2 disk failure
>Date: Sun, 17 Jun 2007 06:57:55 +0000
>
>I have a 5 disk raid5 array that had a disk failure. I removed the disk, 
>added a new one (and a spare), and recovery began. Halfway through 
>recovery, a second disk failed.
>
>However, while the first disk really was dead, the second seems to have 
>been a transient error, as the smart data and disk testing seem to show the 
>disk is fine.
>
>The question is, how can I tell mdadm to unfail this second disk. From what 
>I've found in the archives, I think I need to use the --force option, but 
>I'm concern about getting device names in the wrong order (and totally 
>destroying my array in the process), so thought I'd ask here first. Here is 
>my /proc/mdstat when recovery initially began:
>
>Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
>[raid4] [raid10]
>md1 : active raid5 sdc1[0](S) sdf1[5] sdb1[4] sda1[3] sde1[2] sdd1[1]
>      976783616 blocks level 5, 32k chunk, algorithm 2 [5/4] [_UUUU]
>      [>....................]  recovery =  0.0% (237952/244195904) 
>finish=427.0min speed=9518K/sec
>
>and here is my current mdstat:
>Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
>[raid4] [raid10]
>md1 : active raid5 sdc1[5](S) sdf1[6](S) sdb1[4] sda1[3] sde1[7](F) sdd1[1]
>      976783616 blocks level 5, 32k chunk, algorithm 2 [5/3] [_U_UU]
>
>sde is the disk that is now marked as failed, and which I would like to put 
>back into service.
>
>
>Also, what does the number in []'s mean after each device, and why did that 
>number change on sdc, sde, and sdf?
>
>Thanks, Frank
>
>_________________________________________________________________
>Get a preview of Live Earth, the hottest event this summer - only on MSN 
>http://liveearth.msn.com?source=msntaglineliveearthhm
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

_________________________________________________________________
Get a preview of Live Earth, the hottest event this summer - only on MSN 
http://liveearth.msn.com?source=msntaglineliveearthhm


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: raid5 recover after a 2 disk failure
@ 2007-06-19  7:44 Frank Jenkins
  2007-06-19  8:42 ` David Greaves
  2007-06-19 11:48 ` David Greaves
  0 siblings, 2 replies; 6+ messages in thread
From: Frank Jenkins @ 2007-06-19  7:44 UTC (permalink / raw)
  To: linux-raid

So here's the /proc/mdstat prior to the array failure:
---cut---
Personalities : [linear] [multipath] [raid0] [raid1]
[raid6] [raid5] [raid4] [raid10] 
md1 : active raid5 sdc1[0](S) sdf1[5] sdb1[4] sda1[3]
sde1[2] sdd1[1]
      976783616 blocks level 5, 32k chunk, algorithm 2
[5/4] [_UUUU]
            [>....................]  recovery =  0.0%
(237952/244195904) finish=427.0min speed=9518K/sec
---cut---

and here's the /proc/mdstat as it stands currently:
---cut---
Personalities : [linear] [multipath] [raid0] [raid1]
[raid6] [raid5] [raid4] [raid10] 
md1 : active raid5 sdc1[5](S) sdf1[6](S) sdb1[4]
sda1[3] sde1[7](F) sdd1[1]
      976783616 blocks level 5, 32k chunk, algorithm 2
[5/3] [_U_UU]
            
unused devices: <none>
---cut---

I think what I need to do is run:

mdadm -ARf /dev/md1 missing /dev/sd[baed]1

This should force the array back into a useable state,
yes? 
(assuming that I'm correct and sde isn't really
busted).

And more importantly, if anyone can tell me how to
lock down the 
disks so they are readonly so I can play around with
mdadm 
re-assembly options without having to worry about
compeletly 
destroying the array, that'd be awesome.

As usual, any help would be greatly appreciated.

I'll include the -E output from the drives in their
current state, if that helps:


nas# for t in a b c d e f ; do mdadm -E /dev/sd${t}1 >
sd${t}1.txt ; done
/dev/sda1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 8bc0e21c:ef63a964:93ce508e:500a32e2
  Creation Time : Mon Sep 27 11:14:17 2004
     Raid Level : raid5
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
     Array Size : 976783616 (931.53 GiB 1000.23 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 1

    Update Time : Sat Jun 16 23:22:03 2007
          State : clean
 Active Devices : 3
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 2
       Checksum : 9e863c27 - correct
         Events : 0.67378

         Layout : left-symmetric
     Chunk Size : 32K

      Number   Major   Minor   RaidDevice State
this     3       8        1        3      active sync 
 /dev/sda1

   0     0       0        0        0      removed
   1     1       8       49        1      active sync 
 /dev/sdd1
   2     2       0        0        2      faulty
removed
   3     3       8        1        3      active sync 
 /dev/sda1
   4     4       8       17        4      active sync 
 /dev/sdb1
   5     5       8       33        5      spare  
/dev/sdc1
   6     6       8       81        6      spare  
/dev/sdf1
/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 8bc0e21c:ef63a964:93ce508e:500a32e2
  Creation Time : Mon Sep 27 11:14:17 2004
     Raid Level : raid5
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
     Array Size : 976783616 (931.53 GiB 1000.23 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 1

    Update Time : Sat Jun 16 23:22:03 2007
          State : clean
 Active Devices : 3
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 2
       Checksum : 9e863c39 - correct
         Events : 0.67378

         Layout : left-symmetric
     Chunk Size : 32K

      Number   Major   Minor   RaidDevice State
this     4       8       17        4      active sync 
 /dev/sdb1

   0     0       0        0        0      removed
   1     1       8       49        1      active sync 
 /dev/sdd1
   2     2       0        0        2      faulty
removed
   3     3       8        1        3      active sync 
 /dev/sda1
   4     4       8       17        4      active sync 
 /dev/sdb1
   5     5       8       33        5      spare  
/dev/sdc1
   6     6       8       81        6      spare  
/dev/sdf1
/dev/sdc1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 8bc0e21c:ef63a964:93ce508e:500a32e2
  Creation Time : Mon Sep 27 11:14:17 2004
     Raid Level : raid5
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
     Array Size : 976783616 (931.53 GiB 1000.23 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 1

    Update Time : Sat Jun 16 23:22:03 2007
          State : clean
 Active Devices : 3
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 2
       Checksum : 9e863c45 - correct
         Events : 0.67378

         Layout : left-symmetric
     Chunk Size : 32K

      Number   Major   Minor   RaidDevice State
this     5       8       33        5      spare  
/dev/sdc1

   0     0       0        0        0      removed
   1     1       8       49        1      active sync 
 /dev/sdd1
   2     2       0        0        2      faulty
removed
   3     3       8        1        3      active sync 
 /dev/sda1
   4     4       8       17        4      active sync 
 /dev/sdb1
   5     5       8       33        5      spare  
/dev/sdc1
   6     6       8       81        6      spare  
/dev/sdf1
/dev/sdd1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 8bc0e21c:ef63a964:93ce508e:500a32e2
  Creation Time : Mon Sep 27 11:14:17 2004
     Raid Level : raid5
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
     Array Size : 976783616 (931.53 GiB 1000.23 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 1

    Update Time : Sat Jun 16 23:22:03 2007
          State : clean
 Active Devices : 3
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 2
       Checksum : 9e863c53 - correct
         Events : 0.67378

         Layout : left-symmetric
     Chunk Size : 32K

      Number   Major   Minor   RaidDevice State
this     1       8       49        1      active sync 
 /dev/sdd1

   0     0       0        0        0      removed
   1     1       8       49        1      active sync 
 /dev/sdd1
   2     2       0        0        2      faulty
removed
   3     3       8        1        3      active sync 
 /dev/sda1
   4     4       8       17        4      active sync 
 /dev/sdb1
   5     5       8       33        5      spare  
/dev/sdc1
   6     6       8       81        6      spare  
/dev/sdf1
/dev/sde1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 8bc0e21c:ef63a964:93ce508e:500a32e2
  Creation Time : Mon Sep 27 11:14:17 2004
     Raid Level : raid5
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
     Array Size : 976783616 (931.53 GiB 1000.23 GB)
   Raid Devices : 5
  Total Devices : 4
Preferred Minor : 1

    Update Time : Sat Jun 16 19:16:48 2007
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 9e86022f - correct
         Events : 0.67372

         Layout : left-symmetric
     Chunk Size : 32K

      Number   Major   Minor   RaidDevice State
this     2       8       65        2      active sync 
 /dev/sde1

   0     0       0        0        0      removed
   1     1       8       49        1      active sync 
 /dev/sdd1
   2     2       8       65        2      active sync 
 /dev/sde1
   3     3       8        1        3      active sync 
 /dev/sda1
   4     4       8       17        4      active sync 
 /dev/sdb1
/dev/sdf1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 8bc0e21c:ef63a964:93ce508e:500a32e2
  Creation Time : Mon Sep 27 11:14:17 2004
     Raid Level : raid5
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
     Array Size : 976783616 (931.53 GiB 1000.23 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 1

    Update Time : Sat Jun 16 23:22:03 2007
          State : clean
 Active Devices : 3
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 2
       Checksum : 9e863c77 - correct
         Events : 0.67378

         Layout : left-symmetric
     Chunk Size : 32K

      Number   Major   Minor   RaidDevice State
this     6       8       81        6      spare  
/dev/sdf1

   0     0       0        0        0      removed
   1     1       8       49        1      active sync 
 /dev/sdd1
   2     2       0        0        2      faulty
removed
   3     3       8        1        3      active sync 
 /dev/sda1
   4     4       8       17        4      active sync 
 /dev/sdb1
   5     5       8       33        5      spare  
/dev/sdc1
   6     6       8       81        6      spare  
/dev/sdf1



       
____________________________________________________________________________________
Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid5 recover after a 2 disk failure
  2007-06-19  7:44 raid5 recover after a 2 disk failure Frank Jenkins
@ 2007-06-19  8:42 ` David Greaves
  2007-06-19 11:48 ` David Greaves
  1 sibling, 0 replies; 6+ messages in thread
From: David Greaves @ 2007-06-19  8:42 UTC (permalink / raw)
  To: Frank Jenkins; +Cc: linux-raid

Frank Jenkins wrote:
> So here's the /proc/mdstat prior to the array failure:

I'll take a look through this and see if I can see any problems Frank. Bit busy 
now - give me a few minutes.

David


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid5 recover after a 2 disk failure
  2007-06-19  7:44 raid5 recover after a 2 disk failure Frank Jenkins
  2007-06-19  8:42 ` David Greaves
@ 2007-06-19 11:48 ` David Greaves
  2007-07-08  9:54   ` Frank Jenkins
  1 sibling, 1 reply; 6+ messages in thread
From: David Greaves @ 2007-06-19 11:48 UTC (permalink / raw)
  To: Frank Jenkins; +Cc: linux-raid

All looked OK, a few comments...

Frank Jenkins wrote:
Logically this comes first...
> This should force the array back into a useable state,
> yes? 
> (assuming that I'm correct and sde isn't really
> busted).
correct.
Since you have a spare you may want to use ddrescue to transfer data from sde to 
the spare as a 'backup'. That way, if you do get problems with sde you have the 
best chance of recovery.

If you do that, then you can use the backup as your new sde.

either way...

> I think what I need to do is run:
> mdadm -ARf /dev/md1 missing /dev/sd[baed]1
I don't think you want -R.
If enough drives are present to provide data then it will run with a -f.

If you get the order wrong here, it won't damage anything *provided you have a 
missing disk and the array is degraded* (which you do) so there is no need to 
worry about having rw devices.

The event counts from abcdf are the same and sde is only slightly off so you 
should find it assembles just fine.

At this point can copy data off the disks - bear in mind you may have fs or data 
corruption due to the inconsistency in sde.

If you are confident that sde is actually OK, you can now add in the spare:
   mdadm /dev/md1 --add /dev/sdc1
and allow a rebuild.

Then do an xfs_repair (or ext3 or whatever).
You can do this whilst the resync is running.

At that point you should be back up and running - no need for a restore.

> And more importantly, if anyone can tell me how to
> lock down the 
> disks so they are readonly so I can play around with
> mdadm 
> re-assembly options without having to worry about
> compeletly 
> destroying the array, that'd be awesome.
Not sure you can do this.

> As usual, any help would be greatly appreciated.
> 
> I'll include the -E output from the drives in their
> current state, if that helps:
It did.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid5 recover after a 2 disk failure
  2007-06-19 11:48 ` David Greaves
@ 2007-07-08  9:54   ` Frank Jenkins
  0 siblings, 0 replies; 6+ messages in thread
From: Frank Jenkins @ 2007-07-08  9:54 UTC (permalink / raw)
  To: linux-raid

First, I'd like to extend a HUGE thank you. I was able
to get some irreplaceable family photos off my raid
array before things went south. 

Sadly, another disk seems to have died, so now it
seems completely dead. There is one thing I'm confused
about.

The Events number has gone way up on the disks I was
using for that array. From 0.67378 to 0.78038. Isn't
the event number only updated when the array is
written to? I thought I mounted the array read-only,
so I'm not sure why that would have gone up.

Thanks, Frank



--- David Greaves <david@dgreaves.com> wrote:

> All looked OK, a few comments...
> 
> Frank Jenkins wrote:
> Logically this comes first...
> > This should force the array back into a useable
> state,
> > yes? 
> > (assuming that I'm correct and sde isn't really
> > busted).
> correct.
> Since you have a spare you may want to use ddrescue
> to transfer data from sde to 
> the spare as a 'backup'. That way, if you do get
> problems with sde you have the 
> best chance of recovery.
> 
> If you do that, then you can use the backup as your
> new sde.
> 
> either way...
> 
> > I think what I need to do is run:
> > mdadm -ARf /dev/md1 missing /dev/sd[baed]1
> I don't think you want -R.
> If enough drives are present to provide data then it
> will run with a -f.
> 
> If you get the order wrong here, it won't damage
> anything *provided you have a 
> missing disk and the array is degraded* (which you
> do) so there is no need to 
> worry about having rw devices.
> 
> The event counts from abcdf are the same and sde is
> only slightly off so you 
> should find it assembles just fine.
> 
> At this point can copy data off the disks - bear in
> mind you may have fs or data 
> corruption due to the inconsistency in sde.
> 
> If you are confident that sde is actually OK, you
> can now add in the spare:
>    mdadm /dev/md1 --add /dev/sdc1
> and allow a rebuild.
> 
> Then do an xfs_repair (or ext3 or whatever).
> You can do this whilst the resync is running.
> 
> At that point you should be back up and running - no
> need for a restore.
> 
> 
> > And more importantly, if anyone can tell me how to
> > lock down the 
> > disks so they are readonly so I can play around
> with
> > mdadm 
> > re-assembly options without having to worry about
> > compeletly 
> > destroying the array, that'd be awesome.
> Not sure you can do this.
> 
> > As usual, any help would be greatly appreciated.
> > 
> > I'll include the -E output from the drives in
> their
> > current state, if that helps:
> It did.
> 
> 
> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> 



 
____________________________________________________________________________________
Be a PS3 game guru.
Get your game face on with the latest PS3 news and previews at Yahoo! Games.
http://videogames.yahoo.com/platform?platform=120121

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-07-08  9:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-19  7:44 raid5 recover after a 2 disk failure Frank Jenkins
2007-06-19  8:42 ` David Greaves
2007-06-19 11:48 ` David Greaves
2007-07-08  9:54   ` Frank Jenkins
  -- strict thread matches above, loose matches on Subject: below --
2007-06-17  6:57 frank jenkins
2007-06-17  7:14 ` frank jenkins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.