Failed RAID 6 array advice

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Failed RAID 6 array advice
@ 2011-03-02  5:05 jahammonds prost
  2011-03-02  5:26 ` Mikael Abrahamsson
  2011-03-02  5:26 ` NeilBrown
  0 siblings, 2 replies; 3+ messages in thread
From: jahammonds prost @ 2011-03-02  5:05 UTC (permalink / raw)
  To: linux-raid

I've just had a 3rd drive fail on one of my RAID 6 arrays, and I'm looking for 
some advice on how to get it back enough that I can recover the data, and then 
replacing the other failed drives.


mdadm -V
mdadm - v3.0.3 - 22nd October 2009


Not the most up to date release, but it seems to be the latest one available on 
FC12



The /etc/mdadm.conf file is

ARRAY /dev/md0 uuid=1470c671:4236b155:67287625:899db153


Which explains why I didn't get emailed about the drive failures. This isn't my 
standard file, and I don't know how it was changed, but that's another issue for 
another day.



mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
  Used Dev Size : 488383488 (465.76 GiB 500.10 GB)
   Raid Devices : 15
  Total Devices : 12
    Persistence : Superblock is persistent
    Update Time : Tue Mar  1 22:17:41 2011
          State : active, degraded, Not Started
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0
     Chunk Size : 512K
           Name : file00bert.woodlea.org.uk:0  (local to host 
file00bert.woodlea.org.uk)
           UUID : 1470c671:4236b155:67287625:899db153
         Events : 254890
    Number   Major   Minor   RaidDevice State
       0       8      113        0      active sync   /dev/sdh1
       1       8       17        1      active sync   /dev/sdb1
       2       8      177        2      active sync   /dev/sdl1
       3       0        0        3      removed
       4       8       33        4      active sync   /dev/sdc1
       5       8      193        5      active sync   /dev/sdm1
       6       0        0        6      removed
       7       8       49        7      active sync   /dev/sdd1
       8       8      209        8      active sync   /dev/sdn1
       9       8      161        9      active sync   /dev/sdk1
      10       0        0       10      removed
      11       8      225       11      active sync   /dev/sdo1
      12       8       81       12      active sync   /dev/sdf1
      13       8      241       13      active sync   /dev/sdp1
      14       8        1       14      active sync   /dev/sda1



The output from the failed drives are as follows.


mdadm --examine /dev/sde1
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 1470c671:4236b155:67287625:899db153
           Name : file00bert.woodlea.org.uk:0  (local to host 
file00bert.woodlea.org.uk)
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
   Raid Devices : 15
 Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
     Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
  Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3e284f2e:d939fb97:0b74eb88:326e879c
Internal Bitmap : 2 sectors from superblock
    Update Time : Tue Mar  1 21:53:31 2011
       Checksum : 768f0f34 - correct
         Events : 254591
     Chunk Size : 512K
   Device Role : Active device 10
   Array State : AAA.AA.AAAAAAAA ('A' == active, '.' == missing)


The above is the drive that failed tonight, and the one I would like to re add 
back into the array. There have been no writes to the filesystem on the array in 
the last couple of days (other than what ext4 would do on it's own).


 mdadm --examine /dev/sdi1
/dev/sdi1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 1470c671:4236b155:67287625:899db153
           Name : file00bert.woodlea.org.uk:0  (local to host 
file00bert.woodlea.org.uk)
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
   Raid Devices : 15
 Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
     Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
  Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 8e668e39:06d8281b:b79aa3ab:a1d55fb5
Internal Bitmap : 2 sectors from superblock
    Update Time : Thu Feb 10 18:20:54 2011
       Checksum : 4078396b - correct
         Events : 254075
     Chunk Size : 512K
   Device Role : Active device 3
   Array State : AAAAAA.AAAAAAAA ('A' == active, '.' == missing)


mdadm --examine /dev/sdj1
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 1470c671:4236b155:67287625:899db153
           Name : file00bert.woodlea.org.uk:0  (local to host 
file00bert.woodlea.org.uk)
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
   Raid Devices : 15
 Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
     Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
  Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 37d422cc:8436960a:c3c4d11c:81a8e4fa
Internal Bitmap : 2 sectors from superblock
    Update Time : Thu Oct 21 23:45:06 2010
       Checksum : 78950bb5 - correct
         Events : 21435
     Chunk Size : 512K
   Device Role : Active device 6
   Array State : AAAAAAAAAAAAAAA ('A' == active, '.' == missing)


Looks like sdj1 failed waaay back in Oct last year (sigh). As I said, I am not 
to bothered about adding these last 2 drives back into the array, since they 
failed so long ago. I have a couple of spare drives sitting here, and I will 
replace these 2 drives with them (once I have completed a badblocks on them). 
Looking at the output of dmesg, there are no other errors showing for the 3 
drives, other than them being kicked out of the array for being non fresh.

I guess I have a couple of questions.

What's the correct process for adding the failed /dev/sde1 back into the array 
so I can start it. I don't want to rush into this and make things worse.

What's the correct process for replacing the 2 other drives?
I am presuming that I need to --fail, then --remove then --add the drives (one 
at a time?), but I want to make sure.


Thanks for your help.


Graham.


      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Failed RAID 6 array advice
  2011-03-02  5:05 Failed RAID 6 array advice jahammonds prost
@ 2011-03-02  5:26 ` Mikael Abrahamsson
  2011-03-02  5:26 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: Mikael Abrahamsson @ 2011-03-02  5:26 UTC (permalink / raw)
  To: jahammonds prost; +Cc: linux-raid

On Tue, 1 Mar 2011, jahammonds prost wrote:

> What's the correct process for adding the failed /dev/sde1 back into the 
> array so I can start it. I don't want to rush into this and make things 
> worse.

There are a lot of discussions about this in the archives, but basically I 
recommend the following:

Make sure you're running the latest mdadm, right now it's 3.1.4. Compile 
it yourself if you have to. After that you stop the array and use 
--assemble --force to get the array up and running again with the drives 
you know are good (make sure you don't use the drives that was offlined a 
long time ago).

> What's the correct process for replacing the 2 other drives?
> I am presuming that I need to --fail, then --remove then --add the drives (one
> at a time?), but I want to make sure.

Yes, when you have a working degraded array you just add them and a 
re-sync should happen and then everything should be ok if the resync 
succeeds.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Failed RAID 6 array advice
  2011-03-02  5:05 Failed RAID 6 array advice jahammonds prost
  2011-03-02  5:26 ` Mikael Abrahamsson
@ 2011-03-02  5:26 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: NeilBrown @ 2011-03-02  5:26 UTC (permalink / raw)
  To: jahammonds prost; +Cc: linux-raid

On Tue, 1 Mar 2011 21:05:33 -0800 (PST) jahammonds prost <gmitch64@yahoo.com>
wrote:

> I've just had a 3rd drive fail on one of my RAID 6 arrays, and I'm looking for 
> some advice on how to get it back enough that I can recover the data, and then 
> replacing the other failed drives.
> 
> 
> mdadm -V
> mdadm - v3.0.3 - 22nd October 2009
> 
> 
> Not the most up to date release, but it seems to be the latest one available on 
> FC12
> 
> 
> 
> The /etc/mdadm.conf file is
> 
> ARRAY /dev/md0 uuid=1470c671:4236b155:67287625:899db153
> 
> 
> Which explains why I didn't get emailed about the drive failures. This isn't my 
> standard file, and I don't know how it was changed, but that's another issue for 
> another day.
> 
> 
> 
> mdadm --detail /dev/md0
> /dev/md0:
>         Version : 1.2
>   Creation Time : Sat Jun  5 10:38:11 2010
>      Raid Level : raid6
>   Used Dev Size : 488383488 (465.76 GiB 500.10 GB)
>    Raid Devices : 15
>   Total Devices : 12
>     Persistence : Superblock is persistent
>     Update Time : Tue Mar  1 22:17:41 2011
>           State : active, degraded, Not Started
>  Active Devices : 12
> Working Devices : 12
>  Failed Devices : 0
>   Spare Devices : 0
>      Chunk Size : 512K
>            Name : file00bert.woodlea.org.uk:0  (local to host 
> file00bert.woodlea.org.uk)
>            UUID : 1470c671:4236b155:67287625:899db153
>          Events : 254890
>     Number   Major   Minor   RaidDevice State
>        0       8      113        0      active sync   /dev/sdh1
>        1       8       17        1      active sync   /dev/sdb1
>        2       8      177        2      active sync   /dev/sdl1
>        3       0        0        3      removed
>        4       8       33        4      active sync   /dev/sdc1
>        5       8      193        5      active sync   /dev/sdm1
>        6       0        0        6      removed
>        7       8       49        7      active sync   /dev/sdd1
>        8       8      209        8      active sync   /dev/sdn1
>        9       8      161        9      active sync   /dev/sdk1
>       10       0        0       10      removed
>       11       8      225       11      active sync   /dev/sdo1
>       12       8       81       12      active sync   /dev/sdf1
>       13       8      241       13      active sync   /dev/sdp1
>       14       8        1       14      active sync   /dev/sda1
> 
> 
> 
> The output from the failed drives are as follows.
> 
> 
> mdadm --examine /dev/sde1
> /dev/sde1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 1470c671:4236b155:67287625:899db153
>            Name : file00bert.woodlea.org.uk:0  (local to host 
> file00bert.woodlea.org.uk)
>   Creation Time : Sat Jun  5 10:38:11 2010
>      Raid Level : raid6
>    Raid Devices : 15
>  Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
>      Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
>   Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
>     Data Offset : 272 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : 3e284f2e:d939fb97:0b74eb88:326e879c
> Internal Bitmap : 2 sectors from superblock
>     Update Time : Tue Mar  1 21:53:31 2011
>        Checksum : 768f0f34 - correct
>          Events : 254591
>      Chunk Size : 512K
>    Device Role : Active device 10
>    Array State : AAA.AA.AAAAAAAA ('A' == active, '.' == missing)
> 
> 
> The above is the drive that failed tonight, and the one I would like to re add 
> back into the array. There have been no writes to the filesystem on the array in 
> the last couple of days (other than what ext4 would do on it's own).
> 
> 
>  mdadm --examine /dev/sdi1
> /dev/sdi1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 1470c671:4236b155:67287625:899db153
>            Name : file00bert.woodlea.org.uk:0  (local to host 
> file00bert.woodlea.org.uk)
>   Creation Time : Sat Jun  5 10:38:11 2010
>      Raid Level : raid6
>    Raid Devices : 15
>  Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
>      Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
>   Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
>     Data Offset : 272 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 8e668e39:06d8281b:b79aa3ab:a1d55fb5
> Internal Bitmap : 2 sectors from superblock
>     Update Time : Thu Feb 10 18:20:54 2011
>        Checksum : 4078396b - correct
>          Events : 254075
>      Chunk Size : 512K
>    Device Role : Active device 3
>    Array State : AAAAAA.AAAAAAAA ('A' == active, '.' == missing)
> 
> 
> mdadm --examine /dev/sdj1
> /dev/sdj1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 1470c671:4236b155:67287625:899db153
>            Name : file00bert.woodlea.org.uk:0  (local to host 
> file00bert.woodlea.org.uk)
>   Creation Time : Sat Jun  5 10:38:11 2010
>      Raid Level : raid6
>    Raid Devices : 15
>  Avail Dev Size : 976767730 (465.76 GiB 500.11 GB)
>      Array Size : 12697970688 (6054.86 GiB 6501.36 GB)
>   Used Dev Size : 976766976 (465.76 GiB 500.10 GB)
>     Data Offset : 272 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 37d422cc:8436960a:c3c4d11c:81a8e4fa
> Internal Bitmap : 2 sectors from superblock
>     Update Time : Thu Oct 21 23:45:06 2010
>        Checksum : 78950bb5 - correct
>          Events : 21435
>      Chunk Size : 512K
>    Device Role : Active device 6
>    Array State : AAAAAAAAAAAAAAA ('A' == active, '.' == missing)
> 
> 
> Looks like sdj1 failed waaay back in Oct last year (sigh). As I said, I am not 
> to bothered about adding these last 2 drives back into the array, since they 
> failed so long ago. I have a couple of spare drives sitting here, and I will 
> replace these 2 drives with them (once I have completed a badblocks on them). 
> Looking at the output of dmesg, there are no other errors showing for the 3 
> drives, other than them being kicked out of the array for being non fresh.
> 
> I guess I have a couple of questions.
> 
> What's the correct process for adding the failed /dev/sde1 back into the array 
> so I can start it. I don't want to rush into this and make things worse.

If you think that the drives really are working and that it was a cabling
problem then stop the array (if it isn't stopped already) and assemble with
--force:

 mdadm --assemble --force /dev/md0 /dev....list of devices

Then find the devices that it chose not to include and add them individually
  mdadm /dev/md0 --add /dev/something

However if any device has a bad block that cannot be read, then this won't
work.
In that case you need to get a new device, partition it to have a partition
EXACTLY the same size, use
  dd_rescue
to copy all the good data from the bad drive to the new drive, remove the bad
drive from the system, and use the "--assemble --force" command using the new
drive, not the old drive.


> 
> What's the correct process for replacing the 2 other drives?
> I am presuming that I need to --fail, then --remove then --add the drives (one 
> at a time?), but I want to make sure.

There are already failed and removed so there is no point in trying to do
that again

Good luck.

NeilBrown


> 
> 
> Thanks for your help.
> 
> 
> Graham.
> 
> 
>       
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-03-02  5:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-02  5:05 Failed RAID 6 array advice jahammonds prost
2011-03-02  5:26 ` Mikael Abrahamsson
2011-03-02  5:26 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).