Failed Array Rebuild advice Please

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Failed Array Rebuild advice Please
@ 2012-04-10 22:32 jahammonds prost
  2012-04-10 23:02 ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: jahammonds prost @ 2012-04-10 22:32 UTC (permalink / raw)
  To: Linux RAID

For various reasons, the email notifications on my RAID6 array wasn't working, and 2 of the 15 drives failed out. I noticed this last week as I was about to move the server into a new case. As part of the move, I upgraded the OS to the latest CentOS, as I was having issues with the existing install and the new HBA card (a SASLP-MV8).
 
When the server came back up, for some reason it decided to fire up the md array with only 1 drive - and it incremented the Event count on that 1 drive (and since I'm running with 2 failed drives on a RAID6, I couldn't kick the drive out and let it rebuild).
 
The array shows this...
 
 mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sat Jun  5 10:38:11 2010
     Raid Level : raid6
  Used Dev Size : 488383488 (465.76 GiB 500.10 GB)
   Raid Devices : 15
  Total Devices : 12
    Persistence : Superblock is persistent
    Update Time : Mon Apr  9 13:05:31 2012
          State : active, FAILED, Not Started
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0
         Layout : left-symmetric
     Chunk Size : 512K
           Name : file00bert.woodlea.org.uk:0  (local to host file00bert.woodlea.org.uk)
           UUID : 1470c671:4236b155:67287625:899db153
         Events : 1378022
    Number   Major   Minor   RaidDevice State
       0       8      113        0      active sync   /dev/sdh1
       1       8      209        1      active sync   /dev/sdn1
       2       8      225        2      active sync   /dev/sdo1
      15       8       17        3      active sync   /dev/sdb1
       4       8      145        4      active sync   /dev/sdj1
       5       8      161        5      active sync   /dev/sdk1
       6       0        0        6      removed
       7       8       81        7      active sync   /dev/sdf1
       8       8       97        8      active sync   /dev/sdg1
      16       8       65        9      active sync   /dev/sde1
      10       8       33       10      active sync   /dev/sdc1
      11       0        0       11      removed
      12       8      177       12      active sync   /dev/sdl1
      13       8      241       13      active sync   /dev/sdp1
      14       0        0       14      removed

 
Looking at the Event count on all the drives as they currently are, they show this
 
sda1 1378024
sdb1 1378022
sdc1 1378022
sdd1 1362956
sde1 1378022
sdf1 1378022
sdg1 1378022
sdh1 1378022
sdj1 1378022
sdk1 1378022
sdl1 1378022
sdm1  616796
sdn1 1378022
sdo1 1378022
sdp1 1378022
 
So, /dev/sdd1 and /dev/sdm1 are the 2 failed drives. The Event count on all the other drives agree with each other, and with that of the array, except for /dev/sda1, which is a couple of events higher than everything else - and with that I can't start the array.
 
 
Since I know I did nothing with the temp one drive array when the server was booted (and I don't think that the md code did anything either??) would it be safe to 
 
mdadm --assemble /dev/md0 /dev/sd[a-c]1 /dev/sd[e-h]1 /dev/sd[j-l]1 /dev/sd[n-p]1 --force
 
to let the array come back up and get it running?
 
What would then be the correct sequence to replace the 2 failed drives (sdd1 and sdm1) and get the array running fully again?
 
 
Thanks for your help.
 
 
YP.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed Array Rebuild advice Please
  2012-04-10 22:32 Failed Array Rebuild advice Please jahammonds prost
@ 2012-04-10 23:02 ` NeilBrown
  2012-04-10 23:46   ` jahammonds prost
  2012-04-11  4:14   ` jahammonds prost
  0 siblings, 2 replies; 7+ messages in thread
From: NeilBrown @ 2012-04-10 23:02 UTC (permalink / raw)
  To: jahammonds prost; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]

On Tue, 10 Apr 2012 15:32:44 -0700 (PDT) jahammonds prost
<gmitch64@yahoo.com> wrote:


> Since I know I did nothing with the temp one drive array when the server was booted (and I don't think that the md code did anything either??) would it be safe to 
>  
> mdadm --assemble /dev/md0 /dev/sd[a-c]1 /dev/sd[e-h]1 /dev/sd[j-l]1 /dev/sd[n-p]1 --force
>  
> to let the array come back up and get it running?

Yes.   Though you don't need to exclude the oldest devices.  mdadm will
figure out which ones to use and which ones to avoid.


>  
> What would then be the correct sequence to replace the 2 failed drives (sdd1 and sdm1) and get the array running fully again?

If you trust that sdd and sdm really are working now, you can

 mdadm /dev/md0 --add /dev/sdd1 /dev/sdm1

This will probably start rebuilding just one of the devices. You can check
with
   mdadm -D /dev/md0
which will report "spare building" against one or more devices.
If it is only rebuilding one of them, then

  echo idle > /sys/block/md0/md/sync_action

will cause it to stop and restart the recovery.  It will then recovery both
devices in parallel.

(Next version of mdadm will do this automatically).

NeilBrown


>  
>  
> Thanks for your help.
>  
>  
> YP.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed Array Rebuild advice Please
  2012-04-10 23:02 ` NeilBrown
@ 2012-04-10 23:46   ` jahammonds prost
  2012-04-11  4:14   ` jahammonds prost
  1 sibling, 0 replies; 7+ messages in thread
From: jahammonds prost @ 2012-04-10 23:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

Neil,
    Thanks for the fast reply.

>    If you trust that sdd and sdm really are working now, you can
>
>    mdadm /dev/md0 --add /dev/sdd1 /dev/sdm1

I am running a read only badblocks on them at the moment, but I don't really trust them. Once the scan has finished, I am going to replace them with a couple of new drives, and once I've run a full destructive badblocks on those (so that's going to take overnight), I will add them into the array. Thanks for the tip on having the rebuild run in parallel, that should speed up the rebuild process. 

Once all that is done, I will run a destructive badblocks on the original failed drives, and if they still pass, I will add them as additional drives into the array. This box is mainly a media server - how many drives would you suggest is the max in a RAID6 setup?

YP.

----- Original Message -----
From: NeilBrown <neilb@suse.de>
To: jahammonds prost <gmitch64@yahoo.com>
Cc: Linux RAID <linux-raid@vger.kernel.org>
Sent: Tuesday, 10 April 2012, 19:02
Subject: Re: Failed Array Rebuild advice Please

On Tue, 10 Apr 2012 15:32:44 -0700 (PDT) jahammonds prost
<gmitch64@yahoo.com> wrote:

> Since I know I did nothing with the temp one drive array when the server was booted (and I don't think that the md code did anything either??) would it be safe to 
>  
> mdadm --assemble /dev/md0 /dev/sd[a-c]1 /dev/sd[e-h]1 /dev/sd[j-l]1 /dev/sd[n-p]1 --force
>  
> to let the array come back up and get it running?

Yes.   Though you don't need to exclude the oldest devices.  mdadm will
figure out which ones to use and which ones to avoid.

>  
> What would then be the correct sequence to replace the 2 failed drives (sdd1 and sdm1) and get the array running fully again?

If you trust that sdd and sdm really are working now, you can

mdadm /dev/md0 --add /dev/sdd1 /dev/sdm1

This will probably start rebuilding just one of the devices. You can check
with
   mdadm -D /dev/md0
which will report "spare building" against one or more devices.
If it is only rebuilding one of them, then

  echo idle > /sys/block/md0/md/sync_action

will cause it to stop and restart the recovery.  It will then recovery both
devices in parallel.

(Next version of mdadm will do this automatically).

NeilBrown

>  
>  
> Thanks for your help.
>  
>  
> YP.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed Array Rebuild advice Please
  2012-04-10 23:02 ` NeilBrown
  2012-04-10 23:46   ` jahammonds prost
@ 2012-04-11  4:14   ` jahammonds prost
  2012-04-11  4:43     ` NeilBrown
  1 sibling, 1 reply; 7+ messages in thread
From: jahammonds prost @ 2012-04-11  4:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

One other question as the badblocks progresses.

>  It will then recovery both devices in parallel.

How many additional devices can be done at the same time? I presume that I am going to have to replace the 2 failed devices before I try and grow the array by adding 3 additional drives? I so, how many additional drives can be rebuilt concurrently with a grow? Could I add 5 devices and not see too much of a performance hit? Or would it be more sensible to add them one at a time?

YP

----- Original Message -----
From: NeilBrown <neilb@suse.de>
To: jahammonds prost <gmitch64@yahoo.com>
Cc: Linux RAID <linux-raid@vger.kernel.org>
Sent: Tuesday, 10 April 2012, 19:02
Subject: Re: Failed Array Rebuild advice Please

On Tue, 10 Apr 2012 15:32:44 -0700 (PDT) jahammonds prost
<gmitch64@yahoo.com> wrote:

> Since I know I did nothing with the temp one drive array when the server was booted (and I don't think that the md code did anything either??) would it be safe to 
>  
> mdadm --assemble /dev/md0 /dev/sd[a-c]1 /dev/sd[e-h]1 /dev/sd[j-l]1 /dev/sd[n-p]1 --force
>  
> to let the array come back up and get it running?

Yes.   Though you don't need to exclude the oldest devices.  mdadm will
figure out which ones to use and which ones to avoid.

>  
> What would then be the correct sequence to replace the 2 failed drives (sdd1 and sdm1) and get the array running fully again?

If you trust that sdd and sdm really are working now, you can

mdadm /dev/md0 --add /dev/sdd1 /dev/sdm1

This will probably start rebuilding just one of the devices. You can check
with
   mdadm -D /dev/md0
which will report "spare building" against one or more devices.
If it is only rebuilding one of them, then

  echo idle > /sys/block/md0/md/sync_action

will cause it to stop and restart the recovery.  It will then recovery both
devices in parallel.

(Next version of mdadm will do this automatically).

NeilBrown

>  
>  
> Thanks for your help.
>  
>  
> YP.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed Array Rebuild advice Please
  2012-04-11  4:14   ` jahammonds prost
@ 2012-04-11  4:43     ` NeilBrown
  2012-04-12  2:10       ` jahammonds prost
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2012-04-11  4:43 UTC (permalink / raw)
  To: jahammonds prost; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 1297 bytes --]

On Tue, 10 Apr 2012 21:14:07 -0700 (PDT) jahammonds prost
<gmitch64@yahoo.com> wrote:

> One other question as the badblocks progresses.
>  
> >  It will then recovery both devices in parallel.
> 
> How many additional devices can be done at the same time? I presume that I am going to have to replace the 2 failed devices before I try and grow the array by adding 3 additional drives? I so, how many additional drives can be rebuilt concurrently with a grow? Could I add 5 devices and not see too much of a performance hit? Or would it be more sensible to add them one at a time?
> 

I think it is best to recover, and then reshape later.  I cannot promise that
doing them both at once will work .... it might but I have a feeling that
there might be problems.

Adding three additional drives at once should work well enough in terms of
performance.
However I would only do it if I were very very confident of the drives.
If you hit bad blocks you start losing drives, and if you have 3 drives that
you haven't used before, the chance of losing them all during the reshape -
while still small - becomes a little too high for comfort.

But if you have run heavy bad-blocks tests on them all and they appear to
work, then adding 3 drives at once should be fine.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed Array Rebuild advice Please
  2012-04-11  4:43     ` NeilBrown
@ 2012-04-12  2:10       ` jahammonds prost
  2012-04-12  2:25         ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: jahammonds prost @ 2012-04-12  2:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux RAID

> I think it is best to recover, and then reshape later.

Which is what I did. The destructive badblocks ran fine overnight, with no errors on the drives that were failed out of the array. Interestingly one of the additional drives that I want to add took an additional 6 hours to run badblocks, so I need to look into that a bit.

I am having a bit of an issue with the reshape tho... When I try and add one of the 2 additional devices I want to add (I'm going to add them one at a time), I get an error about the bitmap needing to be removed.

        mdadm --grow /dev/md0 --raid-devices=16
        mdadm: Need to backup 93184K of critical section..
        mdadm: Cannot set device shape for /dev/md0: Device or resource busy
               Bitmap must be removed before shape can be changed

Now, the docs (and indeed several websites when I googled) suggest that you can have a bitmap present on a grow, and the help suggests that you can even change it during a grow..

         If the word internal is given, then the bitmap is stored with the metadata on the array, and so is replicated on all devices.  If the word none is given with --grow mode, then any bitmap that is present is removed.

Is there an issue with the array, or just in my understanding and google foo?

I am running on Centos 6 (2.6.32-220.7.1.el6.x86_64) with madam (mdadm - v3.2.2 - 17th June 2011)

Thanks again.

----- Original Message -----
From: NeilBrown <neilb@suse.de>
To: jahammonds prost <gmitch64@yahoo.com>
Cc: Linux RAID <linux-raid@vger.kernel.org>
Sent: Wednesday, 11 April 2012, 0:43
Subject: Re: Failed Array Rebuild advice Please

On Tue, 10 Apr 2012 21:14:07 -0700 (PDT) jahammonds prost
<gmitch64@yahoo.com> wrote:

> One other question as the badblocks progresses.
>  
> >  It will then recovery both devices in parallel.
> 
> How many additional devices can be done at the same time? I presume that I am going to have to replace the 2 failed devices before I try and grow the array by adding 3 additional drives? I so, how many additional drives can be rebuilt concurrently with a grow? Could I add 5 devices and not see too much of a performance hit? Or would it be more sensible to add them one at a time?
> 

I think it is best to recover, and then reshape later.  I cannot promise that
doing them both at once will work .... it might but I have a feeling that
there might be problems.

Adding three additional drives at once should work well enough in terms of
performance.
However I would only do it if I were very very confident of the drives.
If you hit bad blocks you start losing drives, and if you have 3 drives that
you haven't used before, the chance of losing them all during the reshape -
while still small - becomes a little too high for comfort.

But if you have run heavy bad-blocks tests on them all and they appear to
work, then adding 3 drives at once should be fine.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed Array Rebuild advice Please
  2012-04-12  2:10       ` jahammonds prost
@ 2012-04-12  2:25         ` NeilBrown
  0 siblings, 0 replies; 7+ messages in thread
From: NeilBrown @ 2012-04-12  2:25 UTC (permalink / raw)
  To: jahammonds prost; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 3315 bytes --]

On Wed, 11 Apr 2012 19:10:47 -0700 (PDT) jahammonds prost
<gmitch64@yahoo.com> wrote:

> > I think it is best to recover, and then reshape later.
>  
> Which is what I did. The destructive badblocks ran fine overnight, with no errors on the drives that were failed out of the array. Interestingly one of the additional drives that I want to add took an additional 6 hours to run badblocks, so I need to look into that a bit.
>  
> I am having a bit of an issue with the reshape tho... When I try and add one of the 2 additional devices I want to add (I'm going to add them one at a time), I get an error about the bitmap needing to be removed.
>  
>         mdadm --grow /dev/md0 --raid-devices=16
>         mdadm: Need to backup 93184K of critical section..
>         mdadm: Cannot set device shape for /dev/md0: Device or resource busy
>                Bitmap must be removed before shape can be changed
> 
> Now, the docs (and indeed several websites when I googled) suggest that you can have a bitmap present on a grow, and the help suggests that you can even change it during a grow..
> 
> 
>          If the word internal is given, then the bitmap is stored with the metadata on the array, and so is replicated on all devices.  If the word none is given with --grow mode, then any bitmap that is present is removed.
> 
> 
> Is there an issue with the array, or just in my understanding and google foo?

You cannot reshape an array while it has a bitmap attached.
This restriction will probably be removed in Linux 3.5

NeilBrown


> 
> I am running on Centos 6 (2.6.32-220.7.1.el6.x86_64) with madam (mdadm - v3.2.2 - 17th June 2011)
> 
> 
> Thanks again.
> 
> 
> 
>  
> ----- Original Message -----
> From: NeilBrown <neilb@suse.de>
> To: jahammonds prost <gmitch64@yahoo.com>
> Cc: Linux RAID <linux-raid@vger.kernel.org>
> Sent: Wednesday, 11 April 2012, 0:43
> Subject: Re: Failed Array Rebuild advice Please
> 
> On Tue, 10 Apr 2012 21:14:07 -0700 (PDT) jahammonds prost
> <gmitch64@yahoo.com> wrote:
> 
> > One other question as the badblocks progresses.
> >  
> > >  It will then recovery both devices in parallel.
> > 
> > How many additional devices can be done at the same time? I presume that I am going to have to replace the 2 failed devices before I try and grow the array by adding 3 additional drives? I so, how many additional drives can be rebuilt concurrently with a grow? Could I add 5 devices and not see too much of a performance hit? Or would it be more sensible to add them one at a time?
> > 
> 
> I think it is best to recover, and then reshape later.  I cannot promise that
> doing them both at once will work .... it might but I have a feeling that
> there might be problems.
> 
> Adding three additional drives at once should work well enough in terms of
> performance.
> However I would only do it if I were very very confident of the drives.
> If you hit bad blocks you start losing drives, and if you have 3 drives that
> you haven't used before, the chance of losing them all during the reshape -
> while still small - becomes a little too high for comfort.
> 
> But if you have run heavy bad-blocks tests on them all and they appear to
> work, then adding 3 drives at once should be fine.
> 
> NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-04-12  2:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-10 22:32 Failed Array Rebuild advice Please jahammonds prost
2012-04-10 23:02 ` NeilBrown
2012-04-10 23:46   ` jahammonds prost
2012-04-11  4:14   ` jahammonds prost
2012-04-11  4:43     ` NeilBrown
2012-04-12  2:10       ` jahammonds prost
2012-04-12  2:25         ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).