Safe disk replace

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Safe disk replace
@ 2012-09-04  4:14 Chris Dunlop
  2012-09-04 10:28 ` David Brown
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Dunlop @ 2012-09-04  4:14 UTC (permalink / raw)
  To: linux-raid

G'day,

What is the best way to replace a fully-functional or minimally-failing
(e.g. occasional bad sectors) disk in a live array whilst maintaining as
much redundancy as possible during the process?

It seems the standard way to replace a disk is to fail out the unwanted
disk, add the new disk, then wait for the array to rebuild. However this
means during the rebuild you've lost some or all of your redundancy,
depending on the raid level of the array. This can be a significant issue,
e.g. if you're replacing a 4 TB disk it could mean 10 to 20 hours or much
more of heightened risk, depending on the rebuild bandwidth available.

Another way would be to add in the new disk and grow the array, wait for
the rebuild, then fail out and remove the old disk, shrink the array, and
again wait for the rebuild. However once again you lose (some of) your
redundancy from the time you've failed the old disk till the rebuild
completes; again, potentially many hours. Unless there's some way of
telling md to shrink the array off the unwanted device before removing it,
and md is smart enough to retain full redundancy during the process?

Another way might be to fail out the old drive, create a raid-1 between
the old and new drives whilst doing some dance with dd and the original
raid metadata and the new raid-1 metadata to make it appear the raid-1 was
the original raid member, "re-add" the raid-1 device to the original raid,
wait for the rebuild of both the raid-1 and the original raid, fail out
the raid-1, do a reverse dd dance to make the new disk look like a primary
member of the original raid, then "re-add" the new disk into the original
raid. This would mean you only lose redundancy for the windows where the
original raid has a failed-out member, i.e. seconds, if done properly.

Is this method possible and, if sufficient care is taken, sensible?

If it's possible, is this something that could or should be built into md
to automate the process and perhaps reduce or completely eliminate the
window of reduced redundancy?

...or, indeed, is this something that's already built into md and I need
to do some significant self-flagellation with the clue bat?

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-04  4:14 Safe disk replace Chris Dunlop
@ 2012-09-04 10:28 ` David Brown
  2012-09-04 12:26   ` Mikael Abrahamsson
  0 siblings, 1 reply; 13+ messages in thread
From: David Brown @ 2012-09-04 10:28 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-raid

On 04/09/2012 06:14, Chris Dunlop wrote:
> G'day,
>
> What is the best way to replace a fully-functional or minimally-failing
> (e.g. occasional bad sectors) disk in a live array whilst maintaining as
> much redundancy as possible during the process?
>
> It seems the standard way to replace a disk is to fail out the unwanted
> disk, add the new disk, then wait for the array to rebuild. However this
> means during the rebuild you've lost some or all of your redundancy,
> depending on the raid level of the array. This can be a significant issue,
> e.g. if you're replacing a 4 TB disk it could mean 10 to 20 hours or much
> more of heightened risk, depending on the rebuild bandwidth available.
>
> Another way would be to add in the new disk and grow the array, wait for
> the rebuild, then fail out and remove the old disk, shrink the array, and
> again wait for the rebuild. However once again you lose (some of) your
> redundancy from the time you've failed the old disk till the rebuild
> completes; again, potentially many hours. Unless there's some way of
> telling md to shrink the array off the unwanted device before removing it,
> and md is smart enough to retain full redundancy during the process?
>
> Another way might be to fail out the old drive, create a raid-1 between
> the old and new drives whilst doing some dance with dd and the original
> raid metadata and the new raid-1 metadata to make it appear the raid-1 was
> the original raid member, "re-add" the raid-1 device to the original raid,
> wait for the rebuild of both the raid-1 and the original raid, fail out
> the raid-1, do a reverse dd dance to make the new disk look like a primary
> member of the original raid, then "re-add" the new disk into the original
> raid. This would mean you only lose redundancy for the windows where the
> original raid has a failed-out member, i.e. seconds, if done properly.
>
> Is this method possible and, if sufficient care is taken, sensible?
>
> If it's possible, is this something that could or should be built into md
> to automate the process and perhaps reduce or completely eliminate the
> window of reduced redundancy?
>
> ...or, indeed, is this something that's already built into md and I need
> to do some significant self-flagellation with the clue bat?
>
> Cheers,
>
> Chris.

It looks like you've thought through most of the possibilities here.

I don't think there is a "best" way to do this sort of replacement, as 
it depends a bit on the circumstances - what sort of array you have from 
before, whether you have a spare disk slot, etc.

The "raid1" copy you mention will one day be possible with "hot replace"
<http://neil.brown.name/blog/20110216044002#2>

I don't know how far along this idea is at the moment.

I know that it is possible to get much of that effect today if you use 
single-disk raid1 "mirrors" as the basis for raid5/6/whatever instead of 
building it directly on disks or partitions.  Then it would be easy to 
add a new disk to a "mirror", wait for it to sync, then remove the old disk.

It is, I believe, possible to turn an existing drive/partition into part 
of a raid1 without metadata, but I am not sure of the details.  But that 
could be used to deal with an existing raid5/6 array.  First, make sure 
you have a write-intent bitmap.  Then remove a disk, make a no-metadata 
raid1 with it, then put it back into the array.  There are a lot of 
details to get right here, so you would want to practice it first!

Bad sectors or read failures in the original disk could quickly cause 
complications here.

If you have a raid5 array and want to replace a disk safely, it is 
relatively easy.  Get another extra disk (and this can be a USB disk, a 
networked disk, etc., if you don't mind the slower speed) and grow your 
array to an asymmetric raid6 (layout "left-symmetric-6", I believe). 
This puts the extra parity on the extra disk, and does not change the 
layout of the rest of the array.  Once the grow/rebuild is complete, you 
can remove the old disk, replace it with the new one, and re-sync. 
Convert back to normal raid5 (which does not need to change the rest of 
the array), and remove the extra disk.

Again, practice this before doing it on live disks - and make sure you 
have a good backup.  Raid can help protect data from disk errors, but 
not from human errors!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-04 10:28 ` David Brown
@ 2012-09-04 12:26   ` Mikael Abrahamsson
  2012-09-04 15:33     ` Robin Hill
  0 siblings, 1 reply; 13+ messages in thread
From: Mikael Abrahamsson @ 2012-09-04 12:26 UTC (permalink / raw)
  To: David Brown; +Cc: Chris Dunlop, linux-raid

On Tue, 4 Sep 2012, David Brown wrote:

> The "raid1" copy you mention will one day be possible with "hot replace"
> <http://neil.brown.name/blog/20110216044002#2>
>
> I don't know how far along this idea is at the moment.

https://lwn.net/Articles/465048/

"hot-replace support for RAID4/5/6:

In order to activate hot-replace you need to mark the device as 
'replaceable'. This happens automatically when a write error is recorded 
in a bad-block log (if you happen to have one).

It can be achieved manually by
    echo replaceable > /sys/block/mdXX/md/dev-YYY/state

This makes YYY, in XX, replaceable."

I don't know if it actually made it into 3.2, I believe I saw somewhere 
that it was available for 3.3, but Neil Brown should know more.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-04 12:26   ` Mikael Abrahamsson
@ 2012-09-04 15:33     ` Robin Hill
  2012-09-04 16:34       ` Mikael Abrahamsson
                         ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Robin Hill @ 2012-09-04 15:33 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1839 bytes --]

On Tue Sep 04, 2012 at 02:26:24PM +0200, Mikael Abrahamsson wrote:

> On Tue, 4 Sep 2012, David Brown wrote:
> 
> > The "raid1" copy you mention will one day be possible with "hot replace"
> > <http://neil.brown.name/blog/20110216044002#2>
> >
> > I don't know how far along this idea is at the moment.
> 
> https://lwn.net/Articles/465048/
> 
> "hot-replace support for RAID4/5/6:
> 
> In order to activate hot-replace you need to mark the device as 
> 'replaceable'. This happens automatically when a write error is recorded 
> in a bad-block log (if you happen to have one).
> 
> It can be achieved manually by
>     echo replaceable > /sys/block/mdXX/md/dev-YYY/state
> 
> This makes YYY, in XX, replaceable."
> 
> I don't know if it actually made it into 3.2, I believe I saw somewhere 
> that it was available for 3.3, but Neil Brown should know more.
> 
I'm currently upgrading my RAID-6 arrays via hot-replacement. The
process I followed (to replace device YYY in array mdXX) is:
    - add the new disk to the array as a spare
    - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state

That kicks off the recovery (a straight disk-to-disk copy from YYY to
the new disk). After the rebuild is complete, YYY gets failed in the
array, so can be safely removed:
    - mdadm -r /dev/mdXX /dev/mdYYY

That's worked fine so far, and looks to run at the single disk write
speed. There were no errors on the old disks though, so I've not seen
how that gets handled (it _should_ just do a parity-based recovery from
the remaining disks and continue).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-04 15:33     ` Robin Hill
@ 2012-09-04 16:34       ` Mikael Abrahamsson
  2012-09-04 17:12         ` Robin Hill
  2012-09-05 14:25       ` John Drescher
  2012-09-06  3:28       ` Chris Dunlop
  2 siblings, 1 reply; 13+ messages in thread
From: Mikael Abrahamsson @ 2012-09-04 16:34 UTC (permalink / raw)
  To: Robin Hill; +Cc: linux-raid

On Tue, 4 Sep 2012, Robin Hill wrote:

> I'm currently upgrading my RAID-6 arrays via hot-replacement. The
> process I followed (to replace device YYY in array mdXX) is:
>    - add the new disk to the array as a spare
>    - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state

What kernel version are you using?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-04 16:34       ` Mikael Abrahamsson
@ 2012-09-04 17:12         ` Robin Hill
  0 siblings, 0 replies; 13+ messages in thread
From: Robin Hill @ 2012-09-04 17:12 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Robin Hill, linux-raid

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

On Tue Sep 04, 2012 at 06:34:39PM +0200, Mikael Abrahamsson wrote:

> On Tue, 4 Sep 2012, Robin Hill wrote:
> 
> > I'm currently upgrading my RAID-6 arrays via hot-replacement. The
> > process I followed (to replace device YYY in array mdXX) is:
> >    - add the new disk to the array as a spare
> >    - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state
> 
> What kernel version are you using?
> 
3.4.9 at the moment. A quick search on the list suggests that this
functionality went in at 3.3 though.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-04 15:33     ` Robin Hill
  2012-09-04 16:34       ` Mikael Abrahamsson
@ 2012-09-05 14:25       ` John Drescher
  2012-09-05 19:35         ` John Drescher
  2012-09-06  3:28       ` Chris Dunlop
  2 siblings, 1 reply; 13+ messages in thread
From: John Drescher @ 2012-09-05 14:25 UTC (permalink / raw)
  To: linux-raid

> I'm currently upgrading my RAID-6 arrays via hot-replacement. The
> process I followed (to replace device YYY in array mdXX) is:
>     - add the new disk to the array as a spare
>     - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state
>
> That kicks off the recovery (a straight disk-to-disk copy from YYY to
> the new disk). After the rebuild is complete, YYY gets failed in the
> array, so can be safely removed:
>     - mdadm -r /dev/mdXX /dev/mdYYY
>

Thanks for the info. I wanted this feature for years at work..

I am testing this now on my test box. Here I have 13 x 250GB SATA 1
drives. Yes these are 8+ years old..

md1 : active raid6 sda2[13](R) sdk2[17] sdj2[18] sdf2[16] sdm2[19]
sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
      2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
[12/12] [UUUUUUUUUUUU]
      [>....................]  recovery =  3.4% (8401408/243147776)
finish=75.9min speed=51540K/sec


Speeds are faster than failing a drive but I would do this more for
the lower chance of failure more than the improved performance:

md1 : active raid6 sdk2[17] sdj2[18] sdf2[16] sdm2[19] sdl2[14]
sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
      2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
[12/11] [_UUUUUUUUUUU]
      [>....................]  recovery =  1.2% (3134952/243147776)
finish=100.1min speed=39954K/sec

John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-05 14:25       ` John Drescher
@ 2012-09-05 19:35         ` John Drescher
  2012-09-05 19:46           ` John Drescher
  2012-09-05 20:32           ` Robin Hill
  0 siblings, 2 replies; 13+ messages in thread
From: John Drescher @ 2012-09-05 19:35 UTC (permalink / raw)
  To: linux-raid

On Wed, Sep 5, 2012 at 10:25 AM, John Drescher <drescherjm@gmail.com> wrote:
>> I'm currently upgrading my RAID-6 arrays via hot-replacement. The
>> process I followed (to replace device YYY in array mdXX) is:
>>     - add the new disk to the array as a spare
>>     - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state
>>
>> That kicks off the recovery (a straight disk-to-disk copy from YYY to
>> the new disk). After the rebuild is complete, YYY gets failed in the
>> array, so can be safely removed:
>>     - mdadm -r /dev/mdXX /dev/mdYYY
>>
>
> Thanks for the info. I wanted this feature for years at work..
>
> I am testing this now on my test box. Here I have 13 x 250GB SATA 1
> drives. Yes these are 8+ years old..
>
> md1 : active raid6 sda2[13](R) sdk2[17] sdj2[18] sdf2[16] sdm2[19]
> sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
>       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [12/12] [UUUUUUUUUUUU]
>       [>....................]  recovery =  3.4% (8401408/243147776)
> finish=75.9min speed=51540K/sec
>
>
> Speeds are faster than failing a drive but I would do this more for
> the lower chance of failure more than the improved performance:
>
> md1 : active raid6 sdk2[17] sdj2[18] sdf2[16] sdm2[19] sdl2[14]
> sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
>       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [12/11] [_UUUUUUUUUUU]
>       [>....................]  recovery =  1.2% (3134952/243147776)
> finish=100.1min speed=39954K/sec
>

I found something interesting. I issued want_replacement without spares.

localhost md # echo want_replacement > dev-sdd2/state
localhost md # cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
[linear] [multipath]
md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
      1048512 blocks [10/10] [UUUUUUUUUU]

md1 : active raid6 sdb2[20] sdk2[17] sda2[13] sdj2[18] sdf2[16]
sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
sdc2[1](F)
      2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
[12/11] [UUUUUUUUUUUU]

Then I added the failed disk from a previous round as a spare.

localhost md # mdadm --manage /dev/md1 --remove /dev/sdc2
mdadm: hot removed /dev/sdc2 from /dev/md1
localhost md # mdadm --zero-superblock /dev/sdc2
localhost md # mdadm --manage /dev/md1 --add /dev/sdc2
mdadm: added /dev/sdc2

localhost md # cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
[linear] [multipath]
md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
      1048512 blocks [10/10] [UUUUUUUUUU]

md1 : active raid6 sdc2[22](R) sdb2[20] sdk2[17] sda2[13] sdj2[18]
sdf2[16] sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
      2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
[12/11] [UUUUUUUUUUUU]
      [>....................]  recovery =  0.6% (1592256/243147776)
finish=119.2min speed=33746K/sec


Now its taking much longer and it says 12/11 instead of 12/12.

John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-05 19:35         ` John Drescher
@ 2012-09-05 19:46           ` John Drescher
  2012-09-05 20:32           ` Robin Hill
  1 sibling, 0 replies; 13+ messages in thread
From: John Drescher @ 2012-09-05 19:46 UTC (permalink / raw)
  To: linux-raid

On Wed, Sep 5, 2012 at 3:35 PM, John Drescher <drescherjm@gmail.com> wrote:
> On Wed, Sep 5, 2012 at 10:25 AM, John Drescher <drescherjm@gmail.com> wrote:
>>> I'm currently upgrading my RAID-6 arrays via hot-replacement. The
>>> process I followed (to replace device YYY in array mdXX) is:
>>>     - add the new disk to the array as a spare
>>>     - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state
>>>
>>> That kicks off the recovery (a straight disk-to-disk copy from YYY to
>>> the new disk). After the rebuild is complete, YYY gets failed in the
>>> array, so can be safely removed:
>>>     - mdadm -r /dev/mdXX /dev/mdYYY
>>>
>>
>> Thanks for the info. I wanted this feature for years at work..
>>
>> I am testing this now on my test box. Here I have 13 x 250GB SATA 1
>> drives. Yes these are 8+ years old..
>>
>> md1 : active raid6 sda2[13](R) sdk2[17] sdj2[18] sdf2[16] sdm2[19]
>> sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
>>       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [12/12] [UUUUUUUUUUUU]
>>       [>....................]  recovery =  3.4% (8401408/243147776)
>> finish=75.9min speed=51540K/sec
>>
>>
>> Speeds are faster than failing a drive but I would do this more for
>> the lower chance of failure more than the improved performance:
>>
>> md1 : active raid6 sdk2[17] sdj2[18] sdf2[16] sdm2[19] sdl2[14]
>> sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
>>       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [12/11] [_UUUUUUUUUUU]
>>       [>....................]  recovery =  1.2% (3134952/243147776)
>> finish=100.1min speed=39954K/sec
>>
>
> I found something interesting. I issued want_replacement without spares.
>
> localhost md # echo want_replacement > dev-sdd2/state
> localhost md # cat /proc/mdstat
> Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
> [linear] [multipath]
> md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
> sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
>       1048512 blocks [10/10] [UUUUUUUUUU]
>
> md1 : active raid6 sdb2[20] sdk2[17] sda2[13] sdj2[18] sdf2[16]
> sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
> sdc2[1](F)
>       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [12/11] [UUUUUUUUUUUU]
>
> Then I added the failed disk from a previous round as a spare.
>
> localhost md # mdadm --manage /dev/md1 --remove /dev/sdc2
> mdadm: hot removed /dev/sdc2 from /dev/md1
> localhost md # mdadm --zero-superblock /dev/sdc2
> localhost md # mdadm --manage /dev/md1 --add /dev/sdc2
> mdadm: added /dev/sdc2
>
> localhost md # cat /proc/mdstat
> Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
> [linear] [multipath]
> md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
> sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
>       1048512 blocks [10/10] [UUUUUUUUUU]
>
> md1 : active raid6 sdc2[22](R) sdb2[20] sdk2[17] sda2[13] sdj2[18]
> sdf2[16] sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
>       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [12/11] [UUUUUUUUUUUU]
>       [>....................]  recovery =  0.6% (1592256/243147776)
> finish=119.2min speed=33746K/sec
>
>
> Now its taking much longer and it says 12/11 instead of 12/12.
>
I am not sure why it is taking longer this time, however from the
drive activity lights on the lsi sas cards it appears that only 2
drives are active in the copy so the raid appears to be doing the
correct thing except for the minor difference in the 12/11 versus
12/12.

John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-05 19:35         ` John Drescher
  2012-09-05 19:46           ` John Drescher
@ 2012-09-05 20:32           ` Robin Hill
  2012-09-06 12:59             ` John Drescher
  2012-09-10  1:01             ` NeilBrown
  1 sibling, 2 replies; 13+ messages in thread
From: Robin Hill @ 2012-09-05 20:32 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4152 bytes --]

On Wed Sep 05, 2012 at 03:35:29PM -0400, John Drescher wrote:

> On Wed, Sep 5, 2012 at 10:25 AM, John Drescher <drescherjm@gmail.com> wrote:
> >> I'm currently upgrading my RAID-6 arrays via hot-replacement. The
> >> process I followed (to replace device YYY in array mdXX) is:
> >>     - add the new disk to the array as a spare
> >>     - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state
> >>
> >> That kicks off the recovery (a straight disk-to-disk copy from YYY to
> >> the new disk). After the rebuild is complete, YYY gets failed in the
> >> array, so can be safely removed:
> >>     - mdadm -r /dev/mdXX /dev/mdYYY
> >>
> >
> > Thanks for the info. I wanted this feature for years at work..
> >
> > I am testing this now on my test box. Here I have 13 x 250GB SATA 1
> > drives. Yes these are 8+ years old..
> >
> > md1 : active raid6 sda2[13](R) sdk2[17] sdj2[18] sdf2[16] sdm2[19]
> > sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
> >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > [12/12] [UUUUUUUUUUUU]
> >       [>....................]  recovery =  3.4% (8401408/243147776)
> > finish=75.9min speed=51540K/sec
> >
> >
> > Speeds are faster than failing a drive but I would do this more for
> > the lower chance of failure more than the improved performance:
> >
> > md1 : active raid6 sdk2[17] sdj2[18] sdf2[16] sdm2[19] sdl2[14]
> > sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
> >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > [12/11] [_UUUUUUUUUUU]
> >       [>....................]  recovery =  1.2% (3134952/243147776)
> > finish=100.1min speed=39954K/sec
> >
> 
> I found something interesting. I issued want_replacement without spares.
> 
> localhost md # echo want_replacement > dev-sdd2/state
> localhost md # cat /proc/mdstat
> Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
> [linear] [multipath]
> md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
> sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
>       1048512 blocks [10/10] [UUUUUUUUUU]
> 
> md1 : active raid6 sdb2[20] sdk2[17] sda2[13] sdj2[18] sdf2[16]
> sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
> sdc2[1](F)
>       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [12/11] [UUUUUUUUUUUU]
>
> Then I added the failed disk from a previous round as a spare.
> 
> localhost md # mdadm --manage /dev/md1 --remove /dev/sdc2
> mdadm: hot removed /dev/sdc2 from /dev/md1
> localhost md # mdadm --zero-superblock /dev/sdc2
> localhost md # mdadm --manage /dev/md1 --add /dev/sdc2
> mdadm: added /dev/sdc2
> 
> localhost md # cat /proc/mdstat
> Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
> [linear] [multipath]
> md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
> sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
>       1048512 blocks [10/10] [UUUUUUUUUU]
> 
> md1 : active raid6 sdc2[22](R) sdb2[20] sdk2[17] sda2[13] sdj2[18]
> sdf2[16] sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
>       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [12/11] [UUUUUUUUUUUU]
>       [>....................]  recovery =  0.6% (1592256/243147776)
> finish=119.2min speed=33746K/sec
> 
> 
> Now its taking much longer and it says 12/11 instead of 12/12.
> 
The problem's actually at the point it finishes the recovery. When it
fails the replaced disk, it treats it as a failure of an in-array disk.
You get the failure email and the array shows as degraded, even though
it has the full number of working devices. Your 12/11 would have shown
even before you started doing the second replacement. It doesn't seem to
cause any problems in use though, and it gets corrected after a reboot.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-04 15:33     ` Robin Hill
  2012-09-04 16:34       ` Mikael Abrahamsson
  2012-09-05 14:25       ` John Drescher
@ 2012-09-06  3:28       ` Chris Dunlop
  2 siblings, 0 replies; 13+ messages in thread
From: Chris Dunlop @ 2012-09-06  3:28 UTC (permalink / raw)
  To: linux-raid

On Tue, Sep 04, 2012 at 04:33:42PM +0100, Robin Hill wrote:
> On Tue Sep 04, 2012 at 02:26:24PM +0200, Mikael Abrahamsson wrote:
>> On Tue, 4 Sep 2012, David Brown wrote:
>> 
>>> The "raid1" copy you mention will one day be possible with "hot replace"
>>> <http://neil.brown.name/blog/20110216044002#2>
>>>
>>> I don't know how far along this idea is at the moment.
>> 
>> https://lwn.net/Articles/465048/
>> 
>> "hot-replace support for RAID4/5/6:
>> 
>> In order to activate hot-replace you need to mark the device as 
>> 'replaceable'. This happens automatically when a write error is recorded 
>> in a bad-block log (if you happen to have one).
>> 
>> It can be achieved manually by
>>     echo replaceable > /sys/block/mdXX/md/dev-YYY/state
>> 
>> This makes YYY, in XX, replaceable."
>> 
>> I don't know if it actually made it into 3.2, I believe I saw somewhere 
>> that it was available for 3.3, but Neil Brown should know more.
> 
> I'm currently upgrading my RAID-6 arrays via hot-replacement. The
> process I followed (to replace device YYY in array mdXX) is:
>     - add the new disk to the array as a spare
>     - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state
> 
> That kicks off the recovery (a straight disk-to-disk copy from YYY to
> the new disk). After the rebuild is complete, YYY gets failed in the
> array, so can be safely removed:
>     - mdadm -r /dev/mdXX /dev/mdYYY
> 
> That's worked fine so far, and looks to run at the single disk write
> speed. There were no errors on the old disks though, so I've not seen
> how that gets handled (it _should_ just do a parity-based recovery from
> the remaining disks and continue).

Thanks all, this is exactly what I was looking for!

Cheers,

Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-05 20:32           ` Robin Hill
@ 2012-09-06 12:59             ` John Drescher
  2012-09-10  1:01             ` NeilBrown
  1 sibling, 0 replies; 13+ messages in thread
From: John Drescher @ 2012-09-06 12:59 UTC (permalink / raw)
  To: linux-raid

> The problem's actually at the point it finishes the recovery. When it
> fails the replaced disk, it treats it as a failure of an in-array disk.
> You get the failure email and the array shows as degraded, even though
> it has the full number of working devices. Your 12/11 would have shown
> even before you started doing the second replacement. It doesn't seem to
> cause any problems in use though, and it gets corrected after a reboot.
>

Thanks. You are correct. It did show 12/11 before the replacement
happened and even after it finished.

John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Safe disk replace
  2012-09-05 20:32           ` Robin Hill
  2012-09-06 12:59             ` John Drescher
@ 2012-09-10  1:01             ` NeilBrown
  1 sibling, 0 replies; 13+ messages in thread
From: NeilBrown @ 2012-09-10  1:01 UTC (permalink / raw)
  To: Robin Hill; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5980 bytes --]

On Wed, 5 Sep 2012 21:32:03 +0100 Robin Hill <robin@robinhill.me.uk> wrote:

> On Wed Sep 05, 2012 at 03:35:29PM -0400, John Drescher wrote:
> 
> > On Wed, Sep 5, 2012 at 10:25 AM, John Drescher <drescherjm@gmail.com> wrote:
> > >> I'm currently upgrading my RAID-6 arrays via hot-replacement. The
> > >> process I followed (to replace device YYY in array mdXX) is:
> > >>     - add the new disk to the array as a spare
> > >>     - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state
> > >>
> > >> That kicks off the recovery (a straight disk-to-disk copy from YYY to
> > >> the new disk). After the rebuild is complete, YYY gets failed in the
> > >> array, so can be safely removed:
> > >>     - mdadm -r /dev/mdXX /dev/mdYYY
> > >>
> > >
> > > Thanks for the info. I wanted this feature for years at work..
> > >
> > > I am testing this now on my test box. Here I have 13 x 250GB SATA 1
> > > drives. Yes these are 8+ years old..
> > >
> > > md1 : active raid6 sda2[13](R) sdk2[17] sdj2[18] sdf2[16] sdm2[19]
> > > sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
> > >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > > [12/12] [UUUUUUUUUUUU]
> > >       [>....................]  recovery =  3.4% (8401408/243147776)
> > > finish=75.9min speed=51540K/sec
> > >
> > >
> > > Speeds are faster than failing a drive but I would do this more for
> > > the lower chance of failure more than the improved performance:
> > >
> > > md1 : active raid6 sdk2[17] sdj2[18] sdf2[16] sdm2[19] sdl2[14]
> > > sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
> > >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > > [12/11] [_UUUUUUUUUUU]
> > >       [>....................]  recovery =  1.2% (3134952/243147776)
> > > finish=100.1min speed=39954K/sec
> > >
> > 
> > I found something interesting. I issued want_replacement without spares.
> > 
> > localhost md # echo want_replacement > dev-sdd2/state
> > localhost md # cat /proc/mdstat
> > Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
> > [linear] [multipath]
> > md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
> > sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
> >       1048512 blocks [10/10] [UUUUUUUUUU]
> > 
> > md1 : active raid6 sdb2[20] sdk2[17] sda2[13] sdj2[18] sdf2[16]
> > sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
> > sdc2[1](F)
> >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > [12/11] [UUUUUUUUUUUU]
> >
> > Then I added the failed disk from a previous round as a spare.
> > 
> > localhost md # mdadm --manage /dev/md1 --remove /dev/sdc2
> > mdadm: hot removed /dev/sdc2 from /dev/md1
> > localhost md # mdadm --zero-superblock /dev/sdc2
> > localhost md # mdadm --manage /dev/md1 --add /dev/sdc2
> > mdadm: added /dev/sdc2
> > 
> > localhost md # cat /proc/mdstat
> > Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
> > [linear] [multipath]
> > md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
> > sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
> >       1048512 blocks [10/10] [UUUUUUUUUU]
> > 
> > md1 : active raid6 sdc2[22](R) sdb2[20] sdk2[17] sda2[13] sdj2[18]
> > sdf2[16] sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
> >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > [12/11] [UUUUUUUUUUUU]
> >       [>....................]  recovery =  0.6% (1592256/243147776)
> > finish=119.2min speed=33746K/sec
> > 
> > 
> > Now its taking much longer and it says 12/11 instead of 12/12.
> > 
> The problem's actually at the point it finishes the recovery. When it
> fails the replaced disk, it treats it as a failure of an in-array disk.
> You get the failure email and the array shows as degraded, even though
> it has the full number of working devices. Your 12/11 would have shown
> even before you started doing the second replacement. It doesn't seem to
> cause any problems in use though, and it gets corrected after a reboot.
> 
> Cheers,
>     Robin

Thanks for the bug report.
This patch should  fix it.

NeilBrown

From d72d7b15e100fc0f9ac95999f39360f44e7b875d Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 10 Sep 2012 11:00:32 +1000
Subject: [PATCH] md/raid5: fix calculate of 'degraded' when a replacement
 becomes active.

When a replacement device becomes active, we mark the device that it
replaces as 'faulty' so that it can subsequently get removed.
However 'calc_degraded' only pays attention to the primary device, not
the replacement, so the array appears to become degraded, which is
wrong.

So teach 'calc_degraded' to consider any replacement if a primary
device is faulty.

Reported-by: Robin Hill <robin@robinhill.me.uk>
Reported-by: John Drescher <drescherjm@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7c8151a..919327a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -419,6 +419,8 @@ static int calc_degraded(struct r5conf *conf)
 	degraded = 0;
 	for (i = 0; i < conf->previous_raid_disks; i++) {
 		struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev);
+		if (rdev && test_bit(Faulty, &rdev->flags))
+			rdev = rcu_dereference(conf->disks[i].replacement);
 		if (!rdev || test_bit(Faulty, &rdev->flags))
 			degraded++;
 		else if (test_bit(In_sync, &rdev->flags))
@@ -443,6 +445,8 @@ static int calc_degraded(struct r5conf *conf)
 	degraded2 = 0;
 	for (i = 0; i < conf->raid_disks; i++) {
 		struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev);
+		if (rdev && test_bit(Faulty, &rdev->flags))
+			rdev = rcu_dereference(conf->disks[i].replacement);
 		if (!rdev || test_bit(Faulty, &rdev->flags))
 			degraded2++;
 		else if (test_bit(In_sync, &rdev->flags))


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-09-10  1:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-04  4:14 Safe disk replace Chris Dunlop
2012-09-04 10:28 ` David Brown
2012-09-04 12:26   ` Mikael Abrahamsson
2012-09-04 15:33     ` Robin Hill
2012-09-04 16:34       ` Mikael Abrahamsson
2012-09-04 17:12         ` Robin Hill
2012-09-05 14:25       ` John Drescher
2012-09-05 19:35         ` John Drescher
2012-09-05 19:46           ` John Drescher
2012-09-05 20:32           ` Robin Hill
2012-09-06 12:59             ` John Drescher
2012-09-10  1:01             ` NeilBrown
2012-09-06  3:28       ` Chris Dunlop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).