Re: Safe disk replace

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Robin Hill <robin@robinhill.me.uk>
Cc: linux-raid@vger.kernel.org
Subject: Re: Safe disk replace
Date: Mon, 10 Sep 2012 11:01:17 +1000	[thread overview]
Message-ID: <20120910110117.61d1b204@notabene.brown> (raw)
In-Reply-To: <20120905203203.GA4391@cthulhu.home.robinhill.me.uk>

[-- Attachment #1: Type: text/plain, Size: 5980 bytes --]

On Wed, 5 Sep 2012 21:32:03 +0100 Robin Hill <robin@robinhill.me.uk> wrote:

> On Wed Sep 05, 2012 at 03:35:29PM -0400, John Drescher wrote:
> 
> > On Wed, Sep 5, 2012 at 10:25 AM, John Drescher <drescherjm@gmail.com> wrote:
> > >> I'm currently upgrading my RAID-6 arrays via hot-replacement. The
> > >> process I followed (to replace device YYY in array mdXX) is:
> > >>     - add the new disk to the array as a spare
> > >>     - echo want_replacement > /sys/block/mdXX/md/dev-YYY/state
> > >>
> > >> That kicks off the recovery (a straight disk-to-disk copy from YYY to
> > >> the new disk). After the rebuild is complete, YYY gets failed in the
> > >> array, so can be safely removed:
> > >>     - mdadm -r /dev/mdXX /dev/mdYYY
> > >>
> > >
> > > Thanks for the info. I wanted this feature for years at work..
> > >
> > > I am testing this now on my test box. Here I have 13 x 250GB SATA 1
> > > drives. Yes these are 8+ years old..
> > >
> > > md1 : active raid6 sda2[13](R) sdk2[17] sdj2[18] sdf2[16] sdm2[19]
> > > sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
> > >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > > [12/12] [UUUUUUUUUUUU]
> > >       [>....................]  recovery =  3.4% (8401408/243147776)
> > > finish=75.9min speed=51540K/sec
> > >
> > >
> > > Speeds are faster than failing a drive but I would do this more for
> > > the lower chance of failure more than the improved performance:
> > >
> > > md1 : active raid6 sdk2[17] sdj2[18] sdf2[16] sdm2[19] sdl2[14]
> > > sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21] sdb2[20] sdc2[1]
> > >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > > [12/11] [_UUUUUUUUUUU]
> > >       [>....................]  recovery =  1.2% (3134952/243147776)
> > > finish=100.1min speed=39954K/sec
> > >
> > 
> > I found something interesting. I issued want_replacement without spares.
> > 
> > localhost md # echo want_replacement > dev-sdd2/state
> > localhost md # cat /proc/mdstat
> > Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
> > [linear] [multipath]
> > md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
> > sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
> >       1048512 blocks [10/10] [UUUUUUUUUU]
> > 
> > md1 : active raid6 sdb2[20] sdk2[17] sda2[13] sdj2[18] sdf2[16]
> > sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
> > sdc2[1](F)
> >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > [12/11] [UUUUUUUUUUUU]
> >
> > Then I added the failed disk from a previous round as a spare.
> > 
> > localhost md # mdadm --manage /dev/md1 --remove /dev/sdc2
> > mdadm: hot removed /dev/sdc2 from /dev/md1
> > localhost md # mdadm --zero-superblock /dev/sdc2
> > localhost md # mdadm --manage /dev/md1 --add /dev/sdc2
> > mdadm: added /dev/sdc2
> > 
> > localhost md # cat /proc/mdstat
> > Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] [raid0]
> > [linear] [multipath]
> > md0 : active raid1 sda1[10](S) sdj1[0] sdk1[2] sdf1[11](S) sdb1[12](S)
> > sdg1[9] sdh1[8] sdl1[7] sdm1[6] sde1[5] sdd1[4] sdi1[3] sdc1[1]
> >       1048512 blocks [10/10] [UUUUUUUUUU]
> > 
> > md1 : active raid6 sdc2[22](R) sdb2[20] sdk2[17] sda2[13] sdj2[18]
> > sdf2[16] sdm2[19] sdl2[14] sdi2[12] sdg2[15] sde2[5] sdd2[4] sdh2[21]
> >       2431477760 blocks super 1.2 level 6, 512k chunk, algorithm 2
> > [12/11] [UUUUUUUUUUUU]
> >       [>....................]  recovery =  0.6% (1592256/243147776)
> > finish=119.2min speed=33746K/sec
> > 
> > 
> > Now its taking much longer and it says 12/11 instead of 12/12.
> > 
> The problem's actually at the point it finishes the recovery. When it
> fails the replaced disk, it treats it as a failure of an in-array disk.
> You get the failure email and the array shows as degraded, even though
> it has the full number of working devices. Your 12/11 would have shown
> even before you started doing the second replacement. It doesn't seem to
> cause any problems in use though, and it gets corrected after a reboot.
> 
> Cheers,
>     Robin

Thanks for the bug report.
This patch should  fix it.

NeilBrown

From d72d7b15e100fc0f9ac95999f39360f44e7b875d Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 10 Sep 2012 11:00:32 +1000
Subject: [PATCH] md/raid5: fix calculate of 'degraded' when a replacement
 becomes active.

When a replacement device becomes active, we mark the device that it
replaces as 'faulty' so that it can subsequently get removed.
However 'calc_degraded' only pays attention to the primary device, not
the replacement, so the array appears to become degraded, which is
wrong.

So teach 'calc_degraded' to consider any replacement if a primary
device is faulty.

Reported-by: Robin Hill <robin@robinhill.me.uk>
Reported-by: John Drescher <drescherjm@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7c8151a..919327a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -419,6 +419,8 @@ static int calc_degraded(struct r5conf *conf)
 	degraded = 0;
 	for (i = 0; i < conf->previous_raid_disks; i++) {
 		struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev);
+		if (rdev && test_bit(Faulty, &rdev->flags))
+			rdev = rcu_dereference(conf->disks[i].replacement);
 		if (!rdev || test_bit(Faulty, &rdev->flags))
 			degraded++;
 		else if (test_bit(In_sync, &rdev->flags))
@@ -443,6 +445,8 @@ static int calc_degraded(struct r5conf *conf)
 	degraded2 = 0;
 	for (i = 0; i < conf->raid_disks; i++) {
 		struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev);
+		if (rdev && test_bit(Faulty, &rdev->flags))
+			rdev = rcu_dereference(conf->disks[i].replacement);
 		if (!rdev || test_bit(Faulty, &rdev->flags))
 			degraded2++;
 		else if (test_bit(In_sync, &rdev->flags))


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

next prev parent reply	other threads:[~2012-09-10  1:01 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-04  4:14 Safe disk replace Chris Dunlop
2012-09-04 10:28 ` David Brown
2012-09-04 12:26   ` Mikael Abrahamsson
2012-09-04 15:33     ` Robin Hill
2012-09-04 16:34       ` Mikael Abrahamsson
2012-09-04 17:12         ` Robin Hill
2012-09-05 14:25       ` John Drescher
2012-09-05 19:35         ` John Drescher
2012-09-05 19:46           ` John Drescher
2012-09-05 20:32           ` Robin Hill
2012-09-06 12:59             ` John Drescher
2012-09-10  1:01             ` NeilBrown [this message]
2012-09-06  3:28       ` Chris Dunlop

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:7c8151a dfblob:919327a )
 OR (
bs:"md/raid5: fix calculate of 'degraded' when a replacement" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120910110117.61d1b204@notabene.brown \
    --to=neilb@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=robin@robinhill.me.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).