Re: [md PATCH 00/16] hot-replace support for RAID4/5/6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
Date: Thu, 15 Dec 2011 17:18:50 +1100	[thread overview]
Message-ID: <20111215171850.335da016@notabene.brown> (raw)
In-Reply-To: <CAA9_cmfWEnLnvrShVp3BL+WsEApS+nsWBZfao-qM2_habqFCtg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7456 bytes --]

On Wed, 14 Dec 2011 14:18:51 -0800 Dan Williams <dan.j.williams@intel.com>
wrote:

> On Tue, Oct 25, 2011 at 6:43 PM, NeilBrown <neilb@suse.de> wrote:
> > The following series - on top of my for-linus branch which should appear in
> > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6.  This is almost
> > certainly the most requested feature over the last few years.
> > The whole series can be pulled from my md-devel branch:
> >   git://neil.brown.name/md md-devel
> > (please don't do a full clone, it is not a very fast link).
> 
> Some belated comments based on the commit ids at the time:
> 
> 88eeb3d md: refine interpretation of "hold_active == UNTIL_IOCTL".
> 9c22832 md: take a reference to mddev during sysfs access.
> a7d6ae4 md: remove test for duplicate device when setting slot number.
> 6deecf2 md: change hot_remove_disk to take an rdev rather than a number.
> 
> last 4 reviewed-by.

Thanks.  I've annotated the two that haven't gone upstream yet.


> 
> f248f8c md: create externally visible flags for supporting hot-replace.
> 
> 'replaceable' just strikes me as a confusing name as all devices are
> nominally "replaceable", but whether you want it to be actively
> replaced is a different consideration.  What about "incumbent" to mark
> the disk as currently holding a position we want it to vacate and
> remove any potential confusion with 'replacement'.

Fair point.  I had wondered if I should not have the flag and just use the
"write_error" flag.  However the meaning is slightly different.

I don't really like "incumbent" as it gives no indication that there is a
desire to replace the device.  Maybe "want_replacement" ??

> 
> ce8fd05 md/raid5: allow each slot to have an extra replacement device
> fd7557d md/raid5: raid5.h cleanup
> 15e9a58 md/raid5: remove redundant bio initialisations.
> 
> last 3 reviewed-by.

Thanks.


> 
> 37aebb5 md/raid5: preferentially read from replacement device if possible.
> 
> +                       /* This flag does not apply to '.replacement'
> +                        * only to .rdev, so make sure to check that*/
> +                       struct md_rdev *rdev2 = rcu_dereference(
> +                               conf->disks[i].rdev);
> +                       if (rdev2 == rdev)
> +                               clear_bit(R5_Insync, &dev->flags);
> +                       if (!test_bit(Faulty, &rdev2->flags)) {
> 
> can't rdev2 be NULL here?

Uhm... probably.  I've added a test for rdev2 like I have in the "MadeGood"
case below.

Thanks.


> 
> @@ -4201,7 +4241,6 @@ static int  retry_aligned_read(struct r5conf
> *conf, struct bio *raid_bio)
>                         return handled;
>                 }
> 
> -               set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
>                 if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
>                         release_stripe(sh);
>                         raid5_set_bi_hw_segments(raid_bio, scnt);
> 
> 
> Should this one liner be broken out for -stable?

Uhmm... maybe.  If the array is degraded we'll hit problems soon anyway, and
if it isn't, the read-errors will all soon be fixed up.

Do you see a particular problem that this fixes that is already possible
without hot-replace?

> 
> 8e2c0f9 md/raid5: allow removal for failed replacement devices.
> 17df00a md/raid5: writes should get directed to replacement as well as original.
> 
> last 2 reviewed-by

Thanks.


> 
> dba5a681 md/raid5:  detect and handle replacements during recovery.
> 
> This one got me looking back to recall the rules about when
> rcu_deference must be used for an rdev (the ones outlined in commit
> 9910f16a "md: fix up some rdev rcu locking in raid5/6").  But the
> casual future reader may have a hard time finding that commit.  Maybe
> we could introduce our own rdev_deref() macro so that sparse and
> lockdep can automatically validate rdev derefences like below.
> 
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 8d8e139..6023583 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -357,9 +357,14 @@ enum {
> 
> 
>  struct disk_info {
> -       struct md_rdev  *rdev, *replacement;
> +       struct md_rdev __rcu *rdev,
> +       struct md_rdev __rcu *replacement;
>  };
> 
> +#define rdev_deref(p, md, sh) \
> +       rcu_dereference_check((p), (md) ? mddev_is_locked(md) : 1 || \
> +                                  (sh) ? test_bit(STRIPE_SYNCING,
> &(sh)->state) : 1)
> +
>  struct r5conf {
>         struct hlist_head       *stripe_hashtbl;
>         struct mddev            *mddev;
> 
> ...but not sure if it's worth the code uglification.

No, I'm not sure either... If it comes up again I might...

> 
> 
> Nit, not sure if it's worth fixing but this one introduces some
> inconsistent line wrapping around logical operators... "at the end" vs
> "beginning of next line"
> 
> +               if (rdev
> +                   && !test_bit(Faulty, &rdev->flags)
> +                   && !test_bit(In_sync, &rdev->flags)
> +                   && !rdev_set_badblocks(rdev, sh->sector,
> +                                          STRIPE_SECTORS, 0))
> +                       abort = 1;
> +               rdev = conf->disks[i].replacement;
> +               if (rdev
> +                   && !test_bit(Faulty, &rdev->flags)
> +                   && !test_bit(In_sync, &rdev->flags)
> +                   && !rdev_set_badblocks(rdev, sh->sector,
> +                                          STRIPE_SECTORS, 0))
>                         abort = 1;
>         }
>         if (abort) {
> @@ -2456,6 +2475,22 @@ handle_failed_sync(struct r5conf *conf, struct
> stripe_head *sh,
>         }
>  }
> 
> +static int want_replace(struct stripe_head *sh, int disk_idx)
> +{
> +       struct md_rdev *rdev;
> +       int rv = 0;
> +       /* Doing recovery so rcu locking not required */
> +       rdev = sh->raid_conf->disks[disk_idx].replacement;
> +       if (rdev &&
> +           !test_bit(Faulty, &rdev->flags) &&
> +           !test_bit(In_sync, &rdev->flags) &&
> +           (rdev->recovery_offset <= sh->sector ||
> +            rdev->mddev->recovery_cp <= sh->sector))
> +               rv = 1;
> +
> +       return rv;

Thanks.
I almost always prefer 'at the start' as import things should be obvious.
So I have updated 'want_replace'.


> 
> 2693b9e md/raid5: handle activation of replacement device when
> recovery completes.
> 
> I questioned not needing a barrier in raid5_end_write_request after
> finding conf->disks[i].replacement == NULL until I found the note in
> raid5_end_read_request about the rdev being pinned until all i/o
> returns.  Maybe a similar note to raid5_end_write_request?

I like adding explanatory notes ... but I'm not quite sure what you are
suggesting here.  Could you be a little more explicit?  Thanks.

> 
> d6db3d0 md/raid5: recognise replacements when assembling array.
> 6cdb4fb md/raid5: If there is a spare and a replaceable device, start
> replacement.
> 0124565 md/raid5: Mark device replaceable when we see a write error.
> 
> last 3 reviewed-by.

Thanks.


> 
> 058c478..678a66d
> raid10 and raid1 patches not reviewed.

That's what a Christmas break is for, isn't it??

Thanks for all the review - I really appreciate it.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

next prev parent reply	other threads:[~2011-12-15  6:18 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-26  1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
2011-10-26  1:43 ` [md PATCH 03/16] md: remove test for duplicate device when setting slot number NeilBrown
2011-10-26  1:43 ` [md PATCH 01/16] md: refine interpretation of "hold_active == UNTIL_IOCTL" NeilBrown
2011-10-26  1:43 ` [md PATCH 02/16] md: take after reference to mddev during sysfs access NeilBrown
2011-10-26  1:43 ` [md PATCH 04/16] md: change hot_remove_disk to take an rdev rather than a number NeilBrown
2011-10-26  1:43 ` [md PATCH 11/16] md/raid5: writes should get directed to replacement as well as original NeilBrown
2011-10-26  1:43 ` [md PATCH 08/16] md/raid5: remove redundant bio initialisations NeilBrown
2011-10-26  1:43 ` [md PATCH 13/16] md/raid5: handle activation of replacement device when recovery completes NeilBrown
2011-10-26  1:43 ` [md PATCH 06/16] md/raid5: allow each slot to have an extra replacement device NeilBrown
2011-10-26  1:43 ` [md PATCH 05/16] md: create externally visible flags for supporting hot-replace NeilBrown
2011-10-26  1:43 ` [md PATCH 10/16] md/raid5: allow removal for failed replacement devices NeilBrown
2011-10-26  1:43 ` [md PATCH 12/16] md/raid5: detect and handle replacements during recovery NeilBrown
2011-10-26  1:43 ` [md PATCH 09/16] md/raid5: preferentially read from replacement device if possible NeilBrown
2011-10-26  1:43 ` [md PATCH 07/16] md/raid5: raid5.h cleanup NeilBrown
2011-10-26  1:43 ` [md PATCH 14/16] md/raid5: recognise replacements when assembling array NeilBrown
2011-10-26  1:43 ` [md PATCH 15/16] md/raid5: If there is a spare and a replaceable device, start replacement NeilBrown
2011-10-26  1:43 ` [md PATCH 16/16] md/raid5: Mark device replaceable when we see a write error NeilBrown
2011-10-26  6:38 ` [md PATCH 00/16] hot-replace support for RAID4/5/6 David Brown
2011-10-26  7:42   ` NeilBrown
2011-10-26  9:01   ` John Robinson
2011-10-26 13:57     ` Peter W. Morreale
2011-10-26 17:27       ` Piergiorgio Sartor
2011-10-27 17:10 ` Peter W. Morreale
2011-10-27 20:44   ` NeilBrown
2011-10-27 20:53     ` Peter W. Morreale
2011-12-14 22:18 ` Dan Williams
2011-12-15  6:18   ` NeilBrown [this message]
2011-12-15  7:14     ` Williams, Dan J
2011-12-20  5:18       ` NeilBrown
2011-12-22 20:54         ` Alexander Kühn
2011-12-22 21:14           ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111215171850.335da016@notabene.brown \
    --to=neilb@suse.de \
    --cc=dan.j.williams@intel.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).