From: NeilBrown <neilb@suse.de>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
Date: Thu, 15 Dec 2011 17:18:50 +1100 [thread overview]
Message-ID: <20111215171850.335da016@notabene.brown> (raw)
In-Reply-To: <CAA9_cmfWEnLnvrShVp3BL+WsEApS+nsWBZfao-qM2_habqFCtg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 7456 bytes --]
On Wed, 14 Dec 2011 14:18:51 -0800 Dan Williams <dan.j.williams@intel.com>
wrote:
> On Tue, Oct 25, 2011 at 6:43 PM, NeilBrown <neilb@suse.de> wrote:
> > The following series - on top of my for-linus branch which should appear in
> > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> > certainly the most requested feature over the last few years.
> > The whole series can be pulled from my md-devel branch:
> > git://neil.brown.name/md md-devel
> > (please don't do a full clone, it is not a very fast link).
>
> Some belated comments based on the commit ids at the time:
>
> 88eeb3d md: refine interpretation of "hold_active == UNTIL_IOCTL".
> 9c22832 md: take a reference to mddev during sysfs access.
> a7d6ae4 md: remove test for duplicate device when setting slot number.
> 6deecf2 md: change hot_remove_disk to take an rdev rather than a number.
>
> last 4 reviewed-by.
Thanks. I've annotated the two that haven't gone upstream yet.
>
> f248f8c md: create externally visible flags for supporting hot-replace.
>
> 'replaceable' just strikes me as a confusing name as all devices are
> nominally "replaceable", but whether you want it to be actively
> replaced is a different consideration. What about "incumbent" to mark
> the disk as currently holding a position we want it to vacate and
> remove any potential confusion with 'replacement'.
Fair point. I had wondered if I should not have the flag and just use the
"write_error" flag. However the meaning is slightly different.
I don't really like "incumbent" as it gives no indication that there is a
desire to replace the device. Maybe "want_replacement" ??
>
> ce8fd05 md/raid5: allow each slot to have an extra replacement device
> fd7557d md/raid5: raid5.h cleanup
> 15e9a58 md/raid5: remove redundant bio initialisations.
>
> last 3 reviewed-by.
Thanks.
>
> 37aebb5 md/raid5: preferentially read from replacement device if possible.
>
> + /* This flag does not apply to '.replacement'
> + * only to .rdev, so make sure to check that*/
> + struct md_rdev *rdev2 = rcu_dereference(
> + conf->disks[i].rdev);
> + if (rdev2 == rdev)
> + clear_bit(R5_Insync, &dev->flags);
> + if (!test_bit(Faulty, &rdev2->flags)) {
>
> can't rdev2 be NULL here?
Uhm... probably. I've added a test for rdev2 like I have in the "MadeGood"
case below.
Thanks.
>
> @@ -4201,7 +4241,6 @@ static int retry_aligned_read(struct r5conf
> *conf, struct bio *raid_bio)
> return handled;
> }
>
> - set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
> if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
> release_stripe(sh);
> raid5_set_bi_hw_segments(raid_bio, scnt);
>
>
> Should this one liner be broken out for -stable?
Uhmm... maybe. If the array is degraded we'll hit problems soon anyway, and
if it isn't, the read-errors will all soon be fixed up.
Do you see a particular problem that this fixes that is already possible
without hot-replace?
>
> 8e2c0f9 md/raid5: allow removal for failed replacement devices.
> 17df00a md/raid5: writes should get directed to replacement as well as original.
>
> last 2 reviewed-by
Thanks.
>
> dba5a681 md/raid5: detect and handle replacements during recovery.
>
> This one got me looking back to recall the rules about when
> rcu_deference must be used for an rdev (the ones outlined in commit
> 9910f16a "md: fix up some rdev rcu locking in raid5/6"). But the
> casual future reader may have a hard time finding that commit. Maybe
> we could introduce our own rdev_deref() macro so that sparse and
> lockdep can automatically validate rdev derefences like below.
>
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 8d8e139..6023583 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -357,9 +357,14 @@ enum {
>
>
> struct disk_info {
> - struct md_rdev *rdev, *replacement;
> + struct md_rdev __rcu *rdev,
> + struct md_rdev __rcu *replacement;
> };
>
> +#define rdev_deref(p, md, sh) \
> + rcu_dereference_check((p), (md) ? mddev_is_locked(md) : 1 || \
> + (sh) ? test_bit(STRIPE_SYNCING,
> &(sh)->state) : 1)
> +
> struct r5conf {
> struct hlist_head *stripe_hashtbl;
> struct mddev *mddev;
>
> ...but not sure if it's worth the code uglification.
No, I'm not sure either... If it comes up again I might...
>
>
> Nit, not sure if it's worth fixing but this one introduces some
> inconsistent line wrapping around logical operators... "at the end" vs
> "beginning of next line"
>
> + if (rdev
> + && !test_bit(Faulty, &rdev->flags)
> + && !test_bit(In_sync, &rdev->flags)
> + && !rdev_set_badblocks(rdev, sh->sector,
> + STRIPE_SECTORS, 0))
> + abort = 1;
> + rdev = conf->disks[i].replacement;
> + if (rdev
> + && !test_bit(Faulty, &rdev->flags)
> + && !test_bit(In_sync, &rdev->flags)
> + && !rdev_set_badblocks(rdev, sh->sector,
> + STRIPE_SECTORS, 0))
> abort = 1;
> }
> if (abort) {
> @@ -2456,6 +2475,22 @@ handle_failed_sync(struct r5conf *conf, struct
> stripe_head *sh,
> }
> }
>
> +static int want_replace(struct stripe_head *sh, int disk_idx)
> +{
> + struct md_rdev *rdev;
> + int rv = 0;
> + /* Doing recovery so rcu locking not required */
> + rdev = sh->raid_conf->disks[disk_idx].replacement;
> + if (rdev &&
> + !test_bit(Faulty, &rdev->flags) &&
> + !test_bit(In_sync, &rdev->flags) &&
> + (rdev->recovery_offset <= sh->sector ||
> + rdev->mddev->recovery_cp <= sh->sector))
> + rv = 1;
> +
> + return rv;
Thanks.
I almost always prefer 'at the start' as import things should be obvious.
So I have updated 'want_replace'.
>
> 2693b9e md/raid5: handle activation of replacement device when
> recovery completes.
>
> I questioned not needing a barrier in raid5_end_write_request after
> finding conf->disks[i].replacement == NULL until I found the note in
> raid5_end_read_request about the rdev being pinned until all i/o
> returns. Maybe a similar note to raid5_end_write_request?
I like adding explanatory notes ... but I'm not quite sure what you are
suggesting here. Could you be a little more explicit? Thanks.
>
> d6db3d0 md/raid5: recognise replacements when assembling array.
> 6cdb4fb md/raid5: If there is a spare and a replaceable device, start
> replacement.
> 0124565 md/raid5: Mark device replaceable when we see a write error.
>
> last 3 reviewed-by.
Thanks.
>
> 058c478..678a66d
> raid10 and raid1 patches not reviewed.
That's what a Christmas break is for, isn't it??
Thanks for all the review - I really appreciate it.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2011-12-15 6:18 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
2011-10-26 1:43 ` [md PATCH 01/16] md: refine interpretation of "hold_active == UNTIL_IOCTL" NeilBrown
2011-10-26 1:43 ` [md PATCH 02/16] md: take after reference to mddev during sysfs access NeilBrown
2011-10-26 1:43 ` [md PATCH 04/16] md: change hot_remove_disk to take an rdev rather than a number NeilBrown
2011-10-26 1:43 ` [md PATCH 03/16] md: remove test for duplicate device when setting slot number NeilBrown
2011-10-26 1:43 ` [md PATCH 09/16] md/raid5: preferentially read from replacement device if possible NeilBrown
2011-10-26 1:43 ` [md PATCH 10/16] md/raid5: allow removal for failed replacement devices NeilBrown
2011-10-26 1:43 ` [md PATCH 12/16] md/raid5: detect and handle replacements during recovery NeilBrown
2011-10-26 1:43 ` [md PATCH 11/16] md/raid5: writes should get directed to replacement as well as original NeilBrown
2011-10-26 1:43 ` [md PATCH 14/16] md/raid5: recognise replacements when assembling array NeilBrown
2011-10-26 1:43 ` [md PATCH 08/16] md/raid5: remove redundant bio initialisations NeilBrown
2011-10-26 1:43 ` [md PATCH 13/16] md/raid5: handle activation of replacement device when recovery completes NeilBrown
2011-10-26 1:43 ` [md PATCH 07/16] md/raid5: raid5.h cleanup NeilBrown
2011-10-26 1:43 ` [md PATCH 05/16] md: create externally visible flags for supporting hot-replace NeilBrown
2011-10-26 1:43 ` [md PATCH 06/16] md/raid5: allow each slot to have an extra replacement device NeilBrown
2011-10-26 1:43 ` [md PATCH 16/16] md/raid5: Mark device replaceable when we see a write error NeilBrown
2011-10-26 1:43 ` [md PATCH 15/16] md/raid5: If there is a spare and a replaceable device, start replacement NeilBrown
2011-10-26 6:38 ` [md PATCH 00/16] hot-replace support for RAID4/5/6 David Brown
2011-10-26 7:42 ` NeilBrown
2011-10-26 9:01 ` John Robinson
2011-10-26 13:57 ` Peter W. Morreale
2011-10-26 17:27 ` Piergiorgio Sartor
2011-10-27 17:10 ` Peter W. Morreale
2011-10-27 20:44 ` NeilBrown
2011-10-27 20:53 ` Peter W. Morreale
2011-12-14 22:18 ` Dan Williams
2011-12-15 6:18 ` NeilBrown [this message]
2011-12-15 7:14 ` Williams, Dan J
2011-12-20 5:18 ` NeilBrown
2011-12-22 20:54 ` Alexander Kühn
2011-12-22 21:14 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111215171850.335da016@notabene.brown \
--to=neilb@suse.de \
--cc=dan.j.williams@intel.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.