From: NeilBrown <neilb@suse.de>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
Date: Thu, 15 Dec 2011 17:18:50 +1100 [thread overview]
Message-ID: <20111215171850.335da016@notabene.brown> (raw)
In-Reply-To: <CAA9_cmfWEnLnvrShVp3BL+WsEApS+nsWBZfao-qM2_habqFCtg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 7456 bytes --]
On Wed, 14 Dec 2011 14:18:51 -0800 Dan Williams <dan.j.williams@intel.com>
wrote:
> On Tue, Oct 25, 2011 at 6:43 PM, NeilBrown <neilb@suse.de> wrote:
> > The following series - on top of my for-linus branch which should appear in
> > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> > certainly the most requested feature over the last few years.
> > The whole series can be pulled from my md-devel branch:
> > git://neil.brown.name/md md-devel
> > (please don't do a full clone, it is not a very fast link).
>
> Some belated comments based on the commit ids at the time:
>
> 88eeb3d md: refine interpretation of "hold_active == UNTIL_IOCTL".
> 9c22832 md: take a reference to mddev during sysfs access.
> a7d6ae4 md: remove test for duplicate device when setting slot number.
> 6deecf2 md: change hot_remove_disk to take an rdev rather than a number.
>
> last 4 reviewed-by.
Thanks. I've annotated the two that haven't gone upstream yet.
>
> f248f8c md: create externally visible flags for supporting hot-replace.
>
> 'replaceable' just strikes me as a confusing name as all devices are
> nominally "replaceable", but whether you want it to be actively
> replaced is a different consideration. What about "incumbent" to mark
> the disk as currently holding a position we want it to vacate and
> remove any potential confusion with 'replacement'.
Fair point. I had wondered if I should not have the flag and just use the
"write_error" flag. However the meaning is slightly different.
I don't really like "incumbent" as it gives no indication that there is a
desire to replace the device. Maybe "want_replacement" ??
>
> ce8fd05 md/raid5: allow each slot to have an extra replacement device
> fd7557d md/raid5: raid5.h cleanup
> 15e9a58 md/raid5: remove redundant bio initialisations.
>
> last 3 reviewed-by.
Thanks.
>
> 37aebb5 md/raid5: preferentially read from replacement device if possible.
>
> + /* This flag does not apply to '.replacement'
> + * only to .rdev, so make sure to check that*/
> + struct md_rdev *rdev2 = rcu_dereference(
> + conf->disks[i].rdev);
> + if (rdev2 == rdev)
> + clear_bit(R5_Insync, &dev->flags);
> + if (!test_bit(Faulty, &rdev2->flags)) {
>
> can't rdev2 be NULL here?
Uhm... probably. I've added a test for rdev2 like I have in the "MadeGood"
case below.
Thanks.
>
> @@ -4201,7 +4241,6 @@ static int retry_aligned_read(struct r5conf
> *conf, struct bio *raid_bio)
> return handled;
> }
>
> - set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
> if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
> release_stripe(sh);
> raid5_set_bi_hw_segments(raid_bio, scnt);
>
>
> Should this one liner be broken out for -stable?
Uhmm... maybe. If the array is degraded we'll hit problems soon anyway, and
if it isn't, the read-errors will all soon be fixed up.
Do you see a particular problem that this fixes that is already possible
without hot-replace?
>
> 8e2c0f9 md/raid5: allow removal for failed replacement devices.
> 17df00a md/raid5: writes should get directed to replacement as well as original.
>
> last 2 reviewed-by
Thanks.
>
> dba5a681 md/raid5: detect and handle replacements during recovery.
>
> This one got me looking back to recall the rules about when
> rcu_deference must be used for an rdev (the ones outlined in commit
> 9910f16a "md: fix up some rdev rcu locking in raid5/6"). But the
> casual future reader may have a hard time finding that commit. Maybe
> we could introduce our own rdev_deref() macro so that sparse and
> lockdep can automatically validate rdev derefences like below.
>
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 8d8e139..6023583 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -357,9 +357,14 @@ enum {
>
>
> struct disk_info {
> - struct md_rdev *rdev, *replacement;
> + struct md_rdev __rcu *rdev,
> + struct md_rdev __rcu *replacement;
> };
>
> +#define rdev_deref(p, md, sh) \
> + rcu_dereference_check((p), (md) ? mddev_is_locked(md) : 1 || \
> + (sh) ? test_bit(STRIPE_SYNCING,
> &(sh)->state) : 1)
> +
> struct r5conf {
> struct hlist_head *stripe_hashtbl;
> struct mddev *mddev;
>
> ...but not sure if it's worth the code uglification.
No, I'm not sure either... If it comes up again I might...
>
>
> Nit, not sure if it's worth fixing but this one introduces some
> inconsistent line wrapping around logical operators... "at the end" vs
> "beginning of next line"
>
> + if (rdev
> + && !test_bit(Faulty, &rdev->flags)
> + && !test_bit(In_sync, &rdev->flags)
> + && !rdev_set_badblocks(rdev, sh->sector,
> + STRIPE_SECTORS, 0))
> + abort = 1;
> + rdev = conf->disks[i].replacement;
> + if (rdev
> + && !test_bit(Faulty, &rdev->flags)
> + && !test_bit(In_sync, &rdev->flags)
> + && !rdev_set_badblocks(rdev, sh->sector,
> + STRIPE_SECTORS, 0))
> abort = 1;
> }
> if (abort) {
> @@ -2456,6 +2475,22 @@ handle_failed_sync(struct r5conf *conf, struct
> stripe_head *sh,
> }
> }
>
> +static int want_replace(struct stripe_head *sh, int disk_idx)
> +{
> + struct md_rdev *rdev;
> + int rv = 0;
> + /* Doing recovery so rcu locking not required */
> + rdev = sh->raid_conf->disks[disk_idx].replacement;
> + if (rdev &&
> + !test_bit(Faulty, &rdev->flags) &&
> + !test_bit(In_sync, &rdev->flags) &&
> + (rdev->recovery_offset <= sh->sector ||
> + rdev->mddev->recovery_cp <= sh->sector))
> + rv = 1;
> +
> + return rv;
Thanks.
I almost always prefer 'at the start' as import things should be obvious.
So I have updated 'want_replace'.
>
> 2693b9e md/raid5: handle activation of replacement device when
> recovery completes.
>
> I questioned not needing a barrier in raid5_end_write_request after
> finding conf->disks[i].replacement == NULL until I found the note in
> raid5_end_read_request about the rdev being pinned until all i/o
> returns. Maybe a similar note to raid5_end_write_request?
I like adding explanatory notes ... but I'm not quite sure what you are
suggesting here. Could you be a little more explicit? Thanks.
>
> d6db3d0 md/raid5: recognise replacements when assembling array.
> 6cdb4fb md/raid5: If there is a spare and a replaceable device, start
> replacement.
> 0124565 md/raid5: Mark device replaceable when we see a write error.
>
> last 3 reviewed-by.
Thanks.
>
> 058c478..678a66d
> raid10 and raid1 patches not reviewed.
That's what a Christmas break is for, isn't it??
Thanks for all the review - I really appreciate it.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2011-12-15 6:18 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
2011-10-26 1:43 ` [md PATCH 03/16] md: remove test for duplicate device when setting slot number NeilBrown
2011-10-26 1:43 ` [md PATCH 01/16] md: refine interpretation of "hold_active == UNTIL_IOCTL" NeilBrown
2011-10-26 1:43 ` [md PATCH 02/16] md: take after reference to mddev during sysfs access NeilBrown
2011-10-26 1:43 ` [md PATCH 04/16] md: change hot_remove_disk to take an rdev rather than a number NeilBrown
2011-10-26 1:43 ` [md PATCH 11/16] md/raid5: writes should get directed to replacement as well as original NeilBrown
2011-10-26 1:43 ` [md PATCH 08/16] md/raid5: remove redundant bio initialisations NeilBrown
2011-10-26 1:43 ` [md PATCH 13/16] md/raid5: handle activation of replacement device when recovery completes NeilBrown
2011-10-26 1:43 ` [md PATCH 06/16] md/raid5: allow each slot to have an extra replacement device NeilBrown
2011-10-26 1:43 ` [md PATCH 05/16] md: create externally visible flags for supporting hot-replace NeilBrown
2011-10-26 1:43 ` [md PATCH 10/16] md/raid5: allow removal for failed replacement devices NeilBrown
2011-10-26 1:43 ` [md PATCH 12/16] md/raid5: detect and handle replacements during recovery NeilBrown
2011-10-26 1:43 ` [md PATCH 09/16] md/raid5: preferentially read from replacement device if possible NeilBrown
2011-10-26 1:43 ` [md PATCH 07/16] md/raid5: raid5.h cleanup NeilBrown
2011-10-26 1:43 ` [md PATCH 14/16] md/raid5: recognise replacements when assembling array NeilBrown
2011-10-26 1:43 ` [md PATCH 15/16] md/raid5: If there is a spare and a replaceable device, start replacement NeilBrown
2011-10-26 1:43 ` [md PATCH 16/16] md/raid5: Mark device replaceable when we see a write error NeilBrown
2011-10-26 6:38 ` [md PATCH 00/16] hot-replace support for RAID4/5/6 David Brown
2011-10-26 7:42 ` NeilBrown
2011-10-26 9:01 ` John Robinson
2011-10-26 13:57 ` Peter W. Morreale
2011-10-26 17:27 ` Piergiorgio Sartor
2011-10-27 17:10 ` Peter W. Morreale
2011-10-27 20:44 ` NeilBrown
2011-10-27 20:53 ` Peter W. Morreale
2011-12-14 22:18 ` Dan Williams
2011-12-15 6:18 ` NeilBrown [this message]
2011-12-15 7:14 ` Williams, Dan J
2011-12-20 5:18 ` NeilBrown
2011-12-22 20:54 ` Alexander Kühn
2011-12-22 21:14 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111215171850.335da016@notabene.brown \
--to=neilb@suse.de \
--cc=dan.j.williams@intel.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).