From: NeilBrown <neilb@suse.de>
To: "Peter W. Morreale" <morreale@sgi.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: [md PATCH 00/16] hot-replace support for RAID4/5/6
Date: Fri, 28 Oct 2011 07:44:45 +1100 [thread overview]
Message-ID: <20111028074445.7ecfa029@notabene.brown> (raw)
In-Reply-To: <1319735434.3930.34.camel@hermosa.site>
[-- Attachment #1: Type: text/plain, Size: 6667 bytes --]
On Thu, 27 Oct 2011 11:10:34 -0600 "Peter W. Morreale" <morreale@sgi.com>
wrote:
> On Wed, 2011-10-26 at 12:43 +1100, NeilBrown wrote:
> > The following series - on top of my for-linus branch which should appear in
> > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost
> > certainly the most requested feature over the last few years.
> > The whole series can be pulled from my md-devel branch:
> > git://neil.brown.name/md md-devel
> > (please don't do a full clone, it is not a very fast link).
> >
> > There is currently no mdadm support, but you can test it out and
> > experiment without mdadm.
> >
> > In order to activate hot-replace you need to mark the device as
> > 'replaceable'.
> > This happens automatically when a write error is recorded in a
> > bad-block log (if you happen to have one).
> > It can be achieved manually by
> > echo replaceable > /sys/block/mdXX/md/dev-YYY/state
> >
> > This makes YYY, in XX, replaceable.
> >
> > If md notices that there is a replaceable drive and a spare it will
> > attach the spare to the replaceable drive and mark it as a
> > 'replacement'.
> > This word appears in the 'state' file and as (R) in /proc/mdstat.
> >
> > md will then copy data from the replaceable drive to the replacement.
> > If there is a bad block on the replaceable drive, it will get the data
> > from elsewhere. This looks like a "recovery" operation.
> >
> > When the replacement completes the replaceable device will be marked
> > as Failed and will be disconnected from the array (i.e. the 'slot'
> > will be set to 'none') and the replacement drive will take up full
> > possession of that slot.
>
> Neil,
>
> Seems to work quite well. Note I have not yet performed a data
> consistency check, just the mechanics of 'replacing' an existing
> drive.
>
> I see in the code that a recovery is kicked immediately after changing
> the state of a drive. One question is whether it will be possible to
> mark multiple drives for replacement, then invoke the recovery one time,
> replacing all disks marked in a single pass?
>
> Right now, it changing state on multiple drives kicks off sequential
> recoveries. For larger disks (3TB/etc), recovery takes a long time and
> there is a non-zero performance hit on the live array.
>
> There are two common use cases to think about. First being an array
> disk replacement to (say) larger disks. Second being a new array in use
> for a period of time where the disks are approaching end-of-life, and
> multiple disks are showing signs of possible failure. So we want to
> replace a number of them at one time and incur the performance hit one
> time.
>
> I see where the code limits a recovery to one sync at a time, would it
> be possible to extend this to multiple concurrent replacements?
>
> What would it take to enable this?
echo frozen > /sys/block/mdX/md/sync_action
for i in /sys/block/mdX/md/dev-*/state
do echo replaceable > $i
done
echo repair > /sys/block/mdX/md/sync_action
should do it. You certainly should be able to replace several devices at the
same time using this approach, though I haven't tried it.
(hmmm... it probably shouldn't accept a 'replaceable' flag on spares - I'll
make a note of that).
>
> Thanks again for this effort, this is terrific.
Thanks.
NeilBrown
>
> Best,
> -PWM
>
>
> >
> > It is not possible to assemble an array with replacement with mdadm.
> > To do this by hand:
> >
> > mknod /dev/md27 b 9 27
> > < /dev/md27
> > cd /sys/block/md27/md
> > echo 1.2 > metadata_version
> > echo 8:1 > new_dev
> > echo 8:17 > new_dev
> > ...
> > echo active > array_state
> >
> > Replace '27' by the md number you want. Replace 1.2 by the metadata
> > version number (must be 1.x for some x). Replace 8:1, 8:17 etc
> > by the major:minor numbers of each device in the array.
> >
> > Yes: this is clumsy. But they you aren't doing this on live data -
> > only on test devices to experiment.
> >
> > You can still assemble the array without the replacement using mdadm.
> > Just list all the drives except the replacement in the --assemble
> > command.
> > Also once the replacement operation completes you can of course stop
> > and assemble the new array with old mdadm.
> >
> > I hope to submit this together with support for RAID10 (and maybe some
> > minimal support for RAID1) for Linux-3.3. By the time it comes out
> > mdadm-3.3 should exist will full support for hot-replace.
> >
> > Review and testing is very welcome, be please do not try it on live
> > data.
> >
> > NeilBrown
> >
> >
> > ---
> >
> > NeilBrown (16):
> > md/raid5: Mark device replaceable when we see a write error.
> > md/raid5: If there is a spare and a replaceable device, start replacement.
> > md/raid5: recognise replacements when assembling array.
> > md/raid5: handle activation of replacement device when recovery completes.
> > md/raid5: detect and handle replacements during recovery.
> > md/raid5: writes should get directed to replacement as well as original.
> > md/raid5: allow removal for failed replacement devices.
> > md/raid5: preferentially read from replacement device if possible.
> > md/raid5: remove redundant bio initialisations.
> > md/raid5: raid5.h cleanup
> > md/raid5: allow each slot to have an extra replacement device
> > md: create externally visible flags for supporting hot-replace.
> > md: change hot_remove_disk to take an rdev rather than a number.
> > md: remove test for duplicate device when setting slot number.
> > md: take after reference to mddev during sysfs access.
> > md: refine interpretation of "hold_active == UNTIL_IOCTL".
> >
> >
> > Documentation/md.txt | 22 ++
> > drivers/md/md.c | 132 ++++++++++---
> > drivers/md/md.h | 82 +++++---
> > drivers/md/multipath.c | 7 -
> > drivers/md/raid1.c | 7 -
> > drivers/md/raid10.c | 7 -
> > drivers/md/raid5.c | 462 +++++++++++++++++++++++++++++++++++----------
> > drivers/md/raid5.h | 98 +++++-----
> > include/linux/raid/md_p.h | 7 -
> > 9 files changed, 599 insertions(+), 225 deletions(-)
> >
> > --
> > Signature
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2011-10-27 20:44 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-26 1:43 [md PATCH 00/16] hot-replace support for RAID4/5/6 NeilBrown
2011-10-26 1:43 ` [md PATCH 03/16] md: remove test for duplicate device when setting slot number NeilBrown
2011-10-26 1:43 ` [md PATCH 04/16] md: change hot_remove_disk to take an rdev rather than a number NeilBrown
2011-10-26 1:43 ` [md PATCH 02/16] md: take after reference to mddev during sysfs access NeilBrown
2011-10-26 1:43 ` [md PATCH 01/16] md: refine interpretation of "hold_active == UNTIL_IOCTL" NeilBrown
2011-10-26 1:43 ` [md PATCH 09/16] md/raid5: preferentially read from replacement device if possible NeilBrown
2011-10-26 1:43 ` [md PATCH 14/16] md/raid5: recognise replacements when assembling array NeilBrown
2011-10-26 1:43 ` [md PATCH 05/16] md: create externally visible flags for supporting hot-replace NeilBrown
2011-10-26 1:43 ` [md PATCH 12/16] md/raid5: detect and handle replacements during recovery NeilBrown
2011-10-26 1:43 ` [md PATCH 10/16] md/raid5: allow removal for failed replacement devices NeilBrown
2011-10-26 1:43 ` [md PATCH 07/16] md/raid5: raid5.h cleanup NeilBrown
2011-10-26 1:43 ` [md PATCH 11/16] md/raid5: writes should get directed to replacement as well as original NeilBrown
2011-10-26 1:43 ` [md PATCH 13/16] md/raid5: handle activation of replacement device when recovery completes NeilBrown
2011-10-26 1:43 ` [md PATCH 06/16] md/raid5: allow each slot to have an extra replacement device NeilBrown
2011-10-26 1:43 ` [md PATCH 08/16] md/raid5: remove redundant bio initialisations NeilBrown
2011-10-26 1:43 ` [md PATCH 16/16] md/raid5: Mark device replaceable when we see a write error NeilBrown
2011-10-26 1:43 ` [md PATCH 15/16] md/raid5: If there is a spare and a replaceable device, start replacement NeilBrown
2011-10-26 6:38 ` [md PATCH 00/16] hot-replace support for RAID4/5/6 David Brown
2011-10-26 7:42 ` NeilBrown
2011-10-26 9:01 ` John Robinson
2011-10-26 13:57 ` Peter W. Morreale
2011-10-26 17:27 ` Piergiorgio Sartor
2011-10-27 17:10 ` Peter W. Morreale
2011-10-27 20:44 ` NeilBrown [this message]
2011-10-27 20:53 ` Peter W. Morreale
2011-12-14 22:18 ` Dan Williams
2011-12-15 6:18 ` NeilBrown
2011-12-15 7:14 ` Williams, Dan J
2011-12-20 5:18 ` NeilBrown
2011-12-22 20:54 ` Alexander Kühn
2011-12-22 21:14 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111028074445.7ecfa029@notabene.brown \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=morreale@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.