From: Bernd Schubert <bs@q-leap.de>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Chris Worley <worleys@gmail.com>, Neil Brown <neilb@suse.de>,
linux-raid@vger.kernel.org
Subject: Re: Roadmap for md/raid ???
Date: Fri, 19 Dec 2008 17:13:34 +0100 [thread overview]
Message-ID: <200812191713.35015.bs@q-leap.de> (raw)
In-Reply-To: <alpine.DEB.1.10.0812191050500.2397@p34.internal.lan>
But multiple rebuilds are already supported. If you have multiple arrays on
drive partitions and the CPU is the limit, you may want to
set /sys/block/mdX/md/sync_force_parallel to one.
Cheers,
Bernd
On Friday 19 December 2008 16:51:24 Justin Piszcz wrote:
> Or, before that, allow multiple arrays to rebuild on each core of the
> CPU(s), one per array.
>
> Justin.
>
> On Fri, 19 Dec 2008, Chris Worley wrote:
> > How about "parallelized parity calculation"... given SSD I/O
> > performance, parity calculations are now the performance bottleneck.
> > Most systems have plenty of CPU's to do parity calculations in
> > parallel. Parity calculations are embarrassingly parallel (no
> > dependence between the domains in a domain distribution).
> >
> > Chris
> >
> > On Thu, Dec 18, 2008 at 9:10 PM, Neil Brown <neilb@suse.de> wrote:
> >> Not really a roadmap, more a few tourist attractions that you might
> >> see on the way if you stick around (and if I stick around)...
> >>
> >> Comments welcome.
> >>
> >> NeilBrown
> >>
> >>
> >> - Bad block list
> >> The idea here is to maintain and store on each device a list of
> >> blocks that are known to be 'bad'. This effectively allows us to
> >> fail a single block rather than a whole device when we get a media
> >> write error. Of course if updating the bad-block-list gives an
> >> error we then have to fail the device.
> >>
> >> We would also record a bad block if we get a read error on a degraded
> >> array. This would e.g. allow recovery for a degraded raid1 where the
> >> sole remaining device has a bad block.
> >>
> >> An array could have multiple errors on different devices and just
> >> those stripes would be considered to be "degraded". As long a no
> >> single stripe had too many bad blocks, the data would still be safe.
> >> Naturally as soon as you get one bad block, the array becomes
> >> susceptible to data loss on a single device failure, so it wouldn't
> >> be advisable to run with non-empty badblock lists for an extended
> >> length of time, However it might provide breathing space until
> >> drive replacement can be achieved.
> >>
> >> - hot-device-replace
> >> This is probably the most asked for feature of late. It would allow
> >> a device to be 'recovered' while the original was still in service.
> >> So instead of failing out a device and adding a spare, you can add
> >> the spare, build the data onto it, then fail out the device.
> >>
> >> This meshes well with the bad block list. When we find a bad block,
> >> we start a hot-replace onto a spare (if one exists). If sleeping
> >> bad blocks are discovered during the hot-replace process, we don't
> >> lose the data unless we find two bad blocks in the same stripe.
> >> And then we just lose data in that stripe.
> >>
> >> Recording in the metadata that a hot-replace was happening might be
> >> a little tricky, so it could be that if you reboot in the middle,
> >> you would have to restart from the beginning. Similarly there would
> >> be no 'intent' bitmap involved for this resync.
> >>
> >> Each personality would have to implement much of this independently,
> >> effectively providing a mini raid1 implementation. It would be very
> >> minimal without e.g. read balancing or write-behind etc.
> >>
> >> There would be no point implementing this in raid1. Just
> >> raid456 and raid10.
> >> It could conceivably make sense for raid0 and linear, but that is
> >> very unlikely to be implemented.
> >>
> >> - split-mirror
> >> This is really a function of mdadm rather than md. It is already
> >> quite possible to break a mirror into two separate single-device
> >> arrays. However it is a sufficiently common operation that it is
> >> probably making it very easy to do with mdadm.
> >> I'm thinking something like
> >> mdadm --create /dev/md/new --split /dev/md/old
> >>
> >> will create a new raid1 by taking one device off /dev/md/old (which
> >> must be a raid1) and making an array with exactly the right metadata
> >> and size.
> >>
> >> - raid5->raid6 conversion.
> >> This is also a fairly commonly asked for feature.
> >> The first step would be to define a raid6 layout where the Q block
> >> was not rotated around the devices but was always on the last
> >> device. Then we could change a raid5 to a singly-degraded raid6
> >> without moving any data.
> >>
> >> The next step would be to implement in-place restriping.
> >> This involves
> >> - freezing a section of the array (all IO blocks)
> >> - copying the data out to a safe backup
> >> - copying it back in with the new layout
> >> - updating the metadata to indicate that the restripe has
> >> progressed.
> >> - repeat.
> >>
> >> This would probably be quite slow but it would achieve the desired
> >> result.
> >>
> >> Once we have in-place restriping we could change chunksize as
> >> well.
> >>
> >> - raid5 reduce number of devices.
> >> We can currently restripe a raid5 (or 6) over a larger number of
> >> devices but not over a smaller number of devices. That means you
> >> cannot undo an increase that you didn't want.
> >>
> >> It might be nice to allow this to happen at the same time as
> >> increasing --size (if the devices are big enough) to allow the
> >> array to be restriped without changing the available space.
> >>
> >> - cluster raid1
> >> Allow a raid1 to be assembled on multiple hosts that share some
> >> drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
> >> It requires co-ordination to handle failure events and
> >> resync/recovery. Most of this would probably be done in userspace.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Bernd Schubert
Q-Leap Networks GmbH
next prev parent reply other threads:[~2008-12-19 16:13 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-19 4:10 Roadmap for md/raid ??? Neil Brown
2008-12-19 15:44 ` Chris Worley
2008-12-19 15:51 ` Justin Piszcz
2008-12-19 16:13 ` Bernd Schubert [this message]
2008-12-30 18:12 ` Janek Kozicki
2008-12-30 18:15 ` Janek Kozicki
2009-01-19 0:54 ` Neil Brown
2009-01-19 12:25 ` Keld Jørn Simonsen
2009-01-19 19:03 ` thomas62186218
2009-01-19 20:00 ` Jon Nelson
2009-01-19 20:18 ` Greg Freemyer
2009-01-19 20:30 ` Jon Nelson
2009-01-11 18:14 ` Piergiorgio Sartor
2009-01-19 1:40 ` Neil Brown
2009-01-19 18:19 ` Piergiorgio Sartor
2009-01-19 18:26 ` Peter Rabbitson
2009-01-19 18:41 ` Piergiorgio Sartor
2009-01-19 21:08 ` Keld Jørn Simonsen
2009-01-14 20:43 ` Bill Davidsen
2009-01-19 2:05 ` Neil Brown
[not found] ` <49740C81.2030502@tmr.com>
2009-01-19 22:32 ` Neil Brown
2009-01-21 17:04 ` Bill Davidsen
-- strict thread matches above, loose matches on Subject: below --
2008-12-19 9:01 Aw: " piergiorgio.sartor
2008-12-19 17:01 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200812191713.35015.bs@q-leap.de \
--to=bs@q-leap.de \
--cc=jpiszcz@lucidpixels.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=worleys@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).