Re: md road-map: 2011 - Keld Jørn Simonsen

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Keld Jørn Simonsen" <keld@keldix.com>
To: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Subject: Re: md road-map: 2011
Date: Wed, 16 Feb 2011 23:50:28 +0100	[thread overview]
Message-ID: <20110216225028.GA11472@www2.open-std.org> (raw)
In-Reply-To: <20110216212751.51a294aa@notabene.brown>

On Wed, Feb 16, 2011 at 09:27:51PM +1100, NeilBrown wrote:
> 
> RAID1, RAID10 and RAID456 should all support bad blocks.  Every read
> or write should perform a lookup of the bad block list.  If a read
> finds a bad block, that device should be treated as failed for that
> read.  This includes reads that are part of resync or recovery.
> 
> If a write finds a bad block there are two possible responses.  Either
> the block can be ignored as with reads, or we can try to write the
> data in the hope that it will fix the error.  Always taking the second
> action would seem best as it allows blocks to be removed from the
> bad-block list, but as a failing write can take a long time, there are
> plenty of cases where it would not be good.

I was thinking of a further refinement, namely that if there is a bad
block on one drive, then the corresponding good block of another drive
should be read, and written to a bad block recovery area on the
erroneous drive. In that way the erroneous dive would still hold
the complete data. The bad block list would then hold both  the bad
block and then the corresponding good block in the bad block recovery
area. Given that the number of bad blocks would be small,
this would not really hurt performance.

the bad block recovery area could be handled as other metadata on the
drive. I think this reflects much what is currently done in most disk
hardware, except that the corresponding good block is copied from another
drive.

> Support reshape of RAID10 arrays.
> ---------------------------------
> 
> 6/ changing layout to or from 'far' is nearly impossible...
>    With a change in data_offset it might be possible to move one
>    stripe at a time, always into the place just vacated.
>    However keeping track of where we are and were it is safe to read
>    from would be a major headache - unless it feel out with some
>    really neat maths, which I don't think it does.
>    So this option will be left out.

I think this can easily be done for some of the more common cases of
"far", eg a 2 or 4-drive raid10 - possibly all layouts involving an
even number of drives. You can just have say one set of complete data 
intact and then rewrite the whole other set of data in the new layout. 
Please note that there may be two versions of the layout of "near" and
"far", one looking like a raid 1+0 and one loking as a raid 0+1, giving
distinct different survival characteristics with failure of more than
one drive. In a 4-drive raid0, the one layout will have a 66 % chance of
surviving a 2 drive crash, while the other version will have a 33 %
chance of surviving 2 disks crashing.

I am not sure this can be generalized to all combinations of drives and
layouts. However, the simple cases are common enough and simple enough
to do to warrant the implementation, IMHO.

> So the only 'instant' conversion possible is to increase the device
> size for 'near' and 'offset' array.
> 
> 'reshape' conversions can modify chunk size, increase/decrease number of
> devices and swap between 'near' and 'offset' layout providing a
> suitable number of chunks of backup space is available.
> 
> The device-size of a 'far' layout can also be changed by a reshape
> providing the number of devices in not increased.

given that most configurations of "far" can be reshaped into "near" -
then the additin of drives should be possible by: reshape far to near,
extend near, reshape near to far.

Other improvements
------------------

I would like to hear if you are considering other improvements:

1.  a layout version of raid10,far and raid10,near thathas a better
survival ratio for failure fo 2 disks or more. The current layout only
have properties of raid 0+1.

2. better performance of resync etc, by using bigger buffers say 20 MB.

best regards
keld

next prev parent reply	other threads:[~2011-02-16 22:50 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-16 10:27 md road-map: 2011 NeilBrown
2011-02-16 11:28 ` Giovanni Tessore
2011-02-16 13:40   ` Roberto Spadim
2011-02-16 14:00     ` Robin Hill
2011-02-16 14:09       ` Roberto Spadim
2011-02-16 14:21         ` Roberto Spadim
2011-02-16 21:55           ` NeilBrown
2011-02-17  1:30             ` Roberto Spadim
2011-02-16 14:13 ` Joe Landman
2011-02-16 21:24   ` NeilBrown
2011-02-16 21:44     ` Roman Mamedov
2011-02-16 21:59       ` NeilBrown
2011-02-17  0:48         ` Phil Turmel
2011-02-16 22:12       ` Joe Landman
2011-02-16 15:42 ` David Brown
2011-02-16 21:35   ` NeilBrown
2011-02-16 22:34     ` David Brown
2011-02-16 23:01       ` NeilBrown
2011-02-17  0:30         ` David Brown
2011-02-17  0:55           ` NeilBrown
2011-02-17  1:04           ` Keld Jørn Simonsen
2011-02-17 10:45             ` David Brown
2011-02-17 10:58               ` Keld Jørn Simonsen
2011-02-17 11:45                 ` Giovanni Tessore
2011-02-17 15:44                   ` Keld Jørn Simonsen
2011-02-17 16:22                     ` Roberto Spadim
2011-02-18  0:13                     ` Giovanni Tessore
2011-02-18  2:56                       ` Keld Jørn Simonsen
2011-02-18  4:27                         ` Roberto Spadim
2011-02-18  9:47                         ` Giovanni Tessore
2011-02-18 18:43                           ` Keld Jørn Simonsen
2011-02-18 19:00                             ` Roberto Spadim
2011-02-18 19:18                               ` Keld Jørn Simonsen
2011-02-18 19:22                                 ` Roberto Spadim
2011-02-16 17:20 ` Joe Landman
2011-02-16 21:36   ` NeilBrown
2011-02-16 19:37 ` Phil Turmel
2011-02-16 21:44   ` NeilBrown
2011-02-17  0:11     ` Phil Turmel
2011-02-16 20:29 ` Piergiorgio Sartor
2011-02-16 21:48   ` NeilBrown
2011-02-16 22:53     ` Piergiorgio Sartor
2011-02-17  0:24     ` Phil Turmel
2011-02-17  0:52       ` NeilBrown
2011-02-17  1:14         ` Phil Turmel
2011-02-17  3:10           ` NeilBrown
2011-02-17 18:46             ` Phil Turmel
2011-02-17 21:04             ` Mr. James W. Laferriere
2011-02-18  1:48               ` NeilBrown
2011-02-17 19:56           ` Piergiorgio Sartor
2011-02-16 22:50 ` Keld Jørn Simonsen [this message]
2011-02-23  5:06 ` Daniel Reurich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110216225028.GA11472@www2.open-std.org \
    --to=keld@keldix.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).