From: David Brown <david@westcontrol.com>
To: linux-raid@vger.kernel.org
Subject: Re: md road-map: 2011
Date: Wed, 16 Feb 2011 16:42:26 +0100 [thread overview]
Message-ID: <ijgr9p$7v8$1@dough.gmane.org> (raw)
In-Reply-To: <20110216212751.51a294aa@notabene.brown>
On 16/02/2011 11:27, NeilBrown wrote:
>
> I all,
> I wrote this today and posted it at
> http://neil.brown.name/blog/20110216044002
>
> I thought it might be worth posting it here too...
>
> NeilBrown
>
The bad block log will be a huge step up for reliability by making
failures fine-grained. Occasional failures are a serious risk,
especially with very large disks. The bad block log, especially
combined with the "hot replace" idea, will make md raid a lot safer
because you avoid running the array in degraded mode (except for a few
stripes).
When a block is marked as bad on a disk, is it possible to inform the
file system that the whole stripe is considered bad? Then the
filesystem will (I hope) add that stripe to its own bad block list, move
the data out to another stripe (or block, from the fs's viewpoint), thus
restoring the raid redundancy for that data.
Can a "hot spare" automatically turn into a "hot replace" based on some
criteria (such as a certain number of bad blocks)? Can the replaced
drive then become a "hot spare" again? It may not be perfect, but it is
still better than nothing, and useful if the admin can't replace the
drive quickly.
It strikes me that "hot replace" is much like one of the original disks
out of the array and replacing it with a RAID 1 pair using the original
disk and a missing second. The new disk is then added to the pair and
they are sync'ed. Finally, you remove the old disk from the RAID 1
pair, then re-assign the drive from the RAID 1 "pair" to the original RAID.
I may be missing something, but if I think that using the bad-block list
and the non-sync bitmaps, the only thing needed to support hot replace
is a way to turn a member drive into a degraded RAID 1 set in an atomic
action, and to reverse this action afterwards. This may also give extra
flexibility - it is conceivable that someone would want to keep the RAID
1 set afterwards as a reshape (turning a RAID 5 into a RAID 1+0, for
example).
For your non-sync bitmap, would it make sense to have a two-level
bitmap? Perhaps a coarse bitmap in blocks of 32 MB, with each entry
showing a state of in sync, out of sync, partially synced, or never
synced. Partially synced coarse blocks would have their own fine bitmap
at the 4K block size (or perhaps a bit bigger - maybe 32K or 64K would
fit well with SSD block sizes). Partially synced and out of sync blocks
would be gradually brought into sync when the disks are otherwise free,
while never synced blocks would not need to be synced at all.
This would let you efficiently store the state during initial builds
(everything is marked "never synced" until it is used), and rebuilds are
done by marking everything as "out of sync" on the new device. The
two-level structure would let you keep fine-grained sync information
from file system discards without taking up unreasonable space.
next prev parent reply other threads:[~2011-02-16 15:42 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-16 10:27 md road-map: 2011 NeilBrown
2011-02-16 11:28 ` Giovanni Tessore
2011-02-16 13:40 ` Roberto Spadim
2011-02-16 14:00 ` Robin Hill
2011-02-16 14:09 ` Roberto Spadim
2011-02-16 14:21 ` Roberto Spadim
2011-02-16 21:55 ` NeilBrown
2011-02-17 1:30 ` Roberto Spadim
2011-02-16 14:13 ` Joe Landman
2011-02-16 21:24 ` NeilBrown
2011-02-16 21:44 ` Roman Mamedov
2011-02-16 21:59 ` NeilBrown
2011-02-17 0:48 ` Phil Turmel
2011-02-16 22:12 ` Joe Landman
2011-02-16 15:42 ` David Brown [this message]
2011-02-16 21:35 ` NeilBrown
2011-02-16 22:34 ` David Brown
2011-02-16 23:01 ` NeilBrown
2011-02-17 0:30 ` David Brown
2011-02-17 0:55 ` NeilBrown
2011-02-17 1:04 ` Keld Jørn Simonsen
2011-02-17 10:45 ` David Brown
2011-02-17 10:58 ` Keld Jørn Simonsen
2011-02-17 11:45 ` Giovanni Tessore
2011-02-17 15:44 ` Keld Jørn Simonsen
2011-02-17 16:22 ` Roberto Spadim
2011-02-18 0:13 ` Giovanni Tessore
2011-02-18 2:56 ` Keld Jørn Simonsen
2011-02-18 4:27 ` Roberto Spadim
2011-02-18 9:47 ` Giovanni Tessore
2011-02-18 18:43 ` Keld Jørn Simonsen
2011-02-18 19:00 ` Roberto Spadim
2011-02-18 19:18 ` Keld Jørn Simonsen
2011-02-18 19:22 ` Roberto Spadim
2011-02-16 17:20 ` Joe Landman
2011-02-16 21:36 ` NeilBrown
2011-02-16 19:37 ` Phil Turmel
2011-02-16 21:44 ` NeilBrown
2011-02-17 0:11 ` Phil Turmel
2011-02-16 20:29 ` Piergiorgio Sartor
2011-02-16 21:48 ` NeilBrown
2011-02-16 22:53 ` Piergiorgio Sartor
2011-02-17 0:24 ` Phil Turmel
2011-02-17 0:52 ` NeilBrown
2011-02-17 1:14 ` Phil Turmel
2011-02-17 3:10 ` NeilBrown
2011-02-17 18:46 ` Phil Turmel
2011-02-17 21:04 ` Mr. James W. Laferriere
2011-02-18 1:48 ` NeilBrown
2011-02-17 19:56 ` Piergiorgio Sartor
2011-02-16 22:50 ` Keld Jørn Simonsen
2011-02-23 5:06 ` Daniel Reurich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='ijgr9p$7v8$1@dough.gmane.org' \
--to=david@westcontrol.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).