From: NeilBrown <neilb@suse.de>
To: David Brown <david@westcontrol.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: md road-map: 2011
Date: Thu, 17 Feb 2011 08:35:31 +1100 [thread overview]
Message-ID: <20110217083531.3090a348@notabene.brown> (raw)
In-Reply-To: <ijgr9p$7v8$1@dough.gmane.org>
On Wed, 16 Feb 2011 16:42:26 +0100 David Brown <david@westcontrol.com> wrote:
> On 16/02/2011 11:27, NeilBrown wrote:
> >
> > I all,
> > I wrote this today and posted it at
> > http://neil.brown.name/blog/20110216044002
> >
> > I thought it might be worth posting it here too...
> >
> > NeilBrown
> >
>
>
> The bad block log will be a huge step up for reliability by making
> failures fine-grained. Occasional failures are a serious risk,
> especially with very large disks. The bad block log, especially
> combined with the "hot replace" idea, will make md raid a lot safer
> because you avoid running the array in degraded mode (except for a few
> stripes).
>
> When a block is marked as bad on a disk, is it possible to inform the
> file system that the whole stripe is considered bad? Then the
> filesystem will (I hope) add that stripe to its own bad block list, move
> the data out to another stripe (or block, from the fs's viewpoint), thus
> restoring the raid redundancy for that data.
There is no in-kernel mechanism to do this. You could possibly write a tool
which examined the bad-block-lists exported by md, and told a filesystem
about them.
It might be good to have a feature where by when the filesystem requests a
'read', it gets told 'here is the data, but I had trouble getting it so you
should try to save it elsewhere and never write here again'. If you can
find a filesystem developer interested in using the information I'd be
interested in trying to provide it.
>
> Can a "hot spare" automatically turn into a "hot replace" based on some
> criteria (such as a certain number of bad blocks)? Can the replaced
> drive then become a "hot spare" again? It may not be perfect, but it is
> still better than nothing, and useful if the admin can't replace the
> drive quickly.
Possibly. This would be a job for user-space though. May "mdadm --monitor"
could be given some policy such as you describe. Then it could activate a
spare as appropriate.
>
> It strikes me that "hot replace" is much like one of the original disks
> out of the array and replacing it with a RAID 1 pair using the original
> disk and a missing second. The new disk is then added to the pair and
> they are sync'ed. Finally, you remove the old disk from the RAID 1
> pair, then re-assign the drive from the RAID 1 "pair" to the original RAID.
Very much. However if that process finds an unreadable block, there is
nothing it can do. By integrating into the parent array, we can easily find
that data from elsewhere.
>
> I may be missing something, but if I think that using the bad-block list
> and the non-sync bitmaps, the only thing needed to support hot replace
> is a way to turn a member drive into a degraded RAID 1 set in an atomic
> action, and to reverse this action afterwards. This may also give extra
> flexibility - it is conceivable that someone would want to keep the RAID
> 1 set afterwards as a reshape (turning a RAID 5 into a RAID 1+0, for
> example).
You could do that .... the raid1 resync would need to record bad-blocks in
the new device where badblocks are found in the old device. Then you need
the parent array to find and reconstruct all those bad blocks. It would be
do-able. I'm not sure the complexity of doing it that way is less than the
complexity of directly implementing hot-replace. But I'll keep it in mind if
the code gets too hairy.
>
> For your non-sync bitmap, would it make sense to have a two-level
> bitmap? Perhaps a coarse bitmap in blocks of 32 MB, with each entry
> showing a state of in sync, out of sync, partially synced, or never
> synced. Partially synced coarse blocks would have their own fine bitmap
> at the 4K block size (or perhaps a bit bigger - maybe 32K or 64K would
> fit well with SSD block sizes). Partially synced and out of sync blocks
> would be gradually brought into sync when the disks are otherwise free,
> while never synced blocks would not need to be synced at all.
>
> This would let you efficiently store the state during initial builds
> (everything is marked "never synced" until it is used), and rebuilds are
> done by marking everything as "out of sync" on the new device. The
> two-level structure would let you keep fine-grained sync information
> from file system discards without taking up unreasonable space.
I cannot see that this gains anything.
I need to allocate all the disk space that I might ever need for bitmaps at
the beginning. There is no sense in which I can allocate some when needed
and free it up later (like there might be in a filesystem).
So whatever granularity I need - the space must be pre-allocated.
Certainly a two-level table might be appropriate for the in-memory copy of
the bitmap. Maybe even 3 level. But I think you are talking about storing
data on disk, and I think there - only one bitmap makes sense.
??
NeilBrown
next prev parent reply other threads:[~2011-02-16 21:35 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-16 10:27 md road-map: 2011 NeilBrown
2011-02-16 11:28 ` Giovanni Tessore
2011-02-16 13:40 ` Roberto Spadim
2011-02-16 14:00 ` Robin Hill
2011-02-16 14:09 ` Roberto Spadim
2011-02-16 14:21 ` Roberto Spadim
2011-02-16 21:55 ` NeilBrown
2011-02-17 1:30 ` Roberto Spadim
2011-02-16 14:13 ` Joe Landman
2011-02-16 21:24 ` NeilBrown
2011-02-16 21:44 ` Roman Mamedov
2011-02-16 21:59 ` NeilBrown
2011-02-17 0:48 ` Phil Turmel
2011-02-16 22:12 ` Joe Landman
2011-02-16 15:42 ` David Brown
2011-02-16 21:35 ` NeilBrown [this message]
2011-02-16 22:34 ` David Brown
2011-02-16 23:01 ` NeilBrown
2011-02-17 0:30 ` David Brown
2011-02-17 0:55 ` NeilBrown
2011-02-17 1:04 ` Keld Jørn Simonsen
2011-02-17 10:45 ` David Brown
2011-02-17 10:58 ` Keld Jørn Simonsen
2011-02-17 11:45 ` Giovanni Tessore
2011-02-17 15:44 ` Keld Jørn Simonsen
2011-02-17 16:22 ` Roberto Spadim
2011-02-18 0:13 ` Giovanni Tessore
2011-02-18 2:56 ` Keld Jørn Simonsen
2011-02-18 4:27 ` Roberto Spadim
2011-02-18 9:47 ` Giovanni Tessore
2011-02-18 18:43 ` Keld Jørn Simonsen
2011-02-18 19:00 ` Roberto Spadim
2011-02-18 19:18 ` Keld Jørn Simonsen
2011-02-18 19:22 ` Roberto Spadim
2011-02-16 17:20 ` Joe Landman
2011-02-16 21:36 ` NeilBrown
2011-02-16 19:37 ` Phil Turmel
2011-02-16 21:44 ` NeilBrown
2011-02-17 0:11 ` Phil Turmel
2011-02-16 20:29 ` Piergiorgio Sartor
2011-02-16 21:48 ` NeilBrown
2011-02-16 22:53 ` Piergiorgio Sartor
2011-02-17 0:24 ` Phil Turmel
2011-02-17 0:52 ` NeilBrown
2011-02-17 1:14 ` Phil Turmel
2011-02-17 3:10 ` NeilBrown
2011-02-17 18:46 ` Phil Turmel
2011-02-17 21:04 ` Mr. James W. Laferriere
2011-02-18 1:48 ` NeilBrown
2011-02-17 19:56 ` Piergiorgio Sartor
2011-02-16 22:50 ` Keld Jørn Simonsen
2011-02-23 5:06 ` Daniel Reurich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110217083531.3090a348@notabene.brown \
--to=neilb@suse.de \
--cc=david@westcontrol.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).