Re: expand raid10 - David Brown

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Brown <david@westcontrol.com>
To: linux-raid@vger.kernel.org
Subject: Re: expand raid10
Date: Thu, 14 Apr 2011 10:16:43 +0200	[thread overview]
Message-ID: <io6adm$2q5$1@dough.gmane.org> (raw)
In-Reply-To: <20110414093657.1e848952@notabene.brown>

On 14/04/2011 01:36, NeilBrown wrote:
> On Wed, 13 Apr 2011 14:34:15 +0200 David Brown<david@westcontrol.com>  wrote:
>
>> On 13/04/2011 13:17, NeilBrown wrote:
>>> On Wed, 13 Apr 2011 13:10:16 +0200 Keld Jørn Simonsen<keld@keldix.com>   wrote:
>>>
>>>> On Wed, Apr 13, 2011 at 07:47:26AM -0300, Roberto Spadim wrote:
>>>>> raid10 with other layout i could expand?
>>>>
>>>> My understanding is that you currently cannot expand raid10.
>>>> but there are things in the works. Expansion of raid10,far
>>>> was not on the list from neil, raid10,near was. But it should be fairly
>>>> easy to expand raid10,far. You can just treat one of the copies as your
>>>> refence data, and copy that data to the other raid0-like parts of the
>>>> array.  I wonder if Neil thinks he could leave that as an exersize for
>>>> me to implement... I would like  to be able to combine it with a
>>>> reformat to a more robust layout of raid10,far that in some cases can survive more
>>>> than one disk failure.
>>>>
>>>
>>> I'm very happy for anyone to offer to implement anything.
>>>
>>> I will of course require the code to be of reasonable quality before I accept
>>> it, but I'm also happy to give helpful review comments and guidance.
>>>
>>> So don't wait for permission, if you want to try implementing something, just
>>> do it.
>>>
>>> Equally if there is something that I particularly want done I won't wait for
>>> ever for someone else who says they are working on it.  But RAID10 reshape is
>>> a long way from the top of my list.
>>>
>>
>> I know you have other exciting things on your to-do list - there was
>> lots in your roadmap thread a while back.
>>
>> But I'd like to put in a word for raid10,far - it is an excellent choice
>> of layout for small or medium systems with a combination of redundancy
>> and near-raid0 speed.  It is especially ideal for 2 or 3 disk systems.
>> The only disadvantage is that it can't be resized or re-shaped.  The
>> algorithm suggested by Keld sounds simple to implement, but it would
>> leave the disks in a non-redundant state during the resize/reshape.
>> That would be good enough for some uses (and better than nothing), but
>> not good enough for all uses.  It may also be scalable to include both
>> resizing (replacing each disk with a bigger one) and adding another disk
>> to the array.
>>
>> Currently, it /is/ possible to get an approximate raid10,far layout that
>> is resizeable and reshapeable.  You can divide the member disks into two
>> partitions and pair them off appropriately in mirrors.  Then use these
>> mirrors to form a degraded raid5 with "parity-last" layout and a missing
>> last disk - this is, as far as I can see, equivalent to a raid0 layout
>> but can be re-shaped to more disks and resized to use bigger disks.
>>
>
> There is an interesting idea in here....
>
> Currently if the devices in an md/raid array with redundancy (1,4,5,6,10) are
> of difference sizes, they are all treated as being the size of the smallest
> device.
> However this doesn't really make sense for RAID10-far.
>
> For RAID10-far, it would make the offset where the second slab of data
> appeared not be 50% of the smallest device (in the far-2 case), but 50% of
> the current device.
>
> Then replacing all the devices in a RAID10-far with larger devices would mean
> that the size of the array could then be increased with no further data
> rearrangement.
>
> A lot of care would be needed to implement this as the assumption that all
> drives are only as big as the smallest is pretty deep.  But it could be done
> and would be sensible.
>
> That would make point 2 of http://neil.brown.name/blog/20110216044002#11 a
> lot simpler.
>

I'd like to share an idea here for a slight change in the metadata, and 
an algorithm that I think can be used for resizing raid10,far.  I 
apologise if I've got my terminology wrong, or if it sounds like I'm 
teaching my grandmother to suck eggs.

I think you want to make a distinction between the size of the 
underlying device (disk, partition, lvm device, other md raid), the size 
of the components actually used, and the position of the mirror copy in 
raid10.

I see it as perfectly reasonable to assume that the used component size 
is the same for all devices in an array, and that this only changes when 
you "grow" the array itself (assuming the underlying devices are 
bigger).  That's the way raid 1, 4, 5, and 6 work, and I think that 
assumption would help make 10 growable.  It is also, AFAIU, the reason 
normal raid 0 isn't growable - because it doesn't have that restriction. 
  (Maybe raid0 can be made growable for cases where the component sizes 
are the same?)

To make raid10, far resizeable, I think the key is that instead of 
"position of second copy" being fixed at 50% of the array component 
size, or 50% of the underlying device size, it should be variable.  In 
fact, not only should it be variable - it should consist of two (start, 
length) pairs.

The issue here is that to do a safe grow after resizing the underlying 
device (this being the most awkward case), the mirror copy has to be 
moved rather than deleted and re-written - otherwise you lose your 
redundancy.  But if you keep track of two valid regions, it becomes 
easier.  In the most common case, growing the disk, you would start at 
the end.  Copy a block from the end of the component part of the mirror 
to the appropriate place near the end of the new underlying device. 
Update the second (start, length) pair to include this block, and the 
first (start, length) pair to remove it.  Repeat the process until you 
have copied over everything valid and then have a device with a first 
data block, then some unused space, then a mirror block, then some 
unused space.  Once every underlying device is in this shape, then a 
"grow" is just a straight sync of the unused space (or you just mark it 
in the non-sync bitmap).

Let me try to put it into a picture.  I'll label all the real data 
blocks by letters, and use "." for unused data blocks.  Small letters 
and big letters represent the same data in two copies.  "*" is for 
non-sync bitmap data, or data that must be synced normally (if the 
non-sync bitmap functionality is not yet implemented).

The list of numbers after the disks is:

Size of underlying disk, size of component, (start, length), (start, length)

We start with a raid10,far layout:

1: acegikBDFHJL		12, 6, (6, 6), (0, 0)
2: bdfhjlACEGIK		12, 6, (6, 6), (0, 0)

Then we assume disk 2 is grown (either it is an LVM partition, a raid 
that is grown, or whatever).  Thus we have:

1: acegikBDFHJL			12, 6, (6, 6), (0, 0)
2: bdfhjlACEGIK......		18, 6, (6, 6), (0, 0)

Rebalancing disk 2 (which may be done as its own operation, or 
automatically during a "grow" of the whole array - assuming each 
component disk has enough space) goes through steps like this:

2: bdfhjlACEGIK......		18, 6, (6, 6), (0, 0)
2: bdfhjlACEGIK.IK...		18, 6, (6, 6), (13, 2)
2: bdfhjlACEG...IK...		18, 6, (6, 4), (13, 2)
2: bdfhjlACEG.EGIK...		18, 6, (6, 4), (11, 4)
2: bdfhjlAC...EGIK...		18, 6, (6, 2), (11, 4)
2: bdfhjlAC.ACEGIK...		18, 6, (6, 2), (9, 6)
2: bdfhjl...ACEGIK...		18, 6, (6, 0), (9, 6)
2: bdfhjl...ACEGIK...		18, 6, (9, 6), (0, 0)

With the pair now being:

1: acegikBDFHJL			12, 6, (6, 6), (0, 0)
2: bdfhjl...ACEGIK...		18, 6, (9, 6), (0, 0)

After a similar process with disk 1 we have:

1: acegik...BDFHJL...		18, 6, (9, 6), (0, 0)
2: bdfhjl...ACEGIK...		18, 6, (9, 6), (0, 0)

"Grow" gives you:

1: acegik***BDFHJL***		18, 9, (9, 9), (0, 0)
2: bdfhjl***ACEGIK***		18, 9, (9, 9), (0, 0)

A similar sort of sequence is easy to imagine for shrinking partitions. 
  And when replacing a disk with a new one, this re-shape could easily 
be combined with a hot-replace copy.

As far as I can see, this setup with the extra metadata will hold 
everything consistent, safe and redundant during the whole operation.

mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-04-14  8:16 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-13  4:28 expand raid10 Roberto Spadim
2011-04-13  7:15 ` Mathias Burén
2011-04-13 10:47   ` Roberto Spadim
2011-04-13 11:10     ` Keld Jørn Simonsen
2011-04-13 11:17       ` NeilBrown
2011-04-13 12:34         ` Keld Jørn Simonsen
2011-04-13 23:28           ` NeilBrown
2011-04-13 12:34         ` David Brown
2011-04-13 23:36           ` NeilBrown
2011-04-14  8:16             ` David Brown [this message]
2011-04-15 16:52             ` Keld Jørn Simonsen
2011-04-18  0:46               ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='io6adm$2q5$1@dough.gmane.org' \
    --to=david@westcontrol.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).