Thoughts on RAID nomenclature

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Thoughts on RAID nomenclature
@ 2014-05-05 21:17 Hugo Mills
  2014-05-05 21:28 ` Marc MERLIN
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Hugo Mills @ 2014-05-05 21:17 UTC (permalink / raw)
  To: Btrfs mailing list

[-- Attachment #1: Type: text/plain, Size: 3865 bytes --]

   A passing remark I made on this list a day or two ago set me to
thinking. You may all want to hide behind your desks or in a similar
safe place away from the danger zone (say, Vladivostok) at this
point...

   If we switch to the NcMsPp notation for replication, that
comfortably describes most of the plausible replication methods, and
I'm happy with that. But, there's a wart in the previous proposition,
which is putting "d" for 2cd to indicate that there's a DUP where
replicated chunks can go on the same device. This was the jumping-off
point to consider chunk allocation strategies in general.

   At the moment, we have two chunk allocation strategies: "dup" and
"spread" (for want of a better word; not to be confused with the
ssd_spread mount option, which is a whole different kettle of
borscht). The dup allocation strategy is currently only available for
2c replication, and only on single-device filesystems. When a
filesystem with dup allocation has a second device added to it, it's
automatically upgraded to spread.

   The general operation of the chunk allocator is that it's asked for
locations for n chunks for a block group, and makes a decision about
where those chunks go. In the case of spread, it sorts the devices in
decreasing order of unchunked space, and allocates the n chunks in
that order. For dup, it allocates both chunks on the same device (or,
generalising, may allocate the chunks on the same device if it has
to).

   Now, there are other variations we could consider. For example:

 - linear, which allocates on the n smallest-numbered devices with
   free space. This goes halfway towards some people's goal of
   minimising the file fragments damaged in a device failure on a 1c
   FS (again, see (*)). [There's an open question on this one about
   what happens when holes open up through, say, a balance.]

 - grouped, which allows the administrator to assign groups to the
   devices, and allocates each chunk from a different group. [There's
   a variation here -- we could look instead at ensuring that
   different _copies_ go in different groups.]

   Given these four (spread, dup, linear, grouped), I think it's
fairly obvious that spread is a special case of grouped, where each
device is its own group. Then dup is the opposite of grouped (i.e. you
must have one or the other but not both). Finally, linear is a
modifier that changes the sort order.

   All of these options run completely independently of the actual
replication level selected, so we could have 3c:spread,linear
(allocates on the first three devices only, until one fills up and
then it moves to the fourth device), or 2c2s:grouped, with a device
mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
different copies on different device controllers.

   Does this all make sense? Are there any other options or features
that we might consider for chunk allocation at this point? Having had
a look at the chunk allocator, I think most if not all of this is
fairly easily implementable, given a sufficiently good method of
describing it all, which is what I'm trying to get to the bottom of in
this discussion.

   Hugo.

(*) The missing piece here is to deal with extent allocation in a
similar way, which would offer better odds again on the number of
files damaged in a device-loss situation on a 1c FS. This is in
general a much harder problem, though. The only change we have in this
area at the moment is ssd_spread, which doesn't do very much. It also
has the potential for really killing performance and/or file
fragmentation.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
          --- ...  one ping(1) to rule them all, and in the ---          
                         darkness bind(2) them.                          

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on RAID nomenclature
  2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
@ 2014-05-05 21:28 ` Marc MERLIN
  2014-05-05 21:47 ` Brendan Hide
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Marc MERLIN @ 2014-05-05 21:28 UTC (permalink / raw)
  To: Hugo Mills, Btrfs mailing list

On Mon, May 05, 2014 at 10:17:38PM +0100, Hugo Mills wrote:
>    Given these four (spread, dup, linear, grouped), I think it's
> fairly obvious that spread is a special case of grouped, where each
> device is its own group. Then dup is the opposite of grouped (i.e. you
> must have one or the other but not both). Finally, linear is a
> modifier that changes the sort order.
> 
>    All of these options run completely independently of the actual
> replication level selected, so we could have 3c:spread,linear
> (allocates on the first three devices only, until one fills up and
> then it moves to the fourth device), or 2c2s:grouped, with a device
> mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
> different copies on different device controllers.

I generally like the idea and proposal :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on RAID nomenclature
  2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
  2014-05-05 21:28 ` Marc MERLIN
@ 2014-05-05 21:47 ` Brendan Hide
  2014-05-06 20:51   ` Duncan
  2014-05-06 20:16 ` Goffredo Baroncelli
  2014-05-08 15:58 ` Hugo Mills
  3 siblings, 1 reply; 7+ messages in thread
From: Brendan Hide @ 2014-05-05 21:47 UTC (permalink / raw)
  To: Hugo Mills, Btrfs mailing list

On 2014/05/05 11:17 PM, Hugo Mills wrote:
>     A passing remark I made on this list a day or two ago set me to
> thinking. You may all want to hide behind your desks or in a similar
> safe place away from the danger zone (say, Vladivostok) at this
> point...

I feel like I can brave some "mild horrors". Of course, my C skills 
aren't up to scratch so its all just bravado. ;)
>     If we switch to the NcMsPp notation for replication, that
> comfortably describes most of the plausible replication methods, and
> I'm happy with that. But, there's a wart in the previous proposition,
> which is putting "d" for 2cd to indicate that there's a DUP where
> replicated chunks can go on the same device. This was the jumping-off
> point to consider chunk allocation strategies in general.
>
>     At the moment, we have two chunk allocation strategies: "dup" and
> "spread" (for want of a better word; not to be confused with the
> ssd_spread mount option, which is a whole different kettle of
> borscht). The dup allocation strategy is currently only available for
> 2c replication, and only on single-device filesystems. When a
> filesystem with dup allocation has a second device added to it, it's
> automatically upgraded to spread.

I thought this step was manual - but okay! :)
>     The general operation of the chunk allocator is that it's asked for
> locations for n chunks for a block group, and makes a decision about
> where those chunks go. In the case of spread, it sorts the devices in
> decreasing order of unchunked space, and allocates the n chunks in
> that order. For dup, it allocates both chunks on the same device (or,
> generalising, may allocate the chunks on the same device if it has
> to).
>
>     Now, there are other variations we could consider. For example:
>
>   - linear, which allocates on the n smallest-numbered devices with
>     free space. This goes halfway towards some people's goal of
>     minimising the file fragments damaged in a device failure on a 1c
>     FS (again, see (*)). [There's an open question on this one about
>     what happens when holes open up through, say, a balance.]
>
>   - grouped, which allows the administrator to assign groups to the
>     devices, and allocates each chunk from a different group. [There's
>     a variation here -- we could look instead at ensuring that
>     different _copies_ go in different groups.]
>
>     Given these four (spread, dup, linear, grouped), I think it's
> fairly obvious that spread is a special case of grouped, where each
> device is its own group. Then dup is the opposite of grouped (i.e. you
> must have one or the other but not both). Finally, linear is a
> modifier that changes the sort order.
>
>     All of these options run completely independently of the actual
> replication level selected, so we could have 3c:spread,linear
> (allocates on the first three devices only, until one fills up and
> then it moves to the fourth device), or 2c2s:grouped, with a device
> mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
> different copies on different device controllers.
>
>     Does this all make sense? Are there any other options or features
> that we might consider for chunk allocation at this point? Having had
> a look at the chunk allocator, I think most if not all of this is
> fairly easily implementable, given a sufficiently good method of
> describing it all, which is what I'm trying to get to the bottom of in
> this discussion.

I think I get most of what you're saying. If its not too difficult, 
perhaps you could update (or duplicate to another URL) your 
/btrfs-usage/ calculator to reflect the idea. It'd definitely make it 
easier for everyone (including myself) to know we're on the same page.

I like the idea that the administrator would have more granular control 
over where data gets allocated first or where copies "belong". 
"Splicing" data to different controllers as you mentioned can help with 
both redundancy and performance.

Note: I've always thought of dup as a special form of "spread" where we 
just write things out twice - but yes, there's no need for it to be 
compatible with any other allocation type.
>     Hugo.
>
> (*) The missing piece here is to deal with extent allocation in a
> similar way, which would offer better odds again on the number of
> files damaged in a device-loss situation on a 1c FS. This is in
> general a much harder problem, though. The only change we have in this
> area at the moment is ssd_spread, which doesn't do very much. It also
> has the potential for really killing performance and/or file
> fragmentation.
>


-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on RAID nomenclature
  2014-05-05 21:47 ` Brendan Hide
@ 2014-05-06 20:51   ` Duncan
  0 siblings, 0 replies; 7+ messages in thread
From: Duncan @ 2014-05-06 20:51 UTC (permalink / raw)
  To: linux-btrfs

Brendan Hide posted on Mon, 05 May 2014 23:47:17 +0200 as excerpted:

>> At the moment, we have two chunk allocation strategies: "dup" and
>> "spread" (for want of a better word; not to be confused with the
>> ssd_spread mount option, which is a whole different kettle of borscht).
>> The dup allocation strategy is currently only available for 2c
>> replication, and only on single-device filesystems. When a filesystem
>> with dup allocation has a second device added to it, it's automatically
>> upgraded to spread.
> 
> I thought this step was manual - but okay! :)

AFAIK, the /allocator/ automatically updates to "spread" when a second 
device is added.  That is, assuming previous dup metadata on a single 
device, adding a device will cause new allocations to be in raid1/spread 
mode, instead of dup.

What's manual, however, is that /existing/ chunk allocations don't get 
automatically updated.  For that, a balance must be done.  But existing 
allocations are by definition already allocated, so the chunk allocator 
doesn't do anything with them.  (A rebalance allocates new chunks, 
rewriting the contents of the old chunks into the new ones before 
unmapping the now unused old chunks, so again, existing chunks stay where 
they are until unmapped, it's the NEW chunks that get mapped by the 
updated allocation policy.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on RAID nomenclature
  2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
  2014-05-05 21:28 ` Marc MERLIN
  2014-05-05 21:47 ` Brendan Hide
@ 2014-05-06 20:16 ` Goffredo Baroncelli
  2014-05-08 15:58 ` Hugo Mills
  3 siblings, 0 replies; 7+ messages in thread
From: Goffredo Baroncelli @ 2014-05-06 20:16 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

On 05/05/2014 11:17 PM, Hugo Mills wrote:
[...]
>    Does this all make sense? Are there any other options or features
> that we might consider for chunk allocation at this point? 

The kind of chunk (DATA, METADATA, MIXED....) and the subvolume (when /if this possibility will come)


As how write this information I suggest the following options:


-[DATA|METADATA|MIXED|SYSTEM:]NcMsPp[:<driveslist>[:/subvolume/path]]


Where <drivelist> is an expression of the disks policy allocation:


a)
	{sdX1:W1,sdX2:W2...}

        where sdX is the partition involved and W is the weight:

#1	{sda:1,sdb:1,sdc:1}   means spread all the disks
#2	{sda:1,sdb:2,sdc:3}   means linear from sda to sdc
#3	{sda:1,sdb:1,sdc:2}   means spread on sda and sdb (grouped)
                                 then (when full)  sdc

	or
b)
#1	(sda,sdb,sdc)		means spread all the disks
#2	[sda,sdb,sdc]		means linear from sda to sdc
#3	[(sda,sdb),sdc]		means spread on sda and sdb (grouped)
                                 then (when full)  sdc

	or
c)
#1	(sda,sdb,sdc)		means spread all the disks
#2	sda,sdb,sdc		means linear from sda to sdc
#3	(sda,sdb),sdc		means spread on sda and sdb (grouped)
                                 then (when full)  sdc


Some examples:

- 1c2s3b		
		Default allocation policy

- DATA:2c3s4b		
		Default allocation policy for the DATA

- METADATA:1c4s:(sda,sdb,sdc,sdd)
		Spread over all the 4 disks for metadata

- MIXED:1c4s:sda,sdc,sdb,sdd
		Linear over the 4 disks, ordered as the list for
		Data+Metadata

- DATA:1c4s:(sda,sdc),(sdb,sdd)
		spread over sda,sdc and then when these are
		filled, spread over sdb and sdc

- METADATA:1c4s:(sda,sdb,sdc,sdd):/subvolume/path
		Spread over all the 4 disks for metadata belonging the
		subvolume /subvolume/path


I think it would be interesting to explore some configuration like

- DATA:1c:(sda)
- METADATA:2c:(sdb)

if sda is bigger and sdb is faster

Some further thoughts:
- more I think about the allocation policy per subvolume and/or file basis and more I think that it would be a messy to manage





-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on RAID nomenclature
  2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
                   ` (2 preceding siblings ...)
  2014-05-06 20:16 ` Goffredo Baroncelli
@ 2014-05-08 15:58 ` Hugo Mills
  2014-05-08 21:55   ` Hugo Mills
  3 siblings, 1 reply; 7+ messages in thread
From: Hugo Mills @ 2014-05-08 15:58 UTC (permalink / raw)
  To: Btrfs mailing list

[-- Attachment #1: Type: text/plain, Size: 7760 bytes --]

On Mon, May 05, 2014 at 10:17:38PM +0100, Hugo Mills wrote:
>    A passing remark I made on this list a day or two ago set me to
> thinking. You may all want to hide behind your desks or in a similar
> safe place away from the danger zone (say, Vladivostok) at this
> point...
> 
>    If we switch to the NcMsPp notation for replication, that
> comfortably describes most of the plausible replication methods, and
> I'm happy with that. But, there's a wart in the previous proposition,
> which is putting "d" for 2cd to indicate that there's a DUP where
> replicated chunks can go on the same device. This was the jumping-off
> point to consider chunk allocation strategies in general.
> 
>    At the moment, we have two chunk allocation strategies: "dup" and
> "spread" (for want of a better word; not to be confused with the
> ssd_spread mount option, which is a whole different kettle of
> borscht). The dup allocation strategy is currently only available for
> 2c replication, and only on single-device filesystems. When a
> filesystem with dup allocation has a second device added to it, it's
> automatically upgraded to spread.
> 
>    The general operation of the chunk allocator is that it's asked for
> locations for n chunks for a block group, and makes a decision about
> where those chunks go. In the case of spread, it sorts the devices in
> decreasing order of unchunked space, and allocates the n chunks in
> that order. For dup, it allocates both chunks on the same device (or,
> generalising, may allocate the chunks on the same device if it has
> to).
> 
>    Now, there are other variations we could consider. For example:
> 
>  - linear, which allocates on the n smallest-numbered devices with
>    free space. This goes halfway towards some people's goal of
>    minimising the file fragments damaged in a device failure on a 1c
>    FS (again, see (*)). [There's an open question on this one about
>    what happens when holes open up through, say, a balance.]
> 
>  - grouped, which allows the administrator to assign groups to the
>    devices, and allocates each chunk from a different group. [There's
>    a variation here -- we could look instead at ensuring that
>    different _copies_ go in different groups.]
> 
>    Given these four (spread, dup, linear, grouped), I think it's
> fairly obvious that spread is a special case of grouped, where each
> device is its own group. Then dup is the opposite of grouped (i.e. you
> must have one or the other but not both). Finally, linear is a
> modifier that changes the sort order.
> 
>    All of these options run completely independently of the actual
> replication level selected, so we could have 3c:spread,linear
> (allocates on the first three devices only, until one fills up and
> then it moves to the fourth device), or 2c2s:grouped, with a device
> mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
> different copies on different device controllers.

   Having thought about this some more(*), what I've described above
isn't quite right. We've got two main axes for the algorithm, and one
flag (which modifies the second axis's behaviour).

   The first axis is selection of a suitable device from a list of
candidates. I've renamed things from my last email to try to make
things clearer, but example algorithms here could be:

 - first:   The old algorithm, which simply selects the available device
            with the smallest device ID.

 - even:    The current algorithm, which selects the available device
            with the largest free space.

 - round-robin: A stateful selection, which selects the next available
            device after the last one selected.

 - forward: Like "first", only if a device becomes full, we don't go
            back to look at it until we've exhausted all the other
            devices first. This approximates a ring-buffer type
            structure (only where we only fill in gaps, obviously).

 - seq:     Like "first", only with a user-supplied arbitrary ordering of
            devices.

   These can be stateful (r-r and forward are), and each function may
be called multiple times for each block group: once per chunk
required. I don't necessarily propose to implement all of these, but
at least even and first, and possibly seq, seem sensible.

   After selecting a device on which to create a chunk, we then need
to winnow out the devices which are no longer suitable for selection
as a result of that allocation. This is the other axis, and the
behaviours here are:

 - any: The current behaviour. Only the selected device is removed
        from consideration.

 - fast, slow: Like "any", except that devices which are,
        respectively, slow and fast are put in a special "don't use"
        group which prevents them from being allocated at all.

 - grouped: All devices within the group of the selected device are
        removed from consideration. This allows us to specify, for
        example, that different copies in Nc configurations should go
        on different controllers(**). It also allows us to specify
        that metadata chunks should only go on specific devices.

 - dup: may be applied to any of the other winnowing functions, and
        simply forces that function to put its discarded devices to
        the back of the queue of possible devices again, allowing them
        to be reused.

   So, current (and strongly recommended) behaviour is even,any or
even,any+dup. The linear allocation which is sometimes requested would
be first,any -- this is the same as the old allocator from a few years
ago. Allocating different copies to different controllers might be:
even,grouped with groups A=sda,sdb,sdc;B=sdd,sde,sdf. Note that "any",
"fast" and "slow" are all special cases of "grouped". It's worth
noting, as I've just discovered, that it's possible to configure this
system to do some _really_ silly suboptimal things with chunk
allocation.

   I've got the algorithms for all this coded in python (mostly --
I've got a couple of things to finish off), to verify that it turns
into a sane implementation at least. It does.

   Hugo.

(*) Screams, sirens, sounds of breaking glass...

(**) There's a bit of a twist here to deal with RAID-10- or
RAID-50-like configurations, which is that we actually maintain two
lists for those, and allow different stripes in the same copy to share
a group, but not different copies of the same stripe. It approximately
doubles the size of the data structures we need while running an
allocation, but that's (a) transient, and (b) consists of lists of
pointers to device structures, so it's not exactly large.

> 
>    Does this all make sense? Are there any other options or features
> that we might consider for chunk allocation at this point? Having had
> a look at the chunk allocator, I think most if not all of this is
> fairly easily implementable, given a sufficiently good method of
> describing it all, which is what I'm trying to get to the bottom of in
> this discussion.
> 
>    Hugo.
> 
> (*) The missing piece here is to deal with extent allocation in a
> similar way, which would offer better odds again on the number of
> files damaged in a device-loss situation on a 1c FS. This is in
> general a much harder problem, though. The only change we have in this
> area at the moment is ssd_spread, which doesn't do very much. It also
> has the potential for really killing performance and/or file
> fragmentation.
> 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- Everything simple is false. Everything which is ---         
                          complex is unusable.                           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on RAID nomenclature
  2014-05-08 15:58 ` Hugo Mills
@ 2014-05-08 21:55   ` Hugo Mills
  0 siblings, 0 replies; 7+ messages in thread
From: Hugo Mills @ 2014-05-08 21:55 UTC (permalink / raw)
  To: Btrfs mailing list

[-- Attachment #1: Type: text/plain, Size: 14073 bytes --]

On Thu, May 08, 2014 at 04:58:34PM +0100, Hugo Mills wrote:
>    The first axis is selection of a suitable device from a list of
> candidates. I've renamed things from my last email to try to make
> things clearer, but example algorithms here could be:
> 
>  - first:   The old algorithm, which simply selects the available device
>             with the smallest device ID.
> 
>  - even:    The current algorithm, which selects the available device
>             with the largest free space.
> 
>  - round-robin: A stateful selection, which selects the next available
>             device after the last one selected.
> 
>  - forward: Like "first", only if a device becomes full, we don't go
>             back to look at it until we've exhausted all the other
>             devices first. This approximates a ring-buffer type
>             structure (only where we only fill in gaps, obviously).
> 
>  - seq:     Like "first", only with a user-supplied arbitrary ordering of
>             devices.
> 
>    These can be stateful (r-r and forward are), and each function may
> be called multiple times for each block group: once per chunk
> required. I don't necessarily propose to implement all of these, but
> at least even and first, and possibly seq, seem sensible.
> 
>    After selecting a device on which to create a chunk, we then need
> to winnow out the devices which are no longer suitable for selection
> as a result of that allocation. This is the other axis, and the
> behaviours here are:
> 
>  - any: The current behaviour. Only the selected device is removed
>         from consideration.
> 
>  - fast, slow: Like "any", except that devices which are,
>         respectively, slow and fast are put in a special "don't use"
>         group which prevents them from being allocated at all.
> 
>  - grouped: All devices within the group of the selected device are
>         removed from consideration. This allows us to specify, for
>         example, that different copies in Nc configurations should go
>         on different controllers(**). It also allows us to specify
>         that metadata chunks should only go on specific devices.
> 
>  - dup: may be applied to any of the other winnowing functions, and
>         simply forces that function to put its discarded devices to
>         the back of the queue of possible devices again, allowing them
>         to be reused.
> 
>    So, current (and strongly recommended) behaviour is even,any or
> even,any+dup. The linear allocation which is sometimes requested would
> be first,any -- this is the same as the old allocator from a few years
> ago. Allocating different copies to different controllers might be:
> even,grouped with groups A=sda,sdb,sdc;B=sdd,sde,sdf. Note that "any",
> "fast" and "slow" are all special cases of "grouped". It's worth
> noting, as I've just discovered, that it's possible to configure this
> system to do some _really_ silly suboptimal things with chunk
> allocation.
> 
>    I've got the algorithms for all this coded in python (mostly --
> I've got a couple of things to finish off), to verify that it turns
> into a sane implementation at least. It does.

   ... and here's the code. Usage:

alloc <repl> <allocator> [groups] <dev_size> ...

   Examples:

$ ./alloc 2c even 1000 1000 2000             # Current behaviour RAID-1
$ ./alloc 2c even,any,dup 1000 1000 2000     # Current behaviour DUP
$ ./alloc 2cMs first 100 100 100 200 200     # RAID-10, first-found allocator
$ ./alloc 1c3s1p rr,grouped A=0,1:B=2,3 100 100 100 100

   Allocators are {even,forward,first,rr},{any,grouped},{distinct,dup}.

   If "grouped" is selected, you must supply a group mapping:
<groupid>=<dev>,<dev>,...:<groupid>=<dev>,<dev>,...:... A <groupid> of
"." prevents use of that group entirely.

   Device sizes are scaled to a maximum of 24, for display purposes.

   Hugo.

#!/usr/bin/python3

import sys
import itertools

# Device list filters: filter the sequence of eligible devices based
# on the selected device. Return the filtered sequence

# winn_any is the existing algorithm: it disallows a device from
# being used again if it's already been used once
def winn_any(seq, dev, dgrp, dup):
    seq.remove(dev)
    if dup:
        seq.append(dev)
    return seq

# winn_grouped uses group definitions, and disallows a device if any
# device in its group has already been used.
def winn_grouped(seq, dev, dgrp, dup):
    removed = [d for d in seq if dgrp[d.id] == dgrp[dev.id]]
    seq = [d for d in seq if dgrp[d.id] != dgrp[dev.id]]

    if dup:
        seq += removed

    return seq

# Selection algorithms: given a list of devices and a state, return a
# device (or None), and an updated state

# seq_even is the existing algorithm: pick devices with the largest
# amount of free space
def seq_even(devs, state):
    dev = None
    for d in devs:
        if dev is None or d.free > dev.free:
            dev = d

    if dev is not None:
        return dev, dev.get_first_from(0), state
    else:
        return None, 0, state

# seq_rr does a round-robin through the free devices
def seq_rr(devs, state):
    if state is None:
        state = -1
    dev = None
    for d in devs:
        # We pick a device in preference if: We don't have one at all,
        # or we've got something closer to the last selection than we
        # have at the moment (where "closer" is smaller.
        if (dev is None
            or (state < d.id and d.id < dev.id)
            or (dev.id <= state and (d.id < dev.id or d.id > state))):
            dev = d

    if dev is not None:
        state = dev.id
        return dev, dev.get_first_from(0), state
    else:
        return None, 0, state

# This is the old allocator: We pick the first device that has free
# space
def seq_first(devs, state):
    dev = None
    if devs:
        dev = devs[0]

    if dev is not None:
        return dev, dev.get_first_from(0), state
    else:
        return None, 0, state

# seq_forward keeps track of where it had allocated on a device, and
# won't go back to fill in gaps on a device until it's reached the end
# of all the other devices first. If no chunks are freed, this is
# equivalent to seq_first
def seq_forward(devs, state):
    if state is None:
        state = [0] * len(devs)

    dev = None
    for x in range(2):
        # As with seq_first, pick the first eligible device, but ignore
        # any device which doesn't have any chunks after the last-
        # allocated one.
        for d in devs:
            first = d.get_first_from(state[d.id])
            if first is not None:
                dev = d
                break

        if dev is not None:
            break

        for d in devs:
            state[d.id] = 0

    if dev is not None:
        state[dev.id] = first
        return dev, first, state
    else:
        return None, 0, state    

def free_devs(devs):
    return len([d for d in devs if d.free > 0])

def allocate(devs, sch, name):
    chunks = count_chunks(sch, devs)
    dev = None

    # Initialise the set of devices that we wish to try: drop full
    # devices immediately as uninteresting, and ones with a group ID
    # of ".", which are forbidden in this scheme
    seq = [d for d in devs if d.free > 0 and sch.dgrp[d.id] != "."]
    # nxt is the list of devices we haven't allocated a chunk to
    nxt = seq[:]
    # rem is the list of devices already allocated: we recycle this
    # when nxt is empty if we're doing dup
    rem = []

    # Generate the sequence of chunk IDs: for each stripe component,
    # we run through all the copies in order. The chid is a 2-tuple,
    # of (data-stripe, copy)
    ids = itertools.product(range(chunks//sch.c), range(sch.c))
    c_and_s = (sch.c > 1 and chunks//sch.c > 1)

    # Now iterate through the sequence, picking devices in turn
    # and modifying the list as appropriate
    chid = (-1, -1)
    while chunks > 0:
        old = chid[0]
        chid = next(ids)
        print("  chid", chid, "old", old)

        # If we're doing multiple copies and multiple data-stripes,
        # and have skipped to the next data-stripe, we need to
        # reinitialise our list of devices with the ones that we
        # haven't used so far.
        if c_and_s and chid[0] != old:
            if sch.dup and len(nxt) == 0:
                nxt = rem
                rem = []
            seq = nxt[:]

        # Select a suitable device and location for the new chunk,
        # based on the current list of candidates and the current
        # state
        print("  Eligible device sequence is",
              " ".join((str(d.id) for d in seq)))
        print("  nxt =", nxt)
        dev, pos, sch.state = sch.sequence(seq, sch.state)

        if dev is None:
            print("  Failed -- no suitable device found")
            break

        print("  Found device", repr(dev), "at", pos, "seq", seq, "state", sch.state)
        dev.set_at(pos, name)

        # Filter the remaining candidates based on the selection
        seq = sch.winnower(seq, dev, sch.dgrp, sch.dup)
        nxt.remove(dev)

        print("  Filtered seq", seq, "/", nxt)
        chunks -= 1

    for d in devs:
        if dev is None:
            d.rollback()
        else:
            d.commit()

    return dev is not None

def count_chunks(sch, devs):
    # How many chunks do we want to allocate?
    # First, work out the total number of stripes we have to play with
    if sch.s is None:
        stripe = free_devs(devs) // sch.c
        # Complain if we drop below 1 data stripe with parity,
        if sch.p > 0 and stripe < sch.p + 1:
            return -1
        # Or 2 data stripes without
        if sch.p == 0 and stripe < 2:
            return -1
    else:
        stripe = sch.s + sch.p

    return stripe * sch.c

def all_alpha():
    return itertools.chain(range(ord("A"), ord("Z")+1),
                           range(ord("a"), ord("z")+1))

class Scheme(object):
    # This is just a dumb struct to hold all the details necessary for
    # a single scheme
    pass

def main():
    sch = Scheme()

    sys.argv.pop(0)
    # Parse input:
    sch.c, sch.s, sch.p = parse_csp(sys.argv.pop(0))
    sch.sequence, sch.winnower, sch.dup, grouped = parse_alloc(sys.argv.pop(0))
    sch.state = None
    if grouped:
        sch.dgrp = parse_groups(sys.argv.pop(0))
    else:
        sch.dgrp = [0] * len(sys.argv)

    # Device sizes
    devsz = normalise_devs([int(sz) for sz in sys.argv])
    devs = [Device(i, sz) for i, sz in enumerate(devsz)]

    for d in devs:
        print(d)
    print()

    # List of chunks to allocate
    names = itertools.chain(
        ("{0}{0}".format(chr(c)) for c in all_alpha()),
        ("{0}*".format(chr(c)) for c in all_alpha()),
        ("{0}:".format(chr(c)) for c in all_alpha()),
        ("{0}^".format(chr(c)) for c in all_alpha()),)

    success = True
    while success:
        name = next(names)
        print("Name is", name)

        success = allocate(devs, sch, name)

        for d in devs:
            print(d)
        print()


class Device(object):
    def __init__(self, i, size):
        self.id = i
        self.size = size
        self.free = size
        self.alloc = ["--" for x in range(size)]

    def get_first_from(self, pos):
        for i in range(pos, self.size):
            if self.alloc[i] == "--":
                return i
        return None

    def set_at(self, pos, name):
        self.alloc[pos] = "." + name
        self.free -= 1

    def commit(self):
        for i in range(0, self.size):
            if self.alloc[i][0] == ".":
                self.alloc[i] = self.alloc[i][1:]

    def rollback(self):
        for i in range(0, self.size):
            if self.alloc[i][0] == ".":
                self.alloc[i] = "--"

    def __str__(self):
        return "{0.id:2d} |".format(self) + " ".join(self.alloc) + "|"

    def __repr__(self):
        return "dev{0.id}".format(self)

## Parsing code and assorted infrastructure

def parse_csp(csp):
    c = "1"
    s = "1"
    p = "0"
    c, sp = csp.split("c")
    if sp != "":
        s, pp = sp.split("s")
        if pp != "":
            p, x = pp.split("p")
            if x != "":
                raise ValueError("p must be the last character in the csp string")
    c = int(c)
    if s == "M":
        s = None
    else:
        s = int(s)
    p = int(p)
    return c, s, p

WINNOWER = {"any": winn_any,
            "grouped": winn_grouped,
            }
SEQUENCE = {"even": seq_even,
            "forward": seq_forward,
            "first": seq_first,
            "round-robin": seq_rr,
            "rr": seq_rr, # Synonym for round-robin
            }
PACKING = {"distinct": "distinct",
           "dup": "dup",
           }
        
def parse_groups(s):
    gdev = {}
    maxdev = 0
    for g in s.split(":"):
        name, devlist = g.split("=")
        devs = [int(d) for d in devlist.split(",")]
        gdev[name] = devs
        maxdev = max(maxdev, *devs)
    print("Largest device ID is", maxdev)
    dgrp = [0] * (maxdev+1)
    for g, ds in gdev.items():
        for d in ds:
            dgrp[d] = g
    print(dgrp)
    return dgrp

def parse_alloc(s):
    opts = s.split(",")
    aopt = "even"
    wopt = "any"
    popt = "distinct"
    try:
        sopt = opts[0]
        wopt = opts[1]
        popt = opts[2]
    except IndexError:
        pass
    sequence = SEQUENCE[sopt]
    winnower = WINNOWER[wopt]
    packing = PACKING[popt]

    return sequence, winnower, (packing == "dup"), (wopt == "grouped")

def normalise_devs(seq):
    maxs = max(*seq)
    if maxs > 24:
        return [int(s*24/maxs+0.5) for s in seq]
    else:
        return seq

if __name__ == "__main__":
    main()


-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- There's an infinite number of monkeys outside who want to ---    
               talk to us about this new script for Hamlet               
                           they've worked out!                           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-05-08 21:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
2014-05-05 21:28 ` Marc MERLIN
2014-05-05 21:47 ` Brendan Hide
2014-05-06 20:51   ` Duncan
2014-05-06 20:16 ` Goffredo Baroncelli
2014-05-08 15:58 ` Hugo Mills
2014-05-08 21:55   ` Hugo Mills

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).