* Thoughts on RAID nomenclature
@ 2014-05-05 21:17 Hugo Mills
2014-05-05 21:28 ` Marc MERLIN
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Hugo Mills @ 2014-05-05 21:17 UTC (permalink / raw)
To: Btrfs mailing list
[-- Attachment #1: Type: text/plain, Size: 3865 bytes --]
A passing remark I made on this list a day or two ago set me to
thinking. You may all want to hide behind your desks or in a similar
safe place away from the danger zone (say, Vladivostok) at this
point...
If we switch to the NcMsPp notation for replication, that
comfortably describes most of the plausible replication methods, and
I'm happy with that. But, there's a wart in the previous proposition,
which is putting "d" for 2cd to indicate that there's a DUP where
replicated chunks can go on the same device. This was the jumping-off
point to consider chunk allocation strategies in general.
At the moment, we have two chunk allocation strategies: "dup" and
"spread" (for want of a better word; not to be confused with the
ssd_spread mount option, which is a whole different kettle of
borscht). The dup allocation strategy is currently only available for
2c replication, and only on single-device filesystems. When a
filesystem with dup allocation has a second device added to it, it's
automatically upgraded to spread.
The general operation of the chunk allocator is that it's asked for
locations for n chunks for a block group, and makes a decision about
where those chunks go. In the case of spread, it sorts the devices in
decreasing order of unchunked space, and allocates the n chunks in
that order. For dup, it allocates both chunks on the same device (or,
generalising, may allocate the chunks on the same device if it has
to).
Now, there are other variations we could consider. For example:
- linear, which allocates on the n smallest-numbered devices with
free space. This goes halfway towards some people's goal of
minimising the file fragments damaged in a device failure on a 1c
FS (again, see (*)). [There's an open question on this one about
what happens when holes open up through, say, a balance.]
- grouped, which allows the administrator to assign groups to the
devices, and allocates each chunk from a different group. [There's
a variation here -- we could look instead at ensuring that
different _copies_ go in different groups.]
Given these four (spread, dup, linear, grouped), I think it's
fairly obvious that spread is a special case of grouped, where each
device is its own group. Then dup is the opposite of grouped (i.e. you
must have one or the other but not both). Finally, linear is a
modifier that changes the sort order.
All of these options run completely independently of the actual
replication level selected, so we could have 3c:spread,linear
(allocates on the first three devices only, until one fills up and
then it moves to the fourth device), or 2c2s:grouped, with a device
mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
different copies on different device controllers.
Does this all make sense? Are there any other options or features
that we might consider for chunk allocation at this point? Having had
a look at the chunk allocator, I think most if not all of this is
fairly easily implementable, given a sufficiently good method of
describing it all, which is what I'm trying to get to the bottom of in
this discussion.
Hugo.
(*) The missing piece here is to deal with extent allocation in a
similar way, which would offer better odds again on the number of
files damaged in a device-loss situation on a 1c FS. This is in
general a much harder problem, though. The only change we have in this
area at the moment is ssd_spread, which doesn't do very much. It also
has the potential for really killing performance and/or file
fragmentation.
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- ... one ping(1) to rule them all, and in the ---
darkness bind(2) them.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature
2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
@ 2014-05-05 21:28 ` Marc MERLIN
2014-05-05 21:47 ` Brendan Hide
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Marc MERLIN @ 2014-05-05 21:28 UTC (permalink / raw)
To: Hugo Mills, Btrfs mailing list
On Mon, May 05, 2014 at 10:17:38PM +0100, Hugo Mills wrote:
> Given these four (spread, dup, linear, grouped), I think it's
> fairly obvious that spread is a special case of grouped, where each
> device is its own group. Then dup is the opposite of grouped (i.e. you
> must have one or the other but not both). Finally, linear is a
> modifier that changes the sort order.
>
> All of these options run completely independently of the actual
> replication level selected, so we could have 3c:spread,linear
> (allocates on the first three devices only, until one fills up and
> then it moves to the fourth device), or 2c2s:grouped, with a device
> mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
> different copies on different device controllers.
I generally like the idea and proposal :)
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature
2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
2014-05-05 21:28 ` Marc MERLIN
@ 2014-05-05 21:47 ` Brendan Hide
2014-05-06 20:51 ` Duncan
2014-05-06 20:16 ` Goffredo Baroncelli
2014-05-08 15:58 ` Hugo Mills
3 siblings, 1 reply; 7+ messages in thread
From: Brendan Hide @ 2014-05-05 21:47 UTC (permalink / raw)
To: Hugo Mills, Btrfs mailing list
On 2014/05/05 11:17 PM, Hugo Mills wrote:
> A passing remark I made on this list a day or two ago set me to
> thinking. You may all want to hide behind your desks or in a similar
> safe place away from the danger zone (say, Vladivostok) at this
> point...
I feel like I can brave some "mild horrors". Of course, my C skills
aren't up to scratch so its all just bravado. ;)
> If we switch to the NcMsPp notation for replication, that
> comfortably describes most of the plausible replication methods, and
> I'm happy with that. But, there's a wart in the previous proposition,
> which is putting "d" for 2cd to indicate that there's a DUP where
> replicated chunks can go on the same device. This was the jumping-off
> point to consider chunk allocation strategies in general.
>
> At the moment, we have two chunk allocation strategies: "dup" and
> "spread" (for want of a better word; not to be confused with the
> ssd_spread mount option, which is a whole different kettle of
> borscht). The dup allocation strategy is currently only available for
> 2c replication, and only on single-device filesystems. When a
> filesystem with dup allocation has a second device added to it, it's
> automatically upgraded to spread.
I thought this step was manual - but okay! :)
> The general operation of the chunk allocator is that it's asked for
> locations for n chunks for a block group, and makes a decision about
> where those chunks go. In the case of spread, it sorts the devices in
> decreasing order of unchunked space, and allocates the n chunks in
> that order. For dup, it allocates both chunks on the same device (or,
> generalising, may allocate the chunks on the same device if it has
> to).
>
> Now, there are other variations we could consider. For example:
>
> - linear, which allocates on the n smallest-numbered devices with
> free space. This goes halfway towards some people's goal of
> minimising the file fragments damaged in a device failure on a 1c
> FS (again, see (*)). [There's an open question on this one about
> what happens when holes open up through, say, a balance.]
>
> - grouped, which allows the administrator to assign groups to the
> devices, and allocates each chunk from a different group. [There's
> a variation here -- we could look instead at ensuring that
> different _copies_ go in different groups.]
>
> Given these four (spread, dup, linear, grouped), I think it's
> fairly obvious that spread is a special case of grouped, where each
> device is its own group. Then dup is the opposite of grouped (i.e. you
> must have one or the other but not both). Finally, linear is a
> modifier that changes the sort order.
>
> All of these options run completely independently of the actual
> replication level selected, so we could have 3c:spread,linear
> (allocates on the first three devices only, until one fills up and
> then it moves to the fourth device), or 2c2s:grouped, with a device
> mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
> different copies on different device controllers.
>
> Does this all make sense? Are there any other options or features
> that we might consider for chunk allocation at this point? Having had
> a look at the chunk allocator, I think most if not all of this is
> fairly easily implementable, given a sufficiently good method of
> describing it all, which is what I'm trying to get to the bottom of in
> this discussion.
I think I get most of what you're saying. If its not too difficult,
perhaps you could update (or duplicate to another URL) your
/btrfs-usage/ calculator to reflect the idea. It'd definitely make it
easier for everyone (including myself) to know we're on the same page.
I like the idea that the administrator would have more granular control
over where data gets allocated first or where copies "belong".
"Splicing" data to different controllers as you mentioned can help with
both redundancy and performance.
Note: I've always thought of dup as a special form of "spread" where we
just write things out twice - but yes, there's no need for it to be
compatible with any other allocation type.
> Hugo.
>
> (*) The missing piece here is to deal with extent allocation in a
> similar way, which would offer better odds again on the number of
> files damaged in a device-loss situation on a 1c FS. This is in
> general a much harder problem, though. The only change we have in this
> area at the moment is ssd_spread, which doesn't do very much. It also
> has the potential for really killing performance and/or file
> fragmentation.
>
--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature
2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
2014-05-05 21:28 ` Marc MERLIN
2014-05-05 21:47 ` Brendan Hide
@ 2014-05-06 20:16 ` Goffredo Baroncelli
2014-05-08 15:58 ` Hugo Mills
3 siblings, 0 replies; 7+ messages in thread
From: Goffredo Baroncelli @ 2014-05-06 20:16 UTC (permalink / raw)
To: Hugo Mills; +Cc: linux-btrfs
On 05/05/2014 11:17 PM, Hugo Mills wrote:
[...]
> Does this all make sense? Are there any other options or features
> that we might consider for chunk allocation at this point?
The kind of chunk (DATA, METADATA, MIXED....) and the subvolume (when /if this possibility will come)
As how write this information I suggest the following options:
-[DATA|METADATA|MIXED|SYSTEM:]NcMsPp[:<driveslist>[:/subvolume/path]]
Where <drivelist> is an expression of the disks policy allocation:
a)
{sdX1:W1,sdX2:W2...}
where sdX is the partition involved and W is the weight:
#1 {sda:1,sdb:1,sdc:1} means spread all the disks
#2 {sda:1,sdb:2,sdc:3} means linear from sda to sdc
#3 {sda:1,sdb:1,sdc:2} means spread on sda and sdb (grouped)
then (when full) sdc
or
b)
#1 (sda,sdb,sdc) means spread all the disks
#2 [sda,sdb,sdc] means linear from sda to sdc
#3 [(sda,sdb),sdc] means spread on sda and sdb (grouped)
then (when full) sdc
or
c)
#1 (sda,sdb,sdc) means spread all the disks
#2 sda,sdb,sdc means linear from sda to sdc
#3 (sda,sdb),sdc means spread on sda and sdb (grouped)
then (when full) sdc
Some examples:
- 1c2s3b
Default allocation policy
- DATA:2c3s4b
Default allocation policy for the DATA
- METADATA:1c4s:(sda,sdb,sdc,sdd)
Spread over all the 4 disks for metadata
- MIXED:1c4s:sda,sdc,sdb,sdd
Linear over the 4 disks, ordered as the list for
Data+Metadata
- DATA:1c4s:(sda,sdc),(sdb,sdd)
spread over sda,sdc and then when these are
filled, spread over sdb and sdc
- METADATA:1c4s:(sda,sdb,sdc,sdd):/subvolume/path
Spread over all the 4 disks for metadata belonging the
subvolume /subvolume/path
I think it would be interesting to explore some configuration like
- DATA:1c:(sda)
- METADATA:2c:(sdb)
if sda is bigger and sdb is faster
Some further thoughts:
- more I think about the allocation policy per subvolume and/or file basis and more I think that it would be a messy to manage
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature
2014-05-05 21:47 ` Brendan Hide
@ 2014-05-06 20:51 ` Duncan
0 siblings, 0 replies; 7+ messages in thread
From: Duncan @ 2014-05-06 20:51 UTC (permalink / raw)
To: linux-btrfs
Brendan Hide posted on Mon, 05 May 2014 23:47:17 +0200 as excerpted:
>> At the moment, we have two chunk allocation strategies: "dup" and
>> "spread" (for want of a better word; not to be confused with the
>> ssd_spread mount option, which is a whole different kettle of borscht).
>> The dup allocation strategy is currently only available for 2c
>> replication, and only on single-device filesystems. When a filesystem
>> with dup allocation has a second device added to it, it's automatically
>> upgraded to spread.
>
> I thought this step was manual - but okay! :)
AFAIK, the /allocator/ automatically updates to "spread" when a second
device is added. That is, assuming previous dup metadata on a single
device, adding a device will cause new allocations to be in raid1/spread
mode, instead of dup.
What's manual, however, is that /existing/ chunk allocations don't get
automatically updated. For that, a balance must be done. But existing
allocations are by definition already allocated, so the chunk allocator
doesn't do anything with them. (A rebalance allocates new chunks,
rewriting the contents of the old chunks into the new ones before
unmapping the now unused old chunks, so again, existing chunks stay where
they are until unmapped, it's the NEW chunks that get mapped by the
updated allocation policy.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature
2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
` (2 preceding siblings ...)
2014-05-06 20:16 ` Goffredo Baroncelli
@ 2014-05-08 15:58 ` Hugo Mills
2014-05-08 21:55 ` Hugo Mills
3 siblings, 1 reply; 7+ messages in thread
From: Hugo Mills @ 2014-05-08 15:58 UTC (permalink / raw)
To: Btrfs mailing list
[-- Attachment #1: Type: text/plain, Size: 7760 bytes --]
On Mon, May 05, 2014 at 10:17:38PM +0100, Hugo Mills wrote:
> A passing remark I made on this list a day or two ago set me to
> thinking. You may all want to hide behind your desks or in a similar
> safe place away from the danger zone (say, Vladivostok) at this
> point...
>
> If we switch to the NcMsPp notation for replication, that
> comfortably describes most of the plausible replication methods, and
> I'm happy with that. But, there's a wart in the previous proposition,
> which is putting "d" for 2cd to indicate that there's a DUP where
> replicated chunks can go on the same device. This was the jumping-off
> point to consider chunk allocation strategies in general.
>
> At the moment, we have two chunk allocation strategies: "dup" and
> "spread" (for want of a better word; not to be confused with the
> ssd_spread mount option, which is a whole different kettle of
> borscht). The dup allocation strategy is currently only available for
> 2c replication, and only on single-device filesystems. When a
> filesystem with dup allocation has a second device added to it, it's
> automatically upgraded to spread.
>
> The general operation of the chunk allocator is that it's asked for
> locations for n chunks for a block group, and makes a decision about
> where those chunks go. In the case of spread, it sorts the devices in
> decreasing order of unchunked space, and allocates the n chunks in
> that order. For dup, it allocates both chunks on the same device (or,
> generalising, may allocate the chunks on the same device if it has
> to).
>
> Now, there are other variations we could consider. For example:
>
> - linear, which allocates on the n smallest-numbered devices with
> free space. This goes halfway towards some people's goal of
> minimising the file fragments damaged in a device failure on a 1c
> FS (again, see (*)). [There's an open question on this one about
> what happens when holes open up through, say, a balance.]
>
> - grouped, which allows the administrator to assign groups to the
> devices, and allocates each chunk from a different group. [There's
> a variation here -- we could look instead at ensuring that
> different _copies_ go in different groups.]
>
> Given these four (spread, dup, linear, grouped), I think it's
> fairly obvious that spread is a special case of grouped, where each
> device is its own group. Then dup is the opposite of grouped (i.e. you
> must have one or the other but not both). Finally, linear is a
> modifier that changes the sort order.
>
> All of these options run completely independently of the actual
> replication level selected, so we could have 3c:spread,linear
> (allocates on the first three devices only, until one fills up and
> then it moves to the fourth device), or 2c2s:grouped, with a device
> mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
> different copies on different device controllers.
Having thought about this some more(*), what I've described above
isn't quite right. We've got two main axes for the algorithm, and one
flag (which modifies the second axis's behaviour).
The first axis is selection of a suitable device from a list of
candidates. I've renamed things from my last email to try to make
things clearer, but example algorithms here could be:
- first: The old algorithm, which simply selects the available device
with the smallest device ID.
- even: The current algorithm, which selects the available device
with the largest free space.
- round-robin: A stateful selection, which selects the next available
device after the last one selected.
- forward: Like "first", only if a device becomes full, we don't go
back to look at it until we've exhausted all the other
devices first. This approximates a ring-buffer type
structure (only where we only fill in gaps, obviously).
- seq: Like "first", only with a user-supplied arbitrary ordering of
devices.
These can be stateful (r-r and forward are), and each function may
be called multiple times for each block group: once per chunk
required. I don't necessarily propose to implement all of these, but
at least even and first, and possibly seq, seem sensible.
After selecting a device on which to create a chunk, we then need
to winnow out the devices which are no longer suitable for selection
as a result of that allocation. This is the other axis, and the
behaviours here are:
- any: The current behaviour. Only the selected device is removed
from consideration.
- fast, slow: Like "any", except that devices which are,
respectively, slow and fast are put in a special "don't use"
group which prevents them from being allocated at all.
- grouped: All devices within the group of the selected device are
removed from consideration. This allows us to specify, for
example, that different copies in Nc configurations should go
on different controllers(**). It also allows us to specify
that metadata chunks should only go on specific devices.
- dup: may be applied to any of the other winnowing functions, and
simply forces that function to put its discarded devices to
the back of the queue of possible devices again, allowing them
to be reused.
So, current (and strongly recommended) behaviour is even,any or
even,any+dup. The linear allocation which is sometimes requested would
be first,any -- this is the same as the old allocator from a few years
ago. Allocating different copies to different controllers might be:
even,grouped with groups A=sda,sdb,sdc;B=sdd,sde,sdf. Note that "any",
"fast" and "slow" are all special cases of "grouped". It's worth
noting, as I've just discovered, that it's possible to configure this
system to do some _really_ silly suboptimal things with chunk
allocation.
I've got the algorithms for all this coded in python (mostly --
I've got a couple of things to finish off), to verify that it turns
into a sane implementation at least. It does.
Hugo.
(*) Screams, sirens, sounds of breaking glass...
(**) There's a bit of a twist here to deal with RAID-10- or
RAID-50-like configurations, which is that we actually maintain two
lists for those, and allow different stripes in the same copy to share
a group, but not different copies of the same stripe. It approximately
doubles the size of the data structures we need while running an
allocation, but that's (a) transient, and (b) consists of lists of
pointers to device structures, so it's not exactly large.
>
> Does this all make sense? Are there any other options or features
> that we might consider for chunk allocation at this point? Having had
> a look at the chunk allocator, I think most if not all of this is
> fairly easily implementable, given a sufficiently good method of
> describing it all, which is what I'm trying to get to the bottom of in
> this discussion.
>
> Hugo.
>
> (*) The missing piece here is to deal with extent allocation in a
> similar way, which would offer better odds again on the number of
> files damaged in a device-loss situation on a 1c FS. This is in
> general a much harder problem, though. The only change we have in this
> area at the moment is ssd_spread, which doesn't do very much. It also
> has the potential for really killing performance and/or file
> fragmentation.
>
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Everything simple is false. Everything which is ---
complex is unusable.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature
2014-05-08 15:58 ` Hugo Mills
@ 2014-05-08 21:55 ` Hugo Mills
0 siblings, 0 replies; 7+ messages in thread
From: Hugo Mills @ 2014-05-08 21:55 UTC (permalink / raw)
To: Btrfs mailing list
[-- Attachment #1: Type: text/plain, Size: 14073 bytes --]
On Thu, May 08, 2014 at 04:58:34PM +0100, Hugo Mills wrote:
> The first axis is selection of a suitable device from a list of
> candidates. I've renamed things from my last email to try to make
> things clearer, but example algorithms here could be:
>
> - first: The old algorithm, which simply selects the available device
> with the smallest device ID.
>
> - even: The current algorithm, which selects the available device
> with the largest free space.
>
> - round-robin: A stateful selection, which selects the next available
> device after the last one selected.
>
> - forward: Like "first", only if a device becomes full, we don't go
> back to look at it until we've exhausted all the other
> devices first. This approximates a ring-buffer type
> structure (only where we only fill in gaps, obviously).
>
> - seq: Like "first", only with a user-supplied arbitrary ordering of
> devices.
>
> These can be stateful (r-r and forward are), and each function may
> be called multiple times for each block group: once per chunk
> required. I don't necessarily propose to implement all of these, but
> at least even and first, and possibly seq, seem sensible.
>
> After selecting a device on which to create a chunk, we then need
> to winnow out the devices which are no longer suitable for selection
> as a result of that allocation. This is the other axis, and the
> behaviours here are:
>
> - any: The current behaviour. Only the selected device is removed
> from consideration.
>
> - fast, slow: Like "any", except that devices which are,
> respectively, slow and fast are put in a special "don't use"
> group which prevents them from being allocated at all.
>
> - grouped: All devices within the group of the selected device are
> removed from consideration. This allows us to specify, for
> example, that different copies in Nc configurations should go
> on different controllers(**). It also allows us to specify
> that metadata chunks should only go on specific devices.
>
> - dup: may be applied to any of the other winnowing functions, and
> simply forces that function to put its discarded devices to
> the back of the queue of possible devices again, allowing them
> to be reused.
>
> So, current (and strongly recommended) behaviour is even,any or
> even,any+dup. The linear allocation which is sometimes requested would
> be first,any -- this is the same as the old allocator from a few years
> ago. Allocating different copies to different controllers might be:
> even,grouped with groups A=sda,sdb,sdc;B=sdd,sde,sdf. Note that "any",
> "fast" and "slow" are all special cases of "grouped". It's worth
> noting, as I've just discovered, that it's possible to configure this
> system to do some _really_ silly suboptimal things with chunk
> allocation.
>
> I've got the algorithms for all this coded in python (mostly --
> I've got a couple of things to finish off), to verify that it turns
> into a sane implementation at least. It does.
... and here's the code. Usage:
alloc <repl> <allocator> [groups] <dev_size> ...
Examples:
$ ./alloc 2c even 1000 1000 2000 # Current behaviour RAID-1
$ ./alloc 2c even,any,dup 1000 1000 2000 # Current behaviour DUP
$ ./alloc 2cMs first 100 100 100 200 200 # RAID-10, first-found allocator
$ ./alloc 1c3s1p rr,grouped A=0,1:B=2,3 100 100 100 100
Allocators are {even,forward,first,rr},{any,grouped},{distinct,dup}.
If "grouped" is selected, you must supply a group mapping:
<groupid>=<dev>,<dev>,...:<groupid>=<dev>,<dev>,...:... A <groupid> of
"." prevents use of that group entirely.
Device sizes are scaled to a maximum of 24, for display purposes.
Hugo.
#!/usr/bin/python3
import sys
import itertools
# Device list filters: filter the sequence of eligible devices based
# on the selected device. Return the filtered sequence
# winn_any is the existing algorithm: it disallows a device from
# being used again if it's already been used once
def winn_any(seq, dev, dgrp, dup):
seq.remove(dev)
if dup:
seq.append(dev)
return seq
# winn_grouped uses group definitions, and disallows a device if any
# device in its group has already been used.
def winn_grouped(seq, dev, dgrp, dup):
removed = [d for d in seq if dgrp[d.id] == dgrp[dev.id]]
seq = [d for d in seq if dgrp[d.id] != dgrp[dev.id]]
if dup:
seq += removed
return seq
# Selection algorithms: given a list of devices and a state, return a
# device (or None), and an updated state
# seq_even is the existing algorithm: pick devices with the largest
# amount of free space
def seq_even(devs, state):
dev = None
for d in devs:
if dev is None or d.free > dev.free:
dev = d
if dev is not None:
return dev, dev.get_first_from(0), state
else:
return None, 0, state
# seq_rr does a round-robin through the free devices
def seq_rr(devs, state):
if state is None:
state = -1
dev = None
for d in devs:
# We pick a device in preference if: We don't have one at all,
# or we've got something closer to the last selection than we
# have at the moment (where "closer" is smaller.
if (dev is None
or (state < d.id and d.id < dev.id)
or (dev.id <= state and (d.id < dev.id or d.id > state))):
dev = d
if dev is not None:
state = dev.id
return dev, dev.get_first_from(0), state
else:
return None, 0, state
# This is the old allocator: We pick the first device that has free
# space
def seq_first(devs, state):
dev = None
if devs:
dev = devs[0]
if dev is not None:
return dev, dev.get_first_from(0), state
else:
return None, 0, state
# seq_forward keeps track of where it had allocated on a device, and
# won't go back to fill in gaps on a device until it's reached the end
# of all the other devices first. If no chunks are freed, this is
# equivalent to seq_first
def seq_forward(devs, state):
if state is None:
state = [0] * len(devs)
dev = None
for x in range(2):
# As with seq_first, pick the first eligible device, but ignore
# any device which doesn't have any chunks after the last-
# allocated one.
for d in devs:
first = d.get_first_from(state[d.id])
if first is not None:
dev = d
break
if dev is not None:
break
for d in devs:
state[d.id] = 0
if dev is not None:
state[dev.id] = first
return dev, first, state
else:
return None, 0, state
def free_devs(devs):
return len([d for d in devs if d.free > 0])
def allocate(devs, sch, name):
chunks = count_chunks(sch, devs)
dev = None
# Initialise the set of devices that we wish to try: drop full
# devices immediately as uninteresting, and ones with a group ID
# of ".", which are forbidden in this scheme
seq = [d for d in devs if d.free > 0 and sch.dgrp[d.id] != "."]
# nxt is the list of devices we haven't allocated a chunk to
nxt = seq[:]
# rem is the list of devices already allocated: we recycle this
# when nxt is empty if we're doing dup
rem = []
# Generate the sequence of chunk IDs: for each stripe component,
# we run through all the copies in order. The chid is a 2-tuple,
# of (data-stripe, copy)
ids = itertools.product(range(chunks//sch.c), range(sch.c))
c_and_s = (sch.c > 1 and chunks//sch.c > 1)
# Now iterate through the sequence, picking devices in turn
# and modifying the list as appropriate
chid = (-1, -1)
while chunks > 0:
old = chid[0]
chid = next(ids)
print(" chid", chid, "old", old)
# If we're doing multiple copies and multiple data-stripes,
# and have skipped to the next data-stripe, we need to
# reinitialise our list of devices with the ones that we
# haven't used so far.
if c_and_s and chid[0] != old:
if sch.dup and len(nxt) == 0:
nxt = rem
rem = []
seq = nxt[:]
# Select a suitable device and location for the new chunk,
# based on the current list of candidates and the current
# state
print(" Eligible device sequence is",
" ".join((str(d.id) for d in seq)))
print(" nxt =", nxt)
dev, pos, sch.state = sch.sequence(seq, sch.state)
if dev is None:
print(" Failed -- no suitable device found")
break
print(" Found device", repr(dev), "at", pos, "seq", seq, "state", sch.state)
dev.set_at(pos, name)
# Filter the remaining candidates based on the selection
seq = sch.winnower(seq, dev, sch.dgrp, sch.dup)
nxt.remove(dev)
print(" Filtered seq", seq, "/", nxt)
chunks -= 1
for d in devs:
if dev is None:
d.rollback()
else:
d.commit()
return dev is not None
def count_chunks(sch, devs):
# How many chunks do we want to allocate?
# First, work out the total number of stripes we have to play with
if sch.s is None:
stripe = free_devs(devs) // sch.c
# Complain if we drop below 1 data stripe with parity,
if sch.p > 0 and stripe < sch.p + 1:
return -1
# Or 2 data stripes without
if sch.p == 0 and stripe < 2:
return -1
else:
stripe = sch.s + sch.p
return stripe * sch.c
def all_alpha():
return itertools.chain(range(ord("A"), ord("Z")+1),
range(ord("a"), ord("z")+1))
class Scheme(object):
# This is just a dumb struct to hold all the details necessary for
# a single scheme
pass
def main():
sch = Scheme()
sys.argv.pop(0)
# Parse input:
sch.c, sch.s, sch.p = parse_csp(sys.argv.pop(0))
sch.sequence, sch.winnower, sch.dup, grouped = parse_alloc(sys.argv.pop(0))
sch.state = None
if grouped:
sch.dgrp = parse_groups(sys.argv.pop(0))
else:
sch.dgrp = [0] * len(sys.argv)
# Device sizes
devsz = normalise_devs([int(sz) for sz in sys.argv])
devs = [Device(i, sz) for i, sz in enumerate(devsz)]
for d in devs:
print(d)
print()
# List of chunks to allocate
names = itertools.chain(
("{0}{0}".format(chr(c)) for c in all_alpha()),
("{0}*".format(chr(c)) for c in all_alpha()),
("{0}:".format(chr(c)) for c in all_alpha()),
("{0}^".format(chr(c)) for c in all_alpha()),)
success = True
while success:
name = next(names)
print("Name is", name)
success = allocate(devs, sch, name)
for d in devs:
print(d)
print()
class Device(object):
def __init__(self, i, size):
self.id = i
self.size = size
self.free = size
self.alloc = ["--" for x in range(size)]
def get_first_from(self, pos):
for i in range(pos, self.size):
if self.alloc[i] == "--":
return i
return None
def set_at(self, pos, name):
self.alloc[pos] = "." + name
self.free -= 1
def commit(self):
for i in range(0, self.size):
if self.alloc[i][0] == ".":
self.alloc[i] = self.alloc[i][1:]
def rollback(self):
for i in range(0, self.size):
if self.alloc[i][0] == ".":
self.alloc[i] = "--"
def __str__(self):
return "{0.id:2d} |".format(self) + " ".join(self.alloc) + "|"
def __repr__(self):
return "dev{0.id}".format(self)
## Parsing code and assorted infrastructure
def parse_csp(csp):
c = "1"
s = "1"
p = "0"
c, sp = csp.split("c")
if sp != "":
s, pp = sp.split("s")
if pp != "":
p, x = pp.split("p")
if x != "":
raise ValueError("p must be the last character in the csp string")
c = int(c)
if s == "M":
s = None
else:
s = int(s)
p = int(p)
return c, s, p
WINNOWER = {"any": winn_any,
"grouped": winn_grouped,
}
SEQUENCE = {"even": seq_even,
"forward": seq_forward,
"first": seq_first,
"round-robin": seq_rr,
"rr": seq_rr, # Synonym for round-robin
}
PACKING = {"distinct": "distinct",
"dup": "dup",
}
def parse_groups(s):
gdev = {}
maxdev = 0
for g in s.split(":"):
name, devlist = g.split("=")
devs = [int(d) for d in devlist.split(",")]
gdev[name] = devs
maxdev = max(maxdev, *devs)
print("Largest device ID is", maxdev)
dgrp = [0] * (maxdev+1)
for g, ds in gdev.items():
for d in ds:
dgrp[d] = g
print(dgrp)
return dgrp
def parse_alloc(s):
opts = s.split(",")
aopt = "even"
wopt = "any"
popt = "distinct"
try:
sopt = opts[0]
wopt = opts[1]
popt = opts[2]
except IndexError:
pass
sequence = SEQUENCE[sopt]
winnower = WINNOWER[wopt]
packing = PACKING[popt]
return sequence, winnower, (packing == "dup"), (wopt == "grouped")
def normalise_devs(seq):
maxs = max(*seq)
if maxs > 24:
return [int(s*24/maxs+0.5) for s in seq]
else:
return seq
if __name__ == "__main__":
main()
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- There's an infinite number of monkeys outside who want to ---
talk to us about this new script for Hamlet
they've worked out!
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-05-08 21:55 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
2014-05-05 21:28 ` Marc MERLIN
2014-05-05 21:47 ` Brendan Hide
2014-05-06 20:51 ` Duncan
2014-05-06 20:16 ` Goffredo Baroncelli
2014-05-08 15:58 ` Hugo Mills
2014-05-08 21:55 ` Hugo Mills
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).