* Thoughts on RAID nomenclature @ 2014-05-05 21:17 Hugo Mills 2014-05-05 21:28 ` Marc MERLIN ` (3 more replies) 0 siblings, 4 replies; 7+ messages in thread From: Hugo Mills @ 2014-05-05 21:17 UTC (permalink / raw) To: Btrfs mailing list [-- Attachment #1: Type: text/plain, Size: 3865 bytes --] A passing remark I made on this list a day or two ago set me to thinking. You may all want to hide behind your desks or in a similar safe place away from the danger zone (say, Vladivostok) at this point... If we switch to the NcMsPp notation for replication, that comfortably describes most of the plausible replication methods, and I'm happy with that. But, there's a wart in the previous proposition, which is putting "d" for 2cd to indicate that there's a DUP where replicated chunks can go on the same device. This was the jumping-off point to consider chunk allocation strategies in general. At the moment, we have two chunk allocation strategies: "dup" and "spread" (for want of a better word; not to be confused with the ssd_spread mount option, which is a whole different kettle of borscht). The dup allocation strategy is currently only available for 2c replication, and only on single-device filesystems. When a filesystem with dup allocation has a second device added to it, it's automatically upgraded to spread. The general operation of the chunk allocator is that it's asked for locations for n chunks for a block group, and makes a decision about where those chunks go. In the case of spread, it sorts the devices in decreasing order of unchunked space, and allocates the n chunks in that order. For dup, it allocates both chunks on the same device (or, generalising, may allocate the chunks on the same device if it has to). Now, there are other variations we could consider. For example: - linear, which allocates on the n smallest-numbered devices with free space. This goes halfway towards some people's goal of minimising the file fragments damaged in a device failure on a 1c FS (again, see (*)). [There's an open question on this one about what happens when holes open up through, say, a balance.] - grouped, which allows the administrator to assign groups to the devices, and allocates each chunk from a different group. [There's a variation here -- we could look instead at ensuring that different _copies_ go in different groups.] Given these four (spread, dup, linear, grouped), I think it's fairly obvious that spread is a special case of grouped, where each device is its own group. Then dup is the opposite of grouped (i.e. you must have one or the other but not both). Finally, linear is a modifier that changes the sort order. All of these options run completely independently of the actual replication level selected, so we could have 3c:spread,linear (allocates on the first three devices only, until one fills up and then it moves to the fourth device), or 2c2s:grouped, with a device mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts different copies on different device controllers. Does this all make sense? Are there any other options or features that we might consider for chunk allocation at this point? Having had a look at the chunk allocator, I think most if not all of this is fairly easily implementable, given a sufficiently good method of describing it all, which is what I'm trying to get to the bottom of in this discussion. Hugo. (*) The missing piece here is to deal with extent allocation in a similar way, which would offer better odds again on the number of files damaged in a device-loss situation on a 1c FS. This is in general a much harder problem, though. The only change we have in this area at the moment is ssd_spread, which doesn't do very much. It also has the potential for really killing performance and/or file fragmentation. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- ... one ping(1) to rule them all, and in the --- darkness bind(2) them. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature 2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills @ 2014-05-05 21:28 ` Marc MERLIN 2014-05-05 21:47 ` Brendan Hide ` (2 subsequent siblings) 3 siblings, 0 replies; 7+ messages in thread From: Marc MERLIN @ 2014-05-05 21:28 UTC (permalink / raw) To: Hugo Mills, Btrfs mailing list On Mon, May 05, 2014 at 10:17:38PM +0100, Hugo Mills wrote: > Given these four (spread, dup, linear, grouped), I think it's > fairly obvious that spread is a special case of grouped, where each > device is its own group. Then dup is the opposite of grouped (i.e. you > must have one or the other but not both). Finally, linear is a > modifier that changes the sort order. > > All of these options run completely independently of the actual > replication level selected, so we could have 3c:spread,linear > (allocates on the first three devices only, until one fills up and > then it moves to the fourth device), or 2c2s:grouped, with a device > mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts > different copies on different device controllers. I generally like the idea and proposal :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature 2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills 2014-05-05 21:28 ` Marc MERLIN @ 2014-05-05 21:47 ` Brendan Hide 2014-05-06 20:51 ` Duncan 2014-05-06 20:16 ` Goffredo Baroncelli 2014-05-08 15:58 ` Hugo Mills 3 siblings, 1 reply; 7+ messages in thread From: Brendan Hide @ 2014-05-05 21:47 UTC (permalink / raw) To: Hugo Mills, Btrfs mailing list On 2014/05/05 11:17 PM, Hugo Mills wrote: > A passing remark I made on this list a day or two ago set me to > thinking. You may all want to hide behind your desks or in a similar > safe place away from the danger zone (say, Vladivostok) at this > point... I feel like I can brave some "mild horrors". Of course, my C skills aren't up to scratch so its all just bravado. ;) > If we switch to the NcMsPp notation for replication, that > comfortably describes most of the plausible replication methods, and > I'm happy with that. But, there's a wart in the previous proposition, > which is putting "d" for 2cd to indicate that there's a DUP where > replicated chunks can go on the same device. This was the jumping-off > point to consider chunk allocation strategies in general. > > At the moment, we have two chunk allocation strategies: "dup" and > "spread" (for want of a better word; not to be confused with the > ssd_spread mount option, which is a whole different kettle of > borscht). The dup allocation strategy is currently only available for > 2c replication, and only on single-device filesystems. When a > filesystem with dup allocation has a second device added to it, it's > automatically upgraded to spread. I thought this step was manual - but okay! :) > The general operation of the chunk allocator is that it's asked for > locations for n chunks for a block group, and makes a decision about > where those chunks go. In the case of spread, it sorts the devices in > decreasing order of unchunked space, and allocates the n chunks in > that order. For dup, it allocates both chunks on the same device (or, > generalising, may allocate the chunks on the same device if it has > to). > > Now, there are other variations we could consider. For example: > > - linear, which allocates on the n smallest-numbered devices with > free space. This goes halfway towards some people's goal of > minimising the file fragments damaged in a device failure on a 1c > FS (again, see (*)). [There's an open question on this one about > what happens when holes open up through, say, a balance.] > > - grouped, which allows the administrator to assign groups to the > devices, and allocates each chunk from a different group. [There's > a variation here -- we could look instead at ensuring that > different _copies_ go in different groups.] > > Given these four (spread, dup, linear, grouped), I think it's > fairly obvious that spread is a special case of grouped, where each > device is its own group. Then dup is the opposite of grouped (i.e. you > must have one or the other but not both). Finally, linear is a > modifier that changes the sort order. > > All of these options run completely independently of the actual > replication level selected, so we could have 3c:spread,linear > (allocates on the first three devices only, until one fills up and > then it moves to the fourth device), or 2c2s:grouped, with a device > mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts > different copies on different device controllers. > > Does this all make sense? Are there any other options or features > that we might consider for chunk allocation at this point? Having had > a look at the chunk allocator, I think most if not all of this is > fairly easily implementable, given a sufficiently good method of > describing it all, which is what I'm trying to get to the bottom of in > this discussion. I think I get most of what you're saying. If its not too difficult, perhaps you could update (or duplicate to another URL) your /btrfs-usage/ calculator to reflect the idea. It'd definitely make it easier for everyone (including myself) to know we're on the same page. I like the idea that the administrator would have more granular control over where data gets allocated first or where copies "belong". "Splicing" data to different controllers as you mentioned can help with both redundancy and performance. Note: I've always thought of dup as a special form of "spread" where we just write things out twice - but yes, there's no need for it to be compatible with any other allocation type. > Hugo. > > (*) The missing piece here is to deal with extent allocation in a > similar way, which would offer better odds again on the number of > files damaged in a device-loss situation on a 1c FS. This is in > general a much harder problem, though. The only change we have in this > area at the moment is ssd_spread, which doesn't do very much. It also > has the potential for really killing performance and/or file > fragmentation. > -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature 2014-05-05 21:47 ` Brendan Hide @ 2014-05-06 20:51 ` Duncan 0 siblings, 0 replies; 7+ messages in thread From: Duncan @ 2014-05-06 20:51 UTC (permalink / raw) To: linux-btrfs Brendan Hide posted on Mon, 05 May 2014 23:47:17 +0200 as excerpted: >> At the moment, we have two chunk allocation strategies: "dup" and >> "spread" (for want of a better word; not to be confused with the >> ssd_spread mount option, which is a whole different kettle of borscht). >> The dup allocation strategy is currently only available for 2c >> replication, and only on single-device filesystems. When a filesystem >> with dup allocation has a second device added to it, it's automatically >> upgraded to spread. > > I thought this step was manual - but okay! :) AFAIK, the /allocator/ automatically updates to "spread" when a second device is added. That is, assuming previous dup metadata on a single device, adding a device will cause new allocations to be in raid1/spread mode, instead of dup. What's manual, however, is that /existing/ chunk allocations don't get automatically updated. For that, a balance must be done. But existing allocations are by definition already allocated, so the chunk allocator doesn't do anything with them. (A rebalance allocates new chunks, rewriting the contents of the old chunks into the new ones before unmapping the now unused old chunks, so again, existing chunks stay where they are until unmapped, it's the NEW chunks that get mapped by the updated allocation policy.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature 2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills 2014-05-05 21:28 ` Marc MERLIN 2014-05-05 21:47 ` Brendan Hide @ 2014-05-06 20:16 ` Goffredo Baroncelli 2014-05-08 15:58 ` Hugo Mills 3 siblings, 0 replies; 7+ messages in thread From: Goffredo Baroncelli @ 2014-05-06 20:16 UTC (permalink / raw) To: Hugo Mills; +Cc: linux-btrfs On 05/05/2014 11:17 PM, Hugo Mills wrote: [...] > Does this all make sense? Are there any other options or features > that we might consider for chunk allocation at this point? The kind of chunk (DATA, METADATA, MIXED....) and the subvolume (when /if this possibility will come) As how write this information I suggest the following options: -[DATA|METADATA|MIXED|SYSTEM:]NcMsPp[:<driveslist>[:/subvolume/path]] Where <drivelist> is an expression of the disks policy allocation: a) {sdX1:W1,sdX2:W2...} where sdX is the partition involved and W is the weight: #1 {sda:1,sdb:1,sdc:1} means spread all the disks #2 {sda:1,sdb:2,sdc:3} means linear from sda to sdc #3 {sda:1,sdb:1,sdc:2} means spread on sda and sdb (grouped) then (when full) sdc or b) #1 (sda,sdb,sdc) means spread all the disks #2 [sda,sdb,sdc] means linear from sda to sdc #3 [(sda,sdb),sdc] means spread on sda and sdb (grouped) then (when full) sdc or c) #1 (sda,sdb,sdc) means spread all the disks #2 sda,sdb,sdc means linear from sda to sdc #3 (sda,sdb),sdc means spread on sda and sdb (grouped) then (when full) sdc Some examples: - 1c2s3b Default allocation policy - DATA:2c3s4b Default allocation policy for the DATA - METADATA:1c4s:(sda,sdb,sdc,sdd) Spread over all the 4 disks for metadata - MIXED:1c4s:sda,sdc,sdb,sdd Linear over the 4 disks, ordered as the list for Data+Metadata - DATA:1c4s:(sda,sdc),(sdb,sdd) spread over sda,sdc and then when these are filled, spread over sdb and sdc - METADATA:1c4s:(sda,sdb,sdc,sdd):/subvolume/path Spread over all the 4 disks for metadata belonging the subvolume /subvolume/path I think it would be interesting to explore some configuration like - DATA:1c:(sda) - METADATA:2c:(sdb) if sda is bigger and sdb is faster Some further thoughts: - more I think about the allocation policy per subvolume and/or file basis and more I think that it would be a messy to manage -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature 2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills ` (2 preceding siblings ...) 2014-05-06 20:16 ` Goffredo Baroncelli @ 2014-05-08 15:58 ` Hugo Mills 2014-05-08 21:55 ` Hugo Mills 3 siblings, 1 reply; 7+ messages in thread From: Hugo Mills @ 2014-05-08 15:58 UTC (permalink / raw) To: Btrfs mailing list [-- Attachment #1: Type: text/plain, Size: 7760 bytes --] On Mon, May 05, 2014 at 10:17:38PM +0100, Hugo Mills wrote: > A passing remark I made on this list a day or two ago set me to > thinking. You may all want to hide behind your desks or in a similar > safe place away from the danger zone (say, Vladivostok) at this > point... > > If we switch to the NcMsPp notation for replication, that > comfortably describes most of the plausible replication methods, and > I'm happy with that. But, there's a wart in the previous proposition, > which is putting "d" for 2cd to indicate that there's a DUP where > replicated chunks can go on the same device. This was the jumping-off > point to consider chunk allocation strategies in general. > > At the moment, we have two chunk allocation strategies: "dup" and > "spread" (for want of a better word; not to be confused with the > ssd_spread mount option, which is a whole different kettle of > borscht). The dup allocation strategy is currently only available for > 2c replication, and only on single-device filesystems. When a > filesystem with dup allocation has a second device added to it, it's > automatically upgraded to spread. > > The general operation of the chunk allocator is that it's asked for > locations for n chunks for a block group, and makes a decision about > where those chunks go. In the case of spread, it sorts the devices in > decreasing order of unchunked space, and allocates the n chunks in > that order. For dup, it allocates both chunks on the same device (or, > generalising, may allocate the chunks on the same device if it has > to). > > Now, there are other variations we could consider. For example: > > - linear, which allocates on the n smallest-numbered devices with > free space. This goes halfway towards some people's goal of > minimising the file fragments damaged in a device failure on a 1c > FS (again, see (*)). [There's an open question on this one about > what happens when holes open up through, say, a balance.] > > - grouped, which allows the administrator to assign groups to the > devices, and allocates each chunk from a different group. [There's > a variation here -- we could look instead at ensuring that > different _copies_ go in different groups.] > > Given these four (spread, dup, linear, grouped), I think it's > fairly obvious that spread is a special case of grouped, where each > device is its own group. Then dup is the opposite of grouped (i.e. you > must have one or the other but not both). Finally, linear is a > modifier that changes the sort order. > > All of these options run completely independently of the actual > replication level selected, so we could have 3c:spread,linear > (allocates on the first three devices only, until one fills up and > then it moves to the fourth device), or 2c2s:grouped, with a device > mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts > different copies on different device controllers. Having thought about this some more(*), what I've described above isn't quite right. We've got two main axes for the algorithm, and one flag (which modifies the second axis's behaviour). The first axis is selection of a suitable device from a list of candidates. I've renamed things from my last email to try to make things clearer, but example algorithms here could be: - first: The old algorithm, which simply selects the available device with the smallest device ID. - even: The current algorithm, which selects the available device with the largest free space. - round-robin: A stateful selection, which selects the next available device after the last one selected. - forward: Like "first", only if a device becomes full, we don't go back to look at it until we've exhausted all the other devices first. This approximates a ring-buffer type structure (only where we only fill in gaps, obviously). - seq: Like "first", only with a user-supplied arbitrary ordering of devices. These can be stateful (r-r and forward are), and each function may be called multiple times for each block group: once per chunk required. I don't necessarily propose to implement all of these, but at least even and first, and possibly seq, seem sensible. After selecting a device on which to create a chunk, we then need to winnow out the devices which are no longer suitable for selection as a result of that allocation. This is the other axis, and the behaviours here are: - any: The current behaviour. Only the selected device is removed from consideration. - fast, slow: Like "any", except that devices which are, respectively, slow and fast are put in a special "don't use" group which prevents them from being allocated at all. - grouped: All devices within the group of the selected device are removed from consideration. This allows us to specify, for example, that different copies in Nc configurations should go on different controllers(**). It also allows us to specify that metadata chunks should only go on specific devices. - dup: may be applied to any of the other winnowing functions, and simply forces that function to put its discarded devices to the back of the queue of possible devices again, allowing them to be reused. So, current (and strongly recommended) behaviour is even,any or even,any+dup. The linear allocation which is sometimes requested would be first,any -- this is the same as the old allocator from a few years ago. Allocating different copies to different controllers might be: even,grouped with groups A=sda,sdb,sdc;B=sdd,sde,sdf. Note that "any", "fast" and "slow" are all special cases of "grouped". It's worth noting, as I've just discovered, that it's possible to configure this system to do some _really_ silly suboptimal things with chunk allocation. I've got the algorithms for all this coded in python (mostly -- I've got a couple of things to finish off), to verify that it turns into a sane implementation at least. It does. Hugo. (*) Screams, sirens, sounds of breaking glass... (**) There's a bit of a twist here to deal with RAID-10- or RAID-50-like configurations, which is that we actually maintain two lists for those, and allow different stripes in the same copy to share a group, but not different copies of the same stripe. It approximately doubles the size of the data structures we need while running an allocation, but that's (a) transient, and (b) consists of lists of pointers to device structures, so it's not exactly large. > > Does this all make sense? Are there any other options or features > that we might consider for chunk allocation at this point? Having had > a look at the chunk allocator, I think most if not all of this is > fairly easily implementable, given a sufficiently good method of > describing it all, which is what I'm trying to get to the bottom of in > this discussion. > > Hugo. > > (*) The missing piece here is to deal with extent allocation in a > similar way, which would offer better odds again on the number of > files damaged in a device-loss situation on a 1c FS. This is in > general a much harder problem, though. The only change we have in this > area at the moment is ssd_spread, which doesn't do very much. It also > has the potential for really killing performance and/or file > fragmentation. > -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Everything simple is false. Everything which is --- complex is unusable. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on RAID nomenclature 2014-05-08 15:58 ` Hugo Mills @ 2014-05-08 21:55 ` Hugo Mills 0 siblings, 0 replies; 7+ messages in thread From: Hugo Mills @ 2014-05-08 21:55 UTC (permalink / raw) To: Btrfs mailing list [-- Attachment #1: Type: text/plain, Size: 14073 bytes --] On Thu, May 08, 2014 at 04:58:34PM +0100, Hugo Mills wrote: > The first axis is selection of a suitable device from a list of > candidates. I've renamed things from my last email to try to make > things clearer, but example algorithms here could be: > > - first: The old algorithm, which simply selects the available device > with the smallest device ID. > > - even: The current algorithm, which selects the available device > with the largest free space. > > - round-robin: A stateful selection, which selects the next available > device after the last one selected. > > - forward: Like "first", only if a device becomes full, we don't go > back to look at it until we've exhausted all the other > devices first. This approximates a ring-buffer type > structure (only where we only fill in gaps, obviously). > > - seq: Like "first", only with a user-supplied arbitrary ordering of > devices. > > These can be stateful (r-r and forward are), and each function may > be called multiple times for each block group: once per chunk > required. I don't necessarily propose to implement all of these, but > at least even and first, and possibly seq, seem sensible. > > After selecting a device on which to create a chunk, we then need > to winnow out the devices which are no longer suitable for selection > as a result of that allocation. This is the other axis, and the > behaviours here are: > > - any: The current behaviour. Only the selected device is removed > from consideration. > > - fast, slow: Like "any", except that devices which are, > respectively, slow and fast are put in a special "don't use" > group which prevents them from being allocated at all. > > - grouped: All devices within the group of the selected device are > removed from consideration. This allows us to specify, for > example, that different copies in Nc configurations should go > on different controllers(**). It also allows us to specify > that metadata chunks should only go on specific devices. > > - dup: may be applied to any of the other winnowing functions, and > simply forces that function to put its discarded devices to > the back of the queue of possible devices again, allowing them > to be reused. > > So, current (and strongly recommended) behaviour is even,any or > even,any+dup. The linear allocation which is sometimes requested would > be first,any -- this is the same as the old allocator from a few years > ago. Allocating different copies to different controllers might be: > even,grouped with groups A=sda,sdb,sdc;B=sdd,sde,sdf. Note that "any", > "fast" and "slow" are all special cases of "grouped". It's worth > noting, as I've just discovered, that it's possible to configure this > system to do some _really_ silly suboptimal things with chunk > allocation. > > I've got the algorithms for all this coded in python (mostly -- > I've got a couple of things to finish off), to verify that it turns > into a sane implementation at least. It does. ... and here's the code. Usage: alloc <repl> <allocator> [groups] <dev_size> ... Examples: $ ./alloc 2c even 1000 1000 2000 # Current behaviour RAID-1 $ ./alloc 2c even,any,dup 1000 1000 2000 # Current behaviour DUP $ ./alloc 2cMs first 100 100 100 200 200 # RAID-10, first-found allocator $ ./alloc 1c3s1p rr,grouped A=0,1:B=2,3 100 100 100 100 Allocators are {even,forward,first,rr},{any,grouped},{distinct,dup}. If "grouped" is selected, you must supply a group mapping: <groupid>=<dev>,<dev>,...:<groupid>=<dev>,<dev>,...:... A <groupid> of "." prevents use of that group entirely. Device sizes are scaled to a maximum of 24, for display purposes. Hugo. #!/usr/bin/python3 import sys import itertools # Device list filters: filter the sequence of eligible devices based # on the selected device. Return the filtered sequence # winn_any is the existing algorithm: it disallows a device from # being used again if it's already been used once def winn_any(seq, dev, dgrp, dup): seq.remove(dev) if dup: seq.append(dev) return seq # winn_grouped uses group definitions, and disallows a device if any # device in its group has already been used. def winn_grouped(seq, dev, dgrp, dup): removed = [d for d in seq if dgrp[d.id] == dgrp[dev.id]] seq = [d for d in seq if dgrp[d.id] != dgrp[dev.id]] if dup: seq += removed return seq # Selection algorithms: given a list of devices and a state, return a # device (or None), and an updated state # seq_even is the existing algorithm: pick devices with the largest # amount of free space def seq_even(devs, state): dev = None for d in devs: if dev is None or d.free > dev.free: dev = d if dev is not None: return dev, dev.get_first_from(0), state else: return None, 0, state # seq_rr does a round-robin through the free devices def seq_rr(devs, state): if state is None: state = -1 dev = None for d in devs: # We pick a device in preference if: We don't have one at all, # or we've got something closer to the last selection than we # have at the moment (where "closer" is smaller. if (dev is None or (state < d.id and d.id < dev.id) or (dev.id <= state and (d.id < dev.id or d.id > state))): dev = d if dev is not None: state = dev.id return dev, dev.get_first_from(0), state else: return None, 0, state # This is the old allocator: We pick the first device that has free # space def seq_first(devs, state): dev = None if devs: dev = devs[0] if dev is not None: return dev, dev.get_first_from(0), state else: return None, 0, state # seq_forward keeps track of where it had allocated on a device, and # won't go back to fill in gaps on a device until it's reached the end # of all the other devices first. If no chunks are freed, this is # equivalent to seq_first def seq_forward(devs, state): if state is None: state = [0] * len(devs) dev = None for x in range(2): # As with seq_first, pick the first eligible device, but ignore # any device which doesn't have any chunks after the last- # allocated one. for d in devs: first = d.get_first_from(state[d.id]) if first is not None: dev = d break if dev is not None: break for d in devs: state[d.id] = 0 if dev is not None: state[dev.id] = first return dev, first, state else: return None, 0, state def free_devs(devs): return len([d for d in devs if d.free > 0]) def allocate(devs, sch, name): chunks = count_chunks(sch, devs) dev = None # Initialise the set of devices that we wish to try: drop full # devices immediately as uninteresting, and ones with a group ID # of ".", which are forbidden in this scheme seq = [d for d in devs if d.free > 0 and sch.dgrp[d.id] != "."] # nxt is the list of devices we haven't allocated a chunk to nxt = seq[:] # rem is the list of devices already allocated: we recycle this # when nxt is empty if we're doing dup rem = [] # Generate the sequence of chunk IDs: for each stripe component, # we run through all the copies in order. The chid is a 2-tuple, # of (data-stripe, copy) ids = itertools.product(range(chunks//sch.c), range(sch.c)) c_and_s = (sch.c > 1 and chunks//sch.c > 1) # Now iterate through the sequence, picking devices in turn # and modifying the list as appropriate chid = (-1, -1) while chunks > 0: old = chid[0] chid = next(ids) print(" chid", chid, "old", old) # If we're doing multiple copies and multiple data-stripes, # and have skipped to the next data-stripe, we need to # reinitialise our list of devices with the ones that we # haven't used so far. if c_and_s and chid[0] != old: if sch.dup and len(nxt) == 0: nxt = rem rem = [] seq = nxt[:] # Select a suitable device and location for the new chunk, # based on the current list of candidates and the current # state print(" Eligible device sequence is", " ".join((str(d.id) for d in seq))) print(" nxt =", nxt) dev, pos, sch.state = sch.sequence(seq, sch.state) if dev is None: print(" Failed -- no suitable device found") break print(" Found device", repr(dev), "at", pos, "seq", seq, "state", sch.state) dev.set_at(pos, name) # Filter the remaining candidates based on the selection seq = sch.winnower(seq, dev, sch.dgrp, sch.dup) nxt.remove(dev) print(" Filtered seq", seq, "/", nxt) chunks -= 1 for d in devs: if dev is None: d.rollback() else: d.commit() return dev is not None def count_chunks(sch, devs): # How many chunks do we want to allocate? # First, work out the total number of stripes we have to play with if sch.s is None: stripe = free_devs(devs) // sch.c # Complain if we drop below 1 data stripe with parity, if sch.p > 0 and stripe < sch.p + 1: return -1 # Or 2 data stripes without if sch.p == 0 and stripe < 2: return -1 else: stripe = sch.s + sch.p return stripe * sch.c def all_alpha(): return itertools.chain(range(ord("A"), ord("Z")+1), range(ord("a"), ord("z")+1)) class Scheme(object): # This is just a dumb struct to hold all the details necessary for # a single scheme pass def main(): sch = Scheme() sys.argv.pop(0) # Parse input: sch.c, sch.s, sch.p = parse_csp(sys.argv.pop(0)) sch.sequence, sch.winnower, sch.dup, grouped = parse_alloc(sys.argv.pop(0)) sch.state = None if grouped: sch.dgrp = parse_groups(sys.argv.pop(0)) else: sch.dgrp = [0] * len(sys.argv) # Device sizes devsz = normalise_devs([int(sz) for sz in sys.argv]) devs = [Device(i, sz) for i, sz in enumerate(devsz)] for d in devs: print(d) print() # List of chunks to allocate names = itertools.chain( ("{0}{0}".format(chr(c)) for c in all_alpha()), ("{0}*".format(chr(c)) for c in all_alpha()), ("{0}:".format(chr(c)) for c in all_alpha()), ("{0}^".format(chr(c)) for c in all_alpha()),) success = True while success: name = next(names) print("Name is", name) success = allocate(devs, sch, name) for d in devs: print(d) print() class Device(object): def __init__(self, i, size): self.id = i self.size = size self.free = size self.alloc = ["--" for x in range(size)] def get_first_from(self, pos): for i in range(pos, self.size): if self.alloc[i] == "--": return i return None def set_at(self, pos, name): self.alloc[pos] = "." + name self.free -= 1 def commit(self): for i in range(0, self.size): if self.alloc[i][0] == ".": self.alloc[i] = self.alloc[i][1:] def rollback(self): for i in range(0, self.size): if self.alloc[i][0] == ".": self.alloc[i] = "--" def __str__(self): return "{0.id:2d} |".format(self) + " ".join(self.alloc) + "|" def __repr__(self): return "dev{0.id}".format(self) ## Parsing code and assorted infrastructure def parse_csp(csp): c = "1" s = "1" p = "0" c, sp = csp.split("c") if sp != "": s, pp = sp.split("s") if pp != "": p, x = pp.split("p") if x != "": raise ValueError("p must be the last character in the csp string") c = int(c) if s == "M": s = None else: s = int(s) p = int(p) return c, s, p WINNOWER = {"any": winn_any, "grouped": winn_grouped, } SEQUENCE = {"even": seq_even, "forward": seq_forward, "first": seq_first, "round-robin": seq_rr, "rr": seq_rr, # Synonym for round-robin } PACKING = {"distinct": "distinct", "dup": "dup", } def parse_groups(s): gdev = {} maxdev = 0 for g in s.split(":"): name, devlist = g.split("=") devs = [int(d) for d in devlist.split(",")] gdev[name] = devs maxdev = max(maxdev, *devs) print("Largest device ID is", maxdev) dgrp = [0] * (maxdev+1) for g, ds in gdev.items(): for d in ds: dgrp[d] = g print(dgrp) return dgrp def parse_alloc(s): opts = s.split(",") aopt = "even" wopt = "any" popt = "distinct" try: sopt = opts[0] wopt = opts[1] popt = opts[2] except IndexError: pass sequence = SEQUENCE[sopt] winnower = WINNOWER[wopt] packing = PACKING[popt] return sequence, winnower, (packing == "dup"), (wopt == "grouped") def normalise_devs(seq): maxs = max(*seq) if maxs > 24: return [int(s*24/maxs+0.5) for s in seq] else: return seq if __name__ == "__main__": main() -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- There's an infinite number of monkeys outside who want to --- talk to us about this new script for Hamlet they've worked out! [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-05-08 21:55 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills 2014-05-05 21:28 ` Marc MERLIN 2014-05-05 21:47 ` Brendan Hide 2014-05-06 20:51 ` Duncan 2014-05-06 20:16 ` Goffredo Baroncelli 2014-05-08 15:58 ` Hugo Mills 2014-05-08 21:55 ` Hugo Mills
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).