linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Thoughts on RAID nomenclature
@ 2014-05-05 21:17 Hugo Mills
  2014-05-05 21:28 ` Marc MERLIN
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Hugo Mills @ 2014-05-05 21:17 UTC (permalink / raw)
  To: Btrfs mailing list

[-- Attachment #1: Type: text/plain, Size: 3865 bytes --]

   A passing remark I made on this list a day or two ago set me to
thinking. You may all want to hide behind your desks or in a similar
safe place away from the danger zone (say, Vladivostok) at this
point...

   If we switch to the NcMsPp notation for replication, that
comfortably describes most of the plausible replication methods, and
I'm happy with that. But, there's a wart in the previous proposition,
which is putting "d" for 2cd to indicate that there's a DUP where
replicated chunks can go on the same device. This was the jumping-off
point to consider chunk allocation strategies in general.

   At the moment, we have two chunk allocation strategies: "dup" and
"spread" (for want of a better word; not to be confused with the
ssd_spread mount option, which is a whole different kettle of
borscht). The dup allocation strategy is currently only available for
2c replication, and only on single-device filesystems. When a
filesystem with dup allocation has a second device added to it, it's
automatically upgraded to spread.

   The general operation of the chunk allocator is that it's asked for
locations for n chunks for a block group, and makes a decision about
where those chunks go. In the case of spread, it sorts the devices in
decreasing order of unchunked space, and allocates the n chunks in
that order. For dup, it allocates both chunks on the same device (or,
generalising, may allocate the chunks on the same device if it has
to).

   Now, there are other variations we could consider. For example:

 - linear, which allocates on the n smallest-numbered devices with
   free space. This goes halfway towards some people's goal of
   minimising the file fragments damaged in a device failure on a 1c
   FS (again, see (*)). [There's an open question on this one about
   what happens when holes open up through, say, a balance.]

 - grouped, which allows the administrator to assign groups to the
   devices, and allocates each chunk from a different group. [There's
   a variation here -- we could look instead at ensuring that
   different _copies_ go in different groups.]

   Given these four (spread, dup, linear, grouped), I think it's
fairly obvious that spread is a special case of grouped, where each
device is its own group. Then dup is the opposite of grouped (i.e. you
must have one or the other but not both). Finally, linear is a
modifier that changes the sort order.

   All of these options run completely independently of the actual
replication level selected, so we could have 3c:spread,linear
(allocates on the first three devices only, until one fills up and
then it moves to the fourth device), or 2c2s:grouped, with a device
mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts
different copies on different device controllers.

   Does this all make sense? Are there any other options or features
that we might consider for chunk allocation at this point? Having had
a look at the chunk allocator, I think most if not all of this is
fairly easily implementable, given a sufficiently good method of
describing it all, which is what I'm trying to get to the bottom of in
this discussion.

   Hugo.

(*) The missing piece here is to deal with extent allocation in a
similar way, which would offer better odds again on the number of
files damaged in a device-loss situation on a 1c FS. This is in
general a much harder problem, though. The only change we have in this
area at the moment is ssd_spread, which doesn't do very much. It also
has the potential for really killing performance and/or file
fragmentation.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
          --- ...  one ping(1) to rule them all, and in the ---          
                         darkness bind(2) them.                          

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-05-08 21:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-05 21:17 Thoughts on RAID nomenclature Hugo Mills
2014-05-05 21:28 ` Marc MERLIN
2014-05-05 21:47 ` Brendan Hide
2014-05-06 20:51   ` Duncan
2014-05-06 20:16 ` Goffredo Baroncelli
2014-05-08 15:58 ` Hugo Mills
2014-05-08 21:55   ` Hugo Mills

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).