From: Gabriel <g2p.code@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH][BTRFS-PROGS] Enhance btrfs fi df
Date: Sat, 3 Nov 2012 00:14:53 +0000 (UTC) [thread overview]
Message-ID: <k71nlt$7go$5@ger.gmane.org> (raw)
In-Reply-To: 20121102234419.GD28864@carfax.org.uk
On Fri, 02 Nov 2012 23:44:19 +0000, Hugo Mills wrote:
> On Fri, Nov 02, 2012 at 11:23:14PM +0000, Gabriel wrote:
>> On Fri, 02 Nov 2012 22:06:04 +0000, Hugo Mills wrote:
>> > I've not considered the full semantics of all this yet -- I'll try
>> > to do that tomorrow. However, I note that the "×2" here could become
>> > non-integer with the RAID-5/6 code (which is due Real Soon Now). In
>> > the first RAID-5/6 code drop, it won't even be simple to calculate
>> > where there are different-sized devices in the filesystem. Putting an
>> > exact figure on that number is potentially going to be awkward. I
>> > think we're going to need kernel help for working out what that number
>> > should be, in the general case.
>>
>> DUP can be nested below a device because it represents same-device
>> redundancy (purpose: survive smudges but not device failure).
>>
>> On the other hand raid levels should occupy the same space on all
>> linked devices (a necessary consequence of the guarantee that RAID5
>> can survive the loss of any device and RAID6 any two devices).
>
> No, the multiplier here is variable. Consider:
>
> 1 MiB stored in RAID-5 across 3 devices takes up 1.5 MiB -- multiplier ×1.5
> (1 MiB over 2 devices is 512 KiB, plus an additional 512 KiB for parity)
> 1 MiB stored in RAID-5 across 6 devices takes up 1.2 MiB -- multipler ×1.2
> (1 MiB over 5 devices is 204.8 KiB, plus an additional 204.8 KiB for parity)
>
> With the (initial) proposed implementation of RAID-5, the
> stripe-width (i.e. the number of devices used for any given chunk
> allocation) will be *as many as can be allocated*. Chris confirmed
> this today on IRC. So if I have a disk array of 2T, 2T, 2T, 1T, 1T,
> 1T, then the first 1T of allocation will stripe across 6 devices,
> giving me 5 data+1 parity, or a multiplier of ×1.2. As soon as the
> smaller devices are full, the stripe width will drop to 3 devices, and
> we'll be using 2 data+1 parity allocation, or a multiplier of ×1.5 for
> any subsequent chunks. So, as more data over the first 5T is stored,
> the multiplier steadily decreases, until we fill the FS, and we get a
> multiplier of ×1.35 overall. This gets more complicated if you have
> devices of many different sizes. (Imagine 6 disks with sizes 500G, 1T,
> 1.5T, 2T, 3T, 3T).
>
> We probably can work out the current RAID overhead and feed it back
> sensibly, but it's (a) not constant as the allocation of the chunks
> increases, and (b) not trivial to compute.
All right, your example does illustrate things better. And I had no
idea about the implementation, but the as-many-stripes-as-possible
logic does make sense.
That doesn't break the sketch I made; I used RAIDn(device list)
as the block heading.
Your first example becomes:
RAID5(disk[1-6]), up to 6⁄5×5T.
Once that is filled we add a second block:
RAID5(disk[1-6])
(the usual grid: free, reserved; data metadata system)
RAID5(disk[1-3]), 3⁄2×2T more.
(the usual grid)
For proper reporting of free space we either need the kernel
to reserve all the blocks and tell us about them, or just some
info about the kernel's policy.
RAID5 with maximum stripes and no reduced redundancy is enough
info to compute the rest in userspace. Though the block approach
will be more reliable if the kernel has to make complicated policy
decisions, like the choice to reshape after device failure.
>> The two probably won't need to be represented at the same time
>> except during a reshape, because I imagine DUP gets converted to
>> RAID (1 or 5) as soon as the second device is added.
>>
>> A 1→2 reshape would look a bit like this (doing only the data column
>> and skipping totals):
>>
>> InitialDevice
>> Reserved 1.21TB
>> Used 1.21TB
>> RAID1(InitialDevice, SecondDevice)
>> Reserved 1.31TB + 100GB
>> Used 2× 100GB
>>
>> RAID5, RAID6: same with fractions, n+1⁄n and n+2⁄n.
>
> Except that n isn't guaranteed to be constant. That was pretty much
> my only point. Don't assume that it will be (or at the very least, be
> aware that you are assuming it is, and be prepared for inconsistencies).
>
> Hugo.
next prev parent reply other threads:[~2012-11-03 0:15 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-02 10:15 [PATCH][BTRFS-PROGS] Enhance btrfs fi df Goffredo Baroncelli
2012-11-02 10:15 ` [PATCH 1/8] Enhance the command btrfs filesystem df Goffredo Baroncelli
2012-11-02 10:15 ` [PATCH 2/8] Create the man page entry for the command btrfs fi df Goffredo Baroncelli
2012-11-02 10:15 ` [PATCH 3/8] Move open_file_or_dir() in utils.c Goffredo Baroncelli
2012-11-02 10:15 ` [PATCH 4/8] Move scrub_fs_info() and scrub_dev_info() " Goffredo Baroncelli
2012-11-02 10:15 ` [PATCH 5/8] Add command btrfs filesystem disk-usage Goffredo Baroncelli
2012-11-02 10:15 ` [PATCH 6/8] Create entry in man page for " Goffredo Baroncelli
2012-11-02 10:15 ` [PATCH 7/8] Add btrfs device disk-usage command Goffredo Baroncelli
2012-11-02 10:15 ` [PATCH 8/8] Create a new entry in btrfs man page for btrfs device disk-usage Goffredo Baroncelli
2012-11-02 11:18 ` [PATCH][BTRFS-PROGS] Enhance btrfs fi df Martin Steigerwald
2012-11-02 12:02 ` Goffredo Baroncelli
2012-11-02 19:05 ` Gabriel
2012-11-02 19:31 ` Goffredo Baroncelli
2012-11-02 20:40 ` Gabriel
2012-11-02 21:46 ` Michael Kjörling
2012-11-02 23:34 ` Gabriel
2012-11-02 22:06 ` Hugo Mills
2012-11-02 23:23 ` Gabriel
2012-11-02 23:44 ` Hugo Mills
2012-11-03 0:14 ` Gabriel [this message]
2012-11-03 12:28 ` Goffredo Baroncelli
2012-11-03 12:35 ` Goffredo Baroncelli
2012-11-03 22:04 ` cwillu
2012-11-03 12:11 ` Goffredo Baroncelli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='k71nlt$7go$5@ger.gmane.org' \
--to=g2p.code@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).