Re: How to properly and efficiently balance RAID6 after more drives are added?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: How to properly and efficiently balance RAID6 after more drives are added?
Date: Thu, 3 Sep 2015 02:22:23 +0000 (UTC)	[thread overview]
Message-ID: <pan$45a25$fd021e31$91d1cbe2$220454a1@cox.net> (raw)
In-Reply-To: 55E6F51B.3000007@netcologne.de

Christian Rohmann posted on Wed, 02 Sep 2015 15:09:47 +0200 as excerpted:

> Hey Hugo,
> 
> thanks for the quick response.
> 
> On 09/02/2015 01:30 PM, Hugo Mills wrote:
>> You had some data on the first 8 drives with 6 data+2 parity, then
>> added four more. From that point on, you were adding block groups with
>> 10 data+2 parity. At some point, the first 8 drives became full, and
>> then new block groups have been added only to the new drives, using 2
>> data+2 parity.
> 
> Even though the old 8 drive RAID6 was not full yet? Read: There was
> still some terabytes of free space.

At this point we're primarily guessing (unless you want to dive deep into 
btrfs-debug or the like), because the results you posted are from after 
you added the set of four new devices to the existing eight.  We don't 
have the btrfs fi show and df from before you added the new devices.

But what we /do/ know from what you posted (from after the add), the 
previously existing devices are "100% chunk-allocated", size 3.64 TiB, 
used 3.64 TiB, on each of the first eight devices.

I don't know how much of (the user docs on) the wiki you've read, and/or 
understood, but for many people, it takes awhile to really understand a 
few major differences between btrfs and most other filesystems.

1) Btrfs separates data and metadata into separate allocations, 
allocating, tracking and reporting them separately.  While some 
filesystems do allocate separately, few expose the separate data and 
metadata allocation detail to the user.

2) Btrfs allocates and uses space in two steps, first allocating/
reserving relatively large "chunks" from free-space into separate data 
and metadata chunks, then using space from these chunk allocations as 
needed, until they're full and more must be allocated.  Nominal[1] chunk 
size is 1 GiB for data, 256 MiB for metadata.

It's worth noting that for striped raid (with or without parity, so 
raid0,5,6, with parity strips taken from what would be the raid0 strips 
as appropriate), btrfs allocates a full chunk strip on each available 
device, so nominal raid6 strip allocation on eight devices would be a 6 
GiB data plus 2 GiB parity stripe (8x1GiB strips per stripe), while 
metadata would be 1.5 GiB metadata (6x256MiB) plus half a GiB parity 
(2x256MiB) (total of 8x256MiB strips per stripe).

Again, most filesystems don't allocate in chunks like this, at least for 
data (they often will allocate metadata in chunks of some size, in 
ordered to keep it grouped relatively close together, but that level of 
detail isn't show to the user, and because metadata is typically a small 
fraction of data, it can simply be included in the used figure as soon as 
allocated and still disappear in the rounding error).  What they report 
as free space is thus available unallocated space that should, within 
rounding error, be available for data.

3) Up until a few kernel cycles ago, btrfs could and would automatically 
allocate chunks as needed, but wouldn't deallocate them when they 
emptied.  Once they were allocated for data or metadata, that's how they 
stayed allocated, unless/until the user did a balance manually, at which 
point the chunk rewrite would consolidate the used space and free any 
unused chunk-space back to the unallocated space pool.

The result was that given normal usage writing and deleting data, over 
time, all unallocated space would typically end up allocated as data 
chunks, such that at some point the filesystem would run out of metadata 
space and need to allocate more metadata chunks, but couldn't, because of 
all those extra partially to entirely empty data chunks that were 
allocated and never freed.

Since IIRC 3.17 or so (kernel cycle from unverified memory, but that 
should be close), btrfs will automatically deallocate chunks if they're 
left entirely empty, so the problem has disappeared to a large extent, 
tho it's still possible to eventually end up with a bunch of not-quite-
empty data chunks, that require a manual balance to consolidate and clean 
up.

4) Normal df (as opposed to btrfs fi df) will list free space in existing 
data chunks as free, even after all unallocated space is gone and it's 
all allocated to either data or metadata chunks.  At that point, which 
ever one you run out of first, typically metadata, will trigger ENOSPC 
errors, despite df often showing quite some free space left -- because 
all the reported free-space is tied up in data chunks, and there's no 
unallocated space left to allocate to new metadata chunks when the 
existing ones get full.

5) What btrfs fi show reports for "used" in the individual device stats 
is chunk-allocated space.

What your btrfs fi show is saying, is that 100% of the capacity of those 
first eight devices is chunk-allocated, to data or metadata chunks it 
doesn't say, but whichever it is, it's already allocated to one or the 
other, and cannot be reallocated to something else, either a different 
sized stripe after adding the new devices, or to the opposite of data or 
metadata, whichever it is allocated as, until it is rewritten in ordered 
to consolidate all the actually used space into as few chunks as 
possible, thereby freeing the unused but currently chunk-allocated space 
back to the unallocated pool.  This chunk rewrite and consolidation is 
exactly what balance is designed to do.

Again, at this point we're guessing to some extent, based on what's 
reported now, after the addition and evident partial use of the four new 
devices to the existing eight.  Thus we don't know for sure when the 
existing eight devices got fully allocated, whether it was before the 
addition of the new devices or after, but full allocation is definitely 
the state they're in now, according to your posted btrfs fi show.

One plausible guess is as Hugo suggested, that they were mostly but not 
fully allocated before the addition of the new devices, with that data 
written as an 8-strip-stripe (6+2), that after the addition of the four 
new devices, the remaining unallocated space on the original eight was 
then filled along with usage from the new four, in a 12-strip-stripe (10
+2), after which further writes, if any, were now down to a 4-strip-
stripe (2+2), since the original eight were now fully chunk-allocated and 
the new four were the only devices with remaining unallocated space.

Another plausible guess is that the original eight devices were fully 
chunk-allocated before the addition of the four new devices, and that the 
free space that df was reporting was entirely in already allocated but 
not fully used data chunks.  In this case, you would have been perilously 
close to ENOSPC errors, when the existing metadata chunks got full, since 
all space was already allocated so no more metadata chunks could have 
been allocated, and if you didn't actually hit those errors, it was 
simply down to the lucky timing of adding the four new devices.

In either case, that df was and is reporting TiB of free space doesn't 
necessarily mean that there was unallocated space left, because df 
reports on potential space to write data, including both data-chunk-
allocated-but-not-yet-data-used-space, and unallocated-space.  Btrfs fi 
show is reporting for each device it's total space and allocated space, 
something totally different than df reports, so trying to directly 
compare the output from the two commands without knowing exactly what 
those numbers mean, is meaningless as they're reporting two entirely 
different things.

---
[1] Nominal chunk size:  Note the "nominal" qualifier.  While this is the 
normal chunk allocation size, on multi-TiB devices, the first few data 
chunk allocations in particular can be much larger, multiples of a GiB, 
while as unallocated space dwindles, both data and metadata chunks can be 
smaller in ordered to use up the last available unallocated space.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2015-09-03  2:22 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-02 10:29 How to properly and efficiently balance RAID6 after more drives are added? Christian Rohmann
2015-09-02 11:30 ` Hugo Mills
2015-09-02 13:09   ` Christian Rohmann
2015-09-03  2:22     ` Duncan [this message]
2015-09-04  8:28       ` Christian Rohmann
2015-09-04 11:04         ` Duncan
2015-11-11 14:17           ` Christian Rohmann
2015-11-12  4:31             ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$45a25$fd021e31$91d1cbe2$220454a1@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.