Re: RAID1 fails to recover chunk tree

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: RAID1 fails to recover chunk tree
Date: Mon, 3 Nov 2014 08:00:36 +0000 (UTC)	[thread overview]
Message-ID: <pan$87e7c$a6682da9$37ea679d$69f7389f@cox.net> (raw)
In-Reply-To: 545610BE.1080508@pobox.com

Robert White posted on Sun, 02 Nov 2014 03:08:46 -0800 as excerpted:

> Confusing bit, for example, from wiki
> 
> [QUOTE]
> If you are getting out of space errors due to metadata being full, try
> 
> btrfs balance start -v -dusage=0 /mnt/btrfs [/QUOTE]
> 
> Combined with "Balances only block groups with usage under the given
> percentage. "
> 
> Which I was reading -dusage=0 means don't bother with data chunks and
> (and so just fix the metadata), otherwise the mention of using a -d
> filter to affect metadata is perverse.
> 
> Blarg... I mean just... blarg...
> 
> But now I know. 8-)

If metadata is full and there's no unallocated space left from which to 
create new metadata chunks, then balancing metadata wouldn't do any good 
anyway.

Which is why you balance data chunks in that case.

The typical scenario is this.  Someone creates a btrfs and starts using 
it, creating files, deleting files, but over time, tending to create more 
files than they delete, so the space starts to fill up.

As they do so, btrfs allocates new data and metadata chunks on demand 
from the unallocated space.  Btrfs allocation and usage happens in two 
steps, unallocated space gets allocated to chunkspace, either data or 
metadata, and then that allocated chunkspace gets actually used for file 
data or metadata, depending on the chunk type.  Data chunks are 1 GiB 
each by default, while metadata chunks default to a quarter GiB each.

The critical bit to understand here is that while btrfs can automatically 
allocate both chunks and actual usage on demand, when it frees space, it 
can only automatically free actual usage, not the allocated chunks.  And 
it can't switch chunks from one type to the other.  To free the chunks 
back to unallocated so they can again be allocated on-demand to data or 
metadata as necessary, one must run a balance, which rewrites the chunks, 
consolidating as it goes, thereby freeing the excess allocated chunks if 
actual usage fits into less chunks than were previously allocated.

Picking up our typical scenario...  Then they delete a bunch of files, 
often the bigger ones, but the data tends to be much bigger than the 
metadata, so deleting these files frees up a lot of data chunk space but 
only a relatively little metadata chunk space.

Then they go writing files again, but on average smaller ones.  These 
smaller files take up less data space but the same amount of metadata 
space, so without a manual balance to reclaim allocated but mostly empty 
chunks, the limited metadata space freed by that big deletion gets filled 
up faster than the data space, and suddenly, people are getting ENOSPC 
errors when df says there's LOTS of space, because all that space is 
taken up by mostly empty data chunks, leaving no room to write new 
metadata chunks.

The scenario is similar to that of ext* running out of inodes (a type of 
metadata, after all) since it preallocates them at mkfs time, except that 
over time, the default number of inodes at a particular ext* filesystem 
size has been bumped up so that this seldom happens in practice any 
more.  But btrfs stores quite a bit more metadata per file, including 
checksums, and for small files, perhaps the entire file including the 
data, in which case it won't actually have a data extent, so oversizing 
btrfs metadata by a similar amount would mean wasting MUCH more space for 
the typical user.  And btrfs can automatically allocate data and metadata 
chunks on demand -- the catch is that it can't automatically unallocate 
chunks on demand[1], a balance is required for that, nor can it switch 
usage types on chunks once allocated.

In that scenario, it's metadata that's out, but to fix it you have to 
balance data, returning unused but allocated data chunks back to the 
unallocated space pool, so they can be allocated as metadata.

Which is why/how the -d (data) filter affects -m (metadata) -- by freeing 
mostly (or with the suggested -dusage=0, entirely[2]) empty data chunks 
back to unallocated so they can be reallocated as metadata chunks.

So call it perverse if you want to, but it's an entirely logical 
perversion![3] =:^)

Meanwhile, it's also possible, altho less common, to run into the 
opposite situation, out of data space, with metadata space left.  That's 
actually rather interesting, as you can create files and sometimes even 
write just a small bit of content into them, since small files are 
entirely stored within the metadata leaf and don't require a data 
allocation.  But as soon as you try to write anything of any significant 
size (a few KiB) to the new file, it'll ENOSPC when it tries to allocate 
a data extent and can't.

---
[1] Yet.  There's patches circulating that once thru discussion and 
merged, should let btrfs automatically handle at least the normal cases 
of data/metadata chunk imbalance.

[2] If there's actual data in a chunk, a balance must have at least 
enough space left in ordered to create at least one more chunk, so as to 
be able to do the rewrite.  But with a bit of luck, there's at least one 
chunk that's entirely empty, in which case usage=0 will free it without 
actually requiring space to create a new chunk to rewrite into, since 
there's nothing to rewrite.  That's why the usage=0.  If you're unlucky 
and there's no entirely empty chunks available for the balance to simply 
delete, then the usage=0 won't help.  That's where the suggestion to 
temporarily add another device of at least a few gigs comes in, the idea 
being to give balance enough room to rewrite a few chunks on the new 
device, thereby freeing the space they would have used on the original 
device(s).  Assuming an over-allocation, the balance should correct the 
problem, leaving enough space on the original device(s) so there's room 
to transfer the chunks back to the original device(s) using btrfs device 
delete <tmp-device>, and hopefully still leave some unallocated space 
left after that.

[3] Sort of like the (in)famous MS Windows perversion of having to hit 
the start button to stop...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2014-11-03  8:00 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-28 20:32 RAID1 fails to recover chunk tree Zack Coffey
2014-10-29  3:55 ` Anand Jain
2014-10-29 19:32   ` Zack Coffey
2014-10-30  3:33     ` Anand Jain
2014-10-29 22:26 ` Robert White
2014-10-29 23:07   ` Robert White
2014-10-30 13:30     ` Zack Coffey
2014-10-30 15:23       ` Zygo Blaxell
2014-10-30 18:04       ` Chris Murphy
2014-10-31  1:27         ` Duncan
2014-10-31  2:09           ` Chris Murphy
2014-11-02  4:26             ` Robert White
2014-11-02  8:48               ` Roman Mamedov
2014-11-02 11:08                 ` Robert White
2014-11-03  6:52                   ` Duncan
2014-11-03  8:00                   ` Duncan [this message]
2014-10-31  8:35       ` Robert White
2014-10-31 12:15         ` Zack Coffey
2014-11-02  4:19           ` Robert White
  -- strict thread matches above, loose matches on Subject: below --
2014-10-28 20:18 Zack Coffey
2014-10-27 19:01 Zack Coffey
2014-10-15 21:09 Zack Coffey
2014-10-15 15:42 Zack Coffey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$87e7c$a6682da9$37ea679d$69f7389f@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox