Re: File system is oddly full after kernel upgrade, balance doesn't help

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: File system is oddly full after kernel upgrade, balance doesn't help
Date: Tue, 31 Jan 2017 03:53:40 +0000 (UTC)	[thread overview]
Message-ID: <pan$61590$8b4273ea$8cd79913$19b8f7cc@cox.net> (raw)
In-Reply-To: CAE8gLhkjgOzRgJXKL=az8Mw8njCg+m6d0zyH8WmDPNx_-Wp9Ow@mail.gmail.com

MegaBrutal posted on Sat, 28 Jan 2017 19:15:01 +0100 as excerpted:

> Of course I can't retrieve the data from before the balance, but here is
> the data from now:

FWIW, if it's available, btrfs fi usage tends to yield the richest 
information.  But it's also a (relatively) new addition to the btrfs-
tools suite, and the results of btrfs fi show combined with btrfs fi df 
are the older version, together displaying the same critical information, 
tho without quite as much multi-device information.  Meanwhile, both 
btrfs fi usage and btrfs fi df require a mounted btrfs, so when it won't 
mount, btrfs fi show is about the best that can be done, at least staying 
within the normal admin-user targeted commands (there's developer 
diagnostics targeted commands, but I'm not a dev, just a btrfs list 
regular and btrfs user myself, and to date have left those commands for 
the devs to play with).

But since usage is available, that's all I'm quoting, here:

> root@vmhost:~# btrfs fi usage /tmp/mnt/curlybrace
> Overall:
>     Device size:              2.00GiB
>     Device allocated:         1.90GiB
>     Device unallocated:     103.38MiB
>     Device missing:             0.00B
>     Used:                   789.94MiB
>     Free (estimated):       162.18MiB    (min: 110.50MiB)
>     Data ratio:                  1.00
>     Metadata ratio:              2.00
>     Global reserve:         512.00MiB    (used: 0.00B)
> 
> Data,single: Size:773.62MiB, Used:714.82MiB
>    /dev/mapper/vmdata--vg-lxc--curlybrace     773.62MiB
> 
> Metadata,DUP: Size:577.50MiB, Used:37.55MiB
>    /dev/mapper/vmdata--vg-lxc--curlybrace       1.13GiB
> 
> System,DUP: Size:8.00MiB, Used:16.00KiB
>    /dev/mapper/vmdata--vg-lxc--curlybrace      16.00MiB
> 
> Unallocated:
>    /dev/mapper/vmdata--vg-lxc--curlybrace     103.38MiB
> 
> 
> So... if I sum the data, metadata, and the global reserve, I see why
> only ~170 MB is left. I have no idea, however, why the global reserve
> sneaked up to 512 MB for such a small file system, and how could I
> resolve this situation. Any ideas?

That's an interesting issue I've not seen before, tho my experience is 
relatively limited compared to say Chris (Murphy)'s or Hugo's, as other 
than my own systems, my experience is limited to the list, while they do 
the IRC channels, etc.

I've no idea how to resolve it, unless per some chance balance removes 
excess global reserve as well (I simply don't know, it has never come up 
that I've seen before).

But IIRC one of the devs (or possibly Hugo) mentioned something about 
global reserve being dynamic, based on... something, IDR what.  Given my 
far lower global reserve on multiple relatively small btrfs and the fact 
that my own use-case doesn't use subvolumes or snapshots, if yours does 
and you have quite a few, that /might/ be the explanation.

FWIW, while I tend to use rather small btrfs as well, in my case they're 
nearly all btrfs dual-device raid1.  However, a usage comparison based on 
my closest sized filesystem can still be useful, particularly the global 
reserve.  Here's my /, as you can see, 8 GiB per device raid1, so one 
copy (comparable to single mode if it were a single device, no dup mode 
metadata as it's a copy on each device) on each:

# btrfs fi u /
Overall:
    Device size:                  16.00GiB
    Device allocated:              7.06GiB
    Device unallocated:            8.94GiB
    Device missing:                  0.00B
    Used:                          4.38GiB
    Free (estimated):              5.51GiB      (min: 5.51GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:               16.00MiB      (used: 0.00B)

Data,RAID1: Size:3.00GiB, Used:1.96GiB
   /dev/sda5       3.00GiB
   /dev/sdb5       3.00GiB

Metadata,RAID1: Size:512.00MiB, Used:232.77MiB
   /dev/sda5     512.00MiB
   /dev/sdb5     512.00MiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda5      32.00MiB
   /dev/sdb5      32.00MiB

Unallocated:
   /dev/sda5       4.47GiB
   /dev/sdb5       4.47GiB

It is worth noting that global reserve actually comes from metadata.  
That's why metadata never reports fully used, because global reserve 
isn't included in the used count, but can't normally be used for normal 
metadata.

Also note that under normal conditions, global reserve is always 0 used 
as btrfs is quite reluctant to use it for routine metadata storage, and 
will normally only use it for getting out of COW-based jams due to the 
fact that because of COW, even deleting something means temporarily 
allocating additional space to write the new metadata, without the 
deleted stuff, into.  Normally, btrfs will only write to global reserve 
if metadata space is all used and it thinks that by doing so it can end 
up actually freeing space.  In normal operations it will simply see the 
lack of regular metadata space available and will error out, without 
using the global reserve.

So if at any time btrfs reports more than 0 global reserve used, it means 
btrfs thinks it's in pretty serious straits and it's in quite a pickle, 
making non-zero global reserve usage a primary indicator of a filesystem 
in trouble, no matter what else is reported.

So with all that said, you can see that on that 8-gig per device, pair-
device raid1, btrfs has allocated only 512 MiB of metadata on each 
device, of which 232 MiB on each is used, *nominally* leaving 280 MiB 
metadata unused on each device, tho global reserve comes from that.

But, there's only 16 MiB of global reserve, counted only once.  If we 
assume it'd be used equally from each device, that's 8 MiB of global 
reserve on each device subtracted from that 280 MiB nominally free, 
leaving 272 MiB of metadata free, a reasonably healthy filesystem state, 
considering that's more metadata than actually used, plus there's nearly 
4.5 GiB entirely unallocated on each device, that can be allocated to 
data or metadata as needed.

That's quite a contrast compared to yours, a quarter the size, 2 GiB 
instead of 8, and as you have only the single device, the metadata 
defaulted to dup, so it uses twice as much space on the single device.

But the *real* contrast is as you said, your global reserve, an entirely 
unrealistic half a GiB, on a 2 GiB filesystem!

Of course global reserve being accounted single, while your metadata is 
dup, half should come from each side of that dup, so your real metadata 
usage vs. free can be calculated as 577.5 size (per side of the dup) - 
37.5 (normal used), - 256 (half of the global reserve), basically 284 MiB 
of usable metadata space (per side of the dup, but each side should be 
used equally).

Add to that the ~100 MiB unallocated, tho if used for dup metadata you'd 
only have half that usable, and you're not in /horrible/ shape.

But that 512 MiB global reserve, a quarter of the total filesystem size, 
is just killing you.

And unless it has something to do with snapshots/subvolumes, I don't have 
a clue why, or what to do about it.

But here's what I'd try, based on the answer to the question of whether 
you use snapshots/subvolumes (or use any of the btrfs reflink-based dedup 
tools as they have many of the same implications as snapshots, tho the 
scope is of course a bit different), and how many you have if so:

* Snapshots and reflinks are great, but unfortunately, have limited 
scaling ability at this time.  While on normal sized btrfs the limit 
before scaling becomes an issue seems to be a few hundred (under 1000 and 
for most under 500), it /may/ be that on a btrfs as small as your two-
GiB, more than say 10 may be an issue.

As I said, I don't /know/ if it'll help, but if you're over this, I'd 
certainly try reducing the number of snapshots/reflinks to under 10 per 
subvolume/file and see if it helps at all.

* You /may/ be able to try btrfs bal start -musage=, starting with a 
relatively low value (you tried 0, it's percentage, try 2, 5, 10.. up 
toward 100%, until you see some results or you get ENOSPC errors), and 
see some results.  However, typical metadata chunks are 256 MiB in size, 
tho they should be smaller on a 2 GiB btrfs, but I'm not sure by how 
much, and it's relatively likely you'll run into ENOSPC errors due to 
metadata chunks larger than half (dup so it'll take two chunks of the 
same size) your unallocated space size, before you get anywhere, even if 
balancing would otherwise help -- which again I'm not even sure it will, 
as I don't know whether it helps with bloated global reserve, or not.

* If the balance ENOSPCs, you may of course try (temporarily) increasing 
the size of the filesystem, possibly by adding a device.  There's 
discussion of that on the wiki.  But I honestly don't know how global 
reserve will behave, because something's clearly going on with it and I 
have no idea what.  For all I know, it'll eat most of the new space 
again, and you'll be in an even worse position, as it won't then let you 
remove the device you added to try to fix the problem.

* Similarly, but perhaps less risky with regard to global reserve size, 
tho definitely being more risky in terms of data safety in case something 
goes wrong (but the data's backed up, right?), you could try doing a 
btrfs balance start -mconvert=single, to reduce the metadata usage from 
dup to single mode.  Tho personally, I'd probably bother with the risk, 
simply double-checking my backups, then going ahead with the next one 
instead of this one.

* Since in data admin terms, data without a backup is considered to be 
defined by the lack thereof of that backup, as worth less than the time 
and trouble necessary to do it, and that applies even stronger to a still 
under heavy development and not yet fully stable filesystem such as 
btrfs, it's relatively safe to assume you either have a backup, or don't 
really care about the possibility of losing the data in the first place.  
Certainly that's the case here, tho I can't honestly say I always keep 
the backups fresh, but I equally honestly know that if I lose what's not 
backed up, it's purely because my actions defined that data as not worth 
the trouble, so in any case I saved what was worth more to me, either the 
data, or the time necessary to ensure it's safety via backup.

As such, what I'd be very likely to do here, before spending /too/ much 
time or effort or worry trying to fix things with no real guarantee it'll 
work anyway, would be to first freshen the backups if necessary, then 
simply blow away the existing filesystem and start over, restoring from 
backups to a freshly mkfsed btrfs.

* But, and this may well be the most practically worthwhile piece of the 
entire post, on a redo, I'd /strongly/ consider using the -M/--mixed 
mkfs.btrfs option.

What this does is tell btrfs to create mixed data/metadata block-groups 
aka chunks, instead of separating data and metadata into their own chunk 
types.  --mixed used to be default for btrfs under 1 GiB, and is still 
extremely strongly recommended for such small btrfs, as managing separate 
data and metadata chunks at that size is simply impractical.  The general 
on-list consensus seems to be that --mixed should be strongly considered 
for small btrfs of over a gig as well, with any disagreement being more 
one of whether the line should be closer to 8 GiB or 64 GiB, before the 
tradeoff between the lower hassle factor of --mixed vs. its somewhat 
lower efficiency compared to separate data/metadata swings toward higher 
efficiency.

Tho to a large extent I believe it's installation (hardware and layout 
factors), use-case and individual admin tech-detail-task tolerance 
specific.  Personally, I run gentoo and have a reasonably higher 
tolerance for minding the minor tech details than I suppose most do, and 
I still run, and believe it's appropriate for me to run, separate data/
metadata on my 8-gig*2-device / (and it's primary backup) btrfs.  But I 
run mixed on the 256-meg*1-device-in-dup-mode /boot (and its backup /boot 
on the other device), and would almost certainly run mixed on a 2 GiB 
btrfs as well.  By 4 GiB, tho, I'd consider separate data/metadata for me 
personally, tho would still recommend mixed for those who would prefer 
that it "just work" with the least constant fooling with it possible, up 
to probably 16 GiB at least.  And for some users I'd recommend it up to 
32 GiB or even 64 GiB, tho probably not above that, and in practice, the 
users I'd recommend it for at 32 or 64 GiB I'd probably recommend that 
they stay off btrfs until it stabilizes a bit further, because I simply 
don't think btrfs in general is appropriate for them yet if they're so 
relatively averse to tech detail that I'd consider mixed at 64 GiB an 
appropriate recommendation for them.

But for 2 GiB, I'd *definitely* be considering mixed mode, here, and 
almost certainly using it, tho there's one additional caveat to be aware 
of with mixed mode.

* Because mixed mode mixes data and metadata in the same chunks, they 
have to have the same redundancy level.

Which means if you want dup metadata, the normal default and recommended 
for metadata safety, if you're doing mixed mode, that means dup data as 
well.

And while dup data does give you a second copy on a single device, and 
thus a way for scrub to fix not only metadata (which is usually duped) 
but also data (which is usually single and thus error detectable but not 
correctable), it *ALSO* generally means far more space usage, and that 
you basically only get to use have the space of the filesystem, because 
it's keeping a second copy of /everything/ then.  And some people 
consider only half space availability simply too high a price to pay on 
what are already by definition small filesystems with relatively limited 
space.

One thing I know for sure.  When I did the layout of my current system, I 
planned for btrfs raid1 mode for nearly everything, and due to having 
some experience over the years, I got nearly everything pretty much 
correct in terms of partition and filesystem sizes.  But what I did *not* 
get quite right was /boot, because I failed to figure in the doubled 
space usage of dup for data and metadata both.

So what was supposed to be a 256 MiB available for usage /boot, became 
128 MiB available for usage due to dup, which does crimp my style a bit, 
particularly when I'm trying to git-bisect a kernel bug and I have to 
keep removing the extra kernels every few bisect loops, because I simply 
don't have space for them all.

So when I redo the layout, I'll probably make them 384 MiB, 192 MiB usage 
due to dup, or possibly even 512/256.

So there is a real down side to mixed mode.  For single device btrfs 
anyway, you have to choose either single for both data/metadata, or dup 
for both data/metadata, the first being a bit more risky than I'm 
comfortable with, the second, arguably overkill, considering if the 
device dies, both copies are on the same device, so it's gone, and the 
same if the btrfs itself dies.  Given that and the previously mentioned 
no-backup-defines-the-data-as-throwaway rule, meaning there's very likely 
another copy of the data anyway, arguably, if both have to be set the 
same as they do for mixed, single mode for both makes more sense than dup 
mode for both.

Which, given that I /do/ have a /boot backup setup on the other device, 
selectable via BIOS if necessary, means I may just go ahead and leave my /
boots at 256 MiB each anyway, and just set them both to single mode for 
the mixed data/metadata, to make use of the full 256 MiB and not have to 
worry about /boot size constraints like I do now with only 128 MiB 
available due to dup.  We'll see...

But anyway, do consider --mixed for your 2 GiB btrfs, the next time you 
mkfs.btrfs it, whether that's now, or whenever later.  I'd almost 
certainly be using it here, even if that /does/ mean I have to have the 
same mode for data and metadata because they're mixed.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2017-01-31  3:54 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-27 18:45 File system is oddly full after kernel upgrade, balance doesn't help MegaBrutal
2017-01-28  6:46 ` Duncan
2017-01-28 18:15   ` MegaBrutal
2017-01-31  3:53     ` Duncan [this message]
2017-05-08 23:45     ` Andrew E. Mileski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$61590$8b4273ea$8cd79913$19b8f7cc@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).