From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Filesystem unable to recover from ENOSPC
Date: Fri, 11 Apr 2014 02:09:46 +0000 (UTC) [thread overview]
Message-ID: <pan$18fc6$7005cb64$a96a9a75$5cd6742f@cox.net> (raw)
In-Reply-To: CAH3EO-oSzYTd2Q3H3wfg9eV9ii-phuquvtzU2mT5KGKF1_x5iw@mail.gmail.com
Chip Turner posted on Thu, 10 Apr 2014 15:40:22 -0700 as excerpted:
> On Thu, Apr 10, 2014 at 1:34 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
>> On Thu, Apr 10, 2014 at 01:00:35PM -0700, Chip Turner wrote:
>>> btrfs show:
>>> Label: none uuid: 04283a32-b388-480b-9949-686675fad7df
>>> Total devices 1 FS bytes used 135.58GiB
>>> devid 1 size 238.22GiB used 238.22GiB path /dev/sdb2
>>>
>>> btrfs fi df:
>>> Data, single: total=234.21GiB, used=131.82GiB
>>> System, single: total=4.00MiB, used=48.00KiB
>>> Metadata, single: total=4.01GiB, used=3.76GiB
[Tried all the usual tricks, didn't work.]
>> One thing you could do is btrfs dev add a small new device to the
>> filesystem (say, a USB stick, or a 4 GiB loopback file mounted over NBD
>> or something). Then run the filtered balance. Then btrfs dev del the
>> spare device.
>
> Ah, this worked great. It fixed it in about ten seconds.
>
> I'm curious about the space report; why doesn't Data+System+Metadata add
> up to the total space used on the device?
Actually, it does... *IF* you know how to read it. Unfortunately that's
a *BIG* *IF*, because btrfs show very confusingly reports two very
different numbers using very similar wording, without making *AT* *ALL*
clear what it's actually reporting.
Try this:
Add up the df totals (which is the space allocated for each category
type). 234.21 gig, 4.01 gig, 4 meg. 238 gig and change, correct? Look
at the show output. What number does that look like there?
Now do the same with the df used (which is the space used out of that
allocated). 131.82 gig, 3.76 gig, (insubstantial). 135 gig and change.
What number from btrfs show does /that/ look like?
Here's what's happening and how to read those numbers. Btrfs uses space
in two stages.
First it on-demand allocates chunks dedicated to the usage type. Data
chunks are 1 GiB in size. Metadata chunks are 256 MiB in size, a quarter
the size of a data chunk, altho by default on a single device they are
allocated in pairs, dup mode, so 512 MiB at a time (half a data-chunk),
tho I see your metadata is single mode so it's still only 256 MiB at a
time.
The space used by these ALLOCATED chunks appears as the totals in btrfs
filesystem df and as used in btrfs filesystem show for individual
devices, but the show total line comes from somewhere else *ENTIRELY*,
which is why the reported individual device used number sum up to (if
there's more than one device, the individual device numbers can be added
together, if it's just one, that's it) so much larger than the number
reported by show as total used.
That metadata-single, BTW, probably explains Hugo's observation, that you
were able to use more of your metadata than most, because you're running
single metadata mode instead of the more usual dup. (Either you set it
up that way, or mkfs.btrfs detected SSD, in which case it defaults to
single metadata for a single device filesystem.) So you were able to get
closer to full metadata usage. (Btrfs reserves some metadata, typically
about a chunk, which means about two chunks in dup mode, for its own
usage. That's never usable so it always looks like you have a bit more
free metadata space than you actually do. But as long as there's
unallocated free space to allocate additional metadata blocks from, it
doesn't matter. Only when all space is allocated does it matter, since
then it still looks like you have free metadata space to use when you
don't.)
Anyway, once btrfs has a chunk of the appropriate type, it fills it up.
When necessary, it'll try to allocate another chunk.
The actual usage of the already allocated chunks appears in btrfs
filesystem df as used, with the total of all types for all devices also
appearing in btrfs filesystem show on the total used line.
So data+metadata+system allocated as reported by df, adds up to the
totals reported as used by show for all the individual devices, added
together.
And data+metadata+system actually used (out of the allocated) as reported
by df, adds up to the total reported by show as used, on the total used
line.
But they are two very different numbers, one total chunks allocated, the
other the total used OF those allocated chunks. Makes sense *IF* *YOU*
*KNOW* *HOW* *TO* *READ* *IT*, but otherwise, it's *ENTIRELY* misleading
and confusing!
There has already been discussion and proposed patches for adding more
detail to df and show, with the wording changed up as well, and I sort of
expected to see that in btrfs-progs v3.14 when it came out altho I'm
running it now and don't see a change, but FWIW, from the posted examples
at least, I couldn't quite figure out the proposed new output either, so
it might not be that much better than what we have. Which might or might
not have anything to do with it not appearing in v3.14 as I expected.
Meanwhile, now that I actually know how to read the current output, it
does provide the needed information, even if the output /is/ rather
confusing to newbies.
Back to btrfs behavior and how it leads to nospc errors, however...
When btrfs deletes files, it frees space in the corresponding chunks, but
since individual files normally use a lot more data space than metadata,
data chunks get emptied faster than the corresponding metadata chunks.
But here's the problem. Btrfs can automatically free space back to the
allocated chunks as files get deleted, but it does *NOT* (yet) know how
to automatically deallocate those now empty or mostly empty chunks,
returning them to the unallocated pool so they can be reused as another
chunk-type if necessary.
So btrfs uses space up in two stages, but can only automatically return
unused space back in one stage, not the other. Currently, to deallocate
and free those unused blocks, you must run balance (which is where the
filtered balance -dusage=20 or whatever, to balance, in that case only
data chunks with 20% usage or less), which rewrites those blocks and
consolidates any remaining usage as it goes, freeing up the blocks it
empties back to the unallocated pool.
At some point the devs plan to automate the process, probably by
automatically triggering a balance start -dusage=5 or balance start
-musage=5, or whatever, as necessary. But that hasn't happened yet.
Which is why admins must currently keep an eye on things and run that
balance manually (or hack up some sort of script to do it automatically,
themselves) when necessary.
> Was the fs stuck in a state
> where it couldn't clean up because it couldn't write more metadata (and
> hence adding a few gb allowed it to balance)?
Basically, yes. As you can see from the individual device line in the
above show output, 238.22 gig used (that is, chunks allocated), of 238.22
filesystem size. There's no room left to allocate additional chunks, not
even one in ordered to rewrite the remaining data from some of those
mostly empty data chunks, in ordered to return them to the unallocated
pool.
With a bit of luck, you would have had at least one entirely empty data
chunk, in which case a balance start -dusage=0 would have freed it (since
it was entirely empty and thus there was nothing to rewrite to a new
chunk), thus giving you enough space to actually allocate a new chunk, to
write into and free more of them. But if you tried a balance start
-dusage=0 and it couldn't find even one entirely empty data chunk to
free, as apparently you did, then you had a problem, since all available
space was already allocated.
Temporarily adding another device gave it enough room to allocate a few
new chunks, such that balance then had enough space to rewrite a few of
the mostly empty chunks, thereby freeing enough space so you could then
btrfs device delete the new device, rewriting those new chunks back to
the newly deallocated space on the original device.
> After the balance, the
> used space dropped to around 150GB, roughly what I'd expect.
=:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-04-11 2:10 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-10 20:00 Filesystem unable to recover from ENOSPC Chip Turner
2014-04-10 20:34 ` Hugo Mills
2014-04-10 22:40 ` Chip Turner
2014-04-11 2:09 ` Duncan [this message]
2014-04-11 6:10 ` Chip Turner
2014-04-11 9:28 ` Liu Bo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$18fc6$7005cb64$a96a9a75$5cd6742f@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).