Re: Why is the actual disk usage of btrfs considered unknowable?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Shriramana Sharma <samjnaa@gmail.com>, i@hungrycats.org
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Why is the actual disk usage of btrfs considered unknowable?
Date: Mon, 8 Dec 2014 01:43:17 -0500	[thread overview]
Message-ID: <20141208064315.GA22023@hungrycats.org> (raw)
In-Reply-To: <CAH-HCWU9GEjvZLH=rwYev_O0S4_Cs9FJvRiJgBiOK8gdxqK5CQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4187 bytes --]

On Sun, Dec 07, 2014 at 08:45:59PM +0530, Shriramana Sharma wrote:
> IIUC:
> 
> 1) btrfs fi df already shows the alloc-ed space and the space used out of that.
> 
> 2) Despite snapshots, CoW and compression, the tree knows how many
> extents of data and metadata there are, and how many bytes on disk
> these occcupy, no matter what is the total (uncompressed,
> "unsnapshotted") size of all the directories and files on the disk.
> 
> So this means that btrfs fi df actually shows the real on-disk usage.
> In this case, why do we hear people saying it's not possible to know
> the actual on-disk usage and when a btrfs-formatted disk (or
> partition) will go out of space?

"On-disk usage" is easy--that's about the past, and can be measured
straightforwardly with a single count of bytes.

"When a btrfs filesystem will return ENOSPC" is much more
complicated--that's about the future, and depends heavily on current
structure and upcoming modifications of it.

There were some pretty terrible btrfs bugs and warts that were fixed
only in the last 5 months or so.  Since some of those had been around
for a year or more, they gave btrfs a reputation.

The 'df' command (statvfs(2)) would report raw free space instead of an
estimate based on the current RAID profile.  This confused some badly
designed programs that would use statvfs to determine that N bytes of free
space were avaiable, and be surprised when N bytes were not all available
for their use.  If you had a btrfs using RAID1, it would report double
the amount of space used and available (one for each disk, e.g. 2x1TB
disks 75% full would be reported as 2TB capacity with 1.5TB used and
0.5TB free).  Now statvfs(2) computes more correct values (1TB capacity
with 750GB used and 250GB free).

Some bugs would crash the btrfs cleaner (the thread which removes deleted
snapshots) or balance, and would cause the filesystem to prematurely
report ENOSPC when (in theory) hundreds of gigabytes were available.
These were straight up bugs that are now fixed.

Modifying the filesystem tree requires free metadata blocks into which
to write new CoW nodes for the modified metadata.  When you delete
something, disk usage goes up for a few seconds before it goes down
(if you have snapshots, the "down" part may be delayed until you delete
the snapshots).  This can lead to surprising "No space left on device"
errors from commands like 'rm -rf lots_of_files'.  The GlobalReserve
chunk type was introduced to reserve a few MB of space on the filesystem
to handle such cases.

Thankfully, everything above now seems to be fixed.

There is still an issue with hetergeneous chunk allocation.  The 'df'
command and 'statvfs' syscall only report a single quantity for used
and free space, while in btrfs there are two distinct data types to be
stored in two distinct container types--and for maximum result
irreproducibility, the amount of space allocated to each type is dynamic.

Data (file contents) is allocated 1GB at a time, metadata (directory
structures, inodes, checksums) is allocated 256MB at a time, and
the two types are not interchangeable after allocation.  This can
cause inaccuracies when reporting free space as the last few free GB
are consumed.  256MB might abruptly disappear from free space if you
happen to run out of free metadata space and allocate a new metadata
chunk instead of a data chunk.

The last few KB of a file that does not fill a full 4K block can be
stored 'inline' (next to the inode in the metadata tree).  If you are
low on space in data chunks, you might be able to write a large number
of small files using inline metadata to store the file contents, but
not an equivalent-sized large file using data extent blocks.  If you
have lots of free data space but not enough metadata space, you get the
opposite result (e.g. you can write new large files but not extend small
existing ones).

All that above happens with RAID, compression and quotas turned *off*.
Turning those them on makes space usage even harder to analyze (and
ENOSPC errors harder to predict) with a single-dimension "available
space" metric.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

     prev parent reply	other threads:[~2014-12-08  6:43 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma
2014-12-07 15:33 ` Martin Steigerwald
2014-12-07 15:37   ` Shriramana Sharma
2014-12-07 15:40   ` Martin Steigerwald
2014-12-08  5:32     ` Robert White
2014-12-08  6:20       ` ashford
2014-12-08  7:06         ` Robert White
2014-12-08 14:47       ` Martin Steigerwald
2014-12-08 14:57         ` Austin S Hemmelgarn
2014-12-08 15:52           ` Martin Steigerwald
2014-12-08 23:14         ` Zygo Blaxell
2014-12-07 18:20   ` ashford
2014-12-07 18:34     ` Hugo Mills
2014-12-07 18:48       ` Martin Steigerwald
2014-12-07 19:39       ` ashford
2014-12-08  5:17       ` Chris Murphy
2014-12-07 18:38     ` Martin Steigerwald
2014-12-07 19:44       ` ashford
2014-12-07 19:19   ` Goffredo Baroncelli
2014-12-07 20:32     ` ashford
2014-12-07 23:01       ` Goffredo Baroncelli
2014-12-08  0:12         ` ashford
2014-12-08  2:42           ` Qu Wenruo
2014-12-08  8:12             ` ashford
2014-12-08 14:34           ` Goffredo Baroncelli
2014-12-08  8:18       ` Chris Murphy
2014-12-08  4:59 ` Robert White
2014-12-08  6:43 ` Zygo Blaxell [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141208064315.GA22023@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=i@hungrycats.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=samjnaa@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.