All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: kreijack@inwind.it
Cc: Chris Murphy <lists@colorremedies.com>,
	Boris Burkov <boris@bur.io>, "Apostolos B." <barz621@gmail.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	systemd Mailing List <systemd-devel@lists.freedesktop.org>
Subject: Re: No space left errors on shutdown with systemd-homed /home dir
Date: Mon, 31 Jan 2022 23:26:41 -0500	[thread overview]
Message-ID: <Yfi2gVf5QOXkaM6+@hungrycats.org> (raw)
In-Reply-To: <042e75ab-ded2-009a-d9fc-95887c26d4d2@libero.it>

On Sat, Jan 29, 2022 at 10:53:00AM +0100, Goffredo Baroncelli wrote:
> On 27/01/2022 21.48, Chris Murphy wrote:
> > On Wed, Jan 26, 2022 at 4:19 PM Boris Burkov <boris@bur.io> wrote:
> [...]
> > 
> > systemd-homed by default uses btrfs on LUKS on loop mount, with a
> > backing file. On login, it grows the user home file system with some
> > percentage (I think 80%) of the free space of the underlying file
> > system. And on logout, it does both fstrim and shrinks the fs. I don't
> > know why it does both, it seems adequate to do only fstrim on logout
> > to return unused blocks to the underlying file system; and to do an fs
> > resize on login to either grow or shrink the user home file system.
> > 
> > But also, we don't really have a great estimator of the minimum size a
> > file system can be. `btrfs inspect-internal min-dev-size` is pretty
> > broken right now.
> > https://github.com/kdave/btrfs-progs/issues/271
> 
> I tried the test case, but was unable to get a wrong value. However
> I think that this is due to the fact that btrfs improved the bg reclaiming.
> 
> However tweaking the test case, I was able to trigger the problem (I
> reduced the filesize from 1GB to 256MB, so when some files are
> removed a BG is empty filled)
> 
> 
> 
> > 
> > I'm not sure if systemd folks would use libbtrfsutil facility to
> > determine the minimum device shrink size? But also even the kernel
> > doesn't have a very good idea of how small a file system can be
> > shrunk. Right now it basically has to just start trying, and does it
> > one block group at a time.
> 
> I think that for the systemd uses cases (singled device FS), a simpler
> approach would be:
> 
>     fstatfs(fd, &sfs)
>     needed = sfs.f_blocks - sfs.f_bavail;
>     needed *= sfs.f_bsize
> 
>     needed = roundup_64(needed, 3*(1024*1024*1024))
> 
> Comparing the original systemd-homed code, I made the following changes
> - 1) f_bfree is replaced by f_bavail (which seem to be more consistent to the disk usage; to me it seems to consider also the metadata chunk allocation)
> - 2) the needing value is rounded up of 3GB in order to consider a further 1 data chunk and 2 metadata chunk (DUP))
> 
> Comments ?

This is closer to the right answer but not quite there yet.  A summary
of the issues:

 * Discard (called by systemd-homed in the form of trim) locks a block
 group (makes it read-only and removes it from available space) while
 it runs.

 * Relocation (balance or filesystem resize) locks a block group while
 it runs.

 * Scrub in the worst case locks one block group per disk (but we may
 never run scrub on a systemd-homed filesystem, so ignore this for now).

 * Large (>50GB) filesystems have larger block groups than smaller
 filesystems.  Resizing from >50GB to <50GB can require rewriting the
 _entire_ filesystem to make a sensible number of smaller block groups
 (high enough to be able to lock all the above block groups and still
 have enough free space to run a transaction without them).

 * System chunks aren't the same size as other chunks, which will create
 unusable free space holes between block groups, and (after lots of
 balancing/resizing runs) possibly create a lot of unusable free space
 that existing extents cannot be relocated into without temporarily
 increasing the size of the filesystem.

 * Resize is a fairly dumb algorithm as algorithms go.  It works in one
 pass, in a fixed order, and it can't fragment an extent or a block group.
 The minimum size of a filesystem depends not just on how much data there
 is, but how capable the resize algorithm is at arranging the data into
 the space given all the overlapping constraints btrfs has on allocation.
 Resize makes several size-speed tradeoffs in favor of speed (or at least
 not in favor of size).

 * The minimum filesystem size to store the data is different from
 the minimum filesystem size to run specific btrfs data modification
 operations.  Some metadata operations can require significant amounts
 of space to complete.  If the filesystem is resized too small with
 exactly the right amount of free space, it may become impossible to
 perform metadata-intensive operations like orphan inode cleanup or
 snapshot deletion on the next mount.  It's not possible to predict
 these additional space requirements without doing equivalent work to
 performing the operations and measuring the space required.  This means
 that in order to compute the minimum filesystem size, we need to be
 able to predict (or strongly control) the future of the filesystem,
 at least long enough to grow the filesystem back to its service size.

These combine to make it especially challenging to resize a nearly empty
filesystem from 128GB to smaller than somewhere around 8GB (1GB data +
1GB locked by discard + 1GB locked by relocation + 1GB metadata * 2 for
dup + 2GB trapped free dev_extent space from earlier resize operations).
That could be reduced to about 6GB with single metadata, but that's a
significant resiliency trade-off if the host filesystem doesn't implement
data integrity and self-healing.

It might be doable in multiple resize passes--one to resize the filesystem
to <50GB so that all the data can be moved into smaller block groups,
then again to resize it to the next block group size.  In the worst
case this copies all of the data in the filesystem multiple times, so
it would be faster for systemd-homed to mkfs a new filesystem at the
target size (which would have smaller block groups from the beginning)
and simply copy the data with 'cp -a' to the new filesystem instead of
resizing the old one.

Resizing a filesystem with 1GB of data on it from over 50GB to 16GB is
probably reliable.  8GB is less likely to succeed, and I wouldn't expect
any smaller number to work reliably except on very simple test cases.

It does suck that the kernel handles resizing below the minimum size of
the filesystem so badly; however, even if it rejected the resize request
cleanly with an error, it's not necessarily a good idea to attempt it.
Pushing the lower limits of what is possible in resize to save a handful
of GB is asking for trouble.  It's far better to overestimate generously
than to underestimate the minimum size.

> > 
> > Adding systemd-devel@
> > 
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  parent reply	other threads:[~2022-02-01  4:26 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-25 17:46 No space left errors on shutdown with systemd-homed /home dir Apostolos B.
2022-01-26 21:50 ` Boris Burkov
2022-01-26 22:07   ` Apostolos B.
2022-01-26 23:19     ` Boris Burkov
2022-01-26 23:29       ` Apostolos B.
2022-01-27  7:59         ` Wang Yugui
2022-01-27  8:51           ` Wang Yugui
2022-01-27 19:13         ` Goffredo Baroncelli
2022-01-27 20:48       ` Chris Murphy
2022-01-29  9:53         ` Goffredo Baroncelli
2022-01-29 18:01           ` Chris Murphy
2022-01-30  9:27             ` Goffredo Baroncelli
2022-01-31  9:41               ` Colin Guthrie
2022-02-01 19:55                 ` Neal Gompa
2022-05-31 12:44                   ` Colin Guthrie
2022-05-31 18:12                     ` Goffredo Baroncelli
2022-06-01  9:36                       ` Colin Guthrie
2022-07-23 19:09                         ` Chris Murphy
2022-02-01  4:26           ` Zygo Blaxell [this message]
2022-07-23 19:26             ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yfi2gVf5QOXkaM6+@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=barz621@gmail.com \
    --cc=boris@bur.io \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=systemd-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.