Re: [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: David Sterba <dsterba@suse.cz>
To: Boris Burkov <boris@bur.io>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim
Date: Tue, 6 Feb 2024 15:55:24 +0100	[thread overview]
Message-ID: <20240206145524.GQ355@twin.jikos.cz> (raw)
In-Reply-To: <cover.1706914865.git.boris@bur.io>

On Fri, Feb 02, 2024 at 03:12:42PM -0800, Boris Burkov wrote:
> Btrfs's block_group allocator suffers from a well known problem, that
> it is capable of eagerly allocating too much space to either data or
> metadata (most often data, absent bugs) and then later be unable to
> allocate more space for the other, when needed. When data starves
> metadata, this can extra painfully result in read only filesystems that
> need careful manual balancing to fix.
> 
> This can be worked around by:
> - enabling automatic reclaim
> - periodically running balance
> 
> Neither of these enjoy widespread use, as far as I know, though the
> former is used at scale at Meta with good results.

https://github.com/kdave/btrfsmaintenance is to my knowledge widely used
and installed on distros.  (Also my most starred project on github.)

The idea is to make the balance separate from kernel, allowing users and
administrators to easily tweak the parameters and timing. We haven't
added automatic reclaim to kernel as it tends to start at the worst
time. The jobs from btrfsmaintenance are scheduled according to the
calendar events (systemd.timer).

Also the jobs don't have to be ran at all, the package not installed.

The problem with balancing amount of data and metadata chunks is known
and there are only heuristics, we can't solve that without knowing the
exact usage pattern.

> This patch set expands on automatic reclaim, adding the ability to set a
> dynamic reclaim threshold that appropriately scales with the global file
> system allocation conditions as well as periodic reclaim which runs that
> reclaim sweep in the cleaner thread. Together, I believe they constitute
> a robust and general automatic reclaim system that should avoid
> unfortunate read only filesystems in all but extreme conditions, where
> space is running quite low anyway and failure is more reasonable.
> 
> I ran it on three workloads (described in detail on the dynamic reclaim
> patch) but they are:
> 1. bounce allocations around X% full.
> 2. fill up all the way and introduce full fragmentation.
> 3. write in a fragmented way until the filesystem is just about full.
> script can be found here:
> https://github.com/boryas/scripts/tree/main/fio/reclaim

A common workload on distros is regular system update (rolling distro)
with snapshots (snapper) and cleanup. This can create a lot of under
used block groups, both data and metadata. Reclaiming that preriodically
was one of the ground ideas for the btrfsmaintenance project.

The reclaim is needed to make the space more compact as the randomly
removed unused extents create holes for new data so this is a good
example for either scripted or automatic reclaim.

However you can also find use case where this would harm performance or
just waste IO as the data are short lived and shuffling around unused
block groups does not help much.

The exact parameters of auto reclaim also depend on the storage type, an
NVMe would be probably fine with any amount of data, HDD not so much.

I don't know from your description above what's the estimated frequency
of the reclaim? I understand that the urgent reclaim would start as
needed, but otherwise the frequency of reclaim of say 30% used block
groups can stay fine for a few days, as there are usually more new data
than deletions.

Also with more block groups around it's more likely to find good
candidates for the size classes and then do the placement.

> The important results can be seen here (full results explorable at
> bur.io/dyn-rec/)
> 
> bounce at 30%, much higher relocations with a fixed threshold:
> https://bur.io/dyn-rec/bounce-30/relocs.png
> 
> hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
> https://bur.io/dyn-rec/strict_frag-30/unalloc_bytes.png
> https://bur.io/dyn-rec/strict_frag-30/relocs.png
> 
> fill it all the way up, not crazy churn, but saving a buffer:
> https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
> https://bur.io/dyn-rec/last_gig/relocs.png
> https://bur.io/dyn-rec/last_gig/thresh.png
> 
> Boris Burkov (6):
>   btrfs: report reclaim count in sysfs
>   btrfs: store fs_info on space_info
>   btrfs: dynamic block_group reclaim threshold
>   btrfs: periodic block_group reclaim
>   btrfs: urgent periodic reclaim pass
>   btrfs: prevent pathological periodic reclaim loops

So one thing is to have the mechanism for the reclaim, I think that's
the easy part, the tuning will be interesting.

next prev parent reply	other threads:[~2024-02-06 14:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
2024-02-02 23:12 ` [PATCH 1/6] btrfs: report reclaim count in sysfs Boris Burkov
2024-02-02 23:12 ` [PATCH 2/6] btrfs: store fs_info on space_info Boris Burkov
2024-02-02 23:12 ` [PATCH 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
2024-02-02 23:12 ` [PATCH 4/6] btrfs: periodic block_group reclaim Boris Burkov
2024-02-04 18:19   ` kernel test robot
2024-02-02 23:12 ` [PATCH 5/6] btrfs: urgent periodic reclaim pass Boris Burkov
2024-02-02 23:12 ` [PATCH 6/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
2024-02-06 14:55 ` David Sterba [this message]
2024-02-06 22:07   ` [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
2024-02-19 19:38     ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240206145524.GQ355@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=boris@bur.io \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox