From: "Boris Burkov" <boris@bur.io>
To: "David Sterba" <dsterba@suse.cz>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim
Date: Tue, 06 Feb 2024 14:07:52 -0800 [thread overview]
Message-ID: <ZcKrE0iFnga94kIA@devvm12410.ftw0.facebook.com> (raw)
In-Reply-To: <20240206145524.GQ355@twin.jikos.cz>
On Tue, Feb 06, 2024 at 03:55:24PM +0100, David Sterba wrote:
> On Fri, Feb 02, 2024 at 03:12:42PM -0800, Boris Burkov wrote:
> > Btrfs's block_group allocator suffers from a well known problem, that
> > it is capable of eagerly allocating too much space to either data or
> > metadata (most often data, absent bugs) and then later be unable to
> > allocate more space for the other, when needed. When data starves
> > metadata, this can extra painfully result in read only filesystems that
> > need careful manual balancing to fix.
> >
> > This can be worked around by:
> > - enabling automatic reclaim
> > - periodically running balance
> >
> > Neither of these enjoy widespread use, as far as I know, though the
> > former is used at scale at Meta with good results.
>
> https://github.com/kdave/btrfsmaintenance is to my knowledge widely used
> and installed on distros. (Also my most starred project on github.)
Oh, cool, I'm glad that is out there and being used. I'm sorry for my ignorance.
>
> The idea is to make the balance separate from kernel, allowing users and
> administrators to easily tweak the parameters and timing. We haven't
> added automatic reclaim to kernel as it tends to start at the worst
> time. The jobs from btrfsmaintenance are scheduled according to the
> calendar events (systemd.timer).
Makes sense.
>
> Also the jobs don't have to be ran at all, the package not installed.
>
> The problem with balancing amount of data and metadata chunks is known
> and there are only heuristics, we can't solve that without knowing the
> exact usage pattern.
Agreed.
>
> > This patch set expands on automatic reclaim, adding the ability to set a
> > dynamic reclaim threshold that appropriately scales with the global file
> > system allocation conditions as well as periodic reclaim which runs that
> > reclaim sweep in the cleaner thread. Together, I believe they constitute
> > a robust and general automatic reclaim system that should avoid
> > unfortunate read only filesystems in all but extreme conditions, where
> > space is running quite low anyway and failure is more reasonable.
> >
> > I ran it on three workloads (described in detail on the dynamic reclaim
> > patch) but they are:
> > 1. bounce allocations around X% full.
> > 2. fill up all the way and introduce full fragmentation.
> > 3. write in a fragmented way until the filesystem is just about full.
> > script can be found here:
> > https://github.com/boryas/scripts/tree/main/fio/reclaim
>
> A common workload on distros is regular system update (rolling distro)
> with snapshots (snapper) and cleanup. This can create a lot of under
> used block groups, both data and metadata. Reclaiming that preriodically
> was one of the ground ideas for the btrfsmaintenance project.
I believe this is pretty similar to my workload 2 in spirit, except I
haven't done much with snapshots. I would love to run this workload so
I'll try to set it up with a VM. If you have a script for it already, or
even tips for setting it up, I would be quite grateful :)
I think that the "lots of random deletes leave empty block groups"
workload is the most interesting one in general for reclaim, and I
think it's cool that it happens in the real world :)
>
> The reclaim is needed to make the space more compact as the randomly
> removed unused extents create holes for new data so this is a good
> example for either scripted or automatic reclaim.
>
> However you can also find use case where this would harm performance or
> just waste IO as the data are short lived and shuffling around unused
> block groups does not help much.
+1, definitely trying to avoid this.
>
> The exact parameters of auto reclaim also depend on the storage type, an
> NVMe would be probably fine with any amount of data, HDD not so much.
Good point, have only tested on NVMe. Definitely needs to be tunable to
not abuse HDDs.
>
> I don't know from your description above what's the estimated frequency
> of the reclaim? I understand that the urgent reclaim would start as
> needed, but otherwise the frequency of reclaim of say 30% used block
> groups can stay fine for a few days, as there are usually more new data
> than deletions.
>
> Also with more block groups around it's more likely to find good
> candidates for the size classes and then do the placement.
I think talking about my workload 2 here is helpful. Roughly, it writes
out 100G in a ~110G disk, then deletes 70G in perfectly fragmenting
stripes, so if we were way too aggressive, or used the current
autoreclaim with an unlucky threshold, we would reclaim all 100
block_groups. Dynamic reclaim's threshold spikes up to max, relocates 7
block groups, which is enough to negative feedback loop it back to a low
threshold and not doing more reclaim.
see https://bur.io/dyn-rec/strict_frag-30/thresh.png for the threshold
curve and https://bur.io/dyn-rec/strict_frag-30/relocs.png for the
reclaim counts. (I didn't hack it up perfectly evilly to make the 30%
threshold config relocate 100 block groups in that graph, FWIW)
I will try to more systematically plot threshold curves to get a better
sense for how to cause the most reclaims possible for a worst case
estimate.
In case you were asking more about the period it runs at:
As written right now, it runs with every cleaner thread run, but skips
block_groups that got an allocation since the last cleaner thread run. I
think you make an excellent point that the rate is much better to be more
like "daily" or "weekly" rather than "minutely". That gives more time to
reach a quiescent state, fill in gaps with small writes, etc. At the
minimum, I think the periodic reclaim ought to have a configurable period
with a relatively long default (this should help with HDD too?)
>
> > The important results can be seen here (full results explorable at
> > bur.io/dyn-rec/)
> >
> > bounce at 30%, much higher relocations with a fixed threshold:
> > https://bur.io/dyn-rec/bounce-30/relocs.png
> >
> > hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
> > https://bur.io/dyn-rec/strict_frag-30/unalloc_bytes.png
> > https://bur.io/dyn-rec/strict_frag-30/relocs.png
> >
> > fill it all the way up, not crazy churn, but saving a buffer:
> > https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
> > https://bur.io/dyn-rec/last_gig/relocs.png
> > https://bur.io/dyn-rec/last_gig/thresh.png
> >
> > Boris Burkov (6):
> > btrfs: report reclaim count in sysfs
> > btrfs: store fs_info on space_info
> > btrfs: dynamic block_group reclaim threshold
> > btrfs: periodic block_group reclaim
> > btrfs: urgent periodic reclaim pass
> > btrfs: prevent pathological periodic reclaim loops
>
> So one thing is to have the mechanism for the reclaim, I think that's
> the easy part, the tuning will be interesting.
My 2c based on what I learned from this effort, and from your feedback:
Our two goals should be:
1. Avoid unnecessary reclaim, it wastes user resources and can hurt
their system's performance.
2. Prevent unallocated=1MiB before it's too late.
I think anything with a fixed threshold is unlikely to fully achieve
either goal, as unlucky workloads will either operate below the
threshold and reclaim too much or above it and never reclaim.
I believe the dynamic threshold with a negative feedback loop is the
right sort of idea for achieving both goals. Ultimately, it is a
continuous function that encodes "reclaim at all costs when it's really
bad, don't reclaim much otherwise". I think it could also work to get
rid of the extra distraction from modelling it with a continuous
function and trying to encode the two goals more discretely/directly.
i.e.,
Very long period, low threshold periodic maintenance (basically exactly
btrfsmaintenance, doesn't need to be in kernel) and the kernel having "urgent"
conditions where it reclaims more aggressively in a limited way, just to get us
back to a few gigs of unalloc.
I also saw that btrfsmaintenance defaults to dusage=5 then dusage=10
which is lower (but similar to!) the quiescent state thresholds I have
seen in my tests (around 15-20). I may try to tune it to land around 10%
for most healthy fses, as that seems to be the safest number we know.
By the way, I think the dynamic threshold could be implemented fully in
userspace by using the limit flag of balance and recalculating the threshold
between each reclaim. Would you be more interested in experimenting with
that in btrfsmaintenance? I do think that in the long run, some kind of
"urgent unalloc protection" does belong in the kernel by default, assuming
we can really nail it down perfectly.
Thanks for your feedback,
Boris
next prev parent reply other threads:[~2024-02-06 22:08 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
2024-02-02 23:12 ` [PATCH 1/6] btrfs: report reclaim count in sysfs Boris Burkov
2024-02-02 23:12 ` [PATCH 2/6] btrfs: store fs_info on space_info Boris Burkov
2024-02-02 23:12 ` [PATCH 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
2024-02-02 23:12 ` [PATCH 4/6] btrfs: periodic block_group reclaim Boris Burkov
2024-02-04 18:19 ` kernel test robot
2024-02-02 23:12 ` [PATCH 5/6] btrfs: urgent periodic reclaim pass Boris Burkov
2024-02-02 23:12 ` [PATCH 6/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
2024-02-06 14:55 ` [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim David Sterba
2024-02-06 22:07 ` Boris Burkov [this message]
2024-02-19 19:38 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZcKrE0iFnga94kIA@devvm12410.ftw0.facebook.com \
--to=boris@bur.io \
--cc=dsterba@suse.cz \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox