[PATCH] btrfs: make periodic dynamic reclaim the default for data

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH] btrfs: make periodic dynamic reclaim the default for data
@ 2025-07-15 18:58 Boris Burkov
  2025-07-16  6:24 ` Johannes Thumshirn
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Boris Burkov @ 2025-07-15 18:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The explanation of the feature is linked via the original patches.
But tl;dr: dynamic periodic reclaim for data is a way to get a lot of
extra protection from block group mis-allocation ENOSPC without
incurring a lot of reclaims in the happy, steady state case.

We have tested it extensively in production at Meta and are quite
satisfied with its behavior as opposed to an edge triggered
bg_reclaim_threshold set to 25. The latter did well in reducing our
ENOSPCs but at the cost of a LOT of reclaiming. And often excessive
seemingly unbounded reclaiming.

With dynamic periodic reclaim, if the system is below 10G unallocated
space, then the cleaner thread will identify the best block groups to
reclaim to get us back to 10G. It will get progressively more aggressive
as unallocated trends towards 0. It will perform no reclaims when
unallocated is above 10G.

With its by-design conservative approach to reclaiming and good track
record in datacenter testing, I think it is time to introduce automatic
data block group reclaim to btrfs. This does not conflict with the use
of the tools in btrfs_maintenance. One thing to look out for is that the
bg_reclaim_threshold setting is no longer writeable once the dynamic
threshold is enabled, and instead is a read-only file representing the
current snapshot of the dynamic threshold.

To disable either of these features, simply write a 0 to
/sys/fs/btrfs/<uuid>/allocation/data/(dynamic_reclaim|periodic_reclaim)

Link: https://lore.kernel.org/linux-btrfs/cover.1718665689.git.boris@bur.io/#t
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/space-info.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0481c693ac2e..8005483fbfe2 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -306,6 +306,12 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)

 		if (ret)
 			return ret;
+	} else {
+		if ((flags & BTRFS_BLOCK_GROUP_DATA) &&
+		    !(flags & BTRFS_BLOCK_GROUP_METADATA)) {
+			space_info->dynamic_reclaim = 1;
+			space_info->periodic_reclaim = 1;
+		}
 	}

 	ret = btrfs_sysfs_add_space_info_type(info, space_info);
-- 
2.50.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-07-15 18:58 [PATCH] btrfs: make periodic dynamic reclaim the default for data Boris Burkov
@ 2025-07-16  6:24 ` Johannes Thumshirn
  2025-07-16 15:56   ` Boris Burkov
  2025-10-21 18:52 ` Chris Murphy
  2025-12-26  3:07 ` Sun Yangkai
  2 siblings, 1 reply; 14+ messages in thread
From: Johannes Thumshirn @ 2025-07-16  6:24 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs@vger.kernel.org, kernel-team@fb.com

On 15.07.25 20:57, Boris Burkov wrote:
> The explanation of the feature is linked via the original patches.
> But tl;dr: dynamic periodic reclaim for data is a way to get a lot of
> extra protection from block group mis-allocation ENOSPC without
> incurring a lot of reclaims in the happy, steady state case.
> 
> We have tested it extensively in production at Meta and are quite
> satisfied with its behavior as opposed to an edge triggered
> bg_reclaim_threshold set to 25. The latter did well in reducing our
> ENOSPCs but at the cost of a LOT of reclaiming. And often excessive
> seemingly unbounded reclaiming.
> 
> With dynamic periodic reclaim, if the system is below 10G unallocated
> space, then the cleaner thread will identify the best block groups to
> reclaim to get us back to 10G. It will get progressively more aggressive
> as unallocated trends towards 0. It will perform no reclaims when
> unallocated is above 10G.
> 
> With its by-design conservative approach to reclaiming and good track
> record in datacenter testing, I think it is time to introduce automatic
> data block group reclaim to btrfs. This does not conflict with the use
> of the tools in btrfs_maintenance. One thing to look out for is that the
> bg_reclaim_threshold setting is no longer writeable once the dynamic
> threshold is enabled, and instead is a read-only file representing the
> current snapshot of the dynamic threshold.
> 
> To disable either of these features, simply write a 0 to
> /sys/fs/btrfs/<uuid>/allocation/data/(dynamic_reclaim|periodic_reclaim)
> 
> Link: https://lore.kernel.org/linux-btrfs/cover.1718665689.git.boris@bur.io/#t
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/space-info.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 0481c693ac2e..8005483fbfe2 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -306,6 +306,12 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
>   
>   		if (ret)
>   			return ret;
> +	} else {

Why else? If I'm not completely blind I can't see a reason for it.
I'm running it without 'else' part through our perf test because it's 
stressing reclaim quite a bit. We'll know more in ~7h.



> +		if ((flags & BTRFS_BLOCK_GROUP_DATA) &&
> +		    !(flags & BTRFS_BLOCK_GROUP_METADATA)) {
> +			space_info->dynamic_reclaim = 1;
> +			space_info->periodic_reclaim = 1;
> +		}
>   	}
>   
>   	ret = btrfs_sysfs_add_space_info_type(info, space_info);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-07-16  6:24 ` Johannes Thumshirn
@ 2025-07-16 15:56   ` Boris Burkov
  2025-07-17 12:55     ` Johannes Thumshirn
  0 siblings, 1 reply; 14+ messages in thread
From: Boris Burkov @ 2025-07-16 15:56 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com

On Wed, Jul 16, 2025 at 06:24:05AM +0000, Johannes Thumshirn wrote:
> On 15.07.25 20:57, Boris Burkov wrote:
> > The explanation of the feature is linked via the original patches.
> > But tl;dr: dynamic periodic reclaim for data is a way to get a lot of
> > extra protection from block group mis-allocation ENOSPC without
> > incurring a lot of reclaims in the happy, steady state case.
> > 
> > We have tested it extensively in production at Meta and are quite
> > satisfied with its behavior as opposed to an edge triggered
> > bg_reclaim_threshold set to 25. The latter did well in reducing our
> > ENOSPCs but at the cost of a LOT of reclaiming. And often excessive
> > seemingly unbounded reclaiming.
> > 
> > With dynamic periodic reclaim, if the system is below 10G unallocated
> > space, then the cleaner thread will identify the best block groups to
> > reclaim to get us back to 10G. It will get progressively more aggressive
> > as unallocated trends towards 0. It will perform no reclaims when
> > unallocated is above 10G.
> > 
> > With its by-design conservative approach to reclaiming and good track
> > record in datacenter testing, I think it is time to introduce automatic
> > data block group reclaim to btrfs. This does not conflict with the use
> > of the tools in btrfs_maintenance. One thing to look out for is that the
> > bg_reclaim_threshold setting is no longer writeable once the dynamic
> > threshold is enabled, and instead is a read-only file representing the
> > current snapshot of the dynamic threshold.
> > 
> > To disable either of these features, simply write a 0 to
> > /sys/fs/btrfs/<uuid>/allocation/data/(dynamic_reclaim|periodic_reclaim)
> > 
> > Link: https://lore.kernel.org/linux-btrfs/cover.1718665689.git.boris@bur.io/#t
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >   fs/btrfs/space-info.c | 6 ++++++
> >   1 file changed, 6 insertions(+)
> > 
> > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > index 0481c693ac2e..8005483fbfe2 100644
> > --- a/fs/btrfs/space-info.c
> > +++ b/fs/btrfs/space-info.c
> > @@ -306,6 +306,12 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
> >   
> >   		if (ret)
> >   			return ret;
> > +	} else {
> 
> Why else? If I'm not completely blind I can't see a reason for it.
> I'm running it without 'else' part through our perf test because it's 
> stressing reclaim quite a bit. We'll know more in ~7h.
> 
> 
> 

Thank you for running your perf test on it, excited to hear the results!

The reason I didn't propose enabling it for zoned is that I assumed the
reclaim strategy was too conservative for zoned filesystems. I figured
you would be reclaiming block_groups more regularly and that the hard
coded 10G headroom wouldn't work in practice. Also, I'm not sure how the
flipped threshold works. AFAIK, currently zoned inverts the meaning of
bg_reclaim_threshold compared to non-zoned so I wonder if will use a
threshold of 90 at 9 unalloc down to 10 at 1 unalloc for dynamic...

While we're on the topic, what would the ideal auto reclaim for zoned
look like? Maybe we could track "finished" block_groups and trigger
reclaim on the smallest ones (perhaps with the full-ness threshold) as
that number goes up?

Another idea for an extension that I was kicking around that I think
would make sense for both zoned and non-zoned was to keep the current
logic for the "we're out of unallocated" side of things but to add a
slow burn of reclaims metered by reclaim_bytes / reclaim_extents at some
slow pace. This would try to reasonably keep up with general
fragmentation in the sub-critical condition without ever doing a large
amount of reclaim.

> > +		if ((flags & BTRFS_BLOCK_GROUP_DATA) &&
> > +		    !(flags & BTRFS_BLOCK_GROUP_METADATA)) {
> > +			space_info->dynamic_reclaim = 1;
> > +			space_info->periodic_reclaim = 1;
> > +		}
> >   	}
> >   
> >   	ret = btrfs_sysfs_add_space_info_type(info, space_info);
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-07-16 15:56   ` Boris Burkov
@ 2025-07-17 12:55     ` Johannes Thumshirn
  0 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2025-07-17 12:55 UTC (permalink / raw)
  To: Boris Burkov
  Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com, Hans Holmberg,
	hch

[+Cc Hans and Christoph who looked a lot of GC on zoned XFS lately]

On 16.07.25 17:55, Boris Burkov wrote:
> Thank you for running your perf test on it, excited to hear the results!

Net result is, reclaim kicks in earlier but the overwrite phase still 
isn't as good as I'd like it to be (kind of expected as you describe below).

> The reason I didn't propose enabling it for zoned is that I assumed the
> reclaim strategy was too conservative for zoned filesystems. I figured
> you would be reclaiming block_groups more regularly and that the hard
> coded 10G headroom wouldn't work in practice. Also, I'm not sure how the
> flipped threshold works. AFAIK, currently zoned inverts the meaning of
> bg_reclaim_threshold compared to non-zoned so I wonder if will use a
> threshold of 90 at 9 unalloc down to 10 at 1 unalloc for dynamic...

Yes on a zoned FS we (at the moment) don't look at un-allocated space 
but space we can't use (zone_unusable) because it is either:

a) an old generation of the data, or
b) the difference between zone_size and zone_capacity on ZNS drives.

But I have the feeling that mixing these two is a problem we didn't 
consider back then, as for an example ZNS drive with a zone size of 2G 
and a zone capacity of 1G, 50% of the drive are zone_unusable right 
after mkfs.

Not looking at the unallocated space, but the unusable space might be a 
mistake in hindsight. Especially as btrfs_zoned_should_reclaim() looks 
at all the FS used (data + unusable + metadata) vs total size.

> While we're on the topic, what would the ideal auto reclaim for zoned
> look like? 

Good question, unfortunately I'm thinking of this for several weeks now 
and haven't found an answer yet.

> Maybe we could track "finished" block_groups and trigger
> reclaim on the smallest ones (perhaps with the full-ness threshold) as
> that number goes up?

That was more or less the idea with the current zoned GC code. If 75% of 
the drive unusable, start cleaning it up. But it's doing it in one 
batch, causing latency spikes and/or premature ENOSPC because it's done 
in the cleaner kthread and the ticketing code isn't aware (see my RFC 
patches the last 4-6 weeks on the list, that document my failed attempts).

> Another idea for an extension that I was kicking around that I think
> would make sense for both zoned and non-zoned was to keep the current
> logic for the "we're out of unallocated" side of things but to add a
> slow burn of reclaims metered by reclaim_bytes / reclaim_extents at some
> slow pace. This would try to reasonably keep up with general
> fragmentation in the sub-critical condition without ever doing a large
> amount of reclaim.

This one sounds like an interesting idea. Give me some more time to 
contemplate on it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-07-15 18:58 [PATCH] btrfs: make periodic dynamic reclaim the default for data Boris Burkov
  2025-07-16  6:24 ` Johannes Thumshirn
@ 2025-10-21 18:52 ` Chris Murphy
  2025-10-21 22:39   ` Leo Martins
  2025-12-26  3:07 ` Sun Yangkai
  2 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2025-10-21 18:52 UTC (permalink / raw)
  To: Boris Burkov; +Cc: kernel-team, Btrfs BTRFS

>Tue, 15 Jul 2025 11:58:24 -0700
https://lore.kernel.org/linux-btrfs/52b863849f0dd63b3d25a29c8a830a09c748d86b.1752605888.git.boris@bur.io/

Fedora is interested in this enhancement. Any idea when it could be merged or if there are any outstanding concerns?

In particular, I like the lack of knobs. It's either on or off. And it has no effect until unallocated space drops below 10G means it's super lightweight, affecting only users likely to end up in related corner cases.

Fedora isn't installing btrfsmaintenance by default. We do see infrequent cases of premature or misallocation out of space. It would be nice to have this "it does nothing until" type solution enabled by default, if it's ready.

Thanks,

--
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-10-21 18:52 ` Chris Murphy
@ 2025-10-21 22:39   ` Leo Martins
  2025-10-22  0:37     ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Leo Martins @ 2025-10-21 22:39 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Boris Burkov, kernel-team, Btrfs BTRFS

On Tue, 21 Oct 2025 14:52:31 -0400 "Chris Murphy" <lists@colorremedies.com> wrote:

> >Tue, 15 Jul 2025 11:58:24 -0700
> https://lore.kernel.org/linux-btrfs/52b863849f0dd63b3d25a29c8a830a09c748d86b.1752605888.git.boris@bur.io/
> 
> Fedora is interested in this enhancement. Any idea when it could be merged or if there are any outstanding concerns?
> 
> In particular, I like the lack of knobs. It's either on or off. And it has no effect until unallocated space drops below 10G means it's super lightweight, affecting only users likely to end up in related corner cases.
> 
> Fedora isn't installing btrfsmaintenance by default. We do see infrequent cases of premature or misallocation out of space. It would be nice to have this "it does nothing until" type solution enabled by default, if it's ready.
> 
> Thanks,
> 
> --
> Chris Murphy

Wanted to provide some data from the Meta rollout to give more context on the
decision to enable dynamic+periodic reclaim by default for data. All the before
numbers are with bg_reclaim_threshold set to 30.

Enabling dynamic+periodic reclaim for data block groups dramatically decreases
number of reclaims per host, going from 150/day to just 5/day (p99), and from
6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a
slight uptick in enospcs.

I currently don't have direct fragmentation metrics, though that is a
work in progress, but I'm tracking FP as a proxy for fragmentation.

FP = (allocated - used) / allocated
So if there are 100G allocated for data and 80G are used, FP = (100 - 80) / 100 = 20%.

FP has increased from 30% to 45% (p99), and from 5% to 7% (p50).
Enospc rates have gone from around 0.5/day to 1/day per 100k hosts.
This is a doubling in rate, but still a very small absolute number
of enospcs. The unallocated space on disk decreases by ~15G (p99)
and ~5G (p50) after rollout.

Though fragmentation increases and unallocated space decreases the
very small increase in enospcs suggests that this is a worthwhile
tradeoff.

One concern I still have is that replacing the aggressive
bg_reclaim_threshold for the conservative dynamic+periodic reclaim
will lead filesystems to slowly trend toward an "unhealthy" state of
high fragmentation and dynamic+periodic reclaim will only do enough
to keep the filesystem alive, but not enough to make it "healthy" again.
So far, the data indicates these concerns are unfounded as
FP and unallocated space seem to stabilize after their initial changes,
but I'll follow up if anything changes.

That being said I don't think bg_reclaim_threshold is enabled by default,
and I am comfortable saying dynamic+periodic reclaim is better than no
automatic reclaim!

Thanks,
Leo Martins.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-10-21 22:39   ` Leo Martins
@ 2025-10-22  0:37     ` Chris Murphy
  2025-10-22  1:02       ` Boris Burkov
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2025-10-22  0:37 UTC (permalink / raw)
  To: Leo Martins; +Cc: Boris Burkov, kernel-team, Btrfs BTRFS

Thanks for the response.

On Tue, Oct 21, 2025, at 6:39 PM, Leo Martins wrote:

>
> Wanted to provide some data from the Meta rollout to give more context on the
> decision to enable dynamic+periodic reclaim by default for data. All the before
> numbers are with bg_reclaim_threshold set to 30.
>
> Enabling dynamic+periodic reclaim for data block groups dramatically decreases
> number of reclaims per host, going from 150/day to just 5/day (p99), and from
> 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a
> slight uptick in enospcs.
>
> I currently don't have direct fragmentation metrics, though that is a
> work in progress, but I'm tracking FP as a proxy for fragmentation.
>
> FP = (allocated - used) / allocated
> So if there are 100G allocated for data and 80G are used, FP = (100 - 
> 80) / 100 = 20%.
>
> FP has increased from 30% to 45% (p99), and from 5% to 7% (p50).
> Enospc rates have gone from around 0.5/day to 1/day per 100k hosts.
> This is a doubling in rate, but still a very small absolute number
> of enospcs. The unallocated space on disk decreases by ~15G (p99)
> and ~5G (p50) after rollout.

I'm curious how it compares with default btrfsmaintenance btrfs-balance.timer/service  - I'm guessing this is a bit harder to test at Meta in production due to the strictly time based trigger. And customization ends up being a choice between even higher reclaim or higher enospc.

> That being said I don't think bg_reclaim_threshold is enabled by default,
> and I am comfortable saying dynamic+periodic reclaim is better than no
> automatic reclaim!

So there are still corner cases occurring even with dynamic periodic reclaim. What do those look like? Is the file system unable to write metadata for arbitrary deletes to back the file system out? Or is it stuck in some cases?

ext4 users are used to 5% of space being held in reserve for root user processes. I'm not sure if xfs has such a concept. Btrfs global reserve is different in that even root can't use it, it's really reserved for the kernel. But sometimes it's still possible to exhaust this metadata space, and be unable to delete files or balance even 1 data bg to back the file system out of the situation. The wedged in file system that keeps going read-only and appears stuck is a big concern since users have no idea what to do. And internet searches tend to produce results that are less help than no help.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-10-22  0:37     ` Chris Murphy
@ 2025-10-22  1:02       ` Boris Burkov
  2025-10-23 23:27         ` Leo Martins
  0 siblings, 1 reply; 14+ messages in thread
From: Boris Burkov @ 2025-10-22  1:02 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Leo Martins, kernel-team, Btrfs BTRFS

On Tue, Oct 21, 2025 at 08:37:18PM -0400, Chris Murphy wrote:
> Thanks for the response.
> 
> On Tue, Oct 21, 2025, at 6:39 PM, Leo Martins wrote:
> 
> >
> > Wanted to provide some data from the Meta rollout to give more context on the
> > decision to enable dynamic+periodic reclaim by default for data. All the before
> > numbers are with bg_reclaim_threshold set to 30.
> >
> > Enabling dynamic+periodic reclaim for data block groups dramatically decreases
> > number of reclaims per host, going from 150/day to just 5/day (p99), and from
> > 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a
> > slight uptick in enospcs.
> >
> > I currently don't have direct fragmentation metrics, though that is a
> > work in progress, but I'm tracking FP as a proxy for fragmentation.
> >
> > FP = (allocated - used) / allocated
> > So if there are 100G allocated for data and 80G are used, FP = (100 - 
> > 80) / 100 = 20%.
> >
> > FP has increased from 30% to 45% (p99), and from 5% to 7% (p50).
> > Enospc rates have gone from around 0.5/day to 1/day per 100k hosts.

Leo, correct me if I'm wrong, but we have yet to investigate a system
where unallocated steadily marched down to 0 since the introduction of
dynamic reclaim and then it ENOSPC'd, right? If there is a strong,
undeniable increase in ENOSPCs we should absolutely look for such
systems in those regions to motivate further improvements with
full/filling filesystems.

There is also the confounding variable of the bug fixed here:
https://lore.kernel.org/linux-btrfs/22e8b64df3d4984000713433a89cfc14309b75fc.1759430967.git.boris@bur.io/
that has been plaguing our fleet causing ENOSPC issues.

> > This is a doubling in rate, but still a very small absolute number
> > of enospcs. The unallocated space on disk decreases by ~15G (p99)
> > and ~5G (p50) after rollout.
> 
> I'm curious how it compares with default btrfsmaintenance btrfs-balance.timer/service  - I'm guessing this is a bit harder to test at Meta in production due to the strictly time based trigger. And customization ends up being a choice between even higher reclaim or higher enospc.
> 

Yeah, we don't have that data unfortunately.

> > That being said I don't think bg_reclaim_threshold is enabled by default,
> > and I am comfortable saying dynamic+periodic reclaim is better than no
> > automatic reclaim!
> 
> So there are still corner cases occurring even with dynamic periodic reclaim. What do those look like? Is the file system unable to write metadata for arbitrary deletes to back the file system out? Or is it stuck in some cases?
> 

I would imagine the cases that are tough for dynamic reclaim are:
1. genuinely quite full fs
2. rapidly needs a big hunk of metadata between entering the dynamic
   reclaim zone but before the cleaner thread / reclaim worker can run.

> ext4 users are used to 5% of space being held in reserve for root user processes. I'm not sure if xfs has such a concept. Btrfs global reserve is different in that even root can't use it, it's really reserved for the kernel. But sometimes it's still possible to exhaust this metadata space, and be unable to delete files or balance even 1 data bg to back the file system out of the situation. The wedged in file system that keeps going read-only and appears stuck is a big concern since users have no idea what to do. And internet searches tend to produce results that are less help than no help.
> 
> -- 
> Chris Murphy

Anyway, I think Leo's forthcoming detailed per-BG fragmentation data
should be the most telling. System level fragmentation percentage
isn't the most useful IMO.

Thanks,
Boris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-10-22  1:02       ` Boris Burkov
@ 2025-10-23 23:27         ` Leo Martins
  2025-12-13 22:09           ` Neal Gompa
  0 siblings, 1 reply; 14+ messages in thread
From: Leo Martins @ 2025-10-23 23:27 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Chris Murphy, kernel-team, Btrfs BTRFS

On Tue, 21 Oct 2025 18:02:15 -0700 Boris Burkov <boris@bur.io> wrote:

> On Tue, Oct 21, 2025 at 08:37:18PM -0400, Chris Murphy wrote:
> > Thanks for the response.
> > 
> > On Tue, Oct 21, 2025, at 6:39 PM, Leo Martins wrote:
> > 
> > >
> > > Wanted to provide some data from the Meta rollout to give more context on the
> > > decision to enable dynamic+periodic reclaim by default for data. All the before
> > > numbers are with bg_reclaim_threshold set to 30.
> > >
> > > Enabling dynamic+periodic reclaim for data block groups dramatically decreases
> > > number of reclaims per host, going from 150/day to just 5/day (p99), and from
> > > 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a
> > > slight uptick in enospcs.
> > >
> > > I currently don't have direct fragmentation metrics, though that is a
> > > work in progress, but I'm tracking FP as a proxy for fragmentation.
> > >
> > > FP = (allocated - used) / allocated
> > > So if there are 100G allocated for data and 80G are used, FP = (100 - 
> > > 80) / 100 = 20%.
> > >
> > > FP has increased from 30% to 45% (p99), and from 5% to 7% (p50).
> > > Enospc rates have gone from around 0.5/day to 1/day per 100k hosts.
> 
> Leo, correct me if I'm wrong, but we have yet to investigate a system
> where unallocated steadily marched down to 0 since the introduction of
> dynamic reclaim and then it ENOSPC'd, right? If there is a strong,
> undeniable increase in ENOSPCs we should absolutely look for such
> systems in those regions to motivate further improvements with
> full/filling filesystems.

After digging some more the only examples I found of btrfs enospcing
from lack of unallocated are true enospcs where either data or metadata
were entirely full.

> 
> There is also the confounding variable of the bug fixed here:
> https://lore.kernel.org/linux-btrfs/22e8b64df3d4984000713433a89cfc14309b75fc.1759430967.git.boris@bur.io/
> that has been plaguing our fleet causing ENOSPC issues.

Yes, a deeper look revealed that the increase in ENOSPCs is
due to this bug and not dynamic+periodic reclaim. In fact,
the hosts with dynamic+periodic reclaim enabled see a relatively
smaller rate of enospc (about 2x less) than the rest of the fleet.

> 
> > > This is a doubling in rate, but still a very small absolute number
> > > of enospcs. The unallocated space on disk decreases by ~15G (p99)
> > > and ~5G (p50) after rollout.
> > 
> > I'm curious how it compares with default btrfsmaintenance btrfs-balance.timer/service  - I'm guessing this is a bit harder to test at Meta in production due to the strictly time based trigger. And customization ends up being a choice between even higher reclaim or higher enospc.
> > 
> 
> Yeah, we don't have that data unfortunately.
> 
> > > That being said I don't think bg_reclaim_threshold is enabled by default,
> > > and I am comfortable saying dynamic+periodic reclaim is better than no
> > > automatic reclaim!
> > 
> > So there are still corner cases occurring even with dynamic periodic reclaim. What do those look like? Is the file system unable to write metadata for arbitrary deletes to back the file system out? Or is it stuck in some cases?
> > 
> 
> I would imagine the cases that are tough for dynamic reclaim are:
> 1. genuinely quite full fs
> 2. rapidly needs a big hunk of metadata between entering the dynamic
>    reclaim zone but before the cleaner thread / reclaim worker can run.

Concerning point 1 it seems like dynamic+periodic reclaim actually does a pretty good
job here. I haven't seen any signs of thrashing with low unallocated space.

> 
> > ext4 users are used to 5% of space being held in reserve for root user processes. I'm not sure if xfs has such a concept. Btrfs global reserve is different in that even root can't use it, it's really reserved for the kernel. But sometimes it's still possible to exhaust this metadata space, and be unable to delete files or balance even 1 data bg to back the file system out of the situation. The wedged in file system that keeps going read-only and appears stuck is a big concern since users have no idea what to do. And internet searches tend to produce results that are less help than no help.
> > 
> > -- 
> > Chris Murphy
> 
> Anyway, I think Leo's forthcoming detailed per-BG fragmentation data
> should be the most telling. System level fragmentation percentage
> isn't the most useful IMO.
> 
> Thanks,
> Boris

Since the uptick in enospcs is not actually linked to dynamic+periodic
reclaim I now feel confident saying that dynamic+periodic reclaim
should be enabled by default for data.

Thanks,
Leo Martins.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-10-23 23:27         ` Leo Martins
@ 2025-12-13 22:09           ` Neal Gompa
  0 siblings, 0 replies; 14+ messages in thread
From: Neal Gompa @ 2025-12-13 22:09 UTC (permalink / raw)
  To: Leo Martins; +Cc: Boris Burkov, Chris Murphy, kernel-team, Btrfs BTRFS

On Thu, Oct 23, 2025 at 7:27 PM Leo Martins <loemra.dev@gmail.com> wrote:
>
> On Tue, 21 Oct 2025 18:02:15 -0700 Boris Burkov <boris@bur.io> wrote:
>
> > On Tue, Oct 21, 2025 at 08:37:18PM -0400, Chris Murphy wrote:
> > > Thanks for the response.
> > >
> > > On Tue, Oct 21, 2025, at 6:39 PM, Leo Martins wrote:
> > >
> > > >
> > > > Wanted to provide some data from the Meta rollout to give more context on the
> > > > decision to enable dynamic+periodic reclaim by default for data. All the before
> > > > numbers are with bg_reclaim_threshold set to 30.
> > > >
> > > > Enabling dynamic+periodic reclaim for data block groups dramatically decreases
> > > > number of reclaims per host, going from 150/day to just 5/day (p99), and from
> > > > 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a
> > > > slight uptick in enospcs.
> > > >
> > > > I currently don't have direct fragmentation metrics, though that is a
> > > > work in progress, but I'm tracking FP as a proxy for fragmentation.
> > > >
> > > > FP = (allocated - used) / allocated
> > > > So if there are 100G allocated for data and 80G are used, FP = (100 -
> > > > 80) / 100 = 20%.
> > > >
> > > > FP has increased from 30% to 45% (p99), and from 5% to 7% (p50).
> > > > Enospc rates have gone from around 0.5/day to 1/day per 100k hosts.
> >
> > Leo, correct me if I'm wrong, but we have yet to investigate a system
> > where unallocated steadily marched down to 0 since the introduction of
> > dynamic reclaim and then it ENOSPC'd, right? If there is a strong,
> > undeniable increase in ENOSPCs we should absolutely look for such
> > systems in those regions to motivate further improvements with
> > full/filling filesystems.
>
> After digging some more the only examples I found of btrfs enospcing
> from lack of unallocated are true enospcs where either data or metadata
> were entirely full.
>
> >
> > There is also the confounding variable of the bug fixed here:
> > https://lore.kernel.org/linux-btrfs/22e8b64df3d4984000713433a89cfc14309b75fc.1759430967.git.boris@bur.io/
> > that has been plaguing our fleet causing ENOSPC issues.
>
> Yes, a deeper look revealed that the increase in ENOSPCs is
> due to this bug and not dynamic+periodic reclaim. In fact,
> the hosts with dynamic+periodic reclaim enabled see a relatively
> smaller rate of enospc (about 2x less) than the rest of the fleet.
>
> >
> > > > This is a doubling in rate, but still a very small absolute number
> > > > of enospcs. The unallocated space on disk decreases by ~15G (p99)
> > > > and ~5G (p50) after rollout.
> > >
> > > I'm curious how it compares with default btrfsmaintenance btrfs-balance.timer/service  - I'm guessing this is a bit harder to test at Meta in production due to the strictly time based trigger. And customization ends up being a choice between even higher reclaim or higher enospc.
> > >
> >
> > Yeah, we don't have that data unfortunately.
> >
> > > > That being said I don't think bg_reclaim_threshold is enabled by default,
> > > > and I am comfortable saying dynamic+periodic reclaim is better than no
> > > > automatic reclaim!
> > >
> > > So there are still corner cases occurring even with dynamic periodic reclaim. What do those look like? Is the file system unable to write metadata for arbitrary deletes to back the file system out? Or is it stuck in some cases?
> > >
> >
> > I would imagine the cases that are tough for dynamic reclaim are:
> > 1. genuinely quite full fs
> > 2. rapidly needs a big hunk of metadata between entering the dynamic
> >    reclaim zone but before the cleaner thread / reclaim worker can run.
>
> Concerning point 1 it seems like dynamic+periodic reclaim actually does a pretty good
> job here. I haven't seen any signs of thrashing with low unallocated space.
>
> >
> > > ext4 users are used to 5% of space being held in reserve for root user processes. I'm not sure if xfs has such a concept. Btrfs global reserve is different in that even root can't use it, it's really reserved for the kernel. But sometimes it's still possible to exhaust this metadata space, and be unable to delete files or balance even 1 data bg to back the file system out of the situation. The wedged in file system that keeps going read-only and appears stuck is a big concern since users have no idea what to do. And internet searches tend to produce results that are less help than no help.
> > >
> > > --
> > > Chris Murphy
> >
> > Anyway, I think Leo's forthcoming detailed per-BG fragmentation data
> > should be the most telling. System level fragmentation percentage
> > isn't the most useful IMO.
> >
> > Thanks,
> > Boris
>
> Since the uptick in enospcs is not actually linked to dynamic+periodic
> reclaim I now feel confident saying that dynamic+periodic reclaim
> should be enabled by default for data.
>

So have we done this yet? If not, what's holding this up?


-- 
真実はいつも一つ！/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-07-15 18:58 [PATCH] btrfs: make periodic dynamic reclaim the default for data Boris Burkov
  2025-07-16  6:24 ` Johannes Thumshirn
  2025-10-21 18:52 ` Chris Murphy
@ 2025-12-26  3:07 ` Sun Yangkai
  2025-12-30  0:00   ` Boris Burkov
  2 siblings, 1 reply; 14+ messages in thread
From: Sun Yangkai @ 2025-12-26  3:07 UTC (permalink / raw)
  To: boris; +Cc: kernel-team, linux-btrfs

Hi Boris,

Thank you for bring such a feature for btrfs. I love it a lot and try to enable
it on my machine.

But I've get into some unexpected behavior when periodic dynamic reclaim is
enabled and the filesystem is nearly full.

[12月26 10:41] [T20373] BTRFS info (device sda): relocating block group
5214541578240 flags data
[  +0.012446] [T20373] BTRFS error (device sda): error relocating chunk
5214541578240
[  +0.000033] [T20373] BTRFS info (device sda): relocating block group
4540021997568 flags data
[  +0.008927] [T20373] BTRFS error (device sda): error relocating chunk
4540021997568
[  +0.000025] [T20373] BTRFS info (device sda): relocating block group
5606746750976 flags data
[12月26 10:42] [T20373] BTRFS error (device sda): error relocating chunk
5606746750976
[12月26 10:47] [T12072] BTRFS info (device sda): relocating block group
5606746750976 flags data
[  +3.960400] [T12072] BTRFS error (device sda): error relocating chunk
5606746750976
[12月26 10:52] [ T7643] BTRFS info (device sda): relocating block group
5606746750976 flags data
[  +3.960314] [ T7643] BTRFS error (device sda): error relocating chunk
5606746750976
[12月26 10:57] [T20373] BTRFS info (device sda): relocating block group
5606746750976 flags data
[  +3.954485] [T20373] BTRFS error (device sda): error relocating chunk
5606746750976
[12月26 11:02] [ T7701] BTRFS info (device sda): relocating block group
5606746750976 flags data
[  +4.561796] [ T7701] BTRFS error (device sda): error relocating chunk
5606746750976

I guess the condition of when the periodic reclaim should happen is unpolished.

I'm still digging further into it.

Thanks,
Sun YangKai

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-12-26  3:07 ` Sun Yangkai
@ 2025-12-30  0:00   ` Boris Burkov
  2025-12-30  1:29     ` Sun Yangkai
  2025-12-30  1:41     ` Sun Yangkai
  0 siblings, 2 replies; 14+ messages in thread
From: Boris Burkov @ 2025-12-30  0:00 UTC (permalink / raw)
  To: Sun Yangkai; +Cc: kernel-team, linux-btrfs

On Fri, Dec 26, 2025 at 11:07:28AM +0800, Sun Yangkai wrote:
> Hi Boris,

First off, sorry for not replying promptly. I've been in and out of the
office around the holidays.

> 
> Thank you for bring such a feature for btrfs. I love it a lot and try to enable
> it on my machine.

I really appreciate your kind words and your interest in the feature.
Thank you!

> 
> But I've get into some unexpected behavior when periodic dynamic reclaim is
> enabled and the filesystem is nearly full.

Oops! Let's debug it :)

> 
> [12月26 10:41] [T20373] BTRFS info (device sda): relocating block group
> 5214541578240 flags data
> [  +0.012446] [T20373] BTRFS error (device sda): error relocating chunk
> 5214541578240
> [  +0.000033] [T20373] BTRFS info (device sda): relocating block group
> 4540021997568 flags data
> [  +0.008927] [T20373] BTRFS error (device sda): error relocating chunk
> 4540021997568
> [  +0.000025] [T20373] BTRFS info (device sda): relocating block group
> 5606746750976 flags data
> [12月26 10:42] [T20373] BTRFS error (device sda): error relocating chunk
> 5606746750976
> [12月26 10:47] [T12072] BTRFS info (device sda): relocating block group
> 5606746750976 flags data
> [  +3.960400] [T12072] BTRFS error (device sda): error relocating chunk
> 5606746750976
> [12月26 10:52] [ T7643] BTRFS info (device sda): relocating block group
> 5606746750976 flags data
> [  +3.960314] [ T7643] BTRFS error (device sda): error relocating chunk
> 5606746750976
> [12月26 10:57] [T20373] BTRFS info (device sda): relocating block group
> 5606746750976 flags data
> [  +3.954485] [T20373] BTRFS error (device sda): error relocating chunk
> 5606746750976
> [12月26 11:02] [ T7701] BTRFS info (device sda): relocating block group
> 5606746750976 flags data
> [  +4.561796] [ T7701] BTRFS error (device sda): error relocating chunk
> 5606746750976
> 
> I guess the condition of when the periodic reclaim should happen is unpolished.

Yeah, it looks like it is triggering too frequently in conditions where
it isn't likely to succeed. Hopefully we can tune up the heuristics (or
just fix the bug you found) and it works better.

It seems to be triggering every 5 minutes or so, right? Is that the
interval of the cleaner thread running on your system? Or am I
misinterpreting the time stamps? I would normally expect the default of
30s.

> 
> I'm still digging further into it.

Were you able to confirm whether that negative reclaimable_bytes bug was
the root cause here?

If you aren't able to reproduce but it is still happening on one of your
systems, we can try to instrument the periodic reclaim lifecycle with
bpftrace to catch calls to the various important functions setting it
reclaimable, etc.

Please let me know if I can assist you with that, or if you do have a
reproducer I could also look at.

Thanks,
Boris

> 
> Thanks,
> Sun YangKai

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-12-30  0:00   ` Boris Burkov
@ 2025-12-30  1:29     ` Sun Yangkai
  2025-12-30  1:41     ` Sun Yangkai
  1 sibling, 0 replies; 14+ messages in thread
From: Sun Yangkai @ 2025-12-30  1:29 UTC (permalink / raw)
  To: Boris Burkov; +Cc: kernel-team, linux-btrfs



在 2025/12/30 08:00, Boris Burkov 写道:
> On Fri, Dec 26, 2025 at 11:07:28AM +0800, Sun Yangkai wrote:
>> Hi Boris,
> 
> First off, sorry for not replying promptly. I've been in and out of the
> office around the holidays.
> 
>>
>> Thank you for bring such a feature for btrfs. I love it a lot and try to enable
>> it on my machine.
> 
> I really appreciate your kind words and your interest in the feature.
> Thank you!
> 
>>
>> But I've get into some unexpected behavior when periodic dynamic reclaim is
>> enabled and the filesystem is nearly full.
> 
> Oops! Let's debug it :)
> 
>>
>> [12月26 10:41] [T20373] BTRFS info (device sda): relocating block group
>> 5214541578240 flags data
>> [  +0.012446] [T20373] BTRFS error (device sda): error relocating chunk
>> 5214541578240
>> [  +0.000033] [T20373] BTRFS info (device sda): relocating block group
>> 4540021997568 flags data
>> [  +0.008927] [T20373] BTRFS error (device sda): error relocating chunk
>> 4540021997568
>> [  +0.000025] [T20373] BTRFS info (device sda): relocating block group
>> 5606746750976 flags data
>> [12月26 10:42] [T20373] BTRFS error (device sda): error relocating chunk
>> 5606746750976
>> [12月26 10:47] [T12072] BTRFS info (device sda): relocating block group
>> 5606746750976 flags data
>> [  +3.960400] [T12072] BTRFS error (device sda): error relocating chunk
>> 5606746750976
>> [12月26 10:52] [ T7643] BTRFS info (device sda): relocating block group
>> 5606746750976 flags data
>> [  +3.960314] [ T7643] BTRFS error (device sda): error relocating chunk
>> 5606746750976
>> [12月26 10:57] [T20373] BTRFS info (device sda): relocating block group
>> 5606746750976 flags data
>> [  +3.954485] [T20373] BTRFS error (device sda): error relocating chunk
>> 5606746750976
>> [12月26 11:02] [ T7701] BTRFS info (device sda): relocating block group
>> 5606746750976 flags data
>> [  +4.561796] [ T7701] BTRFS error (device sda): error relocating chunk
>> 5606746750976
>>
>> I guess the condition of when the periodic reclaim should happen is unpolished.
> 
> Yeah, it looks like it is triggering too frequently in conditions where
> it isn't likely to succeed. Hopefully we can tune up the heuristics (or
> just fix the bug you found) and it works better.
> 
> It seems to be triggering every 5 minutes or so, right? Is that the
> interval of the cleaner thread running on your system? Or am I
> misinterpreting the time stamps? I would normally expect the default of
> 30s.

Yes, my system has commit=300. It was set years ago when I knew almost nothing
about btrfs and not changed since then.

>>
>> I'm still digging further into it.
> 
> Were you able to confirm whether that negative reclaimable_bytes bug was
> the root cause here?

Yes. After changing chunk_sz to s64, this will not triggered anymore. However,
periodic also does not work properly.

> If you aren't able to reproduce but it is still happening on one of your
> systems, we can try to instrument the periodic reclaim lifecycle with
> bpftrace to catch calls to the various important functions setting it
> reclaimable, etc.

Thank you for your advice. That's what I've done and how I find the unexpected
behavior. It's really a good tool to know what's happening in kernel.

> Please let me know if I can assist you with that, or if you do have a
> reproducer I could also look at.

I've redesigned the logic and iterated some versions. I'll cleanup my code and
send the patches later. Maybe later today or tomorrow. It's not perfect, but I
hope it will be better than what we have now.

Thanks,
Sun YangKai

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data
  2025-12-30  0:00   ` Boris Burkov
  2025-12-30  1:29     ` Sun Yangkai
@ 2025-12-30  1:41     ` Sun Yangkai
  1 sibling, 0 replies; 14+ messages in thread
From: Sun Yangkai @ 2025-12-30  1:41 UTC (permalink / raw)
  To: Boris Burkov; +Cc: kernel-team, linux-btrfs

> Please let me know if I can assist you with that, or if you do have a
> reproducer I could also look at.

I just come with a ... thing I found and I have no idea why it happened.

I've wrote a script to show the 10 least used block groups(used_space is just
calculated from length and used_pct, so please just ignore them) and before
periodic reclaim, I got:

Searching for the 10 least used DATA block groups...

       vaddr           length     used_pct   used_space
--------------------------------------------------------
 6353387388928       1024MiB          5%         51MiB
 6354461130752       1024MiB         30%        307MiB
 6295292084224       1024MiB         80%        819MiB
 6056900427776       1024MiB         89%        911MiB
 4620552634368       1024MiB         97%        993MiB
 6050457976832       1024MiB         98%       1003MiB
 6122398679040       1024MiB         98%       1003MiB
 6270596022272       1024MiB         98%       1003MiB
 6350166163456       1024MiB         98%       1003MiB
  383347851264       1024MiB         99%       1013MiB

And unallocated space is 3GiB so with dynamic periodic reclaim, the first two
block groups will be reclaimed:

[12月28 21:47] [  T357] BTRFS info (device sda): relocating block group
6353387388928 flags data
[  +0.262467] [  T357] BTRFS info (device sda): found 1970 extents, stage: move
data extents
[  +1.334556] [  T357] BTRFS info (device sda): found 1966 extents, stage:
update data pointers
[  +0.618457] [  T357] BTRFS info (device sda): relocating block group
6354461130752 flags data
[  +1.009694] [  T357] BTRFS info (device sda): found 166 extents, stage: move
data extents
[  +0.388070] [  T357] BTRFS info (device sda): found 166 extents, stage: update
data pointers

And after the reclaim I got:

Searching for the 10 least used DATA block groups...

       vaddr           length     used_pct   used_space
--------------------------------------------------------
 6355534872576       1024MiB          6%         61MiB
 6356608614400       1024MiB         16%        163MiB
 6295292084224       1024MiB         80%        819MiB
 4620552634368       1024MiB         97%        993MiB
 6050457976832       1024MiB         98%       1003MiB
 6270596022272       1024MiB         98%       1003MiB
 3782605471744       1024MiB         99%       1013MiB
 4549685673984       1024MiB         99%       1013MiB
 5882820034560       1024MiB         99%       1013MiB
 5909764243456       1024MiB         99%       1013MiB

These two block groups could be merged into existing chunks, but I have no idea
why that didn't happen. But when I run

btrfs balance start -dvrange=6355534872576..6355534872576 /mnt

It can be merged and free some unallocated space.

So I think the periodic reclaim has a different behavior with manually balance?

Thanks,
Sun YangKai

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-12-30  1:41 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 18:58 [PATCH] btrfs: make periodic dynamic reclaim the default for data Boris Burkov
2025-07-16  6:24 ` Johannes Thumshirn
2025-07-16 15:56   ` Boris Burkov
2025-07-17 12:55     ` Johannes Thumshirn
2025-10-21 18:52 ` Chris Murphy
2025-10-21 22:39   ` Leo Martins
2025-10-22  0:37     ` Chris Murphy
2025-10-22  1:02       ` Boris Burkov
2025-10-23 23:27         ` Leo Martins
2025-12-13 22:09           ` Neal Gompa
2025-12-26  3:07 ` Sun Yangkai
2025-12-30  0:00   ` Boris Burkov
2025-12-30  1:29     ` Sun Yangkai
2025-12-30  1:41     ` Sun Yangkai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox