* [PATCH] btrfs: make periodic dynamic reclaim the default for data
@ 2025-07-15 18:58 Boris Burkov
2025-07-16 6:24 ` Johannes Thumshirn
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Boris Burkov @ 2025-07-15 18:58 UTC (permalink / raw)
To: linux-btrfs, kernel-team
The explanation of the feature is linked via the original patches.
But tl;dr: dynamic periodic reclaim for data is a way to get a lot of
extra protection from block group mis-allocation ENOSPC without
incurring a lot of reclaims in the happy, steady state case.
We have tested it extensively in production at Meta and are quite
satisfied with its behavior as opposed to an edge triggered
bg_reclaim_threshold set to 25. The latter did well in reducing our
ENOSPCs but at the cost of a LOT of reclaiming. And often excessive
seemingly unbounded reclaiming.
With dynamic periodic reclaim, if the system is below 10G unallocated
space, then the cleaner thread will identify the best block groups to
reclaim to get us back to 10G. It will get progressively more aggressive
as unallocated trends towards 0. It will perform no reclaims when
unallocated is above 10G.
With its by-design conservative approach to reclaiming and good track
record in datacenter testing, I think it is time to introduce automatic
data block group reclaim to btrfs. This does not conflict with the use
of the tools in btrfs_maintenance. One thing to look out for is that the
bg_reclaim_threshold setting is no longer writeable once the dynamic
threshold is enabled, and instead is a read-only file representing the
current snapshot of the dynamic threshold.
To disable either of these features, simply write a 0 to
/sys/fs/btrfs/<uuid>/allocation/data/(dynamic_reclaim|periodic_reclaim)
Link: https://lore.kernel.org/linux-btrfs/cover.1718665689.git.boris@bur.io/#t
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/space-info.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0481c693ac2e..8005483fbfe2 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -306,6 +306,12 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
if (ret)
return ret;
+ } else {
+ if ((flags & BTRFS_BLOCK_GROUP_DATA) &&
+ !(flags & BTRFS_BLOCK_GROUP_METADATA)) {
+ space_info->dynamic_reclaim = 1;
+ space_info->periodic_reclaim = 1;
+ }
}
ret = btrfs_sysfs_add_space_info_type(info, space_info);
--
2.50.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-07-15 18:58 [PATCH] btrfs: make periodic dynamic reclaim the default for data Boris Burkov @ 2025-07-16 6:24 ` Johannes Thumshirn 2025-07-16 15:56 ` Boris Burkov 2025-10-21 18:52 ` Chris Murphy 2025-12-26 3:07 ` Sun Yangkai 2 siblings, 1 reply; 14+ messages in thread From: Johannes Thumshirn @ 2025-07-16 6:24 UTC (permalink / raw) To: Boris Burkov, linux-btrfs@vger.kernel.org, kernel-team@fb.com On 15.07.25 20:57, Boris Burkov wrote: > The explanation of the feature is linked via the original patches. > But tl;dr: dynamic periodic reclaim for data is a way to get a lot of > extra protection from block group mis-allocation ENOSPC without > incurring a lot of reclaims in the happy, steady state case. > > We have tested it extensively in production at Meta and are quite > satisfied with its behavior as opposed to an edge triggered > bg_reclaim_threshold set to 25. The latter did well in reducing our > ENOSPCs but at the cost of a LOT of reclaiming. And often excessive > seemingly unbounded reclaiming. > > With dynamic periodic reclaim, if the system is below 10G unallocated > space, then the cleaner thread will identify the best block groups to > reclaim to get us back to 10G. It will get progressively more aggressive > as unallocated trends towards 0. It will perform no reclaims when > unallocated is above 10G. > > With its by-design conservative approach to reclaiming and good track > record in datacenter testing, I think it is time to introduce automatic > data block group reclaim to btrfs. This does not conflict with the use > of the tools in btrfs_maintenance. One thing to look out for is that the > bg_reclaim_threshold setting is no longer writeable once the dynamic > threshold is enabled, and instead is a read-only file representing the > current snapshot of the dynamic threshold. > > To disable either of these features, simply write a 0 to > /sys/fs/btrfs/<uuid>/allocation/data/(dynamic_reclaim|periodic_reclaim) > > Link: https://lore.kernel.org/linux-btrfs/cover.1718665689.git.boris@bur.io/#t > Signed-off-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/space-info.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index 0481c693ac2e..8005483fbfe2 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -306,6 +306,12 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags) > > if (ret) > return ret; > + } else { Why else? If I'm not completely blind I can't see a reason for it. I'm running it without 'else' part through our perf test because it's stressing reclaim quite a bit. We'll know more in ~7h. > + if ((flags & BTRFS_BLOCK_GROUP_DATA) && > + !(flags & BTRFS_BLOCK_GROUP_METADATA)) { > + space_info->dynamic_reclaim = 1; > + space_info->periodic_reclaim = 1; > + } > } > > ret = btrfs_sysfs_add_space_info_type(info, space_info); ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-07-16 6:24 ` Johannes Thumshirn @ 2025-07-16 15:56 ` Boris Burkov 2025-07-17 12:55 ` Johannes Thumshirn 0 siblings, 1 reply; 14+ messages in thread From: Boris Burkov @ 2025-07-16 15:56 UTC (permalink / raw) To: Johannes Thumshirn; +Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com On Wed, Jul 16, 2025 at 06:24:05AM +0000, Johannes Thumshirn wrote: > On 15.07.25 20:57, Boris Burkov wrote: > > The explanation of the feature is linked via the original patches. > > But tl;dr: dynamic periodic reclaim for data is a way to get a lot of > > extra protection from block group mis-allocation ENOSPC without > > incurring a lot of reclaims in the happy, steady state case. > > > > We have tested it extensively in production at Meta and are quite > > satisfied with its behavior as opposed to an edge triggered > > bg_reclaim_threshold set to 25. The latter did well in reducing our > > ENOSPCs but at the cost of a LOT of reclaiming. And often excessive > > seemingly unbounded reclaiming. > > > > With dynamic periodic reclaim, if the system is below 10G unallocated > > space, then the cleaner thread will identify the best block groups to > > reclaim to get us back to 10G. It will get progressively more aggressive > > as unallocated trends towards 0. It will perform no reclaims when > > unallocated is above 10G. > > > > With its by-design conservative approach to reclaiming and good track > > record in datacenter testing, I think it is time to introduce automatic > > data block group reclaim to btrfs. This does not conflict with the use > > of the tools in btrfs_maintenance. One thing to look out for is that the > > bg_reclaim_threshold setting is no longer writeable once the dynamic > > threshold is enabled, and instead is a read-only file representing the > > current snapshot of the dynamic threshold. > > > > To disable either of these features, simply write a 0 to > > /sys/fs/btrfs/<uuid>/allocation/data/(dynamic_reclaim|periodic_reclaim) > > > > Link: https://lore.kernel.org/linux-btrfs/cover.1718665689.git.boris@bur.io/#t > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > fs/btrfs/space-info.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > > index 0481c693ac2e..8005483fbfe2 100644 > > --- a/fs/btrfs/space-info.c > > +++ b/fs/btrfs/space-info.c > > @@ -306,6 +306,12 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags) > > > > if (ret) > > return ret; > > + } else { > > Why else? If I'm not completely blind I can't see a reason for it. > I'm running it without 'else' part through our perf test because it's > stressing reclaim quite a bit. We'll know more in ~7h. > > > Thank you for running your perf test on it, excited to hear the results! The reason I didn't propose enabling it for zoned is that I assumed the reclaim strategy was too conservative for zoned filesystems. I figured you would be reclaiming block_groups more regularly and that the hard coded 10G headroom wouldn't work in practice. Also, I'm not sure how the flipped threshold works. AFAIK, currently zoned inverts the meaning of bg_reclaim_threshold compared to non-zoned so I wonder if will use a threshold of 90 at 9 unalloc down to 10 at 1 unalloc for dynamic... While we're on the topic, what would the ideal auto reclaim for zoned look like? Maybe we could track "finished" block_groups and trigger reclaim on the smallest ones (perhaps with the full-ness threshold) as that number goes up? Another idea for an extension that I was kicking around that I think would make sense for both zoned and non-zoned was to keep the current logic for the "we're out of unallocated" side of things but to add a slow burn of reclaims metered by reclaim_bytes / reclaim_extents at some slow pace. This would try to reasonably keep up with general fragmentation in the sub-critical condition without ever doing a large amount of reclaim. > > + if ((flags & BTRFS_BLOCK_GROUP_DATA) && > > + !(flags & BTRFS_BLOCK_GROUP_METADATA)) { > > + space_info->dynamic_reclaim = 1; > > + space_info->periodic_reclaim = 1; > > + } > > } > > > > ret = btrfs_sysfs_add_space_info_type(info, space_info); > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-07-16 15:56 ` Boris Burkov @ 2025-07-17 12:55 ` Johannes Thumshirn 0 siblings, 0 replies; 14+ messages in thread From: Johannes Thumshirn @ 2025-07-17 12:55 UTC (permalink / raw) To: Boris Burkov Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com, Hans Holmberg, hch [+Cc Hans and Christoph who looked a lot of GC on zoned XFS lately] On 16.07.25 17:55, Boris Burkov wrote: > Thank you for running your perf test on it, excited to hear the results! Net result is, reclaim kicks in earlier but the overwrite phase still isn't as good as I'd like it to be (kind of expected as you describe below). > The reason I didn't propose enabling it for zoned is that I assumed the > reclaim strategy was too conservative for zoned filesystems. I figured > you would be reclaiming block_groups more regularly and that the hard > coded 10G headroom wouldn't work in practice. Also, I'm not sure how the > flipped threshold works. AFAIK, currently zoned inverts the meaning of > bg_reclaim_threshold compared to non-zoned so I wonder if will use a > threshold of 90 at 9 unalloc down to 10 at 1 unalloc for dynamic... Yes on a zoned FS we (at the moment) don't look at un-allocated space but space we can't use (zone_unusable) because it is either: a) an old generation of the data, or b) the difference between zone_size and zone_capacity on ZNS drives. But I have the feeling that mixing these two is a problem we didn't consider back then, as for an example ZNS drive with a zone size of 2G and a zone capacity of 1G, 50% of the drive are zone_unusable right after mkfs. Not looking at the unallocated space, but the unusable space might be a mistake in hindsight. Especially as btrfs_zoned_should_reclaim() looks at all the FS used (data + unusable + metadata) vs total size. > While we're on the topic, what would the ideal auto reclaim for zoned > look like? Good question, unfortunately I'm thinking of this for several weeks now and haven't found an answer yet. > Maybe we could track "finished" block_groups and trigger > reclaim on the smallest ones (perhaps with the full-ness threshold) as > that number goes up? That was more or less the idea with the current zoned GC code. If 75% of the drive unusable, start cleaning it up. But it's doing it in one batch, causing latency spikes and/or premature ENOSPC because it's done in the cleaner kthread and the ticketing code isn't aware (see my RFC patches the last 4-6 weeks on the list, that document my failed attempts). > Another idea for an extension that I was kicking around that I think > would make sense for both zoned and non-zoned was to keep the current > logic for the "we're out of unallocated" side of things but to add a > slow burn of reclaims metered by reclaim_bytes / reclaim_extents at some > slow pace. This would try to reasonably keep up with general > fragmentation in the sub-critical condition without ever doing a large > amount of reclaim. This one sounds like an interesting idea. Give me some more time to contemplate on it. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-07-15 18:58 [PATCH] btrfs: make periodic dynamic reclaim the default for data Boris Burkov 2025-07-16 6:24 ` Johannes Thumshirn @ 2025-10-21 18:52 ` Chris Murphy 2025-10-21 22:39 ` Leo Martins 2025-12-26 3:07 ` Sun Yangkai 2 siblings, 1 reply; 14+ messages in thread From: Chris Murphy @ 2025-10-21 18:52 UTC (permalink / raw) To: Boris Burkov; +Cc: kernel-team, Btrfs BTRFS >Tue, 15 Jul 2025 11:58:24 -0700 https://lore.kernel.org/linux-btrfs/52b863849f0dd63b3d25a29c8a830a09c748d86b.1752605888.git.boris@bur.io/ Fedora is interested in this enhancement. Any idea when it could be merged or if there are any outstanding concerns? In particular, I like the lack of knobs. It's either on or off. And it has no effect until unallocated space drops below 10G means it's super lightweight, affecting only users likely to end up in related corner cases. Fedora isn't installing btrfsmaintenance by default. We do see infrequent cases of premature or misallocation out of space. It would be nice to have this "it does nothing until" type solution enabled by default, if it's ready. Thanks, -- Chris Murphy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-10-21 18:52 ` Chris Murphy @ 2025-10-21 22:39 ` Leo Martins 2025-10-22 0:37 ` Chris Murphy 0 siblings, 1 reply; 14+ messages in thread From: Leo Martins @ 2025-10-21 22:39 UTC (permalink / raw) To: Chris Murphy; +Cc: Boris Burkov, kernel-team, Btrfs BTRFS On Tue, 21 Oct 2025 14:52:31 -0400 "Chris Murphy" <lists@colorremedies.com> wrote: > >Tue, 15 Jul 2025 11:58:24 -0700 > https://lore.kernel.org/linux-btrfs/52b863849f0dd63b3d25a29c8a830a09c748d86b.1752605888.git.boris@bur.io/ > > Fedora is interested in this enhancement. Any idea when it could be merged or if there are any outstanding concerns? > > In particular, I like the lack of knobs. It's either on or off. And it has no effect until unallocated space drops below 10G means it's super lightweight, affecting only users likely to end up in related corner cases. > > Fedora isn't installing btrfsmaintenance by default. We do see infrequent cases of premature or misallocation out of space. It would be nice to have this "it does nothing until" type solution enabled by default, if it's ready. > > Thanks, > > -- > Chris Murphy Wanted to provide some data from the Meta rollout to give more context on the decision to enable dynamic+periodic reclaim by default for data. All the before numbers are with bg_reclaim_threshold set to 30. Enabling dynamic+periodic reclaim for data block groups dramatically decreases number of reclaims per host, going from 150/day to just 5/day (p99), and from 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a slight uptick in enospcs. I currently don't have direct fragmentation metrics, though that is a work in progress, but I'm tracking FP as a proxy for fragmentation. FP = (allocated - used) / allocated So if there are 100G allocated for data and 80G are used, FP = (100 - 80) / 100 = 20%. FP has increased from 30% to 45% (p99), and from 5% to 7% (p50). Enospc rates have gone from around 0.5/day to 1/day per 100k hosts. This is a doubling in rate, but still a very small absolute number of enospcs. The unallocated space on disk decreases by ~15G (p99) and ~5G (p50) after rollout. Though fragmentation increases and unallocated space decreases the very small increase in enospcs suggests that this is a worthwhile tradeoff. One concern I still have is that replacing the aggressive bg_reclaim_threshold for the conservative dynamic+periodic reclaim will lead filesystems to slowly trend toward an "unhealthy" state of high fragmentation and dynamic+periodic reclaim will only do enough to keep the filesystem alive, but not enough to make it "healthy" again. So far, the data indicates these concerns are unfounded as FP and unallocated space seem to stabilize after their initial changes, but I'll follow up if anything changes. That being said I don't think bg_reclaim_threshold is enabled by default, and I am comfortable saying dynamic+periodic reclaim is better than no automatic reclaim! Thanks, Leo Martins. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-10-21 22:39 ` Leo Martins @ 2025-10-22 0:37 ` Chris Murphy 2025-10-22 1:02 ` Boris Burkov 0 siblings, 1 reply; 14+ messages in thread From: Chris Murphy @ 2025-10-22 0:37 UTC (permalink / raw) To: Leo Martins; +Cc: Boris Burkov, kernel-team, Btrfs BTRFS Thanks for the response. On Tue, Oct 21, 2025, at 6:39 PM, Leo Martins wrote: > > Wanted to provide some data from the Meta rollout to give more context on the > decision to enable dynamic+periodic reclaim by default for data. All the before > numbers are with bg_reclaim_threshold set to 30. > > Enabling dynamic+periodic reclaim for data block groups dramatically decreases > number of reclaims per host, going from 150/day to just 5/day (p99), and from > 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a > slight uptick in enospcs. > > I currently don't have direct fragmentation metrics, though that is a > work in progress, but I'm tracking FP as a proxy for fragmentation. > > FP = (allocated - used) / allocated > So if there are 100G allocated for data and 80G are used, FP = (100 - > 80) / 100 = 20%. > > FP has increased from 30% to 45% (p99), and from 5% to 7% (p50). > Enospc rates have gone from around 0.5/day to 1/day per 100k hosts. > This is a doubling in rate, but still a very small absolute number > of enospcs. The unallocated space on disk decreases by ~15G (p99) > and ~5G (p50) after rollout. I'm curious how it compares with default btrfsmaintenance btrfs-balance.timer/service - I'm guessing this is a bit harder to test at Meta in production due to the strictly time based trigger. And customization ends up being a choice between even higher reclaim or higher enospc. > That being said I don't think bg_reclaim_threshold is enabled by default, > and I am comfortable saying dynamic+periodic reclaim is better than no > automatic reclaim! So there are still corner cases occurring even with dynamic periodic reclaim. What do those look like? Is the file system unable to write metadata for arbitrary deletes to back the file system out? Or is it stuck in some cases? ext4 users are used to 5% of space being held in reserve for root user processes. I'm not sure if xfs has such a concept. Btrfs global reserve is different in that even root can't use it, it's really reserved for the kernel. But sometimes it's still possible to exhaust this metadata space, and be unable to delete files or balance even 1 data bg to back the file system out of the situation. The wedged in file system that keeps going read-only and appears stuck is a big concern since users have no idea what to do. And internet searches tend to produce results that are less help than no help. -- Chris Murphy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-10-22 0:37 ` Chris Murphy @ 2025-10-22 1:02 ` Boris Burkov 2025-10-23 23:27 ` Leo Martins 0 siblings, 1 reply; 14+ messages in thread From: Boris Burkov @ 2025-10-22 1:02 UTC (permalink / raw) To: Chris Murphy; +Cc: Leo Martins, kernel-team, Btrfs BTRFS On Tue, Oct 21, 2025 at 08:37:18PM -0400, Chris Murphy wrote: > Thanks for the response. > > On Tue, Oct 21, 2025, at 6:39 PM, Leo Martins wrote: > > > > > Wanted to provide some data from the Meta rollout to give more context on the > > decision to enable dynamic+periodic reclaim by default for data. All the before > > numbers are with bg_reclaim_threshold set to 30. > > > > Enabling dynamic+periodic reclaim for data block groups dramatically decreases > > number of reclaims per host, going from 150/day to just 5/day (p99), and from > > 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a > > slight uptick in enospcs. > > > > I currently don't have direct fragmentation metrics, though that is a > > work in progress, but I'm tracking FP as a proxy for fragmentation. > > > > FP = (allocated - used) / allocated > > So if there are 100G allocated for data and 80G are used, FP = (100 - > > 80) / 100 = 20%. > > > > FP has increased from 30% to 45% (p99), and from 5% to 7% (p50). > > Enospc rates have gone from around 0.5/day to 1/day per 100k hosts. Leo, correct me if I'm wrong, but we have yet to investigate a system where unallocated steadily marched down to 0 since the introduction of dynamic reclaim and then it ENOSPC'd, right? If there is a strong, undeniable increase in ENOSPCs we should absolutely look for such systems in those regions to motivate further improvements with full/filling filesystems. There is also the confounding variable of the bug fixed here: https://lore.kernel.org/linux-btrfs/22e8b64df3d4984000713433a89cfc14309b75fc.1759430967.git.boris@bur.io/ that has been plaguing our fleet causing ENOSPC issues. > > This is a doubling in rate, but still a very small absolute number > > of enospcs. The unallocated space on disk decreases by ~15G (p99) > > and ~5G (p50) after rollout. > > I'm curious how it compares with default btrfsmaintenance btrfs-balance.timer/service - I'm guessing this is a bit harder to test at Meta in production due to the strictly time based trigger. And customization ends up being a choice between even higher reclaim or higher enospc. > Yeah, we don't have that data unfortunately. > > That being said I don't think bg_reclaim_threshold is enabled by default, > > and I am comfortable saying dynamic+periodic reclaim is better than no > > automatic reclaim! > > So there are still corner cases occurring even with dynamic periodic reclaim. What do those look like? Is the file system unable to write metadata for arbitrary deletes to back the file system out? Or is it stuck in some cases? > I would imagine the cases that are tough for dynamic reclaim are: 1. genuinely quite full fs 2. rapidly needs a big hunk of metadata between entering the dynamic reclaim zone but before the cleaner thread / reclaim worker can run. > ext4 users are used to 5% of space being held in reserve for root user processes. I'm not sure if xfs has such a concept. Btrfs global reserve is different in that even root can't use it, it's really reserved for the kernel. But sometimes it's still possible to exhaust this metadata space, and be unable to delete files or balance even 1 data bg to back the file system out of the situation. The wedged in file system that keeps going read-only and appears stuck is a big concern since users have no idea what to do. And internet searches tend to produce results that are less help than no help. > > -- > Chris Murphy Anyway, I think Leo's forthcoming detailed per-BG fragmentation data should be the most telling. System level fragmentation percentage isn't the most useful IMO. Thanks, Boris ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-10-22 1:02 ` Boris Burkov @ 2025-10-23 23:27 ` Leo Martins 2025-12-13 22:09 ` Neal Gompa 0 siblings, 1 reply; 14+ messages in thread From: Leo Martins @ 2025-10-23 23:27 UTC (permalink / raw) To: Boris Burkov; +Cc: Chris Murphy, kernel-team, Btrfs BTRFS On Tue, 21 Oct 2025 18:02:15 -0700 Boris Burkov <boris@bur.io> wrote: > On Tue, Oct 21, 2025 at 08:37:18PM -0400, Chris Murphy wrote: > > Thanks for the response. > > > > On Tue, Oct 21, 2025, at 6:39 PM, Leo Martins wrote: > > > > > > > > Wanted to provide some data from the Meta rollout to give more context on the > > > decision to enable dynamic+periodic reclaim by default for data. All the before > > > numbers are with bg_reclaim_threshold set to 30. > > > > > > Enabling dynamic+periodic reclaim for data block groups dramatically decreases > > > number of reclaims per host, going from 150/day to just 5/day (p99), and from > > > 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a > > > slight uptick in enospcs. > > > > > > I currently don't have direct fragmentation metrics, though that is a > > > work in progress, but I'm tracking FP as a proxy for fragmentation. > > > > > > FP = (allocated - used) / allocated > > > So if there are 100G allocated for data and 80G are used, FP = (100 - > > > 80) / 100 = 20%. > > > > > > FP has increased from 30% to 45% (p99), and from 5% to 7% (p50). > > > Enospc rates have gone from around 0.5/day to 1/day per 100k hosts. > > Leo, correct me if I'm wrong, but we have yet to investigate a system > where unallocated steadily marched down to 0 since the introduction of > dynamic reclaim and then it ENOSPC'd, right? If there is a strong, > undeniable increase in ENOSPCs we should absolutely look for such > systems in those regions to motivate further improvements with > full/filling filesystems. After digging some more the only examples I found of btrfs enospcing from lack of unallocated are true enospcs where either data or metadata were entirely full. > > There is also the confounding variable of the bug fixed here: > https://lore.kernel.org/linux-btrfs/22e8b64df3d4984000713433a89cfc14309b75fc.1759430967.git.boris@bur.io/ > that has been plaguing our fleet causing ENOSPC issues. Yes, a deeper look revealed that the increase in ENOSPCs is due to this bug and not dynamic+periodic reclaim. In fact, the hosts with dynamic+periodic reclaim enabled see a relatively smaller rate of enospc (about 2x less) than the rest of the fleet. > > > > This is a doubling in rate, but still a very small absolute number > > > of enospcs. The unallocated space on disk decreases by ~15G (p99) > > > and ~5G (p50) after rollout. > > > > I'm curious how it compares with default btrfsmaintenance btrfs-balance.timer/service - I'm guessing this is a bit harder to test at Meta in production due to the strictly time based trigger. And customization ends up being a choice between even higher reclaim or higher enospc. > > > > Yeah, we don't have that data unfortunately. > > > > That being said I don't think bg_reclaim_threshold is enabled by default, > > > and I am comfortable saying dynamic+periodic reclaim is better than no > > > automatic reclaim! > > > > So there are still corner cases occurring even with dynamic periodic reclaim. What do those look like? Is the file system unable to write metadata for arbitrary deletes to back the file system out? Or is it stuck in some cases? > > > > I would imagine the cases that are tough for dynamic reclaim are: > 1. genuinely quite full fs > 2. rapidly needs a big hunk of metadata between entering the dynamic > reclaim zone but before the cleaner thread / reclaim worker can run. Concerning point 1 it seems like dynamic+periodic reclaim actually does a pretty good job here. I haven't seen any signs of thrashing with low unallocated space. > > > ext4 users are used to 5% of space being held in reserve for root user processes. I'm not sure if xfs has such a concept. Btrfs global reserve is different in that even root can't use it, it's really reserved for the kernel. But sometimes it's still possible to exhaust this metadata space, and be unable to delete files or balance even 1 data bg to back the file system out of the situation. The wedged in file system that keeps going read-only and appears stuck is a big concern since users have no idea what to do. And internet searches tend to produce results that are less help than no help. > > > > -- > > Chris Murphy > > Anyway, I think Leo's forthcoming detailed per-BG fragmentation data > should be the most telling. System level fragmentation percentage > isn't the most useful IMO. > > Thanks, > Boris Since the uptick in enospcs is not actually linked to dynamic+periodic reclaim I now feel confident saying that dynamic+periodic reclaim should be enabled by default for data. Thanks, Leo Martins. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-10-23 23:27 ` Leo Martins @ 2025-12-13 22:09 ` Neal Gompa 0 siblings, 0 replies; 14+ messages in thread From: Neal Gompa @ 2025-12-13 22:09 UTC (permalink / raw) To: Leo Martins; +Cc: Boris Burkov, Chris Murphy, kernel-team, Btrfs BTRFS On Thu, Oct 23, 2025 at 7:27 PM Leo Martins <loemra.dev@gmail.com> wrote: > > On Tue, 21 Oct 2025 18:02:15 -0700 Boris Burkov <boris@bur.io> wrote: > > > On Tue, Oct 21, 2025 at 08:37:18PM -0400, Chris Murphy wrote: > > > Thanks for the response. > > > > > > On Tue, Oct 21, 2025, at 6:39 PM, Leo Martins wrote: > > > > > > > > > > > Wanted to provide some data from the Meta rollout to give more context on the > > > > decision to enable dynamic+periodic reclaim by default for data. All the before > > > > numbers are with bg_reclaim_threshold set to 30. > > > > > > > > Enabling dynamic+periodic reclaim for data block groups dramatically decreases > > > > number of reclaims per host, going from 150/day to just 5/day (p99), and from > > > > 6/day to 0/day (p50). The trade-offs are increases in fragmentation, and a > > > > slight uptick in enospcs. > > > > > > > > I currently don't have direct fragmentation metrics, though that is a > > > > work in progress, but I'm tracking FP as a proxy for fragmentation. > > > > > > > > FP = (allocated - used) / allocated > > > > So if there are 100G allocated for data and 80G are used, FP = (100 - > > > > 80) / 100 = 20%. > > > > > > > > FP has increased from 30% to 45% (p99), and from 5% to 7% (p50). > > > > Enospc rates have gone from around 0.5/day to 1/day per 100k hosts. > > > > Leo, correct me if I'm wrong, but we have yet to investigate a system > > where unallocated steadily marched down to 0 since the introduction of > > dynamic reclaim and then it ENOSPC'd, right? If there is a strong, > > undeniable increase in ENOSPCs we should absolutely look for such > > systems in those regions to motivate further improvements with > > full/filling filesystems. > > After digging some more the only examples I found of btrfs enospcing > from lack of unallocated are true enospcs where either data or metadata > were entirely full. > > > > > There is also the confounding variable of the bug fixed here: > > https://lore.kernel.org/linux-btrfs/22e8b64df3d4984000713433a89cfc14309b75fc.1759430967.git.boris@bur.io/ > > that has been plaguing our fleet causing ENOSPC issues. > > Yes, a deeper look revealed that the increase in ENOSPCs is > due to this bug and not dynamic+periodic reclaim. In fact, > the hosts with dynamic+periodic reclaim enabled see a relatively > smaller rate of enospc (about 2x less) than the rest of the fleet. > > > > > > > This is a doubling in rate, but still a very small absolute number > > > > of enospcs. The unallocated space on disk decreases by ~15G (p99) > > > > and ~5G (p50) after rollout. > > > > > > I'm curious how it compares with default btrfsmaintenance btrfs-balance.timer/service - I'm guessing this is a bit harder to test at Meta in production due to the strictly time based trigger. And customization ends up being a choice between even higher reclaim or higher enospc. > > > > > > > Yeah, we don't have that data unfortunately. > > > > > > That being said I don't think bg_reclaim_threshold is enabled by default, > > > > and I am comfortable saying dynamic+periodic reclaim is better than no > > > > automatic reclaim! > > > > > > So there are still corner cases occurring even with dynamic periodic reclaim. What do those look like? Is the file system unable to write metadata for arbitrary deletes to back the file system out? Or is it stuck in some cases? > > > > > > > I would imagine the cases that are tough for dynamic reclaim are: > > 1. genuinely quite full fs > > 2. rapidly needs a big hunk of metadata between entering the dynamic > > reclaim zone but before the cleaner thread / reclaim worker can run. > > Concerning point 1 it seems like dynamic+periodic reclaim actually does a pretty good > job here. I haven't seen any signs of thrashing with low unallocated space. > > > > > > ext4 users are used to 5% of space being held in reserve for root user processes. I'm not sure if xfs has such a concept. Btrfs global reserve is different in that even root can't use it, it's really reserved for the kernel. But sometimes it's still possible to exhaust this metadata space, and be unable to delete files or balance even 1 data bg to back the file system out of the situation. The wedged in file system that keeps going read-only and appears stuck is a big concern since users have no idea what to do. And internet searches tend to produce results that are less help than no help. > > > > > > -- > > > Chris Murphy > > > > Anyway, I think Leo's forthcoming detailed per-BG fragmentation data > > should be the most telling. System level fragmentation percentage > > isn't the most useful IMO. > > > > Thanks, > > Boris > > Since the uptick in enospcs is not actually linked to dynamic+periodic > reclaim I now feel confident saying that dynamic+periodic reclaim > should be enabled by default for data. > So have we done this yet? If not, what's holding this up? -- 真実はいつも一つ!/ Always, there's only one truth! ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-07-15 18:58 [PATCH] btrfs: make periodic dynamic reclaim the default for data Boris Burkov 2025-07-16 6:24 ` Johannes Thumshirn 2025-10-21 18:52 ` Chris Murphy @ 2025-12-26 3:07 ` Sun Yangkai 2025-12-30 0:00 ` Boris Burkov 2 siblings, 1 reply; 14+ messages in thread From: Sun Yangkai @ 2025-12-26 3:07 UTC (permalink / raw) To: boris; +Cc: kernel-team, linux-btrfs Hi Boris, Thank you for bring such a feature for btrfs. I love it a lot and try to enable it on my machine. But I've get into some unexpected behavior when periodic dynamic reclaim is enabled and the filesystem is nearly full. [12月26 10:41] [T20373] BTRFS info (device sda): relocating block group 5214541578240 flags data [ +0.012446] [T20373] BTRFS error (device sda): error relocating chunk 5214541578240 [ +0.000033] [T20373] BTRFS info (device sda): relocating block group 4540021997568 flags data [ +0.008927] [T20373] BTRFS error (device sda): error relocating chunk 4540021997568 [ +0.000025] [T20373] BTRFS info (device sda): relocating block group 5606746750976 flags data [12月26 10:42] [T20373] BTRFS error (device sda): error relocating chunk 5606746750976 [12月26 10:47] [T12072] BTRFS info (device sda): relocating block group 5606746750976 flags data [ +3.960400] [T12072] BTRFS error (device sda): error relocating chunk 5606746750976 [12月26 10:52] [ T7643] BTRFS info (device sda): relocating block group 5606746750976 flags data [ +3.960314] [ T7643] BTRFS error (device sda): error relocating chunk 5606746750976 [12月26 10:57] [T20373] BTRFS info (device sda): relocating block group 5606746750976 flags data [ +3.954485] [T20373] BTRFS error (device sda): error relocating chunk 5606746750976 [12月26 11:02] [ T7701] BTRFS info (device sda): relocating block group 5606746750976 flags data [ +4.561796] [ T7701] BTRFS error (device sda): error relocating chunk 5606746750976 I guess the condition of when the periodic reclaim should happen is unpolished. I'm still digging further into it. Thanks, Sun YangKai ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-12-26 3:07 ` Sun Yangkai @ 2025-12-30 0:00 ` Boris Burkov 2025-12-30 1:29 ` Sun Yangkai 2025-12-30 1:41 ` Sun Yangkai 0 siblings, 2 replies; 14+ messages in thread From: Boris Burkov @ 2025-12-30 0:00 UTC (permalink / raw) To: Sun Yangkai; +Cc: kernel-team, linux-btrfs On Fri, Dec 26, 2025 at 11:07:28AM +0800, Sun Yangkai wrote: > Hi Boris, First off, sorry for not replying promptly. I've been in and out of the office around the holidays. > > Thank you for bring such a feature for btrfs. I love it a lot and try to enable > it on my machine. I really appreciate your kind words and your interest in the feature. Thank you! > > But I've get into some unexpected behavior when periodic dynamic reclaim is > enabled and the filesystem is nearly full. Oops! Let's debug it :) > > [12月26 10:41] [T20373] BTRFS info (device sda): relocating block group > 5214541578240 flags data > [ +0.012446] [T20373] BTRFS error (device sda): error relocating chunk > 5214541578240 > [ +0.000033] [T20373] BTRFS info (device sda): relocating block group > 4540021997568 flags data > [ +0.008927] [T20373] BTRFS error (device sda): error relocating chunk > 4540021997568 > [ +0.000025] [T20373] BTRFS info (device sda): relocating block group > 5606746750976 flags data > [12月26 10:42] [T20373] BTRFS error (device sda): error relocating chunk > 5606746750976 > [12月26 10:47] [T12072] BTRFS info (device sda): relocating block group > 5606746750976 flags data > [ +3.960400] [T12072] BTRFS error (device sda): error relocating chunk > 5606746750976 > [12月26 10:52] [ T7643] BTRFS info (device sda): relocating block group > 5606746750976 flags data > [ +3.960314] [ T7643] BTRFS error (device sda): error relocating chunk > 5606746750976 > [12月26 10:57] [T20373] BTRFS info (device sda): relocating block group > 5606746750976 flags data > [ +3.954485] [T20373] BTRFS error (device sda): error relocating chunk > 5606746750976 > [12月26 11:02] [ T7701] BTRFS info (device sda): relocating block group > 5606746750976 flags data > [ +4.561796] [ T7701] BTRFS error (device sda): error relocating chunk > 5606746750976 > > I guess the condition of when the periodic reclaim should happen is unpolished. Yeah, it looks like it is triggering too frequently in conditions where it isn't likely to succeed. Hopefully we can tune up the heuristics (or just fix the bug you found) and it works better. It seems to be triggering every 5 minutes or so, right? Is that the interval of the cleaner thread running on your system? Or am I misinterpreting the time stamps? I would normally expect the default of 30s. > > I'm still digging further into it. Were you able to confirm whether that negative reclaimable_bytes bug was the root cause here? If you aren't able to reproduce but it is still happening on one of your systems, we can try to instrument the periodic reclaim lifecycle with bpftrace to catch calls to the various important functions setting it reclaimable, etc. Please let me know if I can assist you with that, or if you do have a reproducer I could also look at. Thanks, Boris > > Thanks, > Sun YangKai ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-12-30 0:00 ` Boris Burkov @ 2025-12-30 1:29 ` Sun Yangkai 2025-12-30 1:41 ` Sun Yangkai 1 sibling, 0 replies; 14+ messages in thread From: Sun Yangkai @ 2025-12-30 1:29 UTC (permalink / raw) To: Boris Burkov; +Cc: kernel-team, linux-btrfs 在 2025/12/30 08:00, Boris Burkov 写道: > On Fri, Dec 26, 2025 at 11:07:28AM +0800, Sun Yangkai wrote: >> Hi Boris, > > First off, sorry for not replying promptly. I've been in and out of the > office around the holidays. > >> >> Thank you for bring such a feature for btrfs. I love it a lot and try to enable >> it on my machine. > > I really appreciate your kind words and your interest in the feature. > Thank you! > >> >> But I've get into some unexpected behavior when periodic dynamic reclaim is >> enabled and the filesystem is nearly full. > > Oops! Let's debug it :) > >> >> [12月26 10:41] [T20373] BTRFS info (device sda): relocating block group >> 5214541578240 flags data >> [ +0.012446] [T20373] BTRFS error (device sda): error relocating chunk >> 5214541578240 >> [ +0.000033] [T20373] BTRFS info (device sda): relocating block group >> 4540021997568 flags data >> [ +0.008927] [T20373] BTRFS error (device sda): error relocating chunk >> 4540021997568 >> [ +0.000025] [T20373] BTRFS info (device sda): relocating block group >> 5606746750976 flags data >> [12月26 10:42] [T20373] BTRFS error (device sda): error relocating chunk >> 5606746750976 >> [12月26 10:47] [T12072] BTRFS info (device sda): relocating block group >> 5606746750976 flags data >> [ +3.960400] [T12072] BTRFS error (device sda): error relocating chunk >> 5606746750976 >> [12月26 10:52] [ T7643] BTRFS info (device sda): relocating block group >> 5606746750976 flags data >> [ +3.960314] [ T7643] BTRFS error (device sda): error relocating chunk >> 5606746750976 >> [12月26 10:57] [T20373] BTRFS info (device sda): relocating block group >> 5606746750976 flags data >> [ +3.954485] [T20373] BTRFS error (device sda): error relocating chunk >> 5606746750976 >> [12月26 11:02] [ T7701] BTRFS info (device sda): relocating block group >> 5606746750976 flags data >> [ +4.561796] [ T7701] BTRFS error (device sda): error relocating chunk >> 5606746750976 >> >> I guess the condition of when the periodic reclaim should happen is unpolished. > > Yeah, it looks like it is triggering too frequently in conditions where > it isn't likely to succeed. Hopefully we can tune up the heuristics (or > just fix the bug you found) and it works better. > > It seems to be triggering every 5 minutes or so, right? Is that the > interval of the cleaner thread running on your system? Or am I > misinterpreting the time stamps? I would normally expect the default of > 30s. Yes, my system has commit=300. It was set years ago when I knew almost nothing about btrfs and not changed since then. >> >> I'm still digging further into it. > > Were you able to confirm whether that negative reclaimable_bytes bug was > the root cause here? Yes. After changing chunk_sz to s64, this will not triggered anymore. However, periodic also does not work properly. > If you aren't able to reproduce but it is still happening on one of your > systems, we can try to instrument the periodic reclaim lifecycle with > bpftrace to catch calls to the various important functions setting it > reclaimable, etc. Thank you for your advice. That's what I've done and how I find the unexpected behavior. It's really a good tool to know what's happening in kernel. > Please let me know if I can assist you with that, or if you do have a > reproducer I could also look at. I've redesigned the logic and iterated some versions. I'll cleanup my code and send the patches later. Maybe later today or tomorrow. It's not perfect, but I hope it will be better than what we have now. Thanks, Sun YangKai ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] btrfs: make periodic dynamic reclaim the default for data 2025-12-30 0:00 ` Boris Burkov 2025-12-30 1:29 ` Sun Yangkai @ 2025-12-30 1:41 ` Sun Yangkai 1 sibling, 0 replies; 14+ messages in thread From: Sun Yangkai @ 2025-12-30 1:41 UTC (permalink / raw) To: Boris Burkov; +Cc: kernel-team, linux-btrfs > Please let me know if I can assist you with that, or if you do have a > reproducer I could also look at. I just come with a ... thing I found and I have no idea why it happened. I've wrote a script to show the 10 least used block groups(used_space is just calculated from length and used_pct, so please just ignore them) and before periodic reclaim, I got: Searching for the 10 least used DATA block groups... vaddr length used_pct used_space -------------------------------------------------------- 6353387388928 1024MiB 5% 51MiB 6354461130752 1024MiB 30% 307MiB 6295292084224 1024MiB 80% 819MiB 6056900427776 1024MiB 89% 911MiB 4620552634368 1024MiB 97% 993MiB 6050457976832 1024MiB 98% 1003MiB 6122398679040 1024MiB 98% 1003MiB 6270596022272 1024MiB 98% 1003MiB 6350166163456 1024MiB 98% 1003MiB 383347851264 1024MiB 99% 1013MiB And unallocated space is 3GiB so with dynamic periodic reclaim, the first two block groups will be reclaimed: [12月28 21:47] [ T357] BTRFS info (device sda): relocating block group 6353387388928 flags data [ +0.262467] [ T357] BTRFS info (device sda): found 1970 extents, stage: move data extents [ +1.334556] [ T357] BTRFS info (device sda): found 1966 extents, stage: update data pointers [ +0.618457] [ T357] BTRFS info (device sda): relocating block group 6354461130752 flags data [ +1.009694] [ T357] BTRFS info (device sda): found 166 extents, stage: move data extents [ +0.388070] [ T357] BTRFS info (device sda): found 166 extents, stage: update data pointers And after the reclaim I got: Searching for the 10 least used DATA block groups... vaddr length used_pct used_space -------------------------------------------------------- 6355534872576 1024MiB 6% 61MiB 6356608614400 1024MiB 16% 163MiB 6295292084224 1024MiB 80% 819MiB 4620552634368 1024MiB 97% 993MiB 6050457976832 1024MiB 98% 1003MiB 6270596022272 1024MiB 98% 1003MiB 3782605471744 1024MiB 99% 1013MiB 4549685673984 1024MiB 99% 1013MiB 5882820034560 1024MiB 99% 1013MiB 5909764243456 1024MiB 99% 1013MiB These two block groups could be merged into existing chunks, but I have no idea why that didn't happen. But when I run btrfs balance start -dvrange=6355534872576..6355534872576 /mnt It can be merged and free some unallocated space. So I think the periodic reclaim has a different behavior with manually balance? Thanks, Sun YangKai ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-12-30 1:41 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-15 18:58 [PATCH] btrfs: make periodic dynamic reclaim the default for data Boris Burkov 2025-07-16 6:24 ` Johannes Thumshirn 2025-07-16 15:56 ` Boris Burkov 2025-07-17 12:55 ` Johannes Thumshirn 2025-10-21 18:52 ` Chris Murphy 2025-10-21 22:39 ` Leo Martins 2025-10-22 0:37 ` Chris Murphy 2025-10-22 1:02 ` Boris Burkov 2025-10-23 23:27 ` Leo Martins 2025-12-13 22:09 ` Neal Gompa 2025-12-26 3:07 ` Sun Yangkai 2025-12-30 0:00 ` Boris Burkov 2025-12-30 1:29 ` Sun Yangkai 2025-12-30 1:41 ` Sun Yangkai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox