Re: [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Boris Burkov <boris@bur.io>
To: Sun Yangkai <sunk67188@gmail.com>
Cc: kernel-team@fb.com, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops
Date: Mon, 29 Dec 2025 15:54:59 -0800	[thread overview]
Message-ID: <aVMU08bgkt79Pl/Q@devvm12410.ftw0.facebook.com> (raw)
In-Reply-To: <29f9e528-9dbe-48fb-9353-6ad7177be50f@gmail.com>

On Fri, Dec 26, 2025 at 12:18:51PM +0800, Sun Yangkai wrote:
> > Periodic reclaim runs the risk of getting stuck in a state where it
> > keeps reclaiming the same block group over and over. This can happen if
> > 1. reclaiming that block_group fails
> > 2. reclaiming that block_group fails to move any extents into existing
> >    block_groups and just allocates a fresh chunk and moves everything.
> > 
> > Currently, 1. is a very tight loop inside the reclaim worker. That is
> > critical for edge triggered reclaim or else we risk forgetting about a
> > reclaimable group. On the other hand, with level triggered reclaim we
> > can break out of that loop and get it later.
> > 
> > With that fixed, 2. applies to both failures and "successes" with no
> > progress. If we have done a periodic reclaim on a space_info and nothing
> > has changed in that space_info, there is not much point to trying again,
> > so don't, until enough space gets free, which we capture with a
> > heuristic of needing to net free 1 chunk.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/block-group.c | 12 ++++++---
> >  fs/btrfs/space-info.c  | 56 ++++++++++++++++++++++++++++++++++++------
> >  fs/btrfs/space-info.h  | 14 +++++++++++
> >  3 files changed, 71 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 6bcf24f2ac79..ba9afb94e7ce 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1933,6 +1933,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  			reclaimed = 0;
> >  			spin_lock(&space_info->lock);
> >  			space_info->reclaim_errors++;
> > +			if (READ_ONCE(space_info->periodic_reclaim))
> > +				space_info->periodic_reclaim_ready = false;
> 
> I wonder why we're not clearing the reclaimble_bytes count here.
> 

As far as I can tell, it's an oversight. I think it ought to use
btrfs_set_reclaim_ready(fs_info, false) here. However, FWIW, the
reclaimable bytes already got reset when we checked whether reclaim was
ready, so this would extra re-clear it after. It honestly might make the
most sense to just get rid of this logic here entirely, as it's pretty
redundant with setting ready=false at the start of each invocation of
periodic_reclaim.

> >  			spin_unlock(&space_info->lock);
> >  		}
> >  		spin_lock(&space_info->lock);
> > @@ -1941,7 +1943,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  		spin_unlock(&space_info->lock);
> >  
> >  next:
> > -		if (ret) {
> > +		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
> >  			/* Refcount held by the reclaim_bgs list after splice. */
> >  			btrfs_get_block_group(bg);
> >  			list_add_tail(&bg->bg_list, &retry_list);
> > @@ -3677,6 +3679,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >  		space_info->bytes_reserved -= num_bytes;
> >  		space_info->bytes_used += num_bytes;
> >  		space_info->disk_used += num_bytes * factor;
> > +		if (READ_ONCE(space_info->periodic_reclaim))
> > +			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
> >  		spin_unlock(&cache->lock);
> >  		spin_unlock(&space_info->lock);
> >  	} else {
> > @@ -3686,8 +3690,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >  		btrfs_space_info_update_bytes_pinned(info, space_info, num_bytes);
> >  		space_info->bytes_used -= num_bytes;
> >  		space_info->disk_used -= num_bytes * factor;
> > -
> > -		reclaim = should_reclaim_block_group(cache, num_bytes);
> > +		if (READ_ONCE(space_info->periodic_reclaim))
> > +			btrfs_space_info_update_reclaimable(space_info, num_bytes);
> > +		else
> > +			reclaim = should_reclaim_block_group(cache, num_bytes);
> >  
> >  		spin_unlock(&cache->lock);
> >  		spin_unlock(&space_info->lock);
> > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > index ff92ad26ffa2..e7a2aa751f8f 100644
> > --- a/fs/btrfs/space-info.c
> > +++ b/fs/btrfs/space-info.c
> > @@ -1,5 +1,6 @@
> >  // SPDX-License-Identifier: GPL-2.0
> >  
> > +#include "linux/spinlock.h"
> >  #include <linux/minmax.h>
> >  #include "misc.h"
> >  #include "ctree.h"
> > @@ -1899,7 +1900,9 @@ static u64 calc_pct_ratio(u64 x, u64 y)
> >   */
> >  static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
> >  {
> > -	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * calc_effective_data_chunk_size(fs_info);
> > +	u64 chunk_sz = calc_effective_data_chunk_size(fs_info);
> > +
> > +	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * chunk_sz;
> >  }
> >  
> >  /*
> > @@ -1935,14 +1938,13 @@ static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info)
> >  	u64 unused = alloc - used;
> >  	u64 want = target > unalloc ? target - unalloc : 0;
> >  	u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
> > -	/* Cast to int is OK because want <= target */
> > -	int ratio = calc_pct_ratio(want, target);
> >  
> > -	/* If we have no unused space, don't bother, it won't work anyway */
> > +	/* If we have no unused space, don't bother, it won't work anyway. */
> >  	if (unused < data_chunk_size)
> >  		return 0;
> >  
> > -	return ratio;
> > +	/* Cast to int is OK because want <= target. */
> > +	return calc_pct_ratio(want, target);
> >  }
> >  
> >  int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
> > @@ -1984,6 +1986,46 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
> >  	return 0;
> >  }
> >  
> > +void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> > +{
> > +	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
> > +
> > +	assert_spin_locked(&space_info->lock);
> > +	space_info->reclaimable_bytes += bytes;
> > +
> > +	if (space_info->reclaimable_bytes >= chunk_sz)
> 
> We're comparing s64 with u64 here, and it won't work as expected.

Good catch. Since we could do a bunch of allocation and make
reclaimable_bytes negative, we need to check for negative
reclaimable_bytes first, as the coercion to a giant u64 is actually
quite possible in practice.

> 
> Even after fixing this by changing chunk_sz to s64, it will not work as expected
> in the following case:
> 
> - We're filling the disk so reclaimable_bytes is always negative.
> - There's less than 10G unallocated so dynamic_reclaim kicked in.
> - periodic_reclaim will never work since reclaimable_bytes is always negetive.
> 

Yes, I agree that this case will thrash the reclaim if we keep
allocating constantly (without going enospc).

Good catch! Hopefully that's what is happening in your report on the
other email.

> > +		btrfs_set_periodic_reclaim_ready(space_info, true);
> > +}
> > +
> > +void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
> > +{
> > +	assert_spin_locked(&space_info->lock);
> > +	if (!READ_ONCE(space_info->periodic_reclaim))
> > +		return;
> > +	if (ready != space_info->periodic_reclaim_ready) {
> > +		space_info->periodic_reclaim_ready = ready;
> > +		if (!ready)
> > +			space_info->reclaimable_bytes = 0;
> > +	}
> > +}
> > +
> > +bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
> > +{
> > +	bool ret;
> > +
> > +	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
> > +		return false;
> > +	if (!READ_ONCE(space_info->periodic_reclaim))
> > +		return false;
> > +
> > +	spin_lock(&space_info->lock);
> > +	ret = space_info->periodic_reclaim_ready;
> > +	btrfs_set_periodic_reclaim_ready(space_info, false);
> > +	spin_unlock(&space_info->lock);
> > +
> > +	return ret;
> > +}
> > +
> >  int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
> >  {
> >  	int ret;
> > @@ -1991,9 +2033,7 @@ int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
> >  	struct btrfs_space_info *space_info;
> >  
> >  	list_for_each_entry(space_info, &fs_info->space_info, list) {
> > -		if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
> > -			continue;
> > -		if (!READ_ONCE(space_info->periodic_reclaim))
> > +		if (!btrfs_should_periodic_reclaim(space_info))
> >  			continue;
> >  		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
> >  			ret = do_reclaim_sweep(fs_info, space_info, raid);
> > diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
> > index ae4a1f7d5856..4db8a0267c16 100644
> > --- a/fs/btrfs/space-info.h
> > +++ b/fs/btrfs/space-info.h
> > @@ -196,6 +196,17 @@ struct btrfs_space_info {
> >  	 * threshold in the cleaner thread.
> >  	 */
> >  	bool periodic_reclaim;
> > +
> > +	/*
> > +	 * Periodic reclaim should be a no-op if a space_info hasn't
> > +	 * freed any space since the last time we tried.
> > +	 */
> > +	bool periodic_reclaim_ready;
> 
> Also, I wonder why we need this bool flag. I think we care more about if
> reclaimable_bytes' value is more than 1G when calling
> btrfs_should_periodic_reclaim() rather than if it has been more than 1G during
> two calls of btrfs_should_periodic_reclaim().

This is probably an accident of how the logic evolved as I iterated on
the design. The first version was to set ready false on a failure then
set it true on a deallocation, then I "enhanced" it with the
reclaimable_bytes logic to make it more conservative.

I think getting rid of periodic_reclaim_ready and just doing it off of
bytes_reclaimable (which still gets reset after every reclaim attempt)
would make sense.

Thanks,
Boris

> 
> Thanks,
> Sun YangKai
> 
> > +
> > +	/*
> > +	 * Net bytes freed or allocated since the last reclaim pass.
> > +	 */
> > +	s64 reclaimable_bytes;
> >  };
> >  
> >  struct reserve_ticket {
> > @@ -278,6 +289,9 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
> >  void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
> >  u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
> >  
> > +void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes);
> > +void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready);
> > +bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info);
> >  int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info);
> >  int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info);
> >  
> > -- 
> > 2.45.2
>

next prev parent reply	other threads:[~2025-12-29 23:55 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
2024-06-17 23:11 ` [PATCH v2 1/6] btrfs: report reclaim stats in sysfs Boris Burkov
2024-06-17 23:11 ` [PATCH v2 2/6] btrfs: store fs_info on space_info Boris Burkov
2024-06-17 23:11 ` [PATCH v2 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
2024-06-25 13:40   ` Naohiro Aota
2024-06-17 23:11 ` [PATCH v2 4/6] btrfs: periodic block_group reclaim Boris Burkov
2024-06-17 23:11 ` [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
2024-06-24 15:23   ` Josef Bacik
2024-06-24 16:05     ` David Sterba
2025-12-26  4:18   ` Sun Yangkai
2025-12-29 23:54     ` Boris Burkov [this message]
2024-06-17 23:11 ` [PATCH v2 6/6] btrfs: urgent periodic reclaim pass Boris Burkov
2024-06-24 15:25 ` [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aVMU08bgkt79Pl/Q@devvm12410.ftw0.facebook.com \
    --to=boris@bur.io \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=sunk67188@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.