[XFS] Question: can xfs_agfl_free_finish_item exhaust AGFL during bnobt/cntbt split?

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [XFS] Question: can xfs_agfl_free_finish_item exhaust AGFL during bnobt/cntbt split?
@ 2026-04-12 16:32 Jinliang Zheng
  2026-04-13  0:14 ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Jinliang Zheng @ 2026-04-12 16:32 UTC (permalink / raw)
  To: djwong, linux-xfs; +Cc: alexjlzheng

Hi,

I have a question about the deferred AGFL block freeing path and whether it is
safe from AGFL exhaustion during bnobt/cntbt btree splits.

When `xfs_agfl_free_finish_item` runs, it calls:

    xfs_agfl_free_finish_item
      -> xfs_free_agfl_block
           -> xfs_free_ag_extent   (fs/xfs/libxfs/xfs_alloc.c)

`xfs_free_ag_extent` directly manipulates bnobt and cntbt, and inserting a new
free extent record can trigger a btree split. A split will try to allocate a
block from the AGFL via `xfs_allocbt_alloc_block` -> `xfs_alloc_get_freelist`.
If the AGFL is empty at that point, `xfs_alloc_get_freelist` returns
NULLAGBLOCK and the split fails.

The normal extent-freeing path (`__xfs_free_extent`) guards against this by
calling `xfs_free_extent_fix_freelist` first, which invokes
`xfs_alloc_fix_freelist(..., XFS_ALLOC_FLAG_FREEING)` to ensure the AGFL holds
at least `xfs_alloc_min_freelist(mp, pag)` blocks before touching the btrees.

However, `xfs_agfl_free_finish_item` does not call fix_freelist at all.

The deferred AGFL frees are created by `xfs_alloc_fix_freelist` during its
shrink phase: blocks are removed from the AGFL immediately (pagf_flcount is
decremented synchronously by `xfs_alloc_get_freelist`), and their actual return
to the free space btrees is deferred via `xfs_defer_agfl_block`. After the
shrink loop, the AGFL sits at exactly `need` blocks.

But before `xfs_agfl_free_finish_item` executes, `xfs_defer_finish_noroll`
rolls the transaction. Other deferred operations that execute in between—for
example, regular extent frees (`XFS_DEFER_OPS_TYPE_FREE`) that themselves call
fix_freelist and may re-shrink the AGFL, or rmapbt operations—could consume
AGFL blocks and leave the count below `need` by the time our deferred AGFL free
runs.

So my question is: is there a guarantee that the AGFL will have enough blocks
to service any btree split needed by `xfs_free_ag_extent` inside
`xfs_agfl_free_finish_item`? If so, what mechanism provides that guarantee?
I don't see a fix_freelist call or any equivalent reservation in that code path.

Thanks,
Jinliang Zheng

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [XFS] Question: can xfs_agfl_free_finish_item exhaust AGFL during bnobt/cntbt split?
  2026-04-12 16:32 [XFS] Question: can xfs_agfl_free_finish_item exhaust AGFL during bnobt/cntbt split? Jinliang Zheng
@ 2026-04-13  0:14 ` Dave Chinner
  2026-04-13  2:48   ` Jinliang Zheng
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2026-04-13  0:14 UTC (permalink / raw)
  To: Jinliang Zheng; +Cc: djwong, linux-xfs, alexjlzheng

On Mon, Apr 13, 2026 at 12:32:50AM +0800, Jinliang Zheng wrote:
> Hi,
> 
> I have a question about the deferred AGFL block freeing path and whether it is
> safe from AGFL exhaustion during bnobt/cntbt btree splits.
> 
> When `xfs_agfl_free_finish_item` runs, it calls:
> 
>     xfs_agfl_free_finish_item
>       -> xfs_free_agfl_block
>            -> xfs_free_ag_extent   (fs/xfs/libxfs/xfs_alloc.c)
> 
> `xfs_free_ag_extent` directly manipulates bnobt and cntbt, and inserting a new
> free extent record can trigger a btree split. A split will try to allocate a
> block from the AGFL via `xfs_allocbt_alloc_block` -> `xfs_alloc_get_freelist`.
> If the AGFL is empty at that point, `xfs_alloc_get_freelist` returns
> NULLAGBLOCK and the split fails.

And then the filesystem will shut down because of a fatal error in a
dirty transaction.

My first question is this: is this actually happening in practice,
or is this a theoretical concern?

We don't attempt to provide a 100% guarantee of correctness in free
space accounting, alloction or freeing algorithms in XFS. Complexity
is high, and there are lots of corner cases that are extremely
difficult to reason about. The cost of providing aboslute guarantees
is ... prohibitive. 

Hence I'd pretty much guarantee the logic and/or math isn't
comprehensive or water-tight. We won't try to handle the
really convoluted "many weird things have to align exactly right"
corner cases, and instead will use things like reservation pools to
handle them instead. The AGFL is a good example of a reservation
pool that we make deep enough to handle real world usage without
needing to care about the complexities of trying to guarantee
behaviour in exceedingly rare corner cases.

For example, we changed the AGFL reservations for the first time in
2023 - they were introduced almost 30 years ago (1994), and had
remained unchanged until then. i.e. commit f63a5b3769ad ("xfs: fix
internal error from AGFL exhaustion") changed the reservation pool
to consider two full height splits instead of just one.

i.e. double allocbt splits of enough height to empty the AGFL pool
are sufficient rare that it took almost 30 years for a single height
reservation to be found insufficient in practice.

With that change, failing to split during the deferred
AGFL free would now require two full height splits on the rmap
btree + two full height splits on both the bnobt and cntbt in the
-same transaction- to exhaust the AGFL, then the very next
transaction needs to be a deferred AGFL free that requires another
bnobt or cntbt split to occur. Those conditions are not going to
occur very often by chance....

Hence my question: are you actually seeing allocbt splits on
deferred AGFL frees running out of AGFL blocks in practice, or is
this a theoretical concern?

> The normal extent-freeing path (`__xfs_free_extent`) guards against this by
> calling `xfs_free_extent_fix_freelist` first, which invokes
> `xfs_alloc_fix_freelist(..., XFS_ALLOC_FLAG_FREEING)` to ensure the AGFL holds
> at least `xfs_alloc_min_freelist(mp, pag)` blocks before touching the btrees.

Yes, because the AGFL has not been prepared for the extent freeing
operation that is about to take place. xfs_alloc_min_freelist()
makes sure the AGFL is long enough to handle two full allobct
splits, one of which accounts for a split during AGFL refill/free
operations.

> However, `xfs_agfl_free_finish_item` does not call fix_freelist at all.

Because the AGFL size that xfs_alloc_min_freelist() prepared should
have already taken this into account. And calling fix_freelist() to
fixup the free list for freeing a block that was taken from the
freelist means we now have to consider the possibility of recursive
freeing fixup accounting problems when AGs are near ENOSPC.

> The deferred AGFL frees are created by `xfs_alloc_fix_freelist` during its
> shrink phase: blocks are removed from the AGFL immediately (pagf_flcount is
> decremented synchronously by `xfs_alloc_get_freelist`), and their actual return
> to the free space btrees is deferred via `xfs_defer_agfl_block`. After the
> shrink loop, the AGFL sits at exactly `need` blocks.

Right, so the deferred free should be the first deferred operation
in the allocation intent chain. i.e. the generated EFI it will be
the first intent in the alloc chain processed on transaction commit.
It should be run before any other pending CUI, RUI and EFI operation
for that allocation operation will be processed. Hence the AGFL
should still be in the same state as the initial extent free
operation left it, and hence should have sufficient remaining blocks
in the AGFL for a split during the defered AGFL free operation.

> But before `xfs_agfl_free_finish_item` executes, `xfs_defer_finish_noroll`
> rolls the transaction. Other deferred operations that execute in between—for
> example, regular extent frees (`XFS_DEFER_OPS_TYPE_FREE`) that themselves call
> fix_freelist and may re-shrink the AGFL, or rmapbt operations—could consume
> AGFL blocks and leave the count below `need` by the time our deferred AGFL free
> runs.

Even if we run RUIs, CUIs or other EFIs before the AGFL deferred
free, they fix up the freelist for their operations first, and that
reserves space for a deferred AGFL free from the free list.

If any other extent allocation/free on that AG is run between the
two transactions (e.g. high level transactions racing on AGF buffer
access) they will also reserve space in the AGFL for a split during
a deferred free.

IOWs, the deferred frees typically always run with a AGFL block
reservation guaranteed by the previous extent alloc/free operation
that was just completed. It is a rare situtation where the AGFL
would not have enough blocks on it to perform allocbt splits
sucessfully, and to have multiple deferred AGFL frees chained
together is quite unlikely because of the order of intent
processing...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [XFS] Question: can xfs_agfl_free_finish_item exhaust AGFL during bnobt/cntbt split?
  2026-04-13  0:14 ` Dave Chinner
@ 2026-04-13  2:48   ` Jinliang Zheng
  2026-04-13 22:19     ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Jinliang Zheng @ 2026-04-13  2:48 UTC (permalink / raw)
  To: dgc; +Cc: alexjlzheng, alexjlzheng, djwong, linux-xfs

On Mon, 13 Apr 2026 10:14:45 +1000, dgc@kernel.org wrote:
> On Mon, Apr 13, 2026 at 12:32:50AM +0800, Jinliang Zheng wrote:
> > Hi,
> > 
> > I have a question about the deferred AGFL block freeing path and whether it is
> > safe from AGFL exhaustion during bnobt/cntbt btree splits.
> > 
> > When `xfs_agfl_free_finish_item` runs, it calls:
> > 
> >     xfs_agfl_free_finish_item
> >       -> xfs_free_agfl_block
> >            -> xfs_free_ag_extent   (fs/xfs/libxfs/xfs_alloc.c)
> > 
> > `xfs_free_ag_extent` directly manipulates bnobt and cntbt, and inserting a new
> > free extent record can trigger a btree split. A split will try to allocate a
> > block from the AGFL via `xfs_allocbt_alloc_block` -> `xfs_alloc_get_freelist`.
> > If the AGFL is empty at that point, `xfs_alloc_get_freelist` returns
> > NULLAGBLOCK and the split fails.
> 
> And then the filesystem will shut down because of a fatal error in a
> dirty transaction.
> 
> My first question is this: is this actually happening in practice,
> or is this a theoretical concern?

Thank you for your reply.

Just a theoretical concern. :)

> 
> We don't attempt to provide a 100% guarantee of correctness in free
> space accounting, alloction or freeing algorithms in XFS. Complexity
> is high, and there are lots of corner cases that are extremely
> difficult to reason about. The cost of providing aboslute guarantees
> is ... prohibitive. 
> 
> Hence I'd pretty much guarantee the logic and/or math isn't
> comprehensive or water-tight. We won't try to handle the
> really convoluted "many weird things have to align exactly right"
> corner cases, and instead will use things like reservation pools to
> handle them instead. The AGFL is a good example of a reservation
> pool that we make deep enough to handle real world usage without
> needing to care about the complexities of trying to guarantee
> behaviour in exceedingly rare corner cases.
> 
> For example, we changed the AGFL reservations for the first time in
> 2023 - they were introduced almost 30 years ago (1994), and had
> remained unchanged until then. i.e. commit f63a5b3769ad ("xfs: fix
> internal error from AGFL exhaustion") changed the reservation pool
> to consider two full height splits instead of just one.
> 
> i.e. double allocbt splits of enough height to empty the AGFL pool
> are sufficient rare that it took almost 30 years for a single height
> reservation to be found insufficient in practice.
> 
> With that change, failing to split during the deferred
> AGFL free would now require two full height splits on the rmap
> btree + two full height splits on both the bnobt and cntbt in the
> -same transaction- to exhaust the AGFL, then the very next
> transaction needs to be a deferred AGFL free that requires another
> bnobt or cntbt split to occur. Those conditions are not going to
> occur very often by chance....
> 
> Hence my question: are you actually seeing allocbt splits on
> deferred AGFL frees running out of AGFL blocks in practice, or is
> this a theoretical concern?
> 
> > The normal extent-freeing path (`__xfs_free_extent`) guards against this by
> > calling `xfs_free_extent_fix_freelist` first, which invokes
> > `xfs_alloc_fix_freelist(..., XFS_ALLOC_FLAG_FREEING)` to ensure the AGFL holds
> > at least `xfs_alloc_min_freelist(mp, pag)` blocks before touching the btrees.
> 
> Yes, because the AGFL has not been prepared for the extent freeing
> operation that is about to take place. xfs_alloc_min_freelist()
> makes sure the AGFL is long enough to handle two full allobct
> splits, one of which accounts for a split during AGFL refill/free
> operations.
> 
> > However, `xfs_agfl_free_finish_item` does not call fix_freelist at all.
> 
> Because the AGFL size that xfs_alloc_min_freelist() prepared should
> have already taken this into account. And calling fix_freelist() to
> fixup the free list for freeing a block that was taken from the
> freelist means we now have to consider the possibility of recursive
> freeing fixup accounting problems when AGs are near ENOSPC.
> 
> > The deferred AGFL frees are created by `xfs_alloc_fix_freelist` during its
> > shrink phase: blocks are removed from the AGFL immediately (pagf_flcount is
> > decremented synchronously by `xfs_alloc_get_freelist`), and their actual return
> > to the free space btrees is deferred via `xfs_defer_agfl_block`. After the
> > shrink loop, the AGFL sits at exactly `need` blocks.
> 
> Right, so the deferred free should be the first deferred operation
> in the allocation intent chain. i.e. the generated EFI it will be
> the first intent in the alloc chain processed on transaction commit.
> It should be run before any other pending CUI, RUI and EFI operation
> for that allocation operation will be processed. Hence the AGFL
> should still be in the same state as the initial extent free
> operation left it, and hence should have sufficient remaining blocks
> in the AGFL for a split during the defered AGFL free operation.
> 
> > But before `xfs_agfl_free_finish_item` executes, `xfs_defer_finish_noroll`
> > rolls the transaction. Other deferred operations that execute in between—for
> > example, regular extent frees (`XFS_DEFER_OPS_TYPE_FREE`) that themselves call
> > fix_freelist and may re-shrink the AGFL, or rmapbt operations—could consume
> > AGFL blocks and leave the count below `need` by the time our deferred AGFL free
> > runs.
> 
> Even if we run RUIs, CUIs or other EFIs before the AGFL deferred
> free, they fix up the freelist for their operations first, and that
> reserves space for a deferred AGFL free from the free list.
> 
> If any other extent allocation/free on that AG is run between the
> two transactions (e.g. high level transactions racing on AGF buffer
> access) they will also reserve space in the AGFL for a split during
> a deferred free.

Thank you for the detailed explanation. I have a follow-up question about
the case where fix_freelist shrinks the AGFL by more than one block.

When `xfs_alloc_fix_freelist` shrinks an oversized AGFL, it may call
`xfs_defer_agfl_block` multiple times in a loop, adding N deferred AGFL
free items into the same `xfs_defer_pending` (up to max_items=16 per
pending entry).

In `xfs_defer_finish_noroll`, all N items within that single
`xfs_defer_pending` are processed consecutively in the same rolled
transaction via `list_for_each_safe`, with no intervening fix_freelist
call between them. Each call to `xfs_agfl_free_finish_item` invokes
`xfs_free_ag_extent`, which may trigger a full-height split of both bnobt
and cntbt, consuming up to `(bnobt_levels + 1) + (cntbt_levels + 1)`
AGFL blocks.

The initial fix_freelist leaves the AGFL at exactly `need` blocks. If
the first deferred free triggers a full-height bnobt+cntbt split and
consumes those blocks, the AGFL may already be below `need` — or even
exhausted — by the time the second deferred free runs and needs to split.

Is this scenario considered?

thanks,
Jinliang Zheng

> 
> IOWs, the deferred frees typically always run with a AGFL block
> reservation guaranteed by the previous extent alloc/free operation
> that was just completed. It is a rare situtation where the AGFL
> would not have enough blocks on it to perform allocbt splits
> sucessfully, and to have multiple deferred AGFL frees chained
> together is quite unlikely because of the order of intent
> processing...
> 
> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [XFS] Question: can xfs_agfl_free_finish_item exhaust AGFL during bnobt/cntbt split?
  2026-04-13  2:48   ` Jinliang Zheng
@ 2026-04-13 22:19     ` Dave Chinner
  0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2026-04-13 22:19 UTC (permalink / raw)
  To: Jinliang Zheng; +Cc: alexjlzheng, djwong, linux-xfs

On Mon, Apr 13, 2026 at 10:48:51AM +0800, Jinliang Zheng wrote:
> On Mon, 13 Apr 2026 10:14:45 +1000, dgc@kernel.org wrote:
> > On Mon, Apr 13, 2026 at 12:32:50AM +0800, Jinliang Zheng wrote:
> > > But before `xfs_agfl_free_finish_item` executes, `xfs_defer_finish_noroll`
> > > rolls the transaction. Other deferred operations that execute in between—for
> > > example, regular extent frees (`XFS_DEFER_OPS_TYPE_FREE`) that themselves call
> > > fix_freelist and may re-shrink the AGFL, or rmapbt operations—could consume
> > > AGFL blocks and leave the count below `need` by the time our deferred AGFL free
> > > runs.
> > 
> > Even if we run RUIs, CUIs or other EFIs before the AGFL deferred
> > free, they fix up the freelist for their operations first, and that
> > reserves space for a deferred AGFL free from the free list.
> > 
> > If any other extent allocation/free on that AG is run between the
> > two transactions (e.g. high level transactions racing on AGF buffer
> > access) they will also reserve space in the AGFL for a split during
> > a deferred free.
> 
> Thank you for the detailed explanation. I have a follow-up question about
> the case where fix_freelist shrinks the AGFL by more than one block.
> 
> When `xfs_alloc_fix_freelist` shrinks an oversized AGFL, it may call
> `xfs_defer_agfl_block` multiple times in a loop, adding N deferred AGFL
> free items into the same `xfs_defer_pending` (up to max_items=16 per
> pending entry).
> 
> In `xfs_defer_finish_noroll`, all N items within that single
> `xfs_defer_pending` are processed consecutively in the same rolled
> transaction via `list_for_each_safe`, with no intervening fix_freelist
> call between them. Each call to `xfs_agfl_free_finish_item` invokes
> `xfs_free_ag_extent`, which may trigger a full-height split of both bnobt
> and cntbt, consuming up to `(bnobt_levels + 1) + (cntbt_levels + 1)`
> AGFL blocks.
>
> The initial fix_freelist leaves the AGFL at exactly `need` blocks. If
> the first deferred free triggers a full-height bnobt+cntbt split and
> consumes those blocks, the AGFL may already be below `need` — or even
> exhausted — by the time the second deferred free runs and needs to split.
> 
> Is this scenario considered?

I have considered it, yes.

Remember what I said about considering the probability of something
occurring before adding complexity to guarantee it will never
happen.

The above scenario requires both allocbts and the AGFL to be set up
in such a way that it has multiple full subtrees ready to split in
both trees, and that the same block frees (from the AGFL) will
trigger those subtree splits with the same free space record insert.

And the AGFL has to have exactly the right blocks on it *in excess
of what is required* in to get them deferred in the right order to
then trigger these subtree splits.

And this has to happen multiple times in consecutive operations.

Can the above happen? Yes.

Will it happen by chance before the heat death of the universe? Probably.

Will it happen in production workloads?  Not very likely at all.

That's the difference between providing theoretical guarantees vs
practical guarantees.

We can keep playing "but what if we had N+1" type theoretical
scenarios endlessly here, but reality has to step in at some point.
We set reservation sizes for "unbound excursions" at a size that is
large enough for normal operation and largely ignore the cases with
10^(large negative N) probabilites.

If, in practice, we find the reservation pools are getting exhausted
in normal production workloads, then we'll either change the
behaviour that is causing the exhaustion or increase the reservation
pool size to handle that case.

-Dave.

-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-13 22:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-12 16:32 [XFS] Question: can xfs_agfl_free_finish_item exhaust AGFL during bnobt/cntbt split? Jinliang Zheng
2026-04-13  0:14 ` Dave Chinner
2026-04-13  2:48   ` Jinliang Zheng
2026-04-13 22:19     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox