Linux XFS filesystem development
 help / color / mirror / Atom feed
From: yebin <yebin@huaweicloud.com>
To: Dave Chinner <dgc@kernel.org>
Cc: linux-xfs@vger.kernel.org, djwong@kernel.org, hch@lst.de
Subject: Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
Date: Thu, 14 May 2026 11:16:54 +0800	[thread overview]
Message-ID: <6A053EA6.1040707@huaweicloud.com> (raw)
In-Reply-To: <agRuByZS15BgUrGX@dread>



On 2026/5/13 20:26, Dave Chinner wrote:
> On Wed, May 13, 2026 at 05:33:44PM +0800, yebin wrote:
>>
>>
>> On 2026/5/13 6:52, Dave Chinner wrote:
>>> On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote:
>>>> Hello Darrick and all,
>>>>
>>>> Recently, I encountered a problem where a BUG was triggered in the write-back process.
>>>> The detailed problem information is as follows:
>>>> ```
>>>> XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
>>>> XFS (sde): Please unmount the filesystem and rectify the problem(s)
>>>> XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
>>>> ------------[ cut here ]------------
>>>> kernel BUG at fs/xfs/xfs_message.c:102!
>>>> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
>>>> RIP: 0010:assfail+0x9f/0xb0
>>>
>>> What kernel? You've stripped that line out of the stack dump.
>>
>> The initial issue appeared on the v5.10 kernel and occurred multiple times.
>> The current stack is a reproduction I made on linux-next based on the
>> cc13002a9f98 tag: next-20260402.
>
> So why strip it out of the debug output? It doesn't encourage people
> to look at the problem when things like this have been obvious
> stripped from the output.
>
> ....
>

This is my fault. I accidentally deleted the version information when
removing unimportant details.

>>> So your test code is creating a number of fragmented extents to get
>>> to the edge of btree format conversion, then doing a delalloc
>>> write() to create a long delalloc extent range, then is alternating
>>> along the range of the delalloc extent doing:
>>>
>>> loop:
>>> 	sync_file_range(offset, 4096, SYNC_FILE_RANGE_WRITE)
>>> 	fallocate(PUNCH_HOLE, offset, 4096)
>>> 	offset += 4096;
>>>
>>> And so it is converting a single block at the left edge of the
>>> delalloc extent to a written extent which triggers a extent -> btree
>>> conversion, and then you punch out the newly created written extent
>>> triggering a btree -> extent conversion.
>>>
>>> And each time you do this it removes a reserved block from the
>>> delalloc extent for the btree root block, yes?
>>>
>>
>> Yes, this is just a process I designed to facilitate the construction
>> of this problem. In another case, during the delay extent conversion,
>> B-tree splitting continuously consumes reserved blocks.
>
> The worst_indlen calculation should be taking blocks needed for
> BMBT tree splits into account.
>
> It expects to consume (extent len / BMBT records per block) leaf
> blocks for the delalloc extent. It then walks back up the bmbt tree,
> calculating how many node blocks will be needed to index all those
> leaf blocks.  IOWs, it reserves all the node blocks it will need for
> splits to index the growing number of leaf blocks.
>
> i.e. by calculating the number of BMBT blocks required to index the
> delalloc extent being converted into individual single block
> extents, it should have taken into account all the blocks needed for
> all the BMBT splits needed to index the range.
>

I agree with your point, but as I mentioned earlier, xfs_bmap_worst_indlen()
calculates the maximum number of additional blocks required for a single
conversion of a delay extent. If the delay extent is split into multiple
conversions, each converting a portion at a time, it may trigger a btree
split each time. For example, if the delay extent has 10 blocks and 5 blocks
are reserved, and each conversion starts from the beginning and converts one
block at a time, each conversion will trigger a btree split. Assuming each
btree split consumes one block, the conversion process of the initial delay
extent with 10 blocks will actually consume an additional 10 blocks, but
only 5 blocks were initially reserved. The calculation method of 'da_new'
does not even consider this exhaustion scenario.

da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
                          startblockval(PREV.br_startblock) -
                          (bma->cur? bma->cur->bc_bmap.allocated : 0));

Each time a re-conversion occurs, the state of the btree/extent may have
changed. In theory, 'da_new' should be maintained at the value of
xfs_bmap_worst_indlen() to be rigorous.

The original algorithm model is based on the assumption that all changes
in the tree are caused by the continuous conversion of the same delay
extent. However, in a production environment, a file may be used in
segments, and the sources of tree changes are not singular.

>> Essentially,
>> this is because the delay extent conversion process is broken down,
>> which may cause the reserved blocks to be exhausted.
>
> As per above, fragmentation by itself shouldn't cause the indlen
> reservation to be exhausted.
>

Yes, file system fragmentation is just a trigger factor.

>> I think that the scenario of conversion between extents and B-trees
>> may be that the unwritten extents are converted to written extents
>> after the writeback is complete and then the extents are combined,
>> causing the B-tree to be converted to an extent.
>
> Yes, I can see how that could occur - it would need contiguous
> physical extent allocation to keep the number of extents in the file
> at the threshold where:
>
> writeback submission
> -> delalloc
>    -> left contiguous unwritten allocation
>      -> nextents++
>        -> extent_to_btree
>
> IO completion
> -> unwritten conversion
>    -> left merge with written extent
>       -> nextents--
>          -> btree_to_extents
>
> But here's the thing: the extent_to_btree conversion does not
> account blocks allocated to indlen blocks stored in the delalloc
> extent. Yes, it uses blocks that were accounting to the superblock
> as reserved delalloc blocks but the btree root block allocation only
> gets accounted to the superblock and not to the new indlen in the
> remaining delalloc extent.
>
> Hence the data fork can bounce back and forth between extents and
> btree forms across allocation and conversion without having any
> impact on the indlen held in the delalloc extent that is slowly
> being allocated and written.
>
> The problem you are seeing is that indlen is being exhausted
> by something, and that results in passing wasdel = false to the
> extents_to_btree() conversion without a block reservation. We don't
> yet have a plausible explanation of why indlen is being exausted in
> the first place - it's not foramt conversion, and it's not "btree
> splits", so how are we getting to indlen = 0 and triggering this
> issue?
>
> e.g. How much of the delalloc extent remains unallocated when da_old
> reaches zero? Is this an off-by-one corner case of having allocated
> the entire delalloc range and so having consumed all the indlen at
> the same time the last allocation needs to convert the data for to
> btree format?
>

I think in the second paragraph of my response, I gave an example that
can explain how inlen is gradually consumed and eventually becomes 0.

>> This scenario may
>> be triggered by normal service operations. In any case, file system
>> fragmentation is the cause of this problem.
>
> I've not seen any evidence that supports this conclusion yet.
>

What I mean is that file system fragmentation is a contributing factor
to the problem, not the root cause.

>>> How realistic is this scenario in an application/production
>>> environment? I mean, nobody walks through a file syncing data to
>>> disk one fragmented extent at a time only to immediately remove it
>>> before writing the next block.
>>>
>>> We've known that this is possible for a very long time. I've
>>> personally known it can happen in carefully constructed test code
>>> for over 20 years, but I can count on one hand the number of times
>>> I've actually seen this exhaustion occur in a production system.
>>>
>>> The reservation we use here is essentially unchanged since if was
>>> first introduced in 1994, so time in use tells us that the
>>> reservation is largely sufficient for production systems. Can you
>>> describe the situation where you production systems are hitting
>>> this? What is the application actually doing to trigger this
>>> problem?
>>>
>>
>> The extent reduction is not only triggered in the punch hole scenario.
>> In all scenarios where extent merging is triggered, the conversion
>> from B-tree to extent may be performed.
>
> Yes, I know. But in writeback scenarios, only unwritten extent
> conversion can cause merges, and that only happens when we have
> contiguous allocations over the delalloc range.
>
> IOWs, it can't happen when the filesystem is fragmented, as it
> requires repeated contiguous allocation to enable left merging.
> Hence large, uncontested free spaces are required to trigger the
> fork format conversion cycling behaviour, but this is irrelevant
> because I don't think the format cycling is the cause of indlen
> exhaustion....
>

I mentioned from the beginning that format conversion is just one
possible scenario. I designed the path to reproduce this issue in
order to better control the construction of the problem. The current
algorithm does not take into account that the source of btree changes
is not limited to the case of a single delay extent being generated.
In such cases, if a delay extent is split into multiple conversions,
it may lead to the exhaustion of the reserved inlen.

>>> Which begs the question: we fixed some issues with this code back in
>>> 2024, so does this problem still occur on TOT kernels? e.g. commit
>>> d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial
>>> conversions") should help address indlen block consumption for
>>> repeated partial conversions.
>>>
>>
>> This patch can solve the issue where the required extra space may not
>> be reserved in time, leading to a writeback failure. However, it cannot
>> address the problem caused by the continuous consumption of the reserved
>> space.
>
> OK, but therein lies the issue: what is the mechanism that causes
> the excessive consumption of the indlen blocks? Is the calculation
> wrong, does it leak blocks when we split the delalloc extent, or
> something else?
>

I think the core of the problem lies in the following code logic.
```
xfs_bmap_add_extent_delay_real
      da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
                               startblockval(PREV.br_startblock) -
                               (bma->cur? bma->cur->bc_bmap.allocated : 0));
```
As I mentioned before, the B-tree changes are not caused by the continuous
generation of a delay extent. A specific delay extent, which is partially
converted each time, happens to reach the critical point of B-tree splitting.
This unfortunate delay extent continuously contributes its reserved space,
and finally, the inlen becomes 0.
I say that fragmentation is the cause of this problem because in fragmented
scenarios, the probability of extent merging decreases, and B-tree splitting
becomes more frequent.

> -Dave.
>


      reply	other threads:[~2026-05-14  3:17 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin
2026-05-12 17:19 ` Darrick J. Wong
2026-05-12 22:52 ` Dave Chinner
2026-05-13  9:33   ` yebin
2026-05-13 12:26     ` Dave Chinner
2026-05-14  3:16       ` yebin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6A053EA6.1040707@huaweicloud.com \
    --to=yebin@huaweicloud.com \
    --cc=dgc@kernel.org \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox