From: "Darrick J. Wong" <djwong@kernel.org>
To: yebin <yebin@huaweicloud.com>
Cc: linux-xfs@vger.kernel.org, hch@lst.de, dgc@kernel.org
Subject: Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
Date: Tue, 12 May 2026 10:19:31 -0700 [thread overview]
Message-ID: <20260512171931.GN9555@frogsfrogsfrogs> (raw)
In-Reply-To: <6A031038.9030708@huaweicloud.com>
On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote:
> Hello Darrick and all,
>
> Recently, I encountered a problem where a BUG was triggered in the write-back process.
> The detailed problem information is as follows:
> ```
> XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351). Shutting.
> XFS (sde): Please unmount the filesystem and rectify the problem(s)
> XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
> ------------[ cut here ]------------
> kernel BUG at fs/xfs/xfs_message.c:102!
> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> RIP: 0010:assfail+0x9f/0xb0
> Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310
> RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6
> RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001
> RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded
> R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520
> R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff
> FS: 00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0
> Call Trace:
> <TASK>
> xfs_trans_unreserve_and_mod_sb+0xb86/0xd00
> __xfs_trans_commit+0x38b/0xe00
> xfs_trans_commit+0xeb/0x1a0
> xfs_bmapi_convert_one_delalloc+0xbca/0x1270
> xfs_bmapi_convert_delalloc+0x101/0x350
> xfs_writeback_range+0x76c/0x12d0
> iomap_writeback_folio+0x9ed/0x2100
> iomap_writepages+0x13c/0x2a0
> xfs_vm_writepages+0x278/0x330
> do_writepages+0x247/0x5c0
> filemap_writeback+0x22c/0x2e0
> xfs_file_release+0x442/0x580
> __fput+0x407/0xb50
> fput_close_sync+0x114/0x210
> __x64_sys_close+0x94/0x120
> do_syscall_64+0xc4/0xf80
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> ```
>
> After analyzing the above issues, the possible triggering process
> is as follows:
> ```
> xfs_bmapi_convert_delalloc
> xfs_bmapi_convert_one_delalloc
> xfs_bmapi_allocate
> xfs_bmap_add_extent_delay_real
> da_old = startblockval(PREV.br_startblock); // da_old = 5
> case BMAP_LEFT_FILLING:
> ifp->if_nextents++; // 21 + 1 = 22
> if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21
> xfs_bmap_extents_to_btree // convert to btree
> cur->bc_ino.allocated++;
> da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
> startblockval(PREV.br_startblock) -
> (bma->cur ? bma->cur->bc_ino.allocated : 0)); // da_new = 5 - 1 = 4
> PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return
>
> xfs_bmap_del_extent_real
> case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:
> ifp->if_nextents--; // 22 - 1 = 21
> if (xfs_bmap_needs_btree(ip, whichfork))
> xfs_bmap_extents_to_btree
> else
> xfs_bmap_btree_to_extents // convert to extents
> ... // Alternate a few times in the middle.
> da_old = 4
> da_old = 3
> da_old = 2
> da_old = 1
> ...
> xfs_bmapi_convert_delalloc
> xfs_bmapi_convert_one_delalloc
> error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp); // Both blocks and rtextents are 0
> tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
> error = xfs_trans_reserve(tp, resp, blocks, rtextents);
> if (blocks > 0)
> error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
> tp->t_blk_res += blocks; // The value of blocks is 0, so the value of tp->t_blk_res is 0
> xfs_bmapi_allocate
> xfs_bmap_add_extent_delay_real
> da_old = startblockval(PREV.br_startblock); // da_old = 0
> case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: // The current delay extent is just exhausted.
> ifp->if_nextents++; // 21 + 1 + 22
>
> if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21
> error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork); // Converted to btree. da_old > 0 is false.
> args.wasdel = wasdel; // wasdel is false
> error = xfs_alloc_vextent(&args);
> xfs_alloc_ag_vextent(args, 0)
> xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
> case XFS_AG_RESV_NONE:
> field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS : XFS_TRANS_SB_FDBLOCKS; //args->wasdel == false
> xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
> case XFS_TRANS_SB_FDBLOCKS:
> if (delta < 0)
> tp->t_blk_res_used += (uint)-delta;
> if (tp->t_blk_res_used > tp->t_blk_res) // ***tp->t_blk_res is 0, thus triggering xfs_force_shutdown()***
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> ```
>
> The logic that triggers the issue above was designed by me to facilitate the
> construction of the problem. Besides the scenario where XFS_DINODE_FMT_BTREE
> and XFS_DINODE_FMT_EXTENTS are converted back and forth, there is also the
> scenario of btree splitting.
>
> The core reason for the issue is that in xfs_bmapi_convert_delalloc(), the
> call to xfs_bmap_worst_indlen() calculates the worst-case number of reserved
> blocks, which is the number of additional blocks required after a complete
> conversion of the entire delayed extent. It assumes that the entire conversion
> process is atomic. However, the current process cannot guarantee such atomicity.
> In the case of a fragmented filesystem, the most extreme scenario is that every
> block conversion triggers a full btree split, in which case the reserved blocks
> are far from sufficient. When this issue is triggered, the filesystem fragmentation
> in the environment is indeed quite severe.
>
> Further analysis of this abnormal model shows that because the reserved blocks
> are continuously consumed, they may eventually exceed the reserved amount. When
> the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate
> blocks, triggering a warning. This failure to allocate additional blocks can lead
> to issues with normal block allocation.
>
> Additionally, in xfs_bmap_add_extent_delay_real(), if a delayed extent is split
> into two, xfs_bmap_worst_indlen() is recalculated to reserve blocks. In the case
> of nearly exhausted space, it may be impossible to reserve the newly required
> blocks, leading to a writeback failure.
>
> During the reservation phase, reserving more blocks by considering the worst-case
> scenario would require occupying a lot of extra space, which is not very practical.
> I was thinking that we could convert all the delay extents at once to ensure
> atomicity, which would ensure that the two issues analyzed above do not exist.
> However, I am not sure what negative impacts this approach might have. The only
> thing I can think of is that the reserved space would be repeatedly allocated and
> released, but I believe the current logic already has similar situations.
>
> I haven't thought of a better solution at the moment. I wonder if anyone has any
> good ideas?
I haven't. With EFI-based space freeing in transactions, as a
theoretical last resort you could steal a block from an EFI instead of
scanning the bnobt/cntbt. This would mitigate the nastiness of repeated
split/merge cycles but you'd have to be careful about block reuse.
--D
>
> Thanks,
> Ye Bin
>
>
prev parent reply other threads:[~2026-05-12 17:19 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin
2026-05-12 17:19 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260512171931.GN9555@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=dgc@kernel.org \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
--cc=yebin@huaweicloud.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox