* [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
@ 2026-05-12 11:34 yebin
2026-05-12 17:19 ` Darrick J. Wong
2026-05-12 22:52 ` Dave Chinner
0 siblings, 2 replies; 5+ messages in thread
From: yebin @ 2026-05-12 11:34 UTC (permalink / raw)
To: linux-xfs, djwong, hch, dgc
Hello Darrick and all,
Recently, I encountered a problem where a BUG was triggered in the write-back process.
The detailed problem information is as follows:
```
XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351). Shutting.
XFS (sde): Please unmount the filesystem and rectify the problem(s)
XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
RIP: 0010:assfail+0x9f/0xb0
Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310
RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6
RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded
R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520
R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff
FS: 00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0
Call Trace:
<TASK>
xfs_trans_unreserve_and_mod_sb+0xb86/0xd00
__xfs_trans_commit+0x38b/0xe00
xfs_trans_commit+0xeb/0x1a0
xfs_bmapi_convert_one_delalloc+0xbca/0x1270
xfs_bmapi_convert_delalloc+0x101/0x350
xfs_writeback_range+0x76c/0x12d0
iomap_writeback_folio+0x9ed/0x2100
iomap_writepages+0x13c/0x2a0
xfs_vm_writepages+0x278/0x330
do_writepages+0x247/0x5c0
filemap_writeback+0x22c/0x2e0
xfs_file_release+0x442/0x580
__fput+0x407/0xb50
fput_close_sync+0x114/0x210
__x64_sys_close+0x94/0x120
do_syscall_64+0xc4/0xf80
entry_SYSCALL_64_after_hwframe+0x76/0x7e
```
After analyzing the above issues, the possible triggering process
is as follows:
```
xfs_bmapi_convert_delalloc
xfs_bmapi_convert_one_delalloc
xfs_bmapi_allocate
xfs_bmap_add_extent_delay_real
da_old = startblockval(PREV.br_startblock); // da_old = 5
case BMAP_LEFT_FILLING:
ifp->if_nextents++; // 21 + 1 = 22
if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21
xfs_bmap_extents_to_btree // convert to btree
cur->bc_ino.allocated++;
da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
startblockval(PREV.br_startblock) -
(bma->cur ? bma->cur->bc_ino.allocated : 0)); // da_new = 5 - 1 = 4
PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return
xfs_bmap_del_extent_real
case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:
ifp->if_nextents--; // 22 - 1 = 21
if (xfs_bmap_needs_btree(ip, whichfork))
xfs_bmap_extents_to_btree
else
xfs_bmap_btree_to_extents // convert to extents
... // Alternate a few times in the middle.
da_old = 4
da_old = 3
da_old = 2
da_old = 1
...
xfs_bmapi_convert_delalloc
xfs_bmapi_convert_one_delalloc
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp); // Both blocks and rtextents are 0
tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
error = xfs_trans_reserve(tp, resp, blocks, rtextents);
if (blocks > 0)
error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
tp->t_blk_res += blocks; // The value of blocks is 0, so the value of tp->t_blk_res is 0
xfs_bmapi_allocate
xfs_bmap_add_extent_delay_real
da_old = startblockval(PREV.br_startblock); // da_old = 0
case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: // The current delay extent is just exhausted.
ifp->if_nextents++; // 21 + 1 + 22
if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21
error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork); // Converted to btree. da_old > 0 is false.
args.wasdel = wasdel; // wasdel is false
error = xfs_alloc_vextent(&args);
xfs_alloc_ag_vextent(args, 0)
xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
case XFS_AG_RESV_NONE:
field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS : XFS_TRANS_SB_FDBLOCKS; //args->wasdel == false
xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
case XFS_TRANS_SB_FDBLOCKS:
if (delta < 0)
tp->t_blk_res_used += (uint)-delta;
if (tp->t_blk_res_used > tp->t_blk_res) // ***tp->t_blk_res is 0, thus triggering xfs_force_shutdown()***
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
```
The logic that triggers the issue above was designed by me to facilitate the
construction of the problem. Besides the scenario where XFS_DINODE_FMT_BTREE
and XFS_DINODE_FMT_EXTENTS are converted back and forth, there is also the
scenario of btree splitting.
The core reason for the issue is that in xfs_bmapi_convert_delalloc(), the
call to xfs_bmap_worst_indlen() calculates the worst-case number of reserved
blocks, which is the number of additional blocks required after a complete
conversion of the entire delayed extent. It assumes that the entire conversion
process is atomic. However, the current process cannot guarantee such atomicity.
In the case of a fragmented filesystem, the most extreme scenario is that every
block conversion triggers a full btree split, in which case the reserved blocks
are far from sufficient. When this issue is triggered, the filesystem fragmentation
in the environment is indeed quite severe.
Further analysis of this abnormal model shows that because the reserved blocks
are continuously consumed, they may eventually exceed the reserved amount. When
the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate
blocks, triggering a warning. This failure to allocate additional blocks can lead
to issues with normal block allocation.
Additionally, in xfs_bmap_add_extent_delay_real(), if a delayed extent is split
into two, xfs_bmap_worst_indlen() is recalculated to reserve blocks. In the case
of nearly exhausted space, it may be impossible to reserve the newly required
blocks, leading to a writeback failure.
During the reservation phase, reserving more blocks by considering the worst-case
scenario would require occupying a lot of extra space, which is not very practical.
I was thinking that we could convert all the delay extents at once to ensure
atomicity, which would ensure that the two issues analyzed above do not exist.
However, I am not sure what negative impacts this approach might have. The only
thing I can think of is that the reserved space would be repeatedly allocated and
released, but I believe the current logic already has similar situations.
I haven't thought of a better solution at the moment. I wonder if anyone has any
good ideas?
Thanks,
Ye Bin
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102! 2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin @ 2026-05-12 17:19 ` Darrick J. Wong 2026-05-12 22:52 ` Dave Chinner 1 sibling, 0 replies; 5+ messages in thread From: Darrick J. Wong @ 2026-05-12 17:19 UTC (permalink / raw) To: yebin; +Cc: linux-xfs, hch, dgc On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote: > Hello Darrick and all, > > Recently, I encountered a problem where a BUG was triggered in the write-back process. > The detailed problem information is as follows: > ``` > XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351). Shutting. > XFS (sde): Please unmount the filesystem and rectify the problem(s) > XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610 > ------------[ cut here ]------------ > kernel BUG at fs/xfs/xfs_message.c:102! > Oops: invalid opcode: 0000 [#1] SMP KASAN PTI > RIP: 0010:assfail+0x9f/0xb0 > Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310 > RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293 > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6 > RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001 > RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded > R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520 > R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff > FS: 00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0 > Call Trace: > <TASK> > xfs_trans_unreserve_and_mod_sb+0xb86/0xd00 > __xfs_trans_commit+0x38b/0xe00 > xfs_trans_commit+0xeb/0x1a0 > xfs_bmapi_convert_one_delalloc+0xbca/0x1270 > xfs_bmapi_convert_delalloc+0x101/0x350 > xfs_writeback_range+0x76c/0x12d0 > iomap_writeback_folio+0x9ed/0x2100 > iomap_writepages+0x13c/0x2a0 > xfs_vm_writepages+0x278/0x330 > do_writepages+0x247/0x5c0 > filemap_writeback+0x22c/0x2e0 > xfs_file_release+0x442/0x580 > __fput+0x407/0xb50 > fput_close_sync+0x114/0x210 > __x64_sys_close+0x94/0x120 > do_syscall_64+0xc4/0xf80 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > ``` > > After analyzing the above issues, the possible triggering process > is as follows: > ``` > xfs_bmapi_convert_delalloc > xfs_bmapi_convert_one_delalloc > xfs_bmapi_allocate > xfs_bmap_add_extent_delay_real > da_old = startblockval(PREV.br_startblock); // da_old = 5 > case BMAP_LEFT_FILLING: > ifp->if_nextents++; // 21 + 1 = 22 > if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21 > xfs_bmap_extents_to_btree // convert to btree > cur->bc_ino.allocated++; > da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp), > startblockval(PREV.br_startblock) - > (bma->cur ? bma->cur->bc_ino.allocated : 0)); // da_new = 5 - 1 = 4 > PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return > > xfs_bmap_del_extent_real > case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: > ifp->if_nextents--; // 22 - 1 = 21 > if (xfs_bmap_needs_btree(ip, whichfork)) > xfs_bmap_extents_to_btree > else > xfs_bmap_btree_to_extents // convert to extents > ... // Alternate a few times in the middle. > da_old = 4 > da_old = 3 > da_old = 2 > da_old = 1 > ... > xfs_bmapi_convert_delalloc > xfs_bmapi_convert_one_delalloc > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp); // Both blocks and rtextents are 0 > tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL); > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > if (blocks > 0) > error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd); > tp->t_blk_res += blocks; // The value of blocks is 0, so the value of tp->t_blk_res is 0 > xfs_bmapi_allocate > xfs_bmap_add_extent_delay_real > da_old = startblockval(PREV.br_startblock); // da_old = 0 > case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: // The current delay extent is just exhausted. > ifp->if_nextents++; // 21 + 1 + 22 > > if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21 > error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork); // Converted to btree. da_old > 0 is false. > args.wasdel = wasdel; // wasdel is false > error = xfs_alloc_vextent(&args); > xfs_alloc_ag_vextent(args, 0) > xfs_ag_resv_alloc_extent(args->pag, args->resv, args); > case XFS_AG_RESV_NONE: > field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS : XFS_TRANS_SB_FDBLOCKS; //args->wasdel == false > xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len); > case XFS_TRANS_SB_FDBLOCKS: > if (delta < 0) > tp->t_blk_res_used += (uint)-delta; > if (tp->t_blk_res_used > tp->t_blk_res) // ***tp->t_blk_res is 0, thus triggering xfs_force_shutdown()*** > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); > ``` > > The logic that triggers the issue above was designed by me to facilitate the > construction of the problem. Besides the scenario where XFS_DINODE_FMT_BTREE > and XFS_DINODE_FMT_EXTENTS are converted back and forth, there is also the > scenario of btree splitting. > > The core reason for the issue is that in xfs_bmapi_convert_delalloc(), the > call to xfs_bmap_worst_indlen() calculates the worst-case number of reserved > blocks, which is the number of additional blocks required after a complete > conversion of the entire delayed extent. It assumes that the entire conversion > process is atomic. However, the current process cannot guarantee such atomicity. > In the case of a fragmented filesystem, the most extreme scenario is that every > block conversion triggers a full btree split, in which case the reserved blocks > are far from sufficient. When this issue is triggered, the filesystem fragmentation > in the environment is indeed quite severe. > > Further analysis of this abnormal model shows that because the reserved blocks > are continuously consumed, they may eventually exceed the reserved amount. When > the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate > blocks, triggering a warning. This failure to allocate additional blocks can lead > to issues with normal block allocation. > > Additionally, in xfs_bmap_add_extent_delay_real(), if a delayed extent is split > into two, xfs_bmap_worst_indlen() is recalculated to reserve blocks. In the case > of nearly exhausted space, it may be impossible to reserve the newly required > blocks, leading to a writeback failure. > > During the reservation phase, reserving more blocks by considering the worst-case > scenario would require occupying a lot of extra space, which is not very practical. > I was thinking that we could convert all the delay extents at once to ensure > atomicity, which would ensure that the two issues analyzed above do not exist. > However, I am not sure what negative impacts this approach might have. The only > thing I can think of is that the reserved space would be repeatedly allocated and > released, but I believe the current logic already has similar situations. > > I haven't thought of a better solution at the moment. I wonder if anyone has any > good ideas? I haven't. With EFI-based space freeing in transactions, as a theoretical last resort you could steal a block from an EFI instead of scanning the bnobt/cntbt. This would mitigate the nastiness of repeated split/merge cycles but you'd have to be careful about block reuse. --D > > Thanks, > Ye Bin > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102! 2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin 2026-05-12 17:19 ` Darrick J. Wong @ 2026-05-12 22:52 ` Dave Chinner 2026-05-13 9:33 ` yebin 1 sibling, 1 reply; 5+ messages in thread From: Dave Chinner @ 2026-05-12 22:52 UTC (permalink / raw) To: yebin; +Cc: linux-xfs, djwong, hch On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote: > Hello Darrick and all, > > Recently, I encountered a problem where a BUG was triggered in the write-back process. > The detailed problem information is as follows: > ``` > XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351). Shutting. > XFS (sde): Please unmount the filesystem and rectify the problem(s) > XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610 > ------------[ cut here ]------------ > kernel BUG at fs/xfs/xfs_message.c:102! > Oops: invalid opcode: 0000 [#1] SMP KASAN PTI > RIP: 0010:assfail+0x9f/0xb0 What kernel? You've stripped that line out of the stack dump. > Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310 > RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293 > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6 > RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001 > RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded > R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520 > R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff > FS: 00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0 > Call Trace: > <TASK> > xfs_trans_unreserve_and_mod_sb+0xb86/0xd00 > __xfs_trans_commit+0x38b/0xe00 > xfs_trans_commit+0xeb/0x1a0 > xfs_bmapi_convert_one_delalloc+0xbca/0x1270 > xfs_bmapi_convert_delalloc+0x101/0x350 > xfs_writeback_range+0x76c/0x12d0 > iomap_writeback_folio+0x9ed/0x2100 > iomap_writepages+0x13c/0x2a0 > xfs_vm_writepages+0x278/0x330 > do_writepages+0x247/0x5c0 > filemap_writeback+0x22c/0x2e0 > xfs_file_release+0x442/0x580 > __fput+0x407/0xb50 > fput_close_sync+0x114/0x210 > __x64_sys_close+0x94/0x120 > do_syscall_64+0xc4/0xf80 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > ``` > > After analyzing the above issues, the possible triggering process > is as follows: > ``` > xfs_bmapi_convert_delalloc > xfs_bmapi_convert_one_delalloc > xfs_bmapi_allocate > xfs_bmap_add_extent_delay_real > da_old = startblockval(PREV.br_startblock); // da_old = 5 > case BMAP_LEFT_FILLING: > ifp->if_nextents++; // 21 + 1 = 22 > if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21 > xfs_bmap_extents_to_btree // convert to btree > cur->bc_ino.allocated++; > da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp), > startblockval(PREV.br_startblock) - > (bma->cur ? bma->cur->bc_ino.allocated : 0)); // da_new = 5 - 1 = 4 > PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return > > xfs_bmap_del_extent_real > case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: > ifp->if_nextents--; // 22 - 1 = 21 > if (xfs_bmap_needs_btree(ip, whichfork)) > xfs_bmap_extents_to_btree > else > xfs_bmap_btree_to_extents // convert to extents So your test code is creating a number of fragmented extents to get to the edge of btree format conversion, then doing a delalloc write() to create a long delalloc extent range, then is alternating along the range of the delalloc extent doing: loop: sync_file_range(offset, 4096, SYNC_FILE_RANGE_WRITE) fallocate(PUNCH_HOLE, offset, 4096) offset += 4096; And so it is converting a single block at the left edge of the delalloc extent to a written extent which triggers a extent -> btree conversion, and then you punch out the newly created written extent triggering a btree -> extent conversion. And each time you do this it removes a reserved block from the delalloc extent for the btree root block, yes? How realistic is this scenario in an application/production environment? I mean, nobody walks through a file syncing data to disk one fragmented extent at a time only to immediately remove it before writing the next block. We've known that this is possible for a very long time. I've personally known it can happen in carefully constructed test code for over 20 years, but I can count on one hand the number of times I've actually seen this exhaustion occur in a production system. The reservation we use here is essentially unchanged since if was first introduced in 1994, so time in use tells us that the reservation is largely sufficient for production systems. Can you describe the situation where you production systems are hitting this? What is the application actually doing to trigger this problem? > ... // Alternate a few times in the middle. > da_old = 4 > da_old = 3 > da_old = 2 > da_old = 1 > ... > xfs_bmapi_convert_delalloc > xfs_bmapi_convert_one_delalloc > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp); // Both blocks and rtextents are 0 > tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL); > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > if (blocks > 0) > error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd); > tp->t_blk_res += blocks; // The value of blocks is 0, so the value of tp->t_blk_res is 0 > xfs_bmapi_allocate > xfs_bmap_add_extent_delay_real > da_old = startblockval(PREV.br_startblock); // da_old = 0 > case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: // The current delay extent is just exhausted. > ifp->if_nextents++; // 21 + 1 + 22 > > if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21 > error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork); // Converted to btree. da_old > 0 is false. > args.wasdel = wasdel; // wasdel is false > error = xfs_alloc_vextent(&args); > xfs_alloc_ag_vextent(args, 0) Ok, that's why you stripped the kernel ID out of the stack dump - this analysis is from a vendor kernel of some kind. i.e. xfs_alloc_ag_vextent() went away in 2023... Which begs the question: we fixed some issues with this code back in 2024, so does this problem still occur on TOT kernels? e.g. commit d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial conversions") should help address indlen block consumption for repeated partial conversions. > Further analysis of this abnormal model shows that because the reserved blocks > are continuously consumed, they may eventually exceed the reserved amount. When > the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate > blocks, triggering a warning. This failure to allocate additional blocks can lead > to issues with normal block allocation. The TOT code should be recalculating the required indlen for the remaining delalloc extent and accounting for indlen block usage where it gets depleted. hence the gradual reduction of the indlen over repeated left edge conversion and removal triggering repeated indlen block consumption should no longer be a problem. If it is a problem, then we need to make sure we account for it correctly, similar to the fix in the above commit and the series it was part of. -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102! 2026-05-12 22:52 ` Dave Chinner @ 2026-05-13 9:33 ` yebin 2026-05-13 12:26 ` Dave Chinner 0 siblings, 1 reply; 5+ messages in thread From: yebin @ 2026-05-13 9:33 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, djwong, hch On 2026/5/13 6:52, Dave Chinner wrote: > On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote: >> Hello Darrick and all, >> >> Recently, I encountered a problem where a BUG was triggered in the write-back process. >> The detailed problem information is as follows: >> ``` >> XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351). Shutting. >> XFS (sde): Please unmount the filesystem and rectify the problem(s) >> XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610 >> ------------[ cut here ]------------ >> kernel BUG at fs/xfs/xfs_message.c:102! >> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI >> RIP: 0010:assfail+0x9f/0xb0 > > What kernel? You've stripped that line out of the stack dump. The initial issue appeared on the v5.10 kernel and occurred multiple times. The current stack is a reproduction I made on linux-next based on the cc13002a9f98 tag: next-20260402. > >> Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310 >> RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293 >> RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6 >> RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001 >> RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded >> R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520 >> R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff >> FS: 00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0 >> Call Trace: >> <TASK> >> xfs_trans_unreserve_and_mod_sb+0xb86/0xd00 >> __xfs_trans_commit+0x38b/0xe00 >> xfs_trans_commit+0xeb/0x1a0 >> xfs_bmapi_convert_one_delalloc+0xbca/0x1270 >> xfs_bmapi_convert_delalloc+0x101/0x350 >> xfs_writeback_range+0x76c/0x12d0 >> iomap_writeback_folio+0x9ed/0x2100 >> iomap_writepages+0x13c/0x2a0 >> xfs_vm_writepages+0x278/0x330 >> do_writepages+0x247/0x5c0 >> filemap_writeback+0x22c/0x2e0 >> xfs_file_release+0x442/0x580 >> __fput+0x407/0xb50 >> fput_close_sync+0x114/0x210 >> __x64_sys_close+0x94/0x120 >> do_syscall_64+0xc4/0xf80 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> ``` >> >> After analyzing the above issues, the possible triggering process >> is as follows: >> ``` >> xfs_bmapi_convert_delalloc >> xfs_bmapi_convert_one_delalloc >> xfs_bmapi_allocate >> xfs_bmap_add_extent_delay_real >> da_old = startblockval(PREV.br_startblock); // da_old = 5 >> case BMAP_LEFT_FILLING: >> ifp->if_nextents++; // 21 + 1 = 22 >> if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21 >> xfs_bmap_extents_to_btree // convert to btree >> cur->bc_ino.allocated++; >> da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp), >> startblockval(PREV.br_startblock) - >> (bma->cur ? bma->cur->bc_ino.allocated : 0)); // da_new = 5 - 1 = 4 >> PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return >> >> xfs_bmap_del_extent_real >> case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: >> ifp->if_nextents--; // 22 - 1 = 21 >> if (xfs_bmap_needs_btree(ip, whichfork)) >> xfs_bmap_extents_to_btree >> else >> xfs_bmap_btree_to_extents // convert to extents > > So your test code is creating a number of fragmented extents to get > to the edge of btree format conversion, then doing a delalloc > write() to create a long delalloc extent range, then is alternating > along the range of the delalloc extent doing: > > loop: > sync_file_range(offset, 4096, SYNC_FILE_RANGE_WRITE) > fallocate(PUNCH_HOLE, offset, 4096) > offset += 4096; > > And so it is converting a single block at the left edge of the > delalloc extent to a written extent which triggers a extent -> btree > conversion, and then you punch out the newly created written extent > triggering a btree -> extent conversion. > > And each time you do this it removes a reserved block from the > delalloc extent for the btree root block, yes? > Yes, this is just a process I designed to facilitate the construction of this problem. In another case, during the delay extent conversion, B-tree splitting continuously consumes reserved blocks. Essentially, this is because the delay extent conversion process is broken down, which may cause the reserved blocks to be exhausted. I think that the scenario of conversion between extents and B-trees may be that the unwritten extents are converted to written extents after the writeback is complete and then the extents are combined, causing the B-tree to be converted to an extent. This scenario may be triggered by normal service operations. In any case, file system fragmentation is the cause of this problem. > How realistic is this scenario in an application/production > environment? I mean, nobody walks through a file syncing data to > disk one fragmented extent at a time only to immediately remove it > before writing the next block. > > We've known that this is possible for a very long time. I've > personally known it can happen in carefully constructed test code > for over 20 years, but I can count on one hand the number of times > I've actually seen this exhaustion occur in a production system. > > The reservation we use here is essentially unchanged since if was > first introduced in 1994, so time in use tells us that the > reservation is largely sufficient for production systems. Can you > describe the situation where you production systems are hitting > this? What is the application actually doing to trigger this > problem? > The extent reduction is not only triggered in the punch hole scenario. In all scenarios where extent merging is triggered, the conversion from B-tree to extent may be performed. >> ... // Alternate a few times in the middle. >> da_old = 4 >> da_old = 3 >> da_old = 2 >> da_old = 1 >> ... >> xfs_bmapi_convert_delalloc >> xfs_bmapi_convert_one_delalloc >> error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp); // Both blocks and rtextents are 0 >> tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL); >> error = xfs_trans_reserve(tp, resp, blocks, rtextents); >> if (blocks > 0) >> error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd); >> tp->t_blk_res += blocks; // The value of blocks is 0, so the value of tp->t_blk_res is 0 >> xfs_bmapi_allocate >> xfs_bmap_add_extent_delay_real >> da_old = startblockval(PREV.br_startblock); // da_old = 0 >> case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: // The current delay extent is just exhausted. >> ifp->if_nextents++; // 21 + 1 + 22 >> >> if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21 >> error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork); // Converted to btree. da_old > 0 is false. >> args.wasdel = wasdel; // wasdel is false >> error = xfs_alloc_vextent(&args); >> xfs_alloc_ag_vextent(args, 0) > > Ok, that's why you stripped the kernel ID out of the stack dump - > this analysis is from a vendor kernel of some kind. i.e. > xfs_alloc_ag_vextent() went away in 2023... > Sorry, my initial problem analysis was based on the v5.10 kernel version, and I must have missed updating it during the modification. > Which begs the question: we fixed some issues with this code back in > 2024, so does this problem still occur on TOT kernels? e.g. commit > d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial > conversions") should help address indlen block consumption for > repeated partial conversions. > This patch can solve the issue where the required extra space may not be reserved in time, leading to a writeback failure. However, it cannot address the problem caused by the continuous consumption of the reserved space. After da_old is exhausted, there is currently no replenishment, and I am wondering whether it is reasonable to maintain the reserved blocks at the value of xfs_bmap_worst_indlen(). At the same time, da_old does not subtract the consumed portion, so (da_new - da_old) will be smaller than the actual value, resulting in less space being reserved than actually needed. commit d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial conversions") Here, when there is not enough space, it reserves space from the reserved block pool. Although the reservation is successful, in theory, the actual allocation of space may fail due to insufficient space in xfs_alloc_fix_freelist(). I am not sure if my understanding is correct? >> Further analysis of this abnormal model shows that because the reserved blocks >> are continuously consumed, they may eventually exceed the reserved amount. When >> the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate >> blocks, triggering a warning. This failure to allocate additional blocks can lead >> to issues with normal block allocation. > > The TOT code should be recalculating the required indlen for the > remaining delalloc extent and accounting for indlen block usage > where it gets depleted. hence the gradual reduction of the indlen > over repeated left edge conversion and removal triggering repeated > indlen block consumption should no longer be a problem. > > If it is a problem, then we need to make sure we account for it > correctly, similar to the fix in the above commit and the series it > was part of. > > -Dave. > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102! 2026-05-13 9:33 ` yebin @ 2026-05-13 12:26 ` Dave Chinner 0 siblings, 0 replies; 5+ messages in thread From: Dave Chinner @ 2026-05-13 12:26 UTC (permalink / raw) To: yebin; +Cc: linux-xfs, djwong, hch On Wed, May 13, 2026 at 05:33:44PM +0800, yebin wrote: > > > On 2026/5/13 6:52, Dave Chinner wrote: > > On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote: > > > Hello Darrick and all, > > > > > > Recently, I encountered a problem where a BUG was triggered in the write-back process. > > > The detailed problem information is as follows: > > > ``` > > > XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351). Shutting. > > > XFS (sde): Please unmount the filesystem and rectify the problem(s) > > > XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610 > > > ------------[ cut here ]------------ > > > kernel BUG at fs/xfs/xfs_message.c:102! > > > Oops: invalid opcode: 0000 [#1] SMP KASAN PTI > > > RIP: 0010:assfail+0x9f/0xb0 > > > > What kernel? You've stripped that line out of the stack dump. > > The initial issue appeared on the v5.10 kernel and occurred multiple times. > The current stack is a reproduction I made on linux-next based on the > cc13002a9f98 tag: next-20260402. So why strip it out of the debug output? It doesn't encourage people to look at the problem when things like this have been obvious stripped from the output. .... > > So your test code is creating a number of fragmented extents to get > > to the edge of btree format conversion, then doing a delalloc > > write() to create a long delalloc extent range, then is alternating > > along the range of the delalloc extent doing: > > > > loop: > > sync_file_range(offset, 4096, SYNC_FILE_RANGE_WRITE) > > fallocate(PUNCH_HOLE, offset, 4096) > > offset += 4096; > > > > And so it is converting a single block at the left edge of the > > delalloc extent to a written extent which triggers a extent -> btree > > conversion, and then you punch out the newly created written extent > > triggering a btree -> extent conversion. > > > > And each time you do this it removes a reserved block from the > > delalloc extent for the btree root block, yes? > > > > Yes, this is just a process I designed to facilitate the construction > of this problem. In another case, during the delay extent conversion, > B-tree splitting continuously consumes reserved blocks. The worst_indlen calculation should be taking blocks needed for BMBT tree splits into account. It expects to consume (extent len / BMBT records per block) leaf blocks for the delalloc extent. It then walks back up the bmbt tree, calculating how many node blocks will be needed to index all those leaf blocks. IOWs, it reserves all the node blocks it will need for splits to index the growing number of leaf blocks. i.e. by calculating the number of BMBT blocks required to index the delalloc extent being converted into individual single block extents, it should have taken into account all the blocks needed for all the BMBT splits needed to index the range. > Essentially, > this is because the delay extent conversion process is broken down, > which may cause the reserved blocks to be exhausted. As per above, fragmentation by itself shouldn't cause the indlen reservation to be exhausted. > I think that the scenario of conversion between extents and B-trees > may be that the unwritten extents are converted to written extents > after the writeback is complete and then the extents are combined, > causing the B-tree to be converted to an extent. Yes, I can see how that could occur - it would need contiguous physical extent allocation to keep the number of extents in the file at the threshold where: writeback submission -> delalloc -> left contiguous unwritten allocation -> nextents++ -> extent_to_btree IO completion -> unwritten conversion -> left merge with written extent -> nextents-- -> btree_to_extents But here's the thing: the extent_to_btree conversion does not account blocks allocated to indlen blocks stored in the delalloc extent. Yes, it uses blocks that were accounting to the superblock as reserved delalloc blocks but the btree root block allocation only gets accounted to the superblock and not to the new indlen in the remaining delalloc extent. Hence the data fork can bounce back and forth between extents and btree forms across allocation and conversion without having any impact on the indlen held in the delalloc extent that is slowly being allocated and written. The problem you are seeing is that indlen is being exhausted by something, and that results in passing wasdel = false to the extents_to_btree() conversion without a block reservation. We don't yet have a plausible explanation of why indlen is being exausted in the first place - it's not foramt conversion, and it's not "btree splits", so how are we getting to indlen = 0 and triggering this issue? e.g. How much of the delalloc extent remains unallocated when da_old reaches zero? Is this an off-by-one corner case of having allocated the entire delalloc range and so having consumed all the indlen at the same time the last allocation needs to convert the data for to btree format? > This scenario may > be triggered by normal service operations. In any case, file system > fragmentation is the cause of this problem. I've not seen any evidence that supports this conclusion yet. > > How realistic is this scenario in an application/production > > environment? I mean, nobody walks through a file syncing data to > > disk one fragmented extent at a time only to immediately remove it > > before writing the next block. > > > > We've known that this is possible for a very long time. I've > > personally known it can happen in carefully constructed test code > > for over 20 years, but I can count on one hand the number of times > > I've actually seen this exhaustion occur in a production system. > > > > The reservation we use here is essentially unchanged since if was > > first introduced in 1994, so time in use tells us that the > > reservation is largely sufficient for production systems. Can you > > describe the situation where you production systems are hitting > > this? What is the application actually doing to trigger this > > problem? > > > > The extent reduction is not only triggered in the punch hole scenario. > In all scenarios where extent merging is triggered, the conversion > from B-tree to extent may be performed. Yes, I know. But in writeback scenarios, only unwritten extent conversion can cause merges, and that only happens when we have contiguous allocations over the delalloc range. IOWs, it can't happen when the filesystem is fragmented, as it requires repeated contiguous allocation to enable left merging. Hence large, uncontested free spaces are required to trigger the fork format conversion cycling behaviour, but this is irrelevant because I don't think the format cycling is the cause of indlen exhaustion.... > > Which begs the question: we fixed some issues with this code back in > > 2024, so does this problem still occur on TOT kernels? e.g. commit > > d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial > > conversions") should help address indlen block consumption for > > repeated partial conversions. > > > > This patch can solve the issue where the required extra space may not > be reserved in time, leading to a writeback failure. However, it cannot > address the problem caused by the continuous consumption of the reserved > space. OK, but therein lies the issue: what is the mechanism that causes the excessive consumption of the indlen blocks? Is the calculation wrong, does it leak blocks when we split the delalloc extent, or something else? -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-05-13 12:26 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin 2026-05-12 17:19 ` Darrick J. Wong 2026-05-12 22:52 ` Dave Chinner 2026-05-13 9:33 ` yebin 2026-05-13 12:26 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox