[bug report] kernel BUG at fs/xfs/xfs

Linux XFS filesystem development
 help / color / mirror / Atom feed

* [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
@ 2026-05-12 11:34 yebin
  2026-05-12 17:19 ` Darrick J. Wong
  2026-05-12 22:52 ` Dave Chinner
  0 siblings, 2 replies; 6+ messages in thread
From: yebin @ 2026-05-12 11:34 UTC (permalink / raw)
  To: linux-xfs, djwong, hch, dgc

Hello Darrick and all,

Recently, I encountered a problem where a BUG was triggered in the write-back process.
The detailed problem information is as follows:
```
XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
XFS (sde): Please unmount the filesystem and rectify the problem(s)
XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
RIP: 0010:assfail+0x9f/0xb0
Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310
RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6
RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded
R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520
R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff
FS:  00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0
Call Trace:
  <TASK>
  xfs_trans_unreserve_and_mod_sb+0xb86/0xd00
  __xfs_trans_commit+0x38b/0xe00
  xfs_trans_commit+0xeb/0x1a0
  xfs_bmapi_convert_one_delalloc+0xbca/0x1270
  xfs_bmapi_convert_delalloc+0x101/0x350
  xfs_writeback_range+0x76c/0x12d0
  iomap_writeback_folio+0x9ed/0x2100
  iomap_writepages+0x13c/0x2a0
  xfs_vm_writepages+0x278/0x330
  do_writepages+0x247/0x5c0
  filemap_writeback+0x22c/0x2e0
  xfs_file_release+0x442/0x580
  __fput+0x407/0xb50
  fput_close_sync+0x114/0x210
  __x64_sys_close+0x94/0x120
  do_syscall_64+0xc4/0xf80
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
```

After analyzing the above issues, the possible triggering process
is as follows:
```
xfs_bmapi_convert_delalloc
  xfs_bmapi_convert_one_delalloc
   xfs_bmapi_allocate
    xfs_bmap_add_extent_delay_real
     da_old = startblockval(PREV.br_startblock); // da_old = 5
     case BMAP_LEFT_FILLING:
      ifp->if_nextents++;  // 21 + 1 = 22
      if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
       xfs_bmap_extents_to_btree     // convert to btree
         cur->bc_ino.allocated++;
       da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
                                startblockval(PREV.br_startblock) -
                                (bma->cur ? bma->cur->bc_ino.allocated : 0));  // da_new = 5 - 1 = 4
       PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return

                                                  xfs_bmap_del_extent_real
                                                    case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:
                                                     ifp->if_nextents--;  // 22 - 1 = 21
                                                     if (xfs_bmap_needs_btree(ip, whichfork))
                                                       xfs_bmap_extents_to_btree
                                                     else
                                                       xfs_bmap_btree_to_extents  // convert to extents
... // Alternate a few times in the middle.
da_old = 4
da_old = 3
da_old = 2
da_old = 1
...
xfs_bmapi_convert_delalloc
  xfs_bmapi_convert_one_delalloc
   error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp);  // Both blocks and rtextents are 0
    tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
    error = xfs_trans_reserve(tp, resp, blocks, rtextents);
    if (blocks > 0)
     error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
     tp->t_blk_res += blocks;    // The value of blocks is 0, so the value of tp->t_blk_res is 0
    xfs_bmapi_allocate
     xfs_bmap_add_extent_delay_real
      da_old = startblockval(PREV.br_startblock); // da_old = 0
      case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:   // The current delay extent is just exhausted.
       ifp->if_nextents++;  // 21 + 1 + 22

     if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
      error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork);     // Converted to btree. da_old > 0 is false.
       args.wasdel = wasdel;   //  wasdel is false
       error = xfs_alloc_vextent(&args);
        xfs_alloc_ag_vextent(args, 0)
         xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
          case XFS_AG_RESV_NONE:
           field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS : XFS_TRANS_SB_FDBLOCKS; //args->wasdel  == false
           xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
            case XFS_TRANS_SB_FDBLOCKS:
             if (delta < 0)
              tp->t_blk_res_used += (uint)-delta;
              if (tp->t_blk_res_used > tp->t_blk_res)  // ***tp->t_blk_res is 0, thus triggering xfs_force_shutdown()***
               xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
```

The logic that triggers the issue above was designed by me to facilitate the
construction of the problem. Besides the scenario where XFS_DINODE_FMT_BTREE
and XFS_DINODE_FMT_EXTENTS are converted back and forth, there is also the
scenario of btree splitting.

The core reason for the issue is that in xfs_bmapi_convert_delalloc(), the
call to xfs_bmap_worst_indlen() calculates the worst-case number of reserved
blocks, which is the number of additional blocks required after a complete
conversion of the entire delayed extent. It assumes that the entire conversion
process is atomic. However, the current process cannot guarantee such atomicity.
In the case of a fragmented filesystem, the most extreme scenario is that every
block conversion triggers a full btree split, in which case the reserved blocks
are far from sufficient. When this issue is triggered, the filesystem fragmentation
in the environment is indeed quite severe.

Further analysis of this abnormal model shows that because the reserved blocks
are continuously consumed, they may eventually exceed the reserved amount. When
the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate
blocks, triggering a warning. This failure to allocate additional blocks can lead
to issues with normal block allocation.

Additionally, in xfs_bmap_add_extent_delay_real(), if a delayed extent is split
into two, xfs_bmap_worst_indlen() is recalculated to reserve blocks. In the case
of nearly exhausted space, it may be impossible to reserve the newly required
blocks, leading to a writeback failure.

During the reservation phase, reserving more blocks by considering the worst-case
scenario would require occupying a lot of extra space, which is not very practical.
I was thinking that we could convert all the delay extents at once to ensure
atomicity, which would ensure that the two issues analyzed above do not exist.
However, I am not sure what negative impacts this approach might have. The only
thing I can think of is that the reserved space would be repeatedly allocated and
released, but I believe the current logic already has similar situations.

I haven't thought of a better solution at the moment. I wonder if anyone has any
good ideas?


Thanks,
Ye Bin


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
  2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin
@ 2026-05-12 17:19 ` Darrick J. Wong
  2026-05-12 22:52 ` Dave Chinner
  1 sibling, 0 replies; 6+ messages in thread
From: Darrick J. Wong @ 2026-05-12 17:19 UTC (permalink / raw)
  To: yebin; +Cc: linux-xfs, hch, dgc

On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote:
> Hello Darrick and all,
> 
> Recently, I encountered a problem where a BUG was triggered in the write-back process.
> The detailed problem information is as follows:
> ```
> XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
> XFS (sde): Please unmount the filesystem and rectify the problem(s)
> XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
> ------------[ cut here ]------------
> kernel BUG at fs/xfs/xfs_message.c:102!
> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> RIP: 0010:assfail+0x9f/0xb0
> Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310
> RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6
> RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001
> RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded
> R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520
> R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff
> FS:  00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0
> Call Trace:
>  <TASK>
>  xfs_trans_unreserve_and_mod_sb+0xb86/0xd00
>  __xfs_trans_commit+0x38b/0xe00
>  xfs_trans_commit+0xeb/0x1a0
>  xfs_bmapi_convert_one_delalloc+0xbca/0x1270
>  xfs_bmapi_convert_delalloc+0x101/0x350
>  xfs_writeback_range+0x76c/0x12d0
>  iomap_writeback_folio+0x9ed/0x2100
>  iomap_writepages+0x13c/0x2a0
>  xfs_vm_writepages+0x278/0x330
>  do_writepages+0x247/0x5c0
>  filemap_writeback+0x22c/0x2e0
>  xfs_file_release+0x442/0x580
>  __fput+0x407/0xb50
>  fput_close_sync+0x114/0x210
>  __x64_sys_close+0x94/0x120
>  do_syscall_64+0xc4/0xf80
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> ```
> 
> After analyzing the above issues, the possible triggering process
> is as follows:
> ```
> xfs_bmapi_convert_delalloc
>  xfs_bmapi_convert_one_delalloc
>   xfs_bmapi_allocate
>    xfs_bmap_add_extent_delay_real
>     da_old = startblockval(PREV.br_startblock); // da_old = 5
>     case BMAP_LEFT_FILLING:
>      ifp->if_nextents++;  // 21 + 1 = 22
>      if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
>       xfs_bmap_extents_to_btree     // convert to btree
>         cur->bc_ino.allocated++;
>       da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
>                                startblockval(PREV.br_startblock) -
>                                (bma->cur ? bma->cur->bc_ino.allocated : 0));  // da_new = 5 - 1 = 4
>       PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return
> 
>                                                  xfs_bmap_del_extent_real
>                                                    case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:
>                                                     ifp->if_nextents--;  // 22 - 1 = 21
>                                                     if (xfs_bmap_needs_btree(ip, whichfork))
>                                                       xfs_bmap_extents_to_btree
>                                                     else
>                                                       xfs_bmap_btree_to_extents  // convert to extents
> ... // Alternate a few times in the middle.
> da_old = 4
> da_old = 3
> da_old = 2
> da_old = 1
> ...
> xfs_bmapi_convert_delalloc
>  xfs_bmapi_convert_one_delalloc
>   error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp);  // Both blocks and rtextents are 0
>    tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
>    error = xfs_trans_reserve(tp, resp, blocks, rtextents);
>    if (blocks > 0)
>     error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
>     tp->t_blk_res += blocks;    // The value of blocks is 0, so the value of tp->t_blk_res is 0
>    xfs_bmapi_allocate
>     xfs_bmap_add_extent_delay_real
>      da_old = startblockval(PREV.br_startblock); // da_old = 0
>      case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:   // The current delay extent is just exhausted.
>       ifp->if_nextents++;  // 21 + 1 + 22
> 
>     if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
>      error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork);     // Converted to btree. da_old > 0 is false.
>       args.wasdel = wasdel;   //  wasdel is false
>       error = xfs_alloc_vextent(&args);
>        xfs_alloc_ag_vextent(args, 0)
>         xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
>          case XFS_AG_RESV_NONE:
>           field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS : XFS_TRANS_SB_FDBLOCKS; //args->wasdel  == false
>           xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
>            case XFS_TRANS_SB_FDBLOCKS:
>             if (delta < 0)
>              tp->t_blk_res_used += (uint)-delta;
>              if (tp->t_blk_res_used > tp->t_blk_res)  // ***tp->t_blk_res is 0, thus triggering xfs_force_shutdown()***
>               xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> ```
> 
> The logic that triggers the issue above was designed by me to facilitate the
> construction of the problem. Besides the scenario where XFS_DINODE_FMT_BTREE
> and XFS_DINODE_FMT_EXTENTS are converted back and forth, there is also the
> scenario of btree splitting.
> 
> The core reason for the issue is that in xfs_bmapi_convert_delalloc(), the
> call to xfs_bmap_worst_indlen() calculates the worst-case number of reserved
> blocks, which is the number of additional blocks required after a complete
> conversion of the entire delayed extent. It assumes that the entire conversion
> process is atomic. However, the current process cannot guarantee such atomicity.
> In the case of a fragmented filesystem, the most extreme scenario is that every
> block conversion triggers a full btree split, in which case the reserved blocks
> are far from sufficient. When this issue is triggered, the filesystem fragmentation
> in the environment is indeed quite severe.
> 
> Further analysis of this abnormal model shows that because the reserved blocks
> are continuously consumed, they may eventually exceed the reserved amount. When
> the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate
> blocks, triggering a warning. This failure to allocate additional blocks can lead
> to issues with normal block allocation.
> 
> Additionally, in xfs_bmap_add_extent_delay_real(), if a delayed extent is split
> into two, xfs_bmap_worst_indlen() is recalculated to reserve blocks. In the case
> of nearly exhausted space, it may be impossible to reserve the newly required
> blocks, leading to a writeback failure.
> 
> During the reservation phase, reserving more blocks by considering the worst-case
> scenario would require occupying a lot of extra space, which is not very practical.
> I was thinking that we could convert all the delay extents at once to ensure
> atomicity, which would ensure that the two issues analyzed above do not exist.
> However, I am not sure what negative impacts this approach might have. The only
> thing I can think of is that the reserved space would be repeatedly allocated and
> released, but I believe the current logic already has similar situations.
> 
> I haven't thought of a better solution at the moment. I wonder if anyone has any
> good ideas?

I haven't.  With EFI-based space freeing in transactions, as a
theoretical last resort you could steal a block from an EFI instead of
scanning the bnobt/cntbt.  This would mitigate the nastiness of repeated
split/merge cycles but you'd have to be careful about block reuse.

--D

> 
> Thanks,
> Ye Bin
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
  2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin
  2026-05-12 17:19 ` Darrick J. Wong
@ 2026-05-12 22:52 ` Dave Chinner
  2026-05-13  9:33   ` yebin
  1 sibling, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2026-05-12 22:52 UTC (permalink / raw)
  To: yebin; +Cc: linux-xfs, djwong, hch

On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote:
> Hello Darrick and all,
> 
> Recently, I encountered a problem where a BUG was triggered in the write-back process.
> The detailed problem information is as follows:
> ```
> XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
> XFS (sde): Please unmount the filesystem and rectify the problem(s)
> XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
> ------------[ cut here ]------------
> kernel BUG at fs/xfs/xfs_message.c:102!
> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> RIP: 0010:assfail+0x9f/0xb0

What kernel? You've stripped that line out of the stack dump.

> Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310
> RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6
> RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001
> RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded
> R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520
> R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff
> FS:  00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0
> Call Trace:
>  <TASK>
>  xfs_trans_unreserve_and_mod_sb+0xb86/0xd00
>  __xfs_trans_commit+0x38b/0xe00
>  xfs_trans_commit+0xeb/0x1a0
>  xfs_bmapi_convert_one_delalloc+0xbca/0x1270
>  xfs_bmapi_convert_delalloc+0x101/0x350
>  xfs_writeback_range+0x76c/0x12d0
>  iomap_writeback_folio+0x9ed/0x2100
>  iomap_writepages+0x13c/0x2a0
>  xfs_vm_writepages+0x278/0x330
>  do_writepages+0x247/0x5c0
>  filemap_writeback+0x22c/0x2e0
>  xfs_file_release+0x442/0x580
>  __fput+0x407/0xb50
>  fput_close_sync+0x114/0x210
>  __x64_sys_close+0x94/0x120
>  do_syscall_64+0xc4/0xf80
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> ```
> 
> After analyzing the above issues, the possible triggering process
> is as follows:
> ```
> xfs_bmapi_convert_delalloc
>  xfs_bmapi_convert_one_delalloc
>   xfs_bmapi_allocate
>    xfs_bmap_add_extent_delay_real
>     da_old = startblockval(PREV.br_startblock); // da_old = 5
>     case BMAP_LEFT_FILLING:
>      ifp->if_nextents++;  // 21 + 1 = 22
>      if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
>       xfs_bmap_extents_to_btree     // convert to btree
>         cur->bc_ino.allocated++;
>       da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
>                                startblockval(PREV.br_startblock) -
>                                (bma->cur ? bma->cur->bc_ino.allocated : 0));  // da_new = 5 - 1 = 4
>       PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return
> 
>                                                  xfs_bmap_del_extent_real
>                                                    case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:
>                                                     ifp->if_nextents--;  // 22 - 1 = 21
>                                                     if (xfs_bmap_needs_btree(ip, whichfork))
>                                                       xfs_bmap_extents_to_btree
>                                                     else
>                                                       xfs_bmap_btree_to_extents  // convert to extents

So your test code is creating a number of fragmented extents to get
to the edge of btree format conversion, then doing a delalloc
write() to create a long delalloc extent range, then is alternating
along the range of the delalloc extent doing:

loop:
	sync_file_range(offset, 4096, SYNC_FILE_RANGE_WRITE)
	fallocate(PUNCH_HOLE, offset, 4096)
	offset += 4096;

And so it is converting a single block at the left edge of the
delalloc extent to a written extent which triggers a extent -> btree
conversion, and then you punch out the newly created written extent
triggering a btree -> extent conversion.

And each time you do this it removes a reserved block from the
delalloc extent for the btree root block, yes?

How realistic is this scenario in an application/production
environment? I mean, nobody walks through a file syncing data to
disk one fragmented extent at a time only to immediately remove it
before writing the next block.

We've known that this is possible for a very long time. I've
personally known it can happen in carefully constructed test code
for over 20 years, but I can count on one hand the number of times
I've actually seen this exhaustion occur in a production system.

The reservation we use here is essentially unchanged since if was
first introduced in 1994, so time in use tells us that the
reservation is largely sufficient for production systems. Can you
describe the situation where you production systems are hitting
this? What is the application actually doing to trigger this
problem?

> ... // Alternate a few times in the middle.
> da_old = 4
> da_old = 3
> da_old = 2
> da_old = 1
> ...
> xfs_bmapi_convert_delalloc
>  xfs_bmapi_convert_one_delalloc
>   error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp);  // Both blocks and rtextents are 0
>    tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
>    error = xfs_trans_reserve(tp, resp, blocks, rtextents);
>    if (blocks > 0)
>     error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
>     tp->t_blk_res += blocks;    // The value of blocks is 0, so the value of tp->t_blk_res is 0
>    xfs_bmapi_allocate
>     xfs_bmap_add_extent_delay_real
>      da_old = startblockval(PREV.br_startblock); // da_old = 0
>      case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:   // The current delay extent is just exhausted.
>       ifp->if_nextents++;  // 21 + 1 + 22
> 
>     if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
>      error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork);     // Converted to btree. da_old > 0 is false.
>       args.wasdel = wasdel;   //  wasdel is false
>       error = xfs_alloc_vextent(&args);
>        xfs_alloc_ag_vextent(args, 0)

Ok, that's why you stripped the kernel ID out of the stack dump -
this analysis is from a vendor kernel of some kind. i.e.
xfs_alloc_ag_vextent() went away in 2023...

Which begs the question: we fixed some issues with this code back in
2024, so does this problem still occur on TOT kernels? e.g. commit
d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial
conversions") should help address indlen block consumption for
repeated partial conversions.

> Further analysis of this abnormal model shows that because the reserved blocks
> are continuously consumed, they may eventually exceed the reserved amount. When
> the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate
> blocks, triggering a warning. This failure to allocate additional blocks can lead
> to issues with normal block allocation.

The TOT code should be recalculating the required indlen for the
remaining delalloc extent and accounting for indlen block usage
where it gets depleted. hence the gradual reduction of the indlen
over repeated left edge conversion and removal triggering repeated
indlen block consumption should no longer be a problem.

If it is a problem, then we need to make sure we account for it
correctly, similar to the fix in the above commit and the series it
was part of.

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
  2026-05-12 22:52 ` Dave Chinner
@ 2026-05-13  9:33   ` yebin
  2026-05-13 12:26     ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: yebin @ 2026-05-13  9:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong, hch



On 2026/5/13 6:52, Dave Chinner wrote:
> On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote:
>> Hello Darrick and all,
>>
>> Recently, I encountered a problem where a BUG was triggered in the write-back process.
>> The detailed problem information is as follows:
>> ```
>> XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
>> XFS (sde): Please unmount the filesystem and rectify the problem(s)
>> XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
>> ------------[ cut here ]------------
>> kernel BUG at fs/xfs/xfs_message.c:102!
>> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
>> RIP: 0010:assfail+0x9f/0xb0
>
> What kernel? You've stripped that line out of the stack dump.

The initial issue appeared on the v5.10 kernel and occurred multiple times.
The current stack is a reproduction I made on linux-next based on the
cc13002a9f98 tag: next-20260402.

>
>> Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310
>> RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293
>> RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6
>> RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001
>> RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded
>> R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520
>> R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff
>> FS:  00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0
>> Call Trace:
>>   <TASK>
>>   xfs_trans_unreserve_and_mod_sb+0xb86/0xd00
>>   __xfs_trans_commit+0x38b/0xe00
>>   xfs_trans_commit+0xeb/0x1a0
>>   xfs_bmapi_convert_one_delalloc+0xbca/0x1270
>>   xfs_bmapi_convert_delalloc+0x101/0x350
>>   xfs_writeback_range+0x76c/0x12d0
>>   iomap_writeback_folio+0x9ed/0x2100
>>   iomap_writepages+0x13c/0x2a0
>>   xfs_vm_writepages+0x278/0x330
>>   do_writepages+0x247/0x5c0
>>   filemap_writeback+0x22c/0x2e0
>>   xfs_file_release+0x442/0x580
>>   __fput+0x407/0xb50
>>   fput_close_sync+0x114/0x210
>>   __x64_sys_close+0x94/0x120
>>   do_syscall_64+0xc4/0xf80
>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> ```
>>
>> After analyzing the above issues, the possible triggering process
>> is as follows:
>> ```
>> xfs_bmapi_convert_delalloc
>>   xfs_bmapi_convert_one_delalloc
>>    xfs_bmapi_allocate
>>     xfs_bmap_add_extent_delay_real
>>      da_old = startblockval(PREV.br_startblock); // da_old = 5
>>      case BMAP_LEFT_FILLING:
>>       ifp->if_nextents++;  // 21 + 1 = 22
>>       if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
>>        xfs_bmap_extents_to_btree     // convert to btree
>>          cur->bc_ino.allocated++;
>>        da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
>>                                 startblockval(PREV.br_startblock) -
>>                                 (bma->cur ? bma->cur->bc_ino.allocated : 0));  // da_new = 5 - 1 = 4
>>        PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return
>>
>>                                                   xfs_bmap_del_extent_real
>>                                                     case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:
>>                                                      ifp->if_nextents--;  // 22 - 1 = 21
>>                                                      if (xfs_bmap_needs_btree(ip, whichfork))
>>                                                        xfs_bmap_extents_to_btree
>>                                                      else
>>                                                        xfs_bmap_btree_to_extents  // convert to extents
>
> So your test code is creating a number of fragmented extents to get
> to the edge of btree format conversion, then doing a delalloc
> write() to create a long delalloc extent range, then is alternating
> along the range of the delalloc extent doing:
>
> loop:
> 	sync_file_range(offset, 4096, SYNC_FILE_RANGE_WRITE)
> 	fallocate(PUNCH_HOLE, offset, 4096)
> 	offset += 4096;
>
> And so it is converting a single block at the left edge of the
> delalloc extent to a written extent which triggers a extent -> btree
> conversion, and then you punch out the newly created written extent
> triggering a btree -> extent conversion.
>
> And each time you do this it removes a reserved block from the
> delalloc extent for the btree root block, yes?
>

Yes, this is just a process I designed to facilitate the construction
of this problem. In another case, during the delay extent conversion,
B-tree splitting continuously consumes reserved blocks. Essentially,
this is because the delay extent conversion process is broken down,
which may cause the reserved blocks to be exhausted.
I think that the scenario of conversion between extents and B-trees
may be that the unwritten extents are converted to written extents
after the writeback is complete and then the extents are combined,
causing the B-tree to be converted to an extent. This scenario may
be triggered by normal service operations. In any case, file system
fragmentation is the cause of this problem.

> How realistic is this scenario in an application/production
> environment? I mean, nobody walks through a file syncing data to
> disk one fragmented extent at a time only to immediately remove it
> before writing the next block.
>
> We've known that this is possible for a very long time. I've
> personally known it can happen in carefully constructed test code
> for over 20 years, but I can count on one hand the number of times
> I've actually seen this exhaustion occur in a production system.
>
> The reservation we use here is essentially unchanged since if was
> first introduced in 1994, so time in use tells us that the
> reservation is largely sufficient for production systems. Can you
> describe the situation where you production systems are hitting
> this? What is the application actually doing to trigger this
> problem?
>

The extent reduction is not only triggered in the punch hole scenario.
In all scenarios where extent merging is triggered, the conversion
from B-tree to extent may be performed.

>> ... // Alternate a few times in the middle.
>> da_old = 4
>> da_old = 3
>> da_old = 2
>> da_old = 1
>> ...
>> xfs_bmapi_convert_delalloc
>>   xfs_bmapi_convert_one_delalloc
>>    error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp);  // Both blocks and rtextents are 0
>>     tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
>>     error = xfs_trans_reserve(tp, resp, blocks, rtextents);
>>     if (blocks > 0)
>>      error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
>>      tp->t_blk_res += blocks;    // The value of blocks is 0, so the value of tp->t_blk_res is 0
>>     xfs_bmapi_allocate
>>      xfs_bmap_add_extent_delay_real
>>       da_old = startblockval(PREV.br_startblock); // da_old = 0
>>       case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:   // The current delay extent is just exhausted.
>>        ifp->if_nextents++;  // 21 + 1 + 22
>>
>>      if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
>>       error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork);     // Converted to btree. da_old > 0 is false.
>>        args.wasdel = wasdel;   //  wasdel is false
>>        error = xfs_alloc_vextent(&args);
>>         xfs_alloc_ag_vextent(args, 0)
>
> Ok, that's why you stripped the kernel ID out of the stack dump -
> this analysis is from a vendor kernel of some kind. i.e.
> xfs_alloc_ag_vextent() went away in 2023...
>
Sorry, my initial problem analysis was based on the v5.10 kernel version,
and I must have missed updating it during the modification.

> Which begs the question: we fixed some issues with this code back in
> 2024, so does this problem still occur on TOT kernels? e.g. commit
> d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial
> conversions") should help address indlen block consumption for
> repeated partial conversions.
>

This patch can solve the issue where the required extra space may not
be reserved in time, leading to a writeback failure. However, it cannot
address the problem caused by the continuous consumption of the reserved
space. After da_old is exhausted, there is currently no replenishment,
and I am wondering whether it is reasonable to maintain the reserved
blocks at the value of xfs_bmap_worst_indlen(). At the same time, da_old
does not subtract the consumed portion, so (da_new - da_old) will be
smaller than the actual value, resulting in less space being reserved
than actually needed.

commit d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial
conversions") Here, when there is not enough space, it reserves space from
the reserved block pool. Although the reservation is successful, in theory,
the actual allocation of space may fail due to insufficient space in
xfs_alloc_fix_freelist(). I am not sure if my understanding is correct?

>> Further analysis of this abnormal model shows that because the reserved blocks
>> are continuously consumed, they may eventually exceed the reserved amount. When
>> the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate
>> blocks, triggering a warning. This failure to allocate additional blocks can lead
>> to issues with normal block allocation.
>
> The TOT code should be recalculating the required indlen for the
> remaining delalloc extent and accounting for indlen block usage
> where it gets depleted. hence the gradual reduction of the indlen
> over repeated left edge conversion and removal triggering repeated
> indlen block consumption should no longer be a problem.
>
> If it is a problem, then we need to make sure we account for it
> correctly, similar to the fix in the above commit and the series it
> was part of.
>
> -Dave.
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
  2026-05-13  9:33   ` yebin
@ 2026-05-13 12:26     ` Dave Chinner
  2026-05-14  3:16       ` yebin
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2026-05-13 12:26 UTC (permalink / raw)
  To: yebin; +Cc: linux-xfs, djwong, hch

On Wed, May 13, 2026 at 05:33:44PM +0800, yebin wrote:
> 
> 
> On 2026/5/13 6:52, Dave Chinner wrote:
> > On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote:
> > > Hello Darrick and all,
> > > 
> > > Recently, I encountered a problem where a BUG was triggered in the write-back process.
> > > The detailed problem information is as follows:
> > > ```
> > > XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
> > > XFS (sde): Please unmount the filesystem and rectify the problem(s)
> > > XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
> > > ------------[ cut here ]------------
> > > kernel BUG at fs/xfs/xfs_message.c:102!
> > > Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> > > RIP: 0010:assfail+0x9f/0xb0
> > 
> > What kernel? You've stripped that line out of the stack dump.
> 
> The initial issue appeared on the v5.10 kernel and occurred multiple times.
> The current stack is a reproduction I made on linux-next based on the
> cc13002a9f98 tag: next-20260402.

So why strip it out of the debug output? It doesn't encourage people
to look at the problem when things like this have been obvious
stripped from the output.

....

> > So your test code is creating a number of fragmented extents to get
> > to the edge of btree format conversion, then doing a delalloc
> > write() to create a long delalloc extent range, then is alternating
> > along the range of the delalloc extent doing:
> > 
> > loop:
> > 	sync_file_range(offset, 4096, SYNC_FILE_RANGE_WRITE)
> > 	fallocate(PUNCH_HOLE, offset, 4096)
> > 	offset += 4096;
> > 
> > And so it is converting a single block at the left edge of the
> > delalloc extent to a written extent which triggers a extent -> btree
> > conversion, and then you punch out the newly created written extent
> > triggering a btree -> extent conversion.
> > 
> > And each time you do this it removes a reserved block from the
> > delalloc extent for the btree root block, yes?
> > 
> 
> Yes, this is just a process I designed to facilitate the construction
> of this problem. In another case, during the delay extent conversion,
> B-tree splitting continuously consumes reserved blocks.

The worst_indlen calculation should be taking blocks needed for
BMBT tree splits into account.

It expects to consume (extent len / BMBT records per block) leaf
blocks for the delalloc extent. It then walks back up the bmbt tree,
calculating how many node blocks will be needed to index all those
leaf blocks.  IOWs, it reserves all the node blocks it will need for
splits to index the growing number of leaf blocks.

i.e. by calculating the number of BMBT blocks required to index the
delalloc extent being converted into individual single block
extents, it should have taken into account all the blocks needed for
all the BMBT splits needed to index the range.

> Essentially,
> this is because the delay extent conversion process is broken down,
> which may cause the reserved blocks to be exhausted.

As per above, fragmentation by itself shouldn't cause the indlen
reservation to be exhausted.

> I think that the scenario of conversion between extents and B-trees
> may be that the unwritten extents are converted to written extents
> after the writeback is complete and then the extents are combined,
> causing the B-tree to be converted to an extent.

Yes, I can see how that could occur - it would need contiguous
physical extent allocation to keep the number of extents in the file
at the threshold where:

writeback submission
-> delalloc
  -> left contiguous unwritten allocation
    -> nextents++
      -> extent_to_btree

IO completion
-> unwritten conversion
  -> left merge with written extent
     -> nextents--
        -> btree_to_extents

But here's the thing: the extent_to_btree conversion does not
account blocks allocated to indlen blocks stored in the delalloc
extent. Yes, it uses blocks that were accounting to the superblock
as reserved delalloc blocks but the btree root block allocation only
gets accounted to the superblock and not to the new indlen in the
remaining delalloc extent.

Hence the data fork can bounce back and forth between extents and
btree forms across allocation and conversion without having any
impact on the indlen held in the delalloc extent that is slowly
being allocated and written.

The problem you are seeing is that indlen is being exhausted
by something, and that results in passing wasdel = false to the
extents_to_btree() conversion without a block reservation. We don't
yet have a plausible explanation of why indlen is being exausted in
the first place - it's not foramt conversion, and it's not "btree
splits", so how are we getting to indlen = 0 and triggering this
issue?

e.g. How much of the delalloc extent remains unallocated when da_old
reaches zero? Is this an off-by-one corner case of having allocated
the entire delalloc range and so having consumed all the indlen at
the same time the last allocation needs to convert the data for to
btree format?

> This scenario may
> be triggered by normal service operations. In any case, file system
> fragmentation is the cause of this problem.

I've not seen any evidence that supports this conclusion yet.

> > How realistic is this scenario in an application/production
> > environment? I mean, nobody walks through a file syncing data to
> > disk one fragmented extent at a time only to immediately remove it
> > before writing the next block.
> > 
> > We've known that this is possible for a very long time. I've
> > personally known it can happen in carefully constructed test code
> > for over 20 years, but I can count on one hand the number of times
> > I've actually seen this exhaustion occur in a production system.
> > 
> > The reservation we use here is essentially unchanged since if was
> > first introduced in 1994, so time in use tells us that the
> > reservation is largely sufficient for production systems. Can you
> > describe the situation where you production systems are hitting
> > this? What is the application actually doing to trigger this
> > problem?
> > 
> 
> The extent reduction is not only triggered in the punch hole scenario.
> In all scenarios where extent merging is triggered, the conversion
> from B-tree to extent may be performed.

Yes, I know. But in writeback scenarios, only unwritten extent
conversion can cause merges, and that only happens when we have
contiguous allocations over the delalloc range.

IOWs, it can't happen when the filesystem is fragmented, as it
requires repeated contiguous allocation to enable left merging.
Hence large, uncontested free spaces are required to trigger the
fork format conversion cycling behaviour, but this is irrelevant
because I don't think the format cycling is the cause of indlen
exhaustion....

> > Which begs the question: we fixed some issues with this code back in
> > 2024, so does this problem still occur on TOT kernels? e.g. commit
> > d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial
> > conversions") should help address indlen block consumption for
> > repeated partial conversions.
> > 
> 
> This patch can solve the issue where the required extra space may not
> be reserved in time, leading to a writeback failure. However, it cannot
> address the problem caused by the continuous consumption of the reserved
> space.

OK, but therein lies the issue: what is the mechanism that causes
the excessive consumption of the indlen blocks? Is the calculation
wrong, does it leak blocks when we split the delalloc extent, or
something else?

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
  2026-05-13 12:26     ` Dave Chinner
@ 2026-05-14  3:16       ` yebin
  0 siblings, 0 replies; 6+ messages in thread
From: yebin @ 2026-05-14  3:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong, hch



On 2026/5/13 20:26, Dave Chinner wrote:
> On Wed, May 13, 2026 at 05:33:44PM +0800, yebin wrote:
>>
>>
>> On 2026/5/13 6:52, Dave Chinner wrote:
>>> On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote:
>>>> Hello Darrick and all,
>>>>
>>>> Recently, I encountered a problem where a BUG was triggered in the write-back process.
>>>> The detailed problem information is as follows:
>>>> ```
>>>> XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
>>>> XFS (sde): Please unmount the filesystem and rectify the problem(s)
>>>> XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
>>>> ------------[ cut here ]------------
>>>> kernel BUG at fs/xfs/xfs_message.c:102!
>>>> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
>>>> RIP: 0010:assfail+0x9f/0xb0
>>>
>>> What kernel? You've stripped that line out of the stack dump.
>>
>> The initial issue appeared on the v5.10 kernel and occurred multiple times.
>> The current stack is a reproduction I made on linux-next based on the
>> cc13002a9f98 tag: next-20260402.
>
> So why strip it out of the debug output? It doesn't encourage people
> to look at the problem when things like this have been obvious
> stripped from the output.
>
> ....
>

This is my fault. I accidentally deleted the version information when
removing unimportant details.

>>> So your test code is creating a number of fragmented extents to get
>>> to the edge of btree format conversion, then doing a delalloc
>>> write() to create a long delalloc extent range, then is alternating
>>> along the range of the delalloc extent doing:
>>>
>>> loop:
>>> 	sync_file_range(offset, 4096, SYNC_FILE_RANGE_WRITE)
>>> 	fallocate(PUNCH_HOLE, offset, 4096)
>>> 	offset += 4096;
>>>
>>> And so it is converting a single block at the left edge of the
>>> delalloc extent to a written extent which triggers a extent -> btree
>>> conversion, and then you punch out the newly created written extent
>>> triggering a btree -> extent conversion.
>>>
>>> And each time you do this it removes a reserved block from the
>>> delalloc extent for the btree root block, yes?
>>>
>>
>> Yes, this is just a process I designed to facilitate the construction
>> of this problem. In another case, during the delay extent conversion,
>> B-tree splitting continuously consumes reserved blocks.
>
> The worst_indlen calculation should be taking blocks needed for
> BMBT tree splits into account.
>
> It expects to consume (extent len / BMBT records per block) leaf
> blocks for the delalloc extent. It then walks back up the bmbt tree,
> calculating how many node blocks will be needed to index all those
> leaf blocks.  IOWs, it reserves all the node blocks it will need for
> splits to index the growing number of leaf blocks.
>
> i.e. by calculating the number of BMBT blocks required to index the
> delalloc extent being converted into individual single block
> extents, it should have taken into account all the blocks needed for
> all the BMBT splits needed to index the range.
>

I agree with your point, but as I mentioned earlier, xfs_bmap_worst_indlen()
calculates the maximum number of additional blocks required for a single
conversion of a delay extent. If the delay extent is split into multiple
conversions, each converting a portion at a time, it may trigger a btree
split each time. For example, if the delay extent has 10 blocks and 5 blocks
are reserved, and each conversion starts from the beginning and converts one
block at a time, each conversion will trigger a btree split. Assuming each
btree split consumes one block, the conversion process of the initial delay
extent with 10 blocks will actually consume an additional 10 blocks, but
only 5 blocks were initially reserved. The calculation method of 'da_new'
does not even consider this exhaustion scenario.

da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
                          startblockval(PREV.br_startblock) -
                          (bma->cur? bma->cur->bc_bmap.allocated : 0));

Each time a re-conversion occurs, the state of the btree/extent may have
changed. In theory, 'da_new' should be maintained at the value of
xfs_bmap_worst_indlen() to be rigorous.

The original algorithm model is based on the assumption that all changes
in the tree are caused by the continuous conversion of the same delay
extent. However, in a production environment, a file may be used in
segments, and the sources of tree changes are not singular.

>> Essentially,
>> this is because the delay extent conversion process is broken down,
>> which may cause the reserved blocks to be exhausted.
>
> As per above, fragmentation by itself shouldn't cause the indlen
> reservation to be exhausted.
>

Yes, file system fragmentation is just a trigger factor.

>> I think that the scenario of conversion between extents and B-trees
>> may be that the unwritten extents are converted to written extents
>> after the writeback is complete and then the extents are combined,
>> causing the B-tree to be converted to an extent.
>
> Yes, I can see how that could occur - it would need contiguous
> physical extent allocation to keep the number of extents in the file
> at the threshold where:
>
> writeback submission
> -> delalloc
>    -> left contiguous unwritten allocation
>      -> nextents++
>        -> extent_to_btree
>
> IO completion
> -> unwritten conversion
>    -> left merge with written extent
>       -> nextents--
>          -> btree_to_extents
>
> But here's the thing: the extent_to_btree conversion does not
> account blocks allocated to indlen blocks stored in the delalloc
> extent. Yes, it uses blocks that were accounting to the superblock
> as reserved delalloc blocks but the btree root block allocation only
> gets accounted to the superblock and not to the new indlen in the
> remaining delalloc extent.
>
> Hence the data fork can bounce back and forth between extents and
> btree forms across allocation and conversion without having any
> impact on the indlen held in the delalloc extent that is slowly
> being allocated and written.
>
> The problem you are seeing is that indlen is being exhausted
> by something, and that results in passing wasdel = false to the
> extents_to_btree() conversion without a block reservation. We don't
> yet have a plausible explanation of why indlen is being exausted in
> the first place - it's not foramt conversion, and it's not "btree
> splits", so how are we getting to indlen = 0 and triggering this
> issue?
>
> e.g. How much of the delalloc extent remains unallocated when da_old
> reaches zero? Is this an off-by-one corner case of having allocated
> the entire delalloc range and so having consumed all the indlen at
> the same time the last allocation needs to convert the data for to
> btree format?
>

I think in the second paragraph of my response, I gave an example that
can explain how inlen is gradually consumed and eventually becomes 0.

>> This scenario may
>> be triggered by normal service operations. In any case, file system
>> fragmentation is the cause of this problem.
>
> I've not seen any evidence that supports this conclusion yet.
>

What I mean is that file system fragmentation is a contributing factor
to the problem, not the root cause.

>>> How realistic is this scenario in an application/production
>>> environment? I mean, nobody walks through a file syncing data to
>>> disk one fragmented extent at a time only to immediately remove it
>>> before writing the next block.
>>>
>>> We've known that this is possible for a very long time. I've
>>> personally known it can happen in carefully constructed test code
>>> for over 20 years, but I can count on one hand the number of times
>>> I've actually seen this exhaustion occur in a production system.
>>>
>>> The reservation we use here is essentially unchanged since if was
>>> first introduced in 1994, so time in use tells us that the
>>> reservation is largely sufficient for production systems. Can you
>>> describe the situation where you production systems are hitting
>>> this? What is the application actually doing to trigger this
>>> problem?
>>>
>>
>> The extent reduction is not only triggered in the punch hole scenario.
>> In all scenarios where extent merging is triggered, the conversion
>> from B-tree to extent may be performed.
>
> Yes, I know. But in writeback scenarios, only unwritten extent
> conversion can cause merges, and that only happens when we have
> contiguous allocations over the delalloc range.
>
> IOWs, it can't happen when the filesystem is fragmented, as it
> requires repeated contiguous allocation to enable left merging.
> Hence large, uncontested free spaces are required to trigger the
> fork format conversion cycling behaviour, but this is irrelevant
> because I don't think the format cycling is the cause of indlen
> exhaustion....
>

I mentioned from the beginning that format conversion is just one
possible scenario. I designed the path to reproduce this issue in
order to better control the construction of the problem. The current
algorithm does not take into account that the source of btree changes
is not limited to the case of a single delay extent being generated.
In such cases, if a delay extent is split into multiple conversions,
it may lead to the exhaustion of the reserved inlen.

>>> Which begs the question: we fixed some issues with this code back in
>>> 2024, so does this problem still occur on TOT kernels? e.g. commit
>>> d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial
>>> conversions") should help address indlen block consumption for
>>> repeated partial conversions.
>>>
>>
>> This patch can solve the issue where the required extra space may not
>> be reserved in time, leading to a writeback failure. However, it cannot
>> address the problem caused by the continuous consumption of the reserved
>> space.
>
> OK, but therein lies the issue: what is the mechanism that causes
> the excessive consumption of the indlen blocks? Is the calculation
> wrong, does it leak blocks when we split the delalloc extent, or
> something else?
>

I think the core of the problem lies in the following code logic.
```
xfs_bmap_add_extent_delay_real
      da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
                               startblockval(PREV.br_startblock) -
                               (bma->cur? bma->cur->bc_bmap.allocated : 0));
```
As I mentioned before, the B-tree changes are not caused by the continuous
generation of a delay extent. A specific delay extent, which is partially
converted each time, happens to reach the critical point of B-tree splitting.
This unfortunate delay extent continuously contributes its reserved space,
and finally, the inlen becomes 0.
I say that fragmentation is the cause of this problem because in fragmented
scenarios, the probability of extent merging decreases, and B-tree splitting
becomes more frequent.

> -Dave.
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-14  3:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin
2026-05-12 17:19 ` Darrick J. Wong
2026-05-12 22:52 ` Dave Chinner
2026-05-13  9:33   ` yebin
2026-05-13 12:26     ` Dave Chinner
2026-05-14  3:16       ` yebin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox