Linux XFS filesystem development
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: yebin <yebin@huaweicloud.com>
Cc: linux-xfs@vger.kernel.org, hch@lst.de, dgc@kernel.org
Subject: Re: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
Date: Tue, 12 May 2026 10:19:31 -0700	[thread overview]
Message-ID: <20260512171931.GN9555@frogsfrogsfrogs> (raw)
In-Reply-To: <6A031038.9030708@huaweicloud.com>

On Tue, May 12, 2026 at 07:34:16PM +0800, yebin wrote:
> Hello Darrick and all,
> 
> Recently, I encountered a problem where a BUG was triggered in the write-back process.
> The detailed problem information is as follows:
> ```
> XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
> XFS (sde): Please unmount the filesystem and rectify the problem(s)
> XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
> ------------[ cut here ]------------
> kernel BUG at fs/xfs/xfs_message.c:102!
> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> RIP: 0010:assfail+0x9f/0xb0
> Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310
> RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6
> RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001
> RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded
> R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520
> R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff
> FS:  00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0
> Call Trace:
>  <TASK>
>  xfs_trans_unreserve_and_mod_sb+0xb86/0xd00
>  __xfs_trans_commit+0x38b/0xe00
>  xfs_trans_commit+0xeb/0x1a0
>  xfs_bmapi_convert_one_delalloc+0xbca/0x1270
>  xfs_bmapi_convert_delalloc+0x101/0x350
>  xfs_writeback_range+0x76c/0x12d0
>  iomap_writeback_folio+0x9ed/0x2100
>  iomap_writepages+0x13c/0x2a0
>  xfs_vm_writepages+0x278/0x330
>  do_writepages+0x247/0x5c0
>  filemap_writeback+0x22c/0x2e0
>  xfs_file_release+0x442/0x580
>  __fput+0x407/0xb50
>  fput_close_sync+0x114/0x210
>  __x64_sys_close+0x94/0x120
>  do_syscall_64+0xc4/0xf80
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> ```
> 
> After analyzing the above issues, the possible triggering process
> is as follows:
> ```
> xfs_bmapi_convert_delalloc
>  xfs_bmapi_convert_one_delalloc
>   xfs_bmapi_allocate
>    xfs_bmap_add_extent_delay_real
>     da_old = startblockval(PREV.br_startblock); // da_old = 5
>     case BMAP_LEFT_FILLING:
>      ifp->if_nextents++;  // 21 + 1 = 22
>      if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
>       xfs_bmap_extents_to_btree     // convert to btree
>         cur->bc_ino.allocated++;
>       da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
>                                startblockval(PREV.br_startblock) -
>                                (bma->cur ? bma->cur->bc_ino.allocated : 0));  // da_new = 5 - 1 = 4
>       PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return
> 
>                                                  xfs_bmap_del_extent_real
>                                                    case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:
>                                                     ifp->if_nextents--;  // 22 - 1 = 21
>                                                     if (xfs_bmap_needs_btree(ip, whichfork))
>                                                       xfs_bmap_extents_to_btree
>                                                     else
>                                                       xfs_bmap_btree_to_extents  // convert to extents
> ... // Alternate a few times in the middle.
> da_old = 4
> da_old = 3
> da_old = 2
> da_old = 1
> ...
> xfs_bmapi_convert_delalloc
>  xfs_bmapi_convert_one_delalloc
>   error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp);  // Both blocks and rtextents are 0
>    tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
>    error = xfs_trans_reserve(tp, resp, blocks, rtextents);
>    if (blocks > 0)
>     error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
>     tp->t_blk_res += blocks;    // The value of blocks is 0, so the value of tp->t_blk_res is 0
>    xfs_bmapi_allocate
>     xfs_bmap_add_extent_delay_real
>      da_old = startblockval(PREV.br_startblock); // da_old = 0
>      case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:   // The current delay extent is just exhausted.
>       ifp->if_nextents++;  // 21 + 1 + 22
> 
>     if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
>      error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork);     // Converted to btree. da_old > 0 is false.
>       args.wasdel = wasdel;   //  wasdel is false
>       error = xfs_alloc_vextent(&args);
>        xfs_alloc_ag_vextent(args, 0)
>         xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
>          case XFS_AG_RESV_NONE:
>           field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS : XFS_TRANS_SB_FDBLOCKS; //args->wasdel  == false
>           xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
>            case XFS_TRANS_SB_FDBLOCKS:
>             if (delta < 0)
>              tp->t_blk_res_used += (uint)-delta;
>              if (tp->t_blk_res_used > tp->t_blk_res)  // ***tp->t_blk_res is 0, thus triggering xfs_force_shutdown()***
>               xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> ```
> 
> The logic that triggers the issue above was designed by me to facilitate the
> construction of the problem. Besides the scenario where XFS_DINODE_FMT_BTREE
> and XFS_DINODE_FMT_EXTENTS are converted back and forth, there is also the
> scenario of btree splitting.
> 
> The core reason for the issue is that in xfs_bmapi_convert_delalloc(), the
> call to xfs_bmap_worst_indlen() calculates the worst-case number of reserved
> blocks, which is the number of additional blocks required after a complete
> conversion of the entire delayed extent. It assumes that the entire conversion
> process is atomic. However, the current process cannot guarantee such atomicity.
> In the case of a fragmented filesystem, the most extreme scenario is that every
> block conversion triggers a full btree split, in which case the reserved blocks
> are far from sufficient. When this issue is triggered, the filesystem fragmentation
> in the environment is indeed quite severe.
> 
> Further analysis of this abnormal model shows that because the reserved blocks
> are continuously consumed, they may eventually exceed the reserved amount. When
> the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate
> blocks, triggering a warning. This failure to allocate additional blocks can lead
> to issues with normal block allocation.
> 
> Additionally, in xfs_bmap_add_extent_delay_real(), if a delayed extent is split
> into two, xfs_bmap_worst_indlen() is recalculated to reserve blocks. In the case
> of nearly exhausted space, it may be impossible to reserve the newly required
> blocks, leading to a writeback failure.
> 
> During the reservation phase, reserving more blocks by considering the worst-case
> scenario would require occupying a lot of extra space, which is not very practical.
> I was thinking that we could convert all the delay extents at once to ensure
> atomicity, which would ensure that the two issues analyzed above do not exist.
> However, I am not sure what negative impacts this approach might have. The only
> thing I can think of is that the reserved space would be repeatedly allocated and
> released, but I believe the current logic already has similar situations.
> 
> I haven't thought of a better solution at the moment. I wonder if anyone has any
> good ideas?

I haven't.  With EFI-based space freeing in transactions, as a
theoretical last resort you could steal a block from an EFI instead of
scanning the bnobt/cntbt.  This would mitigate the nastiness of repeated
split/merge cycles but you'd have to be careful about block reuse.

--D

> 
> Thanks,
> Ye Bin
> 
> 

      reply	other threads:[~2026-05-12 17:19 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 11:34 [bug report] kernel BUG at fs/xfs/xfs_message.c:102! yebin
2026-05-12 17:19 ` Darrick J. Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260512171931.GN9555@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=dgc@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    --cc=yebin@huaweicloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox