[bug report] kernel BUG at fs/xfs/xfs_message.c:102!

Linux XFS filesystem development
 help / color / mirror / Atom feed

From: yebin <yebin@huaweicloud.com>
To: linux-xfs@vger.kernel.org, djwong@kernel.org, hch@lst.de, dgc@kernel.org
Subject: [bug report] kernel BUG at fs/xfs/xfs_message.c:102!
Date: Tue, 12 May 2026 19:34:16 +0800	[thread overview]
Message-ID: <6A031038.9030708@huaweicloud.com> (raw)

Hello Darrick and all,

Recently, I encountered a problem where a BUG was triggered in the write-back process.
The detailed problem information is as follows:
```
XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351).  Shutting.
XFS (sde): Please unmount the filesystem and rectify the problem(s)
XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
RIP: 0010:assfail+0x9f/0xb0
Code: fe 84 db 75 20 e8 51 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 58 ae 2b 8d e8 08 73 a2 fe eb cc e8 310
RSP: 0018:ffffc9000f6372e0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91a6
RDX: ffff8881a856bb00 RSI: ffffffff838c91cf RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: fffff52001ec6ded
R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520
R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff
FS:  00007f7ee1f5b740(0000) GS:ffff88878bb45000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e632788f0 CR3: 00000001b524a000 CR4: 00000000000006f0
Call Trace:
  <TASK>
  xfs_trans_unreserve_and_mod_sb+0xb86/0xd00
  __xfs_trans_commit+0x38b/0xe00
  xfs_trans_commit+0xeb/0x1a0
  xfs_bmapi_convert_one_delalloc+0xbca/0x1270
  xfs_bmapi_convert_delalloc+0x101/0x350
  xfs_writeback_range+0x76c/0x12d0
  iomap_writeback_folio+0x9ed/0x2100
  iomap_writepages+0x13c/0x2a0
  xfs_vm_writepages+0x278/0x330
  do_writepages+0x247/0x5c0
  filemap_writeback+0x22c/0x2e0
  xfs_file_release+0x442/0x580
  __fput+0x407/0xb50
  fput_close_sync+0x114/0x210
  __x64_sys_close+0x94/0x120
  do_syscall_64+0xc4/0xf80
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
```

After analyzing the above issues, the possible triggering process
is as follows:
```
xfs_bmapi_convert_delalloc
  xfs_bmapi_convert_one_delalloc
   xfs_bmapi_allocate
    xfs_bmap_add_extent_delay_real
     da_old = startblockval(PREV.br_startblock); // da_old = 5
     case BMAP_LEFT_FILLING:
      ifp->if_nextents++;  // 21 + 1 = 22
      if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
       xfs_bmap_extents_to_btree     // convert to btree
         cur->bc_ino.allocated++;
       da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp),
                                startblockval(PREV.br_startblock) -
                                (bma->cur ? bma->cur->bc_ino.allocated : 0));  // da_new = 5 - 1 = 4
       PREV.br_startblock = nullstartblock(da_new); //xfs_bmapi_convert_one_delalloc() return

                                                  xfs_bmap_del_extent_real
                                                    case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:
                                                     ifp->if_nextents--;  // 22 - 1 = 21
                                                     if (xfs_bmap_needs_btree(ip, whichfork))
                                                       xfs_bmap_extents_to_btree
                                                     else
                                                       xfs_bmap_btree_to_extents  // convert to extents
... // Alternate a few times in the middle.
da_old = 4
da_old = 3
da_old = 2
da_old = 1
...
xfs_bmapi_convert_delalloc
  xfs_bmapi_convert_one_delalloc
   error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp);  // Both blocks and rtextents are 0
    tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL);
    error = xfs_trans_reserve(tp, resp, blocks, rtextents);
    if (blocks > 0)
     error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
     tp->t_blk_res += blocks;    // The value of blocks is 0, so the value of tp->t_blk_res is 0
    xfs_bmapi_allocate
     xfs_bmap_add_extent_delay_real
      da_old = startblockval(PREV.br_startblock); // da_old = 0
      case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING:   // The current delay extent is just exhausted.
       ifp->if_nextents++;  // 21 + 1 + 22

     if (xfs_bmap_needs_btree(bma->ip, whichfork))  // 22 > 21
      error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork);     // Converted to btree. da_old > 0 is false.
       args.wasdel = wasdel;   //  wasdel is false
       error = xfs_alloc_vextent(&args);
        xfs_alloc_ag_vextent(args, 0)
         xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
          case XFS_AG_RESV_NONE:
           field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS : XFS_TRANS_SB_FDBLOCKS; //args->wasdel  == false
           xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
            case XFS_TRANS_SB_FDBLOCKS:
             if (delta < 0)
              tp->t_blk_res_used += (uint)-delta;
              if (tp->t_blk_res_used > tp->t_blk_res)  // ***tp->t_blk_res is 0, thus triggering xfs_force_shutdown()***
               xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
```

The logic that triggers the issue above was designed by me to facilitate the
construction of the problem. Besides the scenario where XFS_DINODE_FMT_BTREE
and XFS_DINODE_FMT_EXTENTS are converted back and forth, there is also the
scenario of btree splitting.

The core reason for the issue is that in xfs_bmapi_convert_delalloc(), the
call to xfs_bmap_worst_indlen() calculates the worst-case number of reserved
blocks, which is the number of additional blocks required after a complete
conversion of the entire delayed extent. It assumes that the entire conversion
process is atomic. However, the current process cannot guarantee such atomicity.
In the case of a fragmented filesystem, the most extreme scenario is that every
block conversion triggers a full btree split, in which case the reserved blocks
are far from sufficient. When this issue is triggered, the filesystem fragmentation
in the environment is indeed quite severe.

Further analysis of this abnormal model shows that because the reserved blocks
are continuously consumed, they may eventually exceed the reserved amount. When
the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate
blocks, triggering a warning. This failure to allocate additional blocks can lead
to issues with normal block allocation.

Additionally, in xfs_bmap_add_extent_delay_real(), if a delayed extent is split
into two, xfs_bmap_worst_indlen() is recalculated to reserve blocks. In the case
of nearly exhausted space, it may be impossible to reserve the newly required
blocks, leading to a writeback failure.

During the reservation phase, reserving more blocks by considering the worst-case
scenario would require occupying a lot of extra space, which is not very practical.
I was thinking that we could convert all the delay extents at once to ensure
atomicity, which would ensure that the two issues analyzed above do not exist.
However, I am not sure what negative impacts this approach might have. The only
thing I can think of is that the reserved space would be repeatedly allocated and
released, but I believe the current logic already has similar situations.

I haven't thought of a better solution at the moment. I wonder if anyone has any
good ideas?

Thanks,
Ye Bin

next             reply	other threads:[~2026-05-12 11:35 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 11:34 yebin [this message]
2026-05-12 17:19 ` [bug report] kernel BUG at fs/xfs/xfs_message.c:102! Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6A031038.9030708@huaweicloud.com \
    --to=yebin@huaweicloud.com \
    --cc=dgc@kernel.org \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox