Re: [PATCH v3] btrfs: warn about extent buffer that can not be released

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: Glass Su <glass.su@suse.com>
Cc: linux-btrfs@vger.kernel.org, AHN SEOK-YOUNG <iamsyahn@gmail.com>,
	Teng Liu <27rabbitlt@gmail.com>, Su Yue <l@damenly.org>
Subject: Re: [PATCH v3] btrfs: warn about extent buffer that can not be released
Date: Sat, 20 Jun 2026 11:10:23 +0930	[thread overview]
Message-ID: <2aeb8d7f-48e8-4e9a-bfb5-086b854bbc27@suse.com> (raw)
In-Reply-To: <DC0C775E-13B3-47D9-9AB2-895BB11C029D@suse.com>



在 2026/6/20 10:14, Glass Su 写道:
> 
> 
> 
> On Fri, Apr 17, 2026 at 6:47 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> When we unmount the fs or during mount failures, btrfs will call
>> invalidate_inode_pages() to release all btree inode folios.
>>
>> However that function can return -EBUSY if any folios can not be
>> invalidated.
>> This can be caused by:
>>
>> - Some extent buffers are still held by btrfs
>>   This is a logic error, as we should release all tree root nodes
>>   during unmount and mount failure handling.
>>
>> - Some extent buffers are under readahead and haven't yet finished
>>   This is much rarer but valid cases.
>>   In that case we should wait for those extent buffers.
>>
>> Introduce a new helper invalidate_btree_folios() which will:
>>
>> - Call invalidate_inode_pages2() and catch its return value
>>   If it returned 0 as expected, that's great and we can call it a day.
>>
>> - Otherwise go through each extent buffer in buffer_tree
>>   Increase the ref by one first for the eb we're checking.
>>   This is to ensure the eb won't be freed after the readahead is
>>   finished.
>>
>>   For eb that still has EXTENT_BUFFER_READING flag, wait for them to
>>   finish first.
>>
>>   After waiting for the readahead, check the refs of the eb and if it's
>>   still dirty.
>>
>>   If the eb refs is greater than 2 (one for the buffer tree, one hold by
>>   us), it means we are still holding the extent buffer somewhere else,
>>   which is a logic bug.
>>
>>   If the eb is still dirty, it means a bug in transaction handling.
>>   Unfortunately there are already test cases triggering this warning, so
>>   our transaction cleanup hasn't done its work reliably.
>>
>>   For either case, show a warning message about the eb, including its
>>   bytenr, owner, refs and flags.
>>   And if it's a debug build, also trigger WARN_ON_ONCE() so that fstests
>>   can properly catch such situation.
>>
>> Furthermore, to help debugging the unreleased extent buffers, output the
>> transid of the current aborted transaction, so that we can know which
>> transaction the unreleased extent buffers belong to.
>>
>> This will help future debugging as we're already hitting the new
>> warnings from test cases like generic/388.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=221270
>> Reported-by: AHN SEOK-YOUNG <iamsyahn@gmail.com>
>> Cc: Teng Liu <27rabbitlt@gmail.com>
>> Tested-by: Teng Liu <27rabbitlt@gmail.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> v3:
>> - Revert the DEBUG_WANR_ON_ONCE() change
>>   As there is only one user, a simple
>>   WARN_ON_ONCE(IS_ENABLED(CONFIG_BTRFS_DEBUG)) is more than enough.
>>
>> - Output the generation of the unreleased eb too
>>   Since it's possible to have 2 transactions (one committing and reached
>>   UNBLOCKED state, one new running), the generation output will help us
>>   to know which transaction the unreleased eb belongs to.
>>
>> - Also output the transid when a transaction is aborted
>>   To co-operate with the above change for debugging.
>>
>> v2:
>> - Add one extra ref before checking the eb
>>   Although readahead has one extra ref, after the readahead finished the
>>   extra ref will be dropped, and memory pressure can kick in to free the
>>   extent buffer.
>>
>> - Use rcu lock with xa_for_each() instead of xas lock and xas_for_each()
>>   Since we're holding one extra eb ref to prevent eb from disappearing,
>>   we no longer needs the more strict xas lock nor the extra xas
>>   pause/unlock.
>>
>>   Although xa_for_each() is more time consuming, we're at the cold path
>>   already, not a huge cost.
>>
>> - Remove the temporarary void pointer
>>   And pass eb pointer directly into xas_for_each().
>>
>> - Introduce DEBUG_WARN_ON_ONCE() helper
>>   To follow the existing DEBUG_WARN() helper.
>>
>> - Fix a typo
>>
>> - Also fix the checkpatch warning on the exist DEBUG_WARN()
>> ---
>> fs/btrfs/disk-io.c     | 49 ++++++++++++++++++++++++++++++++++++++++--
>> fs/btrfs/extent_io.c   |  6 ------
>> fs/btrfs/extent_io.h   |  6 ++++++
>> fs/btrfs/transaction.h |  8 +++----
>> 4 files changed, 57 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 7800a1b20290..241acdc16da1 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3272,6 +3272,51 @@ static bool fs_is_full_ro(const struct btrfs_fs_info *fs_info)
>>         return false;
>> }
>>
>> +static void invalidate_btree_folios(struct btrfs_fs_info *fs_info)
>> +{
>> +       unsigned long index = 0;
>> +       struct extent_buffer *eb;
>> +       int ret;
>> +
>> +       ret = invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
>> +       if (likely(ret == 0))
>> +               return;
>> +
>> +       /*
>> +        * Some btree pages can not be invalidated, this happens when some
>> +        * tree blocks are still held (either by some pointer or readahead).
>> +        */
>> +       rcu_read_lock();
>> +       xa_for_each(&fs_info->buffer_tree, index, eb) {
>> +               /* Increase the ref so that the eb won't disappear. */
>> +               if (!refcount_inc_not_zero(&eb->refs))
>> +                       continue;
>> +               rcu_read_unlock();
>> +
>> +               /* Wait for any readahead first. */
>> +               if (test_bit(EXTENT_BUFFER_READING, &eb->bflags))
>> +                       wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_READING,
>> +                                      TASK_UNINTERRUPTIBLE);
>> +               /*
>> +                * The refs threshold is 2, one hold by us at the beginning
>> +                * of the loop, one for the ownership in the buffer tree.
>> +                */
> 
> However, IIUC, there is still a small window between clear_extent_buffer_reading()
> and free_extent_buffer() in end_bbio_meta_read().

You're right, and that's also one of my existing concern but I was 
unable to hit.

One idea is to make EXTENT_BUFFER_READING and the later refs check to be 
protected by refs_lock.
But that will require some extra changes which may not be trivial.

For now I'll change hide the whole error message behind DEBUG builds.

Thanks for the report,
Qu

> 
> btrfs/298 with added DEBUG output:
> 
> [47724.849583] BTRFS info (device sdd): first mount of filesystem f9bf732a-a19b-44b9-99a7-614ddff168e2
> [47724.849597] BTRFS info (device sdd): using crc32c checksum algorithm
> [47724.854471] DEBUG: before clear_extent_buffer_reading on 365985792 refs 3
> [47724.855333] BTRFS error (device sdd): failed to find fsid cb2fdb42-b638-4f2f-badd-4127467ba674 when attempting to open seed devices
> [47724.855349] BTRFS error (device sdd): failed to read chunk tree: -2
> [47724.855403] ------------[ cut here ]------------
> [47724.855405] WARNING: disk-io.c:3342 at invalidate_and_check_btree_folios+0x260/0x3c0 [btrfs], CPU#4: mount/125993
> [47724.855503] Modules linked in: btrfs(OE) xor(E) libblake2b(E) raid6_pq(E) sctp(E) ip6_udp_tunnel(E) udp_tunnel(E) dm_mod(E) virtio_net(E) net_failover(E) arm_smccc_trng(E) failover(E) virtio_balloon(E) vfat(E) fat(E) drm(E) fuse(E) xfs(E) virtio_scsi(E) qemu_fw_cfg(E) virtio_pci(E) virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E) virtio_console(E) virtio_rng(E) rng_core(E) [last unloaded: xor(E)]
> [47724.855549] CPU: 4 UID: 0 PID: 125993 Comm: mount Tainted: G        W  OE       7.1.0-rc7-custom+ #1 PREEMPT(full)
> [47724.855555] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> [47724.855558] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250812-19.fc42 08/12/2025
> [47724.855561] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [47724.855564] pc : invalidate_and_check_btree_folios+0x260/0x3c0 [btrfs]
> [47724.855648] lr : invalidate_and_check_btree_folios+0x11c/0x3c0 [btrfs]
> [47724.855728] sp : ffff80008e123b90
> [47724.855730] x29: ffff80008e123ba0 x28: ffff0000d1e41000 x27: ffff000127d14558
> [47724.855736] x26: ffffaa8f7cfc0000 x25: ffff00010734e400 x24: ffff0000d1e44000
> [47724.855741] x23: ffff0000ca10c000 x22: 0000000000001000 x21: ffff000125509000
> [47724.855746] x20: ffff00011d54c000 x19: ffff00011bf5eb58 x18: 000000000000000a
> [47724.855751] x17: 663266342d383336 x16: ffffaa8f7ba36cf0 x15: 0000000000000000
> [47724.855756] x14: 0000000000000000 x13: 322d203a65657274 x12: 206b6e7568632064
> [47724.855761] x11: 0000000000003cd8 x10: 0000000000000000 x9 : ffffaa8f3721d7cc
> [47724.855767] x8 : ffffaa8f7cefe848 x7 : ffff00010bdc8bf0 x6 : 0000000000000009
> [47724.855772] x5 : 0000000000000003 x4 : ffff00010bdc8040 x3 : ffff80008e123b44
> [47724.855777] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000003
> [47724.855782] Call trace:
> [47724.855784]  invalidate_and_check_btree_folios+0x260/0x3c0 [btrfs] (P)
> [47724.855865]  open_ctree+0x1f50/0x23b0 [btrfs]
> [47724.855944]  btrfs_get_tree+0x89c/0xc48 [btrfs]
> [47724.856019]  vfs_get_tree+0x30/0x110
> [47724.856025]  vfs_cmd_create+0x58/0xe8
> [47724.856031]  __arm64_sys_fsconfig+0x39c/0x518
> [47724.856035]  invoke_syscall.constprop.0+0x48/0x120
> [47724.856042]  el0_svc_common.constprop.0+0x40/0xe8
> [47724.856046]  do_el0_svc+0x24/0x38
> [47724.856051]  el0_svc+0x50/0x310
> [47724.856057]  el0t_64_sync_handler+0xa0/0xe8
> [47724.856061]  el0t_64_sync+0x198/0x1a0
> [47724.856065] irq event stamp: 16018
> [47724.856067] hardirqs last  enabled at (16017): [<ffffaa8f7c81029c>] _raw_spin_unlock_irqrestore+0x74/0x80
> [47724.856073] hardirqs last disabled at (16018): [<ffffaa8f7c7f65a0>] el1_brk64+0x20/0x68
> [47724.856077] softirqs last  enabled at (13320): [<ffffaa8f7b8fdeec>] kernel_neon_begin+0x11c/0x178
> [47724.856082] softirqs last disabled at (13318): [<ffffaa8f7b8fde90>] kernel_neon_begin+0xc0/0x178
> [47724.856085] ---[ end trace 0000000000000000 ]---
> [47724.856089] BTRFS warning (device sdd): unable to release extent buffer 365985792 owner 3 gen 17 refs 3 flags 0x5
> [47724.856195] DEBUG: before free_extent_buffer on 365985792 refs 2
> [47724.856200] DEBUG: after free_extent_buffer on 365985792 refs 1
> 
> Standard Output
> 
> Full:
> #setup seed sprout device
> btrfs-progs v7.0
> See https://btrfs.readthedocs.io for more information.
> 
> Performing full device TRIM /dev/sdc (300.00MiB) ...
> NOTE: default settings have changed in version 6.19 (supported since linux 6.1):
>       - enable block-group-tree (-O bgt)
> 
> Label:              (null)
> UUID:               43f147f8-e91b-4306-82ef-4829ce018dae
> Node size:          16384
> Sector size:        4096 (CPU page size: 4096)
> Filesystem size:    300.00MiB
> Block group profiles:
>   Data:             single            8.00MiB
>   Metadata:         DUP              32.00MiB
>   System:           DUP               8.00MiB
> SSD detected:       no
> Zoned device:       no
> Features:           extref, skinny-metadata, no-holes, free-space-tree, block-group-tree
> Checksum:           crc32c
> Number of devices:  1
> Devices:
>    ID        SIZE  PATH
>     1   300.00MiB  /dev/sdc
> 
> mount: /mnt/scratch: WARNING: source write-protected, mounted read-only.
> Performing full device TRIM /dev/sdd (8.00GiB) ...
> #Scan seed device and check using mount
> Scanning for btrfs filesystems on '/dev/sdc'
> #check again, ensures seed device still in kernel
> #Now scan of non-seed device makes kernel forget
> WARNING: seeding flag cleared on /dev/sdc
> Scanning for btrfs filesystems on '/dev/sdc'
> #Sprout mount must fail for missing seed device
> umount: /mnt/scratch: not mounted.
> 
> 
>> +               if (unlikely(refcount_read(&eb->refs) > 2 ||
>> +                            extent_buffer_under_io(eb))) {
>> +                       WARN_ON_ONCE(IS_ENABLED(CONFIG_BTRFS_DEBUG));
>> +                       btrfs_warn(fs_info,
>> +                       "unable to release extent buffer %llu owner %llu gen %llu refs %u flags 0x%lx",
>> +                                  eb->start, btrfs_header_owner(eb),
>> +                                  btrfs_header_generation(eb),
>> +                                  refcount_read(&eb->refs), eb->bflags);
>> +               }
>> +               free_extent_buffer(eb);
>> +               rcu_read_lock();
>> +       }
>> +       rcu_read_unlock();
>> +       invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
>> +}
>> +
>> int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_devices)
>> {
>>         u32 sectorsize;
>> @@ -3702,7 +3747,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>>         if (fs_info->data_reloc_root)
>>                 btrfs_drop_and_free_fs_root(fs_info, fs_info->data_reloc_root);
>>         free_root_pointers(fs_info, true);
>> -       invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
>> +       invalidate_btree_folios(fs_info);
>>
>> fail_sb_buffer:
>>         btrfs_stop_all_workers(fs_info);
>> @@ -4431,7 +4476,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
>>          * We must make sure there is not any read request to
>>          * submit after we stop all workers.
>>          */
>> -       invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
>> +       invalidate_btree_folios(fs_info);
>>         btrfs_stop_all_workers(fs_info);
>>
>>         /*
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 8d241a7a880f..4eab0f9909e3 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2872,12 +2872,6 @@ bool try_release_extent_mapping(struct folio *folio, gfp_t mask)
>>         return try_release_extent_state(io_tree, folio);
>> }
>>
>> -static int extent_buffer_under_io(const struct extent_buffer *eb)
>> -{
>> -       return (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
>> -               test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
>> -}
>> -
>> static bool folio_range_has_eb(struct folio *folio)
>> {
>>         struct btrfs_folio_state *bfs;
>> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
>> index fd209233317f..b284aee1bfb0 100644
>> --- a/fs/btrfs/extent_io.h
>> +++ b/fs/btrfs/extent_io.h
>> @@ -326,6 +326,12 @@ static inline bool extent_buffer_uptodate(const struct extent_buffer *eb)
>>         return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
>> }
>>
>> +static inline bool extent_buffer_under_io(const struct extent_buffer *eb)
>> +{
>> +       return (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
>> +               test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
>> +}
>> +
>> int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
>>                          unsigned long start, unsigned long len);
>> void read_extent_buffer(const struct extent_buffer *eb, void *dst,
>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
>> index 7d70fe486758..264dcd4b3788 100644
>> --- a/fs/btrfs/transaction.h
>> +++ b/fs/btrfs/transaction.h
>> @@ -255,13 +255,13 @@ do {                                                              \
>>                 __first = true;                                 \
>>                 if (WARN(btrfs_abort_should_print_stack(error), \
>>                         KERN_ERR                                \
>> -                       "BTRFS: Transaction aborted (error %d)\n",      \
>> -                       (error))) {                                     \
>> +                       "BTRFS: Transaction %llu aborted (error %d)\n", \
>> +                       (trans)->transid, (error))) {                   \
>>                         /* Stack trace printed. */                      \
>>                 } else {                                                \
>>                         btrfs_err((trans)->fs_info,                     \
>> -                                 "Transaction aborted (error %d)",     \
>> -                                 (error));                     \
>> +                       "Transaction %llu aborted (error %d)",  \
>> +                                 (trans)->transid, (error));   \
>>                 }                                               \
>>         }                                                       \
>>         __btrfs_abort_transaction((trans), __func__,            \
>> --
>> 2.53.0
>>
>>

     prev parent reply	other threads:[~2026-06-20  1:40 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-16 22:43 [PATCH v3] btrfs: warn about extent buffer that can not be released Qu Wenruo
2026-04-27 15:48 ` David Sterba
2026-04-27 22:01   ` Qu Wenruo
2026-04-28 15:17     ` David Sterba
2026-06-20  0:44 ` Glass Su
2026-06-20  1:40   ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2aeb8d7f-48e8-4e9a-bfb5-086b854bbc27@suse.com \
    --to=wqu@suse.com \
    --cc=27rabbitlt@gmail.com \
    --cc=glass.su@suse.com \
    --cc=iamsyahn@gmail.com \
    --cc=l@damenly.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox