[PATCH] btrfs: prevent COW amplification during btrfs_search

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
@ 2026-01-27 20:42 Leo Martins
  2026-01-28 21:48 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Leo Martins @ 2026-01-27 20:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

I've been investigating enospcs at Meta and have observed a strange
pattern where filesystems are enospcing with lots of unallocated space
(> 100G). Sample dmesg dump at bottom of message.

btrfs_insert_delayed_dir_index is attempting to migrate some reservation
from the transaction block reserve and finding it exhausted leading to a
warning and enospc. This is a bug as the reservations are meant to be
worst case. It should be impossible to exhaust the transaction block
reserve.

Some tracing of affected hosts revealed that there were single
btrfs_search_slot calls that were COWing 100s of times. I was able to
reproduce this behavior locally by creating a very constrained cgroup
and producing a lot of concurrent filesystem operations. Here's the
pattern:

 1. btrfs_search_slot() begins tree traversal with cow=1
 2. Node at level N needs COW (old generation or WRITTEN flag set)
 3. btrfs_cow_block() allocates new node, updates parent pointer
 4. Traversal continues, but hits a condition requiring restart (e.g., node
    not cached, lock contention, need higher write_lock_level)
 5. btrfs_release_path() releases all locks and references
 6. Memory pressure triggers writeback on the COW'd node
 7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
    BTRFS_HEADER_FLAG_WRITTEN
 8. goto again - traversal restarts from root
 9. Traversal reaches the freshly COW'd node
 10. should_cow_block() sees WRITTEN flag set, returns true
 11. btrfs_cow_block() allocates another new node - same logical position,
     new physical location, new reservation consumed
 12. Steps 4-11 repeat indefinitely under sustained memory pressure

Note this behavior should be much harder to trigger since Boris's
AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
accounted for in user cgroups. However, I believe it
would still be an issue under global memory pressure.
Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/

This COW amplification breaks the idea that transaction reservations are
worst case as any search slot call could find itself in this COW loop and
exhaust its reservation.

My proposed solution is to temporarily pin extent buffers for the
lifetime of btrfs_search_slot. This prevents the massive COW
amplification that can be seen during high memory pressure.

The implementation uses a local xarray to track COW'd buffers for the
duration of the search. The xarray stores extent_buffer pointers without
taking additional references; this is safe because tracked buffers remain
dirty (writeback_blockers prevents the dirty bit from being cleared) and
dirty buffers cannot be reclaimed by memory pressure.

Synchronization is provided by eb->lock: increments in
btrfs_search_slot_track_cow() occur while holding the write lock, and
the check in lock_extent_buffer_for_io() also holds the write lock via
btrfs_tree_lock(). Decrements don't require eb->lock because
writeback_blockers is atomic and merely indicates "don't write yet".
Once we decrement, we're done and don't care if writeback proceeds
immediately.

Here is pahole output of extent_buffer showing that the new atomic_t
member can slot into an existing 6 byte hole.

Before:
struct extent_buffer {
        u64                        start;                /*     0     8 */
        u32                        len;                  /*     8     4 */
        u32                        folio_size;           /*    12     4 */
        unsigned long              bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        void *                     addr;                 /*    32     8 */
        spinlock_t                 refs_lock;            /*    40     0 */
        refcount_t                 refs;                 /*    40     4 */
        int                        read_mirror;          /*    44     4 */
        s8                         log_index;            /*    48     1 */
        u8                         folio_shift;          /*    49     1 */

        /* XXX 6 bytes hole, try to pack */

        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        struct rw_semaphore        lock;                 /*    72    32 */
        struct folio *             folios[16];           /*   104   128 */

        /* size: 232, cachelines: 4, members: 14 */
        /* sum members: 226, holes: 1, sum holes: 6 */
        /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */
        /* last cacheline: 40 bytes */
};

After:
struct extent_buffer {
        u64                        start;                /*     0     8 */
        u32                        len;                  /*     8     4 */
        u32                        folio_size;           /*    12     4 */
        unsigned long              bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        void *                     addr;                 /*    32     8 */
        spinlock_t                 refs_lock;            /*    40     0 */
        refcount_t                 refs;                 /*    40     4 */
        int                        read_mirror;          /*    44     4 */
        s8                         log_index;            /*    48     1 */
        u8                         folio_shift;          /*    49     1 */

        /* XXX 2 bytes hole, try to pack */

        atomic_t                   writeback_blockers;   /*    52     4 */
        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        struct rw_semaphore        lock;                 /*    72    32 */
        struct folio *             folios[16];           /*   104   128 */

        /* size: 232, cachelines: 4, members: 15 */
        /* sum members: 230, holes: 1, sum holes: 2 */
        /* forced alignments: 1 */
        /* last cacheline: 40 bytes */
};

------------[ cut here ]------------
WARNING: CPU: 28 PID: 930807 at fs/btrfs/delayed-inode.c:1547 btrfs_insert_delayed_dir_index+0x346/0x3a0
Modules linked in: ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) bpf_preload(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) sch_fq(E) tls(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) mlx5_ib(E) iTCO_vendor_support(E) xhci_pci(E) mlx5_fwctl(E) i2c_i801(E) kvm(E) xhci_hcd(E) ib_uverbs(E) acpi_cpufreq(E) fwctl(E) i2c_smbus(E) wmi(E) ipmi_si(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
CPU: 28 UID: 34126 PID: 930807 Comm: CPUThreadPool0 Kdump: loaded Tainted: G S          E       6.13.2-0_fbk9_0_gb487e362c3df #1
Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, BIOS Y3DL403 06/20/2025
RIP: 0010:btrfs_insert_delayed_dir_index+0x346/0x3a0
Code: 08 48 89 de 48 c7 c2 e8 09 61 82 4d 89 e0 45 31 c9 e8 2e da 73 ff 48 89 df 48 8b 6c 24 08 4c 8b 74 24 10 e9 57 fe ff ff 89 c3 <0f> 0b 4c 89 e7 e8 b0 fb ff ff e9 5a fe ff ff 65 8b 05 50 d9 2b 7e
RSP: 0000:ffffc900047b79f8 EFLAGS: 00010286
RAX: 00000000ffffffe4 RBX: 00000000ffffffe4 RCX: 0000000000000000
RDX: fffffffffffc0000 RSI: ffff8882aaeb7170 RDI: ffff8882aaeb7128
RBP: ffff888348114d68 R08: 000000006f684265 R09: 5f79636e6574616c
R10: 5f726f74696e6f6d R11: 617461646174656d R12: ffff8882d49c4180
R13: ffff8882aaeb7000 R14: 0000000000000045 R15: 0000000000040000
FS:  00007fb5563fd640(0000) GS:ffff889036600000(0000) knlGS:0000000000000000
CR2: 000000000ba8bffd CR3: 0000000906d1b002 CR4: 00000000007726f0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0xa4/0x140
 ? btrfs_insert_delayed_dir_index+0x346/0x3a0
 ? report_bug+0xe1/0x140
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? btrfs_insert_delayed_dir_index+0x346/0x3a0
 ? btrfs_insert_delayed_dir_index+0x20c/0x3a0
 btrfs_insert_dir_item+0x1b0/0x210
 ? setup_items_for_insert+0x250/0x480
 btrfs_add_link+0x94/0x3e0
 btrfs_create_new_inode+0x60a/0xb90
 ? start_transaction.llvm.5573957049853623343+0x2e4/0x7a0
 btrfs_create_common+0x16c/0x1f0
 path_openat+0x20ff/0x4140
 do_filp_open+0xa2/0x130
 ? _raw_spin_lock+0x10/0x20
 __x64_sys_openat+0x114/0x1b0
 do_syscall_64+0x68/0x130
 ? exc_page_fault+0x69/0x130
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fb55a51b592
Code: 8b 55 d0 eb b0 0f 1f 00 44 89 55 9c e8 b7 b6 f7 ff 41 89 c0 44 8b 55 9c 44 89 e2 4c 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 46 44 89 c7 89 45 9c e8 eb b6 f7 ff 8b 45 9c
RSP: 002b:00007fb5563f70a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: f49998db0aa753ff RCX: 00007fb55a51b592
RDX: 00000000000000c2 RSI: 00007fb557509cc0 RDI: 00000000ffffff9c
RBP: 00007fb5563f7110 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
R13: 00007fb557509cc0 R14: 000000000f595b0b R15: 00007fb55a5e0ca0
 </TASK>
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 28 PID: 930807 at fs/btrfs/inode.c:6606 btrfs_add_link+0x3ae/0x3e0
Modules linked in: ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) bpf_preload(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) sch_fq(E) tls(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) mlx5_ib(E) iTCO_vendor_support(E) xhci_pci(E) mlx5_fwctl(E) i2c_i801(E) kvm(E) xhci_hcd(E) ib_uverbs(E) acpi_cpufreq(E) fwctl(E) i2c_smbus(E) wmi(E) ipmi_si(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
CPU: 28 UID: 34126 PID: 930807 Comm: CPUThreadPool0 Kdump: loaded Tainted: G S      W   E       6.13.2-0_fbk9_0_gb487e362c3df #1
Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [E]=UNSIGNED_MODULE
Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, BIOS Y3DL403 06/20/2025
RIP: 0010:btrfs_add_link+0x3ae/0x3e0
Code: 00 e9 75 ff ff ff 48 c7 c7 9b 10 6a 82 89 de e8 28 c1 36 ff 0f 0b e9 81 fe ff ff 48 c7 c7 9b 10 6a 82 44 89 e6 e8 12 c1 36 ff <0f> 0b e9 cf fe ff ff 48 c7 c7 9b 10 6a 82 89 de e8 fd c0 36 ff 0f
RSP: 0000:ffffc900047b7b00 EFLAGS: 00010282
RAX: 0000000000000026 RBX: 00000000000acc00 RCX: 0000000000000000
RDX: ffff889036630158 RSI: ffff889036621c60 RDI: ffff889036621c60
RBP: 00000000001569d0 R08: ffffffff832692a0 R09: 000000000002fffd
R10: 0000000000000000 R11: ffffffffffffffff R12: 00000000ffffffe4
R13: 0000000000000000 R14: ffff8882a6707400 R15: ffffc900047b7ca8
FS:  00007fb5563fd640(0000) GS:ffff889036600000(0000) knlGS:0000000000000000
CR2: 000000000ba8bffd CR3: 0000000906d1b002 CR4: 00000000007726f0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0xa4/0x140
 ? btrfs_add_link+0x3ae/0x3e0
 ? report_bug+0xe1/0x140
 ? btrfs_add_link+0x3ae/0x3e0
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? btrfs_add_link+0x3ae/0x3e0
 btrfs_create_new_inode+0x60a/0xb90
 ? start_transaction.llvm.5573957049853623343+0x2e4/0x7a0
 btrfs_create_common+0x16c/0x1f0
 path_openat+0x20ff/0x4140
 do_filp_open+0xa2/0x130
 ? _raw_spin_lock+0x10/0x20
 __x64_sys_openat+0x114/0x1b0
 do_syscall_64+0x68/0x130
 ? exc_page_fault+0x69/0x130
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fb55a51b592
Code: 8b 55 d0 eb b0 0f 1f 00 44 89 55 9c e8 b7 b6 f7 ff 41 89 c0 44 8b 55 9c 44 89 e2 4c 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 46 44 89 c7 89 45 9c e8 eb b6 f7 ff 8b 45 9c
RSP: 002b:00007fb5563f70a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: f49998db0aa753ff RCX: 00007fb55a51b592
RDX: 00000000000000c2 RSI: 00007fb557509cc0 RDI: 00000000ffffff9c
RBP: 00007fb5563f7110 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
R13: 00007fb557509cc0 R14: 000000000f595b0b R15: 00007fb55a5e0ca0
 </TASK>
---[ end trace 0000000000000000 ]---
BTRFS info (device nvme0n1p2 state A): dumping space info:
BTRFS info (device nvme0n1p2 state A): space_info DATA has 11715895296 free, is not full
BTRFS info (device nvme0n1p2 state A): space_info total=92350185472, used=80633892864, pinned=0, reserved=241664, may_use=155648, readonly=0 zone_unusable=0 delalloc=151552 ordered=8192
BTRFS info (device nvme0n1p2 state A): space_info METADATA has 880836608 free, is not full
BTRFS info (device nvme0n1p2 state A): space_info total=2181038080, used=912293888, pinned=3784704, reserved=557056, may_use=383500288, readonly=65536 zone_unusable=0 delalloc=151552 ordered=8192
BTRFS info (device nvme0n1p2 state A): space_info SYSTEM has 8372224 free, is not full
BTRFS info (device nvme0n1p2 state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 delalloc=151552 ordered=8192
BTRFS info (device nvme0n1p2 state A): global_block_rsv: size 306905088 reserved 306905088
BTRFS info (device nvme0n1p2 state A): trans_block_rsv: size 1310720 reserved 0
BTRFS info (device nvme0n1p2 state A): chunk_block_rsv: size 0 reserved 0
BTRFS info (device nvme0n1p2 state A): delayed_block_rsv: size 4980736 reserved 4980736
BTRFS info (device nvme0n1p2 state A): delayed_refs_rsv: size 47841280 reserved 46858240
BTRFS: error (device nvme0n1p2 state A) in btrfs_add_link:6606: errno=-28 No space left
BTRFS info (device nvme0n1p2 state EA): forced readonly


Signed-off-by: Leo Martins <loemra.dev@gmail.com>
---
 fs/btrfs/ctree.c     | 42 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.c |  4 +++-
 fs/btrfs/extent_io.h |  8 ++++++++
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 7267b2502665..473e78f398b4 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1972,6 +1972,40 @@ static int search_leaf(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * Track an extent buffer that was COW'd during btrfs_search_slot.
+ * Prevents the flusher from writing this buffer until the search completes.
+ * This avoids COW amplification where a restart would force an unnecessary
+ * re-COW of the same block.
+ */
+static inline int btrfs_search_slot_track_cow(struct xarray *cowed_buffers,
+					       struct extent_buffer *eb)
+{
+	u32 tmp;
+	int ret = 0;
+
+	lockdep_assert_held_write(&eb->lock);
+
+	ret = xa_alloc(cowed_buffers, &tmp, eb, xa_limit_32b, GFP_NOFS);
+	if (!ret)
+		atomic_inc(&eb->writeback_blockers);
+	return ret;
+}
+
+/*
+ * Clear COW protection from all extent buffers tracked during this search.
+ * Called at the end of btrfs_search_slot to allow normal writeback behavior.
+ */
+static inline void btrfs_search_slot_clear_cow_protection(struct xarray *cowed_buffers)
+{
+	struct extent_buffer *eb;
+	unsigned long index;
+
+	xa_for_each(cowed_buffers, index, eb)
+		atomic_dec(&eb->writeback_blockers);
+	xa_destroy(cowed_buffers);
+}
+
 /*
  * Look for a key in a tree and perform necessary modifications to preserve
  * tree invariants.
@@ -2009,6 +2043,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 {
 	struct btrfs_fs_info *fs_info;
 	struct extent_buffer *b;
+	DEFINE_XARRAY_ALLOC(cowed_buffers);
 	int slot;
 	int ret;
 	int level;
@@ -2121,6 +2156,11 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 				ret = ret2;
 				goto done;
 			}
+			ret2 = btrfs_search_slot_track_cow(&cowed_buffers, b);
+			if (ret2) {
+				ret = ret2;
+				goto done;
+			}
 		}
 cow_done:
 		p->nodes[level] = b;
@@ -2242,6 +2282,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			ret = ret2;
 	}
 
+	btrfs_search_slot_clear_cow_protection(&cowed_buffers);
+
 	return ret;
 }
 ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dfc17c292217..5dd7fcaec5a5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1940,7 +1940,8 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
 	 * of time.
 	 */
 	spin_lock(&eb->refs_lock);
-	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
+	if (!atomic_read(&eb->writeback_blockers) &&
+	    test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
 		XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
 		unsigned long flags;
 
@@ -3009,6 +3010,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
 	eb->len = fs_info->nodesize;
 	eb->fs_info = fs_info;
 	init_rwsem(&eb->lock);
+	atomic_set(&eb->writeback_blockers, 0);
 
 	btrfs_leak_debug_add_eb(eb);
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 73571d5d3d5a..da77c4eb9a43 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -102,6 +102,14 @@ struct extent_buffer {
 	/* >= 0 if eb belongs to a log tree, -1 otherwise */
 	s8 log_index;
 	u8 folio_shift;
+
+	/*
+	 * Active btrfs_search_slot() operations blocking writeback.
+	 * Prevents COW amplification when searches restart under memory
+	 * pressure. Checked under eb->lock in lock_extent_buffer_for_io().
+	 */
+	atomic_t writeback_blockers;
+
 	struct rcu_head rcu_head;
 
 	struct rw_semaphore lock;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-27 20:42 [PATCH] btrfs: prevent COW amplification during btrfs_search_slot Leo Martins
@ 2026-01-28 21:48 ` Qu Wenruo
  2026-01-29 19:30   ` Leo Martins
  2026-01-29 11:52 ` Filipe Manana
  2026-02-10  7:45 ` kernel test robot
  2 siblings, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2026-01-28 21:48 UTC (permalink / raw)
  To: Leo Martins, linux-btrfs, kernel-team



在 2026/1/28 07:12, Leo Martins 写道:
> I've been investigating enospcs at Meta and have observed a strange
> pattern where filesystems are enospcing with lots of unallocated space
> (> 100G). Sample dmesg dump at bottom of message.
> 
> btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> from the transaction block reserve and finding it exhausted leading to a
> warning and enospc. This is a bug as the reservations are meant to be
> worst case. It should be impossible to exhaust the transaction block
> reserve.
> 
> Some tracing of affected hosts revealed that there were single
> btrfs_search_slot calls that were COWing 100s of times. I was able to
> reproduce this behavior locally by creating a very constrained cgroup
> and producing a lot of concurrent filesystem operations. Here's the
> pattern:
> 
>   1. btrfs_search_slot() begins tree traversal with cow=1
>   2. Node at level N needs COW (old generation or WRITTEN flag set)
>   3. btrfs_cow_block() allocates new node, updates parent pointer
>   4. Traversal continues, but hits a condition requiring restart (e.g., node
>      not cached, lock contention, need higher write_lock_level)
>   5. btrfs_release_path() releases all locks and references
>   6. Memory pressure triggers writeback on the COW'd node
>   7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>      BTRFS_HEADER_FLAG_WRITTEN
>   8. goto again - traversal restarts from root
>   9. Traversal reaches the freshly COW'd node
>   10. should_cow_block() sees WRITTEN flag set, returns true
>   11. btrfs_cow_block() allocates another new node - same logical position,
>       new physical location, new reservation consumed
>   12. Steps 4-11 repeat indefinitely under sustained memory pressure
> 
> Note this behavior should be much harder to trigger since Boris's
> AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> accounted for in user cgroups. However, I believe it
> would still be an issue under global memory pressure.
> Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> 
> This COW amplification breaks the idea that transaction reservations are
> worst case as any search slot call could find itself in this COW loop and
> exhaust its reservation.
> 
> My proposed solution is to temporarily pin extent buffers for the
> lifetime of btrfs_search_slot. This prevents the massive COW
> amplification that can be seen during high memory pressure.
> 
> The implementation uses a local xarray to track COW'd buffers for the
> duration of the search. The xarray stores extent_buffer pointers without
> taking additional references; this is safe because tracked buffers remain
> dirty (writeback_blockers prevents the dirty bit from being cleared) and
> dirty buffers cannot be reclaimed by memory pressure.
> 
> Synchronization is provided by eb->lock: increments in
> btrfs_search_slot_track_cow() occur while holding the write lock, and
> the check in lock_extent_buffer_for_io() also holds the write lock via
> btrfs_tree_lock(). Decrements don't require eb->lock because
> writeback_blockers is atomic and merely indicates "don't write yet".
> Once we decrement, we're done and don't care if writeback proceeds
> immediately.
> 
> Here is pahole output of extent_buffer showing that the new atomic_t
> member can slot into an existing 6 byte hole.
> 
> Before:
> struct extent_buffer {
>          u64                        start;                /*     0     8 */
>          u32                        len;                  /*     8     4 */
>          u32                        folio_size;           /*    12     4 */
>          unsigned long              bflags;               /*    16     8 */
>          struct btrfs_fs_info *     fs_info;              /*    24     8 */
>          void *                     addr;                 /*    32     8 */
>          spinlock_t                 refs_lock;            /*    40     0 */
>          refcount_t                 refs;                 /*    40     4 */
>          int                        read_mirror;          /*    44     4 */
>          s8                         log_index;            /*    48     1 */
>          u8                         folio_shift;          /*    49     1 */
> 
>          /* XXX 6 bytes hole, try to pack */
> 
>          struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
>          /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
>          struct rw_semaphore        lock;                 /*    72    32 */
>          struct folio *             folios[16];           /*   104   128 */
> 
>          /* size: 232, cachelines: 4, members: 14 */
>          /* sum members: 226, holes: 1, sum holes: 6 */
>          /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */
>          /* last cacheline: 40 bytes */
> };
> 
> After:
> struct extent_buffer {
>          u64                        start;                /*     0     8 */
>          u32                        len;                  /*     8     4 */
>          u32                        folio_size;           /*    12     4 */
>          unsigned long              bflags;               /*    16     8 */
>          struct btrfs_fs_info *     fs_info;              /*    24     8 */
>          void *                     addr;                 /*    32     8 */
>          spinlock_t                 refs_lock;            /*    40     0 */
>          refcount_t                 refs;                 /*    40     4 */
>          int                        read_mirror;          /*    44     4 */
>          s8                         log_index;            /*    48     1 */
>          u8                         folio_shift;          /*    49     1 */
> 
>          /* XXX 2 bytes hole, try to pack */
> 
>          atomic_t                   writeback_blockers;   /*    52     4 */
>          struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
>          /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
>          struct rw_semaphore        lock;                 /*    72    32 */
>          struct folio *             folios[16];           /*   104   128 */
> 
>          /* size: 232, cachelines: 4, members: 15 */
>          /* sum members: 230, holes: 1, sum holes: 2 */
>          /* forced alignments: 1 */
>          /* last cacheline: 40 bytes */
> };
> 
> ------------[ cut here ]------------
> WARNING: CPU: 28 PID: 930807 at fs/btrfs/delayed-inode.c:1547 btrfs_insert_delayed_dir_index+0x346/0x3a0
> Modules linked in: ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) bpf_preload(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) sch_fq(E) tls(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) mlx5_ib(E) iTCO_vendor_support(E) xhci_pci(E) mlx5_fwctl(E) i2c_i801(E) kvm(E) xhci_hcd(E) ib_uverbs(E) acpi_cpufreq(E) fwctl(E) i2c_smbus(E) wmi(E) ipmi_si(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
> CPU: 28 UID: 34126 PID: 930807 Comm: CPUThreadPool0 Kdump: loaded Tainted: G S          E       6.13.2-0_fbk9_0_gb487e362c3df #1
> Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, BIOS Y3DL403 06/20/2025
> RIP: 0010:btrfs_insert_delayed_dir_index+0x346/0x3a0
> Code: 08 48 89 de 48 c7 c2 e8 09 61 82 4d 89 e0 45 31 c9 e8 2e da 73 ff 48 89 df 48 8b 6c 24 08 4c 8b 74 24 10 e9 57 fe ff ff 89 c3 <0f> 0b 4c 89 e7 e8 b0 fb ff ff e9 5a fe ff ff 65 8b 05 50 d9 2b 7e
> RSP: 0000:ffffc900047b79f8 EFLAGS: 00010286
> RAX: 00000000ffffffe4 RBX: 00000000ffffffe4 RCX: 0000000000000000
> RDX: fffffffffffc0000 RSI: ffff8882aaeb7170 RDI: ffff8882aaeb7128
> RBP: ffff888348114d68 R08: 000000006f684265 R09: 5f79636e6574616c
> R10: 5f726f74696e6f6d R11: 617461646174656d R12: ffff8882d49c4180
> R13: ffff8882aaeb7000 R14: 0000000000000045 R15: 0000000000040000
> FS:  00007fb5563fd640(0000) GS:ffff889036600000(0000) knlGS:0000000000000000
> CR2: 000000000ba8bffd CR3: 0000000906d1b002 CR4: 00000000007726f0
> PKRU: 55555554
> Call Trace:
>   <TASK>
>   ? __warn+0xa4/0x140
>   ? btrfs_insert_delayed_dir_index+0x346/0x3a0
>   ? report_bug+0xe1/0x140
>   ? handle_bug+0x5e/0x90
>   ? exc_invalid_op+0x16/0x40
>   ? asm_exc_invalid_op+0x16/0x20
>   ? btrfs_insert_delayed_dir_index+0x346/0x3a0
>   ? btrfs_insert_delayed_dir_index+0x20c/0x3a0
>   btrfs_insert_dir_item+0x1b0/0x210
>   ? setup_items_for_insert+0x250/0x480
>   btrfs_add_link+0x94/0x3e0
>   btrfs_create_new_inode+0x60a/0xb90
>   ? start_transaction.llvm.5573957049853623343+0x2e4/0x7a0
>   btrfs_create_common+0x16c/0x1f0
>   path_openat+0x20ff/0x4140
>   do_filp_open+0xa2/0x130
>   ? _raw_spin_lock+0x10/0x20
>   __x64_sys_openat+0x114/0x1b0
>   do_syscall_64+0x68/0x130
>   ? exc_page_fault+0x69/0x130
>   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> RIP: 0033:0x7fb55a51b592
> Code: 8b 55 d0 eb b0 0f 1f 00 44 89 55 9c e8 b7 b6 f7 ff 41 89 c0 44 8b 55 9c 44 89 e2 4c 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 46 44 89 c7 89 45 9c e8 eb b6 f7 ff 8b 45 9c
> RSP: 002b:00007fb5563f70a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> RAX: ffffffffffffffda RBX: f49998db0aa753ff RCX: 00007fb55a51b592
> RDX: 00000000000000c2 RSI: 00007fb557509cc0 RDI: 00000000ffffff9c
> RBP: 00007fb5563f7110 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
> R13: 00007fb557509cc0 R14: 000000000f595b0b R15: 00007fb55a5e0ca0
>   </TASK>
> ---[ end trace 0000000000000000 ]---
> ------------[ cut here ]------------
> BTRFS: Transaction aborted (error -28)
> WARNING: CPU: 28 PID: 930807 at fs/btrfs/inode.c:6606 btrfs_add_link+0x3ae/0x3e0
> Modules linked in: ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) bpf_preload(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) sch_fq(E) tls(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) mlx5_ib(E) iTCO_vendor_support(E) xhci_pci(E) mlx5_fwctl(E) i2c_i801(E) kvm(E) xhci_hcd(E) ib_uverbs(E) acpi_cpufreq(E) fwctl(E) i2c_smbus(E) wmi(E) ipmi_si(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
> CPU: 28 UID: 34126 PID: 930807 Comm: CPUThreadPool0 Kdump: loaded Tainted: G S      W   E       6.13.2-0_fbk9_0_gb487e362c3df #1
> Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [E]=UNSIGNED_MODULE
> Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, BIOS Y3DL403 06/20/2025
> RIP: 0010:btrfs_add_link+0x3ae/0x3e0
> Code: 00 e9 75 ff ff ff 48 c7 c7 9b 10 6a 82 89 de e8 28 c1 36 ff 0f 0b e9 81 fe ff ff 48 c7 c7 9b 10 6a 82 44 89 e6 e8 12 c1 36 ff <0f> 0b e9 cf fe ff ff 48 c7 c7 9b 10 6a 82 89 de e8 fd c0 36 ff 0f
> RSP: 0000:ffffc900047b7b00 EFLAGS: 00010282
> RAX: 0000000000000026 RBX: 00000000000acc00 RCX: 0000000000000000
> RDX: ffff889036630158 RSI: ffff889036621c60 RDI: ffff889036621c60
> RBP: 00000000001569d0 R08: ffffffff832692a0 R09: 000000000002fffd
> R10: 0000000000000000 R11: ffffffffffffffff R12: 00000000ffffffe4
> R13: 0000000000000000 R14: ffff8882a6707400 R15: ffffc900047b7ca8
> FS:  00007fb5563fd640(0000) GS:ffff889036600000(0000) knlGS:0000000000000000
> CR2: 000000000ba8bffd CR3: 0000000906d1b002 CR4: 00000000007726f0
> PKRU: 55555554
> Call Trace:
>   <TASK>
>   ? __warn+0xa4/0x140
>   ? btrfs_add_link+0x3ae/0x3e0
>   ? report_bug+0xe1/0x140
>   ? btrfs_add_link+0x3ae/0x3e0
>   ? handle_bug+0x5e/0x90
>   ? exc_invalid_op+0x16/0x40
>   ? asm_exc_invalid_op+0x16/0x20
>   ? btrfs_add_link+0x3ae/0x3e0
>   btrfs_create_new_inode+0x60a/0xb90
>   ? start_transaction.llvm.5573957049853623343+0x2e4/0x7a0
>   btrfs_create_common+0x16c/0x1f0
>   path_openat+0x20ff/0x4140
>   do_filp_open+0xa2/0x130
>   ? _raw_spin_lock+0x10/0x20
>   __x64_sys_openat+0x114/0x1b0
>   do_syscall_64+0x68/0x130
>   ? exc_page_fault+0x69/0x130
>   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> RIP: 0033:0x7fb55a51b592
> Code: 8b 55 d0 eb b0 0f 1f 00 44 89 55 9c e8 b7 b6 f7 ff 41 89 c0 44 8b 55 9c 44 89 e2 4c 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 46 44 89 c7 89 45 9c e8 eb b6 f7 ff 8b 45 9c
> RSP: 002b:00007fb5563f70a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> RAX: ffffffffffffffda RBX: f49998db0aa753ff RCX: 00007fb55a51b592
> RDX: 00000000000000c2 RSI: 00007fb557509cc0 RDI: 00000000ffffff9c
> RBP: 00007fb5563f7110 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
> R13: 00007fb557509cc0 R14: 000000000f595b0b R15: 00007fb55a5e0ca0
>   </TASK>
> ---[ end trace 0000000000000000 ]---
> BTRFS info (device nvme0n1p2 state A): dumping space info:
> BTRFS info (device nvme0n1p2 state A): space_info DATA has 11715895296 free, is not full
> BTRFS info (device nvme0n1p2 state A): space_info total=92350185472, used=80633892864, pinned=0, reserved=241664, may_use=155648, readonly=0 zone_unusable=0 delalloc=151552 ordered=8192
> BTRFS info (device nvme0n1p2 state A): space_info METADATA has 880836608 free, is not full
> BTRFS info (device nvme0n1p2 state A): space_info total=2181038080, used=912293888, pinned=3784704, reserved=557056, may_use=383500288, readonly=65536 zone_unusable=0 delalloc=151552 ordered=8192
> BTRFS info (device nvme0n1p2 state A): space_info SYSTEM has 8372224 free, is not full
> BTRFS info (device nvme0n1p2 state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 delalloc=151552 ordered=8192
> BTRFS info (device nvme0n1p2 state A): global_block_rsv: size 306905088 reserved 306905088
> BTRFS info (device nvme0n1p2 state A): trans_block_rsv: size 1310720 reserved 0
> BTRFS info (device nvme0n1p2 state A): chunk_block_rsv: size 0 reserved 0
> BTRFS info (device nvme0n1p2 state A): delayed_block_rsv: size 4980736 reserved 4980736
> BTRFS info (device nvme0n1p2 state A): delayed_refs_rsv: size 47841280 reserved 46858240
> BTRFS: error (device nvme0n1p2 state A) in btrfs_add_link:6606: errno=-28 No space left
> BTRFS info (device nvme0n1p2 state EA): forced readonly
> 
> 
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>

Thanks for the detailed explaination, the immature writeback of a 
recently used dirty eb is indeed a problem.

I considered some simpler solution like just marking those ebs as 
DONT_WRITEBACK, but that will mean no dirty ebs will be writtenback 
until a transaction is commited (and cleared that flag).

Or a more complex solution that makes btree_writepagse() aware of 
filemap LRU and only writes back those older untouched ebs first. Which 
can be super complex.


So the current solution looks good to me as a good compromise, at least 
it makes sure those ebs can still be written back after 
btrfs_search_slot() finished.


But this also makes me wonder, what will happen for all other 
btrfs_cow_block() callers out of btrfs_search_slot().

E.g. for relocation we do not use btrfs_search_slot() to handle reloc 
trees, but manually calls btrfs_cow_block() inside replace_path().

Should they also receive a similar treatment?

Thanks,
Qu

> ---
>   fs/btrfs/ctree.c     | 42 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/extent_io.c |  4 +++-
>   fs/btrfs/extent_io.h |  8 ++++++++
>   3 files changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 7267b2502665..473e78f398b4 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1972,6 +1972,40 @@ static int search_leaf(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>   
> +/*
> + * Track an extent buffer that was COW'd during btrfs_search_slot.
> + * Prevents the flusher from writing this buffer until the search completes.
> + * This avoids COW amplification where a restart would force an unnecessary
> + * re-COW of the same block.
> + */
> +static inline int btrfs_search_slot_track_cow(struct xarray *cowed_buffers,
> +					       struct extent_buffer *eb)
> +{
> +	u32 tmp;
> +	int ret = 0;
> +
> +	lockdep_assert_held_write(&eb->lock);
> +
> +	ret = xa_alloc(cowed_buffers, &tmp, eb, xa_limit_32b, GFP_NOFS);
> +	if (!ret)
> +		atomic_inc(&eb->writeback_blockers);
> +	return ret;
> +}
> +
> +/*
> + * Clear COW protection from all extent buffers tracked during this search.
> + * Called at the end of btrfs_search_slot to allow normal writeback behavior.
> + */
> +static inline void btrfs_search_slot_clear_cow_protection(struct xarray *cowed_buffers)
> +{
> +	struct extent_buffer *eb;
> +	unsigned long index;
> +
> +	xa_for_each(cowed_buffers, index, eb)
> +		atomic_dec(&eb->writeback_blockers);
> +	xa_destroy(cowed_buffers);
> +}
> +
>   /*
>    * Look for a key in a tree and perform necessary modifications to preserve
>    * tree invariants.
> @@ -2009,6 +2043,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>   {
>   	struct btrfs_fs_info *fs_info;
>   	struct extent_buffer *b;
> +	DEFINE_XARRAY_ALLOC(cowed_buffers);
>   	int slot;
>   	int ret;
>   	int level;
> @@ -2121,6 +2156,11 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>   				ret = ret2;
>   				goto done;
>   			}
> +			ret2 = btrfs_search_slot_track_cow(&cowed_buffers, b);
> +			if (ret2) {
> +				ret = ret2;
> +				goto done;
> +			}
>   		}
>   cow_done:
>   		p->nodes[level] = b;
> @@ -2242,6 +2282,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>   			ret = ret2;
>   	}
>   
> +	btrfs_search_slot_clear_cow_protection(&cowed_buffers);
> +
>   	return ret;
>   }
>   ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index dfc17c292217..5dd7fcaec5a5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1940,7 +1940,8 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
>   	 * of time.
>   	 */
>   	spin_lock(&eb->refs_lock);
> -	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> +	if (!atomic_read(&eb->writeback_blockers) &&
> +	    test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
>   		XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
>   		unsigned long flags;
>   
> @@ -3009,6 +3010,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
>   	eb->len = fs_info->nodesize;
>   	eb->fs_info = fs_info;
>   	init_rwsem(&eb->lock);
> +	atomic_set(&eb->writeback_blockers, 0);
>   
>   	btrfs_leak_debug_add_eb(eb);
>   
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 73571d5d3d5a..da77c4eb9a43 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -102,6 +102,14 @@ struct extent_buffer {
>   	/* >= 0 if eb belongs to a log tree, -1 otherwise */
>   	s8 log_index;
>   	u8 folio_shift;
> +
> +	/*
> +	 * Active btrfs_search_slot() operations blocking writeback.
> +	 * Prevents COW amplification when searches restart under memory
> +	 * pressure. Checked under eb->lock in lock_extent_buffer_for_io().
> +	 */
> +	atomic_t writeback_blockers;
> +
>   	struct rcu_head rcu_head;
>   
>   	struct rw_semaphore lock;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-27 20:42 [PATCH] btrfs: prevent COW amplification during btrfs_search_slot Leo Martins
  2026-01-28 21:48 ` Qu Wenruo
@ 2026-01-29 11:52 ` Filipe Manana
  2026-01-30  0:12   ` Leo Martins
  2026-02-10  7:45 ` kernel test robot
  2 siblings, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2026-01-29 11:52 UTC (permalink / raw)
  To: Leo Martins; +Cc: linux-btrfs, kernel-team

On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
>
> I've been investigating enospcs at Meta and have observed a strange
> pattern where filesystems are enospcing with lots of unallocated space
> (> 100G). Sample dmesg dump at bottom of message.
>
> btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> from the transaction block reserve and finding it exhausted leading to a
> warning and enospc. This is a bug as the reservations are meant to be
> worst case. It should be impossible to exhaust the transaction block
> reserve.
>
> Some tracing of affected hosts revealed that there were single
> btrfs_search_slot calls that were COWing 100s of times. I was able to
> reproduce this behavior locally by creating a very constrained cgroup
> and producing a lot of concurrent filesystem operations. Here's the
> pattern:
>
>  1. btrfs_search_slot() begins tree traversal with cow=1
>  2. Node at level N needs COW (old generation or WRITTEN flag set)
>  3. btrfs_cow_block() allocates new node, updates parent pointer
>  4. Traversal continues, but hits a condition requiring restart (e.g., node
>     not cached, lock contention, need higher write_lock_level)
>  5. btrfs_release_path() releases all locks and references
>  6. Memory pressure triggers writeback on the COW'd node
>  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>     BTRFS_HEADER_FLAG_WRITTEN
>  8. goto again - traversal restarts from root
>  9. Traversal reaches the freshly COW'd node
>  10. should_cow_block() sees WRITTEN flag set, returns true
>  11. btrfs_cow_block() allocates another new node - same logical position,
>      new physical location, new reservation consumed
>  12. Steps 4-11 repeat indefinitely under sustained memory pressure
>
> Note this behavior should be much harder to trigger since Boris's
> AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> accounted for in user cgroups. However, I believe it
> would still be an issue under global memory pressure.
> Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
>
> This COW amplification breaks the idea that transaction reservations are
> worst case as any search slot call could find itself in this COW loop and
> exhaust its reservation.
>
> My proposed solution is to temporarily pin extent buffers for the
> lifetime of btrfs_search_slot. This prevents the massive COW
> amplification that can be seen during high memory pressure.
>
> The implementation uses a local xarray to track COW'd buffers for the
> duration of the search. The xarray stores extent_buffer pointers without
> taking additional references; this is safe because tracked buffers remain
> dirty (writeback_blockers prevents the dirty bit from being cleared) and
> dirty buffers cannot be reclaimed by memory pressure.
>
> Synchronization is provided by eb->lock: increments in
> btrfs_search_slot_track_cow() occur while holding the write lock, and
> the check in lock_extent_buffer_for_io() also holds the write lock via
> btrfs_tree_lock(). Decrements don't require eb->lock because
> writeback_blockers is atomic and merely indicates "don't write yet".
> Once we decrement, we're done and don't care if writeback proceeds
> immediately.

This seems too complex to me.

So this problem is very similar to some idea I had a few years ago but
never managed to implement.
It was about avoiding unnecessary COW, not for this space reservation
exhaustion due to sustained memory pressure, but it would solve it
too.

The idea was that we do unnecessary COW in cases like this:

1) We COW a path in some tree and we are at transaction N;

2) Writeback happened for the extent buffers in that path while we are
in the same transaction, because we reached the 32M limit and some
task called btrfs_btree_balance_dirty() or something else triggered
writeback of the btree inode;

3) While still at transaction N, we visit the same path to add an item
to a leaf, or modify an item, whatever. Because the extent buffers
have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
returns true).

So during the lifetime of a transaction we can have a lot of
unnecessary COW - we spend more time allocating extents, allocating
memory, copying extent buffer data, use more space per transaction,
etc.

The idea was to not COW when an extent buffer has
BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
(btrfs_header_generation(eb)) matches the current transaction.
That is safe because there's no committed tree that points to an
extent buffer created in the current transaction.

Any further modification to the extent buffer must be sure that the
EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
transaction's dirty_pages io tree, etc, so that we don't miss writing
the extent buffer to the same location again before the transaction
commits the superblocks.

Have you considered an approach like this?

It would solve this space reservation exhaustion problem, as well as
unnecessary COW for general optimization, without the need to for a
local xarray, which besides being very specific for the
btrfs_search_slot() case (we COW in other places), also requires a
memory allocation which can fail.

Thanks.


>
> Here is pahole output of extent_buffer showing that the new atomic_t
> member can slot into an existing 6 byte hole.
>
> Before:
> struct extent_buffer {
>         u64                        start;                /*     0     8 */
>         u32                        len;                  /*     8     4 */
>         u32                        folio_size;           /*    12     4 */
>         unsigned long              bflags;               /*    16     8 */
>         struct btrfs_fs_info *     fs_info;              /*    24     8 */
>         void *                     addr;                 /*    32     8 */
>         spinlock_t                 refs_lock;            /*    40     0 */
>         refcount_t                 refs;                 /*    40     4 */
>         int                        read_mirror;          /*    44     4 */
>         s8                         log_index;            /*    48     1 */
>         u8                         folio_shift;          /*    49     1 */
>
>         /* XXX 6 bytes hole, try to pack */
>
>         struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
>         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
>         struct rw_semaphore        lock;                 /*    72    32 */
>         struct folio *             folios[16];           /*   104   128 */
>
>         /* size: 232, cachelines: 4, members: 14 */
>         /* sum members: 226, holes: 1, sum holes: 6 */
>         /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */
>         /* last cacheline: 40 bytes */
> };
>
> After:
> struct extent_buffer {
>         u64                        start;                /*     0     8 */
>         u32                        len;                  /*     8     4 */
>         u32                        folio_size;           /*    12     4 */
>         unsigned long              bflags;               /*    16     8 */
>         struct btrfs_fs_info *     fs_info;              /*    24     8 */
>         void *                     addr;                 /*    32     8 */
>         spinlock_t                 refs_lock;            /*    40     0 */
>         refcount_t                 refs;                 /*    40     4 */
>         int                        read_mirror;          /*    44     4 */
>         s8                         log_index;            /*    48     1 */
>         u8                         folio_shift;          /*    49     1 */
>
>         /* XXX 2 bytes hole, try to pack */
>
>         atomic_t                   writeback_blockers;   /*    52     4 */
>         struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
>         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
>         struct rw_semaphore        lock;                 /*    72    32 */
>         struct folio *             folios[16];           /*   104   128 */
>
>         /* size: 232, cachelines: 4, members: 15 */
>         /* sum members: 230, holes: 1, sum holes: 2 */
>         /* forced alignments: 1 */
>         /* last cacheline: 40 bytes */
> };
>
> ------------[ cut here ]------------
> WARNING: CPU: 28 PID: 930807 at fs/btrfs/delayed-inode.c:1547 btrfs_insert_delayed_dir_index+0x346/0x3a0
> Modules linked in: ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) bpf_preload(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) sch_fq(E) tls(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) mlx5_ib(E) iTCO_vendor_support(E) xhci_pci(E) mlx5_fwctl(E) i2c_i801(E) kvm(E) xhci_hcd(E) ib_uverbs(E) acpi_cpufreq(E) fwctl(E) i2c_smbus(E) wmi(E) ipmi_si(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
> CPU: 28 UID: 34126 PID: 930807 Comm: CPUThreadPool0 Kdump: loaded Tainted: G S          E       6.13.2-0_fbk9_0_gb487e362c3df #1
> Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, BIOS Y3DL403 06/20/2025
> RIP: 0010:btrfs_insert_delayed_dir_index+0x346/0x3a0
> Code: 08 48 89 de 48 c7 c2 e8 09 61 82 4d 89 e0 45 31 c9 e8 2e da 73 ff 48 89 df 48 8b 6c 24 08 4c 8b 74 24 10 e9 57 fe ff ff 89 c3 <0f> 0b 4c 89 e7 e8 b0 fb ff ff e9 5a fe ff ff 65 8b 05 50 d9 2b 7e
> RSP: 0000:ffffc900047b79f8 EFLAGS: 00010286
> RAX: 00000000ffffffe4 RBX: 00000000ffffffe4 RCX: 0000000000000000
> RDX: fffffffffffc0000 RSI: ffff8882aaeb7170 RDI: ffff8882aaeb7128
> RBP: ffff888348114d68 R08: 000000006f684265 R09: 5f79636e6574616c
> R10: 5f726f74696e6f6d R11: 617461646174656d R12: ffff8882d49c4180
> R13: ffff8882aaeb7000 R14: 0000000000000045 R15: 0000000000040000
> FS:  00007fb5563fd640(0000) GS:ffff889036600000(0000) knlGS:0000000000000000
> CR2: 000000000ba8bffd CR3: 0000000906d1b002 CR4: 00000000007726f0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  ? __warn+0xa4/0x140
>  ? btrfs_insert_delayed_dir_index+0x346/0x3a0
>  ? report_bug+0xe1/0x140
>  ? handle_bug+0x5e/0x90
>  ? exc_invalid_op+0x16/0x40
>  ? asm_exc_invalid_op+0x16/0x20
>  ? btrfs_insert_delayed_dir_index+0x346/0x3a0
>  ? btrfs_insert_delayed_dir_index+0x20c/0x3a0
>  btrfs_insert_dir_item+0x1b0/0x210
>  ? setup_items_for_insert+0x250/0x480
>  btrfs_add_link+0x94/0x3e0
>  btrfs_create_new_inode+0x60a/0xb90
>  ? start_transaction.llvm.5573957049853623343+0x2e4/0x7a0
>  btrfs_create_common+0x16c/0x1f0
>  path_openat+0x20ff/0x4140
>  do_filp_open+0xa2/0x130
>  ? _raw_spin_lock+0x10/0x20
>  __x64_sys_openat+0x114/0x1b0
>  do_syscall_64+0x68/0x130
>  ? exc_page_fault+0x69/0x130
>  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> RIP: 0033:0x7fb55a51b592
> Code: 8b 55 d0 eb b0 0f 1f 00 44 89 55 9c e8 b7 b6 f7 ff 41 89 c0 44 8b 55 9c 44 89 e2 4c 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 46 44 89 c7 89 45 9c e8 eb b6 f7 ff 8b 45 9c
> RSP: 002b:00007fb5563f70a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> RAX: ffffffffffffffda RBX: f49998db0aa753ff RCX: 00007fb55a51b592
> RDX: 00000000000000c2 RSI: 00007fb557509cc0 RDI: 00000000ffffff9c
> RBP: 00007fb5563f7110 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
> R13: 00007fb557509cc0 R14: 000000000f595b0b R15: 00007fb55a5e0ca0
>  </TASK>
> ---[ end trace 0000000000000000 ]---
> ------------[ cut here ]------------
> BTRFS: Transaction aborted (error -28)
> WARNING: CPU: 28 PID: 930807 at fs/btrfs/inode.c:6606 btrfs_add_link+0x3ae/0x3e0
> Modules linked in: ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) bpf_preload(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) sch_fq(E) tls(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) mlx5_ib(E) iTCO_vendor_support(E) xhci_pci(E) mlx5_fwctl(E) i2c_i801(E) kvm(E) xhci_hcd(E) ib_uverbs(E) acpi_cpufreq(E) fwctl(E) i2c_smbus(E) wmi(E) ipmi_si(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
> CPU: 28 UID: 34126 PID: 930807 Comm: CPUThreadPool0 Kdump: loaded Tainted: G S      W   E       6.13.2-0_fbk9_0_gb487e362c3df #1
> Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [E]=UNSIGNED_MODULE
> Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, BIOS Y3DL403 06/20/2025
> RIP: 0010:btrfs_add_link+0x3ae/0x3e0
> Code: 00 e9 75 ff ff ff 48 c7 c7 9b 10 6a 82 89 de e8 28 c1 36 ff 0f 0b e9 81 fe ff ff 48 c7 c7 9b 10 6a 82 44 89 e6 e8 12 c1 36 ff <0f> 0b e9 cf fe ff ff 48 c7 c7 9b 10 6a 82 89 de e8 fd c0 36 ff 0f
> RSP: 0000:ffffc900047b7b00 EFLAGS: 00010282
> RAX: 0000000000000026 RBX: 00000000000acc00 RCX: 0000000000000000
> RDX: ffff889036630158 RSI: ffff889036621c60 RDI: ffff889036621c60
> RBP: 00000000001569d0 R08: ffffffff832692a0 R09: 000000000002fffd
> R10: 0000000000000000 R11: ffffffffffffffff R12: 00000000ffffffe4
> R13: 0000000000000000 R14: ffff8882a6707400 R15: ffffc900047b7ca8
> FS:  00007fb5563fd640(0000) GS:ffff889036600000(0000) knlGS:0000000000000000
> CR2: 000000000ba8bffd CR3: 0000000906d1b002 CR4: 00000000007726f0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  ? __warn+0xa4/0x140
>  ? btrfs_add_link+0x3ae/0x3e0
>  ? report_bug+0xe1/0x140
>  ? btrfs_add_link+0x3ae/0x3e0
>  ? handle_bug+0x5e/0x90
>  ? exc_invalid_op+0x16/0x40
>  ? asm_exc_invalid_op+0x16/0x20
>  ? btrfs_add_link+0x3ae/0x3e0
>  btrfs_create_new_inode+0x60a/0xb90
>  ? start_transaction.llvm.5573957049853623343+0x2e4/0x7a0
>  btrfs_create_common+0x16c/0x1f0
>  path_openat+0x20ff/0x4140
>  do_filp_open+0xa2/0x130
>  ? _raw_spin_lock+0x10/0x20
>  __x64_sys_openat+0x114/0x1b0
>  do_syscall_64+0x68/0x130
>  ? exc_page_fault+0x69/0x130
>  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> RIP: 0033:0x7fb55a51b592
> Code: 8b 55 d0 eb b0 0f 1f 00 44 89 55 9c e8 b7 b6 f7 ff 41 89 c0 44 8b 55 9c 44 89 e2 4c 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 46 44 89 c7 89 45 9c e8 eb b6 f7 ff 8b 45 9c
> RSP: 002b:00007fb5563f70a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> RAX: ffffffffffffffda RBX: f49998db0aa753ff RCX: 00007fb55a51b592
> RDX: 00000000000000c2 RSI: 00007fb557509cc0 RDI: 00000000ffffff9c
> RBP: 00007fb5563f7110 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
> R13: 00007fb557509cc0 R14: 000000000f595b0b R15: 00007fb55a5e0ca0
>  </TASK>
> ---[ end trace 0000000000000000 ]---
> BTRFS info (device nvme0n1p2 state A): dumping space info:
> BTRFS info (device nvme0n1p2 state A): space_info DATA has 11715895296 free, is not full
> BTRFS info (device nvme0n1p2 state A): space_info total=92350185472, used=80633892864, pinned=0, reserved=241664, may_use=155648, readonly=0 zone_unusable=0 delalloc=151552 ordered=8192
> BTRFS info (device nvme0n1p2 state A): space_info METADATA has 880836608 free, is not full
> BTRFS info (device nvme0n1p2 state A): space_info total=2181038080, used=912293888, pinned=3784704, reserved=557056, may_use=383500288, readonly=65536 zone_unusable=0 delalloc=151552 ordered=8192
> BTRFS info (device nvme0n1p2 state A): space_info SYSTEM has 8372224 free, is not full
> BTRFS info (device nvme0n1p2 state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 delalloc=151552 ordered=8192
> BTRFS info (device nvme0n1p2 state A): global_block_rsv: size 306905088 reserved 306905088
> BTRFS info (device nvme0n1p2 state A): trans_block_rsv: size 1310720 reserved 0
> BTRFS info (device nvme0n1p2 state A): chunk_block_rsv: size 0 reserved 0
> BTRFS info (device nvme0n1p2 state A): delayed_block_rsv: size 4980736 reserved 4980736
> BTRFS info (device nvme0n1p2 state A): delayed_refs_rsv: size 47841280 reserved 46858240
> BTRFS: error (device nvme0n1p2 state A) in btrfs_add_link:6606: errno=-28 No space left
> BTRFS info (device nvme0n1p2 state EA): forced readonly
>
>
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> ---
>  fs/btrfs/ctree.c     | 42 ++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/extent_io.c |  4 +++-
>  fs/btrfs/extent_io.h |  8 ++++++++
>  3 files changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 7267b2502665..473e78f398b4 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1972,6 +1972,40 @@ static int search_leaf(struct btrfs_trans_handle *trans,
>         return ret;
>  }
>
> +/*
> + * Track an extent buffer that was COW'd during btrfs_search_slot.
> + * Prevents the flusher from writing this buffer until the search completes.
> + * This avoids COW amplification where a restart would force an unnecessary
> + * re-COW of the same block.
> + */
> +static inline int btrfs_search_slot_track_cow(struct xarray *cowed_buffers,
> +                                              struct extent_buffer *eb)
> +{
> +       u32 tmp;
> +       int ret = 0;
> +
> +       lockdep_assert_held_write(&eb->lock);
> +
> +       ret = xa_alloc(cowed_buffers, &tmp, eb, xa_limit_32b, GFP_NOFS);
> +       if (!ret)
> +               atomic_inc(&eb->writeback_blockers);
> +       return ret;
> +}
> +
> +/*
> + * Clear COW protection from all extent buffers tracked during this search.
> + * Called at the end of btrfs_search_slot to allow normal writeback behavior.
> + */
> +static inline void btrfs_search_slot_clear_cow_protection(struct xarray *cowed_buffers)
> +{
> +       struct extent_buffer *eb;
> +       unsigned long index;
> +
> +       xa_for_each(cowed_buffers, index, eb)
> +               atomic_dec(&eb->writeback_blockers);
> +       xa_destroy(cowed_buffers);
> +}
> +
>  /*
>   * Look for a key in a tree and perform necessary modifications to preserve
>   * tree invariants.
> @@ -2009,6 +2043,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  {
>         struct btrfs_fs_info *fs_info;
>         struct extent_buffer *b;
> +       DEFINE_XARRAY_ALLOC(cowed_buffers);
>         int slot;
>         int ret;
>         int level;
> @@ -2121,6 +2156,11 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>                                 ret = ret2;
>                                 goto done;
>                         }
> +                       ret2 = btrfs_search_slot_track_cow(&cowed_buffers, b);
> +                       if (ret2) {
> +                               ret = ret2;
> +                               goto done;
> +                       }
>                 }
>  cow_done:
>                 p->nodes[level] = b;
> @@ -2242,6 +2282,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>                         ret = ret2;
>         }
>
> +       btrfs_search_slot_clear_cow_protection(&cowed_buffers);
> +
>         return ret;
>  }
>  ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index dfc17c292217..5dd7fcaec5a5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1940,7 +1940,8 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
>          * of time.
>          */
>         spin_lock(&eb->refs_lock);
> -       if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> +       if (!atomic_read(&eb->writeback_blockers) &&
> +           test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
>                 XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
>                 unsigned long flags;
>
> @@ -3009,6 +3010,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
>         eb->len = fs_info->nodesize;
>         eb->fs_info = fs_info;
>         init_rwsem(&eb->lock);
> +       atomic_set(&eb->writeback_blockers, 0);
>
>         btrfs_leak_debug_add_eb(eb);
>
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 73571d5d3d5a..da77c4eb9a43 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -102,6 +102,14 @@ struct extent_buffer {
>         /* >= 0 if eb belongs to a log tree, -1 otherwise */
>         s8 log_index;
>         u8 folio_shift;
> +
> +       /*
> +        * Active btrfs_search_slot() operations blocking writeback.
> +        * Prevents COW amplification when searches restart under memory
> +        * pressure. Checked under eb->lock in lock_extent_buffer_for_io().
> +        */
> +       atomic_t writeback_blockers;
> +
>         struct rcu_head rcu_head;
>
>         struct rw_semaphore lock;
> --
> 2.47.3
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-28 21:48 ` Qu Wenruo
@ 2026-01-29 19:30   ` Leo Martins
  0 siblings, 0 replies; 21+ messages in thread
From: Leo Martins @ 2026-01-29 19:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, kernel-team

On Thu, 29 Jan 2026 08:18:12 +1030 Qu Wenruo <wqu@suse.com> wrote:

> 
> 
> 在 2026/1/28 07:12, Leo Martins 写道:
> > I've been investigating enospcs at Meta and have observed a strange
> > pattern where filesystems are enospcing with lots of unallocated space
> > (> 100G). Sample dmesg dump at bottom of message.
> > 
> > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > from the transaction block reserve and finding it exhausted leading to a
> > warning and enospc. This is a bug as the reservations are meant to be
> > worst case. It should be impossible to exhaust the transaction block
> > reserve.
> > 
> > Some tracing of affected hosts revealed that there were single
> > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > reproduce this behavior locally by creating a very constrained cgroup
> > and producing a lot of concurrent filesystem operations. Here's the
> > pattern:
> > 
> >   1. btrfs_search_slot() begins tree traversal with cow=1
> >   2. Node at level N needs COW (old generation or WRITTEN flag set)
> >   3. btrfs_cow_block() allocates new node, updates parent pointer
> >   4. Traversal continues, but hits a condition requiring restart (e.g., node
> >      not cached, lock contention, need higher write_lock_level)
> >   5. btrfs_release_path() releases all locks and references
> >   6. Memory pressure triggers writeback on the COW'd node
> >   7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> >      BTRFS_HEADER_FLAG_WRITTEN
> >   8. goto again - traversal restarts from root
> >   9. Traversal reaches the freshly COW'd node
> >   10. should_cow_block() sees WRITTEN flag set, returns true
> >   11. btrfs_cow_block() allocates another new node - same logical position,
> >       new physical location, new reservation consumed
> >   12. Steps 4-11 repeat indefinitely under sustained memory pressure
> > 
> > Note this behavior should be much harder to trigger since Boris's
> > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > accounted for in user cgroups. However, I believe it
> > would still be an issue under global memory pressure.
> > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> > 
> > This COW amplification breaks the idea that transaction reservations are
> > worst case as any search slot call could find itself in this COW loop and
> > exhaust its reservation.
> > 
> > My proposed solution is to temporarily pin extent buffers for the
> > lifetime of btrfs_search_slot. This prevents the massive COW
> > amplification that can be seen during high memory pressure.
> > 
> > The implementation uses a local xarray to track COW'd buffers for the
> > duration of the search. The xarray stores extent_buffer pointers without
> > taking additional references; this is safe because tracked buffers remain
> > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > dirty buffers cannot be reclaimed by memory pressure.
> > 
> > Synchronization is provided by eb->lock: increments in
> > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > the check in lock_extent_buffer_for_io() also holds the write lock via
> > btrfs_tree_lock(). Decrements don't require eb->lock because
> > writeback_blockers is atomic and merely indicates "don't write yet".
> > Once we decrement, we're done and don't care if writeback proceeds
> > immediately.
> > 
> > Here is pahole output of extent_buffer showing that the new atomic_t
> > member can slot into an existing 6 byte hole.
> > 
> > Before:
> > struct extent_buffer {
> >          u64                        start;                /*     0     8 */
> >          u32                        len;                  /*     8     4 */
> >          u32                        folio_size;           /*    12     4 */
> >          unsigned long              bflags;               /*    16     8 */
> >          struct btrfs_fs_info *     fs_info;              /*    24     8 */
> >          void *                     addr;                 /*    32     8 */
> >          spinlock_t                 refs_lock;            /*    40     0 */
> >          refcount_t                 refs;                 /*    40     4 */
> >          int                        read_mirror;          /*    44     4 */
> >          s8                         log_index;            /*    48     1 */
> >          u8                         folio_shift;          /*    49     1 */
> > 
> >          /* XXX 6 bytes hole, try to pack */
> > 
> >          struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
> >          /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> >          struct rw_semaphore        lock;                 /*    72    32 */
> >          struct folio *             folios[16];           /*   104   128 */
> > 
> >          /* size: 232, cachelines: 4, members: 14 */
> >          /* sum members: 226, holes: 1, sum holes: 6 */
> >          /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */
> >          /* last cacheline: 40 bytes */
> > };
> > 
> > After:
> > struct extent_buffer {
> >          u64                        start;                /*     0     8 */
> >          u32                        len;                  /*     8     4 */
> >          u32                        folio_size;           /*    12     4 */
> >          unsigned long              bflags;               /*    16     8 */
> >          struct btrfs_fs_info *     fs_info;              /*    24     8 */
> >          void *                     addr;                 /*    32     8 */
> >          spinlock_t                 refs_lock;            /*    40     0 */
> >          refcount_t                 refs;                 /*    40     4 */
> >          int                        read_mirror;          /*    44     4 */
> >          s8                         log_index;            /*    48     1 */
> >          u8                         folio_shift;          /*    49     1 */
> > 
> >          /* XXX 2 bytes hole, try to pack */
> > 
> >          atomic_t                   writeback_blockers;   /*    52     4 */
> >          struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
> >          /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> >          struct rw_semaphore        lock;                 /*    72    32 */
> >          struct folio *             folios[16];           /*   104   128 */
> > 
> >          /* size: 232, cachelines: 4, members: 15 */
> >          /* sum members: 230, holes: 1, sum holes: 2 */
> >          /* forced alignments: 1 */
> >          /* last cacheline: 40 bytes */
> > };
> > 
> > ------------[ cut here ]------------
> > WARNING: CPU: 28 PID: 930807 at fs/btrfs/delayed-inode.c:1547 btrfs_insert_delayed_dir_index+0x346/0x3a0
> > Modules linked in: ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) bpf_preload(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) sch_fq(E) tls(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) mlx5_ib(E) iTCO_vendor_support(E) xhci_pci(E) mlx5_fwctl(E) i2c_i801(E) kvm(E) xhci_hcd(E) ib_uverbs(E) acpi_cpufreq(E) fwctl(E) i2c_smbus(E) wmi(E) ipmi_si(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
> > CPU: 28 UID: 34126 PID: 930807 Comm: CPUThreadPool0 Kdump: loaded Tainted: G S          E       6.13.2-0_fbk9_0_gb487e362c3df #1
> > Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> > Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, BIOS Y3DL403 06/20/2025
> > RIP: 0010:btrfs_insert_delayed_dir_index+0x346/0x3a0
> > Code: 08 48 89 de 48 c7 c2 e8 09 61 82 4d 89 e0 45 31 c9 e8 2e da 73 ff 48 89 df 48 8b 6c 24 08 4c 8b 74 24 10 e9 57 fe ff ff 89 c3 <0f> 0b 4c 89 e7 e8 b0 fb ff ff e9 5a fe ff ff 65 8b 05 50 d9 2b 7e
> > RSP: 0000:ffffc900047b79f8 EFLAGS: 00010286
> > RAX: 00000000ffffffe4 RBX: 00000000ffffffe4 RCX: 0000000000000000
> > RDX: fffffffffffc0000 RSI: ffff8882aaeb7170 RDI: ffff8882aaeb7128
> > RBP: ffff888348114d68 R08: 000000006f684265 R09: 5f79636e6574616c
> > R10: 5f726f74696e6f6d R11: 617461646174656d R12: ffff8882d49c4180
> > R13: ffff8882aaeb7000 R14: 0000000000000045 R15: 0000000000040000
> > FS:  00007fb5563fd640(0000) GS:ffff889036600000(0000) knlGS:0000000000000000
> > CR2: 000000000ba8bffd CR3: 0000000906d1b002 CR4: 00000000007726f0
> > PKRU: 55555554
> > Call Trace:
> >   <TASK>
> >   ? __warn+0xa4/0x140
> >   ? btrfs_insert_delayed_dir_index+0x346/0x3a0
> >   ? report_bug+0xe1/0x140
> >   ? handle_bug+0x5e/0x90
> >   ? exc_invalid_op+0x16/0x40
> >   ? asm_exc_invalid_op+0x16/0x20
> >   ? btrfs_insert_delayed_dir_index+0x346/0x3a0
> >   ? btrfs_insert_delayed_dir_index+0x20c/0x3a0
> >   btrfs_insert_dir_item+0x1b0/0x210
> >   ? setup_items_for_insert+0x250/0x480
> >   btrfs_add_link+0x94/0x3e0
> >   btrfs_create_new_inode+0x60a/0xb90
> >   ? start_transaction.llvm.5573957049853623343+0x2e4/0x7a0
> >   btrfs_create_common+0x16c/0x1f0
> >   path_openat+0x20ff/0x4140
> >   do_filp_open+0xa2/0x130
> >   ? _raw_spin_lock+0x10/0x20
> >   __x64_sys_openat+0x114/0x1b0
> >   do_syscall_64+0x68/0x130
> >   ? exc_page_fault+0x69/0x130
> >   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > RIP: 0033:0x7fb55a51b592
> > Code: 8b 55 d0 eb b0 0f 1f 00 44 89 55 9c e8 b7 b6 f7 ff 41 89 c0 44 8b 55 9c 44 89 e2 4c 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 46 44 89 c7 89 45 9c e8 eb b6 f7 ff 8b 45 9c
> > RSP: 002b:00007fb5563f70a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> > RAX: ffffffffffffffda RBX: f49998db0aa753ff RCX: 00007fb55a51b592
> > RDX: 00000000000000c2 RSI: 00007fb557509cc0 RDI: 00000000ffffff9c
> > RBP: 00007fb5563f7110 R08: 0000000000000000 R09: 0000000000000001
> > R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
> > R13: 00007fb557509cc0 R14: 000000000f595b0b R15: 00007fb55a5e0ca0
> >   </TASK>
> > ---[ end trace 0000000000000000 ]---
> > ------------[ cut here ]------------
> > BTRFS: Transaction aborted (error -28)
> > WARNING: CPU: 28 PID: 930807 at fs/btrfs/inode.c:6606 btrfs_add_link+0x3ae/0x3e0
> > Modules linked in: ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) bpf_preload(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) sch_fq(E) tls(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) mlx5_ib(E) iTCO_vendor_support(E) xhci_pci(E) mlx5_fwctl(E) i2c_i801(E) kvm(E) xhci_hcd(E) ib_uverbs(E) acpi_cpufreq(E) fwctl(E) i2c_smbus(E) wmi(E) ipmi_si(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
> > CPU: 28 UID: 34126 PID: 930807 Comm: CPUThreadPool0 Kdump: loaded Tainted: G S      W   E       6.13.2-0_fbk9_0_gb487e362c3df #1
> > Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [E]=UNSIGNED_MODULE
> > Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, BIOS Y3DL403 06/20/2025
> > RIP: 0010:btrfs_add_link+0x3ae/0x3e0
> > Code: 00 e9 75 ff ff ff 48 c7 c7 9b 10 6a 82 89 de e8 28 c1 36 ff 0f 0b e9 81 fe ff ff 48 c7 c7 9b 10 6a 82 44 89 e6 e8 12 c1 36 ff <0f> 0b e9 cf fe ff ff 48 c7 c7 9b 10 6a 82 89 de e8 fd c0 36 ff 0f
> > RSP: 0000:ffffc900047b7b00 EFLAGS: 00010282
> > RAX: 0000000000000026 RBX: 00000000000acc00 RCX: 0000000000000000
> > RDX: ffff889036630158 RSI: ffff889036621c60 RDI: ffff889036621c60
> > RBP: 00000000001569d0 R08: ffffffff832692a0 R09: 000000000002fffd
> > R10: 0000000000000000 R11: ffffffffffffffff R12: 00000000ffffffe4
> > R13: 0000000000000000 R14: ffff8882a6707400 R15: ffffc900047b7ca8
> > FS:  00007fb5563fd640(0000) GS:ffff889036600000(0000) knlGS:0000000000000000
> > CR2: 000000000ba8bffd CR3: 0000000906d1b002 CR4: 00000000007726f0
> > PKRU: 55555554
> > Call Trace:
> >   <TASK>
> >   ? __warn+0xa4/0x140
> >   ? btrfs_add_link+0x3ae/0x3e0
> >   ? report_bug+0xe1/0x140
> >   ? btrfs_add_link+0x3ae/0x3e0
> >   ? handle_bug+0x5e/0x90
> >   ? exc_invalid_op+0x16/0x40
> >   ? asm_exc_invalid_op+0x16/0x20
> >   ? btrfs_add_link+0x3ae/0x3e0
> >   btrfs_create_new_inode+0x60a/0xb90
> >   ? start_transaction.llvm.5573957049853623343+0x2e4/0x7a0
> >   btrfs_create_common+0x16c/0x1f0
> >   path_openat+0x20ff/0x4140
> >   do_filp_open+0xa2/0x130
> >   ? _raw_spin_lock+0x10/0x20
> >   __x64_sys_openat+0x114/0x1b0
> >   do_syscall_64+0x68/0x130
> >   ? exc_page_fault+0x69/0x130
> >   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > RIP: 0033:0x7fb55a51b592
> > Code: 8b 55 d0 eb b0 0f 1f 00 44 89 55 9c e8 b7 b6 f7 ff 41 89 c0 44 8b 55 9c 44 89 e2 4c 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 46 44 89 c7 89 45 9c e8 eb b6 f7 ff 8b 45 9c
> > RSP: 002b:00007fb5563f70a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> > RAX: ffffffffffffffda RBX: f49998db0aa753ff RCX: 00007fb55a51b592
> > RDX: 00000000000000c2 RSI: 00007fb557509cc0 RDI: 00000000ffffff9c
> > RBP: 00007fb5563f7110 R08: 0000000000000000 R09: 0000000000000001
> > R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
> > R13: 00007fb557509cc0 R14: 000000000f595b0b R15: 00007fb55a5e0ca0
> >   </TASK>
> > ---[ end trace 0000000000000000 ]---
> > BTRFS info (device nvme0n1p2 state A): dumping space info:
> > BTRFS info (device nvme0n1p2 state A): space_info DATA has 11715895296 free, is not full
> > BTRFS info (device nvme0n1p2 state A): space_info total=92350185472, used=80633892864, pinned=0, reserved=241664, may_use=155648, readonly=0 zone_unusable=0 delalloc=151552 ordered=8192
> > BTRFS info (device nvme0n1p2 state A): space_info METADATA has 880836608 free, is not full
> > BTRFS info (device nvme0n1p2 state A): space_info total=2181038080, used=912293888, pinned=3784704, reserved=557056, may_use=383500288, readonly=65536 zone_unusable=0 delalloc=151552 ordered=8192
> > BTRFS info (device nvme0n1p2 state A): space_info SYSTEM has 8372224 free, is not full
> > BTRFS info (device nvme0n1p2 state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 delalloc=151552 ordered=8192
> > BTRFS info (device nvme0n1p2 state A): global_block_rsv: size 306905088 reserved 306905088
> > BTRFS info (device nvme0n1p2 state A): trans_block_rsv: size 1310720 reserved 0
> > BTRFS info (device nvme0n1p2 state A): chunk_block_rsv: size 0 reserved 0
> > BTRFS info (device nvme0n1p2 state A): delayed_block_rsv: size 4980736 reserved 4980736
> > BTRFS info (device nvme0n1p2 state A): delayed_refs_rsv: size 47841280 reserved 46858240
> > BTRFS: error (device nvme0n1p2 state A) in btrfs_add_link:6606: errno=-28 No space left
> > BTRFS info (device nvme0n1p2 state EA): forced readonly
> > 
> > 
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> 
> Thanks for the detailed explaination, the immature writeback of a 
> recently used dirty eb is indeed a problem.
> 
> I considered some simpler solution like just marking those ebs as 
> DONT_WRITEBACK, but that will mean no dirty ebs will be writtenback 
> until a transaction is commited (and cleared that flag).
> 
> Or a more complex solution that makes btree_writepagse() aware of 
> filemap LRU and only writes back those older untouched ebs first. Which 
> can be super complex.
> 
> 
> So the current solution looks good to me as a good compromise, at least 
> it makes sure those ebs can still be written back after 
> btrfs_search_slot() finished.
> 
> 
> But this also makes me wonder, what will happen for all other 
> btrfs_cow_block() callers out of btrfs_search_slot().
> 
> E.g. for relocation we do not use btrfs_search_slot() to handle reloc 
> trees, but manually calls btrfs_cow_block() inside replace_path().
> 
> Should they also receive a similar treatment?

This is a good point, looking through some of the other
btrfs_cow_block callers there are places that may be vulnerable
to similar patterns of COW -> release locks -> writeback -> RE-COW.

I think that a more general solution like the one Filipe is proposing
would address this as it would not be search_slot specific.

Thanks,
Leo

> 
> Thanks,
> Qu
> 
> > ---
> >   fs/btrfs/ctree.c     | 42 ++++++++++++++++++++++++++++++++++++++++++
> >   fs/btrfs/extent_io.c |  4 +++-
> >   fs/btrfs/extent_io.h |  8 ++++++++
> >   3 files changed, 53 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 7267b2502665..473e78f398b4 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -1972,6 +1972,40 @@ static int search_leaf(struct btrfs_trans_handle *trans,
> >   	return ret;
> >   }
> >   
> > +/*
> > + * Track an extent buffer that was COW'd during btrfs_search_slot.
> > + * Prevents the flusher from writing this buffer until the search completes.
> > + * This avoids COW amplification where a restart would force an unnecessary
> > + * re-COW of the same block.
> > + */
> > +static inline int btrfs_search_slot_track_cow(struct xarray *cowed_buffers,
> > +					       struct extent_buffer *eb)
> > +{
> > +	u32 tmp;
> > +	int ret = 0;
> > +
> > +	lockdep_assert_held_write(&eb->lock);
> > +
> > +	ret = xa_alloc(cowed_buffers, &tmp, eb, xa_limit_32b, GFP_NOFS);
> > +	if (!ret)
> > +		atomic_inc(&eb->writeback_blockers);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Clear COW protection from all extent buffers tracked during this search.
> > + * Called at the end of btrfs_search_slot to allow normal writeback behavior.
> > + */
> > +static inline void btrfs_search_slot_clear_cow_protection(struct xarray *cowed_buffers)
> > +{
> > +	struct extent_buffer *eb;
> > +	unsigned long index;
> > +
> > +	xa_for_each(cowed_buffers, index, eb)
> > +		atomic_dec(&eb->writeback_blockers);
> > +	xa_destroy(cowed_buffers);
> > +}
> > +
> >   /*
> >    * Look for a key in a tree and perform necessary modifications to preserve
> >    * tree invariants.
> > @@ -2009,6 +2043,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >   {
> >   	struct btrfs_fs_info *fs_info;
> >   	struct extent_buffer *b;
> > +	DEFINE_XARRAY_ALLOC(cowed_buffers);
> >   	int slot;
> >   	int ret;
> >   	int level;
> > @@ -2121,6 +2156,11 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >   				ret = ret2;
> >   				goto done;
> >   			}
> > +			ret2 = btrfs_search_slot_track_cow(&cowed_buffers, b);
> > +			if (ret2) {
> > +				ret = ret2;
> > +				goto done;
> > +			}
> >   		}
> >   cow_done:
> >   		p->nodes[level] = b;
> > @@ -2242,6 +2282,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >   			ret = ret2;
> >   	}
> >   
> > +	btrfs_search_slot_clear_cow_protection(&cowed_buffers);
> > +
> >   	return ret;
> >   }
> >   ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO);
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index dfc17c292217..5dd7fcaec5a5 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -1940,7 +1940,8 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
> >   	 * of time.
> >   	 */
> >   	spin_lock(&eb->refs_lock);
> > -	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> > +	if (!atomic_read(&eb->writeback_blockers) &&
> > +	    test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> >   		XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
> >   		unsigned long flags;
> >   
> > @@ -3009,6 +3010,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
> >   	eb->len = fs_info->nodesize;
> >   	eb->fs_info = fs_info;
> >   	init_rwsem(&eb->lock);
> > +	atomic_set(&eb->writeback_blockers, 0);
> >   
> >   	btrfs_leak_debug_add_eb(eb);
> >   
> > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> > index 73571d5d3d5a..da77c4eb9a43 100644
> > --- a/fs/btrfs/extent_io.h
> > +++ b/fs/btrfs/extent_io.h
> > @@ -102,6 +102,14 @@ struct extent_buffer {
> >   	/* >= 0 if eb belongs to a log tree, -1 otherwise */
> >   	s8 log_index;
> >   	u8 folio_shift;
> > +
> > +	/*
> > +	 * Active btrfs_search_slot() operations blocking writeback.
> > +	 * Prevents COW amplification when searches restart under memory
> > +	 * pressure. Checked under eb->lock in lock_extent_buffer_for_io().
> > +	 */
> > +	atomic_t writeback_blockers;
> > +
> >   	struct rcu_head rcu_head;
> >   
> >   	struct rw_semaphore lock;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-29 11:52 ` Filipe Manana
@ 2026-01-30  0:12   ` Leo Martins
  2026-01-30  4:14     ` Sun YangKai
  2026-01-30 12:49     ` Filipe Manana
  0 siblings, 2 replies; 21+ messages in thread
From: Leo Martins @ 2026-01-30  0:12 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana <fdmanana@kernel.org> wrote:

> On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
> >
> > I've been investigating enospcs at Meta and have observed a strange
> > pattern where filesystems are enospcing with lots of unallocated space
> > (> 100G). Sample dmesg dump at bottom of message.
> >
> > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > from the transaction block reserve and finding it exhausted leading to a
> > warning and enospc. This is a bug as the reservations are meant to be
> > worst case. It should be impossible to exhaust the transaction block
> > reserve.
> >
> > Some tracing of affected hosts revealed that there were single
> > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > reproduce this behavior locally by creating a very constrained cgroup
> > and producing a lot of concurrent filesystem operations. Here's the
> > pattern:
> >
> >  1. btrfs_search_slot() begins tree traversal with cow=1
> >  2. Node at level N needs COW (old generation or WRITTEN flag set)
> >  3. btrfs_cow_block() allocates new node, updates parent pointer
> >  4. Traversal continues, but hits a condition requiring restart (e.g., node
> >     not cached, lock contention, need higher write_lock_level)
> >  5. btrfs_release_path() releases all locks and references
> >  6. Memory pressure triggers writeback on the COW'd node
> >  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> >     BTRFS_HEADER_FLAG_WRITTEN
> >  8. goto again - traversal restarts from root
> >  9. Traversal reaches the freshly COW'd node
> >  10. should_cow_block() sees WRITTEN flag set, returns true
> >  11. btrfs_cow_block() allocates another new node - same logical position,
> >      new physical location, new reservation consumed
> >  12. Steps 4-11 repeat indefinitely under sustained memory pressure
> >
> > Note this behavior should be much harder to trigger since Boris's
> > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > accounted for in user cgroups. However, I believe it
> > would still be an issue under global memory pressure.
> > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> >
> > This COW amplification breaks the idea that transaction reservations are
> > worst case as any search slot call could find itself in this COW loop and
> > exhaust its reservation.
> >
> > My proposed solution is to temporarily pin extent buffers for the
> > lifetime of btrfs_search_slot. This prevents the massive COW
> > amplification that can be seen during high memory pressure.
> >
> > The implementation uses a local xarray to track COW'd buffers for the
> > duration of the search. The xarray stores extent_buffer pointers without
> > taking additional references; this is safe because tracked buffers remain
> > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > dirty buffers cannot be reclaimed by memory pressure.
> >
> > Synchronization is provided by eb->lock: increments in
> > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > the check in lock_extent_buffer_for_io() also holds the write lock via
> > btrfs_tree_lock(). Decrements don't require eb->lock because
> > writeback_blockers is atomic and merely indicates "don't write yet".
> > Once we decrement, we're done and don't care if writeback proceeds
> > immediately.
> 
> This seems too complex to me.
> 
> So this problem is very similar to some idea I had a few years ago but
> never managed to implement.
> It was about avoiding unnecessary COW, not for this space reservation
> exhaustion due to sustained memory pressure, but it would solve it
> too.
> 
> The idea was that we do unnecessary COW in cases like this:
> 
> 1) We COW a path in some tree and we are at transaction N;
> 
> 2) Writeback happened for the extent buffers in that path while we are
> in the same transaction, because we reached the 32M limit and some
> task called btrfs_btree_balance_dirty() or something else triggered
> writeback of the btree inode;
> 
> 3) While still at transaction N, we visit the same path to add an item
> to a leaf, or modify an item, whatever. Because the extent buffers
> have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> returns true).
> 
> So during the lifetime of a transaction we can have a lot of
> unnecessary COW - we spend more time allocating extents, allocating
> memory, copying extent buffer data, use more space per transaction,
> etc.
> 
> The idea was to not COW when an extent buffer has
> BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> (btrfs_header_generation(eb)) matches the current transaction.
> That is safe because there's no committed tree that points to an
> extent buffer created in the current transaction.
> 
> Any further modification to the extent buffer must be sure that the
> EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> transaction's dirty_pages io tree, etc, so that we don't miss writing
> the extent buffer to the same location again before the transaction
> commits the superblocks.
> 
> Have you considered an approach like this?

I had not considered this, but it is a great idea.

My first thought is that implementing this could be as simple
as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
would mess with the assumptions around the log tree. From
btrfs_sync_log():

/*
 * IO has been started, blocks of the log tree have WRITTEN flag set
 * in their headers. new modifications of the log will be written to
 * new positions. so it's safe to allow log writers to go in.
 */

^ Assumes that WRITTEN blocks will be COW'd.

The issue looks like:

 1. fsync A COWs eb
 2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
 3. fsync B does __not__ COW eb and modifies it
 4. fsync A writes modified eb to disk
 5. CRASH; the log tree is corrupted

One way to avoid that is to keep the current behavior for the log
tree, but that leaves the potential for COW amplification...

Another idea is to track the log_transid in the eb in the same way
the transid is tracked. Then, in should_cow_block we have something
like:

if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID &&
    buf->log_transid != root->log_transid)
  return true;

Please let me know if you see any issues with this approach or
if you can think of a better method.

Thanks,
Leo

> 
> It would solve this space reservation exhaustion problem, as well as
> unnecessary COW for general optimization, without the need to for a
> local xarray, which besides being very specific for the
> btrfs_search_slot() case (we COW in other places), also requires a
> memory allocation which can fail.
> 
> Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30  0:12   ` Leo Martins
@ 2026-01-30  4:14     ` Sun YangKai
  2026-01-30  9:37       ` Sun YangKai
  2026-01-30 12:49     ` Filipe Manana
  1 sibling, 1 reply; 21+ messages in thread
From: Sun YangKai @ 2026-01-30  4:14 UTC (permalink / raw)
  To: Leo Martins, Filipe Manana; +Cc: linux-btrfs, kernel-team

On 2026/1/30 08:12, Leo Martins wrote:
> On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana<fdmanana@kernel.org> wrote:
>> On Tue, Jan 27, 2026 at 8:43 PM Leo Martins<loemra.dev@gmail.com> wrote:
>>> I've been investigating enospcs at Meta and have observed a strange
>>> pattern where filesystems are enospcing with lots of unallocated space
>>> (> 100G). Sample dmesg dump at bottom of message.
>>>
>>> btrfs_insert_delayed_dir_index is attempting to migrate some reservation
>>> from the transaction block reserve and finding it exhausted leading to a
>>> warning and enospc. This is a bug as the reservations are meant to be
>>> worst case. It should be impossible to exhaust the transaction block
>>> reserve.
>>>
>>> Some tracing of affected hosts revealed that there were single
>>> btrfs_search_slot calls that were COWing 100s of times. I was able to
>>> reproduce this behavior locally by creating a very constrained cgroup
>>> and producing a lot of concurrent filesystem operations. Here's the
>>> pattern:
>>>
>>>   1. btrfs_search_slot() begins tree traversal with cow=1
>>>   2. Node at level N needs COW (old generation or WRITTEN flag set)
>>>   3. btrfs_cow_block() allocates new node, updates parent pointer
>>>   4. Traversal continues, but hits a condition requiring restart (e.g., node
>>>      not cached, lock contention, need higher write_lock_level)
>>>   5. btrfs_release_path() releases all locks and references
>>>   6. Memory pressure triggers writeback on the COW'd node
>>>   7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>>>      BTRFS_HEADER_FLAG_WRITTEN
>>>   8. goto again - traversal restarts from root
>>>   9. Traversal reaches the freshly COW'd node
>>>   10. should_cow_block() sees WRITTEN flag set, returns true
>>>   11. btrfs_cow_block() allocates another new node - same logical position,
>>>       new physical location, new reservation consumed
>>>   12. Steps 4-11 repeat indefinitely under sustained memory pressure
>>>
>>> Note this behavior should be much harder to trigger since Boris's
>>> AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
>>> accounted for in user cgroups. However, I believe it
>>> would still be an issue under global memory pressure.
>>> Link:https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
>>>
>>> This COW amplification breaks the idea that transaction reservations are
>>> worst case as any search slot call could find itself in this COW loop and
>>> exhaust its reservation.
>>>
>>> My proposed solution is to temporarily pin extent buffers for the
>>> lifetime of btrfs_search_slot. This prevents the massive COW
>>> amplification that can be seen during high memory pressure.
>>>
>>> The implementation uses a local xarray to track COW'd buffers for the
>>> duration of the search. The xarray stores extent_buffer pointers without
>>> taking additional references; this is safe because tracked buffers remain
>>> dirty (writeback_blockers prevents the dirty bit from being cleared) and
>>> dirty buffers cannot be reclaimed by memory pressure.
>>>
>>> Synchronization is provided by eb->lock: increments in
>>> btrfs_search_slot_track_cow() occur while holding the write lock, and
>>> the check in lock_extent_buffer_for_io() also holds the write lock via
>>> btrfs_tree_lock(). Decrements don't require eb->lock because
>>> writeback_blockers is atomic and merely indicates "don't write yet".
>>> Once we decrement, we're done and don't care if writeback proceeds
>>> immediately.
>> This seems too complex to me.
>>
>> So this problem is very similar to some idea I had a few years ago but
>> never managed to implement.
>> It was about avoiding unnecessary COW, not for this space reservation
>> exhaustion due to sustained memory pressure, but it would solve it
>> too.
>>
>> The idea was that we do unnecessary COW in cases like this:
>>
>> 1) We COW a path in some tree and we are at transaction N;
>>
>> 2) Writeback happened for the extent buffers in that path while we are
>> in the same transaction, because we reached the 32M limit and some
>> task called btrfs_btree_balance_dirty() or something else triggered
>> writeback of the btree inode;
>>
>> 3) While still at transaction N, we visit the same path to add an item
>> to a leaf, or modify an item, whatever. Because the extent buffers
>> have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
>> returns true).
>>
>> So during the lifetime of a transaction we can have a lot of
>> unnecessary COW - we spend more time allocating extents, allocating
>> memory, copying extent buffer data, use more space per transaction,
>> etc.
>>
>> The idea was to not COW when an extent buffer has
>> BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
>> (btrfs_header_generation(eb)) matches the current transaction.
>> That is safe because there's no committed tree that points to an
>> extent buffer created in the current transaction.
>>
>> Any further modification to the extent buffer must be sure that the
>> EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
>> transaction's dirty_pages io tree, etc, so that we don't miss writing
>> the extent buffer to the same location again before the transaction
>> commits the superblocks.
>>
>> Have you considered an approach like this?
> I had not considered this, but it is a great idea.
>
> My first thought is that implementing this could be as simple
> as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> would mess with the assumptions around the log tree. From
> btrfs_sync_log():
After a fast glance and some tests, I found things might not be that 
easy. The problem is not only the log tree.
> /*
>   * IO has been started, blocks of the log tree have WRITTEN flag set
>   * in their headers. new modifications of the log will be written to
>   * new positions. so it's safe to allow log writers to go in.
>   */
>
> ^ Assumes that WRITTEN blocks will be COW'd.
>
> The issue looks like:
>
>   1. fsync A COWs eb
>   2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
>   3. fsync B does __not__ COW eb and modifies it
>   4. fsync A writes modified eb to disk
>   5. CRASH; the log tree is corrupted
>
> One way to avoid that is to keep the current behavior for the log
> tree, but that leaves the potential for COW amplification...
I tested with a patch like this:
@@ -624,14 +624,18 @@ static inline bool should_cow_block(const struct 
btrfs_trans_handle *trans,
         if (btrfs_header_generation(buf) != trans->transid)
                 return true;

-       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
-               return true;
-
         /* Ensure we can see the FORCE_COW bit. */
         smp_mb__before_atomic();
         if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
                 return true;

+       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
+               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
+                       return true;
+               btrfs_mark_buffer_dirty(trans, buf);
+               return false;
+       }
+
         if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)

                 return false;

And get some errors like this:


[  +0.090163] [ T2589] run fstests btrfs/004 at 2026-01-30 11:53:37
[  +0.432352] [T11685] BTRFS: device fsid 
1fb397fc-97a7-44dd-9602-dd38b74bc391 devid 1 transid 8 /dev/loop1 (7:1) 
scanned by mount (11685)
[  +0.000351] [T11685] BTRFS info (device loop1): first mount of 
filesystem 1fb397fc-97a7-44dd-9602-dd38b74bc391
[  +0.000014] [T11685] BTRFS info (device loop1): using crc32c 
(crc32c-lib) checksum algorithm
[  +0.001298] [T11685] BTRFS info (device loop1): checking UUID tree
[  +0.000039] [T11685] BTRFS info (device loop1): enabling ssd optimizations
[  +0.000003] [T11685] BTRFS info (device loop1): turning on async discard
[  +0.000002] [T11685] BTRFS info (device loop1): enabling free space tree
[  +1.051781] [T11703] page: refcount:2 mapcount:0 
mapping:00000000eb6d7caa index:0x2348 pfn:0x1caebf
[  +0.000008] [T11703] memcg:ffff9b3300263cc0
[  +0.000003] [T11703] aops:0xffffffffc0354040 ino:1
[  +0.000024] [T11703] flags: 
0x4e0000000000423e(referenced|uptodate|dirty|lru|workingset|private|writeback|zone=1)
[  +0.000007] [T11703] raw: 4e0000000000423e fffff74a872bb908 
fffff74a84206a88 ffff9b33c6706880
[  +0.000004] [T11703] raw: 0000000000002348 ffff9b334be522d0 
00000002ffffffff ffff9b3300263cc0
[  +0.000002] [T11703] page dumped because: eb page dump
[  +0.000003] [T11703] BTRFS critical (device loop1): corrupt leaf: 
root=5 block=36995072 slot=118 ino=406 file_offset=94208, invalid 
ram_bytes for file extent, have 8660273067269322872, should be aligned 
to 4096
[  +0.000013] [T11703] BTRFS info (device loop1): leaf 36995072 gen 33 
total ptrs 128 free space 2857 owner 5
[  +0.000006] [T11703]     item 0 key (386 DIR_ITEM 238230307) itemoff 
16249 itemsize 34
[  +0.000004] [T11703]         location key (462 1 0) type 2
[  +0.000003] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000003] [T11703]     item 1 key (386 DIR_ITEM 1473745676) itemoff 
16216 itemsize 33
[  +0.000004] [T11703]         location key (376 1 0) type 3
[  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000003] [T11703]     item 2 key (386 DIR_ITEM 2243137595) itemoff 
16182 itemsize 34
[  +0.000004] [T11703]         location key (413 1 0) type 1
[  +0.000002] [T11703]         transid 32 data_len 0 name_len 4
[  +0.000003] [T11703]     item 3 key (386 DIR_ITEM 2980467489) itemoff 
16148 itemsize 34
[  +0.000003] [T11703]         location key (478 1 0) type 1
[  +0.000002] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000003] [T11703]     item 4 key (386 DIR_ITEM 3091124746) itemoff 
16115 itemsize 33
[  +0.000002] [T11703]         location key (474 1 0) type 3
[  +0.000002] [T11703]         transid 33 data_len 0 name_len 3
[  +0.000001] [T11703]     item 5 key (386 DIR_ITEM 4127802504) itemoff 
16082 itemsize 33
[  +0.000003] [T11703]         location key (407 1 0) type 2
[  +0.000001] [T11703]         transid 27 data_len 0 name_len 3
[  +0.000002] [T11703]     item 6 key (386 DIR_INDEX 2) itemoff 16049 
itemsize 33
[  +0.000003] [T11703]         location key (407 1 0) type 2
[  +0.000001] [T11703]         transid 27 data_len 0 name_len 3
[  +0.000002] [T11703]     item 7 key (386 DIR_INDEX 4) itemoff 16016 
itemsize 33
[  +0.000002] [T11703]         location key (376 1 0) type 3
[  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000002] [T11703]     item 8 key (386 DIR_INDEX 5) itemoff 15982 
itemsize 34
[  +0.000002] [T11703]         location key (413 1 0) type 1
[  +0.000002] [T11703]         transid 32 data_len 0 name_len 4
[  +0.000002] [T11703]     item 9 key (386 DIR_INDEX 6) itemoff 15948 
itemsize 34
[  +0.000002] [T11703]         location key (462 1 0) type 2
[  +0.000002] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 10 key (386 DIR_INDEX 7) itemoff 15915 
itemsize 33
[  +0.000003] [T11703]         location key (474 1 0) type 3
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 3
[  +0.000002] [T11703]     item 11 key (386 DIR_INDEX 8) itemoff 15881 
itemsize 34
[  +0.000002] [T11703]         location key (478 1 0) type 1
[  +0.000002] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000002] [T11703]     item 12 key (387 INODE_ITEM 0) itemoff 15721 
itemsize 160
[  +0.000002] [T11703]         inode generation 25 transid 25 size 0 
nbytes 0
[  +0.000003] [T11703]         block group 0 mode 20444 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 0 flags 0x0
[  +0.000002] [T11703]         atime 1769745218.410248188
[  +0.000002] [T11703]         ctime 1769745218.410248188
[  +0.000002] [T11703]         mtime 1769745218.410248188
[  +0.000002] [T11703]         otime 1769745218.410248188
[  +0.000002] [T11703]     item 13 key (387 INODE_REF 324) itemoff 15708 
itemsize 13
[  +0.000002] [T11703]         index 13 name_len 3
[  +0.000002] [T11703]     item 14 key (389 INODE_ITEM 0) itemoff 15548 
itemsize 160
[  +0.000002] [T11703]         inode generation 25 transid 33 size 
375425 nbytes 8192
[  +0.000003] [T11703]         block group 0 mode 100666 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 4 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.410248188
[  +0.000002] [T11703]         ctime 1769745218.662385845
[  +0.000002] [T11703]         mtime 1769745218.662385845
[  +0.000001] [T11703]         otime 1769745218.410248188
[  +0.000002] [T11703]     item 15 key (389 INODE_REF 334) itemoff 15535 
itemsize 13
[  +0.000002] [T11703]         index 7 name_len 3
[  +0.000002] [T11703]     item 16 key (389 XATTR_ITEM 1939513822) 
itemoff 15435 itemsize 100
[  +0.000003] [T11703]         location key (0 0 0) type 8
[  +0.000002] [T11703]         transid 33 data_len 63 name_len 7
[  +0.000002] [T11703]     item 17 key (389 EXTENT_DATA 368640) itemoff 
15382 itemsize 53
[  +0.000002] [T11703]         generation 31 type 1
[  +0.000002] [T11703]         extent data disk bytenr 17526784 nr 8192
[  +0.000002] [T11703]         extent data offset 0 nr 4096 ram 8192
[  +0.000002] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 18 key (389 EXTENT_DATA 372736) itemoff 
15329 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 14102528 nr 4096
[  +0.000002] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 19 key (389 EXTENT_DATA 1179648) itemoff 
15276 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000002] [T11703]         extent data disk bytenr 370638848 nr 28672
[  +0.000002] [T11703]         extent data offset 0 nr 28672 ram 28672
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 20 key (390 INODE_ITEM 0) itemoff 15116 
itemsize 160
[  +0.000002] [T11703]         inode generation 25 transid 33 size 0 
nbytes 0
[  +0.000002] [T11703]         block group 0 mode 20444 links 2 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 3 flags 0x0
[  +0.000002] [T11703]         atime 1769745218.414620392
[  +0.000002] [T11703]         ctime 1769745218.892384583
[  +0.000001] [T11703]         mtime 1769745218.414620392
[  +0.000002] [T11703]         otime 1769745218.414620392
[  +0.000002] [T11703]     item 21 key (390 INODE_REF 300) itemoff 15103 
itemsize 13
[  +0.000002] [T11703]         index 13 name_len 3
[  +0.000002] [T11703]     item 22 key (390 INODE_REF 327) itemoff 15089 
itemsize 14
[  +0.000002] [T11703]         index 17 name_len 4
[  +0.000001] [T11703]     item 23 key (391 INODE_ITEM 0) itemoff 14929 
itemsize 160
[  +0.000003] [T11703]         inode generation 25 transid 30 size 12 
nbytes 0
[  +0.000002] [T11703]         block group 0 mode 40777 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 3 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.414620392
[  +0.000002] [T11703]         ctime 1769745218.609052805
[  +0.000002] [T11703]         mtime 1769745218.609052805
[  +0.000001] [T11703]         otime 1769745218.414620392
[  +0.000002] [T11703]     item 24 key (391 INODE_REF 369) itemoff 14916 
itemsize 13
[  +0.000002] [T11703]         index 3 name_len 3
[  +0.000002] [T11703]     item 25 key (391 DIR_ITEM 2351632656) itemoff 
14883 itemsize 33
[  +0.000002] [T11703]         location key (430 1 0) type 3
[  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000002] [T11703]     item 26 key (391 DIR_ITEM 3050776180) itemoff 
14850 itemsize 33
[  +0.000002] [T11703]         location key (392 1 0) type 3
[  +0.000002] [T11703]         transid 25 data_len 0 name_len 3
[  +0.000002] [T11703]     item 27 key (391 DIR_INDEX 2) itemoff 14817 
itemsize 33
[  +0.000002] [T11703]         location key (392 1 0) type 3
[  +0.000002] [T11703]         transid 25 data_len 0 name_len 3
[  +0.000002] [T11703]     item 28 key (391 DIR_INDEX 3) itemoff 14784 
itemsize 33
[  +0.000002] [T11703]         location key (430 1 0) type 3
[  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000001] [T11703]     item 29 key (392 INODE_ITEM 0) itemoff 14624 
itemsize 160
[  +0.000003] [T11703]         inode generation 25 transid 25 size 0 
nbytes 0
[  +0.000001] [T11703]         block group 0 mode 20444 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 2 flags 0x0
[  +0.000002] [T11703]         atime 1769745218.425720477
[  +0.000002] [T11703]         ctime 1769745218.429053792
[  +0.000001] [T11703]         mtime 1769745218.425720477
[  +0.000002] [T11703]         otime 1769745218.425720477
[  +0.000002] [T11703]     item 30 key (392 INODE_REF 391) itemoff 14611 
itemsize 13
[  +0.000002] [T11703]         index 2 name_len 3
[  +0.000001] [T11703]     item 31 key (393 INODE_ITEM 0) itemoff 14451 
itemsize 160
[  +0.000003] [T11703]         inode generation 25 transid 25 size 0 
nbytes 0
[  +0.000001] [T11703]         block group 0 mode 20000 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 19 flags 0x0
[  +0.000002] [T11703]         atime 1769745218.429053792
[  +0.000002] [T11703]         ctime 1769745218.429053792
[  +0.000001] [T11703]         mtime 1769745218.429053792
[  +0.000002] [T11703]         otime 1769745218.429053792
[  +0.000002] [T11703]     item 32 key (393 INODE_REF 377) itemoff 14438 
itemsize 13
[  +0.000002] [T11703]         index 4 name_len 3
[  +0.000002] [T11703]     item 33 key (394 INODE_ITEM 0) itemoff 14278 
itemsize 160
[  +0.000023] [T11703]         inode generation 25 transid 29 size 0 
nbytes 0
[  +0.000002] [T11703]         block group 0 mode 20444 links 1 uid 
116392077 gid 49956004
[  +0.000002] [T11703]         rdev 0 sequence 20 flags 0x0
[  +0.000017] [T11703]         atime 1769745218.435720422
[  +0.000005] [T11703]         ctime 1769745218.575719654
[  +0.000002] [T11703]         mtime 1769745218.435720422
[  +0.000002] [T11703]         otime 1769745218.435720422
[  +0.000002] [T11703]     item 34 key (394 INODE_REF 336) itemoff 14265 
itemsize 13
[  +0.000003] [T11703]         index 11 name_len 3
[  +0.000002] [T11703]     item 35 key (395 INODE_ITEM 0) itemoff 14105 
itemsize 160
[  +0.000003] [T11703]         inode generation 25 transid 33 size 14 
nbytes 0
[  +0.000003] [T11703]         block group 0 mode 40777 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 31 flags 0x0
[  +0.000002] [T11703]         atime 1769745218.436974912
[  +0.000002] [T11703]         ctime 1769745218.892384583
[  +0.000002] [T11703]         mtime 1769745218.892384583
[  +0.000001] [T11703]         otime 1769745218.436974912
[  +0.000002] [T11703]     item 36 key (395 INODE_REF 300) itemoff 14092 
itemsize 13
[  +0.000003] [T11703]         index 14 name_len 3
[  +0.000002] [T11703]     item 37 key (395 DIR_ITEM 1909090157) itemoff 
14059 itemsize 33
[  +0.000002] [T11703]         location key (410 1 0) type 1
[  +0.000002] [T11703]         transid 27 data_len 0 name_len 3
[  +0.000002] [T11703]     item 38 key (395 DIR_ITEM 2010649400) itemoff 
14025 itemsize 34
[  +0.000003] [T11703]         location key (448 1 0) type 1
[  +0.000002] [T11703]         transid 32 data_len 0 name_len 4
[  +0.000001] [T11703]     item 39 key (395 DIR_INDEX 3) itemoff 13992 
itemsize 33
[  +0.000003] [T11703]         location key (410 1 0) type 1
[  +0.000001] [T11703]         transid 27 data_len 0 name_len 3
[  +0.000002] [T11703]     item 40 key (395 DIR_INDEX 7) itemoff 13958 
itemsize 34
[  +0.000002] [T11703]         location key (448 1 0) type 1
[  +0.000002] [T11703]         transid 32 data_len 0 name_len 4
[  +0.000002] [T11703]     item 41 key (396 INODE_ITEM 0) itemoff 13798 
itemsize 160
[  +0.000002] [T11703]         inode generation 25 transid 33 size 
4498598 nbytes 368640
[  +0.000003] [T11703]         block group 0 mode 100666 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 43 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.862384748
[  +0.000002] [T11703]         ctime 1769745218.872384693
[  +0.000002] [T11703]         mtime 1769745218.872384693
[  +0.000002] [T11703]         otime 1769745218.436974912
[  +0.000003] [T11703]     item 42 key (396 INODE_REF 334) itemoff 13785 
itemsize 13
[  +0.000003] [T11703]         index 8 name_len 3
[  +0.000002] [T11703]     item 43 key (396 EXTENT_DATA 1015808) itemoff 
13732 itemsize 53
[  +0.000004] [T11703]         generation 26 type 1
[  +0.000002] [T11703]         extent data disk bytenr 20709376 nr 4096
[  +0.000003] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000002] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 44 key (396 EXTENT_DATA 1376256) itemoff 
13679 itemsize 53
[  +0.000003] [T11703]         generation 27 type 1
[  +0.000002] [T11703]         extent data disk bytenr 362196992 nr 131072
[  +0.000003] [T11703]         extent data offset 0 nr 73728 ram 131072
[  +0.000002] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 45 key (396 EXTENT_DATA 1449984) itemoff 
13626 itemsize 53
[  +0.000003] [T11703]         generation 29 type 1
[  +0.000003] [T11703]         extent data disk bytenr 356175872 nr 114688
[  +0.000002] [T11703]         extent data offset 0 nr 114688 ram 114688
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 46 key (396 EXTENT_DATA 1703936) itemoff 
13573 itemsize 53
[  +0.000004] [T11703]         generation 27 type 1
[  +0.000002] [T11703]         extent data disk bytenr 361512960 nr 102400
[  +0.000002] [T11703]         extent data offset 0 nr 102400 ram 102400
[  +0.000002] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 47 key (396 EXTENT_DATA 2301952) itemoff 
13520 itemsize 53
[  +0.000003] [T11703]         generation 31 type 1
[  +0.000002] [T11703]         extent data disk bytenr 15396864 nr 32768
[  +0.000003] [T11703]         extent data offset 0 nr 32768 ram 32768
[  +0.000002] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 48 key (396 EXTENT_DATA 3616768) itemoff 
13467 itemsize 53
[  +0.000002] [T11703]         generation 31 type 1
[  +0.000002] [T11703]         extent data disk bytenr 17121280 nr 49152
[  +0.000001] [T11703]         extent data offset 0 nr 40960 ram 49152
[  +0.000002] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 49 key (396 EXTENT_DATA 4497408) itemoff 
13414 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 15429632 nr 4096
[  +0.000002] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 50 key (396 EXTENT_DATA 4562944) itemoff 
13361 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000002] [T11703]         extent data disk bytenr 16396288 nr 8192
[  +0.000001] [T11703]         extent data offset 0 nr 8192 ram 8192
[  +0.000002] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 51 key (397 INODE_ITEM 0) itemoff 13201 
itemsize 160
[  +0.000002] [T11703]         inode generation 25 transid 32 size 
3371008 nbytes 757760
[  +0.000002] [T11703]         block group 0 mode 100666 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 40 flags 0x10
[  +0.000002] [T11703]         atime 1769745218.436974912
[  +0.000002] [T11703]         ctime 1769745218.732385461
[  +0.000001] [T11703]         mtime 1769745218.732385461
[  +0.000002] [T11703]         otime 1769745218.436974912
[  +0.000002] [T11703]     item 52 key (397 INODE_REF 344) itemoff 13188 
itemsize 13
[  +0.000002] [T11703]         index 5 name_len 3
[  +0.000002] [T11703]     item 53 key (397 EXTENT_DATA 204800) itemoff 
13135 itemsize 53
[  +0.000002] [T11703]         generation 14 type 2
[  +0.000002] [T11703]         extent data disk bytenr 135266304 nr 954368
[  +0.000001] [T11703]         extent data offset 421888 nr 8192 ram 954368
[  +0.000002] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 54 key (397 EXTENT_DATA 253952) itemoff 
13082 itemsize 53
[  +0.000002] [T11703]         generation 14 type 2
[  +0.000001] [T11703]         extent data disk bytenr 135266304 nr 954368
[  +0.000002] [T11703]         extent data offset 471040 nr 28672 ram 954368
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 55 key (397 EXTENT_DATA 507904) itemoff 
13029 itemsize 53
[  +0.000003] [T11703]         generation 31 type 1
[  +0.000001] [T11703]         extent data disk bytenr 17121280 nr 49152
[  +0.000002] [T11703]         extent data offset 0 nr 40960 ram 49152
[  +0.000001] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 56 key (397 EXTENT_DATA 643072) itemoff 
12976 itemsize 53
[  +0.000002] [T11703]         generation 27 type 1
[  +0.000002] [T11703]         extent data disk bytenr 362475520 nr 77824
[  +0.000001] [T11703]         extent data offset 0 nr 57344 ram 77824
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 57 key (397 EXTENT_DATA 700416) itemoff 
12923 itemsize 53
[  +0.000003] [T11703]         generation 32 type 1
[  +0.000001] [T11703]         extent data disk bytenr 353513472 nr 16384
[  +0.000002] [T11703]         extent data offset 0 nr 16384 ram 16384
[  +0.000001] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 58 key (397 EXTENT_DATA 716800) itemoff 
12870 itemsize 53
[  +0.000002] [T11703]         generation 27 type 1
[  +0.000002] [T11703]         extent data disk bytenr 362475520 nr 77824
[  +0.000001] [T11703]         extent data offset 73728 nr 4096 ram 77824
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 59 key (397 EXTENT_DATA 720896) itemoff 
12817 itemsize 53
[  +0.000003] [T11703]         generation 27 type 1
[  +0.000001] [T11703]         extent data disk bytenr 19050496 nr 4096
[  +0.000002] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000001] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 60 key (397 EXTENT_DATA 901120) itemoff 
12764 itemsize 53
[  +0.000002] [T11703]         generation 30 type 1
[  +0.000001] [T11703]         extent data disk bytenr 364146688 nr 28672
[  +0.000001] [T11703]         extent data offset 0 nr 28672 ram 28672
[  +0.000001] [T11703]         extent compression 0
[  +0.000002] [T11703]     item 61 key (397 EXTENT_DATA 1875968) itemoff 
12711 itemsize 53
[  +0.000001] [T11703]         generation 24 type 1
[  +0.000002] [T11703]         extent data disk bytenr 357761024 nr 73728
[  +0.000001] [T11703]         extent data offset 40960 nr 28672 ram 73728
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 62 key (397 EXTENT_DATA 1904640) itemoff 
12658 itemsize 53
[  +0.000002] [T11703]         generation 27 type 1
[  +0.000001] [T11703]         extent data disk bytenr 355385344 nr 4096
[  +0.000002] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 63 key (397 EXTENT_DATA 2244608) itemoff 
12605 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 366755840 nr 102400
[  +0.000001] [T11703]         extent data offset 0 nr 102400 ram 102400
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 64 key (397 EXTENT_DATA 2990080) itemoff 
12552 itemsize 53
[  +0.000001] [T11703]         generation 32 type 2
[  +0.000001] [T11703]         extent data disk bytenr 153870336 nr 585728
[  +0.000002] [T11703]         extent data offset 16384 nr 172032 ram 585728
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 65 key (397 EXTENT_DATA 3162112) itemoff 
12499 itemsize 53
[  +0.000002] [T11703]         generation 32 type 1
[  +0.000001] [T11703]         extent data disk bytenr 153870336 nr 585728
[  +0.000001] [T11703]         extent data offset 188416 nr 110592 ram 
585728
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 66 key (397 EXTENT_DATA 3272704) itemoff 
12446 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 369012736 nr 4096
[  +0.000001] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 67 key (397 EXTENT_DATA 3276800) itemoff 
12393 itemsize 53
[  +0.000001] [T11703]         generation 33 type 2
[  +0.000001] [T11703]         extent data disk bytenr 161648640 nr 643072
[  +0.000002] [T11703]         extent data offset 0 nr 643072 ram 643072
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 68 key (398 INODE_ITEM 0) itemoff 12233 
itemsize 160
[  +0.000002] [T11703]         inode generation 25 transid 32 size 
1390453 nbytes 172032
[  +0.000001] [T11703]         block group 0 mode 100666 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 30 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.436974912
[  +0.000002] [T11703]         ctime 1769745218.522719547
[  +0.000001] [T11703]         mtime 1769745218.522719547
[  +0.000001] [T11703]         otime 1769745218.436974912
[  +0.000001] [T11703]     item 69 key (398 INODE_REF 369) itemoff 12220 
itemsize 13
[  +0.000002] [T11703]         index 4 name_len 3
[  +0.000001] [T11703]     item 70 key (398 EXTENT_DATA 40960) itemoff 
12167 itemsize 53
[  +0.000002] [T11703]         generation 14 type 2
[  +0.000001] [T11703]         extent data disk bytenr 135266304 nr 954368
[  +0.000001] [T11703]         extent data offset 421888 nr 8192 ram 954368
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 71 key (398 EXTENT_DATA 90112) itemoff 
12114 itemsize 53
[  +0.000002] [T11703]         generation 14 type 2
[  +0.000001] [T11703]         extent data disk bytenr 135266304 nr 954368
[  +0.000001] [T11703]         extent data offset 471040 nr 28672 ram 954368
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 72 key (398 EXTENT_DATA 344064) itemoff 
12061 itemsize 53
[  +0.000002] [T11703]         generation 14 type 2
[  +0.000001] [T11703]         extent data disk bytenr 135266304 nr 954368
[  +0.000001] [T11703]         extent data offset 421888 nr 8192 ram 954368
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 73 key (398 EXTENT_DATA 393216) itemoff 
12008 itemsize 53
[  +0.000002] [T11703]         generation 14 type 2
[  +0.000001] [T11703]         extent data disk bytenr 135266304 nr 954368
[  +0.000001] [T11703]         extent data offset 471040 nr 28672 ram 954368
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 74 key (398 EXTENT_DATA 753664) itemoff 
11955 itemsize 53
[  +0.000002] [T11703]         generation 27 type 1
[  +0.000001] [T11703]         extent data disk bytenr 21372928 nr 16384
[  +0.000001] [T11703]         extent data offset 0 nr 16384 ram 16384
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 75 key (398 EXTENT_DATA 806912) itemoff 
11902 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 366972928 nr 102400
[  +0.000001] [T11703]         extent data offset 0 nr 102400 ram 102400
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 76 key (398 EXTENT_DATA 962560) itemoff 
11849 itemsize 53
[  +0.000001] [T11703]         generation 31 type 1
[  +0.000001] [T11703]         extent data disk bytenr 17121280 nr 49152
[  +0.000002] [T11703]         extent data offset 0 nr 40960 ram 49152
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 77 key (398 EXTENT_DATA 1343488) itemoff 
11796 itemsize 53
[  +0.000002] [T11703]         generation 31 type 1
[  +0.000001] [T11703]         extent data disk bytenr 17121280 nr 49152
[  +0.000001] [T11703]         extent data offset 8192 nr 32768 ram 49152
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 78 key (398 EXTENT_DATA 1388544) itemoff 
11743 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 14479360 nr 4096
[  +0.000001] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 79 key (399 INODE_ITEM 0) itemoff 11583 
itemsize 160
[  +0.000001] [T11703]         inode generation 25 transid 32 size 
585093 nbytes 339968
[  +0.000002] [T11703]         block group 0 mode 100666 links 1 uid 0 gid 0
[  +0.000001] [T11703]         rdev 0 sequence 31 flags 0x10
[  +0.000001] [T11703]         atime 1769745218.552386449
[  +0.000002] [T11703]         ctime 1769745218.649404376
[  +0.000001] [T11703]         mtime 1769745218.649404376
[  +0.000001] [T11703]         otime 1769745218.438593902
[  +0.000002] [T11703]     item 80 key (399 INODE_REF 297) itemoff 11570 
itemsize 13
[  +0.000001] [T11703]         index 14 name_len 3
[  +0.000002] [T11703]     item 81 key (399 EXTENT_DATA 49152) itemoff 
11517 itemsize 53
[  +0.000001] [T11703]         generation 14 type 2
[  +0.000001] [T11703]         extent data disk bytenr 135266304 nr 954368
[  +0.000002] [T11703]         extent data offset 421888 nr 8192 ram 954368
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 82 key (399 EXTENT_DATA 98304) itemoff 
11464 itemsize 53
[  +0.000002] [T11703]         generation 14 type 2
[  +0.000001] [T11703]         extent data disk bytenr 135266304 nr 954368
[  +0.000001] [T11703]         extent data offset 471040 nr 28672 ram 954368
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 83 key (399 EXTENT_DATA 163840) itemoff 
11411 itemsize 53
[  +0.000002] [T11703]         generation 27 type 1
[  +0.000001] [T11703]         extent data disk bytenr 362151936 nr 45056
[  +0.000001] [T11703]         extent data offset 0 nr 45056 ram 45056
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 84 key (399 EXTENT_DATA 229376) itemoff 
11358 itemsize 53
[  +0.000002] [T11703]         generation 30 type 1
[  +0.000001] [T11703]         extent data disk bytenr 363864064 nr 98304
[  +0.000001] [T11703]         extent data offset 0 nr 98304 ram 98304
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 85 key (399 EXTENT_DATA 335872) itemoff 
11305 itemsize 53
[  +0.000001] [T11703]         generation 25 type 2
[  +0.000001] [T11703]         extent data disk bytenr 359723008 nr 94208
[  +0.000002] [T11703]         extent data offset 0 nr 4096 ram 94208
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 86 key (399 EXTENT_DATA 344064) itemoff 
11252 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 14061568 nr 4096
[  +0.000001] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 87 key (399 EXTENT_DATA 356352) itemoff 
11199 itemsize 53
[  +0.000002] [T11703]         generation 15 type 2
[  +0.000001] [T11703]         extent data disk bytenr 138276864 nr 282624
[  +0.000001] [T11703]         extent data offset 69632 nr 16384 ram 282624
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 88 key (399 EXTENT_DATA 479232) itemoff 
11146 itemsize 53
[  +0.000002] [T11703]         generation 31 type 1
[  +0.000001] [T11703]         extent data disk bytenr 17408000 nr 106496
[  +0.000001] [T11703]         extent data offset 0 nr 102400 ram 106496
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 89 key (399 EXTENT_DATA 581632) itemoff 
11093 itemsize 53
[  +0.000001] [T11703]         generation 33 type 1
[  +0.000002] [T11703]         extent data disk bytenr 13647872 nr 4096
[  +0.000001] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 90 key (399 EXTENT_DATA 1445888) itemoff 
11040 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 21803008 nr 32768
[  +0.000002] [T11703]         extent data offset 0 nr 28672 ram 32768
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 91 key (399 EXTENT_DATA 1474560) itemoff 
10987 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 13877248 nr 4096
[  +0.000001] [T11703]         extent data offset 0 nr 4096 ram 4096
[  +0.000001] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 92 key (399 EXTENT_DATA 1892352) itemoff 
10934 itemsize 53
[  +0.000002] [T11703]         generation 33 type 1
[  +0.000001] [T11703]         extent data disk bytenr 369930240 nr 65536
[  +0.000001] [T11703]         extent data offset 0 nr 65536 ram 65536
[  +0.000002] [T11703]         extent compression 0
[  +0.000001] [T11703]     item 93 key (400 INODE_ITEM 0) itemoff 10774 
itemsize 160
[  +0.000001] [T11703]         inode generation 25 transid 30 size 0 
nbytes 0
[  +0.000002] [T11703]         block group 0 mode 20444 links 1 uid 0 gid 0
[  +0.000001] [T11703]         rdev 0 sequence 23 flags 0x0
[  +0.000002] [T11703]         atime 1769745218.439782042
[  +0.000001] [T11703]         ctime 1769745218.632386010
[  +0.000001] [T11703]         mtime 1769745218.439782042
[  +0.000001] [T11703]         otime 1769745218.439782042
[  +0.000002] [T11703]     item 94 key (400 INODE_REF 272) itemoff 10761 
itemsize 13
[  +0.000001] [T11703]         index 44 name_len 3
[  +0.000002] [T11703]     item 95 key (401 INODE_ITEM 0) itemoff 10601 
itemsize 160
[  +0.000001] [T11703]         inode generation 25 transid 33 size 14 
nbytes 0
[  +0.000002] [T11703]         block group 0 mode 40777 links 1 uid 0 gid 0
[  +0.000001] [T11703]         rdev 0 sequence 22 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.444837643
[  +0.000002] [T11703]         ctime 1769745218.839051542
[  +0.000001] [T11703]         mtime 1769745218.839051542
[  +0.000001] [T11703]         otime 1769745218.444837643
[  +0.000002] [T11703]     item 96 key (401 INODE_REF 319) itemoff 10588 
itemsize 13
[  +0.000001] [T11703]         index 10 name_len 3
[  +0.000001] [T11703]     item 97 key (401 DIR_ITEM 1191232959) itemoff 
10554 itemsize 34
[  +0.000002] [T11703]         location key (459 1 0) type 3
[  +0.000002] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 98 key (401 DIR_ITEM 3224377935) itemoff 
10521 itemsize 33
[  +0.000002] [T11703]         location key (429 1 0) type 1
[  +0.000001] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000001] [T11703]     item 99 key (401 DIR_INDEX 2) itemoff 10488 
itemsize 33
[  +0.000002] [T11703]         location key (429 1 0) type 1
[  +0.000001] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000002] [T11703]     item 100 key (401 DIR_INDEX 3) itemoff 10454 
itemsize 34
[  +0.000001] [T11703]         location key (459 1 0) type 3
[  +0.000002] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 101 key (402 INODE_ITEM 0) itemoff 10294 
itemsize 160
[  +0.000002] [T11703]         inode generation 25 transid 30 size 0 
nbytes 0
[  +0.000001] [T11703]         block group 0 mode 20000 links 1 uid 10 
gid 25
[  +0.000002] [T11703]         rdev 0 sequence 20 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.446861225
[  +0.000001] [T11703]         ctime 1769745218.609052805
[  +0.000002] [T11703]         mtime 1769745218.446861225
[  +0.000001] [T11703]         otime 1769745218.446861225
[  +0.000001] [T11703]     item 102 key (402 INODE_REF 336) itemoff 
10281 itemsize 13
[  +0.000002] [T11703]         index 12 name_len 3
[  +0.000001] [T11703]     item 103 key (403 INODE_ITEM 0) itemoff 10121 
itemsize 160
[  +0.000002] [T11703]         inode generation 25 transid 33 size 48 
nbytes 0
[  +0.000001] [T11703]         block group 0 mode 40777 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 40 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.446861225
[  +0.000001] [T11703]         ctime 1769745218.802385077
[  +0.000001] [T11703]         mtime 1769745218.802385077
[  +0.000002] [T11703]         otime 1769745218.446861225
[  +0.000001] [T11703]     item 104 key (403 INODE_REF 335) itemoff 
10108 itemsize 13
[  +0.000002] [T11703]         index 8 name_len 3
[  +0.000001] [T11703]     item 105 key (403 DIR_ITEM 383218927) itemoff 
10074 itemsize 34
[  +0.000002] [T11703]         location key (299 1 0) type 3
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 106 key (403 DIR_ITEM 445930716) itemoff 
10041 itemsize 33
[  +0.000002] [T11703]         location key (287 1 0) type 3
[  +0.000001] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000002] [T11703]     item 107 key (403 DIR_ITEM 1238590424) 
itemoff 10008 itemsize 33
[  +0.000001] [T11703]         location key (418 1 0) type 2
[  +0.000002] [T11703]         transid 27 data_len 0 name_len 3
[  +0.000001] [T11703]     item 108 key (403 DIR_ITEM 1779809903) 
itemoff 9974 itemsize 34
[  +0.000002] [T11703]         location key (486 1 0) type 2
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 109 key (403 DIR_ITEM 1827976929) 
itemoff 9941 itemsize 33
[  +0.000002] [T11703]         location key (263 1 0) type 3
[  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000001] [T11703]     item 110 key (403 DIR_ITEM 2397436283) 
itemoff 9907 itemsize 34
[  +0.000002] [T11703]         location key (458 1 0) type 7
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 111 key (403 DIR_ITEM 2479530324) 
itemoff 9873 itemsize 34
[  +0.000002] [T11703]         location key (457 1 0) type 3
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 112 key (403 DIR_ITEM 3110495684) 
itemoff 9839 itemsize 34
[  +0.000002] [T11703]         location key (466 1 0) type 3
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000002] [T11703]     item 113 key (403 DIR_ITEM 3884400930) 
itemoff 9806 itemsize 33
[  +0.000001] [T11703]         location key (416 1 0) type 1
[  +0.000002] [T11703]         transid 27 data_len 0 name_len 3
[  +0.000001] [T11703]     item 114 key (403 DIR_INDEX 4) itemoff 9773 
itemsize 33
[  +0.000002] [T11703]         location key (416 1 0) type 1
[  +0.000001] [T11703]         transid 27 data_len 0 name_len 3
[  +0.000001] [T11703]     item 115 key (403 DIR_INDEX 5) itemoff 9740 
itemsize 33
[  +0.000002] [T11703]         location key (418 1 0) type 2
[  +0.000001] [T11703]         transid 27 data_len 0 name_len 3
[  +0.000002] [T11703]     item 116 key (403 DIR_INDEX 7) itemoff 9707 
itemsize 33
[  +0.000001] [T11703]         location key (287 1 0) type 3
[  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000001] [T11703]     item 117 key (403 DIR_INDEX 8) itemoff 9674 
itemsize 33
[  +0.000002] [T11703]         location key (263 1 0) type 3
[  +0.000001] [T11703]         transid 30 data_len 0 name_len 3
[  +0.000001] [T11703]     item 118 key (403 DIR_INDEX 10) itemoff 9640 
itemsize 34
[  +0.000002] [T11703]         location key (299 1 0) type 3
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000002] [T11703]     item 119 key (403 DIR_INDEX 11) itemoff 9606 
itemsize 34
[  +0.000001] [T11703]         location key (457 1 0) type 3
[  +0.000002] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 120 key (403 DIR_INDEX 12) itemoff 9572 
itemsize 34
[  +0.000002] [T11703]         location key (458 1 0) type 7
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000001] [T11703]     item 121 key (404 INODE_ITEM 0) itemoff 9412 
itemsize 160
[  +0.000002] [T11703]         inode generation 27 transid 32 size 3080 
nbytes 3080
[  +0.000001] [T11703]         block group 0 mode 120777 links 1 uid 0 gid 0
[  +0.000002] [T11703]         rdev 0 sequence 21 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.505720038
[  +0.000001] [T11703]         ctime 1769745218.754448742
[  +0.000002] [T11703]         mtime 1769745218.505720038
[  +0.000001] [T11703]         otime 1769745218.505720038
[  +0.000001] [T11703]     item 122 key (404 INODE_REF 315) itemoff 9399 
itemsize 13
[  +0.000002] [T11703]         index 19 name_len 3
[  +0.000001] [T11703]     item 123 key (404 EXTENT_DATA 0) itemoff 6298 
itemsize 3101
[  +0.000002] [T11703]         generation 27 type 0
[  +0.000001] [T11703]         inline extent data size 3080 ram_bytes 
3080 compression 0
[  +0.000002] [T11703]     item 124 key (405 INODE_ITEM 0) itemoff 6138 
itemsize 160
[  +0.000001] [T11703]         inode generation 27 transid 30 size 20 
nbytes 0
[  +0.000002] [T11703]         block group 0 mode 40777 links 1 uid 31 
gid 901
[  +0.000002] [T11703]         rdev 0 sequence 26 flags 0x0
[  +0.000001] [T11703]         atime 1769745218.505720038
[  +0.000001] [T11703]         ctime 1769745218.662385845
[  +0.000001] [T11703]         mtime 1769745218.662385845
[  +0.000001] [T11703]         otime 1769745218.505720038
[  +0.000002] [T11703]     item 125 key (405 INODE_REF 327) itemoff 6125 
itemsize 13
[  +0.000001] [T11703]         index 12 name_len 3
[  +0.000002] [T11703]     item 126 key (405 DIR_ITEM 695610317) itemoff 
6091 itemsize 34
[  +0.000002] [T11703]         location key (435 1 0) type 2
[  +0.000001] [T11703]         transid 30 data_len 0 name_len 4
[  +0.000001] [T11703]     item 127 key (405 DIR_ITEM 828387202) itemoff 
6057 itemsize 34
[  +0.000002] [T11703]         location key (479 1 0) type 3
[  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
[  +0.000002] [T11703] BTRFS error (device loop1): block=36995072 write 
time tree block corruption detected
[  +0.003429] [T11703] BTRFS: error (device loop1) in 
btrfs_commit_transaction:2555: errno=-5 IO failure (Error while writing 
out transaction)
[  +0.000007] [T11703] BTRFS info (device loop1 state E): forced readonly
[  +0.000002] [T11703] BTRFS warning (device loop1 state E): Skipping 
commit of aborted transaction.
[  +0.000002] [T11703] BTRFS error (device loop1 state EA): Transaction 
aborted (error -5)
[  +0.000003] [T11703] BTRFS: error (device loop1 state EA) in 
cleanup_transaction:2037: errno=-5 IO failure

The reported 406 inode is even not in the printed leaf. It seems like a 
data race maybe caused by:

We unlock the eb after setting the WRITTEN flag during write back, and 
the eb should not get modified since then because all future writes will 
use the cowed eb. However, with the WRITTEN flag check removed in 
should_cow_block, we might write to the eb with WRITTEN flag set which 
might be under io.

To fix this, we need to check the DIRTY flag again to prevent writing a 
eb which has some new data written, and lock the eb before we really 
doing io related things. I'm not farmilar with io related code so please 
correct me if I got anything wrong.

Thanks,

Sun Yangkai


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30  4:14     ` Sun YangKai
@ 2026-01-30  9:37       ` Sun YangKai
  2026-01-30 15:50         ` Sun YangKai
  0 siblings, 1 reply; 21+ messages in thread
From: Sun YangKai @ 2026-01-30  9:37 UTC (permalink / raw)
  To: Leo Martins, Filipe Manana; +Cc: linux-btrfs, kernel-team



On 2026/1/30 12:14, Sun YangKai wrote:
> On 2026/1/30 08:12, Leo Martins wrote:
>> On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana<fdmanana@kernel.org> 
>> wrote:
>>> On Tue, Jan 27, 2026 at 8:43 PM Leo Martins<loemra.dev@gmail.com> wrote:
>>>> I've been investigating enospcs at Meta and have observed a strange
>>>> pattern where filesystems are enospcing with lots of unallocated space
>>>> (> 100G). Sample dmesg dump at bottom of message.
>>>>
>>>> btrfs_insert_delayed_dir_index is attempting to migrate some 
>>>> reservation
>>>> from the transaction block reserve and finding it exhausted leading 
>>>> to a
>>>> warning and enospc. This is a bug as the reservations are meant to be
>>>> worst case. It should be impossible to exhaust the transaction block
>>>> reserve.
>>>>
>>>> Some tracing of affected hosts revealed that there were single
>>>> btrfs_search_slot calls that were COWing 100s of times. I was able to
>>>> reproduce this behavior locally by creating a very constrained cgroup
>>>> and producing a lot of concurrent filesystem operations. Here's the
>>>> pattern:
>>>>
>>>>   1. btrfs_search_slot() begins tree traversal with cow=1
>>>>   2. Node at level N needs COW (old generation or WRITTEN flag set)
>>>>   3. btrfs_cow_block() allocates new node, updates parent pointer
>>>>   4. Traversal continues, but hits a condition requiring restart 
>>>> (e.g., node
>>>>      not cached, lock contention, need higher write_lock_level)
>>>>   5. btrfs_release_path() releases all locks and references
>>>>   6. Memory pressure triggers writeback on the COW'd node
>>>>   7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>>>>      BTRFS_HEADER_FLAG_WRITTEN
>>>>   8. goto again - traversal restarts from root
>>>>   9. Traversal reaches the freshly COW'd node
>>>>   10. should_cow_block() sees WRITTEN flag set, returns true
>>>>   11. btrfs_cow_block() allocates another new node - same logical 
>>>> position,
>>>>       new physical location, new reservation consumed
>>>>   12. Steps 4-11 repeat indefinitely under sustained memory pressure
>>>>
>>>> Note this behavior should be much harder to trigger since Boris's
>>>> AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
>>>> accounted for in user cgroups. However, I believe it
>>>> would still be an issue under global memory pressure.
>>>> Link:https://lore.kernel.org/linux-btrfs/ 
>>>> cover.1755812945.git.boris@bur.io/
>>>>
>>>> This COW amplification breaks the idea that transaction reservations 
>>>> are
>>>> worst case as any search slot call could find itself in this COW 
>>>> loop and
>>>> exhaust its reservation.
>>>>
>>>> My proposed solution is to temporarily pin extent buffers for the
>>>> lifetime of btrfs_search_slot. This prevents the massive COW
>>>> amplification that can be seen during high memory pressure.
>>>>
>>>> The implementation uses a local xarray to track COW'd buffers for the
>>>> duration of the search. The xarray stores extent_buffer pointers 
>>>> without
>>>> taking additional references; this is safe because tracked buffers 
>>>> remain
>>>> dirty (writeback_blockers prevents the dirty bit from being cleared) 
>>>> and
>>>> dirty buffers cannot be reclaimed by memory pressure.
>>>>
>>>> Synchronization is provided by eb->lock: increments in
>>>> btrfs_search_slot_track_cow() occur while holding the write lock, and
>>>> the check in lock_extent_buffer_for_io() also holds the write lock via
>>>> btrfs_tree_lock(). Decrements don't require eb->lock because
>>>> writeback_blockers is atomic and merely indicates "don't write yet".
>>>> Once we decrement, we're done and don't care if writeback proceeds
>>>> immediately.
>>> This seems too complex to me.
>>>
>>> So this problem is very similar to some idea I had a few years ago but
>>> never managed to implement.
>>> It was about avoiding unnecessary COW, not for this space reservation
>>> exhaustion due to sustained memory pressure, but it would solve it
>>> too.
>>>
>>> The idea was that we do unnecessary COW in cases like this:
>>>
>>> 1) We COW a path in some tree and we are at transaction N;
>>>
>>> 2) Writeback happened for the extent buffers in that path while we are
>>> in the same transaction, because we reached the 32M limit and some
>>> task called btrfs_btree_balance_dirty() or something else triggered
>>> writeback of the btree inode;
>>>
>>> 3) While still at transaction N, we visit the same path to add an item
>>> to a leaf, or modify an item, whatever. Because the extent buffers
>>> have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
>>> returns true).
>>>
>>> So during the lifetime of a transaction we can have a lot of
>>> unnecessary COW - we spend more time allocating extents, allocating
>>> memory, copying extent buffer data, use more space per transaction,
>>> etc.
>>>
>>> The idea was to not COW when an extent buffer has
>>> BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
>>> (btrfs_header_generation(eb)) matches the current transaction.
>>> That is safe because there's no committed tree that points to an
>>> extent buffer created in the current transaction.
>>>
>>> Any further modification to the extent buffer must be sure that the
>>> EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
>>> transaction's dirty_pages io tree, etc, so that we don't miss writing
>>> the extent buffer to the same location again before the transaction
>>> commits the superblocks.
>>>
>>> Have you considered an approach like this?
>> I had not considered this, but it is a great idea.
>>
>> My first thought is that implementing this could be as simple
>> as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
>> would mess with the assumptions around the log tree. From
>> btrfs_sync_log():
> After a fast glance and some tests, I found things might not be that 
> easy. The problem is not only the log tree.
>> /*
>>   * IO has been started, blocks of the log tree have WRITTEN flag set
>>   * in their headers. new modifications of the log will be written to
>>   * new positions. so it's safe to allow log writers to go in.
>>   */
>>
>> ^ Assumes that WRITTEN blocks will be COW'd.
>>
>> The issue looks like:
>>
>>   1. fsync A COWs eb
>>   2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
>>   3. fsync B does __not__ COW eb and modifies it
>>   4. fsync A writes modified eb to disk
>>   5. CRASH; the log tree is corrupted
>>
>> One way to avoid that is to keep the current behavior for the log
>> tree, but that leaves the potential for COW amplification...
> I tested with a patch like this:
> @@ -624,14 +624,18 @@ static inline bool should_cow_block(const struct 
> btrfs_trans_handle *trans,
>          if (btrfs_header_generation(buf) != trans->transid)
>                  return true;
> 
> -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> -               return true;
> -
>          /* Ensure we can see the FORCE_COW bit. */
>          smp_mb__before_atomic();
>          if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
>                  return true;
> 
> +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> +               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
> +                       return true;
> +               btrfs_mark_buffer_dirty(trans, buf);
> +               return false;
> +       }
> +
>          if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> 
>                  return false;
> 
> And get some errors like this:
> 
> 
> [  +0.090163] [ T2589] run fstests btrfs/004 at 2026-01-30 11:53:37
> [  +0.432352] [T11685] BTRFS: device fsid 1fb397fc-97a7-44dd-9602- 
> dd38b74bc391 devid 1 transid 8 /dev/loop1 (7:1) scanned by mount (11685)
> [  +0.000351] [T11685] BTRFS info (device loop1): first mount of 
> filesystem 1fb397fc-97a7-44dd-9602-dd38b74bc391
> [  +0.000014] [T11685] BTRFS info (device loop1): using crc32c (crc32c- 
> lib) checksum algorithm
> [  +0.001298] [T11685] BTRFS info (device loop1): checking UUID tree
> [  +0.000039] [T11685] BTRFS info (device loop1): enabling ssd 
> optimizations
> [  +0.000003] [T11685] BTRFS info (device loop1): turning on async discard
> [  +0.000002] [T11685] BTRFS info (device loop1): enabling free space tree
> [  +1.051781] [T11703] page: refcount:2 mapcount:0 
> mapping:00000000eb6d7caa index:0x2348 pfn:0x1caebf
> [  +0.000008] [T11703] memcg:ffff9b3300263cc0
> [  +0.000003] [T11703] aops:0xffffffffc0354040 ino:1
> [  +0.000024] [T11703] flags: 0x4e0000000000423e(referenced|uptodate| 
> dirty|lru|workingset|private|writeback|zone=1)
> [  +0.000007] [T11703] raw: 4e0000000000423e fffff74a872bb908 
> fffff74a84206a88 ffff9b33c6706880
> [  +0.000004] [T11703] raw: 0000000000002348 ffff9b334be522d0 
> 00000002ffffffff ffff9b3300263cc0
> [  +0.000002] [T11703] page dumped because: eb page dump
> [  +0.000003] [T11703] BTRFS critical (device loop1): corrupt leaf: 
> root=5 block=36995072 slot=118 ino=406 file_offset=94208, invalid 
> ram_bytes for file extent, have 8660273067269322872, should be aligned 
> to 4096
> [  +0.000013] [T11703] BTRFS info (device loop1): leaf 36995072 gen 33 
> total ptrs 128 free space 2857 owner 5
> [  +0.000006] [T11703]     item 0 key (386 DIR_ITEM 238230307) itemoff 
> 16249 itemsize 34
> [  +0.000004] [T11703]         location key (462 1 0) type 2
> [  +0.000003] [T11703]         transid 33 data_len 0 name_len 4
> [  +0.000003] [T11703]     item 1 key (386 DIR_ITEM 1473745676) itemoff 
> 16216 itemsize 33
> [  +0.000004] [T11703]         location key (376 1 0) type 3
> [  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
> [  +0.000003] [T11703]     item 2 key (386 DIR_ITEM 2243137595) itemoff 
> 16182 itemsize 34
> [  +0.000004] [T11703]         location key (413 1 0) type 1
> [  +0.000002] [T11703]         transid 32 data_len 0 name_len 4
> ...
> [  +0.000001] [T11703]     item 127 key (405 DIR_ITEM 828387202) itemoff 
> 6057 itemsize 34
> [  +0.000002] [T11703]         location key (479 1 0) type 3
> [  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
> [  +0.000002] [T11703] BTRFS error (device loop1): block=36995072 write 
> time tree block corruption detected
> [  +0.003429] [T11703] BTRFS: error (device loop1) in 
> btrfs_commit_transaction:2555: errno=-5 IO failure (Error while writing 
> out transaction)
> [  +0.000007] [T11703] BTRFS info (device loop1 state E): forced readonly
> [  +0.000002] [T11703] BTRFS warning (device loop1 state E): Skipping 
> commit of aborted transaction.
> [  +0.000002] [T11703] BTRFS error (device loop1 state EA): Transaction 
> aborted (error -5)
> [  +0.000003] [T11703] BTRFS: error (device loop1 state EA) in 
> cleanup_transaction:2037: errno=-5 IO failure
> 
> The reported 406 inode is even not in the printed leaf. It seems like a 
> data race maybe caused by:
> 
> We unlock the eb after setting the WRITTEN flag during write back, and 
> the eb should not get modified since then because all future writes will 
> use the cowed eb. However, with the WRITTEN flag check removed in 
> should_cow_block, we might write to the eb with WRITTEN flag set which 
> might be under io.

I tried again with this:

@@ -624,14 +624,20 @@ static inline bool should_cow_block(const struct 
btrfs_trans_handle *trans,
         if (btrfs_header_generation(buf) != trans->transid)
                 return true;

-       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
-               return true;
-
         /* Ensure we can see the FORCE_COW bit. */
         smp_mb__before_atomic();
         if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
                 return true;

+       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
+               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
+                       return true;
+               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
+                       return true;
+               btrfs_mark_buffer_dirty(trans, buf);
+               return false;
+       }
+
         if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
                 return false;

When WRITEBACK is set, do a normal cow to prevent the data race. This 
seems to fix the previous problem. However, I got this:

[  +0.020843] [T15127] BTRFS error (device loop1): block=30687232 bad 
generation, have 11 expect > 14
[  +0.000009] [T15127] 	item 0 key (256 INODE_ITEM 0) itemoff 16123 
itemsize 160
[  +0.000004] [T15127] 		inode generation 3 transid 11 size 10 nbytes 16384
[  +0.000003] [T15127] 		block group 0 mode 40755 links 1 uid 0 gid 0
[  +0.000002] [T15127] 		rdev 0 sequence 1 flags 0x0
[  +0.000002] [T15127] 		atime 1769760651.0
[  +0.000002] [T15127] 		ctime 1769760652.250234845
[  +0.000002] [T15127] 		mtime 1769760652.250234845
[  +0.000001] [T15127] 		otime 1769760651.0
[  +0.000002] [T15127] 	item 1 key (256 INODE_REF 256) itemoff 16111 
itemsize 12
[  +0.000003] [T15127] 		index 0 name_len 2
[  +0.000002] [T15127] 	item 2 key (256 DIR_ITEM 2030520461) itemoff 
16076 itemsize 35
[  +0.000002] [T15127] 		location key (257 1 0) type 2
[  +0.000002] [T15127] 		transid 11 data_len 0 name_len 5
[  +0.000002] [T15127] 	item 3 key (256 DIR_INDEX 2) itemoff 16041 
itemsize 35
[  +0.000002] [T15127] 		location key (257 1 0) type 2
[  +0.000002] [T15127] 		transid 11 data_len 0 name_len 5
[  +0.000002] [T15127] 	item 4 key (257 INODE_ITEM 0) itemoff 15881 
itemsize 160
[  +0.000002] [T15127] 		inode generation 11 transid 11 size 12 nbytes 0
[  +0.000002] [T15127] 		block group 0 mode 40755 links 1 uid 0 gid 0
[  +0.000002] [T15127] 		rdev 0 sequence 19 flags 0x0
[  +0.000001] [T15127] 		atime 1769760652.250234845
[  +0.000002] [T15127] 		ctime 1769760652.256913323
[  +0.000002] [T15127] 		mtime 1769760652.256913323
[  +0.000001] [T15127] 		otime 1769760652.250234845
[  +0.000002] [T15127] 	item 5 key (257 INODE_REF 256) itemoff 15866 
itemsize 15
[  +0.000002] [T15127] 		index 2 name_len 5
[  +0.000002] [T15127] 	item 6 key (257 DIR_ITEM 247980518) itemoff 
15830 itemsize 36
[  +0.000002] [T15127] 		location key (256 132 18446744073709551615) type 2
[  +0.000002] [T15127] 		transid 11 data_len 0 name_len 6
[  +0.000002] [T15127] 	item 7 key (257 DIR_INDEX 2) itemoff 15794 
itemsize 36
[  +0.000002] [T15127] 		location key (256 132 18446744073709551615) type 2
[  +0.000002] [T15127] 		transid 11 data_len 0 name_len 6
[  +0.000001] [T15127] BTRFS error (device loop1): block=30687232 write 
time tree block corruption detected
[  +0.000017] [T15127] BTRFS error (device loop1): block=30703616 bad 
generation, have 11 expect > 14
[  +0.000004] [T15127] 	item 0 key (13631488 BLOCK_GROUP_ITEM 8388608) 
itemoff 16259 itemsize 24
[  +0.000003] [T15127] 		block group used 0 chunk_objectid 256 flags 1
[  +0.000002] [T15127] 	item 1 key (22020096 BLOCK_GROUP_ITEM 8388608) 
itemoff 16235 itemsize 24
[  +0.000002] [T15127] 		block group used 16384 chunk_objectid 256 flags 34
[  +0.000002] [T15127] 	item 2 key (22036480 METADATA_ITEM 0) itemoff 
16202 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 8 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 3
[  +0.000003] [T15127] 	item 3 key (30408704 BLOCK_GROUP_ITEM 268435456) 
itemoff 16178 itemsize 24
[  +0.000002] [T15127] 		block group used 163840 chunk_objectid 256 flags 36
[  +0.000002] [T15127] 	item 4 key (30490624 METADATA_ITEM 0) itemoff 
16145 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 5 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 7
[  +0.000002] [T15127] 	item 5 key (30523392 METADATA_ITEM 0) itemoff 
16112 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 5 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 18446744073709551607
[  +0.000002] [T15127] 	item 6 key (30605312 METADATA_ITEM 0) itemoff 
16079 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 9 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 4
[  +0.000002] [T15127] 	item 7 key (30687232 METADATA_ITEM 0) itemoff 
16046 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 11 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 5
[  +0.000002] [T15127] 	item 8 key (30703616 METADATA_ITEM 0) itemoff 
16013 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 11 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 2
[  +0.000002] [T15127] 	item 9 key (30720000 METADATA_ITEM 0) itemoff 
15980 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 11 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 10
[  +0.000002] [T15127] 	item 10 key (30736384 METADATA_ITEM 0) itemoff 
15947 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 11 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 8
[  +0.000002] [T15127] 	item 11 key (30752768 METADATA_ITEM 0) itemoff 
15914 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 11 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 256
[  +0.000002] [T15127] 	item 12 key (30769152 METADATA_ITEM 0) itemoff 
15881 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 11 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 1
[  +0.000002] [T15127] 	item 13 key (30785536 METADATA_ITEM 0) itemoff 
15848 itemsize 33
[  +0.000002] [T15127] 		extent refs 1 gen 11 flags 2
[  +0.000002] [T15127] 		ref#0: tree block backref root 9
[  +0.000002] [T15127] BTRFS error (device loop1): block=30703616 write 
time tree block corruption detected
[  +0.000012] [T15127] BTRFS error (device loop1): block=30720000 bad 
generation, have 11 expect > 14
[  +0.000004] [T15127] 	item 0 key (13631488 FREE_SPACE_INFO 8388608) 
itemoff 16275 itemsize 8
[  +0.000002] [T15127] 	item 1 key (13631488 FREE_SPACE_EXTENT 8388608) 
itemoff 16275 itemsize 0
[  +0.000002] [T15127] 	item 2 key (22020096 FREE_SPACE_INFO 8388608) 
itemoff 16267 itemsize 8
[  +0.000002] [T15127] 	item 3 key (22020096 FREE_SPACE_EXTENT 16384) 
itemoff 16267 itemsize 0
[  +0.000003] [T15127] 	item 4 key (22052864 FREE_SPACE_EXTENT 8355840) 
itemoff 16267 itemsize 0
[  +0.000002] [T15127] 	item 5 key (30408704 FREE_SPACE_INFO 268435456) 
itemoff 16259 itemsize 8
[  +0.000002] [T15127] 	item 6 key (30408704 FREE_SPACE_EXTENT 81920) 
itemoff 16259 itemsize 0
[  +0.000002] [T15127] 	item 7 key (30507008 FREE_SPACE_EXTENT 16384) 
itemoff 16259 itemsize 0
[  +0.000002] [T15127] 	item 8 key (30539776 FREE_SPACE_EXTENT 65536) 
itemoff 16259 itemsize 0
[  +0.000002] [T15127] 	item 9 key (30621696 FREE_SPACE_EXTENT 65536) 
itemoff 16259 itemsize 0
[  +0.000003] [T15127] 	item 10 key (30801920 FREE_SPACE_EXTENT 
268042240) itemoff 16259 itemsize 0
[  +0.000002] [T15127] BTRFS error (device loop1): block=30720000 write 
time tree block corruption detected
[  +0.000010] [T15127] BTRFS error (device loop1): block=30736384 bad 
generation, have 11 expect > 14
[  +0.000004] [T15127] 	item 0 key (0 QGROUP_STATUS 0) itemoff 16243 
itemsize 40
[  +0.000003] [T15127] 	item 1 key (0 QGROUP_INFO 5) itemoff 16203 
itemsize 40
[  +0.000002] [T15127] 	item 2 key (0 QGROUP_INFO 256) itemoff 16163 
itemsize 40
[  +0.000002] [T15127] 	item 3 key (0 QGROUP_LIMIT 5) itemoff 16123 
itemsize 40
[  +0.000002] [T15127] 	item 4 key (0 QGROUP_LIMIT 256) itemoff 16083 
itemsize 40
[  +0.000003] [T15127] BTRFS error (device loop1): block=30736384 write 
time tree block corruption detected
[  +0.000014] [T15127] BTRFS error (device loop1): block=30769152 bad 
generation, have 11 expect > 14
[  +0.000004] [T15127] 	item 0 key (2 ROOT_ITEM 0) itemoff 15844 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30703616 refs 1
[  +0.000002] [T15127] 	item 1 key (4 ROOT_ITEM 0) itemoff 15405 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30605312 refs 1
[  +0.000001] [T15127] 	item 2 key (5 INODE_REF 6) itemoff 15388 itemsize 17
[  +0.000002] [T15127] 		index 0 name_len 7
[  +0.000002] [T15127] 	item 3 key (5 ROOT_ITEM 0) itemoff 14949 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30687232 refs 1
[  +0.000002] [T15127] 	item 4 key (5 ROOT_REF 256) itemoff 14925 
itemsize 24
[  +0.000002] [T15127] 	item 5 key (6 INODE_ITEM 0) itemoff 14765 
itemsize 160
[  +0.000002] [T15127] 		inode generation 3 transid 0 size 0 nbytes 16384
[  +0.000002] [T15127] 		block group 0 mode 40755 links 1 uid 0 gid 0
[  +0.000002] [T15127] 		rdev 0 sequence 0 flags 0x0
[  +0.000001] [T15127] 		atime 1769760651.0
[  +0.000002] [T15127] 		ctime 1769760651.0
[  +0.000002] [T15127] 		mtime 1769760651.0
[  +0.000001] [T15127] 		otime 1769760651.0
[  +0.000002] [T15127] 	item 6 key (6 INODE_REF 6) itemoff 14753 itemsize 12
[  +0.000002] [T15127] 		index 0 name_len 2
[  +0.000001] [T15127] 	item 7 key (6 DIR_ITEM 2378154706) itemoff 14716 
itemsize 37
[  +0.000003] [T15127] 		location key (5 132 18446744073709551615) type 2
[  +0.000001] [T15127] 		transid 3 data_len 0 name_len 7
[  +0.000002] [T15127] 	item 8 key (7 ROOT_ITEM 0) itemoff 14277 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30490624 refs 1
[  +0.000002] [T15127] 	item 9 key (8 ROOT_ITEM 0) itemoff 13838 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30736384 refs 1
[  +0.000001] [T15127] 	item 10 key (9 ROOT_ITEM 0) itemoff 13399 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30785536 refs 1
[  +0.000002] [T15127] 	item 11 key (10 ROOT_ITEM 0) itemoff 12960 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30720000 refs 1
[  +0.000001] [T15127] 	item 12 key (256 ROOT_ITEM 11) itemoff 12521 
itemsize 439
[  +0.000003] [T15127] 		root data bytenr 30752768 refs 1
[  +0.000001] [T15127] 	item 13 key (256 ROOT_BACKREF 5) itemoff 12497 
itemsize 24
[  +0.000003] [T15127] 	item 14 key (18446744073709551607 ROOT_ITEM 0) 
itemoff 12058 itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30523392 refs 1
[  +0.000001] [T15127] BTRFS error (device loop1): block=30769152 write 
time tree block corruption detected
[  +0.000012] [T15127] BTRFS error (device loop1): block=30801920 bad 
generation, have 12 expect > 14
[  +0.000003] [T15127] 	item 0 key (0 QGROUP_STATUS 0) itemoff 16243 
itemsize 40
[  +0.000003] [T15127] 	item 1 key (0 QGROUP_INFO 5) itemoff 16203 
itemsize 40
[  +0.000002] [T15127] 	item 2 key (0 QGROUP_INFO 256) itemoff 16163 
itemsize 40
[  +0.000002] [T15127] 	item 3 key (0 QGROUP_INFO 257) itemoff 16123 
itemsize 40
[  +0.000002] [T15127] 	item 4 key (0 QGROUP_LIMIT 5) itemoff 16083 
itemsize 40
[  +0.000002] [T15127] 	item 5 key (0 QGROUP_LIMIT 256) itemoff 16043 
itemsize 40
[  +0.000002] [T15127] 	item 6 key (0 QGROUP_LIMIT 257) itemoff 16003 
itemsize 40
[  +0.000002] [T15127] BTRFS error (device loop1): block=30801920 write 
time tree block corruption detected
[  +0.000014] [T15127] BTRFS error (device loop1): block=30818304 bad 
generation, have 12 expect > 14
[  +0.000003] [T15127] 	item 0 key (256 INODE_ITEM 0) itemoff 16123 
itemsize 160
[  +0.000002] [T15127] 		inode generation 3 transid 11 size 10 nbytes 16384
[  +0.000002] [T15127] 		block group 0 mode 40755 links 1 uid 0 gid 0
[  +0.000002] [T15127] 		rdev 0 sequence 1 flags 0x0
[  +0.000002] [T15127] 		atime 1769760651.0
[  +0.000001] [T15127] 		ctime 1769760652.250234845
[  +0.000002] [T15127] 		mtime 1769760652.250234845
[  +0.000001] [T15127] 		otime 1769760651.0
[  +0.000002] [T15127] 	item 1 key (256 INODE_REF 256) itemoff 16111 
itemsize 12
[  +0.000002] [T15127] 		index 0 name_len 2
[  +0.000002] [T15127] 	item 2 key (256 DIR_ITEM 2030520461) itemoff 
16076 itemsize 35
[  +0.000002] [T15127] 		location key (257 1 0) type 2
[  +0.000002] [T15127] 		transid 11 data_len 0 name_len 5
[  +0.000001] [T15127] 	item 3 key (256 DIR_INDEX 2) itemoff 16041 
itemsize 35
[  +0.000002] [T15127] 		location key (257 1 0) type 2
[  +0.000002] [T15127] 		transid 11 data_len 0 name_len 5
[  +0.000002] [T15127] 	item 4 key (257 INODE_ITEM 0) itemoff 15881 
itemsize 160
[  +0.000002] [T15127] 		inode generation 11 transid 12 size 24 nbytes 0
[  +0.000002] [T15127] 		block group 0 mode 40755 links 1 uid 0 gid 0
[  +0.000002] [T15127] 		rdev 0 sequence 19 flags 0x0
[  +0.000001] [T15127] 		atime 1769760652.250234845
[  +0.000002] [T15127] 		ctime 1769760652.267621586
[  +0.000001] [T15127] 		mtime 1769760652.267621586
[  +0.000002] [T15127] 		otime 1769760652.250234845
[  +0.000002] [T15127] 	item 5 key (257 INODE_REF 256) itemoff 15866 
itemsize 15
[  +0.000002] [T15127] 		index 2 name_len 5
[  +0.000001] [T15127] 	item 6 key (257 DIR_ITEM 247980518) itemoff 
15830 itemsize 36
[  +0.000002] [T15127] 		location key (256 132 18446744073709551615) type 2
[  +0.000002] [T15127] 		transid 11 data_len 0 name_len 6
[  +0.000002] [T15127] 	item 7 key (257 DIR_ITEM 496439826) itemoff 
15794 itemsize 36
[  +0.000002] [T15127] 		location key (257 132 18446744073709551615) type 2
[  +0.000002] [T15127] 		transid 12 data_len 0 name_len 6
[  +0.000001] [T15127] 	item 8 key (257 DIR_INDEX 2) itemoff 15758 
itemsize 36
[  +0.000003] [T15127] 		location key (256 132 18446744073709551615) type 2
[  +0.000001] [T15127] 		transid 11 data_len 0 name_len 6
[  +0.000002] [T15127] 	item 9 key (257 DIR_INDEX 3) itemoff 15722 
itemsize 36
[  +0.000002] [T15127] 		location key (257 132 18446744073709551615) type 2
[  +0.000002] [T15127] 		transid 12 data_len 0 name_len 6
[  +0.000001] [T15127] BTRFS error (device loop1): block=30818304 write 
time tree block corruption detected
[  +0.000016] [T15127] BTRFS error (device loop1): block=30851072 bad 
generation, have 12 expect > 14
[  +0.000004] [T15127] 	item 0 key (2 ROOT_ITEM 0) itemoff 15844 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30867456 refs 1
[  +0.000001] [T15127] 	item 1 key (4 ROOT_ITEM 0) itemoff 15405 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30605312 refs 1
[  +0.000002] [T15127] 	item 2 key (5 INODE_REF 6) itemoff 15388 itemsize 17
[  +0.000002] [T15127] 		index 0 name_len 7
[  +0.000001] [T15127] 	item 3 key (5 ROOT_ITEM 0) itemoff 14949 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30818304 refs 1
[  +0.000002] [T15127] 	item 4 key (5 ROOT_REF 256) itemoff 14925 
itemsize 24
[  +0.000002] [T15127] 	item 5 key (5 ROOT_REF 257) itemoff 14901 
itemsize 24
[  +0.000002] [T15127] 	item 6 key (6 INODE_ITEM 0) itemoff 14741 
itemsize 160
[  +0.000002] [T15127] 		inode generation 3 transid 0 size 0 nbytes 16384
[  +0.000002] [T15127] 		block group 0 mode 40755 links 1 uid 0 gid 0
[  +0.000003] [T15127] 		rdev 0 sequence 0 flags 0x0
[  +0.000001] [T15127] 		atime 1769760651.0
[  +0.000002] [T15127] 		ctime 1769760651.0
[  +0.000003] [T15127] 		mtime 1769760651.0
[  +0.000002] [T15127] 		otime 1769760651.0
[  +0.000002] [T15127] 	item 7 key (6 INODE_REF 6) itemoff 14729 itemsize 12
[  +0.000003] [T15127] 		index 0 name_len 2
[  +0.000002] [T15127] 	item 8 key (6 DIR_ITEM 2378154706) itemoff 14692 
itemsize 37
[  +0.000003] [T15127] 		location key (5 132 18446744073709551615) type 2
[  +0.000002] [T15127] 		transid 3 data_len 0 name_len 7
[  +0.000002] [T15127] 	item 9 key (7 ROOT_ITEM 0) itemoff 14253 
itemsize 439
[  +0.000003] [T15127] 		root data bytenr 30490624 refs 1
[  +0.000002] [T15127] 	item 10 key (8 ROOT_ITEM 0) itemoff 13814 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30801920 refs 1
[  +0.000003] [T15127] 	item 11 key (9 ROOT_ITEM 0) itemoff 13375 
itemsize 439
[  +0.000002] [T15127] 		root data bytenr 30900224 refs 1
[  +0.000002] [T15127] 	item 12 key (10 ROOT_ITEM 0) itemoff 12936 
itemsize 439
[  +0.000003] [T15127] 		root data bytenr 30883840 refs 1
[  +0.000002] [T15127] 	item 13 key (256 ROOT_ITEM 11) itemoff 12497 
itemsize 439
[  +0.000003] [T15127] 		root data bytenr 30752768 refs 1
[  +0.000002] [T15127] 	item 14 key (256 ROOT_BACKREF 5) itemoff 12473 
itemsize 24
[  +0.000003] [T15127] 	item 15 key (257 ROOT_ITEM 12) itemoff 12034 
itemsize 439
[  +0.000003] [T15127] 		root data bytenr 30834688 refs 1
[  +0.000002] [T15127] 	item 16 key (257 ROOT_BACKREF 5) itemoff 12010 
itemsize 24
[  +0.000003] [T15127] 	item 17 key (18446744073709551607 ROOT_ITEM 0) 
itemoff 11571 itemsize 439
[  +0.000004] [T15127] 		root data bytenr 30523392 refs 1
[  +0.000002] [T15127] BTRFS error (device loop1): block=30851072 write 
time tree block corruption detected

and a lot more lines with the same generation errors for btrfs/122 
btrfs/152 btrfs/210 btrfs/224 btrfs/316 btrfs/320 btrfs/340 fstest cases.

I have no idea why it's trying to write some ebs older than current 
transaction. Seems related with snapshots.

> To fix this, we need to check the DIRTY flag again to prevent writing a 
> eb which has some new data written, and lock the eb before we really 
> doing io related things. I'm not farmilar with io related code so please 
> correct me if I got anything wrong.
> 
> Thanks,
> 
> Sun Yangkai



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30  0:12   ` Leo Martins
  2026-01-30  4:14     ` Sun YangKai
@ 2026-01-30 12:49     ` Filipe Manana
  2026-01-30 15:43       ` Boris Burkov
  2026-01-30 21:43       ` Leo Martins
  1 sibling, 2 replies; 21+ messages in thread
From: Filipe Manana @ 2026-01-30 12:49 UTC (permalink / raw)
  To: Leo Martins; +Cc: linux-btrfs, kernel-team

On Fri, Jan 30, 2026 at 12:13 AM Leo Martins <loemra.dev@gmail.com> wrote:
>
> On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
>
> > On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
> > >
> > > I've been investigating enospcs at Meta and have observed a strange
> > > pattern where filesystems are enospcing with lots of unallocated space
> > > (> 100G). Sample dmesg dump at bottom of message.
> > >
> > > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > > from the transaction block reserve and finding it exhausted leading to a
> > > warning and enospc. This is a bug as the reservations are meant to be
> > > worst case. It should be impossible to exhaust the transaction block
> > > reserve.
> > >
> > > Some tracing of affected hosts revealed that there were single
> > > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > > reproduce this behavior locally by creating a very constrained cgroup
> > > and producing a lot of concurrent filesystem operations. Here's the
> > > pattern:
> > >
> > >  1. btrfs_search_slot() begins tree traversal with cow=1
> > >  2. Node at level N needs COW (old generation or WRITTEN flag set)
> > >  3. btrfs_cow_block() allocates new node, updates parent pointer
> > >  4. Traversal continues, but hits a condition requiring restart (e.g., node
> > >     not cached, lock contention, need higher write_lock_level)
> > >  5. btrfs_release_path() releases all locks and references
> > >  6. Memory pressure triggers writeback on the COW'd node
> > >  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> > >     BTRFS_HEADER_FLAG_WRITTEN
> > >  8. goto again - traversal restarts from root
> > >  9. Traversal reaches the freshly COW'd node
> > >  10. should_cow_block() sees WRITTEN flag set, returns true
> > >  11. btrfs_cow_block() allocates another new node - same logical position,
> > >      new physical location, new reservation consumed
> > >  12. Steps 4-11 repeat indefinitely under sustained memory pressure
> > >
> > > Note this behavior should be much harder to trigger since Boris's
> > > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > > accounted for in user cgroups. However, I believe it
> > > would still be an issue under global memory pressure.
> > > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> > >
> > > This COW amplification breaks the idea that transaction reservations are
> > > worst case as any search slot call could find itself in this COW loop and
> > > exhaust its reservation.
> > >
> > > My proposed solution is to temporarily pin extent buffers for the
> > > lifetime of btrfs_search_slot. This prevents the massive COW
> > > amplification that can be seen during high memory pressure.
> > >
> > > The implementation uses a local xarray to track COW'd buffers for the
> > > duration of the search. The xarray stores extent_buffer pointers without
> > > taking additional references; this is safe because tracked buffers remain
> > > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > > dirty buffers cannot be reclaimed by memory pressure.
> > >
> > > Synchronization is provided by eb->lock: increments in
> > > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > > the check in lock_extent_buffer_for_io() also holds the write lock via
> > > btrfs_tree_lock(). Decrements don't require eb->lock because
> > > writeback_blockers is atomic and merely indicates "don't write yet".
> > > Once we decrement, we're done and don't care if writeback proceeds
> > > immediately.
> >
> > This seems too complex to me.
> >
> > So this problem is very similar to some idea I had a few years ago but
> > never managed to implement.
> > It was about avoiding unnecessary COW, not for this space reservation
> > exhaustion due to sustained memory pressure, but it would solve it
> > too.
> >
> > The idea was that we do unnecessary COW in cases like this:
> >
> > 1) We COW a path in some tree and we are at transaction N;
> >
> > 2) Writeback happened for the extent buffers in that path while we are
> > in the same transaction, because we reached the 32M limit and some
> > task called btrfs_btree_balance_dirty() or something else triggered
> > writeback of the btree inode;
> >
> > 3) While still at transaction N, we visit the same path to add an item
> > to a leaf, or modify an item, whatever. Because the extent buffers
> > have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> > returns true).
> >
> > So during the lifetime of a transaction we can have a lot of
> > unnecessary COW - we spend more time allocating extents, allocating
> > memory, copying extent buffer data, use more space per transaction,
> > etc.
> >
> > The idea was to not COW when an extent buffer has
> > BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> > (btrfs_header_generation(eb)) matches the current transaction.
> > That is safe because there's no committed tree that points to an
> > extent buffer created in the current transaction.
> >
> > Any further modification to the extent buffer must be sure that the
> > EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> > transaction's dirty_pages io tree, etc, so that we don't miss writing
> > the extent buffer to the same location again before the transaction
> > commits the superblocks.
> >
> > Have you considered an approach like this?
>
> I had not considered this, but it is a great idea.
>
> My first thought is that implementing this could be as simple
> as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> would mess with the assumptions around the log tree. From
> btrfs_sync_log():
>
> /*
>  * IO has been started, blocks of the log tree have WRITTEN flag set
>  * in their headers. new modifications of the log will be written to
>  * new positions. so it's safe to allow log writers to go in.
>  */
>
> ^ Assumes that WRITTEN blocks will be COW'd.
>
> The issue looks like:
>
>  1. fsync A COWs eb
>  2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
>  3. fsync B does __not__ COW eb and modifies it
>  4. fsync A writes modified eb to disk
>  5. CRASH; the log tree is corrupted
>
> One way to avoid that is to keep the current behavior for the log
> tree, but that leaves the potential for COW amplification...
>
> Another idea is to track the log_transid in the eb in the same way
> the transid is tracked. Then, in should_cow_block we have something
> like:
>
> if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID &&
>     buf->log_transid != root->log_transid)
>   return true;

Log trees are special since their lifetime doesn't span on
transaction, so what I suggested doesn't work of course for log trees
and I forgot to mention that.

Tracking the log_transid in the extent buffer will not always work -
because it can be evicted and reloaded, so we would lose its value.
We would have to update the on-disk format to store it somewhere or
keep another in memory structure to track that, or prevent eviction of
log tree buffers - all of those are too complex.

So I had this half baked patch from many years ago:

 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
                      *root, struct btrfs_path *path, int level);
@@ -1426,11 +1427,30 @@ static inline int should_cow_block(struct
btrfs_trans_handle *trans,
         *    block to ensure the metadata consistency.
         */
        if (btrfs_header_generation(buf) == trans->transid &&
-           !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
            !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
              btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
-           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
+           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state)) {
+
+               if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+                       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
+                               return 1;
+                       return 0;
+               }
+
+               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) ||
+                   test_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
+                       return 1;

This was before a recent refactoring of should_cow_block(), but you
should get the ideia.
IIRC all fstests were passing back then, except for one or two which I
never spent time debugging.

And as that attempt was before the tree checker existed, we would need
to make sure we don't change and eb while the tree checker is
verifying it - making sure the tree checker read locks the eb should
be enough.

There's also one problem with this idea: it won't work for zoned
devices as writes are sequential and we can't write twice to the same
location without doing the zone reset thing which only happens around
transaction commit time IIRC.

Thanks.

>
> Please let me know if you see any issues with this approach or
> if you can think of a better method.
>
> Thanks,
> Leo
>
> >
> > It would solve this space reservation exhaustion problem, as well as
> > unnecessary COW for general optimization, without the need to for a
> > local xarray, which besides being very specific for the
> > btrfs_search_slot() case (we COW in other places), also requires a
> > memory allocation which can fail.
> >
> > Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30 12:49     ` Filipe Manana
@ 2026-01-30 15:43       ` Boris Burkov
  2026-01-30 15:57         ` Filipe Manana
  2026-01-30 21:43       ` Leo Martins
  1 sibling, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2026-01-30 15:43 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Leo Martins, linux-btrfs, kernel-team

On Fri, Jan 30, 2026 at 12:49:55PM +0000, Filipe Manana wrote:
> On Fri, Jan 30, 2026 at 12:13 AM Leo Martins <loemra.dev@gmail.com> wrote:
> >
> > On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> >
> > > On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
> > > >
> > > > I've been investigating enospcs at Meta and have observed a strange
> > > > pattern where filesystems are enospcing with lots of unallocated space
> > > > (> 100G). Sample dmesg dump at bottom of message.
> > > >
> > > > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > > > from the transaction block reserve and finding it exhausted leading to a
> > > > warning and enospc. This is a bug as the reservations are meant to be
> > > > worst case. It should be impossible to exhaust the transaction block
> > > > reserve.
> > > >
> > > > Some tracing of affected hosts revealed that there were single
> > > > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > > > reproduce this behavior locally by creating a very constrained cgroup
> > > > and producing a lot of concurrent filesystem operations. Here's the
> > > > pattern:
> > > >
> > > >  1. btrfs_search_slot() begins tree traversal with cow=1
> > > >  2. Node at level N needs COW (old generation or WRITTEN flag set)
> > > >  3. btrfs_cow_block() allocates new node, updates parent pointer
> > > >  4. Traversal continues, but hits a condition requiring restart (e.g., node
> > > >     not cached, lock contention, need higher write_lock_level)
> > > >  5. btrfs_release_path() releases all locks and references
> > > >  6. Memory pressure triggers writeback on the COW'd node
> > > >  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> > > >     BTRFS_HEADER_FLAG_WRITTEN
> > > >  8. goto again - traversal restarts from root
> > > >  9. Traversal reaches the freshly COW'd node
> > > >  10. should_cow_block() sees WRITTEN flag set, returns true
> > > >  11. btrfs_cow_block() allocates another new node - same logical position,
> > > >      new physical location, new reservation consumed
> > > >  12. Steps 4-11 repeat indefinitely under sustained memory pressure
> > > >
> > > > Note this behavior should be much harder to trigger since Boris's
> > > > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > > > accounted for in user cgroups. However, I believe it
> > > > would still be an issue under global memory pressure.
> > > > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> > > >
> > > > This COW amplification breaks the idea that transaction reservations are
> > > > worst case as any search slot call could find itself in this COW loop and
> > > > exhaust its reservation.
> > > >
> > > > My proposed solution is to temporarily pin extent buffers for the
> > > > lifetime of btrfs_search_slot. This prevents the massive COW
> > > > amplification that can be seen during high memory pressure.
> > > >
> > > > The implementation uses a local xarray to track COW'd buffers for the
> > > > duration of the search. The xarray stores extent_buffer pointers without
> > > > taking additional references; this is safe because tracked buffers remain
> > > > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > > > dirty buffers cannot be reclaimed by memory pressure.
> > > >
> > > > Synchronization is provided by eb->lock: increments in
> > > > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > > > the check in lock_extent_buffer_for_io() also holds the write lock via
> > > > btrfs_tree_lock(). Decrements don't require eb->lock because
> > > > writeback_blockers is atomic and merely indicates "don't write yet".
> > > > Once we decrement, we're done and don't care if writeback proceeds
> > > > immediately.
> > >
> > > This seems too complex to me.
> > >
> > > So this problem is very similar to some idea I had a few years ago but
> > > never managed to implement.
> > > It was about avoiding unnecessary COW, not for this space reservation
> > > exhaustion due to sustained memory pressure, but it would solve it
> > > too.
> > >
> > > The idea was that we do unnecessary COW in cases like this:
> > >
> > > 1) We COW a path in some tree and we are at transaction N;
> > >
> > > 2) Writeback happened for the extent buffers in that path while we are
> > > in the same transaction, because we reached the 32M limit and some
> > > task called btrfs_btree_balance_dirty() or something else triggered
> > > writeback of the btree inode;
> > >
> > > 3) While still at transaction N, we visit the same path to add an item
> > > to a leaf, or modify an item, whatever. Because the extent buffers
> > > have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> > > returns true).
> > >
> > > So during the lifetime of a transaction we can have a lot of
> > > unnecessary COW - we spend more time allocating extents, allocating
> > > memory, copying extent buffer data, use more space per transaction,
> > > etc.
> > >
> > > The idea was to not COW when an extent buffer has
> > > BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> > > (btrfs_header_generation(eb)) matches the current transaction.
> > > That is safe because there's no committed tree that points to an
> > > extent buffer created in the current transaction.
> > >
> > > Any further modification to the extent buffer must be sure that the
> > > EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> > > transaction's dirty_pages io tree, etc, so that we don't miss writing
> > > the extent buffer to the same location again before the transaction
> > > commits the superblocks.
> > >
> > > Have you considered an approach like this?
> >
> > I had not considered this, but it is a great idea.
> >
> > My first thought is that implementing this could be as simple
> > as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> > would mess with the assumptions around the log tree. From
> > btrfs_sync_log():
> >
> > /*
> >  * IO has been started, blocks of the log tree have WRITTEN flag set
> >  * in their headers. new modifications of the log will be written to
> >  * new positions. so it's safe to allow log writers to go in.
> >  */
> >
> > ^ Assumes that WRITTEN blocks will be COW'd.
> >
> > The issue looks like:
> >
> >  1. fsync A COWs eb
> >  2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
> >  3. fsync B does __not__ COW eb and modifies it
> >  4. fsync A writes modified eb to disk
> >  5. CRASH; the log tree is corrupted
> >
> > One way to avoid that is to keep the current behavior for the log
> > tree, but that leaves the potential for COW amplification...
> >
> > Another idea is to track the log_transid in the eb in the same way
> > the transid is tracked. Then, in should_cow_block we have something
> > like:
> >
> > if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID &&
> >     buf->log_transid != root->log_transid)
> >   return true;
> 
> Log trees are special since their lifetime doesn't span on
> transaction, so what I suggested doesn't work of course for log trees
> and I forgot to mention that.
> 
> Tracking the log_transid in the extent buffer will not always work -
> because it can be evicted and reloaded, so we would lose its value.
> We would have to update the on-disk format to store it somewhere or
> keep another in memory structure to track that, or prevent eviction of
> log tree buffers - all of those are too complex.

Supposing we cannot think of a way to do overwrites on log tree ebs,
but that we can make it work for other ebs (excluding the zoned case
you mentioned below):

What do you think about the problem of space reservation exhaustion
due to COW amplification when narrowed to just log trees? As far as I
can tell, there is nothing special about how logged items consume
transaction reservation so the problem would be reduced but still exist.

Do we want to pursue working out the kinkds in eb-overwrite (seems super
valuable regardless of motivation) and think of some other final
backstop for log tree ebs? Given that the fsync will be sending the ebs
down to the disk quite soon anyway, I was thinking it might be more
palatable to try to fully prevent premature writeback of log tree ebs.

> 
> So I had this half baked patch from many years ago:
> 
>  static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
>                       *root, struct btrfs_path *path, int level);
> @@ -1426,11 +1427,30 @@ static inline int should_cow_block(struct
> btrfs_trans_handle *trans,
>          *    block to ensure the metadata consistency.
>          */
>         if (btrfs_header_generation(buf) == trans->transid &&
> -           !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
>             !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
>               btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
> -           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> +           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state)) {
> +
> +               if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
> +                       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> +                               return 1;
> +                       return 0;
> +               }
> +
> +               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) ||
> +                   test_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
> +                       return 1;
> 
> This was before a recent refactoring of should_cow_block(), but you
> should get the ideia.
> IIRC all fstests were passing back then, except for one or two which I
> never spent time debugging.
> 
> And as that attempt was before the tree checker existed, we would need
> to make sure we don't change and eb while the tree checker is
> verifying it - making sure the tree checker read locks the eb should
> be enough.

I suspect this is what Sun was hitting in his replies to Leo.

> 
> There's also one problem with this idea: it won't work for zoned
> devices as writes are sequential and we can't write twice to the same
> location without doing the zone reset thing which only happens around
> transaction commit time IIRC.
> 
> Thanks.
> 

Thanks for your input on this, it's really appreciated,
Boris

> >
> > Please let me know if you see any issues with this approach or
> > if you can think of a better method.
> >
> > Thanks,
> > Leo
> >
> > >
> > > It would solve this space reservation exhaustion problem, as well as
> > > unnecessary COW for general optimization, without the need to for a
> > > local xarray, which besides being very specific for the
> > > btrfs_search_slot() case (we COW in other places), also requires a
> > > memory allocation which can fail.
> > >
> > > Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30  9:37       ` Sun YangKai
@ 2026-01-30 15:50         ` Sun YangKai
  2026-01-30 16:11           ` Filipe Manana
  0 siblings, 1 reply; 21+ messages in thread
From: Sun YangKai @ 2026-01-30 15:50 UTC (permalink / raw)
  To: Leo Martins, Filipe Manana; +Cc: linux-btrfs, kernel-team

On 2026/1/30 17:37, Sun YangKai wrote:
> 
> 
> On 2026/1/30 12:14, Sun YangKai wrote:
>> On 2026/1/30 08:12, Leo Martins wrote:
>>> On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana<fdmanana@kernel.org> 
>>> wrote:
>>>> On Tue, Jan 27, 2026 at 8:43 PM Leo Martins<loemra.dev@gmail.com> 
>>>> wrote:
>>>>> I've been investigating enospcs at Meta and have observed a strange
>>>>> pattern where filesystems are enospcing with lots of unallocated space
>>>>> (> 100G). Sample dmesg dump at bottom of message.
>>>>>
>>>>> btrfs_insert_delayed_dir_index is attempting to migrate some 
>>>>> reservation
>>>>> from the transaction block reserve and finding it exhausted leading 
>>>>> to a
>>>>> warning and enospc. This is a bug as the reservations are meant to be
>>>>> worst case. It should be impossible to exhaust the transaction block
>>>>> reserve.
>>>>>
>>>>> Some tracing of affected hosts revealed that there were single
>>>>> btrfs_search_slot calls that were COWing 100s of times. I was able to
>>>>> reproduce this behavior locally by creating a very constrained cgroup
>>>>> and producing a lot of concurrent filesystem operations. Here's the
>>>>> pattern:
>>>>>
>>>>>   1. btrfs_search_slot() begins tree traversal with cow=1
>>>>>   2. Node at level N needs COW (old generation or WRITTEN flag set)
>>>>>   3. btrfs_cow_block() allocates new node, updates parent pointer
>>>>>   4. Traversal continues, but hits a condition requiring restart 
>>>>> (e.g., node
>>>>>      not cached, lock contention, need higher write_lock_level)
>>>>>   5. btrfs_release_path() releases all locks and references
>>>>>   6. Memory pressure triggers writeback on the COW'd node
>>>>>   7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>>>>>      BTRFS_HEADER_FLAG_WRITTEN
>>>>>   8. goto again - traversal restarts from root
>>>>>   9. Traversal reaches the freshly COW'd node
>>>>>   10. should_cow_block() sees WRITTEN flag set, returns true
>>>>>   11. btrfs_cow_block() allocates another new node - same logical 
>>>>> position,
>>>>>       new physical location, new reservation consumed
>>>>>   12. Steps 4-11 repeat indefinitely under sustained memory pressure
>>>>>
>>>>> Note this behavior should be much harder to trigger since Boris's
>>>>> AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
>>>>> accounted for in user cgroups. However, I believe it
>>>>> would still be an issue under global memory pressure.
>>>>> Link:https://lore.kernel.org/linux-btrfs/ 
>>>>> cover.1755812945.git.boris@bur.io/
>>>>>
>>>>> This COW amplification breaks the idea that transaction 
>>>>> reservations are
>>>>> worst case as any search slot call could find itself in this COW 
>>>>> loop and
>>>>> exhaust its reservation.
>>>>>
>>>>> My proposed solution is to temporarily pin extent buffers for the
>>>>> lifetime of btrfs_search_slot. This prevents the massive COW
>>>>> amplification that can be seen during high memory pressure.
>>>>>
>>>>> The implementation uses a local xarray to track COW'd buffers for the
>>>>> duration of the search. The xarray stores extent_buffer pointers 
>>>>> without
>>>>> taking additional references; this is safe because tracked buffers 
>>>>> remain
>>>>> dirty (writeback_blockers prevents the dirty bit from being 
>>>>> cleared) and
>>>>> dirty buffers cannot be reclaimed by memory pressure.
>>>>>
>>>>> Synchronization is provided by eb->lock: increments in
>>>>> btrfs_search_slot_track_cow() occur while holding the write lock, and
>>>>> the check in lock_extent_buffer_for_io() also holds the write lock via
>>>>> btrfs_tree_lock(). Decrements don't require eb->lock because
>>>>> writeback_blockers is atomic and merely indicates "don't write yet".
>>>>> Once we decrement, we're done and don't care if writeback proceeds
>>>>> immediately.
>>>> This seems too complex to me.
>>>>
>>>> So this problem is very similar to some idea I had a few years ago but
>>>> never managed to implement.
>>>> It was about avoiding unnecessary COW, not for this space reservation
>>>> exhaustion due to sustained memory pressure, but it would solve it
>>>> too.
>>>>
>>>> The idea was that we do unnecessary COW in cases like this:
>>>>
>>>> 1) We COW a path in some tree and we are at transaction N;
>>>>
>>>> 2) Writeback happened for the extent buffers in that path while we are
>>>> in the same transaction, because we reached the 32M limit and some
>>>> task called btrfs_btree_balance_dirty() or something else triggered
>>>> writeback of the btree inode;
>>>>
>>>> 3) While still at transaction N, we visit the same path to add an item
>>>> to a leaf, or modify an item, whatever. Because the extent buffers
>>>> have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
>>>> returns true).
>>>>
>>>> So during the lifetime of a transaction we can have a lot of
>>>> unnecessary COW - we spend more time allocating extents, allocating
>>>> memory, copying extent buffer data, use more space per transaction,
>>>> etc.
>>>>
>>>> The idea was to not COW when an extent buffer has
>>>> BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
>>>> (btrfs_header_generation(eb)) matches the current transaction.
>>>> That is safe because there's no committed tree that points to an
>>>> extent buffer created in the current transaction.
>>>>
>>>> Any further modification to the extent buffer must be sure that the
>>>> EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
>>>> transaction's dirty_pages io tree, etc, so that we don't miss writing
>>>> the extent buffer to the same location again before the transaction
>>>> commits the superblocks.
>>>>
>>>> Have you considered an approach like this?
>>> I had not considered this, but it is a great idea.
>>>
>>> My first thought is that implementing this could be as simple
>>> as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
>>> would mess with the assumptions around the log tree. From
>>> btrfs_sync_log():
>> After a fast glance and some tests, I found things might not be that 
>> easy. The problem is not only the log tree.
>>> /*
>>>   * IO has been started, blocks of the log tree have WRITTEN flag set
>>>   * in their headers. new modifications of the log will be written to
>>>   * new positions. so it's safe to allow log writers to go in.
>>>   */
>>>
>>> ^ Assumes that WRITTEN blocks will be COW'd.
>>>
>>> The issue looks like:
>>>
>>>   1. fsync A COWs eb
>>>   2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
>>>   3. fsync B does __not__ COW eb and modifies it
>>>   4. fsync A writes modified eb to disk
>>>   5. CRASH; the log tree is corrupted
>>>
>>> One way to avoid that is to keep the current behavior for the log
>>> tree, but that leaves the potential for COW amplification...
>> I tested with a patch like this:
>> @@ -624,14 +624,18 @@ static inline bool should_cow_block(const struct 
>> btrfs_trans_handle *trans,
>>          if (btrfs_header_generation(buf) != trans->transid)
>>                  return true;
>>
>> -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
>> -               return true;
>> -
>>          /* Ensure we can see the FORCE_COW bit. */
>>          smp_mb__before_atomic();
>>          if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
>>                  return true;
>>
>> +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
>> +               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
>> +                       return true;
>> +               btrfs_mark_buffer_dirty(trans, buf);
>> +               return false;
>> +       }
>> +
>>          if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
>>
>>                  return false;
>>
>> And get some errors like this:
>>
>>
>> [  +0.090163] [ T2589] run fstests btrfs/004 at 2026-01-30 11:53:37
>> [  +0.432352] [T11685] BTRFS: device fsid 1fb397fc-97a7-44dd-9602- 
>> dd38b74bc391 devid 1 transid 8 /dev/loop1 (7:1) scanned by mount (11685)
>> [  +0.000351] [T11685] BTRFS info (device loop1): first mount of 
>> filesystem 1fb397fc-97a7-44dd-9602-dd38b74bc391
>> [  +0.000014] [T11685] BTRFS info (device loop1): using crc32c 
>> (crc32c- lib) checksum algorithm
>> [  +0.001298] [T11685] BTRFS info (device loop1): checking UUID tree
>> [  +0.000039] [T11685] BTRFS info (device loop1): enabling ssd 
>> optimizations
>> [  +0.000003] [T11685] BTRFS info (device loop1): turning on async 
>> discard
>> [  +0.000002] [T11685] BTRFS info (device loop1): enabling free space 
>> tree
>> [  +1.051781] [T11703] page: refcount:2 mapcount:0 
>> mapping:00000000eb6d7caa index:0x2348 pfn:0x1caebf
>> [  +0.000008] [T11703] memcg:ffff9b3300263cc0
>> [  +0.000003] [T11703] aops:0xffffffffc0354040 ino:1
>> [  +0.000024] [T11703] flags: 0x4e0000000000423e(referenced|uptodate| 
>> dirty|lru|workingset|private|writeback|zone=1)
>> [  +0.000007] [T11703] raw: 4e0000000000423e fffff74a872bb908 
>> fffff74a84206a88 ffff9b33c6706880
>> [  +0.000004] [T11703] raw: 0000000000002348 ffff9b334be522d0 
>> 00000002ffffffff ffff9b3300263cc0
>> [  +0.000002] [T11703] page dumped because: eb page dump
>> [  +0.000003] [T11703] BTRFS critical (device loop1): corrupt leaf: 
>> root=5 block=36995072 slot=118 ino=406 file_offset=94208, invalid 
>> ram_bytes for file extent, have 8660273067269322872, should be aligned 
>> to 4096
>> [  +0.000013] [T11703] BTRFS info (device loop1): leaf 36995072 gen 33 
>> total ptrs 128 free space 2857 owner 5
>> [  +0.000006] [T11703]     item 0 key (386 DIR_ITEM 238230307) itemoff 
>> 16249 itemsize 34
>> [  +0.000004] [T11703]         location key (462 1 0) type 2
>> [  +0.000003] [T11703]         transid 33 data_len 0 name_len 4
>> [  +0.000003] [T11703]     item 1 key (386 DIR_ITEM 1473745676) 
>> itemoff 16216 itemsize 33
>> [  +0.000004] [T11703]         location key (376 1 0) type 3
>> [  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
>> [  +0.000003] [T11703]     item 2 key (386 DIR_ITEM 2243137595) 
>> itemoff 16182 itemsize 34
>> [  +0.000004] [T11703]         location key (413 1 0) type 1
>> [  +0.000002] [T11703]         transid 32 data_len 0 name_len 4
>> ...
>> [  +0.000001] [T11703]     item 127 key (405 DIR_ITEM 828387202) 
>> itemoff 6057 itemsize 34
>> [  +0.000002] [T11703]         location key (479 1 0) type 3
>> [  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
>> [  +0.000002] [T11703] BTRFS error (device loop1): block=36995072 
>> write time tree block corruption detected
>> [  +0.003429] [T11703] BTRFS: error (device loop1) in 
>> btrfs_commit_transaction:2555: errno=-5 IO failure (Error while 
>> writing out transaction)
>> [  +0.000007] [T11703] BTRFS info (device loop1 state E): forced readonly
>> [  +0.000002] [T11703] BTRFS warning (device loop1 state E): Skipping 
>> commit of aborted transaction.
>> [  +0.000002] [T11703] BTRFS error (device loop1 state EA): 
>> Transaction aborted (error -5)
>> [  +0.000003] [T11703] BTRFS: error (device loop1 state EA) in 
>> cleanup_transaction:2037: errno=-5 IO failure
>>
>> The reported 406 inode is even not in the printed leaf. It seems like 
>> a data race maybe caused by:
>>
>> We unlock the eb after setting the WRITTEN flag during write back, and 
>> the eb should not get modified since then because all future writes 
>> will use the cowed eb. However, with the WRITTEN flag check removed in 
>> should_cow_block, we might write to the eb with WRITTEN flag set which 
>> might be under io.
> 
> I tried again with this:
> 
> @@ -624,14 +624,20 @@ static inline bool should_cow_block(const struct 
> btrfs_trans_handle *trans,
>          if (btrfs_header_generation(buf) != trans->transid)
>                  return true;
> 
> -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> -               return true;
> -
>          /* Ensure we can see the FORCE_COW bit. */
>          smp_mb__before_atomic();
>          if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
>                  return true;
> 
> +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> +               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
> +                       return true;
> +               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
> +                       return true;
> +               btrfs_mark_buffer_dirty(trans, buf);
> +               return false;
> +       }
> +
>          if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
>                  return false;
> 
> When WRITEBACK is set, do a normal cow to prevent the data race. This 
> seems to fix the previous problem. However, I got this:
> 
> [  +0.020843] [T15127] BTRFS error (device loop1): block=30687232 bad 
> generation, have 11 expect > 14
> [  +0.000009] [T15127]     item 0 key (256 INODE_ITEM 0) itemoff 16123 
> itemsize 160
> [  +0.000004] [T15127]         inode generation 3 transid 11 size 10 
> nbytes 16384
> [  +0.000003] [T15127]         block group 0 mode 40755 links 1 uid 0 gid 0
> [  +0.000002] [T15127]         rdev 0 sequence 1 flags 0x0
> [  +0.000002] [T15127]         atime 1769760651.0
> [  +0.000002] [T15127]         ctime 1769760652.250234845
> [  +0.000002] [T15127]         mtime 1769760652.250234845
> [  +0.000001] [T15127]         otime 1769760651.0
> ...
> [  +0.000004] [T15127]         root data bytenr 30523392 refs 1
> [  +0.000002] [T15127] BTRFS error (device loop1): block=30851072 write 
> time tree block corruption detected
> 
> and a lot more lines with the same generation errors for btrfs/122 
> btrfs/152 btrfs/210 btrfs/224 btrfs/316 btrfs/320 btrfs/340 fstest cases.
> 
> I have no idea why it's trying to write some ebs older than current 
> transaction. Seems related with snapshots.

This happens because after an extent buffer (eb) is written to disk, 
subsequent modifications only set the dirty flag without adding those 
pages to the current transaction's dirty list. Consequently, their 
writeback isn't triggered or awaited during transaction commit.

In contrast, newly allocated or COWed extent buffers are explicitly 
added to the transaction's dirty_pages via btrfs_init_new_buffer, which 
ensures they are properly tracked and written back.

Add the following code could fix this:

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8d683745afd1..3ab89a31f9bb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4450,6 +4450,9 @@ void btrfs_mark_buffer_dirty(struct 
btrfs_trans_handle *trans,
                            buf->start, transid, fs_info->generation);
         }
         set_extent_buffer_dirty(buf);
+       if (btrfs_header_owner(buf) != BTRFS_TREE_LOG_OBJECTID)
+               btrfs_set_extent_bit(&trans->transaction->dirty_pages, 
buf->start,
+                                    buf->start + buf->len - 1, 
EXTENT_DIRTY, NULL);
  }

  static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info,


Thanks,
Sun YangKai

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30 15:43       ` Boris Burkov
@ 2026-01-30 15:57         ` Filipe Manana
  2026-02-03  1:09           ` Leo Martins
  0 siblings, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2026-01-30 15:57 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Leo Martins, linux-btrfs, kernel-team

On Fri, Jan 30, 2026 at 3:44 PM Boris Burkov <boris@bur.io> wrote:
>
> On Fri, Jan 30, 2026 at 12:49:55PM +0000, Filipe Manana wrote:
> > On Fri, Jan 30, 2026 at 12:13 AM Leo Martins <loemra.dev@gmail.com> wrote:
> > >
> > > On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> > >
> > > > On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
> > > > >
> > > > > I've been investigating enospcs at Meta and have observed a strange
> > > > > pattern where filesystems are enospcing with lots of unallocated space
> > > > > (> 100G). Sample dmesg dump at bottom of message.
> > > > >
> > > > > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > > > > from the transaction block reserve and finding it exhausted leading to a
> > > > > warning and enospc. This is a bug as the reservations are meant to be
> > > > > worst case. It should be impossible to exhaust the transaction block
> > > > > reserve.
> > > > >
> > > > > Some tracing of affected hosts revealed that there were single
> > > > > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > > > > reproduce this behavior locally by creating a very constrained cgroup
> > > > > and producing a lot of concurrent filesystem operations. Here's the
> > > > > pattern:
> > > > >
> > > > >  1. btrfs_search_slot() begins tree traversal with cow=1
> > > > >  2. Node at level N needs COW (old generation or WRITTEN flag set)
> > > > >  3. btrfs_cow_block() allocates new node, updates parent pointer
> > > > >  4. Traversal continues, but hits a condition requiring restart (e.g., node
> > > > >     not cached, lock contention, need higher write_lock_level)
> > > > >  5. btrfs_release_path() releases all locks and references
> > > > >  6. Memory pressure triggers writeback on the COW'd node
> > > > >  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> > > > >     BTRFS_HEADER_FLAG_WRITTEN
> > > > >  8. goto again - traversal restarts from root
> > > > >  9. Traversal reaches the freshly COW'd node
> > > > >  10. should_cow_block() sees WRITTEN flag set, returns true
> > > > >  11. btrfs_cow_block() allocates another new node - same logical position,
> > > > >      new physical location, new reservation consumed
> > > > >  12. Steps 4-11 repeat indefinitely under sustained memory pressure
> > > > >
> > > > > Note this behavior should be much harder to trigger since Boris's
> > > > > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > > > > accounted for in user cgroups. However, I believe it
> > > > > would still be an issue under global memory pressure.
> > > > > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> > > > >
> > > > > This COW amplification breaks the idea that transaction reservations are
> > > > > worst case as any search slot call could find itself in this COW loop and
> > > > > exhaust its reservation.
> > > > >
> > > > > My proposed solution is to temporarily pin extent buffers for the
> > > > > lifetime of btrfs_search_slot. This prevents the massive COW
> > > > > amplification that can be seen during high memory pressure.
> > > > >
> > > > > The implementation uses a local xarray to track COW'd buffers for the
> > > > > duration of the search. The xarray stores extent_buffer pointers without
> > > > > taking additional references; this is safe because tracked buffers remain
> > > > > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > > > > dirty buffers cannot be reclaimed by memory pressure.
> > > > >
> > > > > Synchronization is provided by eb->lock: increments in
> > > > > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > > > > the check in lock_extent_buffer_for_io() also holds the write lock via
> > > > > btrfs_tree_lock(). Decrements don't require eb->lock because
> > > > > writeback_blockers is atomic and merely indicates "don't write yet".
> > > > > Once we decrement, we're done and don't care if writeback proceeds
> > > > > immediately.
> > > >
> > > > This seems too complex to me.
> > > >
> > > > So this problem is very similar to some idea I had a few years ago but
> > > > never managed to implement.
> > > > It was about avoiding unnecessary COW, not for this space reservation
> > > > exhaustion due to sustained memory pressure, but it would solve it
> > > > too.
> > > >
> > > > The idea was that we do unnecessary COW in cases like this:
> > > >
> > > > 1) We COW a path in some tree and we are at transaction N;
> > > >
> > > > 2) Writeback happened for the extent buffers in that path while we are
> > > > in the same transaction, because we reached the 32M limit and some
> > > > task called btrfs_btree_balance_dirty() or something else triggered
> > > > writeback of the btree inode;
> > > >
> > > > 3) While still at transaction N, we visit the same path to add an item
> > > > to a leaf, or modify an item, whatever. Because the extent buffers
> > > > have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> > > > returns true).
> > > >
> > > > So during the lifetime of a transaction we can have a lot of
> > > > unnecessary COW - we spend more time allocating extents, allocating
> > > > memory, copying extent buffer data, use more space per transaction,
> > > > etc.
> > > >
> > > > The idea was to not COW when an extent buffer has
> > > > BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> > > > (btrfs_header_generation(eb)) matches the current transaction.
> > > > That is safe because there's no committed tree that points to an
> > > > extent buffer created in the current transaction.
> > > >
> > > > Any further modification to the extent buffer must be sure that the
> > > > EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> > > > transaction's dirty_pages io tree, etc, so that we don't miss writing
> > > > the extent buffer to the same location again before the transaction
> > > > commits the superblocks.
> > > >
> > > > Have you considered an approach like this?
> > >
> > > I had not considered this, but it is a great idea.
> > >
> > > My first thought is that implementing this could be as simple
> > > as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> > > would mess with the assumptions around the log tree. From
> > > btrfs_sync_log():
> > >
> > > /*
> > >  * IO has been started, blocks of the log tree have WRITTEN flag set
> > >  * in their headers. new modifications of the log will be written to
> > >  * new positions. so it's safe to allow log writers to go in.
> > >  */
> > >
> > > ^ Assumes that WRITTEN blocks will be COW'd.
> > >
> > > The issue looks like:
> > >
> > >  1. fsync A COWs eb
> > >  2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
> > >  3. fsync B does __not__ COW eb and modifies it
> > >  4. fsync A writes modified eb to disk
> > >  5. CRASH; the log tree is corrupted
> > >
> > > One way to avoid that is to keep the current behavior for the log
> > > tree, but that leaves the potential for COW amplification...
> > >
> > > Another idea is to track the log_transid in the eb in the same way
> > > the transid is tracked. Then, in should_cow_block we have something
> > > like:
> > >
> > > if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID &&
> > >     buf->log_transid != root->log_transid)
> > >   return true;
> >
> > Log trees are special since their lifetime doesn't span on
> > transaction, so what I suggested doesn't work of course for log trees
> > and I forgot to mention that.
> >
> > Tracking the log_transid in the extent buffer will not always work -
> > because it can be evicted and reloaded, so we would lose its value.
> > We would have to update the on-disk format to store it somewhere or
> > keep another in memory structure to track that, or prevent eviction of
> > log tree buffers - all of those are too complex.
>
> Supposing we cannot think of a way to do overwrites on log tree ebs,
> but that we can make it work for other ebs (excluding the zoned case
> you mentioned below):
>
> What do you think about the problem of space reservation exhaustion
> due to COW amplification when narrowed to just log trees? As far as I
> can tell, there is nothing special about how logged items consume
> transaction reservation so the problem would be reduced but still exist.

Log trees are small to start with, and if they ever hit -ENOSPC, we
just fallback to transaction commit.

>
> Do we want to pursue working out the kinkds in eb-overwrite (seems super
> valuable regardless of motivation) and think of some other final
> backstop for log tree ebs? Given that the fsync will be sending the ebs
> down to the disk quite soon anyway, I was thinking it might be more
> palatable to try to fully prevent premature writeback of log tree ebs.
>
> >
> > So I had this half baked patch from many years ago:
> >
> >  static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
> >                       *root, struct btrfs_path *path, int level);
> > @@ -1426,11 +1427,30 @@ static inline int should_cow_block(struct
> > btrfs_trans_handle *trans,
> >          *    block to ensure the metadata consistency.
> >          */
> >         if (btrfs_header_generation(buf) == trans->transid &&
> > -           !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
> >             !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> >               btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
> > -           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> > +           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state)) {
> > +
> > +               if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
> > +                       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > +                               return 1;
> > +                       return 0;
> > +               }
> > +
> > +               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) ||
> > +                   test_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
> > +                       return 1;
> >
> > This was before a recent refactoring of should_cow_block(), but you
> > should get the ideia.
> > IIRC all fstests were passing back then, except for one or two which I
> > never spent time debugging.
> >
> > And as that attempt was before the tree checker existed, we would need
> > to make sure we don't change and eb while the tree checker is
> > verifying it - making sure the tree checker read locks the eb should
> > be enough.
>
> I suspect this is what Sun was hitting in his replies to Leo.
>
> >
> > There's also one problem with this idea: it won't work for zoned
> > devices as writes are sequential and we can't write twice to the same
> > location without doing the zone reset thing which only happens around
> > transaction commit time IIRC.
> >
> > Thanks.
> >
>
> Thanks for your input on this, it's really appreciated,
> Boris
>
> > >
> > > Please let me know if you see any issues with this approach or
> > > if you can think of a better method.
> > >
> > > Thanks,
> > > Leo
> > >
> > > >
> > > > It would solve this space reservation exhaustion problem, as well as
> > > > unnecessary COW for general optimization, without the need to for a
> > > > local xarray, which besides being very specific for the
> > > > btrfs_search_slot() case (we COW in other places), also requires a
> > > > memory allocation which can fail.
> > > >
> > > > Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30 15:50         ` Sun YangKai
@ 2026-01-30 16:11           ` Filipe Manana
  2026-01-31  9:16             ` Sun YangKai
  0 siblings, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2026-01-30 16:11 UTC (permalink / raw)
  To: Sun YangKai; +Cc: Leo Martins, linux-btrfs, kernel-team

On Fri, Jan 30, 2026 at 3:50 PM Sun YangKai <sunk67188@gmail.com> wrote:
>
> On 2026/1/30 17:37, Sun YangKai wrote:
> >
> >
> > On 2026/1/30 12:14, Sun YangKai wrote:
> >> On 2026/1/30 08:12, Leo Martins wrote:
> >>> On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana<fdmanana@kernel.org>
> >>> wrote:
> >>>> On Tue, Jan 27, 2026 at 8:43 PM Leo Martins<loemra.dev@gmail.com>
> >>>> wrote:
> >>>>> I've been investigating enospcs at Meta and have observed a strange
> >>>>> pattern where filesystems are enospcing with lots of unallocated space
> >>>>> (> 100G). Sample dmesg dump at bottom of message.
> >>>>>
> >>>>> btrfs_insert_delayed_dir_index is attempting to migrate some
> >>>>> reservation
> >>>>> from the transaction block reserve and finding it exhausted leading
> >>>>> to a
> >>>>> warning and enospc. This is a bug as the reservations are meant to be
> >>>>> worst case. It should be impossible to exhaust the transaction block
> >>>>> reserve.
> >>>>>
> >>>>> Some tracing of affected hosts revealed that there were single
> >>>>> btrfs_search_slot calls that were COWing 100s of times. I was able to
> >>>>> reproduce this behavior locally by creating a very constrained cgroup
> >>>>> and producing a lot of concurrent filesystem operations. Here's the
> >>>>> pattern:
> >>>>>
> >>>>>   1. btrfs_search_slot() begins tree traversal with cow=1
> >>>>>   2. Node at level N needs COW (old generation or WRITTEN flag set)
> >>>>>   3. btrfs_cow_block() allocates new node, updates parent pointer
> >>>>>   4. Traversal continues, but hits a condition requiring restart
> >>>>> (e.g., node
> >>>>>      not cached, lock contention, need higher write_lock_level)
> >>>>>   5. btrfs_release_path() releases all locks and references
> >>>>>   6. Memory pressure triggers writeback on the COW'd node
> >>>>>   7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> >>>>>      BTRFS_HEADER_FLAG_WRITTEN
> >>>>>   8. goto again - traversal restarts from root
> >>>>>   9. Traversal reaches the freshly COW'd node
> >>>>>   10. should_cow_block() sees WRITTEN flag set, returns true
> >>>>>   11. btrfs_cow_block() allocates another new node - same logical
> >>>>> position,
> >>>>>       new physical location, new reservation consumed
> >>>>>   12. Steps 4-11 repeat indefinitely under sustained memory pressure
> >>>>>
> >>>>> Note this behavior should be much harder to trigger since Boris's
> >>>>> AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> >>>>> accounted for in user cgroups. However, I believe it
> >>>>> would still be an issue under global memory pressure.
> >>>>> Link:https://lore.kernel.org/linux-btrfs/
> >>>>> cover.1755812945.git.boris@bur.io/
> >>>>>
> >>>>> This COW amplification breaks the idea that transaction
> >>>>> reservations are
> >>>>> worst case as any search slot call could find itself in this COW
> >>>>> loop and
> >>>>> exhaust its reservation.
> >>>>>
> >>>>> My proposed solution is to temporarily pin extent buffers for the
> >>>>> lifetime of btrfs_search_slot. This prevents the massive COW
> >>>>> amplification that can be seen during high memory pressure.
> >>>>>
> >>>>> The implementation uses a local xarray to track COW'd buffers for the
> >>>>> duration of the search. The xarray stores extent_buffer pointers
> >>>>> without
> >>>>> taking additional references; this is safe because tracked buffers
> >>>>> remain
> >>>>> dirty (writeback_blockers prevents the dirty bit from being
> >>>>> cleared) and
> >>>>> dirty buffers cannot be reclaimed by memory pressure.
> >>>>>
> >>>>> Synchronization is provided by eb->lock: increments in
> >>>>> btrfs_search_slot_track_cow() occur while holding the write lock, and
> >>>>> the check in lock_extent_buffer_for_io() also holds the write lock via
> >>>>> btrfs_tree_lock(). Decrements don't require eb->lock because
> >>>>> writeback_blockers is atomic and merely indicates "don't write yet".
> >>>>> Once we decrement, we're done and don't care if writeback proceeds
> >>>>> immediately.
> >>>> This seems too complex to me.
> >>>>
> >>>> So this problem is very similar to some idea I had a few years ago but
> >>>> never managed to implement.
> >>>> It was about avoiding unnecessary COW, not for this space reservation
> >>>> exhaustion due to sustained memory pressure, but it would solve it
> >>>> too.
> >>>>
> >>>> The idea was that we do unnecessary COW in cases like this:
> >>>>
> >>>> 1) We COW a path in some tree and we are at transaction N;
> >>>>
> >>>> 2) Writeback happened for the extent buffers in that path while we are
> >>>> in the same transaction, because we reached the 32M limit and some
> >>>> task called btrfs_btree_balance_dirty() or something else triggered
> >>>> writeback of the btree inode;
> >>>>
> >>>> 3) While still at transaction N, we visit the same path to add an item
> >>>> to a leaf, or modify an item, whatever. Because the extent buffers
> >>>> have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> >>>> returns true).
> >>>>
> >>>> So during the lifetime of a transaction we can have a lot of
> >>>> unnecessary COW - we spend more time allocating extents, allocating
> >>>> memory, copying extent buffer data, use more space per transaction,
> >>>> etc.
> >>>>
> >>>> The idea was to not COW when an extent buffer has
> >>>> BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> >>>> (btrfs_header_generation(eb)) matches the current transaction.
> >>>> That is safe because there's no committed tree that points to an
> >>>> extent buffer created in the current transaction.
> >>>>
> >>>> Any further modification to the extent buffer must be sure that the
> >>>> EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> >>>> transaction's dirty_pages io tree, etc, so that we don't miss writing
> >>>> the extent buffer to the same location again before the transaction
> >>>> commits the superblocks.
> >>>>
> >>>> Have you considered an approach like this?
> >>> I had not considered this, but it is a great idea.
> >>>
> >>> My first thought is that implementing this could be as simple
> >>> as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> >>> would mess with the assumptions around the log tree. From
> >>> btrfs_sync_log():
> >> After a fast glance and some tests, I found things might not be that
> >> easy. The problem is not only the log tree.
> >>> /*
> >>>   * IO has been started, blocks of the log tree have WRITTEN flag set
> >>>   * in their headers. new modifications of the log will be written to
> >>>   * new positions. so it's safe to allow log writers to go in.
> >>>   */
> >>>
> >>> ^ Assumes that WRITTEN blocks will be COW'd.
> >>>
> >>> The issue looks like:
> >>>
> >>>   1. fsync A COWs eb
> >>>   2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
> >>>   3. fsync B does __not__ COW eb and modifies it
> >>>   4. fsync A writes modified eb to disk
> >>>   5. CRASH; the log tree is corrupted
> >>>
> >>> One way to avoid that is to keep the current behavior for the log
> >>> tree, but that leaves the potential for COW amplification...
> >> I tested with a patch like this:
> >> @@ -624,14 +624,18 @@ static inline bool should_cow_block(const struct
> >> btrfs_trans_handle *trans,
> >>          if (btrfs_header_generation(buf) != trans->transid)
> >>                  return true;
> >>
> >> -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> >> -               return true;
> >> -
> >>          /* Ensure we can see the FORCE_COW bit. */
> >>          smp_mb__before_atomic();
> >>          if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> >>                  return true;
> >>
> >> +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> >> +               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
> >> +                       return true;
> >> +               btrfs_mark_buffer_dirty(trans, buf);
> >> +               return false;
> >> +       }
> >> +
> >>          if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> >>
> >>                  return false;
> >>
> >> And get some errors like this:
> >>
> >>
> >> [  +0.090163] [ T2589] run fstests btrfs/004 at 2026-01-30 11:53:37
> >> [  +0.432352] [T11685] BTRFS: device fsid 1fb397fc-97a7-44dd-9602-
> >> dd38b74bc391 devid 1 transid 8 /dev/loop1 (7:1) scanned by mount (11685)
> >> [  +0.000351] [T11685] BTRFS info (device loop1): first mount of
> >> filesystem 1fb397fc-97a7-44dd-9602-dd38b74bc391
> >> [  +0.000014] [T11685] BTRFS info (device loop1): using crc32c
> >> (crc32c- lib) checksum algorithm
> >> [  +0.001298] [T11685] BTRFS info (device loop1): checking UUID tree
> >> [  +0.000039] [T11685] BTRFS info (device loop1): enabling ssd
> >> optimizations
> >> [  +0.000003] [T11685] BTRFS info (device loop1): turning on async
> >> discard
> >> [  +0.000002] [T11685] BTRFS info (device loop1): enabling free space
> >> tree
> >> [  +1.051781] [T11703] page: refcount:2 mapcount:0
> >> mapping:00000000eb6d7caa index:0x2348 pfn:0x1caebf
> >> [  +0.000008] [T11703] memcg:ffff9b3300263cc0
> >> [  +0.000003] [T11703] aops:0xffffffffc0354040 ino:1
> >> [  +0.000024] [T11703] flags: 0x4e0000000000423e(referenced|uptodate|
> >> dirty|lru|workingset|private|writeback|zone=1)
> >> [  +0.000007] [T11703] raw: 4e0000000000423e fffff74a872bb908
> >> fffff74a84206a88 ffff9b33c6706880
> >> [  +0.000004] [T11703] raw: 0000000000002348 ffff9b334be522d0
> >> 00000002ffffffff ffff9b3300263cc0
> >> [  +0.000002] [T11703] page dumped because: eb page dump
> >> [  +0.000003] [T11703] BTRFS critical (device loop1): corrupt leaf:
> >> root=5 block=36995072 slot=118 ino=406 file_offset=94208, invalid
> >> ram_bytes for file extent, have 8660273067269322872, should be aligned
> >> to 4096
> >> [  +0.000013] [T11703] BTRFS info (device loop1): leaf 36995072 gen 33
> >> total ptrs 128 free space 2857 owner 5
> >> [  +0.000006] [T11703]     item 0 key (386 DIR_ITEM 238230307) itemoff
> >> 16249 itemsize 34
> >> [  +0.000004] [T11703]         location key (462 1 0) type 2
> >> [  +0.000003] [T11703]         transid 33 data_len 0 name_len 4
> >> [  +0.000003] [T11703]     item 1 key (386 DIR_ITEM 1473745676)
> >> itemoff 16216 itemsize 33
> >> [  +0.000004] [T11703]         location key (376 1 0) type 3
> >> [  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
> >> [  +0.000003] [T11703]     item 2 key (386 DIR_ITEM 2243137595)
> >> itemoff 16182 itemsize 34
> >> [  +0.000004] [T11703]         location key (413 1 0) type 1
> >> [  +0.000002] [T11703]         transid 32 data_len 0 name_len 4
> >> ...
> >> [  +0.000001] [T11703]     item 127 key (405 DIR_ITEM 828387202)
> >> itemoff 6057 itemsize 34
> >> [  +0.000002] [T11703]         location key (479 1 0) type 3
> >> [  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
> >> [  +0.000002] [T11703] BTRFS error (device loop1): block=36995072
> >> write time tree block corruption detected
> >> [  +0.003429] [T11703] BTRFS: error (device loop1) in
> >> btrfs_commit_transaction:2555: errno=-5 IO failure (Error while
> >> writing out transaction)
> >> [  +0.000007] [T11703] BTRFS info (device loop1 state E): forced readonly
> >> [  +0.000002] [T11703] BTRFS warning (device loop1 state E): Skipping
> >> commit of aborted transaction.
> >> [  +0.000002] [T11703] BTRFS error (device loop1 state EA):
> >> Transaction aborted (error -5)
> >> [  +0.000003] [T11703] BTRFS: error (device loop1 state EA) in
> >> cleanup_transaction:2037: errno=-5 IO failure
> >>
> >> The reported 406 inode is even not in the printed leaf. It seems like
> >> a data race maybe caused by:
> >>
> >> We unlock the eb after setting the WRITTEN flag during write back, and
> >> the eb should not get modified since then because all future writes
> >> will use the cowed eb. However, with the WRITTEN flag check removed in
> >> should_cow_block, we might write to the eb with WRITTEN flag set which
> >> might be under io.
> >
> > I tried again with this:
> >
> > @@ -624,14 +624,20 @@ static inline bool should_cow_block(const struct
> > btrfs_trans_handle *trans,
> >          if (btrfs_header_generation(buf) != trans->transid)
> >                  return true;
> >
> > -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > -               return true;
> > -
> >          /* Ensure we can see the FORCE_COW bit. */
> >          smp_mb__before_atomic();
> >          if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> >                  return true;
> >
> > +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> > +               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
> > +                       return true;
> > +               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
> > +                       return true;
> > +               btrfs_mark_buffer_dirty(trans, buf);
> > +               return false;
> > +       }
> > +
> >          if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> >                  return false;
> >
> > When WRITEBACK is set, do a normal cow to prevent the data race. This
> > seems to fix the previous problem. However, I got this:
> >
> > [  +0.020843] [T15127] BTRFS error (device loop1): block=30687232 bad
> > generation, have 11 expect > 14
> > [  +0.000009] [T15127]     item 0 key (256 INODE_ITEM 0) itemoff 16123
> > itemsize 160
> > [  +0.000004] [T15127]         inode generation 3 transid 11 size 10
> > nbytes 16384
> > [  +0.000003] [T15127]         block group 0 mode 40755 links 1 uid 0 gid 0
> > [  +0.000002] [T15127]         rdev 0 sequence 1 flags 0x0
> > [  +0.000002] [T15127]         atime 1769760651.0
> > [  +0.000002] [T15127]         ctime 1769760652.250234845
> > [  +0.000002] [T15127]         mtime 1769760652.250234845
> > [  +0.000001] [T15127]         otime 1769760651.0
> > ...
> > [  +0.000004] [T15127]         root data bytenr 30523392 refs 1
> > [  +0.000002] [T15127] BTRFS error (device loop1): block=30851072 write
> > time tree block corruption detected
> >
> > and a lot more lines with the same generation errors for btrfs/122
> > btrfs/152 btrfs/210 btrfs/224 btrfs/316 btrfs/320 btrfs/340 fstest cases.
> >
> > I have no idea why it's trying to write some ebs older than current
> > transaction. Seems related with snapshots.
>
> This happens because after an extent buffer (eb) is written to disk,
> subsequent modifications only set the dirty flag without adding those
> pages to the current transaction's dirty list. Consequently, their
> writeback isn't triggered or awaited during transaction commit.
>
> In contrast, newly allocated or COWed extent buffers are explicitly
> added to the transaction's dirty_pages via btrfs_init_new_buffer, which
> ensures they are properly tracked and written back.
>
> Add the following code could fix this:
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 8d683745afd1..3ab89a31f9bb 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4450,6 +4450,9 @@ void btrfs_mark_buffer_dirty(struct
> btrfs_trans_handle *trans,
>                             buf->start, transid, fs_info->generation);
>          }
>          set_extent_buffer_dirty(buf);
> +       if (btrfs_header_owner(buf) != BTRFS_TREE_LOG_OBJECTID)
> +               btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> buf->start,
> +                                    buf->start + buf->len - 1,

Log tree extent buffers have their own dedicated io tree, they are not
meant to go to ->dirty_pages.
They are meant to be flushed only during fsync and not during a
transaction commit.

I appreciate that you are trying to help, but trying out random things
without having a better understanding of internals is just noise on
the list.

The errors you are getting are very likely because the tree checker
does not lock the eb, since after an eb is written we currently don't
expect changes to it, so we don't lock.
But the idea makes the expectation no longer valid, since they can be
modified again after being written, so the tree checker needs to read
lock an eb.

Thanks.


> EXTENT_DIRTY, NULL);
>   }
>
>   static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info,
>
>
> Thanks,
> Sun YangKai

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30 12:49     ` Filipe Manana
  2026-01-30 15:43       ` Boris Burkov
@ 2026-01-30 21:43       ` Leo Martins
  2026-01-30 22:34         ` Qu Wenruo
  1 sibling, 1 reply; 21+ messages in thread
From: Leo Martins @ 2026-01-30 21:43 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Fri, 30 Jan 2026 12:49:55 +0000 Filipe Manana <fdmanana@kernel.org> wrote:

> On Fri, Jan 30, 2026 at 12:13 AM Leo Martins <loemra.dev@gmail.com> wrote:
> >
> > On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> >
> > > On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
> > > >
> > > > I've been investigating enospcs at Meta and have observed a strange
> > > > pattern where filesystems are enospcing with lots of unallocated space
> > > > (> 100G). Sample dmesg dump at bottom of message.
> > > >
> > > > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > > > from the transaction block reserve and finding it exhausted leading to a
> > > > warning and enospc. This is a bug as the reservations are meant to be
> > > > worst case. It should be impossible to exhaust the transaction block
> > > > reserve.
> > > >
> > > > Some tracing of affected hosts revealed that there were single
> > > > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > > > reproduce this behavior locally by creating a very constrained cgroup
> > > > and producing a lot of concurrent filesystem operations. Here's the
> > > > pattern:
> > > >
> > > >  1. btrfs_search_slot() begins tree traversal with cow=1
> > > >  2. Node at level N needs COW (old generation or WRITTEN flag set)
> > > >  3. btrfs_cow_block() allocates new node, updates parent pointer
> > > >  4. Traversal continues, but hits a condition requiring restart (e.g., node
> > > >     not cached, lock contention, need higher write_lock_level)
> > > >  5. btrfs_release_path() releases all locks and references
> > > >  6. Memory pressure triggers writeback on the COW'd node
> > > >  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> > > >     BTRFS_HEADER_FLAG_WRITTEN
> > > >  8. goto again - traversal restarts from root
> > > >  9. Traversal reaches the freshly COW'd node
> > > >  10. should_cow_block() sees WRITTEN flag set, returns true
> > > >  11. btrfs_cow_block() allocates another new node - same logical position,
> > > >      new physical location, new reservation consumed
> > > >  12. Steps 4-11 repeat indefinitely under sustained memory pressure
> > > >
> > > > Note this behavior should be much harder to trigger since Boris's
> > > > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > > > accounted for in user cgroups. However, I believe it
> > > > would still be an issue under global memory pressure.
> > > > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> > > >
> > > > This COW amplification breaks the idea that transaction reservations are
> > > > worst case as any search slot call could find itself in this COW loop and
> > > > exhaust its reservation.
> > > >
> > > > My proposed solution is to temporarily pin extent buffers for the
> > > > lifetime of btrfs_search_slot. This prevents the massive COW
> > > > amplification that can be seen during high memory pressure.
> > > >
> > > > The implementation uses a local xarray to track COW'd buffers for the
> > > > duration of the search. The xarray stores extent_buffer pointers without
> > > > taking additional references; this is safe because tracked buffers remain
> > > > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > > > dirty buffers cannot be reclaimed by memory pressure.
> > > >
> > > > Synchronization is provided by eb->lock: increments in
> > > > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > > > the check in lock_extent_buffer_for_io() also holds the write lock via
> > > > btrfs_tree_lock(). Decrements don't require eb->lock because
> > > > writeback_blockers is atomic and merely indicates "don't write yet".
> > > > Once we decrement, we're done and don't care if writeback proceeds
> > > > immediately.
> > >
> > > This seems too complex to me.
> > >
> > > So this problem is very similar to some idea I had a few years ago but
> > > never managed to implement.
> > > It was about avoiding unnecessary COW, not for this space reservation
> > > exhaustion due to sustained memory pressure, but it would solve it
> > > too.
> > >
> > > The idea was that we do unnecessary COW in cases like this:
> > >
> > > 1) We COW a path in some tree and we are at transaction N;
> > >
> > > 2) Writeback happened for the extent buffers in that path while we are
> > > in the same transaction, because we reached the 32M limit and some
> > > task called btrfs_btree_balance_dirty() or something else triggered
> > > writeback of the btree inode;
> > >
> > > 3) While still at transaction N, we visit the same path to add an item
> > > to a leaf, or modify an item, whatever. Because the extent buffers
> > > have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> > > returns true).
> > >
> > > So during the lifetime of a transaction we can have a lot of
> > > unnecessary COW - we spend more time allocating extents, allocating
> > > memory, copying extent buffer data, use more space per transaction,
> > > etc.
> > >
> > > The idea was to not COW when an extent buffer has
> > > BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> > > (btrfs_header_generation(eb)) matches the current transaction.
> > > That is safe because there's no committed tree that points to an
> > > extent buffer created in the current transaction.
> > >
> > > Any further modification to the extent buffer must be sure that the
> > > EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> > > transaction's dirty_pages io tree, etc, so that we don't miss writing
> > > the extent buffer to the same location again before the transaction
> > > commits the superblocks.
> > >
> > > Have you considered an approach like this?
> >
> > I had not considered this, but it is a great idea.
> >
> > My first thought is that implementing this could be as simple
> > as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> > would mess with the assumptions around the log tree. From
> > btrfs_sync_log():
> >
> > /*
> >  * IO has been started, blocks of the log tree have WRITTEN flag set
> >  * in their headers. new modifications of the log will be written to
> >  * new positions. so it's safe to allow log writers to go in.
> >  */
> >
> > ^ Assumes that WRITTEN blocks will be COW'd.
> >
> > The issue looks like:
> >
> >  1. fsync A COWs eb
> >  2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
> >  3. fsync B does __not__ COW eb and modifies it
> >  4. fsync A writes modified eb to disk
> >  5. CRASH; the log tree is corrupted
> >
> > One way to avoid that is to keep the current behavior for the log
> > tree, but that leaves the potential for COW amplification...
> >
> > Another idea is to track the log_transid in the eb in the same way
> > the transid is tracked. Then, in should_cow_block we have something
> > like:
> >
> > if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID &&
> >     buf->log_transid != root->log_transid)
> >   return true;
> 
> Log trees are special since their lifetime doesn't span on
> transaction, so what I suggested doesn't work of course for log trees
> and I forgot to mention that.
> 
> Tracking the log_transid in the extent buffer will not always work -
> because it can be evicted and reloaded, so we would lose its value.
> We would have to update the on-disk format to store it somewhere or
> keep another in memory structure to track that, or prevent eviction of
> log tree buffers - all of those are too complex.
> 
> So I had this half baked patch from many years ago:
> 
>  static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
>                       *root, struct btrfs_path *path, int level);
> @@ -1426,11 +1427,30 @@ static inline int should_cow_block(struct
> btrfs_trans_handle *trans,
>          *    block to ensure the metadata consistency.
>          */
>         if (btrfs_header_generation(buf) == trans->transid &&
> -           !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
>             !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
>               btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
> -           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> +           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state)) {
> +
> +               if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
> +                       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> +                               return 1;
> +                       return 0;
> +               }
> +
> +               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) ||
> +                   test_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
> +                       return 1;

This is a great starting point. Will iterate on this and send
out a v2 next week.

Thanks,
Leo

> 
> This was before a recent refactoring of should_cow_block(), but you
> should get the ideia.
> IIRC all fstests were passing back then, except for one or two which I
> never spent time debugging.
> 
> And as that attempt was before the tree checker existed, we would need
> to make sure we don't change and eb while the tree checker is
> verifying it - making sure the tree checker read locks the eb should
> be enough.
> 
> There's also one problem with this idea: it won't work for zoned
> devices as writes are sequential and we can't write twice to the same
> location without doing the zone reset thing which only happens around
> transaction commit time IIRC.
> 
> Thanks.
> 
> >
> > Please let me know if you see any issues with this approach or
> > if you can think of a better method.
> >
> > Thanks,
> > Leo
> >
> > >
> > > It would solve this space reservation exhaustion problem, as well as
> > > unnecessary COW for general optimization, without the need to for a
> > > local xarray, which besides being very specific for the
> > > btrfs_search_slot() case (we COW in other places), also requires a
> > > memory allocation which can fail.
> > >
> > > Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30 21:43       ` Leo Martins
@ 2026-01-30 22:34         ` Qu Wenruo
  2026-01-31  0:11           ` Boris Burkov
  0 siblings, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Leo Martins, Filipe Manana; +Cc: linux-btrfs, kernel-team



在 2026/1/31 08:13, Leo Martins 写道:
> On Fri, 30 Jan 2026 12:49:55 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> 
[...]
> 
>>
>> This was before a recent refactoring of should_cow_block(), but you
>> should get the ideia.
>> IIRC all fstests were passing back then, except for one or two which I
>> never spent time debugging.
>>
>> And as that attempt was before the tree checker existed, we would need
>> to make sure we don't change and eb while the tree checker is
>> verifying it - making sure the tree checker read locks the eb should
>> be enough.

That may still be racy not just to tree-checker, but with the extent 
buffer writeback path.

Even we locked the eb for tree-checker, but someone still modified the 
the eb after tree-checker but before submission, it can still be very 
problematic.

Or we have to block all future writers until the eb is fully written 
back, which may slow down the whole fs.

>>
>> There's also one problem with this idea: it won't work for zoned
>> devices as writes are sequential and we can't write twice to the same
>> location without doing the zone reset thing which only happens around
>> transaction commit time IIRC.

That's also the same concern I have, meaning having to again divide 
zoned and non-zoned metadata routine.

Although before all the new ideas/attempts, I'm wondering the following 
two points:

- With the AS_KERNEL_FILE flag, how frequent we're re-dirtying COWed ebs
   We need extra benchmarks on this first.

- Is there any pattern of the re-dirtying COWed ebs
   E.g. which trees are re-drity the most frequently? Extent or csum or
   log trees?

   Can we take advantage of such patterns if they exist?

- Is there any less invasive alternatives to changing COW basics?
   E.g. Changing btree_writepages() to utilize some LRU so only the
   oldest/least frequent accessed dirty ebs are written back first.

Thanks,
Qu

>>
>> Thanks.
>>
>>>
>>> Please let me know if you see any issues with this approach or
>>> if you can think of a better method.
>>>
>>> Thanks,
>>> Leo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30 22:34         ` Qu Wenruo
@ 2026-01-31  0:11           ` Boris Burkov
  2026-01-31  1:06             ` Qu Wenruo
  0 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2026-01-31  0:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Leo Martins, Filipe Manana, linux-btrfs, kernel-team

On Sat, Jan 31, 2026 at 09:04:03AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2026/1/31 08:13, Leo Martins 写道:
> > On Fri, 30 Jan 2026 12:49:55 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> > 
> [...]
> > 
> > > 
> > > This was before a recent refactoring of should_cow_block(), but you
> > > should get the ideia.
> > > IIRC all fstests were passing back then, except for one or two which I
> > > never spent time debugging.
> > > 
> > > And as that attempt was before the tree checker existed, we would need
> > > to make sure we don't change and eb while the tree checker is
> > > verifying it - making sure the tree checker read locks the eb should
> > > be enough.
> 
> That may still be racy not just to tree-checker, but with the extent buffer
> writeback path.
> 
> Even we locked the eb for tree-checker, but someone still modified the the
> eb after tree-checker but before submission, it can still be very
> problematic.

Agreed that it feels very fishy, like we can write an eb with bad csum,
much like DIO and unstable pages. But do we think it *actually* matters?

In principle, if I buffered an eb, wrote total garbage to the disk during
the transaction, but then during the commit wrote out the correct eb, I
think that is still OK. If we crash, that bad eb isn't reachable from
any root when we mount again, right?

In the proposed design, the dirty page cache is that buffer.

> 
> Or we have to block all future writers until the eb is fully written back,
> which may slow down the whole fs.
> 
> > > 
> > > There's also one problem with this idea: it won't work for zoned
> > > devices as writes are sequential and we can't write twice to the same
> > > location without doing the zone reset thing which only happens around
> > > transaction commit time IIRC.
> 
> That's also the same concern I have, meaning having to again divide zoned
> and non-zoned metadata routine.
> 
> Although before all the new ideas/attempts, I'm wondering the following two
> points:
> 
> - With the AS_KERNEL_FILE flag, how frequent we're re-dirtying COWed ebs
>   We need extra benchmarks on this first.

As far as I am concerned, any amount more than zero is a bug when you
consider it from the perspective of the transaction block_rsv. If you
had an 8 deep tree doing splitting, then a single re-cow you didn't plan
on will use space not in the block_rsv.

In practice today we see 30x amplification at least. To flip it around,
what amount is "OK"? An amount that doesn't happen to cause ENOSPCs on
most machines?

I don't think it's responsible to let it slide and hope it doesn't happen
too much. There's always systems in global reclaim to worry about as well..

I do, however, completely agree that this argument means we should try
to avoid inventing some really wild and costly solution. Ideally we can
find something tidy to plug up the hole :)

> 
> - Is there any pattern of the re-dirtying COWed ebs
>   E.g. which trees are re-drity the most frequently? Extent or csum or
>   log trees?

In the Meta fleet data, the subvol trees and the csum tree see by far
the most lock contention, which I think is likely going to be correlated
with COW amplification. But we don't have detailed COW re-dirtying data
yet.

> 
>   Can we take advantage of such patterns if they exist?
> 

That is why I wasted everyone's time and brain-cells on those very tricky
csum commit root patches I kept messing up for like nine iterations :D

If there are other heuristic improvements I think it's a good idea, but
I think also doesn't change the reality of the transaction block_rsv
over-spend bug.

> - Is there any less invasive alternatives to changing COW basics?
>   E.g. Changing btree_writepages() to utilize some LRU so only the
>   oldest/least frequent accessed dirty ebs are written back first.

With sufficient sustained memory pressure I am not sure that something
like this will work, even to satisfactorily reduce the problem,
but I have not yet reproduced it without resorting to small cgroups
and no AS_KERNEL_FILE (as we have discussed elsewhere)

Ideas we have considered, which I think would fully solve the bug:
(in no particular order, and I'm sure there are others)

- Leo's xarray to block writeback on the shortest scope
- Block writeback on a longer but easier to manage scope (e.g. trans_handle)
- Block writeback to the whole tree while it's being cow-ed.
- Have writeback also take a new type of tree lock which cow paths do not release
- Relax the strictness of the transaction block rsv guarantee 

Personally, I am still quite excited about Filipe's idea and hope we can
make it work. That would be really slick.

Thanks,
Boris

> 
> Thanks,
> Qu
> 
> > > 
> > > Thanks.
> > > 
> > > > 
> > > > Please let me know if you see any issues with this approach or
> > > > if you can think of a better method.
> > > > 
> > > > Thanks,
> > > > Leo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-31  0:11           ` Boris Burkov
@ 2026-01-31  1:06             ` Qu Wenruo
  2026-01-31 17:16               ` Boris Burkov
  0 siblings, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2026-01-31  1:06 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo
  Cc: Leo Martins, Filipe Manana, linux-btrfs, kernel-team



在 2026/1/31 10:41, Boris Burkov 写道:
> On Sat, Jan 31, 2026 at 09:04:03AM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2026/1/31 08:13, Leo Martins 写道:
>>> On Fri, 30 Jan 2026 12:49:55 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
>>>
>> [...]
>>>
>>>>
>>>> This was before a recent refactoring of should_cow_block(), but you
>>>> should get the ideia.
>>>> IIRC all fstests were passing back then, except for one or two which I
>>>> never spent time debugging.
>>>>
>>>> And as that attempt was before the tree checker existed, we would need
>>>> to make sure we don't change and eb while the tree checker is
>>>> verifying it - making sure the tree checker read locks the eb should
>>>> be enough.
>>
>> That may still be racy not just to tree-checker, but with the extent buffer
>> writeback path.
>>
>> Even we locked the eb for tree-checker, but someone still modified the the
>> eb after tree-checker but before submission, it can still be very
>> problematic.
> 
> Agreed that it feels very fishy, like we can write an eb with bad csum,
> much like DIO and unstable pages. But do we think it *actually* matters?
> 
> In principle, if I buffered an eb, wrote total garbage to the disk during
> the transaction, but then during the commit wrote out the correct eb, I
> think that is still OK. If we crash, that bad eb isn't reachable from
> any root when we mount again, right?

Yes, that's correct, however still I'd prefer a more consistent behavior 
that doesn't introduce any bad csum at any timing.

Causing temporary unreachable bad csum seems harmless, but it brings a 
much bigger opening, that may or may not lead to bad csums for valid 
tree blocks in the future.

[...]
>> Although before all the new ideas/attempts, I'm wondering the following two
>> points:
>>
>> - With the AS_KERNEL_FILE flag, how frequent we're re-dirtying COWed ebs
>>    We need extra benchmarks on this first.
> 
> As far as I am concerned, any amount more than zero is a bug when you
> consider it from the perspective of the transaction block_rsv. If you
> had an 8 deep tree doing splitting, then a single re-cow you didn't plan
> on will use space not in the block_rsv.

I also have one question related to the exhausted block_rsv.

If we COWed a tree block (the old is eb A, older than the current 
trans), write the new one (eb B) back to disk, the space of eb A will 
not be available until a full transaction is committed.

But if we need to re-dirty eb B and COW it to eb C, shouldn't the space 
of eb B be available again?

So from the perspective of space usage, re-dirtying a COWed block (in 
the current transaction, except log trees) should not cause extra space 
usage except the temporary usage before freeing eb B and copy its 
content to eb C.

It should always be the space of the original old on-disk eb, and the 
new COWed eb, no matter how many times we redirtied it.

Or is the problem exactly at the space reservation for such eb B?

> 
> In practice today we see 30x amplification at least. To flip it around,
> what amount is "OK"? An amount that doesn't happen to cause ENOSPCs on
> most machines?

30x amplification itself should not be the cause of exhausted space, but 
whether we can reuse the space of eb B of the above example.

Although I agree 30x re-dirtying is very bad, but re-using the eb 
in-place also means we will still see the same 30x write amplification, 
and more chance to screw up the writeback handling.

I'd like to know more about the source of such aggressive writeback first.

Especially if the risk of chances of data race/bad csum is really worthy 
after the AS_KERNEL_FILE feature.

Thanks,
Qu

> 
> I don't think it's responsible to let it slide and hope it doesn't happen
> too much. There's always systems in global reclaim to worry about as well..
> 
> I do, however, completely agree that this argument means we should try
> to avoid inventing some really wild and costly solution. Ideally we can
> find something tidy to plug up the hole :)
> 
>>
>> - Is there any pattern of the re-dirtying COWed ebs
>>    E.g. which trees are re-drity the most frequently? Extent or csum or
>>    log trees?
> 
> In the Meta fleet data, the subvol trees and the csum tree see by far
> the most lock contention, which I think is likely going to be correlated
> with COW amplification. But we don't have detailed COW re-dirtying data
> yet.
> 
>>
>>    Can we take advantage of such patterns if they exist?
>>
> 
> That is why I wasted everyone's time and brain-cells on those very tricky
> csum commit root patches I kept messing up for like nine iterations :D
> 
> If there are other heuristic improvements I think it's a good idea, but
> I think also doesn't change the reality of the transaction block_rsv
> over-spend bug.
> 
>> - Is there any less invasive alternatives to changing COW basics?
>>    E.g. Changing btree_writepages() to utilize some LRU so only the
>>    oldest/least frequent accessed dirty ebs are written back first.
> 
> With sufficient sustained memory pressure I am not sure that something
> like this will work, even to satisfactorily reduce the problem,
> but I have not yet reproduced it without resorting to small cgroups
> and no AS_KERNEL_FILE (as we have discussed elsewhere)
> 
> Ideas we have considered, which I think would fully solve the bug:
> (in no particular order, and I'm sure there are others)
> 
> - Leo's xarray to block writeback on the shortest scope
> - Block writeback on a longer but easier to manage scope (e.g. trans_handle)
> - Block writeback to the whole tree while it's being cow-ed.
> - Have writeback also take a new type of tree lock which cow paths do not release
> - Relax the strictness of the transaction block rsv guarantee
> 
> Personally, I am still quite excited about Filipe's idea and hope we can
> make it work. That would be really slick.
> 
> Thanks,
> Boris
> 
>>
>> Thanks,
>> Qu
>>
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> Please let me know if you see any issues with this approach or
>>>>> if you can think of a better method.
>>>>>
>>>>> Thanks,
>>>>> Leo
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30 16:11           ` Filipe Manana
@ 2026-01-31  9:16             ` Sun YangKai
  0 siblings, 0 replies; 21+ messages in thread
From: Sun YangKai @ 2026-01-31  9:16 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Leo Martins, linux-btrfs, Boris Burkov, Qu Wenruo



On 2026/1/31 00:11, Filipe Manana wrote:
> On Fri, Jan 30, 2026 at 3:50 PM Sun YangKai <sunk67188@gmail.com> wrote:
>>
>> On 2026/1/30 17:37, Sun YangKai wrote:
>>>
>>>
>>> On 2026/1/30 12:14, Sun YangKai wrote:
>>>> On 2026/1/30 08:12, Leo Martins wrote:
>>>>> On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana<fdmanana@kernel.org>
>>>>> wrote:
>>>>>> On Tue, Jan 27, 2026 at 8:43 PM Leo Martins<loemra.dev@gmail.com>
>>>>>> wrote:
>>>>>>> I've been investigating enospcs at Meta and have observed a strange
>>>>>>> pattern where filesystems are enospcing with lots of unallocated space
>>>>>>> (> 100G). Sample dmesg dump at bottom of message.
>>>>>>>
>>>>>>> btrfs_insert_delayed_dir_index is attempting to migrate some
>>>>>>> reservation
>>>>>>> from the transaction block reserve and finding it exhausted leading
>>>>>>> to a
>>>>>>> warning and enospc. This is a bug as the reservations are meant to be
>>>>>>> worst case. It should be impossible to exhaust the transaction block
>>>>>>> reserve.
>>>>>>>
>>>>>>> Some tracing of affected hosts revealed that there were single
>>>>>>> btrfs_search_slot calls that were COWing 100s of times. I was able to
>>>>>>> reproduce this behavior locally by creating a very constrained cgroup
>>>>>>> and producing a lot of concurrent filesystem operations. Here's the
>>>>>>> pattern:
>>>>>>>
>>>>>>>    1. btrfs_search_slot() begins tree traversal with cow=1
>>>>>>>    2. Node at level N needs COW (old generation or WRITTEN flag set)
>>>>>>>    3. btrfs_cow_block() allocates new node, updates parent pointer
>>>>>>>    4. Traversal continues, but hits a condition requiring restart
>>>>>>> (e.g., node
>>>>>>>       not cached, lock contention, need higher write_lock_level)
>>>>>>>    5. btrfs_release_path() releases all locks and references
>>>>>>>    6. Memory pressure triggers writeback on the COW'd node
>>>>>>>    7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>>>>>>>       BTRFS_HEADER_FLAG_WRITTEN
>>>>>>>    8. goto again - traversal restarts from root
>>>>>>>    9. Traversal reaches the freshly COW'd node
>>>>>>>    10. should_cow_block() sees WRITTEN flag set, returns true
>>>>>>>    11. btrfs_cow_block() allocates another new node - same logical
>>>>>>> position,
>>>>>>>        new physical location, new reservation consumed
>>>>>>>    12. Steps 4-11 repeat indefinitely under sustained memory pressure
>>>>>>>
>>>>>>> Note this behavior should be much harder to trigger since Boris's
>>>>>>> AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
>>>>>>> accounted for in user cgroups. However, I believe it
>>>>>>> would still be an issue under global memory pressure.
>>>>>>> Link:https://lore.kernel.org/linux-btrfs/
>>>>>>> cover.1755812945.git.boris@bur.io/
>>>>>>>
>>>>>>> This COW amplification breaks the idea that transaction
>>>>>>> reservations are
>>>>>>> worst case as any search slot call could find itself in this COW
>>>>>>> loop and
>>>>>>> exhaust its reservation.
>>>>>>>
>>>>>>> My proposed solution is to temporarily pin extent buffers for the
>>>>>>> lifetime of btrfs_search_slot. This prevents the massive COW
>>>>>>> amplification that can be seen during high memory pressure.
>>>>>>>
>>>>>>> The implementation uses a local xarray to track COW'd buffers for the
>>>>>>> duration of the search. The xarray stores extent_buffer pointers
>>>>>>> without
>>>>>>> taking additional references; this is safe because tracked buffers
>>>>>>> remain
>>>>>>> dirty (writeback_blockers prevents the dirty bit from being
>>>>>>> cleared) and
>>>>>>> dirty buffers cannot be reclaimed by memory pressure.
>>>>>>>
>>>>>>> Synchronization is provided by eb->lock: increments in
>>>>>>> btrfs_search_slot_track_cow() occur while holding the write lock, and
>>>>>>> the check in lock_extent_buffer_for_io() also holds the write lock via
>>>>>>> btrfs_tree_lock(). Decrements don't require eb->lock because
>>>>>>> writeback_blockers is atomic and merely indicates "don't write yet".
>>>>>>> Once we decrement, we're done and don't care if writeback proceeds
>>>>>>> immediately.
>>>>>> This seems too complex to me.
>>>>>>
>>>>>> So this problem is very similar to some idea I had a few years ago but
>>>>>> never managed to implement.
>>>>>> It was about avoiding unnecessary COW, not for this space reservation
>>>>>> exhaustion due to sustained memory pressure, but it would solve it
>>>>>> too.
>>>>>>
>>>>>> The idea was that we do unnecessary COW in cases like this:
>>>>>>
>>>>>> 1) We COW a path in some tree and we are at transaction N;
>>>>>>
>>>>>> 2) Writeback happened for the extent buffers in that path while we are
>>>>>> in the same transaction, because we reached the 32M limit and some
>>>>>> task called btrfs_btree_balance_dirty() or something else triggered
>>>>>> writeback of the btree inode;
>>>>>>
>>>>>> 3) While still at transaction N, we visit the same path to add an item
>>>>>> to a leaf, or modify an item, whatever. Because the extent buffers
>>>>>> have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
>>>>>> returns true).
>>>>>>
>>>>>> So during the lifetime of a transaction we can have a lot of
>>>>>> unnecessary COW - we spend more time allocating extents, allocating
>>>>>> memory, copying extent buffer data, use more space per transaction,
>>>>>> etc.
>>>>>>
>>>>>> The idea was to not COW when an extent buffer has
>>>>>> BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
>>>>>> (btrfs_header_generation(eb)) matches the current transaction.
>>>>>> That is safe because there's no committed tree that points to an
>>>>>> extent buffer created in the current transaction.
>>>>>>
>>>>>> Any further modification to the extent buffer must be sure that the
>>>>>> EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
>>>>>> transaction's dirty_pages io tree, etc, so that we don't miss writing
>>>>>> the extent buffer to the same location again before the transaction
>>>>>> commits the superblocks.
>>>>>>
>>>>>> Have you considered an approach like this?
>>>>> I had not considered this, but it is a great idea.
>>>>>
>>>>> My first thought is that implementing this could be as simple
>>>>> as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
>>>>> would mess with the assumptions around the log tree. From
>>>>> btrfs_sync_log():
>>>> After a fast glance and some tests, I found things might not be that
>>>> easy. The problem is not only the log tree.
>>>>> /*
>>>>>    * IO has been started, blocks of the log tree have WRITTEN flag set
>>>>>    * in their headers. new modifications of the log will be written to
>>>>>    * new positions. so it's safe to allow log writers to go in.
>>>>>    */
>>>>>
>>>>> ^ Assumes that WRITTEN blocks will be COW'd.
>>>>>
>>>>> The issue looks like:
>>>>>
>>>>>    1. fsync A COWs eb
>>>>>    2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
>>>>>    3. fsync B does __not__ COW eb and modifies it
>>>>>    4. fsync A writes modified eb to disk
>>>>>    5. CRASH; the log tree is corrupted
>>>>>
>>>>> One way to avoid that is to keep the current behavior for the log
>>>>> tree, but that leaves the potential for COW amplification...
>>>> I tested with a patch like this:
>>>> @@ -624,14 +624,18 @@ static inline bool should_cow_block(const struct
>>>> btrfs_trans_handle *trans,
>>>>           if (btrfs_header_generation(buf) != trans->transid)
>>>>                   return true;
>>>>
>>>> -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
>>>> -               return true;
>>>> -
>>>>           /* Ensure we can see the FORCE_COW bit. */
>>>>           smp_mb__before_atomic();
>>>>           if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
>>>>                   return true;
>>>>
>>>> +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
>>>> +               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
>>>> +                       return true;
>>>> +               btrfs_mark_buffer_dirty(trans, buf);
>>>> +               return false;
>>>> +       }
>>>> +
>>>>           if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
>>>>
>>>>                   return false;
>>>>
>>>> And get some errors like this:
>>>>
>>>>
>>>> [  +0.090163] [ T2589] run fstests btrfs/004 at 2026-01-30 11:53:37
>>>> [  +0.432352] [T11685] BTRFS: device fsid 1fb397fc-97a7-44dd-9602-
>>>> dd38b74bc391 devid 1 transid 8 /dev/loop1 (7:1) scanned by mount (11685)
>>>> [  +0.000351] [T11685] BTRFS info (device loop1): first mount of
>>>> filesystem 1fb397fc-97a7-44dd-9602-dd38b74bc391
>>>> [  +0.000014] [T11685] BTRFS info (device loop1): using crc32c
>>>> (crc32c- lib) checksum algorithm
>>>> [  +0.001298] [T11685] BTRFS info (device loop1): checking UUID tree
>>>> [  +0.000039] [T11685] BTRFS info (device loop1): enabling ssd
>>>> optimizations
>>>> [  +0.000003] [T11685] BTRFS info (device loop1): turning on async
>>>> discard
>>>> [  +0.000002] [T11685] BTRFS info (device loop1): enabling free space
>>>> tree
>>>> [  +1.051781] [T11703] page: refcount:2 mapcount:0
>>>> mapping:00000000eb6d7caa index:0x2348 pfn:0x1caebf
>>>> [  +0.000008] [T11703] memcg:ffff9b3300263cc0
>>>> [  +0.000003] [T11703] aops:0xffffffffc0354040 ino:1
>>>> [  +0.000024] [T11703] flags: 0x4e0000000000423e(referenced|uptodate|
>>>> dirty|lru|workingset|private|writeback|zone=1)
>>>> [  +0.000007] [T11703] raw: 4e0000000000423e fffff74a872bb908
>>>> fffff74a84206a88 ffff9b33c6706880
>>>> [  +0.000004] [T11703] raw: 0000000000002348 ffff9b334be522d0
>>>> 00000002ffffffff ffff9b3300263cc0
>>>> [  +0.000002] [T11703] page dumped because: eb page dump
>>>> [  +0.000003] [T11703] BTRFS critical (device loop1): corrupt leaf:
>>>> root=5 block=36995072 slot=118 ino=406 file_offset=94208, invalid
>>>> ram_bytes for file extent, have 8660273067269322872, should be aligned
>>>> to 4096
>>>> [  +0.000013] [T11703] BTRFS info (device loop1): leaf 36995072 gen 33
>>>> total ptrs 128 free space 2857 owner 5
>>>> [  +0.000006] [T11703]     item 0 key (386 DIR_ITEM 238230307) itemoff
>>>> 16249 itemsize 34
>>>> [  +0.000004] [T11703]         location key (462 1 0) type 2
>>>> [  +0.000003] [T11703]         transid 33 data_len 0 name_len 4
>>>> [  +0.000003] [T11703]     item 1 key (386 DIR_ITEM 1473745676)
>>>> itemoff 16216 itemsize 33
>>>> [  +0.000004] [T11703]         location key (376 1 0) type 3
>>>> [  +0.000002] [T11703]         transid 30 data_len 0 name_len 3
>>>> [  +0.000003] [T11703]     item 2 key (386 DIR_ITEM 2243137595)
>>>> itemoff 16182 itemsize 34
>>>> [  +0.000004] [T11703]         location key (413 1 0) type 1
>>>> [  +0.000002] [T11703]         transid 32 data_len 0 name_len 4
>>>> ...
>>>> [  +0.000001] [T11703]     item 127 key (405 DIR_ITEM 828387202)
>>>> itemoff 6057 itemsize 34
>>>> [  +0.000002] [T11703]         location key (479 1 0) type 3
>>>> [  +0.000001] [T11703]         transid 33 data_len 0 name_len 4
>>>> [  +0.000002] [T11703] BTRFS error (device loop1): block=36995072
>>>> write time tree block corruption detected
>>>> [  +0.003429] [T11703] BTRFS: error (device loop1) in
>>>> btrfs_commit_transaction:2555: errno=-5 IO failure (Error while
>>>> writing out transaction)
>>>> [  +0.000007] [T11703] BTRFS info (device loop1 state E): forced readonly
>>>> [  +0.000002] [T11703] BTRFS warning (device loop1 state E): Skipping
>>>> commit of aborted transaction.
>>>> [  +0.000002] [T11703] BTRFS error (device loop1 state EA):
>>>> Transaction aborted (error -5)
>>>> [  +0.000003] [T11703] BTRFS: error (device loop1 state EA) in
>>>> cleanup_transaction:2037: errno=-5 IO failure
>>>>
>>>> The reported 406 inode is even not in the printed leaf. It seems like
>>>> a data race maybe caused by:
>>>>
>>>> We unlock the eb after setting the WRITTEN flag during write back, and
>>>> the eb should not get modified since then because all future writes
>>>> will use the cowed eb. However, with the WRITTEN flag check removed in
>>>> should_cow_block, we might write to the eb with WRITTEN flag set which
>>>> might be under io.
>>>
>>> I tried again with this:
>>>
>>> @@ -624,14 +624,20 @@ static inline bool should_cow_block(const struct
>>> btrfs_trans_handle *trans,
>>>           if (btrfs_header_generation(buf) != trans->transid)
>>>                   return true;
>>>
>>> -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
>>> -               return true;
>>> -
>>>           /* Ensure we can see the FORCE_COW bit. */
>>>           smp_mb__before_atomic();
>>>           if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
>>>                   return true;
>>>
>>> +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
>>> +               if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
>>> +                       return true;
>>> +               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
>>> +                       return true;
>>> +               btrfs_mark_buffer_dirty(trans, buf);
>>> +               return false;
>>> +       }
>>> +
>>>           if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
>>>                   return false;
>>>
>>> When WRITEBACK is set, do a normal cow to prevent the data race. This
>>> seems to fix the previous problem. However, I got this:
>>>
>>> [  +0.020843] [T15127] BTRFS error (device loop1): block=30687232 bad
>>> generation, have 11 expect > 14
>>> [  +0.000009] [T15127]     item 0 key (256 INODE_ITEM 0) itemoff 16123
>>> itemsize 160
>>> [  +0.000004] [T15127]         inode generation 3 transid 11 size 10
>>> nbytes 16384
>>> [  +0.000003] [T15127]         block group 0 mode 40755 links 1 uid 0 gid 0
>>> [  +0.000002] [T15127]         rdev 0 sequence 1 flags 0x0
>>> [  +0.000002] [T15127]         atime 1769760651.0
>>> [  +0.000002] [T15127]         ctime 1769760652.250234845
>>> [  +0.000002] [T15127]         mtime 1769760652.250234845
>>> [  +0.000001] [T15127]         otime 1769760651.0
>>> ...
>>> [  +0.000004] [T15127]         root data bytenr 30523392 refs 1
>>> [  +0.000002] [T15127] BTRFS error (device loop1): block=30851072 write
>>> time tree block corruption detected
>>>
>>> and a lot more lines with the same generation errors for btrfs/122
>>> btrfs/152 btrfs/210 btrfs/224 btrfs/316 btrfs/320 btrfs/340 fstest cases.
>>>
>>> I have no idea why it's trying to write some ebs older than current
>>> transaction. Seems related with snapshots.
>>
>> This happens because after an extent buffer (eb) is written to disk,
>> subsequent modifications only set the dirty flag without adding those
>> pages to the current transaction's dirty list. Consequently, their
>> writeback isn't triggered or awaited during transaction commit.
>>
>> In contrast, newly allocated or COWed extent buffers are explicitly
>> added to the transaction's dirty_pages via btrfs_init_new_buffer, which
>> ensures they are properly tracked and written back.
>>
>> Add the following code could fix this:
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 8d683745afd1..3ab89a31f9bb 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -4450,6 +4450,9 @@ void btrfs_mark_buffer_dirty(struct
>> btrfs_trans_handle *trans,
>>                              buf->start, transid, fs_info->generation);
>>           }
>>           set_extent_buffer_dirty(buf);
>> +       if (btrfs_header_owner(buf) != BTRFS_TREE_LOG_OBJECTID)

I've excluded the extent buffers from log tree here. Please correct me
if I'm using a wrong condition.

>> +               btrfs_set_extent_bit(&trans->transaction->dirty_pages,
>> buf->start,
>> +                                    buf->start + buf->len - 1,
> 
> Log tree extent buffers have their own dedicated io tree, they are not
> meant to go to ->dirty_pages.

> They are meant to be flushed only during fsync and not during a
> transaction commit.
> 
> I appreciate that you are trying to help, but trying out random things
> without having a better understanding of internals is just noise on
> the list.
> 
> The errors you are getting are very likely because the tree checker
> does not lock the eb, since after an eb is written we currently don't
> expect changes to it, so we don't lock.
> But the idea makes the expectation no longer valid, since they can be
> modified again after being written, so the tree checker needs to read
> lock an eb.

Yes, the data race issue is exactly what I've met when sending the first 
email.

I guess you might have missed my second email. But that's okay, let me
re-iterate the issue regarding "avoiding COW for extent buffers with
WRITTEN flag generated in the current transaction". I'm excited about 
this idea and want to make it work.

First of all, this optimization does not apply to the log tree and zoned
devices. So let's exclude them from our discussion for now.

The purpose of this optimization is to reduce COW amplification.

Currently, we face two separate implementation issues.

The first issue is data racing. Previously, we assumed that extent
buffers with the WRITTEN flag would not be modified again. However, this
optimization breaks that assumption. If the content of an extent buffer
is modified during writeback, it will cause errors due to data racing.
This is the issue I discovered in my first email.

Since we lock the extent buffer before setting WRITEBACK and release
the lock after setting the WRITEBACK flag for it, I think there will be 
no data racing as long as we don't modify the extent buffer while it has 
the WRITEBACK flag set. To achieve this, when detecting that an extent
buffer has the WRITEBACK flag in should_cow_block(), we might have the
following solutions:

- Fallback to the old COW behavior
- Wait for the extent buffer's writeback to complete by either lock the
   extent buffer for writeback or wait on the WRITEBACK bit, which would
   reduce re-COW to a greater extent. However, this would block
   subsequent writes during writeback, potentially impacting write
   performance.
- I also have an idea of "lightweight COW", which only performs a copy
   in memory without allocating a new location for this extent buffer. I
   guess this is what Qu want to have. This way, we can avoid blocking
   writes while reducing the overhead of unnecessary COW. But currently I
   have no idea how to achieve this.

The second issue is that extent buffers with the WRITTEN flag are not
promptly written at the end of the current transaction when they are
marked DIRTY again. This is because we currently use
transaction->dirty_pages to record which extent buffers need to be 
written at the end of the transaction. An extent buffer is only added to
dirty_pages when it is created. After we skip COW, the extent buffer is
removed from dirty_pages when it is first written. When we re-mark these
extent buffers as dirty and write new content to them, we don't add them
back to dirty_pages. Therefore, these extent buffers are not promptly
written at the end of the transaction. This leads to data inconsistency,
and when they trigger write later, since their transaction has ended and
a new transaction has started, it triggers a write-time tree block
corruption:

> [  +0.020843] [T15127] BTRFS error (device loop1): block=30687232 bad
> generation, have 11 expect > 14
This is the issue I reported in my second email.

To fix this, we need to add the extent buffers to the current
transaction's dirty_pages when marking them as dirty. Of course, as you
mentioned, for extent buffers belonging to the log tree, they should not
be added to dirty_pages.

Please correct me if I got anything wrong.

Sorry for disturbing again, thanks for your patient reply, and wish you 
have a good weekend :)

> Thanks.
> 
> 
>> EXTENT_DIRTY, NULL);
>>    }
>>
>>    static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info,
>>
>>
>> Thanks,
>> Sun YangKai


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-31  1:06             ` Qu Wenruo
@ 2026-01-31 17:16               ` Boris Burkov
  2026-01-31 21:59                 ` Qu Wenruo
  0 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2026-01-31 17:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Leo Martins, Filipe Manana, linux-btrfs, kernel-team

On Sat, Jan 31, 2026 at 11:36:54AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2026/1/31 10:41, Boris Burkov 写道:
> > On Sat, Jan 31, 2026 at 09:04:03AM +1030, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2026/1/31 08:13, Leo Martins 写道:
> > > > On Fri, 30 Jan 2026 12:49:55 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> > > > 
> > > [...]
> > > > 
> > > > > 
> > > > > This was before a recent refactoring of should_cow_block(), but you
> > > > > should get the ideia.
> > > > > IIRC all fstests were passing back then, except for one or two which I
> > > > > never spent time debugging.
> > > > > 
> > > > > And as that attempt was before the tree checker existed, we would need
> > > > > to make sure we don't change and eb while the tree checker is
> > > > > verifying it - making sure the tree checker read locks the eb should
> > > > > be enough.
> > > 
> > > That may still be racy not just to tree-checker, but with the extent buffer
> > > writeback path.
> > > 
> > > Even we locked the eb for tree-checker, but someone still modified the the
> > > eb after tree-checker but before submission, it can still be very
> > > problematic.
> > 
> > Agreed that it feels very fishy, like we can write an eb with bad csum,
> > much like DIO and unstable pages. But do we think it *actually* matters?
> > 
> > In principle, if I buffered an eb, wrote total garbage to the disk during
> > the transaction, but then during the commit wrote out the correct eb, I
> > think that is still OK. If we crash, that bad eb isn't reachable from
> > any root when we mount again, right?
> 
> Yes, that's correct, however still I'd prefer a more consistent behavior
> that doesn't introduce any bad csum at any timing.
> 
> Causing temporary unreachable bad csum seems harmless, but it brings a much
> bigger opening, that may or may not lead to bad csums for valid tree blocks
> in the future.
> 
> [...]
> > > Although before all the new ideas/attempts, I'm wondering the following two
> > > points:
> > > 
> > > - With the AS_KERNEL_FILE flag, how frequent we're re-dirtying COWed ebs
> > >    We need extra benchmarks on this first.
> > 
> > As far as I am concerned, any amount more than zero is a bug when you
> > consider it from the perspective of the transaction block_rsv. If you
> > had an 8 deep tree doing splitting, then a single re-cow you didn't plan
> > on will use space not in the block_rsv.
> 
> I also have one question related to the exhausted block_rsv.
> 
> If we COWed a tree block (the old is eb A, older than the current trans),
> write the new one (eb B) back to disk, the space of eb A will not be
> available until a full transaction is committed.
> 
> But if we need to re-dirty eb B and COW it to eb C, shouldn't the space of
> eb B be available again?

Eventually, yes. I think the details depend on exactly what happens with
delayed refs. I haven't looked into this angle super closely yet.

Unfortunately, Leo and I could not think of a robust way to wire the
re-dirty cows to the block_rsv in a way that does not double dip from
the block_rsv. (See my reply to your next question for why I am so
obsessed with this...)

How do you tell apart this sequence:
I start a transaction that reserves num_items=1 (reserving for <=16 cows)
I am doing cow in root R (cows=1)
I cow down through node N (cows=2)
I cow down to leaf L (cows=3)
someone in direct reclaim evicts clean page of L
I need to read L so I release the locks and loop
someone in direct reclaim writes back N
I cow down through N and re-dirty it because WRITTEN=true (cows=4, WRONG)

from this one:
someone cows N
someone in direct reclaim writes back N
I start a transaction that reserves num_items=1 (reserving for <=16 cows)
I am doing cow in root R (cows=1)
I cow down through node N because WRITTEN=true (cows=2, CORRECT)
I cow down to leaf L (cows=3)

You can't just save the "last cow-er trans_handle" or whatever,
because after you release the locks and loop, some other writer can cow
down through that path and it could get written back before you get
another shot, so you wouldn't see that you were the last writer.

You would have to track which ebs a search_slot has already cowed, but I
fear that these "track the ebs" things run into the same issue Filipe
raised about Leo's log_transid in the eb idea.

Probably worth exploring this family of fixes a bit more though.

We could consider an array/xarray on the block_rsv for ebs that block_rsv
has cow-ed? I will have to think some more about that. (linked list is no
good because one eb might go on multiple such lists) We could
preallocate the storage when we size the block_rsv too, which is a
much better time to get an ENOMEM (start_transaction) than search_slot.

> 
> So from the perspective of space usage, re-dirtying a COWed block (in the
> current transaction, except log trees) should not cause extra space usage
> except the temporary usage before freeing eb B and copy its content to eb C.
> 
> It should always be the space of the original old on-disk eb, and the new
> COWed eb, no matter how many times we redirtied it.
> 
> Or is the problem exactly at the space reservation for such eb B?
> 

Yes, the problem is the reservation done at start_transaction() is not
properly accounted. See Leo's original email for a good explanation of
the details, but tl;dr:
btrfs_insert_delayed_dir_index was seeing an ENOSPC migrating items from
a transaction block_rsv that had been exhausted by re-dirtying
accounting an arbitrary number of cows to that block_rsv.

The ENOSPCd system has 100s of gigs of unallocated, it's a reservation
accounting ENOSPC not truly leaked or wasted space.

> > 
> > In practice today we see 30x amplification at least. To flip it around,
> > what amount is "OK"? An amount that doesn't happen to cause ENOSPCs on
> > most machines?
> 
> 30x amplification itself should not be the cause of exhausted space, but
> whether we can reuse the space of eb B of the above example.

The level of amplification needed to hit the bug is based on the tree
depth and whether the cow actually split or not. 16x items is the normal
upper-bound, so if we are running closer to that "naturally", then less
re-dirtying is tolerated. Otherwise more. But ultimately any single cow
more than we reserved accounted to the block_rsv is strictly a bug,
unless we relax the expectations around the transaction block_rsv
reservation.

> 
> Although I agree 30x re-dirtying is very bad, but re-using the eb in-place
> also means we will still see the same 30x write amplification, and more
> chance to screw up the writeback handling.

Not avoiding the write-amp and high risk involved is a good point, IMO.

> 
> I'd like to know more about the source of such aggressive writeback first.
> 

A system going into direct reclaim, typically because of a cgroup
reaching its memory limit, but it can also happen when global memory
gets tight. Particularly this is 1:1 with high memory pressure. 


> Especially if the risk of chances of data race/bad csum is really worthy
> after the AS_KERNEL_FILE feature.

The cgroup direct reclaim issue goes away, but the global direct reclaim
issue remains.

> 
> Thanks,
> Qu
> 
> > 
> > I don't think it's responsible to let it slide and hope it doesn't happen
> > too much. There's always systems in global reclaim to worry about as well..
> > 
> > I do, however, completely agree that this argument means we should try
> > to avoid inventing some really wild and costly solution. Ideally we can
> > find something tidy to plug up the hole :)
> > 
> > > 
> > > - Is there any pattern of the re-dirtying COWed ebs
> > >    E.g. which trees are re-drity the most frequently? Extent or csum or
> > >    log trees?
> > 
> > In the Meta fleet data, the subvol trees and the csum tree see by far
> > the most lock contention, which I think is likely going to be correlated
> > with COW amplification. But we don't have detailed COW re-dirtying data
> > yet.
> > 
> > > 
> > >    Can we take advantage of such patterns if they exist?
> > > 
> > 
> > That is why I wasted everyone's time and brain-cells on those very tricky
> > csum commit root patches I kept messing up for like nine iterations :D
> > 
> > If there are other heuristic improvements I think it's a good idea, but
> > I think also doesn't change the reality of the transaction block_rsv
> > over-spend bug.
> > 
> > > - Is there any less invasive alternatives to changing COW basics?
> > >    E.g. Changing btree_writepages() to utilize some LRU so only the
> > >    oldest/least frequent accessed dirty ebs are written back first.
> > 
> > With sufficient sustained memory pressure I am not sure that something
> > like this will work, even to satisfactorily reduce the problem,
> > but I have not yet reproduced it without resorting to small cgroups
> > and no AS_KERNEL_FILE (as we have discussed elsewhere)
> > 
> > Ideas we have considered, which I think would fully solve the bug:
> > (in no particular order, and I'm sure there are others)
> > 
> > - Leo's xarray to block writeback on the shortest scope
> > - Block writeback on a longer but easier to manage scope (e.g. trans_handle)
> > - Block writeback to the whole tree while it's being cow-ed.
> > - Have writeback also take a new type of tree lock which cow paths do not release
> > - Relax the strictness of the transaction block rsv guarantee
> > 
> > Personally, I am still quite excited about Filipe's idea and hope we can
> > make it work. That would be really slick.
> > 
> > Thanks,
> > Boris
> > 
> > > 
> > > Thanks,
> > > Qu
> > > 
> > > > > 
> > > > > Thanks.
> > > > > 
> > > > > > 
> > > > > > Please let me know if you see any issues with this approach or
> > > > > > if you can think of a better method.
> > > > > > 
> > > > > > Thanks,
> > > > > > Leo
> > 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-31 17:16               ` Boris Burkov
@ 2026-01-31 21:59                 ` Qu Wenruo
  0 siblings, 0 replies; 21+ messages in thread
From: Qu Wenruo @ 2026-01-31 21:59 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Qu Wenruo, Leo Martins, Filipe Manana, linux-btrfs, kernel-team



在 2026/2/1 03:46, Boris Burkov 写道:
> On Sat, Jan 31, 2026 at 11:36:54AM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2026/1/31 10:41, Boris Burkov 写道:
>>> On Sat, Jan 31, 2026 at 09:04:03AM +1030, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2026/1/31 08:13, Leo Martins 写道:
>>>>> On Fri, 30 Jan 2026 12:49:55 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
>>>>>
>>>> [...]
>>>>>
>>>>>>
>>>>>> This was before a recent refactoring of should_cow_block(), but you
>>>>>> should get the ideia.
>>>>>> IIRC all fstests were passing back then, except for one or two which I
>>>>>> never spent time debugging.
>>>>>>
>>>>>> And as that attempt was before the tree checker existed, we would need
>>>>>> to make sure we don't change and eb while the tree checker is
>>>>>> verifying it - making sure the tree checker read locks the eb should
>>>>>> be enough.
>>>>
>>>> That may still be racy not just to tree-checker, but with the extent buffer
>>>> writeback path.
>>>>
>>>> Even we locked the eb for tree-checker, but someone still modified the the
>>>> eb after tree-checker but before submission, it can still be very
>>>> problematic.
>>>
>>> Agreed that it feels very fishy, like we can write an eb with bad csum,
>>> much like DIO and unstable pages. But do we think it *actually* matters?
>>>
>>> In principle, if I buffered an eb, wrote total garbage to the disk during
>>> the transaction, but then during the commit wrote out the correct eb, I
>>> think that is still OK. If we crash, that bad eb isn't reachable from
>>> any root when we mount again, right?
>>
>> Yes, that's correct, however still I'd prefer a more consistent behavior
>> that doesn't introduce any bad csum at any timing.
>>
>> Causing temporary unreachable bad csum seems harmless, but it brings a much
>> bigger opening, that may or may not lead to bad csums for valid tree blocks
>> in the future.
>>
>> [...]
>>>> Although before all the new ideas/attempts, I'm wondering the following two
>>>> points:
>>>>
>>>> - With the AS_KERNEL_FILE flag, how frequent we're re-dirtying COWed ebs
>>>>     We need extra benchmarks on this first.
>>>
>>> As far as I am concerned, any amount more than zero is a bug when you
>>> consider it from the perspective of the transaction block_rsv. If you
>>> had an 8 deep tree doing splitting, then a single re-cow you didn't plan
>>> on will use space not in the block_rsv.
>>
>> I also have one question related to the exhausted block_rsv.
>>
>> If we COWed a tree block (the old is eb A, older than the current trans),
>> write the new one (eb B) back to disk, the space of eb A will not be
>> available until a full transaction is committed.
>>
>> But if we need to re-dirty eb B and COW it to eb C, shouldn't the space of
>> eb B be available again?
> 
> Eventually, yes. I think the details depend on exactly what happens with
> delayed refs. I haven't looked into this angle super closely yet.
> 
> Unfortunately, Leo and I could not think of a robust way to wire the
> re-dirty cows to the block_rsv in a way that does not double dip from
> the block_rsv. (See my reply to your next question for why I am so
> obsessed with this...)
> 
> How do you tell apart this sequence:
> I start a transaction that reserves num_items=1 (reserving for <=16 cows)
> I am doing cow in root R (cows=1)
> I cow down through node N (cows=2)
> I cow down to leaf L (cows=3)
> someone in direct reclaim evicts clean page of L
> I need to read L so I release the locks and loop
> someone in direct reclaim writes back N
> I cow down through N and re-dirty it because WRITTEN=true (cows=4, WRONG)
> 
> from this one:
> someone cows N
> someone in direct reclaim writes back N
> I start a transaction that reserves num_items=1 (reserving for <=16 cows)
> I am doing cow in root R (cows=1)
> I cow down through node N because WRITTEN=true (cows=2, CORRECT)
> I cow down to leaf L (cows=3)
> 
> You can't just save the "last cow-er trans_handle" or whatever,
> because after you release the locks and loop, some other writer can cow
> down through that path and it could get written back before you get
> another shot, so you wouldn't see that you were the last writer.
> 
> You would have to track which ebs a search_slot has already cowed, but I
> fear that these "track the ebs" things run into the same issue Filipe
> raised about Leo's log_transid in the eb idea.
> 
> Probably worth exploring this family of fixes a bit more though.
> 
> We could consider an array/xarray on the block_rsv for ebs that block_rsv
> has cow-ed? I will have to think some more about that. (linked list is no
> good because one eb might go on multiple such lists) We could
> preallocate the storage when we size the block_rsv too, which is a
> much better time to get an ENOMEM (start_transaction) than search_slot.
> 
>>
>> So from the perspective of space usage, re-dirtying a COWed block (in the
>> current transaction, except log trees) should not cause extra space usage
>> except the temporary usage before freeing eb B and copy its content to eb C.
>>
>> It should always be the space of the original old on-disk eb, and the new
>> COWed eb, no matter how many times we redirtied it.
>>
>> Or is the problem exactly at the space reservation for such eb B?
>>
> 
> Yes, the problem is the reservation done at start_transaction() is not
> properly accounted. See Leo's original email for a good explanation of
> the details, but tl;dr:
> btrfs_insert_delayed_dir_index was seeing an ENOSPC migrating items from
> a transaction block_rsv that had been exhausted by re-dirtying
> accounting an arbitrary number of cows to that block_rsv.

OK, I see the problem in my idea now.

It's focusing on the pinned bytes of the space info (aka, global pinned 
bytes), not helping the block_rsv of the transaction.

And I can not find any way to improve my current idea to handle such 
situation at all.

In that case I'm afraid breaking metadata COW is the only solution.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-30 15:57         ` Filipe Manana
@ 2026-02-03  1:09           ` Leo Martins
  0 siblings, 0 replies; 21+ messages in thread
From: Leo Martins @ 2026-02-03  1:09 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Boris Burkov, linux-btrfs, kernel-team

On Fri, 30 Jan 2026 15:57:51 +0000 Filipe Manana <fdmanana@kernel.org> wrote:

> On Fri, Jan 30, 2026 at 3:44 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Fri, Jan 30, 2026 at 12:49:55PM +0000, Filipe Manana wrote:
> > > On Fri, Jan 30, 2026 at 12:13 AM Leo Martins <loemra.dev@gmail.com> wrote:
> > > >
> > > > On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> > > >
> > > > > On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
> > > > > >
> > > > > > I've been investigating enospcs at Meta and have observed a strange
> > > > > > pattern where filesystems are enospcing with lots of unallocated space
> > > > > > (> 100G). Sample dmesg dump at bottom of message.
> > > > > >
> > > > > > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > > > > > from the transaction block reserve and finding it exhausted leading to a
> > > > > > warning and enospc. This is a bug as the reservations are meant to be
> > > > > > worst case. It should be impossible to exhaust the transaction block
> > > > > > reserve.
> > > > > >
> > > > > > Some tracing of affected hosts revealed that there were single
> > > > > > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > > > > > reproduce this behavior locally by creating a very constrained cgroup
> > > > > > and producing a lot of concurrent filesystem operations. Here's the
> > > > > > pattern:
> > > > > >
> > > > > >  1. btrfs_search_slot() begins tree traversal with cow=1
> > > > > >  2. Node at level N needs COW (old generation or WRITTEN flag set)
> > > > > >  3. btrfs_cow_block() allocates new node, updates parent pointer
> > > > > >  4. Traversal continues, but hits a condition requiring restart (e.g., node
> > > > > >     not cached, lock contention, need higher write_lock_level)
> > > > > >  5. btrfs_release_path() releases all locks and references
> > > > > >  6. Memory pressure triggers writeback on the COW'd node
> > > > > >  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> > > > > >     BTRFS_HEADER_FLAG_WRITTEN
> > > > > >  8. goto again - traversal restarts from root
> > > > > >  9. Traversal reaches the freshly COW'd node
> > > > > >  10. should_cow_block() sees WRITTEN flag set, returns true
> > > > > >  11. btrfs_cow_block() allocates another new node - same logical position,
> > > > > >      new physical location, new reservation consumed
> > > > > >  12. Steps 4-11 repeat indefinitely under sustained memory pressure
> > > > > >
> > > > > > Note this behavior should be much harder to trigger since Boris's
> > > > > > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > > > > > accounted for in user cgroups. However, I believe it
> > > > > > would still be an issue under global memory pressure.
> > > > > > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> > > > > >
> > > > > > This COW amplification breaks the idea that transaction reservations are
> > > > > > worst case as any search slot call could find itself in this COW loop and
> > > > > > exhaust its reservation.
> > > > > >
> > > > > > My proposed solution is to temporarily pin extent buffers for the
> > > > > > lifetime of btrfs_search_slot. This prevents the massive COW
> > > > > > amplification that can be seen during high memory pressure.
> > > > > >
> > > > > > The implementation uses a local xarray to track COW'd buffers for the
> > > > > > duration of the search. The xarray stores extent_buffer pointers without
> > > > > > taking additional references; this is safe because tracked buffers remain
> > > > > > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > > > > > dirty buffers cannot be reclaimed by memory pressure.
> > > > > >
> > > > > > Synchronization is provided by eb->lock: increments in
> > > > > > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > > > > > the check in lock_extent_buffer_for_io() also holds the write lock via
> > > > > > btrfs_tree_lock(). Decrements don't require eb->lock because
> > > > > > writeback_blockers is atomic and merely indicates "don't write yet".
> > > > > > Once we decrement, we're done and don't care if writeback proceeds
> > > > > > immediately.
> > > > >
> > > > > This seems too complex to me.
> > > > >
> > > > > So this problem is very similar to some idea I had a few years ago but
> > > > > never managed to implement.
> > > > > It was about avoiding unnecessary COW, not for this space reservation
> > > > > exhaustion due to sustained memory pressure, but it would solve it
> > > > > too.
> > > > >
> > > > > The idea was that we do unnecessary COW in cases like this:
> > > > >
> > > > > 1) We COW a path in some tree and we are at transaction N;
> > > > >
> > > > > 2) Writeback happened for the extent buffers in that path while we are
> > > > > in the same transaction, because we reached the 32M limit and some
> > > > > task called btrfs_btree_balance_dirty() or something else triggered
> > > > > writeback of the btree inode;
> > > > >
> > > > > 3) While still at transaction N, we visit the same path to add an item
> > > > > to a leaf, or modify an item, whatever. Because the extent buffers
> > > > > have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> > > > > returns true).
> > > > >
> > > > > So during the lifetime of a transaction we can have a lot of
> > > > > unnecessary COW - we spend more time allocating extents, allocating
> > > > > memory, copying extent buffer data, use more space per transaction,
> > > > > etc.
> > > > >
> > > > > The idea was to not COW when an extent buffer has
> > > > > BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> > > > > (btrfs_header_generation(eb)) matches the current transaction.
> > > > > That is safe because there's no committed tree that points to an
> > > > > extent buffer created in the current transaction.
> > > > >
> > > > > Any further modification to the extent buffer must be sure that the
> > > > > EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> > > > > transaction's dirty_pages io tree, etc, so that we don't miss writing
> > > > > the extent buffer to the same location again before the transaction
> > > > > commits the superblocks.
> > > > >
> > > > > Have you considered an approach like this?
> > > >
> > > > I had not considered this, but it is a great idea.
> > > >
> > > > My first thought is that implementing this could be as simple
> > > > as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> > > > would mess with the assumptions around the log tree. From
> > > > btrfs_sync_log():
> > > >
> > > > /*
> > > >  * IO has been started, blocks of the log tree have WRITTEN flag set
> > > >  * in their headers. new modifications of the log will be written to
> > > >  * new positions. so it's safe to allow log writers to go in.
> > > >  */
> > > >
> > > > ^ Assumes that WRITTEN blocks will be COW'd.
> > > >
> > > > The issue looks like:
> > > >
> > > >  1. fsync A COWs eb
> > > >  2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
> > > >  3. fsync B does __not__ COW eb and modifies it
> > > >  4. fsync A writes modified eb to disk
> > > >  5. CRASH; the log tree is corrupted
> > > >
> > > > One way to avoid that is to keep the current behavior for the log
> > > > tree, but that leaves the potential for COW amplification...
> > > >
> > > > Another idea is to track the log_transid in the eb in the same way
> > > > the transid is tracked. Then, in should_cow_block we have something
> > > > like:
> > > >
> > > > if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID &&
> > > >     buf->log_transid != root->log_transid)
> > > >   return true;
> > >
> > > Log trees are special since their lifetime doesn't span on
> > > transaction, so what I suggested doesn't work of course for log trees
> > > and I forgot to mention that.
> > >
> > > Tracking the log_transid in the extent buffer will not always work -
> > > because it can be evicted and reloaded, so we would lose its value.
> > > We would have to update the on-disk format to store it somewhere or
> > > keep another in memory structure to track that, or prevent eviction of
> > > log tree buffers - all of those are too complex.
> >
> > Supposing we cannot think of a way to do overwrites on log tree ebs,
> > but that we can make it work for other ebs (excluding the zoned case
> > you mentioned below):
> >
> > What do you think about the problem of space reservation exhaustion
> > due to COW amplification when narrowed to just log trees? As far as I
> > can tell, there is nothing special about how logged items consume
> > transaction reservation so the problem would be reduced but still exist.
> 
> Log trees are small to start with, and if they ever hit -ENOSPC, we
> just fallback to transaction commit.
> 
> >
> > Do we want to pursue working out the kinkds in eb-overwrite (seems super
> > valuable regardless of motivation) and think of some other final
> > backstop for log tree ebs? Given that the fsync will be sending the ebs
> > down to the disk quite soon anyway, I was thinking it might be more
> > palatable to try to fully prevent premature writeback of log tree ebs.
> >
> > >
> > > So I had this half baked patch from many years ago:
> > >
> > >  static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
> > >                       *root, struct btrfs_path *path, int level);
> > > @@ -1426,11 +1427,30 @@ static inline int should_cow_block(struct
> > > btrfs_trans_handle *trans,
> > >          *    block to ensure the metadata consistency.
> > >          */
> > >         if (btrfs_header_generation(buf) == trans->transid &&
> > > -           !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
> > >             !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> > >               btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
> > > -           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> > > +           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state)) {
> > > +
> > > +               if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
> > > +                       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > > +                               return 1;
> > > +                       return 0;
> > > +               }
> > > +
> > > +               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) ||
> > > +                   test_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
> > > +                       return 1;
> > >
> > > This was before a recent refactoring of should_cow_block(), but you
> > > should get the ideia.
> > > IIRC all fstests were passing back then, except for one or two which I
> > > never spent time debugging.
> > >
> > > And as that attempt was before the tree checker existed, we would need
> > > to make sure we don't change and eb while the tree checker is
> > > verifying it - making sure the tree checker read locks the eb should
> > > be enough.
> >
> > I suspect this is what Sun was hitting in his replies to Leo.
> >
> > >
> > > There's also one problem with this idea: it won't work for zoned
> > > devices as writes are sequential and we can't write twice to the same
> > > location without doing the zone reset thing which only happens around
> > > transaction commit time IIRC.
> > >
> > > Thanks.
> > >
> >
> > Thanks for your input on this, it's really appreciated,
> > Boris
> >
> > > >
> > > > Please let me know if you see any issues with this approach or
> > > > if you can think of a better method.
> > > >
> > > > Thanks,
> > > > Leo
> > > >
> > > > >
> > > > > It would solve this space reservation exhaustion problem, as well as
> > > > > unnecessary COW for general optimization, without the need to for a
> > > > > local xarray, which besides being very specific for the
> > > > > btrfs_search_slot() case (we COW in other places), also requires a
> > > > > memory allocation which can fail.
> > > > >
> > > > > Thanks.

I just want to re-iterate my understanding of the discussion to make
sure everyone is on the same page.

We've discussed two potential strategies: pinning and overwriting.

Overwriting, as suggested by Filipe, where we skip COWing if an eb
has already been COW'd this transaction generation and instead
overwrite the eb in place.

Pinning, as suggested by me, where we skip flushing ebs that have
been COW'd in the current search_slot.

With pinning alone, the class of unlock-retry re-COW loops goes away,
but not all re-COWs. If two different transaction handles in the same
generation touch the same path they would still re-COW.

With overwriting alone, we never re-COW within a generation, we just
overwrite the eb. This is a general optimization that also happens
to fix the amplification bug. But it doesn't work for log trees or
zoned devices, leaving those exposed to the original bug.

So, in my view, the solution is to implement both. Pinning to prevent
the unnecessary writeback that triggers re-COWs during retry loops,
and overwriting to avoid useless COWing when we legitimately revisit
the same path. For log trees and zoned devices where overwriting isn't
possible, pinning alone provides the fix.

To address the critique that the original pinning approach is too
narrow, only working in btrfs_search_slot(), we can instead tie the
pin to the transaction handle. I also believe that we could address
having to allocate memory mid-transaction by reserving enough space
at start_transaction so we don't risk having to abort the transaction.

Let me know if this all makes sense. Or if you disagree that both
approaches are necessary.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
  2026-01-27 20:42 [PATCH] btrfs: prevent COW amplification during btrfs_search_slot Leo Martins
  2026-01-28 21:48 ` Qu Wenruo
  2026-01-29 11:52 ` Filipe Manana
@ 2026-02-10  7:45 ` kernel test robot
  2 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2026-02-10  7:45 UTC (permalink / raw)
  To: Leo Martins; +Cc: oe-lkp, lkp, linux-btrfs, kernel-team, oliver.sang



Hello,

kernel test robot noticed "INFO:trying_to_register_non-static_key" on:

commit: ba9c6f19149df060c2b7c71eaac21c394c161190 ("[PATCH] btrfs: prevent COW amplification during btrfs_search_slot")
url: https://github.com/intel-lab-lkp/linux/commits/Leo-Martins/btrfs-prevent-COW-amplification-during-btrfs_search_slot/20260128-044526
base: https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git for-next
patch link: https://lore.kernel.org/all/ba53d279b8bb3456f61cb8a4f15d9a4b1e618d0e.1769546089.git.loemra.dev@gmail.com/
patch subject: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot

in testcase: perf-sanity-tests
version: 
with following parameters:

	perf_compiler: gcc
	group: group-01



config: x86_64-rhel-9.4-bpf
compiler: gcc-14
test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (Kaby Lake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202602101504.f113db13-lkp@intel.com



[  125.965833][ T4372] INFO: trying to register non-static key.
[  125.971503][ T4372] The code is fine but needs lockdep annotation, or maybe
[  125.978465][ T4372] you didn't initialize this object before use?
[  125.984560][ T4372] turning off the locking correctness validator.
[  125.990745][ T4372] CPU: 3 UID: 0 PID: 4372 Comm: mount Tainted: G S        I         6.19.0-rc7-00122-gba9c6f19149d #1 PREEMPT(full)
[  125.990755][ T4372] Tainted: [S]=CPU_OUT_OF_SPEC, [I]=FIRMWARE_WORKAROUND
[  125.990758][ T4372] Hardware name: Dell Inc. OptiPlex 7050/062KRH, BIOS 1.2.0 12/22/2016
[  125.990761][ T4372] Call Trace:
[  125.990764][ T4372]  <TASK>
[  125.990767][ T4372]  dump_stack_lvl (lib/dump_stack.c:122)
[  125.990777][ T4372]  register_lock_class (kernel/locking/lockdep.c:985 kernel/locking/lockdep.c:1299)
[  125.990784][ T4372]  ? lock_release (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5891 (discriminator 4) kernel/locking/lockdep.c:5875 (discriminator 4))
[  125.990790][ T4372]  ? __rcu_read_unlock (kernel/rcu/tree_plugin.h:441 (discriminator 1))
[  125.990799][ T4372]  __lock_acquire (kernel/locking/lockdep.c:5113)
[  125.990809][ T4372]  lock_acquire (include/linux/preempt.h:469 (discriminator 2) include/trace/events/lock.h:24 (discriminator 2) include/trace/events/lock.h:24 (discriminator 2) kernel/locking/lockdep.c:5831 (discriminator 2))
[  125.990815][ T4372]  ? xa_destroy (lib/xarray.c:2390 (discriminator 1))
[  125.990823][ T4372]  ? xa_find (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897 lib/xarray.c:2202)
[  125.990830][ T4372]  ? rcu_is_watching (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/context_tracking.h:128 kernel/rcu/tree.c:751)
[  125.990836][ T4372]  ? lock_acquire (include/trace/events/lock.h:24 (discriminator 2) kernel/locking/lockdep.c:5831 (discriminator 2))
[  125.990844][ T4372]  _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
[  125.990851][ T4372]  ? xa_destroy (lib/xarray.c:2390 (discriminator 1))
[  125.990858][ T4372]  xa_destroy (lib/xarray.c:2390 (discriminator 1))
[  125.990865][ T4372]  ? xa_find (lib/xarray.c:2204)
[  125.990872][ T4372]  ? __pfx_xa_destroy (lib/xarray.c:2384)
[  125.990884][ T4372]  ? unlock_up (fs/btrfs/ctree.c:1427) btrfs
[  125.991323][ T4372] btrfs_search_slot (fs/btrfs/ctree.c:2043) btrfs
[  125.991720][ T4372]  ? rcu_is_watching (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/context_tracking.h:128 kernel/rcu/tree.c:751)
[  125.991732][ T4372]  ? __pfx_btrfs_search_slot (fs/btrfs/ctree.c:2043) btrfs
[  125.992150][ T4372]  ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4629 (discriminator 4))
[  125.992158][ T4372]  ? ___slab_alloc (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 mm/slub.c:4555)
[  125.992166][ T4372]  ? __pfx___mutex_lock (kernel/locking/mutex.c:775)
[  125.992173][ T4372]  ? trace_preempt_on (include/trace/events/preemptirq.h:53 (discriminator 2) kernel/trace/trace_preemptirq.c:120 (discriminator 2))
[  125.992184][ T4372]  ? kmem_cache_alloc_noprof (include/trace/events/kmem.h:12 (discriminator 2) mm/slub.c:5273 (discriminator 2))
[  125.992191][ T4372]  ? btrfs_read_chunk_tree (fs/btrfs/volumes.c:7639) btrfs
[  125.992599][ T4372] btrfs_read_chunk_tree (fs/btrfs/volumes.c:7680) btrfs
[  125.993066][ T4372]  ? __pfx_btrfs_read_chunk_tree (fs/btrfs/volumes.c:7627) btrfs
[  125.993473][ T4372]  ? __pfx_load_super_root (fs/btrfs/disk-io.c:2617) btrfs
[  125.993868][ T4372]  ? __pfx_btrfs_read_sys_array (fs/btrfs/volumes.c:7494) btrfs
[  125.994309][ T4372]  ? __asan_memcpy (mm/kasan/shadow.c:105 (discriminator 1))
[  125.994317][ T4372]  ? read_extent_buffer (fs/btrfs/extent_io.c:3981) btrfs
[  125.994743][ T4372] open_ctree (fs/btrfs/disk-io.c:3481) btrfs
[  125.995173][ T4372] btrfs_fill_super.cold (fs/btrfs/super.c:981) btrfs
[  125.995570][ T4372] btrfs_get_tree_super (fs/btrfs/super.c:1945) btrfs
[  125.995959][ T4372] btrfs_get_tree_subvol (fs/btrfs/super.c:2087) btrfs
[  125.996378][ T4372]  vfs_get_tree (fs/super.c:1751)
[  125.996387][ T4372]  vfs_cmd_create (fs/fsopen.c:231)
[  125.996394][ T4372]  __do_sys_fsconfig (fs/fsopen.c:474)
[  125.996401][ T4372]  ? __pfx___do_sys_fsconfig (fs/fsopen.c:356)
[  125.996411][ T4372]  ? vfs_read (include/linux/sched/xacct.h:24 fs/read_write.c:579)
[  125.996417][ T4372]  ? do_syscall_64 (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 include/linux/entry-common.h:108 arch/x86/entry/syscall_64.c:90)
[  125.996425][ T4372]  do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
[  125.996431][ T4372]  ? kasan_quarantine_put (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 mm/kasan/quarantine.c:234)
[  125.996438][ T4372]  ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4629 (discriminator 4))
[  125.996446][ T4372]  ? kasan_quarantine_put (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 mm/kasan/quarantine.c:234)
[  125.996453][ T4372]  ? kfree (mm/slub.c:6674 (discriminator 3) mm/slub.c:6882 (discriminator 3))
[  125.996459][ T4372]  ? __do_sys_fsconfig (fs/fsopen.c:499)
[  125.996467][ T4372]  ? ksys_read (fs/read_write.c:715)
[  125.996473][ T4372]  ? __pfx_ksys_read (fs/read_write.c:705)
[  125.996479][ T4372]  ? __do_sys_fsconfig (fs/fsopen.c:499)
[  125.996486][ T4372]  ? __pfx___do_sys_fsconfig (fs/fsopen.c:356)
[  125.996492][ T4372]  ? do_syscall_64 (include/linux/irq-entry-common.h:298 include/linux/entry-common.h:196 arch/x86/entry/syscall_64.c:100)
[  125.996499][ T4372]  ? do_syscall_64 (arch/x86/entry/syscall_64.c:113)
[  125.996505][ T4372]  ? __pfx_css_rstat_updated (kernel/cgroup/rstat.c:71)
[  125.996512][ T4372]  ? do_syscall_64 (include/linux/irq-entry-common.h:298 include/linux/entry-common.h:196 arch/x86/entry/syscall_64.c:100)
[  125.996520][ T4372]  ? do_syscall_64 (arch/x86/entry/syscall_64.c:113)
[  125.996525][ T4372]  ? find_held_lock (kernel/locking/lockdep.c:5350 (discriminator 1))
[  125.996531][ T4372]  ? count_memcg_events_mm+0x91/0x170
[  125.996539][ T4372]  ? count_memcg_events_mm+0x91/0x170
[  125.996545][ T4372]  ? __lock_release+0x5d/0x1b0
[  125.996553][ T4372]  ? find_held_lock (kernel/locking/lockdep.c:5350 (discriminator 1))
[  125.996559][ T4372]  ? exc_page_fault (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 arch/x86/mm/fault.c:1480 arch/x86/mm/fault.c:1527)
[  125.996567][ T4372]  ? exc_page_fault (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 arch/x86/mm/fault.c:1480 arch/x86/mm/fault.c:1527)
[  125.996573][ T4372]  ? __lock_release+0x5d/0x1b0
[  125.996578][ T4372]  ? handle_mm_fault (mm/memory.c:6469 (discriminator 1) mm/memory.c:6609 (discriminator 1))
[  125.996587][ T4372]  ? lock_release (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5891 (discriminator 4) kernel/locking/lockdep.c:5875 (discriminator 4))
[  125.996594][ T4372]  ? irqentry_exit (include/linux/irq-entry-common.h:298 include/linux/irq-entry-common.h:341 kernel/entry/common.c:196)
[  125.996600][ T4372]  ? trace_hardirqs_on_prepare (kernel/trace/trace_preemptirq.c:64 (discriminator 4) kernel/trace/trace_preemptirq.c:59 (discriminator 4))
[  125.996605][ T4372]  ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4629 (discriminator 4))
[  125.996612][ T4372]  ? irqentry_exit (arch/x86/include/asm/jump_label.h:37 include/linux/context_tracking_state.h:138 include/linux/context_tracking.h:41 include/linux/irq-entry-common.h:301 include/linux/irq-entry-common.h:341 kernel/entry/common.c:196)
[  125.996619][ T4372]  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131)
[  125.996625][ T4372] RIP: 0033:0x7f169cce04aa
[  125.996652][ T4372] Code: 73 01 c3 48 8b 0d 4e 59 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 af 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e 59 0d 00 f7 d8 64 89 01 48
All code
========
   0:	73 01                	jae    0x3
   2:	c3                   	ret
   3:	48 8b 0d 4e 59 0d 00 	mov    0xd594e(%rip),%rcx        # 0xd5958
   a:	f7 d8                	neg    %eax
   c:	64 89 01             	mov    %eax,%fs:(%rcx)
   f:	48 83 c8 ff          	or     $0xffffffffffffffff,%rax
  13:	c3                   	ret
  14:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
  1b:	00 00 00 
  1e:	66 90                	xchg   %ax,%ax
  20:	49 89 ca             	mov    %rcx,%r10
  23:	b8 af 01 00 00       	mov    $0x1af,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	ret
  33:	48 8b 0d 1e 59 0d 00 	mov    0xd591e(%rip),%rcx        # 0xd5958
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	ret
   9:	48 8b 0d 1e 59 0d 00 	mov    0xd591e(%rip),%rcx        # 0xd592e
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260210/202602101504.f113db13-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-02-10  7:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-27 20:42 [PATCH] btrfs: prevent COW amplification during btrfs_search_slot Leo Martins
2026-01-28 21:48 ` Qu Wenruo
2026-01-29 19:30   ` Leo Martins
2026-01-29 11:52 ` Filipe Manana
2026-01-30  0:12   ` Leo Martins
2026-01-30  4:14     ` Sun YangKai
2026-01-30  9:37       ` Sun YangKai
2026-01-30 15:50         ` Sun YangKai
2026-01-30 16:11           ` Filipe Manana
2026-01-31  9:16             ` Sun YangKai
2026-01-30 12:49     ` Filipe Manana
2026-01-30 15:43       ` Boris Burkov
2026-01-30 15:57         ` Filipe Manana
2026-02-03  1:09           ` Leo Martins
2026-01-30 21:43       ` Leo Martins
2026-01-30 22:34         ` Qu Wenruo
2026-01-31  0:11           ` Boris Burkov
2026-01-31  1:06             ` Qu Wenruo
2026-01-31 17:16               ` Boris Burkov
2026-01-31 21:59                 ` Qu Wenruo
2026-02-10  7:45 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox