From: Filipe Manana <fdmanana@kernel.org>
To: Boris Burkov <boris@bur.io>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: fix deadlock between reflink and transaction commit when using flushoncommit
Date: Wed, 25 Mar 2026 18:19:21 +0000 [thread overview]
Message-ID: <CAL3q7H5kyrGkkGgyLcUnOCesTQeziNv=WpfcLT7wLYHZoB7Faw@mail.gmail.com> (raw)
In-Reply-To: <20260325173329.GA2908386@zen.localdomain>
On Wed, Mar 25, 2026 at 5:33 PM Boris Burkov <boris@bur.io> wrote:
>
> On Mon, Mar 23, 2026 at 05:01:58PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When using the flushoncommit mount option, we can have a deadlock between
> > a transaction commit and a reflink operation that copied an inline extent
> > to an offset beyond the current i_size of the destination node.
> >
> > The deadlock happens like this:
> >
> > 1) Task A clones an inline extent from inode X to an offset of inode Y
> > that is beyond Y's current i_size. This means we copied the inline
> > extent's data to a folio of inode Y that is beyond its EOF, using a
> > call to copy_inline_to_page();
> >
> > 2) Task B starts a transaction commit and calls
> > btrfs_start_delalloc_flush() to flush delalloc;
> >
> > 3) The delalloc flushing sees the new dirty folio of inode Y and when it
> > attempts to flush it, it ends up at extent_writepage() and sees that
> > the offset of the folio is beyond the i_size of inode Y, so it attempts
> > to invalidate the folio by calling folio_invalidate(), which ends up at
> > btrfs' folio invalidate callback - btrfs_invalidate_folio(). There it
> > tries to lock the folio's range in inode Y's extent io tree, but it
> > blocks since it's currently locked by task A - during a reflink we lock
> > the inodes and the source and destination ranges after flushing all
> > delalloc and waiting for ordered extent completion - after that we
> > don't expect to have dirty folios in the ranges, the exception is if
> > we have to copy an inline extent's data (because the destination offset
> > is not zero);
>
> mentioning the first lock "where it happens" in the sequence would make
> this easier to follow, IMO. With two files and two tasks, time
> travelling backwards while reading is kind of a mental hurdle.
>
> e.g. 1. Task A clones an inline extent ... in btrfs_clone_files we lock
> the destination range ...
Well it's mentioned in step 1, and extent locking is obvious since we
have to do it for any operation that changes extents in a range, it's
not exclusive to reflinks, we do this everywhere in btrfs. I mentioned
it here in step 2 as a remainder.
>
> >
> > 4) Task A then attempts to start a transaction to update the inode item,
> > and then it's blocked since the current transaction is in the
> > TRANS_STATE_COMMIT_START state. Therefore task A has to wait for the
> > current transaction to become unblocked (its state >=
> > TRANS_STATE_UNBLOCKED).
> >
> > So task A is waiting for the transaction commit done by task B, and
> > the later waiting on the extent lock of inode Y that is currently
> > held by task A.
>
> I believe your stack traces below show a slightly different picture with
> three tasks: clone, commit, and writeback worker. The essential lock
> cycle seems correct but that detail is different from your description.
> Have you seen different forms of it where the commit task is directly
> blocked?
I didn't mention the commit task waiting for the writeback worker to
simplify things.
I think anyone is able to realize the commit task blocks waiting for
the writeback worker to complete - the stack traces make that obvious.
And no, I haven't seen any case where the commit task is directly
blocked - I don't see how that could happen unless the implementation
of try_to_writeback_inodes_sb() changes and it directly does the
writeback instead of using a worker.
>
> >
> > Syzbot recently reported this with the following stack traces:
> >
> > INFO: task kworker/u8:7:1053 blocked for more than 143 seconds.
> > Not tainted syzkaller #0
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:kworker/u8:7 state:D stack:23520 pid:1053 tgid:1053 ppid:2 task_flags:0x4208060 flags:0x00080000
> > Workqueue: writeback wb_workfn (flush-btrfs-46)
> > Call Trace:
> > <TASK>
> > context_switch kernel/sched/core.c:5298 [inline]
> > __schedule+0x1553/0x5240 kernel/sched/core.c:6911
> > __schedule_loop kernel/sched/core.c:6993 [inline]
> > schedule+0x164/0x360 kernel/sched/core.c:7008
> > wait_extent_bit fs/btrfs/extent-io-tree.c:811 [inline]
> > btrfs_lock_extent_bits+0x59c/0x700 fs/btrfs/extent-io-tree.c:1914
> > btrfs_lock_extent fs/btrfs/extent-io-tree.h:152 [inline]
> > btrfs_invalidate_folio+0x43d/0xc40 fs/btrfs/inode.c:7704
> > extent_writepage fs/btrfs/extent_io.c:1852 [inline]
> > extent_write_cache_pages fs/btrfs/extent_io.c:2580 [inline]
> > btrfs_writepages+0x12ff/0x2440 fs/btrfs/extent_io.c:2713
> > do_writepages+0x32e/0x550 mm/page-writeback.c:2554
> > __writeback_single_inode+0x133/0x11a0 fs/fs-writeback.c:1750
> > writeback_sb_inodes+0x995/0x19d0 fs/fs-writeback.c:2042
> > wb_writeback+0x456/0xb70 fs/fs-writeback.c:2227
> > wb_do_writeback fs/fs-writeback.c:2374 [inline]
> > wb_workfn+0x41a/0xf60 fs/fs-writeback.c:2414
> > process_one_work kernel/workqueue.c:3276 [inline]
> > process_scheduled_works+0xb6e/0x18c0 kernel/workqueue.c:3359
> > worker_thread+0xa53/0xfc0 kernel/workqueue.c:3440
> > kthread+0x388/0x470 kernel/kthread.c:436
> > ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158
> > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> > </TASK>
> > INFO: task syz.4.64:6910 blocked for more than 143 seconds.
> > Not tainted syzkaller #0
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:syz.4.64 state:D stack:22752 pid:6910 tgid:6905 ppid:5944 task_flags:0x400140 flags:0x00080002
> > Call Trace:
> > <TASK>
> > context_switch kernel/sched/core.c:5298 [inline]
> > __schedule+0x1553/0x5240 kernel/sched/core.c:6911
> > __schedule_loop kernel/sched/core.c:6993 [inline]
> > schedule+0x164/0x360 kernel/sched/core.c:7008
> > wait_current_trans+0x39f/0x590 fs/btrfs/transaction.c:535
> > start_transaction+0x6a7/0x1650 fs/btrfs/transaction.c:705
> > clone_copy_inline_extent fs/btrfs/reflink.c:299 [inline]
> > btrfs_clone+0x128a/0x24d0 fs/btrfs/reflink.c:529
> > btrfs_clone_files+0x271/0x3f0 fs/btrfs/reflink.c:750
> > btrfs_remap_file_range+0x76b/0x1320 fs/btrfs/reflink.c:903
> > vfs_copy_file_range+0xda7/0x1390 fs/read_write.c:1600
> > __do_sys_copy_file_range fs/read_write.c:1683 [inline]
> > __se_sys_copy_file_range+0x2fb/0x480 fs/read_write.c:1650
> > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> > do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7f5f73afc799
> > RSP: 002b:00007f5f7315e028 EFLAGS: 00000246 ORIG_RAX: 0000000000000146
> > RAX: ffffffffffffffda RBX: 00007f5f73d75fa0 RCX: 00007f5f73afc799
> > RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000005
> > RBP: 00007f5f73b92c99 R08: 0000000000000863 R09: 0000000000000000
> > R10: 00002000000000c0 R11: 0000000000000246 R12: 0000000000000000
> > R13: 00007f5f73d76038 R14: 00007f5f73d75fa0 R15: 00007fff138a5068
> > </TASK>
> > INFO: task syz.4.64:6975 blocked for more than 143 seconds.
> > Not tainted syzkaller #0
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:syz.4.64 state:D stack:24736 pid:6975 tgid:6905 ppid:5944 task_flags:0x400040 flags:0x00080002
> > Call Trace:
> > <TASK>
> > context_switch kernel/sched/core.c:5298 [inline]
> > __schedule+0x1553/0x5240 kernel/sched/core.c:6911
> > __schedule_loop kernel/sched/core.c:6993 [inline]
> > schedule+0x164/0x360 kernel/sched/core.c:7008
> > wb_wait_for_completion+0x3e8/0x790 fs/fs-writeback.c:227
> > __writeback_inodes_sb_nr+0x24c/0x2d0 fs/fs-writeback.c:2838
> > try_to_writeback_inodes_sb+0x9a/0xc0 fs/fs-writeback.c:2886
> > btrfs_start_delalloc_flush fs/btrfs/transaction.c:2175 [inline]
> > btrfs_commit_transaction+0x82e/0x31a0 fs/btrfs/transaction.c:2364
> > btrfs_ioctl+0xca7/0xd00 fs/btrfs/ioctl.c:5206
> > vfs_ioctl fs/ioctl.c:51 [inline]
> > __do_sys_ioctl fs/ioctl.c:597 [inline]
> > __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
> > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> > do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7f5f73afc799
> > RSP: 002b:00007f5f7313d028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > RAX: ffffffffffffffda RBX: 00007f5f73d76090 RCX: 00007f5f73afc799
> > RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000004
> > RBP: 00007f5f73b92c99 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > R13: 00007f5f73d76128 R14: 00007f5f73d76090 R15: 00007fff138a5068
> > </TASK>
> >
> > Fix this by updating the i_size of the destination inode of a reflink
> > operation after we copy an inline extent's data to an offset beyond the
> > i_size and before attempting to start a transaction to update the inode's
> > item.
> >
> > Reported-by: syzbot+63056bf627663701bbbf@syzkaller.appspotmail.com
> > Link: https://lore.kernel.org/linux-btrfs/69bba3fe.050a0220.227207.002f.GAE@google.com/
> > Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
>
> The fix LGTM, thank you.
> Reviewed-by: Boris Burkov <boris@bur.io>
Thanks!
>
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> > fs/btrfs/reflink.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 45 insertions(+)
> >
> > diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> > index fca00c0f5387..49865a463780 100644
> > --- a/fs/btrfs/reflink.c
> > +++ b/fs/btrfs/reflink.c
> > @@ -322,6 +322,51 @@ static int clone_copy_inline_extent(struct btrfs_inode *inode,
> >
> > ret = copy_inline_to_page(inode, new_key->offset,
> > inline_data, size, datal, comp_type);
> > +
> > + /*
> > + * If we copied the inline extent data to a page/folio beyond the i_size
> > + * of the destination inode, then we need to increase the i_size before
> > + * we start a transaction to update the inode item. This is to prevent a
> > + * deadlock when the flushoncommit mount option is used, which happens
> > + * like this:
> > + *
> > + * 1) Task A clones an inline extent from inode X to an offset of inode
> > + * Y that is beyond Y's current i_size. This means we copied the
> > + * inline extent's data to a folio of inode Y that is beyond its EOF,
> > + * using the call above to copy_inline_to_page();
> > + *
> > + * 2) Task B starts a transaction commit and calls
> > + * btrfs_start_delalloc_flush() to flush delalloc;
> > + *
> > + * 3) The delalloc flushing sees the new dirty folio of inode Y and when
> > + * it attempts to flush it, it ends up at extent_writepage() and sees
> > + * that the offset of the folio is beyond the i_size of inode Y, so
> > + * it attempts to invalidate the folio by calling folio_invalidate(),
> > + * which ends up at btrfs' folio invalidate callback -
> > + * btrfs_invalidate_folio(). There it tries to lock the folio's range
> > + * in inode Y's extent io tree, but it blocks since it's currently
> > + * locked by task A - during reflink we lock the inodes and the
> > + * source and destination ranges after flushing all delalloc and
> > + * waiting for ordered extent completion - after that we don't expect
> > + * to have dirty folios in the ranges, the exception is if we have to
> > + * copy an inline extent's data (because the destination offset is
> > + * not zero);
> > + *
> > + * 4) Task A then does the 'goto out' below and attempts to start a
> > + * transaction to update the inode item, and then it's blocked since
> > + * the current transaction is in the TRANS_STATE_COMMIT_START state.
> > + * Therefore task A has to wait for the current transaction to become
> > + * unblocked (its state >= TRANS_STATE_UNBLOCKED).
> > + *
> > + * This leads to a deadlock - the task committing the transaction
> > + * waiting for the delalloc flushing which is blocked during folio
> > + * invalidation on the inode's extent lock and the reflink task waiting
> > + * for the current transaction to be unblocked so that it can start a
> > + * a new one to update the inode item (while holding the extent lock).
> > + */
> > + if (ret == 0 && new_key->offset + datal > i_size_read(&inode->vfs_inode))
> > + i_size_write(&inode->vfs_inode, new_key->offset + datal);
> > +
> > goto out;
> > }
> >
> > --
> > 2.47.2
> >
prev parent reply other threads:[~2026-03-25 18:20 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-23 17:01 [PATCH] btrfs: fix deadlock between reflink and transaction commit when using flushoncommit fdmanana
2026-03-25 17:33 ` Boris Burkov
2026-03-25 18:19 ` Filipe Manana [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAL3q7H5kyrGkkGgyLcUnOCesTQeziNv=WpfcLT7wLYHZoB7Faw@mail.gmail.com' \
--to=fdmanana@kernel.org \
--cc=boris@bur.io \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox