public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Wang Yugui <wangyugui@e16-tech.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/3] btrfs: do not modify log tree while holding a leaf from fs tree locked
Date: Mon, 28 Nov 2022 13:54:49 +0800	[thread overview]
Message-ID: <20221128135448.49F8.409509F4@e16-tech.com> (raw)
In-Reply-To: <616a85703a25abbcc107b4e83d961d356d1c5463.1669025204.git.fdmanana@suse.com>

Hi,

> From: Filipe Manana <fdmanana@suse.com>
> 
> When logging an inode in full mode, or when logging xattrs or when logging
> the dir index items of a directory, we are modifying the log tree while
> holding a read lock on a leaf from the fs/subvolume tree. This can lead to
> a deadlock in rare circumstances, but it is a real possibility, and it was
> recently reported by syzbot with the following trace from lockdep:

This patch has beed merged into linux upstream.

And this patch can not be applied to 5.15.y without some backport.
do we need a backport of this patch for 5.15.y?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/11/28

>    WARNING: possible circular locking dependency detected
>    6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
>    ------------------------------------------------------
>    syz-executor.1/16154 is trying to acquire lock:
>    ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
> 
>    but task is already holding lock:
>    ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
> 
>    which lock already depends on the new lock.
> 
>    the existing dependency chain (in reverse order) is:
> 
>    -> #2 (btrfs-log-00){++++}-{3:3}:
>           down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
>           __btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
>           btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
>           btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
>           btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
>           btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
>           btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
>           btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
>           log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
>           copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
>           copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
>           btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
>           btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
>           btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
>           btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
>           vfs_fsync_range+0x13e/0x230 fs/sync.c:188
>           generic_write_sync include/linux/fs.h:2856 [inline]
>           iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
>           btrfs_direct_write fs/btrfs/file.c:1536 [inline]
>           btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
>           call_write_iter include/linux/fs.h:2160 [inline]
>           do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
>           do_iter_write+0x182/0x700 fs/read_write.c:861
>           vfs_iter_write+0x74/0xa0 fs/read_write.c:902
>           iter_file_splice_write+0x745/0xc90 fs/splice.c:686
>           do_splice_from fs/splice.c:764 [inline]
>           direct_splice_actor+0x114/0x180 fs/splice.c:931
>           splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
>           do_splice_direct+0x1ab/0x280 fs/splice.c:974
>           do_sendfile+0xb19/0x1270 fs/read_write.c:1255
>           __do_sys_sendfile64 fs/read_write.c:1323 [inline]
>           __se_sys_sendfile64 fs/read_write.c:1309 [inline]
>           __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
>           do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>           do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>           entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
>    -> #1 (btrfs-tree-00){++++}-{3:3}:
>           __lock_release kernel/locking/lockdep.c:5382 [inline]
>           lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
>           up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
>           btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
>           btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
>           search_leaf fs/btrfs/ctree.c:1832 [inline]
>           btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
>           btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
>           btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
>           btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
>           __btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
>           __btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
>           flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
>           btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
>           process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
>           worker_thread+0x669/0x1090 kernel/workqueue.c:2436
>           kthread+0x2e8/0x3a0 kernel/kthread.c:376
>           ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
> 
>    -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
>           check_prev_add kernel/locking/lockdep.c:3097 [inline]
>           check_prevs_add kernel/locking/lockdep.c:3216 [inline]
>           validate_chain kernel/locking/lockdep.c:3831 [inline]
>           __lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
>           lock_acquire kernel/locking/lockdep.c:5668 [inline]
>           lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
>           __mutex_lock_common kernel/locking/mutex.c:603 [inline]
>           __mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
>           __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
>           __btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
>           btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
>           btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
>           btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
>           evict+0x2ed/0x6b0 fs/inode.c:664
>           dispose_list+0x117/0x1e0 fs/inode.c:697
>           prune_icache_sb+0xeb/0x150 fs/inode.c:896
>           super_cache_scan+0x391/0x590 fs/super.c:106
>           do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
>           shrink_slab_memcg mm/vmscan.c:912 [inline]
>           shrink_slab+0x388/0x660 mm/vmscan.c:991
>           shrink_node_memcgs mm/vmscan.c:6088 [inline]
>           shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
>           shrink_zones mm/vmscan.c:6355 [inline]
>           do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
>           try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
>           reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
>           mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
>           try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
>           try_charge mm/memcontrol.c:2827 [inline]
>           charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
>           __mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
>           mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
>           __filemap_add_folio+0x615/0xf80 mm/filemap.c:852
>           filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
>           __filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
>           pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
>           find_or_create_page include/linux/pagemap.h:612 [inline]
>           alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
>           btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
>           btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
>           __btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
>           btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
>           btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
>           btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
>           update_log_root fs/btrfs/tree-log.c:2841 [inline]
>           btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
>           btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
>           vfs_fsync_range+0x13e/0x230 fs/sync.c:188
>           generic_write_sync include/linux/fs.h:2856 [inline]
>           iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
>           btrfs_direct_write fs/btrfs/file.c:1536 [inline]
>           btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
>           call_write_iter include/linux/fs.h:2160 [inline]
>           do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
>           do_iter_write+0x182/0x700 fs/read_write.c:861
>           vfs_iter_write+0x74/0xa0 fs/read_write.c:902
>           iter_file_splice_write+0x745/0xc90 fs/splice.c:686
>           do_splice_from fs/splice.c:764 [inline]
>           direct_splice_actor+0x114/0x180 fs/splice.c:931
>           splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
>           do_splice_direct+0x1ab/0x280 fs/splice.c:974
>           do_sendfile+0xb19/0x1270 fs/read_write.c:1255
>           __do_sys_sendfile64 fs/read_write.c:1323 [inline]
>           __se_sys_sendfile64 fs/read_write.c:1309 [inline]
>           __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
>           do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>           do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>           entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
>    other info that might help us debug this:
> 
>    Chain exists of:
>      &delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
> 
>    Possible unsafe locking scenario:
> 
>           CPU0                    CPU1
>           ----                    ----
>      lock(btrfs-log-00);
>                                   lock(btrfs-tree-00);
>                                   lock(btrfs-log-00);
>      lock(&delayed_node->mutex);
> 
> Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
> lock dependency when we are COWing extent buffers for the log tree and we
> have two tasks modifying the log tree, with each one in one of the
> following 2 scenarios:
> 
> 1) Modifying the log tree triggers an extent buffer allocation while
>    holding a write lock on a parent extent buffer from the log tree.
>    Allocating the pages for an extent buffer, or the extent buffer
>    struct, can trigger inode eviction and finally the inode eviction
>    will trigger a release/remove of a delayed node, which requires
>    taking the delayed node's mutex;
> 
> 2) Allocating a metadata extent for a log tree can trigger the async
>    reclaim thread and make us wait for it to release enough space and
>    unblock our reservation ticket. The reclaim thread can start flushing
>    delayed items, and that in turn results in the need to lock delayed
>    node mutexes and in the need to write lock extent buffers of a
>    subvolume tree - all this while holding a write lock on the parent
>    extent buffer in the log tree.
> 
> So one task in scenario 1) running in parallel with another task in
> scenario 2) could lead to a deadlock, one wanting to lock a delayed node
> mutex while having a read lock on a leaf from the subvolume, while the
> other is holding the delayed node's mutex and wants to write lock the same
> subvolume leaf for flushing delayed items.
> 
> Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
> fs/subvolume leaf and use the clone leaf instead.
> 
> Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/tree-log.c | 59 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 55 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 8fcfaf015a70..f7b1bb9c63e4 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3675,15 +3675,29 @@ static int process_dir_items_leaf(struct btrfs_trans_handle *trans,
>  				  u64 *last_old_dentry_offset)
>  {
>  	struct btrfs_root *log = inode->root->log_root;
> -	struct extent_buffer *src = path->nodes[0];
> -	const int nritems = btrfs_header_nritems(src);
> +	struct extent_buffer *src;
> +	const int nritems = btrfs_header_nritems(path->nodes[0]);
>  	const u64 ino = btrfs_ino(inode);
>  	bool last_found = false;
>  	int batch_start = 0;
>  	int batch_size = 0;
>  	int i;
>  
> -	for (i = path->slots[0]; i < nritems; i++) {
> +	/*
> +	 * We need to clone the leaf, release the read lock on it, and use the
> +	 * clone before modifying the log tree. See the comment at copy_items()
> +	 * about why we need to do this.
> +	 */
> +	src = btrfs_clone_extent_buffer(path->nodes[0]);
> +	if (!src)
> +		return -ENOMEM;
> +
> +	i = path->slots[0];
> +	btrfs_release_path(path);
> +	path->nodes[0] = src;
> +	path->slots[0] = i;
> +
> +	for (; i < nritems; i++) {
>  		struct btrfs_dir_item *di;
>  		struct btrfs_key key;
>  		int ret;
> @@ -4284,7 +4298,7 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
>  {
>  	struct btrfs_root *log = inode->root->log_root;
>  	struct btrfs_file_extent_item *extent;
> -	struct extent_buffer *src = src_path->nodes[0];
> +	struct extent_buffer *src;
>  	int ret = 0;
>  	struct btrfs_key *ins_keys;
>  	u32 *ins_sizes;
> @@ -4295,6 +4309,43 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
>  	const bool skip_csum = (inode->flags & BTRFS_INODE_NODATASUM);
>  	const u64 i_size = i_size_read(&inode->vfs_inode);
>  
> +	/*
> +	 * To keep lockdep happy and avoid deadlocks, clone the source leaf and
> +	 * use the clone. This is because otherwise we would be changing the log
> +	 * tree, to insert items from the subvolume tree or insert csum items,
> +	 * while holding a read lock on a leaf from the subvolume tree, which
> +	 * creates a nasty lock dependency when COWing log tree nodes/leaves:
> +	 *
> +	 * 1) Modifying the log tree triggers an extent buffer allocation while
> +	 *    holding a write lock on a parent extent buffer from the log tree.
> +	 *    Allocating the pages for an extent buffer, or the extent buffer
> +	 *    struct, can trigger inode eviction and finally the inode eviction
> +	 *    will trigger a release/remove of a delayed node, which requires
> +	 *    taking the delayed node's mutex;
> +	 *
> +	 * 2) Allocating a metadata extent for a log tree can trigger the async
> +	 *    reclaim thread and make us wait for it to release enough space and
> +	 *    unblock our reservation ticket. The reclaim thread can start
> +	 *    flushing delayed items, and that in turn results in the need to
> +	 *    lock delayed node mutexes and in the need to write lock extent
> +	 *    buffers of a subvolume tree - all this while holding a write lock
> +	 *    on the parent extent buffer in the log tree.
> +	 *
> +	 * So one task in scenario 1) running in parallel with another task in
> +	 * scenario 2) could lead to a deadlock, one wanting to lock a delayed
> +	 * node mutex while having a read lock on a leaf from the subvolume,
> +	 * while the other is holding the delayed node's mutex and wants to
> +	 * write lock the same subvolume leaf for flushing delayed items.
> +	 */
> +	src = btrfs_clone_extent_buffer(src_path->nodes[0]);
> +	if (!src)
> +		return -ENOMEM;
> +
> +	i = src_path->slots[0];
> +	btrfs_release_path(src_path);
> +	src_path->nodes[0] = src;
> +	src_path->slots[0] = i;
> +
>  	ins_data = kmalloc(nr * sizeof(struct btrfs_key) +
>  			   nr * sizeof(u32), GFP_NOFS);
>  	if (!ins_data)
> -- 
> 2.35.1



  reply	other threads:[~2022-11-28  5:54 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-21 10:23 [PATCH 0/3] btrfs: fix a rare deadlock case when logging inodes fdmanana
2022-11-21 10:23 ` [PATCH 1/3] btrfs: do not modify log tree while holding a leaf from fs tree locked fdmanana
2022-11-28  5:54   ` Wang Yugui [this message]
2022-11-28 10:03     ` Filipe Manana
2022-11-21 10:23 ` [PATCH 2/3] btrfs: unify overwrite_item() and do_overwrite_item() fdmanana
2022-11-21 10:23 ` [PATCH 3/3] btrfs: remove outdated logic from overwrite_item() and add assertion fdmanana
2022-11-21 16:12 ` [PATCH 0/3] btrfs: fix a rare deadlock case when logging inodes Josef Bacik
2022-11-21 19:23 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221128135448.49F8.409509F4@e16-tech.com \
    --to=wangyugui@e16-tech.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox