From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 340C4C433FE for ; Mon, 28 Nov 2022 05:54:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229603AbiK1Fyz (ORCPT ); Mon, 28 Nov 2022 00:54:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55458 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229551AbiK1Fyy (ORCPT ); Mon, 28 Nov 2022 00:54:54 -0500 Received: from out20-38.mail.aliyun.com (out20-38.mail.aliyun.com [115.124.20.38]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1658B1085 for ; Sun, 27 Nov 2022 21:54:51 -0800 (PST) X-Alimail-AntiSpam: AC=CONTINUE;BC=0.04436282|-1;CH=green;DM=|CONTINUE|false|;DS=CONTINUE|ham_alarm|0.322749-0.00115659-0.676094;FP=0|0|0|0|0|-1|-1|-1;HT=ay29a033018047209;MF=wangyugui@e16-tech.com;NM=1;PH=DS;RN=2;RT=2;SR=0;TI=SMTPD_---.QIYP8Qq_1669614888; Received: from 192.168.2.112(mailfrom:wangyugui@e16-tech.com fp:SMTPD_---.QIYP8Qq_1669614888) by smtp.aliyun-inc.com; Mon, 28 Nov 2022 13:54:49 +0800 Date: Mon, 28 Nov 2022 13:54:49 +0800 From: Wang Yugui To: fdmanana@kernel.org Subject: Re: [PATCH 1/3] btrfs: do not modify log tree while holding a leaf from fs tree locked Cc: linux-btrfs@vger.kernel.org In-Reply-To: <616a85703a25abbcc107b4e83d961d356d1c5463.1669025204.git.fdmanana@suse.com> References: <616a85703a25abbcc107b4e83d961d356d1c5463.1669025204.git.fdmanana@suse.com> Message-Id: <20221128135448.49F8.409509F4@e16-tech.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.81.04 [en] Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Hi, > From: Filipe Manana > > When logging an inode in full mode, or when logging xattrs or when logging > the dir index items of a directory, we are modifying the log tree while > holding a read lock on a leaf from the fs/subvolume tree. This can lead to > a deadlock in rare circumstances, but it is a real possibility, and it was > recently reported by syzbot with the following trace from lockdep: This patch has beed merged into linux upstream. And this patch can not be applied to 5.15.y without some backport. do we need a backport of this patch for 5.15.y? Best Regards Wang Yugui (wangyugui@e16-tech.com) 2022/11/28 > WARNING: possible circular locking dependency detected > 6.1.0-rc5-next-20221116-syzkaller #0 Not tainted > ------------------------------------------------------ > syz-executor.1/16154 is trying to acquire lock: > ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256 > > but task is already holding lock: > ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197 > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > > -> #2 (btrfs-log-00){++++}-{3:3}: > down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634 > __btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135 > btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline] > btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280 > btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline] > btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998 > btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209 > btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021 > log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258 > copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403 > copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873 > btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495 > btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982 > btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083 > btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921 > vfs_fsync_range+0x13e/0x230 fs/sync.c:188 > generic_write_sync include/linux/fs.h:2856 [inline] > iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128 > btrfs_direct_write fs/btrfs/file.c:1536 [inline] > btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668 > call_write_iter include/linux/fs.h:2160 [inline] > do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 > do_iter_write+0x182/0x700 fs/read_write.c:861 > vfs_iter_write+0x74/0xa0 fs/read_write.c:902 > iter_file_splice_write+0x745/0xc90 fs/splice.c:686 > do_splice_from fs/splice.c:764 [inline] > direct_splice_actor+0x114/0x180 fs/splice.c:931 > splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 > do_splice_direct+0x1ab/0x280 fs/splice.c:974 > do_sendfile+0xb19/0x1270 fs/read_write.c:1255 > __do_sys_sendfile64 fs/read_write.c:1323 [inline] > __se_sys_sendfile64 fs/read_write.c:1309 [inline] > __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309 > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > > -> #1 (btrfs-tree-00){++++}-{3:3}: > __lock_release kernel/locking/lockdep.c:5382 [inline] > lock_release+0x371/0x810 kernel/locking/lockdep.c:5688 > up_write+0x2a/0x520 kernel/locking/rwsem.c:1614 > btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline] > btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238 > search_leaf fs/btrfs/ctree.c:1832 [inline] > btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074 > btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133 > btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746 > btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline] > __btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline] > __btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153 > flush_space+0x147/0xe90 fs/btrfs/space-info.c:728 > btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086 > process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289 > worker_thread+0x669/0x1090 kernel/workqueue.c:2436 > kthread+0x2e8/0x3a0 kernel/kthread.c:376 > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 > > -> #0 (&delayed_node->mutex){+.+.}-{3:3}: > check_prev_add kernel/locking/lockdep.c:3097 [inline] > check_prevs_add kernel/locking/lockdep.c:3216 [inline] > validate_chain kernel/locking/lockdep.c:3831 [inline] > __lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055 > lock_acquire kernel/locking/lockdep.c:5668 [inline] > lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633 > __mutex_lock_common kernel/locking/mutex.c:603 [inline] > __mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747 > __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256 > __btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline] > btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] > btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285 > btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554 > evict+0x2ed/0x6b0 fs/inode.c:664 > dispose_list+0x117/0x1e0 fs/inode.c:697 > prune_icache_sb+0xeb/0x150 fs/inode.c:896 > super_cache_scan+0x391/0x590 fs/super.c:106 > do_shrink_slab+0x464/0xce0 mm/vmscan.c:843 > shrink_slab_memcg mm/vmscan.c:912 [inline] > shrink_slab+0x388/0x660 mm/vmscan.c:991 > shrink_node_memcgs mm/vmscan.c:6088 [inline] > shrink_node+0x93d/0x1f30 mm/vmscan.c:6117 > shrink_zones mm/vmscan.c:6355 [inline] > do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417 > try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732 > reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393 > mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578 > try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816 > try_charge mm/memcontrol.c:2827 [inline] > charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889 > __mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910 > mem_cgroup_charge include/linux/memcontrol.h:667 [inline] > __filemap_add_folio+0x615/0xf80 mm/filemap.c:852 > filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934 > __filemap_get_folio+0x389/0xd80 mm/filemap.c:1976 > pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104 > find_or_create_page include/linux/pagemap.h:612 [inline] > alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588 > btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline] > btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988 > __btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440 > btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595 > btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038 > btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137 > update_log_root fs/btrfs/tree-log.c:2841 [inline] > btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064 > btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947 > vfs_fsync_range+0x13e/0x230 fs/sync.c:188 > generic_write_sync include/linux/fs.h:2856 [inline] > iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128 > btrfs_direct_write fs/btrfs/file.c:1536 [inline] > btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668 > call_write_iter include/linux/fs.h:2160 [inline] > do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 > do_iter_write+0x182/0x700 fs/read_write.c:861 > vfs_iter_write+0x74/0xa0 fs/read_write.c:902 > iter_file_splice_write+0x745/0xc90 fs/splice.c:686 > do_splice_from fs/splice.c:764 [inline] > direct_splice_actor+0x114/0x180 fs/splice.c:931 > splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 > do_splice_direct+0x1ab/0x280 fs/splice.c:974 > do_sendfile+0xb19/0x1270 fs/read_write.c:1255 > __do_sys_sendfile64 fs/read_write.c:1323 [inline] > __se_sys_sendfile64 fs/read_write.c:1309 [inline] > __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309 > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > > other info that might help us debug this: > > Chain exists of: > &delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00 > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(btrfs-log-00); > lock(btrfs-tree-00); > lock(btrfs-log-00); > lock(&delayed_node->mutex); > > Holding a read lock on a leaf from a fs/subvolume tree creates a nasty > lock dependency when we are COWing extent buffers for the log tree and we > have two tasks modifying the log tree, with each one in one of the > following 2 scenarios: > > 1) Modifying the log tree triggers an extent buffer allocation while > holding a write lock on a parent extent buffer from the log tree. > Allocating the pages for an extent buffer, or the extent buffer > struct, can trigger inode eviction and finally the inode eviction > will trigger a release/remove of a delayed node, which requires > taking the delayed node's mutex; > > 2) Allocating a metadata extent for a log tree can trigger the async > reclaim thread and make us wait for it to release enough space and > unblock our reservation ticket. The reclaim thread can start flushing > delayed items, and that in turn results in the need to lock delayed > node mutexes and in the need to write lock extent buffers of a > subvolume tree - all this while holding a write lock on the parent > extent buffer in the log tree. > > So one task in scenario 1) running in parallel with another task in > scenario 2) could lead to a deadlock, one wanting to lock a delayed node > mutex while having a read lock on a leaf from the subvolume, while the > other is holding the delayed node's mutex and wants to write lock the same > subvolume leaf for flushing delayed items. > > Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the > fs/subvolume leaf and use the clone leaf instead. > > Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com > Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/ > Signed-off-by: Filipe Manana > --- > fs/btrfs/tree-log.c | 59 ++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 55 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > index 8fcfaf015a70..f7b1bb9c63e4 100644 > --- a/fs/btrfs/tree-log.c > +++ b/fs/btrfs/tree-log.c > @@ -3675,15 +3675,29 @@ static int process_dir_items_leaf(struct btrfs_trans_handle *trans, > u64 *last_old_dentry_offset) > { > struct btrfs_root *log = inode->root->log_root; > - struct extent_buffer *src = path->nodes[0]; > - const int nritems = btrfs_header_nritems(src); > + struct extent_buffer *src; > + const int nritems = btrfs_header_nritems(path->nodes[0]); > const u64 ino = btrfs_ino(inode); > bool last_found = false; > int batch_start = 0; > int batch_size = 0; > int i; > > - for (i = path->slots[0]; i < nritems; i++) { > + /* > + * We need to clone the leaf, release the read lock on it, and use the > + * clone before modifying the log tree. See the comment at copy_items() > + * about why we need to do this. > + */ > + src = btrfs_clone_extent_buffer(path->nodes[0]); > + if (!src) > + return -ENOMEM; > + > + i = path->slots[0]; > + btrfs_release_path(path); > + path->nodes[0] = src; > + path->slots[0] = i; > + > + for (; i < nritems; i++) { > struct btrfs_dir_item *di; > struct btrfs_key key; > int ret; > @@ -4284,7 +4298,7 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, > { > struct btrfs_root *log = inode->root->log_root; > struct btrfs_file_extent_item *extent; > - struct extent_buffer *src = src_path->nodes[0]; > + struct extent_buffer *src; > int ret = 0; > struct btrfs_key *ins_keys; > u32 *ins_sizes; > @@ -4295,6 +4309,43 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, > const bool skip_csum = (inode->flags & BTRFS_INODE_NODATASUM); > const u64 i_size = i_size_read(&inode->vfs_inode); > > + /* > + * To keep lockdep happy and avoid deadlocks, clone the source leaf and > + * use the clone. This is because otherwise we would be changing the log > + * tree, to insert items from the subvolume tree or insert csum items, > + * while holding a read lock on a leaf from the subvolume tree, which > + * creates a nasty lock dependency when COWing log tree nodes/leaves: > + * > + * 1) Modifying the log tree triggers an extent buffer allocation while > + * holding a write lock on a parent extent buffer from the log tree. > + * Allocating the pages for an extent buffer, or the extent buffer > + * struct, can trigger inode eviction and finally the inode eviction > + * will trigger a release/remove of a delayed node, which requires > + * taking the delayed node's mutex; > + * > + * 2) Allocating a metadata extent for a log tree can trigger the async > + * reclaim thread and make us wait for it to release enough space and > + * unblock our reservation ticket. The reclaim thread can start > + * flushing delayed items, and that in turn results in the need to > + * lock delayed node mutexes and in the need to write lock extent > + * buffers of a subvolume tree - all this while holding a write lock > + * on the parent extent buffer in the log tree. > + * > + * So one task in scenario 1) running in parallel with another task in > + * scenario 2) could lead to a deadlock, one wanting to lock a delayed > + * node mutex while having a read lock on a leaf from the subvolume, > + * while the other is holding the delayed node's mutex and wants to > + * write lock the same subvolume leaf for flushing delayed items. > + */ > + src = btrfs_clone_extent_buffer(src_path->nodes[0]); > + if (!src) > + return -ENOMEM; > + > + i = src_path->slots[0]; > + btrfs_release_path(src_path); > + src_path->nodes[0] = src; > + src_path->slots[0] = i; > + > ins_data = kmalloc(nr * sizeof(struct btrfs_key) + > nr * sizeof(u32), GFP_NOFS); > if (!ins_data) > -- > 2.35.1