From: "Zhu, Yi" <chuyee@google.com>
To: bigeasy@linutronix.de
Cc: boqun.feng@gmail.com, cgroups@vger.kernel.org, elver@google.com,
gregkh@linuxfoundation.org, hannes@cmpxchg.org,
hdanton@sina.com, linux-kernel@vger.kernel.org,
mkoutny@suse.com, paulmck@kernel.org, tglx@linutronix.de,
tj@kernel.org
Subject: [PATCH v8 4/6] kernfs: Don't re-lock kernfs_root::kernfs_rwsem in kernfs_fop_readdir().
Date: Mon, 23 Feb 2026 21:54:36 +0000 [thread overview]
Message-ID: <20260223215436.2669161-1-chuyee@google.com> (raw)
In-Reply-To: <20250213145023.2820193-5-bigeasy@linutronix.de>
Hi Sebastian,
We hit a potential ABBA deadlock while running selftests vfio_pci_device_test with this patch.
Thread 1 calls syscall getdents(2) on a kernfs directory: lock B->C->A dependency
kernfs_fop_readdir()
-> mutex_lock(&root->kernfs_rwsem): acquired root->kernfs_rwsem (Lock_A)
-> dir_emit()
-> filldir()
-> unsafe_copy_dirent_name(): <- Hit page fault
-> do_user_addr_fault()
-> lock_mm_and_find_vma()
-> mmap_read_lock_killable(): acquired read mm->mmap_lock (Lock_C): Lock C depends on A
...
-> handle_mm_fault()
-> handle_pte_fault()
-> __do_fault()
-> vm_ops->fault() = vfio_pci_mmap_page_fault()
-> vfio_pci_mmap_huge_fault(): will acquire vdev->memory_lock (Lock_B): Lock B depends on C
Thread 2 runs vfio_pci_device_test kernel selftests: lock A->B dependency
vfio_pci_irq_test__enable_trigger_disable()
-> vfio_pci_irq_enable()
-> vfio_pci_core_ioctl(VFIO_DEVICE_SET_IRQS...)
-> vfio_pci_ioctl_set_irqs()
-> vfio_irq_enable()
-> vfio_pci_memory_lock_and_enable(): acquired vdev->memory_lock (Lock_B)
-> pci_alloc_irq_vectors()
-> msi_setup_device_data()
-> msi_sysfs_create_group()
-> kernfs_create_dir_ns()
-> kernfs_add_one(): will acquire root->kernfs_rwsem (Lock_A): Lock A will depend on B
The full lockdep warning on v6.18 kernel can be found below.
Hardware: Intel Skylake
Kbuild: CONFIG_VFIO=y, CONFIG_IOMMUFD=y
Kernel Parameters: intel_iommu=on
--
[ 1055.224342] ======================================================
[ 1055.224346] WARNING: possible circular locking dependency detected
[ 1055.224349] 6.18.0-dbg-DEV #48 Tainted: G S O
[ 1055.224353] ------------------------------------------------------
[ 1055.224356] vfio_pci_device/15895 is trying to acquire lock:
[ 1055.224359] ffff94f340a64188 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x2d/0x330
[ 1055.224373]
but task is already holding lock:
[ 1055.224376] ffff94f3f376c960 (&vdev->memory_lock){++++}-{4:4}, at: vfio_pci_memory_lock_and_enable+0x2a/0x90
[ 1055.224387]
which lock already depends on the new lock.
[ 1055.224391]
the existing dependency chain (in reverse order) is:
[ 1055.224395]
-> #2 (&vdev->memory_lock){++++}-{4:4}:
[ 1055.224400] down_read+0x3f/0x180
[ 1055.224406] vfio_pci_mmap_huge_fault+0xbc/0x1a0
[ 1055.224410] __do_fault+0x46/0x180
[ 1055.224417] do_pte_missing+0x214/0x1300
[ 1055.224421] handle_mm_fault+0x795/0xb00
[ 1055.224425] do_user_addr_fault+0x4b7/0x6d0
[ 1055.224431] exc_page_fault+0x7d/0xe0
[ 1055.224437] asm_exc_page_fault+0x26/0x30
[ 1055.224452]
-> #1 (&mm->mmap_lock){++++}-{4:4}:
[ 1055.224457] down_read_killable+0x47/0x1b0
[ 1055.224460] mmap_read_lock_killable+0x12/0x50
[ 1055.224463] lock_mm_and_find_vma+0x125/0x140
[ 1055.224470] do_user_addr_fault+0x3d8/0x6d0
[ 1055.224474] exc_page_fault+0x7d/0xe0
[ 1055.224478] asm_exc_page_fault+0x26/0x30
[ 1055.224481] filldir+0xe4/0x1a0
[ 1055.224501] kernfs_fop_readdir+0x1af/0x2d0
[ 1055.224505] iterate_dir+0x84/0x160
[ 1055.224510] __se_sys_getdents+0x6c/0x110
[ 1055.224514] do_syscall_64+0xf0/0x930
[ 1055.224518] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224521]
-> #0 (&root->kernfs_rwsem){++++}-{4:4}:
[ 1055.224526] __lock_acquire+0x15a7/0x2d10
[ 1055.224531] lock_acquire+0xe8/0x2b0
[ 1055.224534] down_write+0x3a/0xb0
[ 1055.224537] kernfs_add_one+0x2d/0x330
[ 1055.224541] kernfs_create_dir_ns+0xab/0xe0
[ 1055.224545] internal_create_group+0x19d/0x4f0
[ 1055.224549] devm_device_add_group+0x49/0x80
[ 1055.224553] msi_setup_device_data+0x5b/0x100
[ 1055.224557] pci_setup_msi_context+0x1a/0xa0
[ 1055.224561] __pci_enable_msix_range+0x19b/0x220
[ 1055.224565] pci_alloc_irq_vectors_affinity+0xa0/0x130
[ 1055.224568] vfio_pci_set_msi_trigger+0x12b/0x300
[ 1055.224571] vfio_pci_core_ioctl+0x84c/0xf20
[ 1055.224575] vfio_device_fops_unl_ioctl+0x18d/0x350
[ 1055.224581] __se_sys_ioctl+0x71/0xc0
[ 1055.224585] do_syscall_64+0xf0/0x930
[ 1055.224588] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224591]
other info that might help us debug this:
[ 1055.224594] Chain exists of:
&root->kernfs_rwsem --> &mm->mmap_lock --> &vdev->memory_lock
[ 1055.224602] Possible unsafe locking scenario:
[ 1055.224605] CPU0 CPU1
[ 1055.224607] ---- ----
[ 1055.224610] lock(&vdev->memory_lock);
[ 1055.224613] lock(&mm->mmap_lock);
[ 1055.224616] lock(&vdev->memory_lock);
[ 1055.224631] lock(&root->kernfs_rwsem);
[ 1055.224633]
*** DEADLOCK ***
[ 1055.224636] 2 locks held by vfio_pci_device/15895:
[ 1055.224639] #0: ffff94f3f376c728 (&vdev->igate){+.+.}-{4:4}, at: vfio_pci_core_ioctl+0x830/0xf20
[ 1055.224646] #1: ffff94f3f376c960 (&vdev->memory_lock){++++}-{4:4}, at: vfio_pci_memory_lock_and_enable+0x2a/0x90
[ 1055.224654]
stack backtrace:
[ 1055.224657] CPU: 90 UID: 0 PID: 15895 Comm: vfio_pci_device Tainted: G S O 6.18.0-dbg-DEV #48 NONE
[ 1055.224660] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
[ 1055.224660] Hardware name: Google LLC Indus/Indus_QC_00, BIOS 30.116.4 08/29/2025
[ 1055.224662] Call Trace:
[ 1055.224663] <TASK>
[ 1055.224664] dump_stack_lvl+0x7d/0xc0
[ 1055.224667] print_circular_bug+0x2e8/0x300
[ 1055.224670] check_noncircular+0xff/0x120
[ 1055.224672] __lock_acquire+0x15a7/0x2d10
[ 1055.224675] ? __kernfs_new_node+0x8c/0x2b0
[ 1055.224679] ? kernfs_root+0x16/0x140
[ 1055.224680] ? kernfs_add_one+0x2d/0x330
[ 1055.224682] lock_acquire+0xe8/0x2b0
[ 1055.224684] ? kernfs_add_one+0x2d/0x330
[ 1055.224687] down_write+0x3a/0xb0
[ 1055.224689] ? kernfs_add_one+0x2d/0x330
[ 1055.224690] kernfs_add_one+0x2d/0x330
[ 1055.224693] kernfs_create_dir_ns+0xab/0xe0
[ 1055.224695] internal_create_group+0x19d/0x4f0
[ 1055.224698] devm_device_add_group+0x49/0x80
[ 1055.224700] msi_setup_device_data+0x5b/0x100
[ 1055.224702] pci_setup_msi_context+0x1a/0xa0
[ 1055.224703] __pci_enable_msix_range+0x19b/0x220
[ 1055.224705] pci_alloc_irq_vectors_affinity+0xa0/0x130
[ 1055.224708] vfio_pci_set_msi_trigger+0x12b/0x300
[ 1055.224710] vfio_pci_core_ioctl+0x84c/0xf20
[ 1055.224713] ? _raw_spin_unlock_irqrestore+0x35/0x50
[ 1055.224717] vfio_device_fops_unl_ioctl+0x18d/0x350
[ 1055.224720] __se_sys_ioctl+0x71/0xc0
[ 1055.224722] do_syscall_64+0xf0/0x930
[ 1055.224724] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224725] ? exc_page_fault+0x9e/0xe0
[ 1055.224727] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224729] RIP: 0033:0x7fd526765887
[ 1055.224731] Code: cc cc cc 48 8b 05 99 39 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 39 07 00 f7 d8 64 89 01 48
[ 1055.224733] RSP: 002b:00007ffff5c63438 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1055.224735] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd526765887
[ 1055.224736] RDX: 00007ffff5c63440 RSI: 0000000000003b6e RDI: 0000000000000007
[ 1055.224737] RBP: 00007ffff5c634e0 R08: 0000000020ff67e8 R09: 0000000000000000
[ 1055.224738] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000021a180
[ 1055.224738] R13: 00007fd526836798 R14: 00007ffff5c63c78 R15: 0000000000000001
[ 1055.224743] </TASK>
> The readdir operation iterates over all entries and invokes dir_emit()
> for every entry passing kernfs_node::name as argument.
> Since the name argument can change, and become invalid, the
> kernfs_root::kernfs_rwsem lock should not be dropped to prevent renames
> during the operation.
>
> The lock drop around dir_emit() has been initially introduced in commit
> 1e5289c97bba2 ("sysfs: Cache the last sysfs_dirent to improve readdir scalability v2")
>
> to avoid holding a global lock during a page fault. The lock drop is
> wrong since the support of renames and not a big burden since the lock
> is no longer global.
>
> Don't re-acquire kernfs_root::kernfs_rwsem while copying the name to the
> userpace buffer.
>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> fs/kernfs/dir.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 5f0f8b95f44c0..43fbada678381 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1869,10 +1869,10 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
> file->private_data = pos;
> kernfs_get(pos);
>
> - up_read(&root->kernfs_rwsem);
> - if (!dir_emit(ctx, name, len, ino, type))
> + if (!dir_emit(ctx, name, len, ino, type)) {
> + up_read(&root->kernfs_rwsem);
> return 0;
> - down_read(&root->kernfs_rwsem);
> + }
> }
> up_read(&root->kernfs_rwsem);
> file->private_data = NULL;
next prev parent reply other threads:[~2026-02-23 21:54 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-13 14:50 [PATCH v8 0/6] kernfs: Use RCU to access kernfs_node::{parent|name} Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 1/6] kernfs: Acquire kernfs_rwsem in kernfs_notify_workfn() Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 2/6] kernfs: Acquire kernfs_rwsem in kernfs_get_parent_dentry() Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 3/6] kernfs: Acquire kernfs_rwsem in kernfs_node_dentry() Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 4/6] kernfs: Don't re-lock kernfs_root::kernfs_rwsem in kernfs_fop_readdir() Sebastian Andrzej Siewior
2026-02-23 21:54 ` Zhu, Yi [this message]
2025-02-13 14:50 ` [PATCH v8 5/6] kernfs: Use RCU to access kernfs_node::parent Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 6/6] kernfs: Use RCU to access kernfs_node::name Sebastian Andrzej Siewior
2025-04-07 12:46 ` [PATCH] x86/resctrl: Fix rdtgroup_mkdir()'s unlocked use of kernfs_node::name James Morse
2025-04-07 18:08 ` Reinette Chatre
2025-04-07 18:26 ` Ingo Molnar
2025-04-09 10:11 ` [tip: x86/urgent] " tip-bot2 for James Morse
2025-02-13 16:33 ` [PATCH v8 0/6] kernfs: Use RCU to access kernfs_node::{parent|name} Tejun Heo
2025-02-13 18:13 ` Greg Kroah-Hartman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260223215436.2669161-1-chuyee@google.com \
--to=chuyee@google.com \
--cc=bigeasy@linutronix.de \
--cc=boqun.feng@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=elver@google.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=hdanton@sina.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=paulmck@kernel.org \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.