public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
From: "Zhu, Yi" <chuyee@google.com>
To: bigeasy@linutronix.de
Cc: boqun.feng@gmail.com, cgroups@vger.kernel.org, elver@google.com,
	 gregkh@linuxfoundation.org, hannes@cmpxchg.org,
	hdanton@sina.com,  linux-kernel@vger.kernel.org,
	mkoutny@suse.com, paulmck@kernel.org,  tglx@linutronix.de,
	tj@kernel.org
Subject: [PATCH v8 4/6] kernfs: Don't re-lock kernfs_root::kernfs_rwsem in kernfs_fop_readdir().
Date: Mon, 23 Feb 2026 21:54:36 +0000	[thread overview]
Message-ID: <20260223215436.2669161-1-chuyee@google.com> (raw)
In-Reply-To: <20250213145023.2820193-5-bigeasy@linutronix.de>

Hi Sebastian,

We hit a potential ABBA deadlock while running selftests vfio_pci_device_test with this patch.

Thread 1 calls syscall getdents(2) on a kernfs directory: lock B->C->A dependency
  kernfs_fop_readdir()
  -> mutex_lock(&root->kernfs_rwsem): acquired root->kernfs_rwsem (Lock_A)
  -> dir_emit()
    -> filldir()
      -> unsafe_copy_dirent_name(): <- Hit page fault 
        -> do_user_addr_fault()
          -> lock_mm_and_find_vma()
            -> mmap_read_lock_killable(): acquired read mm->mmap_lock (Lock_C): Lock C depends on A
          ...
          -> handle_mm_fault()
            -> handle_pte_fault()
              -> __do_fault()
                -> vm_ops->fault() = vfio_pci_mmap_page_fault()
                  -> vfio_pci_mmap_huge_fault(): will acquire vdev->memory_lock (Lock_B): Lock B depends on C

Thread 2 runs vfio_pci_device_test kernel selftests: lock A->B dependency
  vfio_pci_irq_test__enable_trigger_disable()
  -> vfio_pci_irq_enable()
      -> vfio_pci_core_ioctl(VFIO_DEVICE_SET_IRQS...)
        -> vfio_pci_ioctl_set_irqs()
          -> vfio_irq_enable()
            -> vfio_pci_memory_lock_and_enable(): acquired vdev->memory_lock (Lock_B)
            -> pci_alloc_irq_vectors()
              -> msi_setup_device_data()
                -> msi_sysfs_create_group()
                  -> kernfs_create_dir_ns()
                    -> kernfs_add_one(): will acquire root->kernfs_rwsem (Lock_A): Lock A will depend on B

The full lockdep warning on v6.18 kernel can be found below.
Hardware: Intel Skylake
Kbuild: CONFIG_VFIO=y, CONFIG_IOMMUFD=y
Kernel Parameters: intel_iommu=on
--

[ 1055.224342] ======================================================
[ 1055.224346] WARNING: possible circular locking dependency detected
[ 1055.224349] 6.18.0-dbg-DEV #48 Tainted: G S         O
[ 1055.224353] ------------------------------------------------------
[ 1055.224356] vfio_pci_device/15895 is trying to acquire lock:
[ 1055.224359] ffff94f340a64188 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x2d/0x330
[ 1055.224373]
               but task is already holding lock:
[ 1055.224376] ffff94f3f376c960 (&vdev->memory_lock){++++}-{4:4}, at: vfio_pci_memory_lock_and_enable+0x2a/0x90
[ 1055.224387]
               which lock already depends on the new lock.

[ 1055.224391]
               the existing dependency chain (in reverse order) is:
[ 1055.224395]
               -> #2 (&vdev->memory_lock){++++}-{4:4}:
[ 1055.224400]        down_read+0x3f/0x180
[ 1055.224406]        vfio_pci_mmap_huge_fault+0xbc/0x1a0
[ 1055.224410]        __do_fault+0x46/0x180
[ 1055.224417]        do_pte_missing+0x214/0x1300
[ 1055.224421]        handle_mm_fault+0x795/0xb00
[ 1055.224425]        do_user_addr_fault+0x4b7/0x6d0
[ 1055.224431]        exc_page_fault+0x7d/0xe0
[ 1055.224437]        asm_exc_page_fault+0x26/0x30
[ 1055.224452]
               -> #1 (&mm->mmap_lock){++++}-{4:4}:
[ 1055.224457]        down_read_killable+0x47/0x1b0
[ 1055.224460]        mmap_read_lock_killable+0x12/0x50
[ 1055.224463]        lock_mm_and_find_vma+0x125/0x140
[ 1055.224470]        do_user_addr_fault+0x3d8/0x6d0
[ 1055.224474]        exc_page_fault+0x7d/0xe0
[ 1055.224478]        asm_exc_page_fault+0x26/0x30
[ 1055.224481]        filldir+0xe4/0x1a0
[ 1055.224501]        kernfs_fop_readdir+0x1af/0x2d0
[ 1055.224505]        iterate_dir+0x84/0x160
[ 1055.224510]        __se_sys_getdents+0x6c/0x110
[ 1055.224514]        do_syscall_64+0xf0/0x930
[ 1055.224518]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224521]
               -> #0 (&root->kernfs_rwsem){++++}-{4:4}:
[ 1055.224526]        __lock_acquire+0x15a7/0x2d10
[ 1055.224531]        lock_acquire+0xe8/0x2b0
[ 1055.224534]        down_write+0x3a/0xb0
[ 1055.224537]        kernfs_add_one+0x2d/0x330
[ 1055.224541]        kernfs_create_dir_ns+0xab/0xe0
[ 1055.224545]        internal_create_group+0x19d/0x4f0
[ 1055.224549]        devm_device_add_group+0x49/0x80
[ 1055.224553]        msi_setup_device_data+0x5b/0x100
[ 1055.224557]        pci_setup_msi_context+0x1a/0xa0
[ 1055.224561]        __pci_enable_msix_range+0x19b/0x220
[ 1055.224565]        pci_alloc_irq_vectors_affinity+0xa0/0x130
[ 1055.224568]        vfio_pci_set_msi_trigger+0x12b/0x300
[ 1055.224571]        vfio_pci_core_ioctl+0x84c/0xf20
[ 1055.224575]        vfio_device_fops_unl_ioctl+0x18d/0x350
[ 1055.224581]        __se_sys_ioctl+0x71/0xc0
[ 1055.224585]        do_syscall_64+0xf0/0x930
[ 1055.224588]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224591]
               other info that might help us debug this:

[ 1055.224594] Chain exists of:
                 &root->kernfs_rwsem --> &mm->mmap_lock --> &vdev->memory_lock

[ 1055.224602]  Possible unsafe locking scenario:

[ 1055.224605]        CPU0                    CPU1
[ 1055.224607]        ----                    ----
[ 1055.224610]   lock(&vdev->memory_lock);
[ 1055.224613]                                lock(&mm->mmap_lock);
[ 1055.224616]                                lock(&vdev->memory_lock);
[ 1055.224631]   lock(&root->kernfs_rwsem);
[ 1055.224633]
                *** DEADLOCK ***

[ 1055.224636] 2 locks held by vfio_pci_device/15895:
[ 1055.224639]  #0: ffff94f3f376c728 (&vdev->igate){+.+.}-{4:4}, at: vfio_pci_core_ioctl+0x830/0xf20
[ 1055.224646]  #1: ffff94f3f376c960 (&vdev->memory_lock){++++}-{4:4}, at: vfio_pci_memory_lock_and_enable+0x2a/0x90
[ 1055.224654]
               stack backtrace:
[ 1055.224657] CPU: 90 UID: 0 PID: 15895 Comm: vfio_pci_device Tainted: G S         O        6.18.0-dbg-DEV #48 NONE
[ 1055.224660] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
[ 1055.224660] Hardware name: Google LLC Indus/Indus_QC_00, BIOS 30.116.4 08/29/2025
[ 1055.224662] Call Trace:
[ 1055.224663]  <TASK>
[ 1055.224664]  dump_stack_lvl+0x7d/0xc0
[ 1055.224667]  print_circular_bug+0x2e8/0x300
[ 1055.224670]  check_noncircular+0xff/0x120
[ 1055.224672]  __lock_acquire+0x15a7/0x2d10
[ 1055.224675]  ? __kernfs_new_node+0x8c/0x2b0
[ 1055.224679]  ? kernfs_root+0x16/0x140
[ 1055.224680]  ? kernfs_add_one+0x2d/0x330
[ 1055.224682]  lock_acquire+0xe8/0x2b0
[ 1055.224684]  ? kernfs_add_one+0x2d/0x330
[ 1055.224687]  down_write+0x3a/0xb0
[ 1055.224689]  ? kernfs_add_one+0x2d/0x330
[ 1055.224690]  kernfs_add_one+0x2d/0x330
[ 1055.224693]  kernfs_create_dir_ns+0xab/0xe0
[ 1055.224695]  internal_create_group+0x19d/0x4f0
[ 1055.224698]  devm_device_add_group+0x49/0x80
[ 1055.224700]  msi_setup_device_data+0x5b/0x100
[ 1055.224702]  pci_setup_msi_context+0x1a/0xa0
[ 1055.224703]  __pci_enable_msix_range+0x19b/0x220
[ 1055.224705]  pci_alloc_irq_vectors_affinity+0xa0/0x130
[ 1055.224708]  vfio_pci_set_msi_trigger+0x12b/0x300
[ 1055.224710]  vfio_pci_core_ioctl+0x84c/0xf20
[ 1055.224713]  ? _raw_spin_unlock_irqrestore+0x35/0x50
[ 1055.224717]  vfio_device_fops_unl_ioctl+0x18d/0x350
[ 1055.224720]  __se_sys_ioctl+0x71/0xc0
[ 1055.224722]  do_syscall_64+0xf0/0x930
[ 1055.224724]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224725]  ? exc_page_fault+0x9e/0xe0
[ 1055.224727]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224729] RIP: 0033:0x7fd526765887
[ 1055.224731] Code: cc cc cc 48 8b 05 99 39 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 39 07 00 f7 d8 64 89 01 48
[ 1055.224733] RSP: 002b:00007ffff5c63438 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1055.224735] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd526765887
[ 1055.224736] RDX: 00007ffff5c63440 RSI: 0000000000003b6e RDI: 0000000000000007
[ 1055.224737] RBP: 00007ffff5c634e0 R08: 0000000020ff67e8 R09: 0000000000000000
[ 1055.224738] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000021a180
[ 1055.224738] R13: 00007fd526836798 R14: 00007ffff5c63c78 R15: 0000000000000001
[ 1055.224743]  </TASK>


> The readdir operation iterates over all entries and invokes dir_emit()
> for every entry passing kernfs_node::name as argument.
> Since the name argument can change, and become invalid, the
> kernfs_root::kernfs_rwsem lock should not be dropped to prevent renames
> during the operation.
> 
> The lock drop around dir_emit() has been initially introduced in commit
>    1e5289c97bba2 ("sysfs: Cache the last sysfs_dirent to improve readdir scalability v2")
> 
> to avoid holding a global lock during a page fault. The lock drop is
> wrong since the support of renames and not a big burden since the lock
> is no longer global.
> 
> Don't re-acquire kernfs_root::kernfs_rwsem while copying the name to the
> userpace buffer.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  fs/kernfs/dir.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 5f0f8b95f44c0..43fbada678381 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1869,10 +1869,10 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
>  		file->private_data = pos;
>  		kernfs_get(pos);
>  
> -		up_read(&root->kernfs_rwsem);
> -		if (!dir_emit(ctx, name, len, ino, type))
> +		if (!dir_emit(ctx, name, len, ino, type)) {
> +			up_read(&root->kernfs_rwsem);
>  			return 0;
> -		down_read(&root->kernfs_rwsem);
> +		}
>  	}
>  	up_read(&root->kernfs_rwsem);
>  	file->private_data = NULL;

  reply	other threads:[~2026-02-23 21:54 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-13 14:50 [PATCH v8 0/6] kernfs: Use RCU to access kernfs_node::{parent|name} Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 1/6] kernfs: Acquire kernfs_rwsem in kernfs_notify_workfn() Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 2/6] kernfs: Acquire kernfs_rwsem in kernfs_get_parent_dentry() Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 3/6] kernfs: Acquire kernfs_rwsem in kernfs_node_dentry() Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 4/6] kernfs: Don't re-lock kernfs_root::kernfs_rwsem in kernfs_fop_readdir() Sebastian Andrzej Siewior
2026-02-23 21:54   ` Zhu, Yi [this message]
2025-02-13 14:50 ` [PATCH v8 5/6] kernfs: Use RCU to access kernfs_node::parent Sebastian Andrzej Siewior
2025-02-13 14:50 ` [PATCH v8 6/6] kernfs: Use RCU to access kernfs_node::name Sebastian Andrzej Siewior
2025-02-13 16:33 ` [PATCH v8 0/6] kernfs: Use RCU to access kernfs_node::{parent|name} Tejun Heo
2025-02-13 18:13   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260223215436.2669161-1-chuyee@google.com \
    --to=chuyee@google.com \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=elver@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkoutny@suse.com \
    --cc=paulmck@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox