From: Yi Liu <yi.l.liu@intel.com>
To: fengnan chang <fengnanchang@gmail.com>,
<linux-pci@vger.kernel.org>, <lukas@wunner.de>,
<kvm@vger.kernel.org>, <alex.williamson@redhat.com>,
<bhelgaas@google.com>
Subject: Re: Deadlock during PCIe hot remove and SPDK exit
Date: Fri, 29 Nov 2024 16:05:18 +0800 [thread overview]
Message-ID: <5b2b0211-81d9-4ec9-98d5-b39a84581ac0@intel.com> (raw)
In-Reply-To: <A8CD6F73-CDBC-45D1-A8DF-CB583962DB8C@gmail.com>
On 2024/11/28 11:50, fengnan chang wrote:
>
>
>> 2024年11月27日 14:56,fengnan chang <fengnanchang@gmail.com> 写道:
>>
>> Dear PCI maintainers:
>> I'm having a deadlock issue, somewhat similar to a previous one https://lore.kernel.org/linux-pci/CS1PR8401MB0728FC6FDAB8A35C22BD90EC95F10@CS1PR8401MB0728.NAMPRD84.PROD.OUTLOOK.COM/#t, but my kernel (6.6.40) already included the fix f5eff55.
The previous bug was solved by the below commit.
commit f5eff5591b8f9c5effd25c92c758a127765f74c1
Author: Lukas Wunner <lukas@wunner.de>
Date: Tue Apr 11 08:21:02 2023 +0200
PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock
In 2013, commits
2e35afaefe64 ("PCI: pciehp: Add reset_slot() method")
608c388122c7 ("PCI: Add slot reset option to pci_dev_reset()")
amended PCIe hotplug to mask Presence Detect Changed events during a
Secondary Bus Reset. The reset thus no longer causes gratuitous slot
bringdown and bringup.
However the commits neglected to serialize reset with code paths reading
slot registers. For instance, a slot bringup due to an earlier hotplug
event may see the Presence Detect State bit cleared during a concurrent
Secondary Bus Reset.
In 2018, commit
5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset")
retrofitted the missing locking. It introduced a reset_lock which
serializes a Secondary Bus Reset with other parts of pciehp.
Unfortunately the locking turns out to be overzealous: reset_lock is
held for the entire enumeration and de-enumeration of hotplugged devices,
including driver binding and unbinding.
Driver binding and unbinding acquires device_lock while the reset_lock
of the ancestral hotplug port is held. A concurrent Secondary Bus Reset
acquires the ancestral reset_lock while already holding the device_lock.
The asymmetric locking order in the two code paths can lead to AB-BA
deadlocks.
Michael Haeuptle reports such deadlocks on simultaneous hot-removal and
vfio release (the latter implies a Secondary Bus Reset):
pciehp_ist() # down_read(reset_lock)
pciehp_handle_presence_or_link_change()
pciehp_disable_slot()
__pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
pci_stop_and_remove_bus_device()
pci_stop_bus_device()
pci_stop_dev()
device_release_driver()
device_release_driver_internal()
__device_driver_lock() # device_lock()
SYS_munmap()
vfio_device_fops_release()
vfio_device_group_close()
vfio_device_close()
vfio_device_last_close()
>> Here is my test process, I’m running kernel with 6.6.40 and SPDK v22.05:
>> 1. SPDK use vfio driver to takeover two nvme disks, running some io in nvme.
>> 2. pull out two nvme disks
>> 3. Try to kill -9 SPDK process.
>> Then deadlock issue happened. For now I can 100% reproduce this problem. I’m not an export in PCI, but I did a brief analysis:
>> irq 149 thread take pci_rescan_remove_lock mutex lock, and wait for SPDK to release vfio.
>> irq 148 thread take reset_lock of ctrl A, and wait for psi_rescan_remove_lock
>> SPDK process try to release vfio driver, but wait for reset_lock of ctrl A.
>>
>>
>> irq/149-pciehp stack, cat /proc/514/stack,
>> [<0>] pciehp_unconfigure_device+0x48/0x160 // wait for pci_rescan_remove_lock
>> [<0>] pciehp_disable_slot+0x6b/0x130 // hold reset_lock of ctrl A
>> [<0>] pciehp_handle_presence_or_link_change+0x7d/0x4d0
>> [<0>] pciehp_ist+0x236/0x260
>> [<0>] irq_thread_fn+0x1b/0x60
>> [<0>] irq_thread+0xed/0x190
>> [<0>] kthread+0xe4/0x110
>> [<0>] ret_from_fork+0x2d/0x50
>> [<0>] ret_from_fork_asm+0x11/0x20
>>
>>
>> irq/148-pciehp stack, cat /proc/513/stack
>> [<0>] vfio_unregister_group_dev+0x97/0xe0 [vfio] //wait for
>
> My mistake, this is wait for SPDK to release vfio device. This problem can reproduce in 6.12.
> Besides, My college give me an idea, we can make vfio_device_fops_release be async, so when we close fd, it
> won’t block, and when we close another fd, it will release vfio device, this stack will not block too, then the deadlock disappears.
>
In the hotplug path, vfio needs to notify userspace to stop the usage of
this device and release reference on the vfio_device. When the last
refcount is released, the wait in the vfio_unregister_group_dev() will be
unblocked. It is the vfio_device_fops_release() either userspace exits or
userspace explicitly close the vfio device fd. Your below test patch moves
the majority of the vfio_device_fops_release() out of the existing path.
I don't see a reason why it can work so far.
As the locking issue has been solved in the above commit, seems there is
no deadlock with the reset_lock and device_lock. Can you confirm if the
scenario can be reproduced with one device? Also, even with two devices,
does killing the process matters or not?
> Here is my test patch, cc some vfio guys:
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 50128da18bca..4ebe154a4ae5 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -19,6 +19,7 @@ struct vfio_container;
> struct vfio_device_file {
> struct vfio_device *device;
> struct vfio_group *group;
> + struct work_struct release_work;
>
> u8 access_granted;
> u32 devid; /* only valid when iommufd is valid */
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index a5a62d9d963f..47e3e3f73d70 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -487,6 +487,22 @@ static bool vfio_assert_device_open(struct vfio_device *device)
> return !WARN_ON_ONCE(!READ_ONCE(device->open_count));
> }
>
> +static void vfio_fops_release_work(struct work_struct *work)
> +{
> + struct vfio_device_file *df =
> + container_of(work, struct vfio_device_file, release_work);
> + struct vfio_device *device = df->device;
> +
> + if (df->group)
> + vfio_df_group_close(df);
> + else
> + vfio_df_unbind_iommufd(df);
> +
> + vfio_device_put_registration(device);
> +
> + kfree(df);
> +}
> +
> struct vfio_device_file *
> vfio_allocate_device_file(struct vfio_device *device)
> {
> @@ -497,6 +513,7 @@ vfio_allocate_device_file(struct vfio_device *device)
> return ERR_PTR(-ENOMEM);
>
> df->device = device;
> + INIT_WORK(&df->release_work, vfio_fops_release_work);
> spin_lock_init(&df->kvm_ref_lock);
>
> return df;
> @@ -628,16 +645,8 @@ static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
> static int vfio_device_fops_release(struct inode *inode, struct file *filep)
> {
> struct vfio_device_file *df = filep->private_data;
> - struct vfio_device *device = df->device;
>
> - if (df->group)
> - vfio_df_group_close(df);
> - else
> - vfio_df_unbind_iommufd(df);
> -
> - vfio_device_put_registration(device);
> -
> - kfree(df);
> + schedule_work(&df->release_work);
>
> return 0;
> }
>
>
>> [<0>] vfio_pci_core_unregister_device+0x19/0x80 [vfio_pci_core]
>> [<0>] vfio_pci_remove+0x15/0x20 [vfio_pci]
>> [<0>] pci_device_remove+0x39/0xb0
>> [<0>] device_release_driver_internal+0xad/0x120
>> [<0>] pci_stop_bus_device+0x5d/0x80
>> [<0>] pci_stop_and_remove_bus_device+0xe/0x20
>> [<0>] pciehp_unconfigure_device+0x91/0x160 //hold pci_rescan_remove_lock, release reset_lock of ctrl B
>> [<0>] pciehp_disable_slot+0x6b/0x130
>> [<0>] pciehp_handle_presence_or_link_change+0x7d/0x4d0
>> [<0>] pciehp_ist+0x236/0x260 //hold reset_lock of ctrl B
>> [<0>] irq_thread_fn+0x1b/0x60
>> [<0>] irq_thread+0xed/0x190
>> [<0>] kthread+0xe4/0x110
>> [<0>] ret_from_fork+0x2d/0x50
>> [<0>] ret_from_fork_asm+0x11/0x20
>>
>>
>> SPDK stack, cat /proc/166634/task/167181/stack
>> [<0>] down_write_nested+0x1b7/0x1c0 //wait for reset_lock of ctrl A.
>> [<0>] pciehp_reset_slot+0x58/0x160
>> [<0>] pci_reset_hotplug_slot+0x3b/0x60
>> [<0>] pci_reset_bus_function+0x3b/0xb0
>> [<0>] __pci_reset_function_locked+0x3e/0x60
>> [<0>] vfio_pci_core_disable+0x3ce/0x400 [vfio_pci_core]
>> [<0>] vfio_pci_core_close_device+0x67/0xc0 [vfio_pci_core]
>> [<0>] vfio_df_close+0x79/0xd0 [vfio]
>> [<0>] vfio_df_group_close+0x36/0x70 [vfio]
>> [<0>] vfio_device_fops_release+0x20/0x40 [vfio]
>> [<0>] __fput+0xec/0x290
>> [<0>] task_work_run+0x61/0x90
>> [<0>] do_exit+0x39c/0xc40
>> [<0>] do_group_exit+0x33/0xa0
>> [<0>] get_signal+0xd84/0xd90
>> [<0>] arch_do_signal_or_restart+0x2a/0x260
>> [<0>] exit_to_user_mode_prepare+0x1c7/0x240
>> [<0>] syscall_exit_to_user_mode+0x2a/0x60
>> [<0>] do_syscall_64+0x3e/0x90
>>
>
>
--
Regards,
Yi Liu
next prev parent reply other threads:[~2024-11-29 8:00 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <D0B37524-9444-423B-9E48-406CF9A29A6A@gmail.com>
2024-11-28 3:50 ` Deadlock during PCIe hot remove and SPDK exit fengnan chang
2024-11-29 8:05 ` Yi Liu [this message]
2024-12-02 3:25 ` fengnan chang
2024-12-03 5:36 ` Yi Liu
2024-12-03 9:31 ` fengnan chang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5b2b0211-81d9-4ec9-98d5-b39a84581ac0@intel.com \
--to=yi.l.liu@intel.com \
--cc=alex.williamson@redhat.com \
--cc=bhelgaas@google.com \
--cc=fengnanchang@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox