public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Yi Liu <yi.l.liu@intel.com>
To: fengnan chang <fengnanchang@gmail.com>,
	<linux-pci@vger.kernel.org>, <lukas@wunner.de>,
	<kvm@vger.kernel.org>, <alex.williamson@redhat.com>,
	<bhelgaas@google.com>
Subject: Re: Deadlock during PCIe hot remove and SPDK exit
Date: Fri, 29 Nov 2024 16:05:18 +0800	[thread overview]
Message-ID: <5b2b0211-81d9-4ec9-98d5-b39a84581ac0@intel.com> (raw)
In-Reply-To: <A8CD6F73-CDBC-45D1-A8DF-CB583962DB8C@gmail.com>

On 2024/11/28 11:50, fengnan chang wrote:
> 
> 
>> 2024年11月27日 14:56,fengnan chang <fengnanchang@gmail.com> 写道:
>>
>> Dear PCI maintainers:
>>    I'm having a deadlock issue, somewhat similar to a previous one https://lore.kernel.org/linux-pci/CS1PR8401MB0728FC6FDAB8A35C22BD90EC95F10@CS1PR8401MB0728.NAMPRD84.PROD.OUTLOOK.COM/#t, but my kernel (6.6.40) already included the fix f5eff55.

The previous bug was solved by the below commit.

commit f5eff5591b8f9c5effd25c92c758a127765f74c1
Author: Lukas Wunner <lukas@wunner.de>
Date:   Tue Apr 11 08:21:02 2023 +0200

     PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock

     In 2013, commits

       2e35afaefe64 ("PCI: pciehp: Add reset_slot() method")
       608c388122c7 ("PCI: Add slot reset option to pci_dev_reset()")

     amended PCIe hotplug to mask Presence Detect Changed events during a
     Secondary Bus Reset.  The reset thus no longer causes gratuitous slot
     bringdown and bringup.

     However the commits neglected to serialize reset with code paths reading
     slot registers.  For instance, a slot bringup due to an earlier hotplug
     event may see the Presence Detect State bit cleared during a concurrent
     Secondary Bus Reset.

     In 2018, commit

       5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset")

     retrofitted the missing locking.  It introduced a reset_lock which
     serializes a Secondary Bus Reset with other parts of pciehp.

     Unfortunately the locking turns out to be overzealous:  reset_lock is
     held for the entire enumeration and de-enumeration of hotplugged devices,
     including driver binding and unbinding.

     Driver binding and unbinding acquires device_lock while the reset_lock
     of the ancestral hotplug port is held.  A concurrent Secondary Bus Reset
     acquires the ancestral reset_lock while already holding the device_lock.
     The asymmetric locking order in the two code paths can lead to AB-BA
     deadlocks.

     Michael Haeuptle reports such deadlocks on simultaneous hot-removal and
     vfio release (the latter implies a Secondary Bus Reset):

       pciehp_ist()                                    # down_read(reset_lock)
         pciehp_handle_presence_or_link_change()
           pciehp_disable_slot()
             __pciehp_disable_slot()
               remove_board()
                 pciehp_unconfigure_device()
                   pci_stop_and_remove_bus_device()
                     pci_stop_bus_device()
                       pci_stop_dev()
                         device_release_driver()
                           device_release_driver_internal()
                             __device_driver_lock()    # device_lock()

       SYS_munmap()
         vfio_device_fops_release()
           vfio_device_group_close()
             vfio_device_close()
               vfio_device_last_close()


>>    Here is my test process, I’m running kernel with 6.6.40 and SPDK v22.05:
>>    1. SPDK use vfio driver to takeover two nvme disks, running some io in nvme.
>>    2. pull out two nvme disks
>>    3. Try to kill -9 SPDK process.
>>    Then deadlock issue happened. For now I can 100% reproduce this problem. I’m not an export in PCI, but I did a brief analysis:
>>    irq 149 thread take pci_rescan_remove_lock mutex lock, and wait for SPDK to release vfio.
>>    irq 148 thread take reset_lock of ctrl A, and wait for psi_rescan_remove_lock
>>    SPDK process try to release vfio driver, but wait for reset_lock of ctrl A.
>>
>>
>> irq/149-pciehp stack, cat /proc/514/stack,
>> [<0>] pciehp_unconfigure_device+0x48/0x160 // wait for pci_rescan_remove_lock
>> [<0>] pciehp_disable_slot+0x6b/0x130       // hold reset_lock of ctrl A
>> [<0>] pciehp_handle_presence_or_link_change+0x7d/0x4d0
>> [<0>] pciehp_ist+0x236/0x260
>> [<0>] irq_thread_fn+0x1b/0x60
>> [<0>] irq_thread+0xed/0x190
>> [<0>] kthread+0xe4/0x110
>> [<0>] ret_from_fork+0x2d/0x50
>> [<0>] ret_from_fork_asm+0x11/0x20
>>
>>
>> irq/148-pciehp stack, cat /proc/513/stack
>> [<0>] vfio_unregister_group_dev+0x97/0xe0 [vfio]     //wait for
> 
> My mistake, this is wait for SPDK to release vfio device. This problem can reproduce in 6.12.
> Besides, My college give me an idea, we can make vfio_device_fops_release be async, so when we close fd, it
> won’t block, and when we close another fd, it will release  vfio device, this stack will not block too, then the deadlock disappears.
> 

In the hotplug path, vfio needs to notify userspace to stop the usage of
this device and release reference on the vfio_device. When the last
refcount is released, the wait in the vfio_unregister_group_dev() will be
unblocked. It is the vfio_device_fops_release() either userspace exits or
userspace explicitly close the vfio device fd. Your below test patch moves
the majority of the vfio_device_fops_release() out of the existing path.
I don't see a reason why it can work so far.

As the locking issue has been solved in the above commit, seems there is
no deadlock with the reset_lock and device_lock. Can you confirm if the
scenario can be reproduced with one device? Also, even with two devices,
does killing the process matters or not?

> Here is my test patch, cc some vfio guys:
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 50128da18bca..4ebe154a4ae5 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -19,6 +19,7 @@ struct vfio_container;
>   struct vfio_device_file {
>          struct vfio_device *device;
>          struct vfio_group *group;
> +       struct work_struct      release_work;
> 
>          u8 access_granted;
>          u32 devid; /* only valid when iommufd is valid */
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index a5a62d9d963f..47e3e3f73d70 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -487,6 +487,22 @@ static bool vfio_assert_device_open(struct vfio_device *device)
>          return !WARN_ON_ONCE(!READ_ONCE(device->open_count));
>   }
> 
> +static void vfio_fops_release_work(struct work_struct *work)
> +{
> +       struct vfio_device_file *df =
> +               container_of(work, struct vfio_device_file, release_work);
> +       struct vfio_device *device = df->device;
> +
> +       if (df->group)
> +               vfio_df_group_close(df);
> +       else
> +               vfio_df_unbind_iommufd(df);
> +
> +       vfio_device_put_registration(device);
> +
> +       kfree(df);
> +}
> +
>   struct vfio_device_file *
>   vfio_allocate_device_file(struct vfio_device *device)
>   {
> @@ -497,6 +513,7 @@ vfio_allocate_device_file(struct vfio_device *device)
>                  return ERR_PTR(-ENOMEM);
> 
>          df->device = device;
> +       INIT_WORK(&df->release_work, vfio_fops_release_work);
>          spin_lock_init(&df->kvm_ref_lock);
> 
>          return df;
> @@ -628,16 +645,8 @@ static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
>   static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>   {
>          struct vfio_device_file *df = filep->private_data;
> -       struct vfio_device *device = df->device;
> 
> -       if (df->group)
> -               vfio_df_group_close(df);
> -       else
> -               vfio_df_unbind_iommufd(df);
> -
> -       vfio_device_put_registration(device);
> -
> -       kfree(df);
> +       schedule_work(&df->release_work);
> 
>          return 0;
>   }
> 
> 
>> [<0>] vfio_pci_core_unregister_device+0x19/0x80 [vfio_pci_core]
>> [<0>] vfio_pci_remove+0x15/0x20 [vfio_pci]
>> [<0>] pci_device_remove+0x39/0xb0
>> [<0>] device_release_driver_internal+0xad/0x120
>> [<0>] pci_stop_bus_device+0x5d/0x80
>> [<0>] pci_stop_and_remove_bus_device+0xe/0x20
>> [<0>] pciehp_unconfigure_device+0x91/0x160   //hold pci_rescan_remove_lock, release reset_lock of ctrl B
>> [<0>] pciehp_disable_slot+0x6b/0x130
>> [<0>] pciehp_handle_presence_or_link_change+0x7d/0x4d0
>> [<0>] pciehp_ist+0x236/0x260             //hold reset_lock of ctrl B
>> [<0>] irq_thread_fn+0x1b/0x60
>> [<0>] irq_thread+0xed/0x190
>> [<0>] kthread+0xe4/0x110
>> [<0>] ret_from_fork+0x2d/0x50
>> [<0>] ret_from_fork_asm+0x11/0x20
>>
>>
>> SPDK stack, cat /proc/166634/task/167181/stack
>> [<0>] down_write_nested+0x1b7/0x1c0            //wait for reset_lock of ctrl A.
>> [<0>] pciehp_reset_slot+0x58/0x160
>> [<0>] pci_reset_hotplug_slot+0x3b/0x60
>> [<0>] pci_reset_bus_function+0x3b/0xb0
>> [<0>] __pci_reset_function_locked+0x3e/0x60
>> [<0>] vfio_pci_core_disable+0x3ce/0x400 [vfio_pci_core]
>> [<0>] vfio_pci_core_close_device+0x67/0xc0 [vfio_pci_core]
>> [<0>] vfio_df_close+0x79/0xd0 [vfio]
>> [<0>] vfio_df_group_close+0x36/0x70 [vfio]
>> [<0>] vfio_device_fops_release+0x20/0x40 [vfio]
>> [<0>] __fput+0xec/0x290
>> [<0>] task_work_run+0x61/0x90
>> [<0>] do_exit+0x39c/0xc40
>> [<0>] do_group_exit+0x33/0xa0
>> [<0>] get_signal+0xd84/0xd90
>> [<0>] arch_do_signal_or_restart+0x2a/0x260
>> [<0>] exit_to_user_mode_prepare+0x1c7/0x240
>> [<0>] syscall_exit_to_user_mode+0x2a/0x60
>> [<0>] do_syscall_64+0x3e/0x90
>>
> 
> 

-- 
Regards,
Yi Liu

  reply	other threads:[~2024-11-29  8:00 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <D0B37524-9444-423B-9E48-406CF9A29A6A@gmail.com>
2024-11-28  3:50 ` Deadlock during PCIe hot remove and SPDK exit fengnan chang
2024-11-29  8:05   ` Yi Liu [this message]
2024-12-02  3:25     ` fengnan chang
2024-12-03  5:36       ` Yi Liu
2024-12-03  9:31         ` fengnan chang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5b2b0211-81d9-4ec9-98d5-b39a84581ac0@intel.com \
    --to=yi.l.liu@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=fengnanchang@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox