[PATCH RFC v3] pci: report surprise removal event

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v3] pci: report surprise removal event
@ 2025-07-02  7:20 Michael S. Tsirkin
  2025-07-02 17:23 ` Michael S. Tsirkin
  2025-07-02 17:24 ` Michael S. Tsirkin
  0 siblings, 2 replies; 7+ messages in thread
From: Michael S. Tsirkin @ 2025-07-02  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Bjorn Helgaas, linux-pci

At the moment, in case of a surprise removal, the regular remove
callback is invoked, exclusively.  This works well, because mostly, the
cleanup would be the same.

However, there's a race: imagine device removal was initiated by a user
action, such as driver unbind, and it in turn initiated some cleanup and
is now waiting for an interrupt from the device. If the device is now
surprise-removed, that never arrives and the remove callback hangs
forever.

For example, this was reported for virtio-blk:

	1. the graceful removal is ongoing in the remove() callback, where disk
	   deletion del_gendisk() is ongoing, which waits for the requests +to
	   complete,

	2. Now few requests are yet to complete, and surprise removal started.

	At this point, virtio block driver will not get notified by the driver
	core layer, because it is likely serializing remove() happening by
	+user/driver unload and PCI hotplug driver-initiated device removal.  So
	vblk driver doesn't know that device is removed, block layer is waiting
	for requests completions to arrive which it never gets.  So
	del_gendisk() gets stuck.

Drivers can artificially add timeouts to handle that, but it can be
flaky.

Instead, let's add a way for the driver to be notified about the
disconnect. It can then do any necessary cleanup, knowing that the
device is inactive.

Since cleanups can take a long time, this takes an approach
of a work struct that the driver initiates and enables
on probe, and tears down on remove.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---


Compile tested only.

Note: this minimizes core code. I considered a more elaborate API
that would be easier to use, but decided to be conservative until
there are multiple users.

changes from v2
	v2 was corrupted, fat fingers :(

changes from v1:
        switched to a WQ, with APIs to enable/disable
        added motivation


 drivers/pci/pci.h   |  6 ++++++
 include/linux/pci.h | 27 +++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index b81e99cd4b62..208b4cab534b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -549,6 +549,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
 	pci_doe_disconnected(dev);
 
+	if (READ_ONCE(dev->disconnect_work_enable)) {
+		/* Make sure work is up to date. */
+		smp_rmb();
+		schedule_work(&dev->disconnect_work);
+	}
+
 	return 0;
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 51e2bd6405cd..b2168c5d0679 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -550,6 +550,10 @@ struct pci_dev {
 	/* These methods index pci_reset_fn_methods[] */
 	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
 
+	/* Report disconnect events */
+	u8 disconnect_work_enable;
+	struct work_struct disconnect_work;
+
 #ifdef CONFIG_PCIE_TPH
 	u16		tph_cap;	/* TPH capability offset */
 	u8		tph_mode;	/* TPH mode */
@@ -2657,6 +2661,29 @@ static inline bool pci_is_dev_assigned(struct pci_dev *pdev)
 	return (pdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) == PCI_DEV_FLAGS_ASSIGNED;
 }
 
+/*
+ * Caller must initialize @pdev->disconnect_work before invoking this.
+ * Caller also must check pci_device_is_present afterwards, since
+ * if device is already gone when this is called, work will not run.
+ */
+static inline void pci_set_disconnect_work(struct pci_dev *pdev)
+{
+	/* Make sure WQ has been initialized already */
+	smp_wmb();
+
+	WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
+}
+
+static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
+{
+	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
+
+	/* Make sure to stop using work from now on. */
+	smp_wmb();
+
+	cancel_work_sync(&pdev->disconnect_work);
+}
+
 /**
  * pci_ari_enabled - query ARI forwarding status
  * @bus: the PCI bus
-- 
MST


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC v3] pci: report surprise removal event
  2025-07-02  7:20 [PATCH RFC v3] pci: report surprise removal event Michael S. Tsirkin
@ 2025-07-02 17:23 ` Michael S. Tsirkin
  2025-07-02 17:24 ` Michael S. Tsirkin
  1 sibling, 0 replies; 7+ messages in thread
From: Michael S. Tsirkin @ 2025-07-02 17:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Bjorn Helgaas, linux-pci

On Wed, Jul 02, 2025 at 03:20:52AM -0400, Michael S. Tsirkin wrote:
> At the moment, in case of a surprise removal, the regular remove
> callback is invoked, exclusively.  This works well, because mostly, the
> cleanup would be the same.
> 
> However, there's a race: imagine device removal was initiated by a user
> action, such as driver unbind, and it in turn initiated some cleanup and
> is now waiting for an interrupt from the device. If the device is now
> surprise-removed, that never arrives and the remove callback hangs
> forever.
> 
> For example, this was reported for virtio-blk:
> 
> 	1. the graceful removal is ongoing in the remove() callback, where disk
> 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> 	   complete,
> 
> 	2. Now few requests are yet to complete, and surprise removal started.
> 
> 	At this point, virtio block driver will not get notified by the driver
> 	core layer, because it is likely serializing remove() happening by
> 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> 	vblk driver doesn't know that device is removed, block layer is waiting
> 	for requests completions to arrive which it never gets.  So
> 	del_gendisk() gets stuck.
> 
> Drivers can artificially add timeouts to handle that, but it can be
> flaky.
> 
> Instead, let's add a way for the driver to be notified about the
> disconnect. It can then do any necessary cleanup, knowing that the
> device is inactive.
> 
> Since cleanups can take a long time, this takes an approach
> of a work struct that the driver initiates and enables
> on probe, and tears down on remove.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> 
> Compile tested only.
> 
> Note: this minimizes core code. I considered a more elaborate API
> that would be easier to use, but decided to be conservative until
> there are multiple users.
> 
> changes from v2
> 	v2 was corrupted, fat fingers :(
> 
> changes from v1:
>         switched to a WQ, with APIs to enable/disable
>         added motivation
> 
> 
>  drivers/pci/pci.h   |  6 ++++++
>  include/linux/pci.h | 27 +++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index b81e99cd4b62..208b4cab534b 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -549,6 +549,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
>  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
>  	pci_doe_disconnected(dev);
>  
> +	if (READ_ONCE(dev->disconnect_work_enable)) {
> +		/* Make sure work is up to date. */
> +		smp_rmb();
> +		schedule_work(&dev->disconnect_work);
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 51e2bd6405cd..b2168c5d0679 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -550,6 +550,10 @@ struct pci_dev {
>  	/* These methods index pci_reset_fn_methods[] */
>  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
>  
> +	/* Report disconnect events */
> +	u8 disconnect_work_enable;
> +	struct work_struct disconnect_work;
> +
>  #ifdef CONFIG_PCIE_TPH
>  	u16		tph_cap;	/* TPH capability offset */
>  	u8		tph_mode;	/* TPH mode */
> @@ -2657,6 +2661,29 @@ static inline bool pci_is_dev_assigned(struct pci_dev *pdev)
>  	return (pdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) == PCI_DEV_FLAGS_ASSIGNED;
>  }
>  
> +/*
> + * Caller must initialize @pdev->disconnect_work before invoking this.
> + * Caller also must check pci_device_is_present afterwards, since
> + * if device is already gone when this is called, work will not run.
> + */
> +static inline void pci_set_disconnect_work(struct pci_dev *pdev)
> +{
> +	/* Make sure WQ has been initialized already */
> +	smp_wmb();
> +
> +	WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
> +}
> +
> +static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
> +{
> +	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> +
> +	/* Make sure to stop using work from now on. */
> +	smp_wmb();
> +
> +	cancel_work_sync(&pdev->disconnect_work);
> +}
> +
>  /**
>   * pci_ari_enabled - query ARI forwarding status
>   * @bus: the PCI bus
> -- 
> MST


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC v3] pci: report surprise removal event
  2025-07-02  7:20 [PATCH RFC v3] pci: report surprise removal event Michael S. Tsirkin
  2025-07-02 17:23 ` Michael S. Tsirkin
@ 2025-07-02 17:24 ` Michael S. Tsirkin
  2025-07-03  5:02   ` Parav Pandit
  1 sibling, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2025-07-02 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Bjorn Helgaas, linux-pci, stefanha, alok.a.tiwari, Parav Pandit,
	virtualization

On Wed, Jul 02, 2025 at 03:20:52AM -0400, Michael S. Tsirkin wrote:
> At the moment, in case of a surprise removal, the regular remove
> callback is invoked, exclusively.  This works well, because mostly, the
> cleanup would be the same.
> 
> However, there's a race: imagine device removal was initiated by a user
> action, such as driver unbind, and it in turn initiated some cleanup and
> is now waiting for an interrupt from the device. If the device is now
> surprise-removed, that never arrives and the remove callback hangs
> forever.
> 
> For example, this was reported for virtio-blk:
> 
> 	1. the graceful removal is ongoing in the remove() callback, where disk
> 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> 	   complete,
> 
> 	2. Now few requests are yet to complete, and surprise removal started.
> 
> 	At this point, virtio block driver will not get notified by the driver
> 	core layer, because it is likely serializing remove() happening by
> 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> 	vblk driver doesn't know that device is removed, block layer is waiting
> 	for requests completions to arrive which it never gets.  So
> 	del_gendisk() gets stuck.
> 
> Drivers can artificially add timeouts to handle that, but it can be
> flaky.
> 
> Instead, let's add a way for the driver to be notified about the
> disconnect. It can then do any necessary cleanup, knowing that the
> device is inactive.
> 
> Since cleanups can take a long time, this takes an approach
> of a work struct that the driver initiates and enables
> on probe, and tears down on remove.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 

Parav what do you think of this patch? This you can try
using this in virtio blk to address the hang you
reported?

> Compile tested only.
> 
> Note: this minimizes core code. I considered a more elaborate API
> that would be easier to use, but decided to be conservative until
> there are multiple users.
> 
> changes from v2
> 	v2 was corrupted, fat fingers :(
> 
> changes from v1:
>         switched to a WQ, with APIs to enable/disable
>         added motivation
> 
> 
>  drivers/pci/pci.h   |  6 ++++++
>  include/linux/pci.h | 27 +++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index b81e99cd4b62..208b4cab534b 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -549,6 +549,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
>  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
>  	pci_doe_disconnected(dev);
>  
> +	if (READ_ONCE(dev->disconnect_work_enable)) {
> +		/* Make sure work is up to date. */
> +		smp_rmb();
> +		schedule_work(&dev->disconnect_work);
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 51e2bd6405cd..b2168c5d0679 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -550,6 +550,10 @@ struct pci_dev {
>  	/* These methods index pci_reset_fn_methods[] */
>  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
>  
> +	/* Report disconnect events */
> +	u8 disconnect_work_enable;
> +	struct work_struct disconnect_work;
> +
>  #ifdef CONFIG_PCIE_TPH
>  	u16		tph_cap;	/* TPH capability offset */
>  	u8		tph_mode;	/* TPH mode */
> @@ -2657,6 +2661,29 @@ static inline bool pci_is_dev_assigned(struct pci_dev *pdev)
>  	return (pdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) == PCI_DEV_FLAGS_ASSIGNED;
>  }
>  
> +/*
> + * Caller must initialize @pdev->disconnect_work before invoking this.
> + * Caller also must check pci_device_is_present afterwards, since
> + * if device is already gone when this is called, work will not run.
> + */
> +static inline void pci_set_disconnect_work(struct pci_dev *pdev)
> +{
> +	/* Make sure WQ has been initialized already */
> +	smp_wmb();
> +
> +	WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
> +}
> +
> +static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
> +{
> +	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> +
> +	/* Make sure to stop using work from now on. */
> +	smp_wmb();
> +
> +	cancel_work_sync(&pdev->disconnect_work);
> +}
> +
>  /**
>   * pci_ari_enabled - query ARI forwarding status
>   * @bus: the PCI bus
> -- 
> MST


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH RFC v3] pci: report surprise removal event
  2025-07-02 17:24 ` Michael S. Tsirkin
@ 2025-07-03  5:02   ` Parav Pandit
  2025-07-03  6:24     ` Michael S. Tsirkin
  0 siblings, 1 reply; 7+ messages in thread
From: Parav Pandit @ 2025-07-03  5:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel@vger.kernel.org
  Cc: Bjorn Helgaas, linux-pci@vger.kernel.org, stefanha@redhat.com,
	alok.a.tiwari@oracle.com, virtualization@lists.linux.dev


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: 02 July 2025 10:54 PM
> 
> On Wed, Jul 02, 2025 at 03:20:52AM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively.  This works well, because mostly,
> > the cleanup would be the same.
> >
> > However, there's a race: imagine device removal was initiated by a
> > user action, such as driver unbind, and it in turn initiated some
> > cleanup and is now waiting for an interrupt from the device. If the
> > device is now surprise-removed, that never arrives and the remove
> > callback hangs forever.
> >
> > For example, this was reported for virtio-blk:
> >
> > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > 	   complete,
> >
> > 	2. Now few requests are yet to complete, and surprise removal started.
> >
> > 	At this point, virtio block driver will not get notified by the driver
> > 	core layer, because it is likely serializing remove() happening by
> > 	+user/driver unload and PCI hotplug driver-initiated device removal.
> So
> > 	vblk driver doesn't know that device is removed, block layer is waiting
> > 	for requests completions to arrive which it never gets.  So
> > 	del_gendisk() gets stuck.
> >
> > Drivers can artificially add timeouts to handle that, but it can be
> > flaky.
> >
> > Instead, let's add a way for the driver to be notified about the
> > disconnect. It can then do any necessary cleanup, knowing that the
> > device is inactive.
> >
> > Since cleanups can take a long time, this takes an approach of a work
> > struct that the driver initiates and enables on probe, and tears down
> > on remove.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >
> 
> Parav what do you think of this patch? 
The async notification part without holding the device lock is good part of this patch.

However, large part of the systems and use cases does not involve pci hot plug removal.
An average system that I came across using has 150+ pci devices, and none of them uses hotplug.

So increasing pci dev struct for rare hot unplug, that too for the race condition does not look the best option.

I believe the intent of async notification without device lock can be achieved by adding a non-blocking async notifier callback.
This can go in the pci ops struct.

Such callback scale far better being part of the ops struct instead of pci_dev struct.

> This you can try using this in virtio blk to
> address the hang you reported?
>
The hang I reported was not the race condition between remove() and hotunplug during remove.
It was the simple remove() as hot-unplug issue due to commit 43bb40c5b926.

The race condition hang is hard to reproduce as_is.
I can try to reproduce by adding extra sleep() etc code in remove() with v4 of this version with ops callback.

However, that requires lot more code to be developed on top of current proposed fix [1].

[1] https://lore.kernel.org/linux-block/20250624185622.GB5519@fedora/

I need to re-arrange the hardware with hotplug resources. Will try to arrange on v4.

> > Compile tested only.
> >
> > Note: this minimizes core code. I considered a more elaborate API that
> > would be easier to use, but decided to be conservative until there are
> > multiple users.
> >
> > changes from v2
> > 	v2 was corrupted, fat fingers :(
> >
> > changes from v1:
> >         switched to a WQ, with APIs to enable/disable
> >         added motivation
> >
> >
> >  drivers/pci/pci.h   |  6 ++++++
> >  include/linux/pci.h | 27 +++++++++++++++++++++++++++
> >  2 files changed, 33 insertions(+)
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index
> > b81e99cd4b62..208b4cab534b 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -549,6 +549,12 @@ static inline int pci_dev_set_disconnected(struct
> pci_dev *dev, void *unused)
> >  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> >  	pci_doe_disconnected(dev);
> >
> > +	if (READ_ONCE(dev->disconnect_work_enable)) {
> > +		/* Make sure work is up to date. */
> > +		smp_rmb();
> > +		schedule_work(&dev->disconnect_work);
> > +	}
> > +
> >  	return 0;
> >  }
> >
> > diff --git a/include/linux/pci.h b/include/linux/pci.h index
> > 51e2bd6405cd..b2168c5d0679 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -550,6 +550,10 @@ struct pci_dev {
> >  	/* These methods index pci_reset_fn_methods[] */
> >  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> >
> > +	/* Report disconnect events */
> > +	u8 disconnect_work_enable;
> > +	struct work_struct disconnect_work;
> > +

> >  #ifdef CONFIG_PCIE_TPH
> >  	u16		tph_cap;	/* TPH capability offset */
> >  	u8		tph_mode;	/* TPH mode */
> > @@ -2657,6 +2661,29 @@ static inline bool pci_is_dev_assigned(struct
> pci_dev *pdev)
> >  	return (pdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) ==
> > PCI_DEV_FLAGS_ASSIGNED;  }
> >
> > +/*
> > + * Caller must initialize @pdev->disconnect_work before invoking this.
> > + * Caller also must check pci_device_is_present afterwards, since
> > + * if device is already gone when this is called, work will not run.
> > + */
> > +static inline void pci_set_disconnect_work(struct pci_dev *pdev) {
> > +	/* Make sure WQ has been initialized already */
> > +	smp_wmb();
> > +
> > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x1); }
> > +
> > +static inline void pci_clear_disconnect_work(struct pci_dev *pdev) {
> > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> > +
> > +	/* Make sure to stop using work from now on. */
> > +	smp_wmb();
> > +
> > +	cancel_work_sync(&pdev->disconnect_work);
> > +}
> > +
> >  /**
> >   * pci_ari_enabled - query ARI forwarding status
> >   * @bus: the PCI bus
> > --
> > MST


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC v3] pci: report surprise removal event
  2025-07-03  5:02   ` Parav Pandit
@ 2025-07-03  6:24     ` Michael S. Tsirkin
  2025-07-03  9:51       ` Parav Pandit
  0 siblings, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2025-07-03  6:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: linux-kernel@vger.kernel.org, Bjorn Helgaas,
	linux-pci@vger.kernel.org, stefanha@redhat.com,
	alok.a.tiwari@oracle.com, virtualization@lists.linux.dev

On Thu, Jul 03, 2025 at 05:02:13AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: 02 July 2025 10:54 PM
> > 
> > On Wed, Jul 02, 2025 at 03:20:52AM -0400, Michael S. Tsirkin wrote:
> > > At the moment, in case of a surprise removal, the regular remove
> > > callback is invoked, exclusively.  This works well, because mostly,
> > > the cleanup would be the same.
> > >
> > > However, there's a race: imagine device removal was initiated by a
> > > user action, such as driver unbind, and it in turn initiated some
> > > cleanup and is now waiting for an interrupt from the device. If the
> > > device is now surprise-removed, that never arrives and the remove
> > > callback hangs forever.
> > >
> > > For example, this was reported for virtio-blk:
> > >
> > > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > 	   complete,
> > >
> > > 	2. Now few requests are yet to complete, and surprise removal started.
> > >
> > > 	At this point, virtio block driver will not get notified by the driver
> > > 	core layer, because it is likely serializing remove() happening by
> > > 	+user/driver unload and PCI hotplug driver-initiated device removal.
> > So
> > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > 	for requests completions to arrive which it never gets.  So
> > > 	del_gendisk() gets stuck.
> > >
> > > Drivers can artificially add timeouts to handle that, but it can be
> > > flaky.
> > >
> > > Instead, let's add a way for the driver to be notified about the
> > > disconnect. It can then do any necessary cleanup, knowing that the
> > > device is inactive.
> > >
> > > Since cleanups can take a long time, this takes an approach of a work
> > > struct that the driver initiates and enables on probe, and tears down
> > > on remove.
> > >
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >
> > 
> > Parav what do you think of this patch? 
> The async notification part without holding the device lock is good part of this patch.
> 
> However, large part of the systems and use cases does not involve pci hot plug removal.
> An average system that I came across using has 150+ pci devices, and none of them uses hotplug.
> 
> So increasing pci dev struct for rare hot unplug, that too for the race condition does not look the best option.
> 
> I believe the intent of async notification without device lock can be achieved by adding a non-blocking async notifier callback.
> This can go in the pci ops struct.
> 
> Such callback scale far better being part of the ops struct instead of pci_dev struct.

Sorry, I don't see a way to achieve that, as the driver can go away
while hotunplug happens.

You would be welcome to try but you mentioned you have no plans to do so.



> > This you can try using this in virtio blk to
> > address the hang you reported?
> >
> The hang I reported was not the race condition between remove() and hotunplug during remove.
> It was the simple remove() as hot-unplug issue due to commit 43bb40c5b926.
> 
> The race condition hang is hard to reproduce as_is.
> I can try to reproduce by adding extra sleep() etc code in remove() with v4 of this version with ops callback.
> 
> However, that requires lot more code to be developed on top of current proposed fix [1].
> 
> [1] https://lore.kernel.org/linux-block/20250624185622.GB5519@fedora/
> 
> I need to re-arrange the hardware with hotplug resources. Will try to arrange on v4.
> 
> > > Compile tested only.
> > >
> > > Note: this minimizes core code. I considered a more elaborate API that
> > > would be easier to use, but decided to be conservative until there are
> > > multiple users.
> > >
> > > changes from v2
> > > 	v2 was corrupted, fat fingers :(
> > >
> > > changes from v1:
> > >         switched to a WQ, with APIs to enable/disable
> > >         added motivation
> > >
> > >
> > >  drivers/pci/pci.h   |  6 ++++++
> > >  include/linux/pci.h | 27 +++++++++++++++++++++++++++
> > >  2 files changed, 33 insertions(+)
> > >
> > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index
> > > b81e99cd4b62..208b4cab534b 100644
> > > --- a/drivers/pci/pci.h
> > > +++ b/drivers/pci/pci.h
> > > @@ -549,6 +549,12 @@ static inline int pci_dev_set_disconnected(struct
> > pci_dev *dev, void *unused)
> > >  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> > >  	pci_doe_disconnected(dev);
> > >
> > > +	if (READ_ONCE(dev->disconnect_work_enable)) {
> > > +		/* Make sure work is up to date. */
> > > +		smp_rmb();
> > > +		schedule_work(&dev->disconnect_work);
> > > +	}
> > > +
> > >  	return 0;
> > >  }
> > >
> > > diff --git a/include/linux/pci.h b/include/linux/pci.h index
> > > 51e2bd6405cd..b2168c5d0679 100644
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -550,6 +550,10 @@ struct pci_dev {
> > >  	/* These methods index pci_reset_fn_methods[] */
> > >  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> > >
> > > +	/* Report disconnect events */
> > > +	u8 disconnect_work_enable;
> > > +	struct work_struct disconnect_work;
> > > +
> 
> > >  #ifdef CONFIG_PCIE_TPH
> > >  	u16		tph_cap;	/* TPH capability offset */
> > >  	u8		tph_mode;	/* TPH mode */
> > > @@ -2657,6 +2661,29 @@ static inline bool pci_is_dev_assigned(struct
> > pci_dev *pdev)
> > >  	return (pdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) ==
> > > PCI_DEV_FLAGS_ASSIGNED;  }
> > >
> > > +/*
> > > + * Caller must initialize @pdev->disconnect_work before invoking this.
> > > + * Caller also must check pci_device_is_present afterwards, since
> > > + * if device is already gone when this is called, work will not run.
> > > + */
> > > +static inline void pci_set_disconnect_work(struct pci_dev *pdev) {
> > > +	/* Make sure WQ has been initialized already */
> > > +	smp_wmb();
> > > +
> > > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x1); }
> > > +
> > > +static inline void pci_clear_disconnect_work(struct pci_dev *pdev) {
> > > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> > > +
> > > +	/* Make sure to stop using work from now on. */
> > > +	smp_wmb();
> > > +
> > > +	cancel_work_sync(&pdev->disconnect_work);
> > > +}
> > > +
> > >  /**
> > >   * pci_ari_enabled - query ARI forwarding status
> > >   * @bus: the PCI bus
> > > --
> > > MST


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH RFC v3] pci: report surprise removal event
  2025-07-03  6:24     ` Michael S. Tsirkin
@ 2025-07-03  9:51       ` Parav Pandit
  2025-07-03 10:20         ` Michael S. Tsirkin
  0 siblings, 1 reply; 7+ messages in thread
From: Parav Pandit @ 2025-07-03  9:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel@vger.kernel.org, Bjorn Helgaas,
	linux-pci@vger.kernel.org, stefanha@redhat.com,
	alok.a.tiwari@oracle.com, virtualization@lists.linux.dev


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: 03 July 2025 11:54 AM
> 
> On Thu, Jul 03, 2025 at 05:02:13AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: 02 July 2025 10:54 PM
> > >
> > > On Wed, Jul 02, 2025 at 03:20:52AM -0400, Michael S. Tsirkin wrote:
> > > > At the moment, in case of a surprise removal, the regular remove
> > > > callback is invoked, exclusively.  This works well, because
> > > > mostly, the cleanup would be the same.
> > > >
> > > > However, there's a race: imagine device removal was initiated by a
> > > > user action, such as driver unbind, and it in turn initiated some
> > > > cleanup and is now waiting for an interrupt from the device. If
> > > > the device is now surprise-removed, that never arrives and the
> > > > remove callback hangs forever.
> > > >
> > > > For example, this was reported for virtio-blk:
> > > >
> > > > 	1. the graceful removal is ongoing in the remove() callback, where
> disk
> > > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > 	   complete,
> > > >
> > > > 	2. Now few requests are yet to complete, and surprise removal
> started.
> > > >
> > > > 	At this point, virtio block driver will not get notified by the driver
> > > > 	core layer, because it is likely serializing remove() happening by
> > > > 	+user/driver unload and PCI hotplug driver-initiated device removal.
> > > So
> > > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > > 	for requests completions to arrive which it never gets.  So
> > > > 	del_gendisk() gets stuck.
> > > >
> > > > Drivers can artificially add timeouts to handle that, but it can
> > > > be flaky.
> > > >
> > > > Instead, let's add a way for the driver to be notified about the
> > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > device is inactive.
> > > >
> > > > Since cleanups can take a long time, this takes an approach of a
> > > > work struct that the driver initiates and enables on probe, and
> > > > tears down on remove.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >
> > >
> > > Parav what do you think of this patch?
> > The async notification part without holding the device lock is good part of
> this patch.
> >
> > However, large part of the systems and use cases does not involve pci hot
> plug removal.
> > An average system that I came across using has 150+ pci devices, and none
> of them uses hotplug.
> >
> > So increasing pci dev struct for rare hot unplug, that too for the race
> condition does not look the best option.
> >
> > I believe the intent of async notification without device lock can be achieved
> by adding a non-blocking async notifier callback.
> > This can go in the pci ops struct.
> >
> > Such callback scale far better being part of the ops struct instead of pci_dev
> struct.
> 
> Sorry, I don't see a way to achieve that, as the driver can go away while
> hotunplug happens.
> 
Well without device lock, driver can go away anyway.
In other words when schedule_work() is called by the core in this patch, what prevents driver to not get unloaded?

May be driver refcount can be taken conditionally before invoking the callback?

> You would be welcome to try but you mentioned you have no plans to do so.
> 
As I explained you can see that the support is needed at multiple modules.
Presently I couldn't spend cycles on this corner case race condition.
IMHO, if we want to fix, first fix should be for the most common case condition, for which the proposed fix exists.

Followed by that your second patch of device reset should also be done.

Next should be corner case fix that possibly nvme can benefit too. 

But if you have better ideas, should be fine too.

> 
> > > This you can try using this in virtio blk to address the hang you
> > > reported?
> > >
> > The hang I reported was not the race condition between remove() and
> hotunplug during remove.
> > It was the simple remove() as hot-unplug issue due to commit
> 43bb40c5b926.
> >
> > The race condition hang is hard to reproduce as_is.
> > I can try to reproduce by adding extra sleep() etc code in remove() with v4
> of this version with ops callback.
> >
> > However, that requires lot more code to be developed on top of current
> proposed fix [1].
> >
> > [1] https://lore.kernel.org/linux-block/20250624185622.GB5519@fedora/
> >
> > I need to re-arrange the hardware with hotplug resources. Will try to
> arrange on v4.
> >
> > > > Compile tested only.
> > > >
> > > > Note: this minimizes core code. I considered a more elaborate API
> > > > that would be easier to use, but decided to be conservative until
> > > > there are multiple users.
> > > >
> > > > changes from v2
> > > > 	v2 was corrupted, fat fingers :(
> > > >
> > > > changes from v1:
> > > >         switched to a WQ, with APIs to enable/disable
> > > >         added motivation
> > > >
> > > >
> > > >  drivers/pci/pci.h   |  6 ++++++
> > > >  include/linux/pci.h | 27 +++++++++++++++++++++++++++
> > > >  2 files changed, 33 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index
> > > > b81e99cd4b62..208b4cab534b 100644
> > > > --- a/drivers/pci/pci.h
> > > > +++ b/drivers/pci/pci.h
> > > > @@ -549,6 +549,12 @@ static inline int
> > > > pci_dev_set_disconnected(struct
> > > pci_dev *dev, void *unused)
> > > >  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> > > >  	pci_doe_disconnected(dev);
> > > >
> > > > +	if (READ_ONCE(dev->disconnect_work_enable)) {
> > > > +		/* Make sure work is up to date. */
> > > > +		smp_rmb();
> > > > +		schedule_work(&dev->disconnect_work);
> > > > +	}
> > > > +
> > > >  	return 0;
> > > >  }
> > > >
> > > > diff --git a/include/linux/pci.h b/include/linux/pci.h index
> > > > 51e2bd6405cd..b2168c5d0679 100644
> > > > --- a/include/linux/pci.h
> > > > +++ b/include/linux/pci.h
> > > > @@ -550,6 +550,10 @@ struct pci_dev {
> > > >  	/* These methods index pci_reset_fn_methods[] */
> > > >  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order
> */
> > > >
> > > > +	/* Report disconnect events */
> > > > +	u8 disconnect_work_enable;
> > > > +	struct work_struct disconnect_work;
> > > > +
> >
> > > >  #ifdef CONFIG_PCIE_TPH
> > > >  	u16		tph_cap;	/* TPH capability offset */
> > > >  	u8		tph_mode;	/* TPH mode */
> > > > @@ -2657,6 +2661,29 @@ static inline bool
> > > > pci_is_dev_assigned(struct
> > > pci_dev *pdev)
> > > >  	return (pdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) ==
> > > > PCI_DEV_FLAGS_ASSIGNED;  }
> > > >
> > > > +/*
> > > > + * Caller must initialize @pdev->disconnect_work before invoking this.
> > > > + * Caller also must check pci_device_is_present afterwards, since
> > > > + * if device is already gone when this is called, work will not run.
> > > > + */
> > > > +static inline void pci_set_disconnect_work(struct pci_dev *pdev) {
> > > > +	/* Make sure WQ has been initialized already */
> > > > +	smp_wmb();
> > > > +
> > > > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x1); }
> > > > +
> > > > +static inline void pci_clear_disconnect_work(struct pci_dev *pdev) {
> > > > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> > > > +
> > > > +	/* Make sure to stop using work from now on. */
> > > > +	smp_wmb();
> > > > +
> > > > +	cancel_work_sync(&pdev->disconnect_work);
> > > > +}
> > > > +
> > > >  /**
> > > >   * pci_ari_enabled - query ARI forwarding status
> > > >   * @bus: the PCI bus
> > > > --
> > > > MST


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC v3] pci: report surprise removal event
  2025-07-03  9:51       ` Parav Pandit
@ 2025-07-03 10:20         ` Michael S. Tsirkin
  0 siblings, 0 replies; 7+ messages in thread
From: Michael S. Tsirkin @ 2025-07-03 10:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: linux-kernel@vger.kernel.org, Bjorn Helgaas,
	linux-pci@vger.kernel.org, stefanha@redhat.com,
	alok.a.tiwari@oracle.com, virtualization@lists.linux.dev

On Thu, Jul 03, 2025 at 09:51:57AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: 03 July 2025 11:54 AM
> > 
> > On Thu, Jul 03, 2025 at 05:02:13AM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: 02 July 2025 10:54 PM
> > > >
> > > > On Wed, Jul 02, 2025 at 03:20:52AM -0400, Michael S. Tsirkin wrote:
> > > > > At the moment, in case of a surprise removal, the regular remove
> > > > > callback is invoked, exclusively.  This works well, because
> > > > > mostly, the cleanup would be the same.
> > > > >
> > > > > However, there's a race: imagine device removal was initiated by a
> > > > > user action, such as driver unbind, and it in turn initiated some
> > > > > cleanup and is now waiting for an interrupt from the device. If
> > > > > the device is now surprise-removed, that never arrives and the
> > > > > remove callback hangs forever.
> > > > >
> > > > > For example, this was reported for virtio-blk:
> > > > >
> > > > > 	1. the graceful removal is ongoing in the remove() callback, where
> > disk
> > > > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > > 	   complete,
> > > > >
> > > > > 	2. Now few requests are yet to complete, and surprise removal
> > started.
> > > > >
> > > > > 	At this point, virtio block driver will not get notified by the driver
> > > > > 	core layer, because it is likely serializing remove() happening by
> > > > > 	+user/driver unload and PCI hotplug driver-initiated device removal.
> > > > So
> > > > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > > > 	for requests completions to arrive which it never gets.  So
> > > > > 	del_gendisk() gets stuck.
> > > > >
> > > > > Drivers can artificially add timeouts to handle that, but it can
> > > > > be flaky.
> > > > >
> > > > > Instead, let's add a way for the driver to be notified about the
> > > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > > device is inactive.
> > > > >
> > > > > Since cleanups can take a long time, this takes an approach of a
> > > > > work struct that the driver initiates and enables on probe, and
> > > > > tears down on remove.
> > > > >
> > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > ---
> > > > >
> > > >
> > > > Parav what do you think of this patch?
> > > The async notification part without holding the device lock is good part of
> > this patch.
> > >
> > > However, large part of the systems and use cases does not involve pci hot
> > plug removal.
> > > An average system that I came across using has 150+ pci devices, and none
> > of them uses hotplug.
> > >
> > > So increasing pci dev struct for rare hot unplug, that too for the race
> > condition does not look the best option.
> > >
> > > I believe the intent of async notification without device lock can be achieved
> > by adding a non-blocking async notifier callback.
> > > This can go in the pci ops struct.
> > >
> > > Such callback scale far better being part of the ops struct instead of pci_dev
> > struct.
> > 
> > Sorry, I don't see a way to achieve that, as the driver can go away while
> > hotunplug happens.
> > 
> Well without device lock, driver can go away anyway.
> In other words when schedule_work() is called by the core in this patch, what prevents driver to not get unloaded?
> May be driver refcount can be taken conditionally before invoking the callback?

The work is flushed on driver unload.
Check out v4 for how it's used.

> 
> > You would be welcome to try but you mentioned you have no plans to do so.
> > 
> As I explained you can see that the support is needed at multiple modules.

Right. Check out v4: I did all the core work: pci, virtio and
virtio-pci, so what's left is just virtio blk.  For which I'm not the
best person, I think you are more familiar with that.


> Presently I couldn't spend cycles on this corner case race condition.
> IMHO, if we want to fix, first fix should be for the most common case condition, for which the proposed fix exists.
> 
> Followed by that your second patch of device reset should also be done.
> 
> Next should be corner case fix that possibly nvme can benefit too. 
> 
> But if you have better ideas, should be fine too.
> 
> > 
> > > > This you can try using this in virtio blk to address the hang you
> > > > reported?
> > > >
> > > The hang I reported was not the race condition between remove() and
> > hotunplug during remove.
> > > It was the simple remove() as hot-unplug issue due to commit
> > 43bb40c5b926.
> > >
> > > The race condition hang is hard to reproduce as_is.
> > > I can try to reproduce by adding extra sleep() etc code in remove() with v4
> > of this version with ops callback.
> > >
> > > However, that requires lot more code to be developed on top of current
> > proposed fix [1].
> > >
> > > [1] https://lore.kernel.org/linux-block/20250624185622.GB5519@fedora/
> > >
> > > I need to re-arrange the hardware with hotplug resources. Will try to
> > arrange on v4.
> > >
> > > > > Compile tested only.
> > > > >
> > > > > Note: this minimizes core code. I considered a more elaborate API
> > > > > that would be easier to use, but decided to be conservative until
> > > > > there are multiple users.
> > > > >
> > > > > changes from v2
> > > > > 	v2 was corrupted, fat fingers :(
> > > > >
> > > > > changes from v1:
> > > > >         switched to a WQ, with APIs to enable/disable
> > > > >         added motivation
> > > > >
> > > > >
> > > > >  drivers/pci/pci.h   |  6 ++++++
> > > > >  include/linux/pci.h | 27 +++++++++++++++++++++++++++
> > > > >  2 files changed, 33 insertions(+)
> > > > >
> > > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index
> > > > > b81e99cd4b62..208b4cab534b 100644
> > > > > --- a/drivers/pci/pci.h
> > > > > +++ b/drivers/pci/pci.h
> > > > > @@ -549,6 +549,12 @@ static inline int
> > > > > pci_dev_set_disconnected(struct
> > > > pci_dev *dev, void *unused)
> > > > >  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> > > > >  	pci_doe_disconnected(dev);
> > > > >
> > > > > +	if (READ_ONCE(dev->disconnect_work_enable)) {
> > > > > +		/* Make sure work is up to date. */
> > > > > +		smp_rmb();
> > > > > +		schedule_work(&dev->disconnect_work);
> > > > > +	}
> > > > > +
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > > diff --git a/include/linux/pci.h b/include/linux/pci.h index
> > > > > 51e2bd6405cd..b2168c5d0679 100644
> > > > > --- a/include/linux/pci.h
> > > > > +++ b/include/linux/pci.h
> > > > > @@ -550,6 +550,10 @@ struct pci_dev {
> > > > >  	/* These methods index pci_reset_fn_methods[] */
> > > > >  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order
> > */
> > > > >
> > > > > +	/* Report disconnect events */
> > > > > +	u8 disconnect_work_enable;
> > > > > +	struct work_struct disconnect_work;
> > > > > +
> > >
> > > > >  #ifdef CONFIG_PCIE_TPH
> > > > >  	u16		tph_cap;	/* TPH capability offset */
> > > > >  	u8		tph_mode;	/* TPH mode */
> > > > > @@ -2657,6 +2661,29 @@ static inline bool
> > > > > pci_is_dev_assigned(struct
> > > > pci_dev *pdev)
> > > > >  	return (pdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) ==
> > > > > PCI_DEV_FLAGS_ASSIGNED;  }
> > > > >
> > > > > +/*
> > > > > + * Caller must initialize @pdev->disconnect_work before invoking this.
> > > > > + * Caller also must check pci_device_is_present afterwards, since
> > > > > + * if device is already gone when this is called, work will not run.
> > > > > + */
> > > > > +static inline void pci_set_disconnect_work(struct pci_dev *pdev) {
> > > > > +	/* Make sure WQ has been initialized already */
> > > > > +	smp_wmb();
> > > > > +
> > > > > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x1); }
> > > > > +
> > > > > +static inline void pci_clear_disconnect_work(struct pci_dev *pdev) {
> > > > > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> > > > > +
> > > > > +	/* Make sure to stop using work from now on. */
> > > > > +	smp_wmb();
> > > > > +
> > > > > +	cancel_work_sync(&pdev->disconnect_work);
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * pci_ari_enabled - query ARI forwarding status
> > > > >   * @bus: the PCI bus
> > > > > --
> > > > > MST


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-03 10:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02  7:20 [PATCH RFC v3] pci: report surprise removal event Michael S. Tsirkin
2025-07-02 17:23 ` Michael S. Tsirkin
2025-07-02 17:24 ` Michael S. Tsirkin
2025-07-03  5:02   ` Parav Pandit
2025-07-03  6:24     ` Michael S. Tsirkin
2025-07-03  9:51       ` Parav Pandit
2025-07-03 10:20         ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).