[PATCH RFC v5 1/5] pci: report surprise removal event

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v5 1/5] pci: report surprise removal event
       [not found] <cover.1752094439.git.mst@redhat.com>
@ 2025-07-09 20:55 ` Michael S. Tsirkin
  2025-07-09 23:38   ` Bjorn Helgaas
  2025-07-14  6:11   ` Lukas Wunner
  0 siblings, 2 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-09 20:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lukas Wunner, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

At the moment, in case of a surprise removal, the regular remove
callback is invoked, exclusively.  This works well, because mostly, the
cleanup would be the same.

However, there's a race: imagine device removal was initiated by a user
action, such as driver unbind, and it in turn initiated some cleanup and
is now waiting for an interrupt from the device. If the device is now
surprise-removed, that never arrives and the remove callback hangs
forever.

For example, this was reported for virtio-blk:

	1. the graceful removal is ongoing in the remove() callback, where disk
	   deletion del_gendisk() is ongoing, which waits for the requests +to
	   complete,

	2. Now few requests are yet to complete, and surprise removal started.

	At this point, virtio block driver will not get notified by the driver
	core layer, because it is likely serializing remove() happening by
	+user/driver unload and PCI hotplug driver-initiated device removal.  So
	vblk driver doesn't know that device is removed, block layer is waiting
	for requests completions to arrive which it never gets.  So
	del_gendisk() gets stuck.

Drivers can artificially add timeouts to handle that, but it can be
flaky.

Instead, let's add a way for the driver to be notified about the
disconnect. It can then do any necessary cleanup, knowing that the
device is inactive.

Since cleanups can take a long time, this takes an approach
of a work struct that the driver initiates and enables
on probe, and tears down on remove.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/pci/pci.h   |  6 ++++++
 include/linux/pci.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 12215ee72afb..3ca4ebfd46be 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
 	pci_doe_disconnected(dev);
 
+	if (READ_ONCE(dev->disconnect_work_enable)) {
+		/* Make sure work is up to date. */
+		smp_rmb();
+		schedule_work(&dev->disconnect_work);
+	}
+
 	return 0;
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 05e68f35f392..723b17145b62 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -548,6 +548,10 @@ struct pci_dev {
 	/* These methods index pci_reset_fn_methods[] */
 	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
 
+	/* Report disconnect events. 0x0 - disable, 0x1 - enable */
+	u8 disconnect_work_enable;
+	struct work_struct disconnect_work;
+
 #ifdef CONFIG_PCIE_TPH
 	u16		tph_cap;	/* TPH capability offset */
 	u8		tph_mode;	/* TPH mode */
@@ -1993,6 +1997,47 @@ pci_release_mem_regions(struct pci_dev *pdev)
 			    pci_select_bars(pdev, IORESOURCE_MEM));
 }
 
+/*
+ * Run this first thing after getting a disconnect work, to prevent it from
+ * running multiple times.
+ * Returns: true if disconnect was enabled, proceed. false if disabled, abort.
+ */
+static inline bool pci_test_and_clear_disconnect_enable(struct pci_dev *pdev)
+{
+	u8 enable = 0x1;
+	u8 disable = 0x0;
+	return try_cmpxchg(&pdev->disconnect_work_enable, &enable, disable);
+}
+
+/*
+ * Caller must initialize @pdev->disconnect_work before invoking this.
+ * The work function must run and check pci_test_and_clear_disconnect_enable.
+ * Note that device can go away right after this call.
+ */
+static inline void pci_set_disconnect_work(struct pci_dev *pdev)
+{
+	/* Make sure WQ has been initialized already */
+	smp_wmb();
+
+	WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
+
+	/* check the device did not go away meanwhile. */
+	mb();
+
+	if (!pci_device_is_present(pdev))
+		schedule_work(&pdev->disconnect_work);
+}
+
+static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
+{
+	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
+
+	/* Make sure to stop using work from now on. */
+	smp_wmb();
+
+	cancel_work_sync(&pdev->disconnect_work);
+}
+
 #else /* CONFIG_PCI is not enabled */
 
 static inline void pci_set_flags(int flags) { }
-- 
MST


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: report surprise removal event Michael S. Tsirkin
@ 2025-07-09 23:38   ` Bjorn Helgaas
  2025-07-09 23:55     ` Keith Busch
  2025-07-14  6:26     ` Michael S. Tsirkin
  2025-07-14  6:11   ` Lukas Wunner
  1 sibling, 2 replies; 17+ messages in thread
From: Bjorn Helgaas @ 2025-07-09 23:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
	Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci

Housekeeping: Note subject line convention. Indent with spaces in
commit log.  Remove spurious plus signs.

On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> At the moment, in case of a surprise removal, the regular remove
> callback is invoked, exclusively.  This works well, because mostly, the
> cleanup would be the same.
> 
> However, there's a race: imagine device removal was initiated by a user
> action, such as driver unbind, and it in turn initiated some cleanup and
> is now waiting for an interrupt from the device. If the device is now
> surprise-removed, that never arrives and the remove callback hangs
> forever.
> 
> For example, this was reported for virtio-blk:
> 
> 	1. the graceful removal is ongoing in the remove() callback, where disk
> 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> 	   complete,
> 
> 	2. Now few requests are yet to complete, and surprise removal started.
> 
> 	At this point, virtio block driver will not get notified by the driver
> 	core layer, because it is likely serializing remove() happening by
> 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> 	vblk driver doesn't know that device is removed, block layer is waiting
> 	for requests completions to arrive which it never gets.  So
> 	del_gendisk() gets stuck.
> 
> Drivers can artificially add timeouts to handle that, but it can be
> flaky.
> 
> Instead, let's add a way for the driver to be notified about the
> disconnect. It can then do any necessary cleanup, knowing that the
> device is inactive.

This relies on somebody (typically pciehp, I guess) calling
pci_dev_set_disconnected() when a surprise remove happens.

Do you think it would be practical for the driver's .remove() method
to recognize that the device may stop responding at any point, even if
no hotplug driver is present to call pci_dev_set_disconnected()?

Waiting forever for an interrupt seems kind of vulnerable in general.
Maybe "artificially adding timeouts" is alluding to *not* waiting
forever for interrupts?  That doesn't seem artificial to me because
it's just a fact of life that devices can disappear at arbitrary
times.

It seems a little fragile to me to depend on some other part of the
system to notice the surprise removal and tell you about it or
schedule your work function.  I think it would be more robust for the
driver to check directly, i.e., assume writes to the device may be
lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
never wait for an interrupt without a timeout.

> Since cleanups can take a long time, this takes an approach
> of a work struct that the driver initiates and enables
> on probe, and tears down on remove.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/pci/pci.h   |  6 ++++++
>  include/linux/pci.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 51 insertions(+)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 12215ee72afb..3ca4ebfd46be 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
>  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
>  	pci_doe_disconnected(dev);
>  
> +	if (READ_ONCE(dev->disconnect_work_enable)) {
> +		/* Make sure work is up to date. */
> +		smp_rmb();
> +		schedule_work(&dev->disconnect_work);
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 05e68f35f392..723b17145b62 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -548,6 +548,10 @@ struct pci_dev {
>  	/* These methods index pci_reset_fn_methods[] */
>  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
>  
> +	/* Report disconnect events. 0x0 - disable, 0x1 - enable */
> +	u8 disconnect_work_enable;
> +	struct work_struct disconnect_work;
> +
>  #ifdef CONFIG_PCIE_TPH
>  	u16		tph_cap;	/* TPH capability offset */
>  	u8		tph_mode;	/* TPH mode */
> @@ -1993,6 +1997,47 @@ pci_release_mem_regions(struct pci_dev *pdev)
>  			    pci_select_bars(pdev, IORESOURCE_MEM));
>  }
>  
> +/*
> + * Run this first thing after getting a disconnect work, to prevent it from
> + * running multiple times.
> + * Returns: true if disconnect was enabled, proceed. false if disabled, abort.
> + */
> +static inline bool pci_test_and_clear_disconnect_enable(struct pci_dev *pdev)
> +{
> +	u8 enable = 0x1;
> +	u8 disable = 0x0;
> +	return try_cmpxchg(&pdev->disconnect_work_enable, &enable, disable);
> +}
> +
> +/*
> + * Caller must initialize @pdev->disconnect_work before invoking this.
> + * The work function must run and check pci_test_and_clear_disconnect_enable.
> + * Note that device can go away right after this call.
> + */
> +static inline void pci_set_disconnect_work(struct pci_dev *pdev)
> +{
> +	/* Make sure WQ has been initialized already */
> +	smp_wmb();
> +
> +	WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
> +
> +	/* check the device did not go away meanwhile. */
> +	mb();
> +
> +	if (!pci_device_is_present(pdev))
> +		schedule_work(&pdev->disconnect_work);
> +}
> +
> +static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
> +{
> +	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> +
> +	/* Make sure to stop using work from now on. */
> +	smp_wmb();
> +
> +	cancel_work_sync(&pdev->disconnect_work);
> +}
> +
>  #else /* CONFIG_PCI is not enabled */
>  
>  static inline void pci_set_flags(int flags) { }
> -- 
> MST
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-09 23:38   ` Bjorn Helgaas
@ 2025-07-09 23:55     ` Keith Busch
  2025-07-14  6:17       ` Michael S. Tsirkin
  2025-07-14  6:26     ` Michael S. Tsirkin
  1 sibling, 1 reply; 17+ messages in thread
From: Keith Busch @ 2025-07-09 23:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Michael S. Tsirkin, linux-kernel, Lukas Wunner, Bjorn Helgaas,
	Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci

On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> This relies on somebody (typically pciehp, I guess) calling
> pci_dev_set_disconnected() when a surprise remove happens.
> 
> Do you think it would be practical for the driver's .remove() method
> to recognize that the device may stop responding at any point, even if
> no hotplug driver is present to call pci_dev_set_disconnected()?
> 
> Waiting forever for an interrupt seems kind of vulnerable in general.
> Maybe "artificially adding timeouts" is alluding to *not* waiting
> forever for interrupts?  That doesn't seem artificial to me because
> it's just a fact of life that devices can disappear at arbitrary
> times.

I totally agree here. Every driver's .remove() should be able to
guarantee forward progress some way. I put some work in blk-mq and nvme
to ensure that happens for those devices at least.

That "forward progress" can come slow though, maybe minutes, so we do
have opprotunisitic short cuts sprinkled about the driver. There are
still gaps when waiting for interrupt driven IO that need the longer
timeouts to trigger. It'd be cool if there was a mechansim to kick in
quicker, but this is still an uncommon exceptional condition, right?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: report surprise removal event Michael S. Tsirkin
  2025-07-09 23:38   ` Bjorn Helgaas
@ 2025-07-14  6:11   ` Lukas Wunner
  2025-07-14  6:18     ` Michael S. Tsirkin
                       ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Lukas Wunner @ 2025-07-14  6:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> At the moment, in case of a surprise removal, the regular remove
> callback is invoked, exclusively.  This works well, because mostly, the
> cleanup would be the same.
> 
> However, there's a race: imagine device removal was initiated by a user
> action, such as driver unbind, and it in turn initiated some cleanup and
> is now waiting for an interrupt from the device. If the device is now
> surprise-removed, that never arrives and the remove callback hangs
> forever.

For PCI devices in a hotplug slot, user space can initiate "safe removal"
by writing "0" to the hotplug slot's "power" file in sysfs.

If the PCI device is yanked from the slot while safe removal is ongoing,
there is likewise no way for the driver to know that the device is
suddenly gone.  That's because pciehp_unconfigure_device() only calls
pci_dev_set_disconnected() in the surprise removal case, not for
safe removal.

The solution proposed here is thus not a complete one:  It may work
if user space initiated *driver* removal, but not if it initiated *safe*
removal of the entire device.  For virtio, that may be sufficient.

> +++ b/drivers/pci/pci.h
> @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
>  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
>  	pci_doe_disconnected(dev);
>  
> +	if (READ_ONCE(dev->disconnect_work_enable)) {
> +		/* Make sure work is up to date. */
> +		smp_rmb();
> +		schedule_work(&dev->disconnect_work);
> +	}
> +
>  	return 0;
>  }

Going through all the callers of pci_dev_set_disconnected(),
I suppose the (only) one you're interested in is
pciehp_unconfigure_device().

The other callers are related to runtime resume, resume from
system sleep and ACPI slots.

Instead of amending pci_dev_set_disconnected(), I'd prefer
an approach where pciehp_unconfigure_device() first marks
all devices disconnected, then wakes up some global waitqueue, e.g.:

-	if (!presence)
+	if (!presence) {
		pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+		wake_up_all(&pci_disconnected_wq);
+	}

The benefit is that there's no delay when marking devices disconnected.
(Granted, the delay is small for smp_rmb() + schedule_work().)
And just having a global waitqueue is simpler and may be useful
for other use cases.

So instead of adding timeouts when waiting for interrupts, drivers would
be woken via the waitqueue.

But again, it's not a complete solution as it doesn't cover the
"surprise removal during safe removal" case.

I also agree with Bjorn's and Keith's comments that the driver should
use timeouts for robustness, but still wanted to provide additional
(hopefully constructive) thoughts.

Thanks!

Lukas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-09 23:55     ` Keith Busch
@ 2025-07-14  6:17       ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-14  6:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, linux-kernel, Lukas Wunner, Bjorn Helgaas,
	Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci

On Wed, Jul 09, 2025 at 05:55:17PM -0600, Keith Busch wrote:
> On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > This relies on somebody (typically pciehp, I guess) calling
> > pci_dev_set_disconnected() when a surprise remove happens.
> > 
> > Do you think it would be practical for the driver's .remove() method
> > to recognize that the device may stop responding at any point, even if
> > no hotplug driver is present to call pci_dev_set_disconnected()?
> > 
> > Waiting forever for an interrupt seems kind of vulnerable in general.
> > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > forever for interrupts?  That doesn't seem artificial to me because
> > it's just a fact of life that devices can disappear at arbitrary
> > times.
> 
> I totally agree here. Every driver's .remove() should be able to
> guarantee forward progress some way. I put some work in blk-mq and nvme
> to ensure that happens for those devices at least.
> 
> That "forward progress" can come slow though, maybe minutes, so we do
> have opprotunisitic short cuts sprinkled about the driver. There are
> still gaps when waiting for interrupt driven IO that need the longer
> timeouts to trigger. It'd be cool if there was a mechansim to kick in
> quicker, but this is still an uncommon exceptional condition, right?

It's uncommon, yes.

-- 
MST


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-14  6:11   ` Lukas Wunner
@ 2025-07-14  6:18     ` Michael S. Tsirkin
  2025-07-14  6:54     ` Michael S. Tsirkin
  2025-07-17 15:11     ` Michael S. Tsirkin
  2 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-14  6:18 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively.  This works well, because mostly, the
> > cleanup would be the same.
> > 
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
> 
> For PCI devices in a hotplug slot, user space can initiate "safe removal"
> by writing "0" to the hotplug slot's "power" file in sysfs.
> 
> If the PCI device is yanked from the slot while safe removal is ongoing,
> there is likewise no way for the driver to know that the device is
> suddenly gone.  That's because pciehp_unconfigure_device() only calls
> pci_dev_set_disconnected() in the surprise removal case, not for
> safe removal.
> 
> The solution proposed here is thus not a complete one:  It may work
> if user space initiated *driver* removal, but not if it initiated *safe*
> removal of the entire device.  For virtio, that may be sufficient.
> 
> > +++ b/drivers/pci/pci.h
> > @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> >  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> >  	pci_doe_disconnected(dev);
> >  
> > +	if (READ_ONCE(dev->disconnect_work_enable)) {
> > +		/* Make sure work is up to date. */
> > +		smp_rmb();
> > +		schedule_work(&dev->disconnect_work);
> > +	}
> > +
> >  	return 0;
> >  }
> 
> Going through all the callers of pci_dev_set_disconnected(),
> I suppose the (only) one you're interested in is
> pciehp_unconfigure_device().
> 
> The other callers are related to runtime resume, resume from
> system sleep and ACPI slots.
> 
> Instead of amending pci_dev_set_disconnected(), I'd prefer
> an approach where pciehp_unconfigure_device() first marks
> all devices disconnected, then wakes up some global waitqueue, e.g.:
> 
> -	if (!presence)
> +	if (!presence) {
> 		pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> +		wake_up_all(&pci_disconnected_wq);
> +	}
> 
> The benefit is that there's no delay when marking devices disconnected.
> (Granted, the delay is small for smp_rmb() + schedule_work().)
> And just having a global waitqueue is simpler and may be useful
> for other use cases.
> 
> So instead of adding timeouts when waiting for interrupts, drivers would
> be woken via the waitqueue.
> 
> But again, it's not a complete solution as it doesn't cover the
> "surprise removal during safe removal" case.

Did not realize. Will look into addressing this, thanks!


> I also agree with Bjorn's and Keith's comments that the driver should
> use timeouts for robustness, but still wanted to provide additional
> (hopefully constructive) thoughts.
> 
> Thanks!
> 
> Lukas

I'll address these comments in the next version.

-- 
MST


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-09 23:38   ` Bjorn Helgaas
  2025-07-09 23:55     ` Keith Busch
@ 2025-07-14  6:26     ` Michael S. Tsirkin
  2025-07-14 21:13       ` Bjorn Helgaas
  1 sibling, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-14  6:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
	Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci

On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> Housekeeping: Note subject line convention. Indent with spaces in
> commit log.  Remove spurious plus signs.


Thanks!

> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively.  This works well, because mostly, the
> > cleanup would be the same.
> > 
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
> > 
> > For example, this was reported for virtio-blk:
> > 
> > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > 	   complete,
> > 
> > 	2. Now few requests are yet to complete, and surprise removal started.
> > 
> > 	At this point, virtio block driver will not get notified by the driver
> > 	core layer, because it is likely serializing remove() happening by
> > 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> > 	vblk driver doesn't know that device is removed, block layer is waiting
> > 	for requests completions to arrive which it never gets.  So
> > 	del_gendisk() gets stuck.
> > 
> > Drivers can artificially add timeouts to handle that, but it can be
> > flaky.
> > 
> > Instead, let's add a way for the driver to be notified about the
> > disconnect. It can then do any necessary cleanup, knowing that the
> > device is inactive.
> 
> This relies on somebody (typically pciehp, I guess) calling
> pci_dev_set_disconnected() when a surprise remove happens.
> 
> Do you think it would be practical for the driver's .remove() method
> to recognize that the device may stop responding at any point, even if
> no hotplug driver is present to call pci_dev_set_disconnected()?
> 
> Waiting forever for an interrupt seems kind of vulnerable in general.
> Maybe "artificially adding timeouts" is alluding to *not* waiting
> forever for interrupts?  That doesn't seem artificial to me because
> it's just a fact of life that devices can disappear at arbitrary
> times.
> 
> It seems a little fragile to me to depend on some other part of the
> system to notice the surprise removal and tell you about it or
> schedule your work function.  I think it would be more robust for the
> driver to check directly, i.e., assume writes to the device may be
> lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> never wait for an interrupt without a timeout.

virtio is ... kind of special, in that users already take it for
granted that having a device as long as they want to respond
still does not lead to errors and data loss.

Makes it a bit harder as our timeout would have to
check for presence and retry, we can't just fail as a
normal hardware device does.

And there's the overhead thing - poking at the device a lot
puts a high load on the host.

So I can imagine a very long timeout (minutes?), and then something like
the WQ I am trying to propose here as a shortcut.



> > Since cleanups can take a long time, this takes an approach
> > of a work struct that the driver initiates and enables
> > on probe, and tears down on remove.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  drivers/pci/pci.h   |  6 ++++++
> >  include/linux/pci.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 51 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 12215ee72afb..3ca4ebfd46be 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> >  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> >  	pci_doe_disconnected(dev);
> >  
> > +	if (READ_ONCE(dev->disconnect_work_enable)) {
> > +		/* Make sure work is up to date. */
> > +		smp_rmb();
> > +		schedule_work(&dev->disconnect_work);
> > +	}
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 05e68f35f392..723b17145b62 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -548,6 +548,10 @@ struct pci_dev {
> >  	/* These methods index pci_reset_fn_methods[] */
> >  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> >  
> > +	/* Report disconnect events. 0x0 - disable, 0x1 - enable */
> > +	u8 disconnect_work_enable;
> > +	struct work_struct disconnect_work;
> > +
> >  #ifdef CONFIG_PCIE_TPH
> >  	u16		tph_cap;	/* TPH capability offset */
> >  	u8		tph_mode;	/* TPH mode */
> > @@ -1993,6 +1997,47 @@ pci_release_mem_regions(struct pci_dev *pdev)
> >  			    pci_select_bars(pdev, IORESOURCE_MEM));
> >  }
> >  
> > +/*
> > + * Run this first thing after getting a disconnect work, to prevent it from
> > + * running multiple times.
> > + * Returns: true if disconnect was enabled, proceed. false if disabled, abort.
> > + */
> > +static inline bool pci_test_and_clear_disconnect_enable(struct pci_dev *pdev)
> > +{
> > +	u8 enable = 0x1;
> > +	u8 disable = 0x0;
> > +	return try_cmpxchg(&pdev->disconnect_work_enable, &enable, disable);
> > +}
> > +
> > +/*
> > + * Caller must initialize @pdev->disconnect_work before invoking this.
> > + * The work function must run and check pci_test_and_clear_disconnect_enable.
> > + * Note that device can go away right after this call.
> > + */
> > +static inline void pci_set_disconnect_work(struct pci_dev *pdev)
> > +{
> > +	/* Make sure WQ has been initialized already */
> > +	smp_wmb();
> > +
> > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
> > +
> > +	/* check the device did not go away meanwhile. */
> > +	mb();
> > +
> > +	if (!pci_device_is_present(pdev))
> > +		schedule_work(&pdev->disconnect_work);
> > +}
> > +
> > +static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
> > +{
> > +	WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> > +
> > +	/* Make sure to stop using work from now on. */
> > +	smp_wmb();
> > +
> > +	cancel_work_sync(&pdev->disconnect_work);
> > +}
> > +
> >  #else /* CONFIG_PCI is not enabled */
> >  
> >  static inline void pci_set_flags(int flags) { }
> > -- 
> > MST
> > 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-14  6:11   ` Lukas Wunner
  2025-07-14  6:18     ` Michael S. Tsirkin
@ 2025-07-14  6:54     ` Michael S. Tsirkin
  2025-07-17 15:11     ` Michael S. Tsirkin
  2 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-14  6:54 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively.  This works well, because mostly, the
> > cleanup would be the same.
> > 
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
> 
> For PCI devices in a hotplug slot, user space can initiate "safe removal"
> by writing "0" to the hotplug slot's "power" file in sysfs.
> 
> If the PCI device is yanked from the slot while safe removal is ongoing,
> there is likewise no way for the driver to know that the device is
> suddenly gone.  That's because pciehp_unconfigure_device() only calls
> pci_dev_set_disconnected() in the surprise removal case, not for
> safe removal.
> 
> The solution proposed here is thus not a complete one:  It may work
> if user space initiated *driver* removal, but not if it initiated *safe*
> removal of the entire device.  For virtio, that may be sufficient.


No, I just missed this corner case.

> > +++ b/drivers/pci/pci.h
> > @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> >  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> >  	pci_doe_disconnected(dev);
> >  
> > +	if (READ_ONCE(dev->disconnect_work_enable)) {
> > +		/* Make sure work is up to date. */
> > +		smp_rmb();
> > +		schedule_work(&dev->disconnect_work);
> > +	}
> > +
> >  	return 0;
> >  }
> 
> Going through all the callers of pci_dev_set_disconnected(),
> I suppose the (only) one you're interested in is
> pciehp_unconfigure_device().
> 
> The other callers are related to runtime resume, resume from
> system sleep and ACPI slots.
> 
> Instead of amending pci_dev_set_disconnected(), I'd prefer
> an approach where pciehp_unconfigure_device() first marks
> all devices disconnected, then wakes up some global waitqueue, e.g.:
> 
> -	if (!presence)
> +	if (!presence) {
> 		pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> +		wake_up_all(&pci_disconnected_wq);
> +	}
> 
> The benefit is that there's no delay when marking devices disconnected.
> (Granted, the delay is small for smp_rmb() + schedule_work().)
> And just having a global waitqueue is simpler and may be useful
> for other use cases.
> 
> So instead of adding timeouts when waiting for interrupts, drivers would
> be woken via the waitqueue.
> 
> But again, it's not a complete solution as it doesn't cover the
> "surprise removal during safe removal" case.
> 
> I also agree with Bjorn's and Keith's comments that the driver should
> use timeouts for robustness,

Yes - we can consider this an optimization, as robust timeouts
are by necessity minutes.

> but still wanted to provide additional
> (hopefully constructive) thoughts.
> 
> Thanks!
> 
> Lukas


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-14  6:26     ` Michael S. Tsirkin
@ 2025-07-14 21:13       ` Bjorn Helgaas
  2025-07-15  6:28         ` Michael S. Tsirkin
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2025-07-14 21:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
	Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci

On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > At the moment, in case of a surprise removal, the regular remove
> > > callback is invoked, exclusively.  This works well, because mostly, the
> > > cleanup would be the same.
> > > 
> > > However, there's a race: imagine device removal was initiated by a user
> > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > is now waiting for an interrupt from the device. If the device is now
> > > surprise-removed, that never arrives and the remove callback hangs
> > > forever.
> > > 
> > > For example, this was reported for virtio-blk:
> > > 
> > > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > 	   complete,
> > > 
> > > 	2. Now few requests are yet to complete, and surprise removal started.
> > > 
> > > 	At this point, virtio block driver will not get notified by the driver
> > > 	core layer, because it is likely serializing remove() happening by
> > > 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > 	for requests completions to arrive which it never gets.  So
> > > 	del_gendisk() gets stuck.
> > > 
> > > Drivers can artificially add timeouts to handle that, but it can be
> > > flaky.
> > > 
> > > Instead, let's add a way for the driver to be notified about the
> > > disconnect. It can then do any necessary cleanup, knowing that the
> > > device is inactive.
> > 
> > This relies on somebody (typically pciehp, I guess) calling
> > pci_dev_set_disconnected() when a surprise remove happens.
> > 
> > Do you think it would be practical for the driver's .remove() method
> > to recognize that the device may stop responding at any point, even if
> > no hotplug driver is present to call pci_dev_set_disconnected()?
> > 
> > Waiting forever for an interrupt seems kind of vulnerable in general.
> > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > forever for interrupts?  That doesn't seem artificial to me because
> > it's just a fact of life that devices can disappear at arbitrary
> > times.
> > 
> > It seems a little fragile to me to depend on some other part of the
> > system to notice the surprise removal and tell you about it or
> > schedule your work function.  I think it would be more robust for the
> > driver to check directly, i.e., assume writes to the device may be
> > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > never wait for an interrupt without a timeout.
> 
> virtio is ... kind of special, in that users already take it for
> granted that having a device as long as they want to respond
> still does not lead to errors and data loss.
> 
> Makes it a bit harder as our timeout would have to
> check for presence and retry, we can't just fail as a
> normal hardware device does.

Sorry, I don't know enough about virtio to follow what you said about 
"having a device as long as they want to respond".

We started with a graceful remove.  That must mean the user no longer
needs the device.

> And there's the overhead thing - poking at the device a lot
> puts a high load on the host.

Checking for PCI_POSSIBLE_ERROR() doesn't touch the device.  If you
did a config read already, and the result happened to be ~0, *then* we
have the problem of figuring out whether the actual data from the
device was ~0, or if the read failed and the Root Complex synthesized
the ~0.  In many cases a driver knows that ~0 is not a possible
register value.  Otherwise it might have to read another register that
is known not to be ~0.

Bjorn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-14 21:13       ` Bjorn Helgaas
@ 2025-07-15  6:28         ` Michael S. Tsirkin
  2025-07-16 22:29           ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-15  6:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
	Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci

On Mon, Jul 14, 2025 at 04:13:51PM -0500, Bjorn Helgaas wrote:
> On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > At the moment, in case of a surprise removal, the regular remove
> > > > callback is invoked, exclusively.  This works well, because mostly, the
> > > > cleanup would be the same.
> > > > 
> > > > However, there's a race: imagine device removal was initiated by a user
> > > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > > is now waiting for an interrupt from the device. If the device is now
> > > > surprise-removed, that never arrives and the remove callback hangs
> > > > forever.
> > > > 
> > > > For example, this was reported for virtio-blk:
> > > > 
> > > > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > 	   complete,
> > > > 
> > > > 	2. Now few requests are yet to complete, and surprise removal started.
> > > > 
> > > > 	At this point, virtio block driver will not get notified by the driver
> > > > 	core layer, because it is likely serializing remove() happening by
> > > > 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> > > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > > 	for requests completions to arrive which it never gets.  So
> > > > 	del_gendisk() gets stuck.
> > > > 
> > > > Drivers can artificially add timeouts to handle that, but it can be
> > > > flaky.
> > > > 
> > > > Instead, let's add a way for the driver to be notified about the
> > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > device is inactive.
> > > 
> > > This relies on somebody (typically pciehp, I guess) calling
> > > pci_dev_set_disconnected() when a surprise remove happens.
> > > 
> > > Do you think it would be practical for the driver's .remove() method
> > > to recognize that the device may stop responding at any point, even if
> > > no hotplug driver is present to call pci_dev_set_disconnected()?
> > > 
> > > Waiting forever for an interrupt seems kind of vulnerable in general.
> > > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > > forever for interrupts?  That doesn't seem artificial to me because
> > > it's just a fact of life that devices can disappear at arbitrary
> > > times.
> > > 
> > > It seems a little fragile to me to depend on some other part of the
> > > system to notice the surprise removal and tell you about it or
> > > schedule your work function.  I think it would be more robust for the
> > > driver to check directly, i.e., assume writes to the device may be
> > > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > > never wait for an interrupt without a timeout.
> > 
> > virtio is ... kind of special, in that users already take it for
> > granted that having a device as long as they want to respond
> > still does not lead to errors and data loss.
> > 
> > Makes it a bit harder as our timeout would have to
> > check for presence and retry, we can't just fail as a
> > normal hardware device does.
> 
> Sorry, I don't know enough about virtio to follow what you said about 
> "having a device as long as they want to respond".
> 
> We started with a graceful remove.  That must mean the user no longer
> needs the device.

I'll try to clarify:

Indeed, the user will not submit new requests,
but users might also not know that there are some old requests
being in progress of being processed by the device.
The driver, currently, waits for that processsing to be complete.
Cancelling that with a reset on a timeout might be a regression,
unless the timeout is very long.

Hope this makes it clear.

> > And there's the overhead thing - poking at the device a lot
> > puts a high load on the host.
> 
> Checking for PCI_POSSIBLE_ERROR() doesn't touch the device.  If you
> did a config read already, and the result happened to be ~0, *then* we
> have the problem of figuring out whether the actual data from the
> device was ~0, or if the read failed and the Root Complex synthesized
> the ~0.  In many cases a driver knows that ~0 is not a possible
> register value.  Otherwise it might have to read another register that
> is known not to be ~0.
> 
> Bjorn

To clarify, virtio generally is designed to operate solely
by means of DMA and interrupt, completely avoiding any PCI
reads. This, due to PCI reads being very expensive in virtualized
scenarios.

The extra overhead I refer to is exactly initiating such a read
where there would not be one in normal operation.

-- 
MST


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-15  6:28         ` Michael S. Tsirkin
@ 2025-07-16 22:29           ` Bjorn Helgaas
  2025-07-17 15:15             ` Michael S. Tsirkin
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2025-07-16 22:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
	Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci

On Tue, Jul 15, 2025 at 02:28:20AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 14, 2025 at 04:13:51PM -0500, Bjorn Helgaas wrote:
> > On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > > At the moment, in case of a surprise removal, the regular
> > > > > remove callback is invoked, exclusively.  This works well,
> > > > > because mostly, the cleanup would be the same.
> > > > > 
> > > > > However, there's a race: imagine device removal was
> > > > > initiated by a user action, such as driver unbind, and it in
> > > > > turn initiated some cleanup and is now waiting for an
> > > > > interrupt from the device. If the device is now
> > > > > surprise-removed, that never arrives and the remove callback
> > > > > hangs forever.
> > > > > 
> > > > > For example, this was reported for virtio-blk:
> > > > > 
> > > > > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > > > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > > 	   complete,
> > > > > 
> > > > > 	2. Now few requests are yet to complete, and surprise removal started.
> > > > > 
> > > > > 	At this point, virtio block driver will not get notified by the driver
> > > > > 	core layer, because it is likely serializing remove() happening by
> > > > > 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> > > > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > > > 	for requests completions to arrive which it never gets.  So
> > > > > 	del_gendisk() gets stuck.
> > > > > 
> > > > > Drivers can artificially add timeouts to handle that, but it can be
> > > > > flaky.
> > > > > 
> > > > > Instead, let's add a way for the driver to be notified about the
> > > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > > device is inactive.
> > > > 
> > > > This relies on somebody (typically pciehp, I guess) calling
> > > > pci_dev_set_disconnected() when a surprise remove happens.
> > > > 
> > > > Do you think it would be practical for the driver's .remove() method
> > > > to recognize that the device may stop responding at any point, even if
> > > > no hotplug driver is present to call pci_dev_set_disconnected()?
> > > > 
> > > > Waiting forever for an interrupt seems kind of vulnerable in general.
> > > > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > > > forever for interrupts?  That doesn't seem artificial to me because
> > > > it's just a fact of life that devices can disappear at arbitrary
> > > > times.
> > > > 
> > > > It seems a little fragile to me to depend on some other part of the
> > > > system to notice the surprise removal and tell you about it or
> > > > schedule your work function.  I think it would be more robust for the
> > > > driver to check directly, i.e., assume writes to the device may be
> > > > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > > > never wait for an interrupt without a timeout.
> > > 
> > > virtio is ... kind of special, in that users already take it for
> > > granted that having a device as long as they want to respond
> > > still does not lead to errors and data loss.
> > > 
> > > Makes it a bit harder as our timeout would have to
> > > check for presence and retry, we can't just fail as a
> > > normal hardware device does.
> > 
> > Sorry, I don't know enough about virtio to follow what you said about 
> > "having a device as long as they want to respond".
> > 
> > We started with a graceful remove.  That must mean the user no longer
> > needs the device.
> 
> I'll try to clarify:
> 
> Indeed, the user will not submit new requests,
> but users might also not know that there are some old requests
> being in progress of being processed by the device.
> The driver, currently, waits for that processsing to be complete.
> Cancelling that with a reset on a timeout might be a regression,
> unless the timeout is very long.

This seems like a corner case and maybe rare enough that simply making
the timeout very long would be a possibility.

> > > And there's the overhead thing - poking at the device a lot
> > > puts a high load on the host.
> > 
> > Checking for PCI_POSSIBLE_ERROR() doesn't touch the device.  If you
> > did a config read already, and the result happened to be ~0, *then* we
> > have the problem of figuring out whether the actual data from the
> > device was ~0, or if the read failed and the Root Complex synthesized
> > the ~0.  In many cases a driver knows that ~0 is not a possible
> > register value.  Otherwise it might have to read another register that
> > is known not to be ~0.
> 
> To clarify, virtio generally is designed to operate solely
> by means of DMA and interrupt, completely avoiding any PCI
> reads. This, due to PCI reads being very expensive in virtualized
> scenarios.
> 
> The extra overhead I refer to is exactly initiating such a read
> where there would not be one in normal operation.

Thanks, this part is very helpful.  And since config accesses are very
expensive in *all* environments, I expect most drivers for
high-performance devices work the same way and only do config accesses
during at probe time.

If that's true, it will make this more understandable if the commit
log approaches it from that direction and omits virtio specifics.

Bjorn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-14  6:11   ` Lukas Wunner
  2025-07-14  6:18     ` Michael S. Tsirkin
  2025-07-14  6:54     ` Michael S. Tsirkin
@ 2025-07-17 15:11     ` Michael S. Tsirkin
  2025-07-17 20:12       ` Lukas Wunner
  2 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-17 15:11 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively.  This works well, because mostly, the
> > cleanup would be the same.
> > 
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
> 
> For PCI devices in a hotplug slot, user space can initiate "safe removal"
> by writing "0" to the hotplug slot's "power" file in sysfs.
> 
> If the PCI device is yanked from the slot while safe removal is ongoing,
> there is likewise no way for the driver to know that the device is
> suddenly gone.  That's because pciehp_unconfigure_device() only calls
> pci_dev_set_disconnected() in the surprise removal case, not for
> safe removal.
> 
> The solution proposed here is thus not a complete one:  It may work
> if user space initiated *driver* removal, but not if it initiated *safe*
> removal of the entire device.  For virtio, that may be sufficient.

So just as an idea, something like this can work I guess?  I'm yet to
test this - wrote this on the go - and also I'll need to implement for
other hotplug drivers, I need it at least for ACPI additonally.
WDYT?



diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
index bcc938d4420f..46468a1f0244 100644
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -231,6 +231,15 @@ void pciehp_handle_disable_request(struct controller *ctrl)
 void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events)
 {
 	int present, link_active;
+	/*
+	 * Always mark downstream devices disconnected on Presence Detect Change.
+	 * Covers device yanked during safe removal.
+	 */
+	if (events & PCI_EXP_SLTSTA_PDC) {
+		struct pci_bus *parent = ctrl->pcie->port->subordinate;
+		if (parent)
+			pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+	}
 
 	/*
 	 * If the slot is on and presence or link has changed, turn it off.


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-16 22:29           ` Bjorn Helgaas
@ 2025-07-17 15:15             ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-17 15:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
	Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci

On Wed, Jul 16, 2025 at 05:29:00PM -0500, Bjorn Helgaas wrote:
> On Tue, Jul 15, 2025 at 02:28:20AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 14, 2025 at 04:13:51PM -0500, Bjorn Helgaas wrote:
> > > On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > > > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > > > At the moment, in case of a surprise removal, the regular
> > > > > > remove callback is invoked, exclusively.  This works well,
> > > > > > because mostly, the cleanup would be the same.
> > > > > > 
> > > > > > However, there's a race: imagine device removal was
> > > > > > initiated by a user action, such as driver unbind, and it in
> > > > > > turn initiated some cleanup and is now waiting for an
> > > > > > interrupt from the device. If the device is now
> > > > > > surprise-removed, that never arrives and the remove callback
> > > > > > hangs forever.
> > > > > > 
> > > > > > For example, this was reported for virtio-blk:
> > > > > > 
> > > > > > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > > > > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > > > 	   complete,
> > > > > > 
> > > > > > 	2. Now few requests are yet to complete, and surprise removal started.
> > > > > > 
> > > > > > 	At this point, virtio block driver will not get notified by the driver
> > > > > > 	core layer, because it is likely serializing remove() happening by
> > > > > > 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> > > > > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > > > > 	for requests completions to arrive which it never gets.  So
> > > > > > 	del_gendisk() gets stuck.
> > > > > > 
> > > > > > Drivers can artificially add timeouts to handle that, but it can be
> > > > > > flaky.
> > > > > > 
> > > > > > Instead, let's add a way for the driver to be notified about the
> > > > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > > > device is inactive.
> > > > > 
> > > > > This relies on somebody (typically pciehp, I guess) calling
> > > > > pci_dev_set_disconnected() when a surprise remove happens.
> > > > > 
> > > > > Do you think it would be practical for the driver's .remove() method
> > > > > to recognize that the device may stop responding at any point, even if
> > > > > no hotplug driver is present to call pci_dev_set_disconnected()?
> > > > > 
> > > > > Waiting forever for an interrupt seems kind of vulnerable in general.
> > > > > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > > > > forever for interrupts?  That doesn't seem artificial to me because
> > > > > it's just a fact of life that devices can disappear at arbitrary
> > > > > times.
> > > > > 
> > > > > It seems a little fragile to me to depend on some other part of the
> > > > > system to notice the surprise removal and tell you about it or
> > > > > schedule your work function.  I think it would be more robust for the
> > > > > driver to check directly, i.e., assume writes to the device may be
> > > > > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > > > > never wait for an interrupt without a timeout.
> > > > 
> > > > virtio is ... kind of special, in that users already take it for
> > > > granted that having a device as long as they want to respond
> > > > still does not lead to errors and data loss.
> > > > 
> > > > Makes it a bit harder as our timeout would have to
> > > > check for presence and retry, we can't just fail as a
> > > > normal hardware device does.
> > > 
> > > Sorry, I don't know enough about virtio to follow what you said about 
> > > "having a device as long as they want to respond".
> > > 
> > > We started with a graceful remove.  That must mean the user no longer
> > > needs the device.
> > 
> > I'll try to clarify:
> > 
> > Indeed, the user will not submit new requests,
> > but users might also not know that there are some old requests
> > being in progress of being processed by the device.
> > The driver, currently, waits for that processsing to be complete.
> > Cancelling that with a reset on a timeout might be a regression,
> > unless the timeout is very long.
> 
> This seems like a corner case and maybe rare enough that simply making
> the timeout very long would be a possibility.

Indeed the timeout needs to be very long, and the average would still be
reasonable, but the worst case is terrible and the user can't insert a
replacement card all this time. The system is perceived as flaky.


> > > > And there's the overhead thing - poking at the device a lot
> > > > puts a high load on the host.
> > > 
> > > Checking for PCI_POSSIBLE_ERROR() doesn't touch the device.  If you
> > > did a config read already, and the result happened to be ~0, *then* we
> > > have the problem of figuring out whether the actual data from the
> > > device was ~0, or if the read failed and the Root Complex synthesized
> > > the ~0.  In many cases a driver knows that ~0 is not a possible
> > > register value.  Otherwise it might have to read another register that
> > > is known not to be ~0.
> > 
> > To clarify, virtio generally is designed to operate solely
> > by means of DMA and interrupt, completely avoiding any PCI
> > reads. This, due to PCI reads being very expensive in virtualized
> > scenarios.
> > 
> > The extra overhead I refer to is exactly initiating such a read
> > where there would not be one in normal operation.
> 
> Thanks, this part is very helpful.  And since config accesses are very
> expensive in *all* environments, I expect most drivers for
> high-performance devices work the same way and only do config accesses
> during at probe time.
> 
> If that's true, it will make this more understandable if the commit
> log approaches it from that direction and omits virtio specifics.
> 
> Bjorn

Will do, thanks a lot!

-- 
MST


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-17 15:11     ` Michael S. Tsirkin
@ 2025-07-17 20:12       ` Lukas Wunner
  2025-07-17 23:31         ` Michael S. Tsirkin
  0 siblings, 1 reply; 17+ messages in thread
From: Lukas Wunner @ 2025-07-17 20:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

On Thu, Jul 17, 2025 at 11:11:44AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > At the moment, in case of a surprise removal, the regular remove
> > > callback is invoked, exclusively.  This works well, because mostly, the
> > > cleanup would be the same.
> > > 
> > > However, there's a race: imagine device removal was initiated by a user
> > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > is now waiting for an interrupt from the device. If the device is now
> > > surprise-removed, that never arrives and the remove callback hangs
> > > forever.
> > 
> > For PCI devices in a hotplug slot, user space can initiate "safe removal"
> > by writing "0" to the hotplug slot's "power" file in sysfs.
> > 
> > If the PCI device is yanked from the slot while safe removal is ongoing,
> > there is likewise no way for the driver to know that the device is
> > suddenly gone.  That's because pciehp_unconfigure_device() only calls
> > pci_dev_set_disconnected() in the surprise removal case, not for
> > safe removal.
> > 
> > The solution proposed here is thus not a complete one:  It may work
> > if user space initiated *driver* removal, but not if it initiated *safe*
> > removal of the entire device.  For virtio, that may be sufficient.
> 
> So just as an idea, something like this can work I guess?  I'm yet to
> test this - wrote this on the go -

Don't bother, it won't work:

pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
the IRQ thread.  During safe removal the IRQ thread is busy in
pciehp_unconfigure_device() and waiting for the driver to unbind
from devices being safe-removed.

An IRQ thread is always single-threaded.  There's no second instance
of the IRQ thread being run when another interrupt is signaled.
Rather, the IRQ thread is re-run when it has finished.

In *theory* what would be possible is to plumb this into pciehp_isr().
That's the hardirq handler.  This one will indeed be run when an
interrupt comes in while the IRQ thread is running.  Normally the
hardirq handler would just collect the events for later consumption
by the IRQ thread.  The hardirq handler could *theoretically* mark
devices gone while they're being safe-removed.

I'm saying "theoretically" because in reality I don't think this is
a viable approach either:  pciehp_ist() contains code to *ignore*
link or presence changes if they were caused by a Secondary Bus Reset
or Downstream Port Containment.  In that case we do *not* want to mark
devices disconnected because they're only *temporarily* inaccessible.
This requires waiting for the SBR or DPC to conclude, which can take
several seconds.  We can't wait in the hardirq handler.

So this cannot be solved with the current architecture of pciehp,
at least not easily or in an elegant way.  Sorry!

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-17 20:12       ` Lukas Wunner
@ 2025-07-17 23:31         ` Michael S. Tsirkin
  2025-07-18  4:35           ` Lukas Wunner
  0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-17 23:31 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

On Thu, Jul 17, 2025 at 10:12:03PM +0200, Lukas Wunner wrote:
> On Thu, Jul 17, 2025 at 11:11:44AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > At the moment, in case of a surprise removal, the regular remove
> > > > callback is invoked, exclusively.  This works well, because mostly, the
> > > > cleanup would be the same.
> > > > 
> > > > However, there's a race: imagine device removal was initiated by a user
> > > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > > is now waiting for an interrupt from the device. If the device is now
> > > > surprise-removed, that never arrives and the remove callback hangs
> > > > forever.
> > > 
> > > For PCI devices in a hotplug slot, user space can initiate "safe removal"
> > > by writing "0" to the hotplug slot's "power" file in sysfs.
> > > 
> > > If the PCI device is yanked from the slot while safe removal is ongoing,
> > > there is likewise no way for the driver to know that the device is
> > > suddenly gone.  That's because pciehp_unconfigure_device() only calls
> > > pci_dev_set_disconnected() in the surprise removal case, not for
> > > safe removal.
> > > 
> > > The solution proposed here is thus not a complete one:  It may work
> > > if user space initiated *driver* removal, but not if it initiated *safe*
> > > removal of the entire device.  For virtio, that may be sufficient.
> > 
> > So just as an idea, something like this can work I guess?  I'm yet to
> > test this - wrote this on the go -
> 
> Don't bother, it won't work:
> 
> pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
> the IRQ thread.  During safe removal the IRQ thread is busy in
> pciehp_unconfigure_device() and waiting for the driver to unbind
> from devices being safe-removed.

Confused. I thought safe removal happens in the userspace thread
that wrote into sysfs?

> An IRQ thread is always single-threaded.  There's no second instance
> of the IRQ thread being run when another interrupt is signaled.
> Rather, the IRQ thread is re-run when it has finished.
> 
> In *theory* what would be possible is to plumb this into pciehp_isr().
> That's the hardirq handler.  This one will indeed be run when an
> interrupt comes in while the IRQ thread is running.  Normally the
> hardirq handler would just collect the events for later consumption
> by the IRQ thread.  The hardirq handler could *theoretically* mark
> devices gone while they're being safe-removed.
> 
> I'm saying "theoretically" because in reality I don't think this is
> a viable approach either:  pciehp_ist() contains code to *ignore*
> link or presence changes if they were caused by a Secondary Bus Reset
> or Downstream Port Containment.  In that case we do *not* want to mark
> devices disconnected because they're only *temporarily* inaccessible.
> This requires waiting for the SBR or DPC to conclude, which can take
> several seconds.  We can't wait in the hardirq handler.
> 
> So this cannot be solved with the current architecture of pciehp,
> at least not easily or in an elegant way.  Sorry!
> 
> Thanks,
> 
> Lukas


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-17 23:31         ` Michael S. Tsirkin
@ 2025-07-18  4:35           ` Lukas Wunner
  2025-07-18  8:40             ` Michael S. Tsirkin
  0 siblings, 1 reply; 17+ messages in thread
From: Lukas Wunner @ 2025-07-18  4:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

On Thu, Jul 17, 2025 at 07:31:57PM -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 17, 2025 at 10:12:03PM +0200, Lukas Wunner wrote:
> > pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
> > the IRQ thread.  During safe removal the IRQ thread is busy in
> > pciehp_unconfigure_device() and waiting for the driver to unbind
> > from devices being safe-removed.
> 
> Confused. I thought safe removal happens in the userspace thread
> that wrote into sysfs?

No, the userspace thread synthesizes a DISABLE_SLOT event,
calls irq_wake_thread(), then waits for the IRQ thread to
finish handling that event.  See pciehp_sysfs_disable_slot().

Until 2018 we indeed brought down the slot in the userspace
thread, but that required locking between the workqueue fed
by the interrupt handler on the one hand and the userspace
thread on the other hand.  It was difficult to reason about
the code.

We had bug reports about slots flapping the link or presence
bits on slot bringdown that we could easily address by handling
everything in the IRQ thread, see 3943af9d01e9.  The same was
reported for slot bringup and addressed by 6c35a1ac3da6.

This wouldn't have been possible with the architecture prior
to 2018, at least not this easily.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
  2025-07-18  4:35           ` Lukas Wunner
@ 2025-07-18  8:40             ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2025-07-18  8:40 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
	virtualization, stefanha, alok.a.tiwari, linux-pci

On Fri, Jul 18, 2025 at 06:35:56AM +0200, Lukas Wunner wrote:
> On Thu, Jul 17, 2025 at 07:31:57PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 17, 2025 at 10:12:03PM +0200, Lukas Wunner wrote:
> > > pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
> > > the IRQ thread.  During safe removal the IRQ thread is busy in
> > > pciehp_unconfigure_device() and waiting for the driver to unbind
> > > from devices being safe-removed.
> > 
> > Confused. I thought safe removal happens in the userspace thread
> > that wrote into sysfs?
> 
> No, the userspace thread synthesizes a DISABLE_SLOT event,
> calls irq_wake_thread(), then waits for the IRQ thread to
> finish handling that event.  See pciehp_sysfs_disable_slot().
> 
> Until 2018 we indeed brought down the slot in the userspace
> thread, but that required locking between the workqueue fed
> by the interrupt handler on the one hand and the userspace
> thread on the other hand.  It was difficult to reason about
> the code.
> 
> We had bug reports about slots flapping the link or presence
> bits on slot bringdown that we could easily address by handling
> everything in the IRQ thread, see 3943af9d01e9.  The same was
> reported for slot bringup and addressed by 6c35a1ac3da6.
> 
> This wouldn't have been possible with the architecture prior
> to 2018, at least not this easily.
> 
> Thanks,
> 
> Lukas

Got it, thanks!

-- 
MST


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-18  8:40 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1752094439.git.mst@redhat.com>
2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: report surprise removal event Michael S. Tsirkin
2025-07-09 23:38   ` Bjorn Helgaas
2025-07-09 23:55     ` Keith Busch
2025-07-14  6:17       ` Michael S. Tsirkin
2025-07-14  6:26     ` Michael S. Tsirkin
2025-07-14 21:13       ` Bjorn Helgaas
2025-07-15  6:28         ` Michael S. Tsirkin
2025-07-16 22:29           ` Bjorn Helgaas
2025-07-17 15:15             ` Michael S. Tsirkin
2025-07-14  6:11   ` Lukas Wunner
2025-07-14  6:18     ` Michael S. Tsirkin
2025-07-14  6:54     ` Michael S. Tsirkin
2025-07-17 15:11     ` Michael S. Tsirkin
2025-07-17 20:12       ` Lukas Wunner
2025-07-17 23:31         ` Michael S. Tsirkin
2025-07-18  4:35           ` Lukas Wunner
2025-07-18  8:40             ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).