* [PATCH RFC v5 0/5] pci,virtio: report surprise removal event
@ 2025-07-09 20:55 Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: " Michael S. Tsirkin
` (4 more replies)
0 siblings, 5 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-09 20:55 UTC (permalink / raw)
To: linux-kernel
Cc: Lukas Wunner, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari
Lukas, Keith, Bjorn, others, would very much appreciate your comments
on whether the pci core changes are acceptable.
Parav, I expect this to be integrated into your work on fixing surprise
removal. As such, I am not queueing these patches yet - please include with your
other patches fixing these issues.
==========
This is an attempt to fix the following race in virtio:
when device removal is initiated by a user
action, such as driver unbind, it in turn initiates driver cleanup and
is then waiting for an interrupt from the device. If the device is now
surprise-removed, that interrupt never arrives and the remove callback hangs
forever.
For example, this was reported for virtio-blk:
1. the graceful removal is ongoing in the remove() callback, where disk
deletion del_gendisk() is ongoing, which waits for the requests +to
complete,
2. Now few requests are yet to complete, and surprise removal started.
At this point, virtio block driver will not get notified by the driver
core layer, because it is likely serializing remove() happening by
+user/driver unload and PCI hotplug driver-initiated device removal. So
vblk driver doesn't know that device is removed, block layer is waiting
for requests completions to arrive which it never gets. So
del_gendisk() gets stuck.
We could add timeouts to handle that, but given virtio blk devices are
implemented in userspace, this makes the device flaky.
Instead, this adds pci core infrastructure, and virtio core
infrastructure, for drivers to be notified of device disconnect.
Actual cleanup in virtio-blk is still TBD.
Compile-tested only.
==========
Notes on the design:
Care was taken to avoid re-introducing the bug fixed by
commit 74ff8864cc84 ("PCI: hotplug: Allow marking devices as disconnected during bind/unbind")
To avoid taking locks on removal path, and to avoid invoking callback
with unpredictable latency, the event is reported through a WQ.
Adding APIs to enable/disable the reporting on probe/remove, helps make
sure the driver won't go away in the middle of the handling, all
without taking any locks.
The benefit is that the resulting API is harder than a callback to
misuse, adding unpredictable latencies to unplug. The WQ can simply do
the cleanup directly, taking any locks it needs for that. The cost is
several extra bytes per device, which seems modest.
Previous discussion:
https://lore.kernel.org/all/11cfcb55b5302999b0e58b94018f92a379196698.1751136072.git.mst@redhat.com
Changes from v4:
fix !CONFIG_PCI, reported by Stephen
Changes from v3:
added documentation to address comments by Parav
support in virtio core
Changes from v2:
v2 was corrupted, fat fingers :(
Changes from v1:
switched to a WQ, with APIs to enable/disable
added motivation
==========
Michael S. Tsirkin (5):
pci: report surprise removal event
virtio: fix comments, readability
virtio: pack config changed flags
virtio: allow transports to suppress config change
virtio: support device disconnect
drivers/pci/pci.h | 6 ++++
drivers/virtio/virtio.c | 23 +++++++++++++--
drivers/virtio/virtio_pci_common.c | 45 ++++++++++++++++++++++++++++++
drivers/virtio/virtio_pci_common.h | 3 ++
drivers/virtio/virtio_pci_legacy.c | 2 ++
drivers/virtio/virtio_pci_modern.c | 2 ++
include/linux/pci.h | 45 ++++++++++++++++++++++++++++++
include/linux/virtio.h | 11 ++++++--
include/linux/virtio_config.h | 32 +++++++++++++++++++++
9 files changed, 163 insertions(+), 6 deletions(-)
--
MST
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-09 20:55 [PATCH RFC v5 0/5] pci,virtio: report surprise removal event Michael S. Tsirkin
@ 2025-07-09 20:55 ` Michael S. Tsirkin
2025-07-09 23:38 ` Bjorn Helgaas
2025-07-14 6:11 ` Lukas Wunner
2025-07-09 20:55 ` [PATCH RFC v5 2/5] virtio: fix comments, readability Michael S. Tsirkin
` (3 subsequent siblings)
4 siblings, 2 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-09 20:55 UTC (permalink / raw)
To: linux-kernel
Cc: Lukas Wunner, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
At the moment, in case of a surprise removal, the regular remove
callback is invoked, exclusively. This works well, because mostly, the
cleanup would be the same.
However, there's a race: imagine device removal was initiated by a user
action, such as driver unbind, and it in turn initiated some cleanup and
is now waiting for an interrupt from the device. If the device is now
surprise-removed, that never arrives and the remove callback hangs
forever.
For example, this was reported for virtio-blk:
1. the graceful removal is ongoing in the remove() callback, where disk
deletion del_gendisk() is ongoing, which waits for the requests +to
complete,
2. Now few requests are yet to complete, and surprise removal started.
At this point, virtio block driver will not get notified by the driver
core layer, because it is likely serializing remove() happening by
+user/driver unload and PCI hotplug driver-initiated device removal. So
vblk driver doesn't know that device is removed, block layer is waiting
for requests completions to arrive which it never gets. So
del_gendisk() gets stuck.
Drivers can artificially add timeouts to handle that, but it can be
flaky.
Instead, let's add a way for the driver to be notified about the
disconnect. It can then do any necessary cleanup, knowing that the
device is inactive.
Since cleanups can take a long time, this takes an approach
of a work struct that the driver initiates and enables
on probe, and tears down on remove.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/pci/pci.h | 6 ++++++
include/linux/pci.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 12215ee72afb..3ca4ebfd46be 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
pci_doe_disconnected(dev);
+ if (READ_ONCE(dev->disconnect_work_enable)) {
+ /* Make sure work is up to date. */
+ smp_rmb();
+ schedule_work(&dev->disconnect_work);
+ }
+
return 0;
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 05e68f35f392..723b17145b62 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -548,6 +548,10 @@ struct pci_dev {
/* These methods index pci_reset_fn_methods[] */
u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
+ /* Report disconnect events. 0x0 - disable, 0x1 - enable */
+ u8 disconnect_work_enable;
+ struct work_struct disconnect_work;
+
#ifdef CONFIG_PCIE_TPH
u16 tph_cap; /* TPH capability offset */
u8 tph_mode; /* TPH mode */
@@ -1993,6 +1997,47 @@ pci_release_mem_regions(struct pci_dev *pdev)
pci_select_bars(pdev, IORESOURCE_MEM));
}
+/*
+ * Run this first thing after getting a disconnect work, to prevent it from
+ * running multiple times.
+ * Returns: true if disconnect was enabled, proceed. false if disabled, abort.
+ */
+static inline bool pci_test_and_clear_disconnect_enable(struct pci_dev *pdev)
+{
+ u8 enable = 0x1;
+ u8 disable = 0x0;
+ return try_cmpxchg(&pdev->disconnect_work_enable, &enable, disable);
+}
+
+/*
+ * Caller must initialize @pdev->disconnect_work before invoking this.
+ * The work function must run and check pci_test_and_clear_disconnect_enable.
+ * Note that device can go away right after this call.
+ */
+static inline void pci_set_disconnect_work(struct pci_dev *pdev)
+{
+ /* Make sure WQ has been initialized already */
+ smp_wmb();
+
+ WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
+
+ /* check the device did not go away meanwhile. */
+ mb();
+
+ if (!pci_device_is_present(pdev))
+ schedule_work(&pdev->disconnect_work);
+}
+
+static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
+{
+ WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
+
+ /* Make sure to stop using work from now on. */
+ smp_wmb();
+
+ cancel_work_sync(&pdev->disconnect_work);
+}
+
#else /* CONFIG_PCI is not enabled */
static inline void pci_set_flags(int flags) { }
--
MST
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH RFC v5 2/5] virtio: fix comments, readability
2025-07-09 20:55 [PATCH RFC v5 0/5] pci,virtio: report surprise removal event Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: " Michael S. Tsirkin
@ 2025-07-09 20:55 ` Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 3/5] virtio: pack config changed flags Michael S. Tsirkin
` (2 subsequent siblings)
4 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-09 20:55 UTC (permalink / raw)
To: linux-kernel
Cc: Lukas Wunner, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, Jason Wang, Xuan Zhuo,
Eugenio Pérez
Fix a couple of comments to match reality.
Initialize config_driver_disabled to be consistent with
other fields (note: the structure is already zero initialized,
so this is not a bugfix as such).
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/virtio/virtio.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 95d5d7993e5b..c441c8cc71ef 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -147,7 +147,7 @@ EXPORT_SYMBOL_GPL(virtio_config_changed);
/**
* virtio_config_driver_disable - disable config change reporting by drivers
- * @dev: the device to reset
+ * @dev: the device to disable
*
* This is only allowed to be called by a driver and disabling can't
* be nested.
@@ -162,7 +162,7 @@ EXPORT_SYMBOL_GPL(virtio_config_driver_disable);
/**
* virtio_config_driver_enable - enable config change reporting by drivers
- * @dev: the device to reset
+ * @dev: the device to enable
*
* This is only allowed to be called by a driver and enabling can't
* be nested.
@@ -530,6 +530,7 @@ int register_virtio_device(struct virtio_device *dev)
goto out_ida_remove;
spin_lock_init(&dev->config_lock);
+ dev->config_driver_disabled = false;
dev->config_core_enabled = false;
dev->config_change_pending = false;
--
MST
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH RFC v5 3/5] virtio: pack config changed flags
2025-07-09 20:55 [PATCH RFC v5 0/5] pci,virtio: report surprise removal event Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: " Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 2/5] virtio: fix comments, readability Michael S. Tsirkin
@ 2025-07-09 20:55 ` Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 4/5] virtio: allow transports to suppress config change Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 5/5] virtio: support device disconnect Michael S. Tsirkin
4 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-09 20:55 UTC (permalink / raw)
To: linux-kernel
Cc: Lukas Wunner, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, Jason Wang, Xuan Zhuo,
Eugenio Pérez
In anticipation of adding more, use bit-fields instead of bool.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/linux/virtio.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 64cb4b04be7a..be0beb16b487 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -149,9 +149,9 @@ struct virtio_admin_cmd {
struct virtio_device {
int index;
bool failed;
- bool config_core_enabled;
- bool config_driver_disabled;
- bool config_change_pending;
+ u8 config_core_enabled:1;
+ u8 config_driver_disabled:1;
+ u8 config_change_pending:1;
spinlock_t config_lock;
spinlock_t vqs_list_lock;
struct device dev;
--
MST
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH RFC v5 4/5] virtio: allow transports to suppress config change
2025-07-09 20:55 [PATCH RFC v5 0/5] pci,virtio: report surprise removal event Michael S. Tsirkin
` (2 preceding siblings ...)
2025-07-09 20:55 ` [PATCH RFC v5 3/5] virtio: pack config changed flags Michael S. Tsirkin
@ 2025-07-09 20:55 ` Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 5/5] virtio: support device disconnect Michael S. Tsirkin
4 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-09 20:55 UTC (permalink / raw)
To: linux-kernel
Cc: Lukas Wunner, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, Jason Wang, Xuan Zhuo,
Eugenio Pérez
Will be used on surprise removal, so we don't need to
re-check.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/virtio/virtio.c | 18 +++++++++++++++++-
include/linux/virtio.h | 2 ++
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index c441c8cc71ef..1dd5cdf68c1d 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -127,7 +127,8 @@ static void __virtio_config_changed(struct virtio_device *dev)
{
struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
- if (!dev->config_core_enabled || dev->config_driver_disabled)
+ if (!dev->config_core_enabled || dev->config_driver_disabled ||
+ dev->config_transport_disabled)
dev->config_change_pending = true;
else if (drv && drv->config_changed) {
drv->config_changed(dev);
@@ -193,6 +194,20 @@ static void virtio_config_core_enable(struct virtio_device *dev)
spin_unlock_irq(&dev->config_lock);
}
+/**
+ * virtio_config_transport_disable - disable config change reporting by transport
+ * @dev: the device in question
+ *
+ * This must only be called by transport and enabling is not allowed.
+ */
+void virtio_config_transport_disable(struct virtio_device *dev)
+{
+ spin_lock_irq(&dev->config_lock);
+ dev->config_transport_disabled = true;
+ spin_unlock_irq(&dev->config_lock);
+}
+EXPORT_SYMBOL_GPL(virtio_config_transport_disable);
+
void virtio_add_status(struct virtio_device *dev, unsigned int status)
{
might_sleep();
@@ -530,6 +545,7 @@ int register_virtio_device(struct virtio_device *dev)
goto out_ida_remove;
spin_lock_init(&dev->config_lock);
+ dev->config_transport_disabled = false;
dev->config_driver_disabled = false;
dev->config_core_enabled = false;
dev->config_change_pending = false;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index be0beb16b487..072a25f6622c 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -151,6 +151,7 @@ struct virtio_device {
bool failed;
u8 config_core_enabled:1;
u8 config_driver_disabled:1;
+ u8 config_transport_disabled:1;
u8 config_change_pending:1;
spinlock_t config_lock;
spinlock_t vqs_list_lock;
@@ -185,6 +186,7 @@ void virtio_config_changed(struct virtio_device *dev);
void virtio_config_driver_disable(struct virtio_device *dev);
void virtio_config_driver_enable(struct virtio_device *dev);
+void virtio_config_transport_disable(struct virtio_device *dev);
#ifdef CONFIG_PM_SLEEP
int virtio_device_freeze(struct virtio_device *dev);
int virtio_device_restore(struct virtio_device *dev);
--
MST
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH RFC v5 5/5] virtio: support device disconnect
2025-07-09 20:55 [PATCH RFC v5 0/5] pci,virtio: report surprise removal event Michael S. Tsirkin
` (3 preceding siblings ...)
2025-07-09 20:55 ` [PATCH RFC v5 4/5] virtio: allow transports to suppress config change Michael S. Tsirkin
@ 2025-07-09 20:55 ` Michael S. Tsirkin
4 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-09 20:55 UTC (permalink / raw)
To: linux-kernel
Cc: Lukas Wunner, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, Jason Wang, Xuan Zhuo,
Eugenio Pérez
This adds support for device disconnect:
upon device surprise removal, virtio core makes sure
to no callbacks are running, and then notifies the driver.
At the moment, virtio pci is the only transport with this
functionality enabled, it does nothing for other transports.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/virtio/virtio_pci_common.c | 45 ++++++++++++++++++++++++++++++
drivers/virtio/virtio_pci_common.h | 3 ++
drivers/virtio/virtio_pci_legacy.c | 2 ++
drivers/virtio/virtio_pci_modern.c | 2 ++
include/linux/virtio.h | 3 ++
include/linux/virtio_config.h | 32 +++++++++++++++++++++
6 files changed, 87 insertions(+)
diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index d6d79af44569..a475f47052eb 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -594,6 +594,51 @@ const struct cpumask *vp_get_vq_affinity(struct virtio_device *vdev, int index)
vp_dev->vqs[index]->msix_vector);
}
+/* Report disconnect to the driver. */
+static void virtio_pci_disconnect_work(struct work_struct *work)
+{
+ struct pci_dev *pci_dev = container_of(work, struct pci_dev,
+ disconnect_work);
+ struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+ struct virtio_device *vdev = &vp_dev->vdev;
+ struct virtio_driver *drv = drv_to_virtio(vdev->dev.driver);
+
+ if (!pci_test_and_clear_disconnect_enable(pci_dev))
+ return;
+
+ virtio_config_transport_disable(vdev);
+ virtio_break_device(vdev);
+
+ vp_synchronize_vectors(vdev);
+
+ drv->disconnect(&vp_dev->vdev);
+}
+
+void virtio_pci_enable_disconnect(struct virtio_device *vdev)
+{
+ struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+ struct pci_dev *pci_dev = vp_dev->pci_dev;
+ struct virtio_driver *drv = drv_to_virtio(vdev->dev.driver);
+
+ if (!drv->disconnect)
+ return;
+
+ INIT_WORK(&pci_dev->disconnect_work, virtio_pci_disconnect_work);
+ pci_set_disconnect_work(pci_dev);
+}
+
+void virtio_pci_disable_disconnect(struct virtio_device *vdev)
+{
+ struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+ struct pci_dev *pci_dev = vp_dev->pci_dev;
+ struct virtio_driver *drv = drv_to_virtio(vdev->dev.driver);
+
+ if (!drv->disconnect)
+ return;
+
+ pci_clear_disconnect_work(pci_dev);
+}
+
#ifdef CONFIG_PM_SLEEP
static int virtio_pci_freeze(struct device *dev)
{
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index 8cd01de27baf..982c4c8aabc8 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -161,6 +161,9 @@ static inline void virtio_pci_legacy_remove(struct virtio_pci_device *vp_dev)
int virtio_pci_modern_probe(struct virtio_pci_device *);
void virtio_pci_modern_remove(struct virtio_pci_device *);
+void virtio_pci_enable_disconnect(struct virtio_device *);
+void virtio_pci_disable_disconnect(struct virtio_device *);
+
struct virtio_device *virtio_pci_vf_get_pf_dev(struct pci_dev *pdev);
#define VIRTIO_LEGACY_ADMIN_CMD_BITMAP \
diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
index d9cbb02b35a1..cd424f619b47 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -191,6 +191,8 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
.set = vp_set,
.get_status = vp_get_status,
.set_status = vp_set_status,
+ .enable_disconnect = virtio_pci_enable_disconnect,
+ .disable_disconnect = virtio_pci_disable_disconnect,
.reset = vp_reset,
.find_vqs = vp_find_vqs,
.del_vqs = vp_del_vqs,
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 7182f43ed055..b3dfb403913f 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -1230,6 +1230,8 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.generation = vp_generation,
.get_status = vp_get_status,
.set_status = vp_set_status,
+ .enable_disconnect = virtio_pci_enable_disconnect,
+ .disable_disconnect = virtio_pci_disable_disconnect,
.reset = vp_reset,
.find_vqs = vp_modern_find_vqs,
.del_vqs = vp_del_vqs,
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 072a25f6622c..a091651e3144 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -214,6 +214,8 @@ size_t virtio_max_dma_size(const struct virtio_device *vdev);
* @scan: optional function to call after successful probe; intended
* for virtio-scsi to invoke a scan.
* @remove: the function to call when a device is removed.
+ * @disconnect: the function to call on disconnect (surprise removal),
+ * before remove.
* @config_changed: optional function to call when the device configuration
* changes; may be called in interrupt context.
* @freeze: optional function to call during suspend/hibernation.
@@ -235,6 +237,7 @@ struct virtio_driver {
int (*validate)(struct virtio_device *dev);
int (*probe)(struct virtio_device *dev);
void (*scan)(struct virtio_device *dev);
+ void (*disconnect)(struct virtio_device *dev);
void (*remove)(struct virtio_device *dev);
void (*config_changed)(struct virtio_device *dev);
int (*freeze)(struct virtio_device *dev);
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index b3e1d30c765b..861198a74be2 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -58,6 +58,10 @@ struct virtqueue_info {
* @set_status: write the status byte
* vdev: the virtio_device
* status: the new status byte
+ * @enable_disconnect: driver will get disconnect callbacks
+ * vdev: the virtio_device
+ * @disable_disconnect: driver will not get disconnect callbacks
+ * vdev: the virtio_device
* @reset: reset the device
* vdev: the virtio device
* After this, status and feature negotiation must be done again
@@ -113,6 +117,8 @@ struct virtio_config_ops {
u32 (*generation)(struct virtio_device *vdev);
u8 (*get_status)(struct virtio_device *vdev);
void (*set_status)(struct virtio_device *vdev, u8 status);
+ void (*enable_disconnect)(struct virtio_device *vdev);
+ void (*disable_disconnect)(struct virtio_device *vdev);
void (*reset)(struct virtio_device *vdev);
int (*find_vqs)(struct virtio_device *vdev, unsigned int nvqs,
struct virtqueue *vqs[],
@@ -299,6 +305,32 @@ void virtio_device_ready(struct virtio_device *dev)
dev->config->set_status(dev, status | VIRTIO_CONFIG_S_DRIVER_OK);
}
+/**
+ * virtio_device_enable_disconnect - enable disconnect callback
+ * @dev: the virtio device
+ *
+ * Driver must call this in the probe function.
+ */
+static inline
+void virtio_device_enable_disconnect(struct virtio_device *dev)
+{
+ if (dev->config->enable_disconnect)
+ dev->config->enable_disconnect(dev);
+}
+
+/**
+ * virtio_device_disable_disconnect - enable disconnect callback
+ * @dev: the virtio device
+ *
+ * Driver must call this in the remove function.
+ */
+static inline
+void virtio_device_disable_disconnect(struct virtio_device *dev)
+{
+ if (dev->config->disable_disconnect)
+ dev->config->disable_disconnect(dev);
+}
+
static inline
const char *virtio_bus_name(struct virtio_device *vdev)
{
--
MST
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: " Michael S. Tsirkin
@ 2025-07-09 23:38 ` Bjorn Helgaas
2025-07-09 23:55 ` Keith Busch
2025-07-14 6:26 ` Michael S. Tsirkin
2025-07-14 6:11 ` Lukas Wunner
1 sibling, 2 replies; 22+ messages in thread
From: Bjorn Helgaas @ 2025-07-09 23:38 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci
Housekeeping: Note subject line convention. Indent with spaces in
commit log. Remove spurious plus signs.
On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> At the moment, in case of a surprise removal, the regular remove
> callback is invoked, exclusively. This works well, because mostly, the
> cleanup would be the same.
>
> However, there's a race: imagine device removal was initiated by a user
> action, such as driver unbind, and it in turn initiated some cleanup and
> is now waiting for an interrupt from the device. If the device is now
> surprise-removed, that never arrives and the remove callback hangs
> forever.
>
> For example, this was reported for virtio-blk:
>
> 1. the graceful removal is ongoing in the remove() callback, where disk
> deletion del_gendisk() is ongoing, which waits for the requests +to
> complete,
>
> 2. Now few requests are yet to complete, and surprise removal started.
>
> At this point, virtio block driver will not get notified by the driver
> core layer, because it is likely serializing remove() happening by
> +user/driver unload and PCI hotplug driver-initiated device removal. So
> vblk driver doesn't know that device is removed, block layer is waiting
> for requests completions to arrive which it never gets. So
> del_gendisk() gets stuck.
>
> Drivers can artificially add timeouts to handle that, but it can be
> flaky.
>
> Instead, let's add a way for the driver to be notified about the
> disconnect. It can then do any necessary cleanup, knowing that the
> device is inactive.
This relies on somebody (typically pciehp, I guess) calling
pci_dev_set_disconnected() when a surprise remove happens.
Do you think it would be practical for the driver's .remove() method
to recognize that the device may stop responding at any point, even if
no hotplug driver is present to call pci_dev_set_disconnected()?
Waiting forever for an interrupt seems kind of vulnerable in general.
Maybe "artificially adding timeouts" is alluding to *not* waiting
forever for interrupts? That doesn't seem artificial to me because
it's just a fact of life that devices can disappear at arbitrary
times.
It seems a little fragile to me to depend on some other part of the
system to notice the surprise removal and tell you about it or
schedule your work function. I think it would be more robust for the
driver to check directly, i.e., assume writes to the device may be
lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
never wait for an interrupt without a timeout.
> Since cleanups can take a long time, this takes an approach
> of a work struct that the driver initiates and enables
> on probe, and tears down on remove.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/pci/pci.h | 6 ++++++
> include/linux/pci.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 51 insertions(+)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 12215ee72afb..3ca4ebfd46be 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> pci_doe_disconnected(dev);
>
> + if (READ_ONCE(dev->disconnect_work_enable)) {
> + /* Make sure work is up to date. */
> + smp_rmb();
> + schedule_work(&dev->disconnect_work);
> + }
> +
> return 0;
> }
>
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 05e68f35f392..723b17145b62 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -548,6 +548,10 @@ struct pci_dev {
> /* These methods index pci_reset_fn_methods[] */
> u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
>
> + /* Report disconnect events. 0x0 - disable, 0x1 - enable */
> + u8 disconnect_work_enable;
> + struct work_struct disconnect_work;
> +
> #ifdef CONFIG_PCIE_TPH
> u16 tph_cap; /* TPH capability offset */
> u8 tph_mode; /* TPH mode */
> @@ -1993,6 +1997,47 @@ pci_release_mem_regions(struct pci_dev *pdev)
> pci_select_bars(pdev, IORESOURCE_MEM));
> }
>
> +/*
> + * Run this first thing after getting a disconnect work, to prevent it from
> + * running multiple times.
> + * Returns: true if disconnect was enabled, proceed. false if disabled, abort.
> + */
> +static inline bool pci_test_and_clear_disconnect_enable(struct pci_dev *pdev)
> +{
> + u8 enable = 0x1;
> + u8 disable = 0x0;
> + return try_cmpxchg(&pdev->disconnect_work_enable, &enable, disable);
> +}
> +
> +/*
> + * Caller must initialize @pdev->disconnect_work before invoking this.
> + * The work function must run and check pci_test_and_clear_disconnect_enable.
> + * Note that device can go away right after this call.
> + */
> +static inline void pci_set_disconnect_work(struct pci_dev *pdev)
> +{
> + /* Make sure WQ has been initialized already */
> + smp_wmb();
> +
> + WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
> +
> + /* check the device did not go away meanwhile. */
> + mb();
> +
> + if (!pci_device_is_present(pdev))
> + schedule_work(&pdev->disconnect_work);
> +}
> +
> +static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
> +{
> + WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> +
> + /* Make sure to stop using work from now on. */
> + smp_wmb();
> +
> + cancel_work_sync(&pdev->disconnect_work);
> +}
> +
> #else /* CONFIG_PCI is not enabled */
>
> static inline void pci_set_flags(int flags) { }
> --
> MST
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-09 23:38 ` Bjorn Helgaas
@ 2025-07-09 23:55 ` Keith Busch
2025-07-14 6:17 ` Michael S. Tsirkin
2025-07-14 6:26 ` Michael S. Tsirkin
1 sibling, 1 reply; 22+ messages in thread
From: Keith Busch @ 2025-07-09 23:55 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Michael S. Tsirkin, linux-kernel, Lukas Wunner, Bjorn Helgaas,
Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci
On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> This relies on somebody (typically pciehp, I guess) calling
> pci_dev_set_disconnected() when a surprise remove happens.
>
> Do you think it would be practical for the driver's .remove() method
> to recognize that the device may stop responding at any point, even if
> no hotplug driver is present to call pci_dev_set_disconnected()?
>
> Waiting forever for an interrupt seems kind of vulnerable in general.
> Maybe "artificially adding timeouts" is alluding to *not* waiting
> forever for interrupts? That doesn't seem artificial to me because
> it's just a fact of life that devices can disappear at arbitrary
> times.
I totally agree here. Every driver's .remove() should be able to
guarantee forward progress some way. I put some work in blk-mq and nvme
to ensure that happens for those devices at least.
That "forward progress" can come slow though, maybe minutes, so we do
have opprotunisitic short cuts sprinkled about the driver. There are
still gaps when waiting for interrupt driven IO that need the longer
timeouts to trigger. It'd be cool if there was a mechansim to kick in
quicker, but this is still an uncommon exceptional condition, right?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: " Michael S. Tsirkin
2025-07-09 23:38 ` Bjorn Helgaas
@ 2025-07-14 6:11 ` Lukas Wunner
2025-07-14 6:18 ` Michael S. Tsirkin
` (2 more replies)
1 sibling, 3 replies; 22+ messages in thread
From: Lukas Wunner @ 2025-07-14 6:11 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> At the moment, in case of a surprise removal, the regular remove
> callback is invoked, exclusively. This works well, because mostly, the
> cleanup would be the same.
>
> However, there's a race: imagine device removal was initiated by a user
> action, such as driver unbind, and it in turn initiated some cleanup and
> is now waiting for an interrupt from the device. If the device is now
> surprise-removed, that never arrives and the remove callback hangs
> forever.
For PCI devices in a hotplug slot, user space can initiate "safe removal"
by writing "0" to the hotplug slot's "power" file in sysfs.
If the PCI device is yanked from the slot while safe removal is ongoing,
there is likewise no way for the driver to know that the device is
suddenly gone. That's because pciehp_unconfigure_device() only calls
pci_dev_set_disconnected() in the surprise removal case, not for
safe removal.
The solution proposed here is thus not a complete one: It may work
if user space initiated *driver* removal, but not if it initiated *safe*
removal of the entire device. For virtio, that may be sufficient.
> +++ b/drivers/pci/pci.h
> @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> pci_doe_disconnected(dev);
>
> + if (READ_ONCE(dev->disconnect_work_enable)) {
> + /* Make sure work is up to date. */
> + smp_rmb();
> + schedule_work(&dev->disconnect_work);
> + }
> +
> return 0;
> }
Going through all the callers of pci_dev_set_disconnected(),
I suppose the (only) one you're interested in is
pciehp_unconfigure_device().
The other callers are related to runtime resume, resume from
system sleep and ACPI slots.
Instead of amending pci_dev_set_disconnected(), I'd prefer
an approach where pciehp_unconfigure_device() first marks
all devices disconnected, then wakes up some global waitqueue, e.g.:
- if (!presence)
+ if (!presence) {
pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+ wake_up_all(&pci_disconnected_wq);
+ }
The benefit is that there's no delay when marking devices disconnected.
(Granted, the delay is small for smp_rmb() + schedule_work().)
And just having a global waitqueue is simpler and may be useful
for other use cases.
So instead of adding timeouts when waiting for interrupts, drivers would
be woken via the waitqueue.
But again, it's not a complete solution as it doesn't cover the
"surprise removal during safe removal" case.
I also agree with Bjorn's and Keith's comments that the driver should
use timeouts for robustness, but still wanted to provide additional
(hopefully constructive) thoughts.
Thanks!
Lukas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-09 23:55 ` Keith Busch
@ 2025-07-14 6:17 ` Michael S. Tsirkin
0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-14 6:17 UTC (permalink / raw)
To: Keith Busch
Cc: Bjorn Helgaas, linux-kernel, Lukas Wunner, Bjorn Helgaas,
Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci
On Wed, Jul 09, 2025 at 05:55:17PM -0600, Keith Busch wrote:
> On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > This relies on somebody (typically pciehp, I guess) calling
> > pci_dev_set_disconnected() when a surprise remove happens.
> >
> > Do you think it would be practical for the driver's .remove() method
> > to recognize that the device may stop responding at any point, even if
> > no hotplug driver is present to call pci_dev_set_disconnected()?
> >
> > Waiting forever for an interrupt seems kind of vulnerable in general.
> > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > forever for interrupts? That doesn't seem artificial to me because
> > it's just a fact of life that devices can disappear at arbitrary
> > times.
>
> I totally agree here. Every driver's .remove() should be able to
> guarantee forward progress some way. I put some work in blk-mq and nvme
> to ensure that happens for those devices at least.
>
> That "forward progress" can come slow though, maybe minutes, so we do
> have opprotunisitic short cuts sprinkled about the driver. There are
> still gaps when waiting for interrupt driven IO that need the longer
> timeouts to trigger. It'd be cool if there was a mechansim to kick in
> quicker, but this is still an uncommon exceptional condition, right?
It's uncommon, yes.
--
MST
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-14 6:11 ` Lukas Wunner
@ 2025-07-14 6:18 ` Michael S. Tsirkin
2025-07-14 6:54 ` Michael S. Tsirkin
2025-07-17 15:11 ` Michael S. Tsirkin
2 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-14 6:18 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively. This works well, because mostly, the
> > cleanup would be the same.
> >
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
>
> For PCI devices in a hotplug slot, user space can initiate "safe removal"
> by writing "0" to the hotplug slot's "power" file in sysfs.
>
> If the PCI device is yanked from the slot while safe removal is ongoing,
> there is likewise no way for the driver to know that the device is
> suddenly gone. That's because pciehp_unconfigure_device() only calls
> pci_dev_set_disconnected() in the surprise removal case, not for
> safe removal.
>
> The solution proposed here is thus not a complete one: It may work
> if user space initiated *driver* removal, but not if it initiated *safe*
> removal of the entire device. For virtio, that may be sufficient.
>
> > +++ b/drivers/pci/pci.h
> > @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> > pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> > pci_doe_disconnected(dev);
> >
> > + if (READ_ONCE(dev->disconnect_work_enable)) {
> > + /* Make sure work is up to date. */
> > + smp_rmb();
> > + schedule_work(&dev->disconnect_work);
> > + }
> > +
> > return 0;
> > }
>
> Going through all the callers of pci_dev_set_disconnected(),
> I suppose the (only) one you're interested in is
> pciehp_unconfigure_device().
>
> The other callers are related to runtime resume, resume from
> system sleep and ACPI slots.
>
> Instead of amending pci_dev_set_disconnected(), I'd prefer
> an approach where pciehp_unconfigure_device() first marks
> all devices disconnected, then wakes up some global waitqueue, e.g.:
>
> - if (!presence)
> + if (!presence) {
> pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> + wake_up_all(&pci_disconnected_wq);
> + }
>
> The benefit is that there's no delay when marking devices disconnected.
> (Granted, the delay is small for smp_rmb() + schedule_work().)
> And just having a global waitqueue is simpler and may be useful
> for other use cases.
>
> So instead of adding timeouts when waiting for interrupts, drivers would
> be woken via the waitqueue.
>
> But again, it's not a complete solution as it doesn't cover the
> "surprise removal during safe removal" case.
Did not realize. Will look into addressing this, thanks!
> I also agree with Bjorn's and Keith's comments that the driver should
> use timeouts for robustness, but still wanted to provide additional
> (hopefully constructive) thoughts.
>
> Thanks!
>
> Lukas
I'll address these comments in the next version.
--
MST
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-09 23:38 ` Bjorn Helgaas
2025-07-09 23:55 ` Keith Busch
@ 2025-07-14 6:26 ` Michael S. Tsirkin
2025-07-14 21:13 ` Bjorn Helgaas
1 sibling, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-14 6:26 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci
On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> Housekeeping: Note subject line convention. Indent with spaces in
> commit log. Remove spurious plus signs.
Thanks!
> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively. This works well, because mostly, the
> > cleanup would be the same.
> >
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
> >
> > For example, this was reported for virtio-blk:
> >
> > 1. the graceful removal is ongoing in the remove() callback, where disk
> > deletion del_gendisk() is ongoing, which waits for the requests +to
> > complete,
> >
> > 2. Now few requests are yet to complete, and surprise removal started.
> >
> > At this point, virtio block driver will not get notified by the driver
> > core layer, because it is likely serializing remove() happening by
> > +user/driver unload and PCI hotplug driver-initiated device removal. So
> > vblk driver doesn't know that device is removed, block layer is waiting
> > for requests completions to arrive which it never gets. So
> > del_gendisk() gets stuck.
> >
> > Drivers can artificially add timeouts to handle that, but it can be
> > flaky.
> >
> > Instead, let's add a way for the driver to be notified about the
> > disconnect. It can then do any necessary cleanup, knowing that the
> > device is inactive.
>
> This relies on somebody (typically pciehp, I guess) calling
> pci_dev_set_disconnected() when a surprise remove happens.
>
> Do you think it would be practical for the driver's .remove() method
> to recognize that the device may stop responding at any point, even if
> no hotplug driver is present to call pci_dev_set_disconnected()?
>
> Waiting forever for an interrupt seems kind of vulnerable in general.
> Maybe "artificially adding timeouts" is alluding to *not* waiting
> forever for interrupts? That doesn't seem artificial to me because
> it's just a fact of life that devices can disappear at arbitrary
> times.
>
> It seems a little fragile to me to depend on some other part of the
> system to notice the surprise removal and tell you about it or
> schedule your work function. I think it would be more robust for the
> driver to check directly, i.e., assume writes to the device may be
> lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> never wait for an interrupt without a timeout.
virtio is ... kind of special, in that users already take it for
granted that having a device as long as they want to respond
still does not lead to errors and data loss.
Makes it a bit harder as our timeout would have to
check for presence and retry, we can't just fail as a
normal hardware device does.
And there's the overhead thing - poking at the device a lot
puts a high load on the host.
So I can imagine a very long timeout (minutes?), and then something like
the WQ I am trying to propose here as a shortcut.
> > Since cleanups can take a long time, this takes an approach
> > of a work struct that the driver initiates and enables
> > on probe, and tears down on remove.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > drivers/pci/pci.h | 6 ++++++
> > include/linux/pci.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 51 insertions(+)
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 12215ee72afb..3ca4ebfd46be 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> > pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> > pci_doe_disconnected(dev);
> >
> > + if (READ_ONCE(dev->disconnect_work_enable)) {
> > + /* Make sure work is up to date. */
> > + smp_rmb();
> > + schedule_work(&dev->disconnect_work);
> > + }
> > +
> > return 0;
> > }
> >
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 05e68f35f392..723b17145b62 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -548,6 +548,10 @@ struct pci_dev {
> > /* These methods index pci_reset_fn_methods[] */
> > u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> >
> > + /* Report disconnect events. 0x0 - disable, 0x1 - enable */
> > + u8 disconnect_work_enable;
> > + struct work_struct disconnect_work;
> > +
> > #ifdef CONFIG_PCIE_TPH
> > u16 tph_cap; /* TPH capability offset */
> > u8 tph_mode; /* TPH mode */
> > @@ -1993,6 +1997,47 @@ pci_release_mem_regions(struct pci_dev *pdev)
> > pci_select_bars(pdev, IORESOURCE_MEM));
> > }
> >
> > +/*
> > + * Run this first thing after getting a disconnect work, to prevent it from
> > + * running multiple times.
> > + * Returns: true if disconnect was enabled, proceed. false if disabled, abort.
> > + */
> > +static inline bool pci_test_and_clear_disconnect_enable(struct pci_dev *pdev)
> > +{
> > + u8 enable = 0x1;
> > + u8 disable = 0x0;
> > + return try_cmpxchg(&pdev->disconnect_work_enable, &enable, disable);
> > +}
> > +
> > +/*
> > + * Caller must initialize @pdev->disconnect_work before invoking this.
> > + * The work function must run and check pci_test_and_clear_disconnect_enable.
> > + * Note that device can go away right after this call.
> > + */
> > +static inline void pci_set_disconnect_work(struct pci_dev *pdev)
> > +{
> > + /* Make sure WQ has been initialized already */
> > + smp_wmb();
> > +
> > + WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
> > +
> > + /* check the device did not go away meanwhile. */
> > + mb();
> > +
> > + if (!pci_device_is_present(pdev))
> > + schedule_work(&pdev->disconnect_work);
> > +}
> > +
> > +static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
> > +{
> > + WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> > +
> > + /* Make sure to stop using work from now on. */
> > + smp_wmb();
> > +
> > + cancel_work_sync(&pdev->disconnect_work);
> > +}
> > +
> > #else /* CONFIG_PCI is not enabled */
> >
> > static inline void pci_set_flags(int flags) { }
> > --
> > MST
> >
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-14 6:11 ` Lukas Wunner
2025-07-14 6:18 ` Michael S. Tsirkin
@ 2025-07-14 6:54 ` Michael S. Tsirkin
2025-07-17 15:11 ` Michael S. Tsirkin
2 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-14 6:54 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively. This works well, because mostly, the
> > cleanup would be the same.
> >
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
>
> For PCI devices in a hotplug slot, user space can initiate "safe removal"
> by writing "0" to the hotplug slot's "power" file in sysfs.
>
> If the PCI device is yanked from the slot while safe removal is ongoing,
> there is likewise no way for the driver to know that the device is
> suddenly gone. That's because pciehp_unconfigure_device() only calls
> pci_dev_set_disconnected() in the surprise removal case, not for
> safe removal.
>
> The solution proposed here is thus not a complete one: It may work
> if user space initiated *driver* removal, but not if it initiated *safe*
> removal of the entire device. For virtio, that may be sufficient.
No, I just missed this corner case.
> > +++ b/drivers/pci/pci.h
> > @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> > pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> > pci_doe_disconnected(dev);
> >
> > + if (READ_ONCE(dev->disconnect_work_enable)) {
> > + /* Make sure work is up to date. */
> > + smp_rmb();
> > + schedule_work(&dev->disconnect_work);
> > + }
> > +
> > return 0;
> > }
>
> Going through all the callers of pci_dev_set_disconnected(),
> I suppose the (only) one you're interested in is
> pciehp_unconfigure_device().
>
> The other callers are related to runtime resume, resume from
> system sleep and ACPI slots.
>
> Instead of amending pci_dev_set_disconnected(), I'd prefer
> an approach where pciehp_unconfigure_device() first marks
> all devices disconnected, then wakes up some global waitqueue, e.g.:
>
> - if (!presence)
> + if (!presence) {
> pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> + wake_up_all(&pci_disconnected_wq);
> + }
>
> The benefit is that there's no delay when marking devices disconnected.
> (Granted, the delay is small for smp_rmb() + schedule_work().)
> And just having a global waitqueue is simpler and may be useful
> for other use cases.
>
> So instead of adding timeouts when waiting for interrupts, drivers would
> be woken via the waitqueue.
>
> But again, it's not a complete solution as it doesn't cover the
> "surprise removal during safe removal" case.
>
> I also agree with Bjorn's and Keith's comments that the driver should
> use timeouts for robustness,
Yes - we can consider this an optimization, as robust timeouts
are by necessity minutes.
> but still wanted to provide additional
> (hopefully constructive) thoughts.
>
> Thanks!
>
> Lukas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-14 6:26 ` Michael S. Tsirkin
@ 2025-07-14 21:13 ` Bjorn Helgaas
2025-07-15 6:28 ` Michael S. Tsirkin
0 siblings, 1 reply; 22+ messages in thread
From: Bjorn Helgaas @ 2025-07-14 21:13 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci
On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > At the moment, in case of a surprise removal, the regular remove
> > > callback is invoked, exclusively. This works well, because mostly, the
> > > cleanup would be the same.
> > >
> > > However, there's a race: imagine device removal was initiated by a user
> > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > is now waiting for an interrupt from the device. If the device is now
> > > surprise-removed, that never arrives and the remove callback hangs
> > > forever.
> > >
> > > For example, this was reported for virtio-blk:
> > >
> > > 1. the graceful removal is ongoing in the remove() callback, where disk
> > > deletion del_gendisk() is ongoing, which waits for the requests +to
> > > complete,
> > >
> > > 2. Now few requests are yet to complete, and surprise removal started.
> > >
> > > At this point, virtio block driver will not get notified by the driver
> > > core layer, because it is likely serializing remove() happening by
> > > +user/driver unload and PCI hotplug driver-initiated device removal. So
> > > vblk driver doesn't know that device is removed, block layer is waiting
> > > for requests completions to arrive which it never gets. So
> > > del_gendisk() gets stuck.
> > >
> > > Drivers can artificially add timeouts to handle that, but it can be
> > > flaky.
> > >
> > > Instead, let's add a way for the driver to be notified about the
> > > disconnect. It can then do any necessary cleanup, knowing that the
> > > device is inactive.
> >
> > This relies on somebody (typically pciehp, I guess) calling
> > pci_dev_set_disconnected() when a surprise remove happens.
> >
> > Do you think it would be practical for the driver's .remove() method
> > to recognize that the device may stop responding at any point, even if
> > no hotplug driver is present to call pci_dev_set_disconnected()?
> >
> > Waiting forever for an interrupt seems kind of vulnerable in general.
> > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > forever for interrupts? That doesn't seem artificial to me because
> > it's just a fact of life that devices can disappear at arbitrary
> > times.
> >
> > It seems a little fragile to me to depend on some other part of the
> > system to notice the surprise removal and tell you about it or
> > schedule your work function. I think it would be more robust for the
> > driver to check directly, i.e., assume writes to the device may be
> > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > never wait for an interrupt without a timeout.
>
> virtio is ... kind of special, in that users already take it for
> granted that having a device as long as they want to respond
> still does not lead to errors and data loss.
>
> Makes it a bit harder as our timeout would have to
> check for presence and retry, we can't just fail as a
> normal hardware device does.
Sorry, I don't know enough about virtio to follow what you said about
"having a device as long as they want to respond".
We started with a graceful remove. That must mean the user no longer
needs the device.
> And there's the overhead thing - poking at the device a lot
> puts a high load on the host.
Checking for PCI_POSSIBLE_ERROR() doesn't touch the device. If you
did a config read already, and the result happened to be ~0, *then* we
have the problem of figuring out whether the actual data from the
device was ~0, or if the read failed and the Root Complex synthesized
the ~0. In many cases a driver knows that ~0 is not a possible
register value. Otherwise it might have to read another register that
is known not to be ~0.
Bjorn
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-14 21:13 ` Bjorn Helgaas
@ 2025-07-15 6:28 ` Michael S. Tsirkin
2025-07-16 22:29 ` Bjorn Helgaas
0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-15 6:28 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci
On Mon, Jul 14, 2025 at 04:13:51PM -0500, Bjorn Helgaas wrote:
> On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > At the moment, in case of a surprise removal, the regular remove
> > > > callback is invoked, exclusively. This works well, because mostly, the
> > > > cleanup would be the same.
> > > >
> > > > However, there's a race: imagine device removal was initiated by a user
> > > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > > is now waiting for an interrupt from the device. If the device is now
> > > > surprise-removed, that never arrives and the remove callback hangs
> > > > forever.
> > > >
> > > > For example, this was reported for virtio-blk:
> > > >
> > > > 1. the graceful removal is ongoing in the remove() callback, where disk
> > > > deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > complete,
> > > >
> > > > 2. Now few requests are yet to complete, and surprise removal started.
> > > >
> > > > At this point, virtio block driver will not get notified by the driver
> > > > core layer, because it is likely serializing remove() happening by
> > > > +user/driver unload and PCI hotplug driver-initiated device removal. So
> > > > vblk driver doesn't know that device is removed, block layer is waiting
> > > > for requests completions to arrive which it never gets. So
> > > > del_gendisk() gets stuck.
> > > >
> > > > Drivers can artificially add timeouts to handle that, but it can be
> > > > flaky.
> > > >
> > > > Instead, let's add a way for the driver to be notified about the
> > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > device is inactive.
> > >
> > > This relies on somebody (typically pciehp, I guess) calling
> > > pci_dev_set_disconnected() when a surprise remove happens.
> > >
> > > Do you think it would be practical for the driver's .remove() method
> > > to recognize that the device may stop responding at any point, even if
> > > no hotplug driver is present to call pci_dev_set_disconnected()?
> > >
> > > Waiting forever for an interrupt seems kind of vulnerable in general.
> > > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > > forever for interrupts? That doesn't seem artificial to me because
> > > it's just a fact of life that devices can disappear at arbitrary
> > > times.
> > >
> > > It seems a little fragile to me to depend on some other part of the
> > > system to notice the surprise removal and tell you about it or
> > > schedule your work function. I think it would be more robust for the
> > > driver to check directly, i.e., assume writes to the device may be
> > > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > > never wait for an interrupt without a timeout.
> >
> > virtio is ... kind of special, in that users already take it for
> > granted that having a device as long as they want to respond
> > still does not lead to errors and data loss.
> >
> > Makes it a bit harder as our timeout would have to
> > check for presence and retry, we can't just fail as a
> > normal hardware device does.
>
> Sorry, I don't know enough about virtio to follow what you said about
> "having a device as long as they want to respond".
>
> We started with a graceful remove. That must mean the user no longer
> needs the device.
I'll try to clarify:
Indeed, the user will not submit new requests,
but users might also not know that there are some old requests
being in progress of being processed by the device.
The driver, currently, waits for that processsing to be complete.
Cancelling that with a reset on a timeout might be a regression,
unless the timeout is very long.
Hope this makes it clear.
> > And there's the overhead thing - poking at the device a lot
> > puts a high load on the host.
>
> Checking for PCI_POSSIBLE_ERROR() doesn't touch the device. If you
> did a config read already, and the result happened to be ~0, *then* we
> have the problem of figuring out whether the actual data from the
> device was ~0, or if the read failed and the Root Complex synthesized
> the ~0. In many cases a driver knows that ~0 is not a possible
> register value. Otherwise it might have to read another register that
> is known not to be ~0.
>
> Bjorn
To clarify, virtio generally is designed to operate solely
by means of DMA and interrupt, completely avoiding any PCI
reads. This, due to PCI reads being very expensive in virtualized
scenarios.
The extra overhead I refer to is exactly initiating such a read
where there would not be one in normal operation.
--
MST
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-15 6:28 ` Michael S. Tsirkin
@ 2025-07-16 22:29 ` Bjorn Helgaas
2025-07-17 15:15 ` Michael S. Tsirkin
0 siblings, 1 reply; 22+ messages in thread
From: Bjorn Helgaas @ 2025-07-16 22:29 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci
On Tue, Jul 15, 2025 at 02:28:20AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 14, 2025 at 04:13:51PM -0500, Bjorn Helgaas wrote:
> > On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > > At the moment, in case of a surprise removal, the regular
> > > > > remove callback is invoked, exclusively. This works well,
> > > > > because mostly, the cleanup would be the same.
> > > > >
> > > > > However, there's a race: imagine device removal was
> > > > > initiated by a user action, such as driver unbind, and it in
> > > > > turn initiated some cleanup and is now waiting for an
> > > > > interrupt from the device. If the device is now
> > > > > surprise-removed, that never arrives and the remove callback
> > > > > hangs forever.
> > > > >
> > > > > For example, this was reported for virtio-blk:
> > > > >
> > > > > 1. the graceful removal is ongoing in the remove() callback, where disk
> > > > > deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > > complete,
> > > > >
> > > > > 2. Now few requests are yet to complete, and surprise removal started.
> > > > >
> > > > > At this point, virtio block driver will not get notified by the driver
> > > > > core layer, because it is likely serializing remove() happening by
> > > > > +user/driver unload and PCI hotplug driver-initiated device removal. So
> > > > > vblk driver doesn't know that device is removed, block layer is waiting
> > > > > for requests completions to arrive which it never gets. So
> > > > > del_gendisk() gets stuck.
> > > > >
> > > > > Drivers can artificially add timeouts to handle that, but it can be
> > > > > flaky.
> > > > >
> > > > > Instead, let's add a way for the driver to be notified about the
> > > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > > device is inactive.
> > > >
> > > > This relies on somebody (typically pciehp, I guess) calling
> > > > pci_dev_set_disconnected() when a surprise remove happens.
> > > >
> > > > Do you think it would be practical for the driver's .remove() method
> > > > to recognize that the device may stop responding at any point, even if
> > > > no hotplug driver is present to call pci_dev_set_disconnected()?
> > > >
> > > > Waiting forever for an interrupt seems kind of vulnerable in general.
> > > > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > > > forever for interrupts? That doesn't seem artificial to me because
> > > > it's just a fact of life that devices can disappear at arbitrary
> > > > times.
> > > >
> > > > It seems a little fragile to me to depend on some other part of the
> > > > system to notice the surprise removal and tell you about it or
> > > > schedule your work function. I think it would be more robust for the
> > > > driver to check directly, i.e., assume writes to the device may be
> > > > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > > > never wait for an interrupt without a timeout.
> > >
> > > virtio is ... kind of special, in that users already take it for
> > > granted that having a device as long as they want to respond
> > > still does not lead to errors and data loss.
> > >
> > > Makes it a bit harder as our timeout would have to
> > > check for presence and retry, we can't just fail as a
> > > normal hardware device does.
> >
> > Sorry, I don't know enough about virtio to follow what you said about
> > "having a device as long as they want to respond".
> >
> > We started with a graceful remove. That must mean the user no longer
> > needs the device.
>
> I'll try to clarify:
>
> Indeed, the user will not submit new requests,
> but users might also not know that there are some old requests
> being in progress of being processed by the device.
> The driver, currently, waits for that processsing to be complete.
> Cancelling that with a reset on a timeout might be a regression,
> unless the timeout is very long.
This seems like a corner case and maybe rare enough that simply making
the timeout very long would be a possibility.
> > > And there's the overhead thing - poking at the device a lot
> > > puts a high load on the host.
> >
> > Checking for PCI_POSSIBLE_ERROR() doesn't touch the device. If you
> > did a config read already, and the result happened to be ~0, *then* we
> > have the problem of figuring out whether the actual data from the
> > device was ~0, or if the read failed and the Root Complex synthesized
> > the ~0. In many cases a driver knows that ~0 is not a possible
> > register value. Otherwise it might have to read another register that
> > is known not to be ~0.
>
> To clarify, virtio generally is designed to operate solely
> by means of DMA and interrupt, completely avoiding any PCI
> reads. This, due to PCI reads being very expensive in virtualized
> scenarios.
>
> The extra overhead I refer to is exactly initiating such a read
> where there would not be one in normal operation.
Thanks, this part is very helpful. And since config accesses are very
expensive in *all* environments, I expect most drivers for
high-performance devices work the same way and only do config accesses
during at probe time.
If that's true, it will make this more understandable if the commit
log approaches it from that direction and omits virtio specifics.
Bjorn
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-14 6:11 ` Lukas Wunner
2025-07-14 6:18 ` Michael S. Tsirkin
2025-07-14 6:54 ` Michael S. Tsirkin
@ 2025-07-17 15:11 ` Michael S. Tsirkin
2025-07-17 20:12 ` Lukas Wunner
2 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-17 15:11 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively. This works well, because mostly, the
> > cleanup would be the same.
> >
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
>
> For PCI devices in a hotplug slot, user space can initiate "safe removal"
> by writing "0" to the hotplug slot's "power" file in sysfs.
>
> If the PCI device is yanked from the slot while safe removal is ongoing,
> there is likewise no way for the driver to know that the device is
> suddenly gone. That's because pciehp_unconfigure_device() only calls
> pci_dev_set_disconnected() in the surprise removal case, not for
> safe removal.
>
> The solution proposed here is thus not a complete one: It may work
> if user space initiated *driver* removal, but not if it initiated *safe*
> removal of the entire device. For virtio, that may be sufficient.
So just as an idea, something like this can work I guess? I'm yet to
test this - wrote this on the go - and also I'll need to implement for
other hotplug drivers, I need it at least for ACPI additonally.
WDYT?
diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
index bcc938d4420f..46468a1f0244 100644
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -231,6 +231,15 @@ void pciehp_handle_disable_request(struct controller *ctrl)
void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events)
{
int present, link_active;
+ /*
+ * Always mark downstream devices disconnected on Presence Detect Change.
+ * Covers device yanked during safe removal.
+ */
+ if (events & PCI_EXP_SLTSTA_PDC) {
+ struct pci_bus *parent = ctrl->pcie->port->subordinate;
+ if (parent)
+ pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+ }
/*
* If the slot is on and presence or link has changed, turn it off.
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-16 22:29 ` Bjorn Helgaas
@ 2025-07-17 15:15 ` Michael S. Tsirkin
0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-17 15:15 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: linux-kernel, Lukas Wunner, Keith Busch, Bjorn Helgaas,
Parav Pandit, virtualization, stefanha, alok.a.tiwari, linux-pci
On Wed, Jul 16, 2025 at 05:29:00PM -0500, Bjorn Helgaas wrote:
> On Tue, Jul 15, 2025 at 02:28:20AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 14, 2025 at 04:13:51PM -0500, Bjorn Helgaas wrote:
> > > On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > > > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > > > At the moment, in case of a surprise removal, the regular
> > > > > > remove callback is invoked, exclusively. This works well,
> > > > > > because mostly, the cleanup would be the same.
> > > > > >
> > > > > > However, there's a race: imagine device removal was
> > > > > > initiated by a user action, such as driver unbind, and it in
> > > > > > turn initiated some cleanup and is now waiting for an
> > > > > > interrupt from the device. If the device is now
> > > > > > surprise-removed, that never arrives and the remove callback
> > > > > > hangs forever.
> > > > > >
> > > > > > For example, this was reported for virtio-blk:
> > > > > >
> > > > > > 1. the graceful removal is ongoing in the remove() callback, where disk
> > > > > > deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > > > complete,
> > > > > >
> > > > > > 2. Now few requests are yet to complete, and surprise removal started.
> > > > > >
> > > > > > At this point, virtio block driver will not get notified by the driver
> > > > > > core layer, because it is likely serializing remove() happening by
> > > > > > +user/driver unload and PCI hotplug driver-initiated device removal. So
> > > > > > vblk driver doesn't know that device is removed, block layer is waiting
> > > > > > for requests completions to arrive which it never gets. So
> > > > > > del_gendisk() gets stuck.
> > > > > >
> > > > > > Drivers can artificially add timeouts to handle that, but it can be
> > > > > > flaky.
> > > > > >
> > > > > > Instead, let's add a way for the driver to be notified about the
> > > > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > > > device is inactive.
> > > > >
> > > > > This relies on somebody (typically pciehp, I guess) calling
> > > > > pci_dev_set_disconnected() when a surprise remove happens.
> > > > >
> > > > > Do you think it would be practical for the driver's .remove() method
> > > > > to recognize that the device may stop responding at any point, even if
> > > > > no hotplug driver is present to call pci_dev_set_disconnected()?
> > > > >
> > > > > Waiting forever for an interrupt seems kind of vulnerable in general.
> > > > > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > > > > forever for interrupts? That doesn't seem artificial to me because
> > > > > it's just a fact of life that devices can disappear at arbitrary
> > > > > times.
> > > > >
> > > > > It seems a little fragile to me to depend on some other part of the
> > > > > system to notice the surprise removal and tell you about it or
> > > > > schedule your work function. I think it would be more robust for the
> > > > > driver to check directly, i.e., assume writes to the device may be
> > > > > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > > > > never wait for an interrupt without a timeout.
> > > >
> > > > virtio is ... kind of special, in that users already take it for
> > > > granted that having a device as long as they want to respond
> > > > still does not lead to errors and data loss.
> > > >
> > > > Makes it a bit harder as our timeout would have to
> > > > check for presence and retry, we can't just fail as a
> > > > normal hardware device does.
> > >
> > > Sorry, I don't know enough about virtio to follow what you said about
> > > "having a device as long as they want to respond".
> > >
> > > We started with a graceful remove. That must mean the user no longer
> > > needs the device.
> >
> > I'll try to clarify:
> >
> > Indeed, the user will not submit new requests,
> > but users might also not know that there are some old requests
> > being in progress of being processed by the device.
> > The driver, currently, waits for that processsing to be complete.
> > Cancelling that with a reset on a timeout might be a regression,
> > unless the timeout is very long.
>
> This seems like a corner case and maybe rare enough that simply making
> the timeout very long would be a possibility.
Indeed the timeout needs to be very long, and the average would still be
reasonable, but the worst case is terrible and the user can't insert a
replacement card all this time. The system is perceived as flaky.
> > > > And there's the overhead thing - poking at the device a lot
> > > > puts a high load on the host.
> > >
> > > Checking for PCI_POSSIBLE_ERROR() doesn't touch the device. If you
> > > did a config read already, and the result happened to be ~0, *then* we
> > > have the problem of figuring out whether the actual data from the
> > > device was ~0, or if the read failed and the Root Complex synthesized
> > > the ~0. In many cases a driver knows that ~0 is not a possible
> > > register value. Otherwise it might have to read another register that
> > > is known not to be ~0.
> >
> > To clarify, virtio generally is designed to operate solely
> > by means of DMA and interrupt, completely avoiding any PCI
> > reads. This, due to PCI reads being very expensive in virtualized
> > scenarios.
> >
> > The extra overhead I refer to is exactly initiating such a read
> > where there would not be one in normal operation.
>
> Thanks, this part is very helpful. And since config accesses are very
> expensive in *all* environments, I expect most drivers for
> high-performance devices work the same way and only do config accesses
> during at probe time.
>
> If that's true, it will make this more understandable if the commit
> log approaches it from that direction and omits virtio specifics.
>
> Bjorn
Will do, thanks a lot!
--
MST
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-17 15:11 ` Michael S. Tsirkin
@ 2025-07-17 20:12 ` Lukas Wunner
2025-07-17 23:31 ` Michael S. Tsirkin
0 siblings, 1 reply; 22+ messages in thread
From: Lukas Wunner @ 2025-07-17 20:12 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
On Thu, Jul 17, 2025 at 11:11:44AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > At the moment, in case of a surprise removal, the regular remove
> > > callback is invoked, exclusively. This works well, because mostly, the
> > > cleanup would be the same.
> > >
> > > However, there's a race: imagine device removal was initiated by a user
> > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > is now waiting for an interrupt from the device. If the device is now
> > > surprise-removed, that never arrives and the remove callback hangs
> > > forever.
> >
> > For PCI devices in a hotplug slot, user space can initiate "safe removal"
> > by writing "0" to the hotplug slot's "power" file in sysfs.
> >
> > If the PCI device is yanked from the slot while safe removal is ongoing,
> > there is likewise no way for the driver to know that the device is
> > suddenly gone. That's because pciehp_unconfigure_device() only calls
> > pci_dev_set_disconnected() in the surprise removal case, not for
> > safe removal.
> >
> > The solution proposed here is thus not a complete one: It may work
> > if user space initiated *driver* removal, but not if it initiated *safe*
> > removal of the entire device. For virtio, that may be sufficient.
>
> So just as an idea, something like this can work I guess? I'm yet to
> test this - wrote this on the go -
Don't bother, it won't work:
pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
the IRQ thread. During safe removal the IRQ thread is busy in
pciehp_unconfigure_device() and waiting for the driver to unbind
from devices being safe-removed.
An IRQ thread is always single-threaded. There's no second instance
of the IRQ thread being run when another interrupt is signaled.
Rather, the IRQ thread is re-run when it has finished.
In *theory* what would be possible is to plumb this into pciehp_isr().
That's the hardirq handler. This one will indeed be run when an
interrupt comes in while the IRQ thread is running. Normally the
hardirq handler would just collect the events for later consumption
by the IRQ thread. The hardirq handler could *theoretically* mark
devices gone while they're being safe-removed.
I'm saying "theoretically" because in reality I don't think this is
a viable approach either: pciehp_ist() contains code to *ignore*
link or presence changes if they were caused by a Secondary Bus Reset
or Downstream Port Containment. In that case we do *not* want to mark
devices disconnected because they're only *temporarily* inaccessible.
This requires waiting for the SBR or DPC to conclude, which can take
several seconds. We can't wait in the hardirq handler.
So this cannot be solved with the current architecture of pciehp,
at least not easily or in an elegant way. Sorry!
Thanks,
Lukas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-17 20:12 ` Lukas Wunner
@ 2025-07-17 23:31 ` Michael S. Tsirkin
2025-07-18 4:35 ` Lukas Wunner
0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-17 23:31 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
On Thu, Jul 17, 2025 at 10:12:03PM +0200, Lukas Wunner wrote:
> On Thu, Jul 17, 2025 at 11:11:44AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > At the moment, in case of a surprise removal, the regular remove
> > > > callback is invoked, exclusively. This works well, because mostly, the
> > > > cleanup would be the same.
> > > >
> > > > However, there's a race: imagine device removal was initiated by a user
> > > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > > is now waiting for an interrupt from the device. If the device is now
> > > > surprise-removed, that never arrives and the remove callback hangs
> > > > forever.
> > >
> > > For PCI devices in a hotplug slot, user space can initiate "safe removal"
> > > by writing "0" to the hotplug slot's "power" file in sysfs.
> > >
> > > If the PCI device is yanked from the slot while safe removal is ongoing,
> > > there is likewise no way for the driver to know that the device is
> > > suddenly gone. That's because pciehp_unconfigure_device() only calls
> > > pci_dev_set_disconnected() in the surprise removal case, not for
> > > safe removal.
> > >
> > > The solution proposed here is thus not a complete one: It may work
> > > if user space initiated *driver* removal, but not if it initiated *safe*
> > > removal of the entire device. For virtio, that may be sufficient.
> >
> > So just as an idea, something like this can work I guess? I'm yet to
> > test this - wrote this on the go -
>
> Don't bother, it won't work:
>
> pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
> the IRQ thread. During safe removal the IRQ thread is busy in
> pciehp_unconfigure_device() and waiting for the driver to unbind
> from devices being safe-removed.
Confused. I thought safe removal happens in the userspace thread
that wrote into sysfs?
> An IRQ thread is always single-threaded. There's no second instance
> of the IRQ thread being run when another interrupt is signaled.
> Rather, the IRQ thread is re-run when it has finished.
>
> In *theory* what would be possible is to plumb this into pciehp_isr().
> That's the hardirq handler. This one will indeed be run when an
> interrupt comes in while the IRQ thread is running. Normally the
> hardirq handler would just collect the events for later consumption
> by the IRQ thread. The hardirq handler could *theoretically* mark
> devices gone while they're being safe-removed.
>
> I'm saying "theoretically" because in reality I don't think this is
> a viable approach either: pciehp_ist() contains code to *ignore*
> link or presence changes if they were caused by a Secondary Bus Reset
> or Downstream Port Containment. In that case we do *not* want to mark
> devices disconnected because they're only *temporarily* inaccessible.
> This requires waiting for the SBR or DPC to conclude, which can take
> several seconds. We can't wait in the hardirq handler.
>
> So this cannot be solved with the current architecture of pciehp,
> at least not easily or in an elegant way. Sorry!
>
> Thanks,
>
> Lukas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-17 23:31 ` Michael S. Tsirkin
@ 2025-07-18 4:35 ` Lukas Wunner
2025-07-18 8:40 ` Michael S. Tsirkin
0 siblings, 1 reply; 22+ messages in thread
From: Lukas Wunner @ 2025-07-18 4:35 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
On Thu, Jul 17, 2025 at 07:31:57PM -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 17, 2025 at 10:12:03PM +0200, Lukas Wunner wrote:
> > pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
> > the IRQ thread. During safe removal the IRQ thread is busy in
> > pciehp_unconfigure_device() and waiting for the driver to unbind
> > from devices being safe-removed.
>
> Confused. I thought safe removal happens in the userspace thread
> that wrote into sysfs?
No, the userspace thread synthesizes a DISABLE_SLOT event,
calls irq_wake_thread(), then waits for the IRQ thread to
finish handling that event. See pciehp_sysfs_disable_slot().
Until 2018 we indeed brought down the slot in the userspace
thread, but that required locking between the workqueue fed
by the interrupt handler on the one hand and the userspace
thread on the other hand. It was difficult to reason about
the code.
We had bug reports about slots flapping the link or presence
bits on slot bringdown that we could easily address by handling
everything in the IRQ thread, see 3943af9d01e9. The same was
reported for slot bringup and addressed by 6c35a1ac3da6.
This wouldn't have been possible with the architecture prior
to 2018, at least not this easily.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC v5 1/5] pci: report surprise removal event
2025-07-18 4:35 ` Lukas Wunner
@ 2025-07-18 8:40 ` Michael S. Tsirkin
0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2025-07-18 8:40 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-kernel, Keith Busch, Bjorn Helgaas, Parav Pandit,
virtualization, stefanha, alok.a.tiwari, linux-pci
On Fri, Jul 18, 2025 at 06:35:56AM +0200, Lukas Wunner wrote:
> On Thu, Jul 17, 2025 at 07:31:57PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 17, 2025 at 10:12:03PM +0200, Lukas Wunner wrote:
> > > pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
> > > the IRQ thread. During safe removal the IRQ thread is busy in
> > > pciehp_unconfigure_device() and waiting for the driver to unbind
> > > from devices being safe-removed.
> >
> > Confused. I thought safe removal happens in the userspace thread
> > that wrote into sysfs?
>
> No, the userspace thread synthesizes a DISABLE_SLOT event,
> calls irq_wake_thread(), then waits for the IRQ thread to
> finish handling that event. See pciehp_sysfs_disable_slot().
>
> Until 2018 we indeed brought down the slot in the userspace
> thread, but that required locking between the workqueue fed
> by the interrupt handler on the one hand and the userspace
> thread on the other hand. It was difficult to reason about
> the code.
>
> We had bug reports about slots flapping the link or presence
> bits on slot bringdown that we could easily address by handling
> everything in the IRQ thread, see 3943af9d01e9. The same was
> reported for slot bringup and addressed by 6c35a1ac3da6.
>
> This wouldn't have been possible with the architecture prior
> to 2018, at least not this easily.
>
> Thanks,
>
> Lukas
Got it, thanks!
--
MST
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-07-18 8:40 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 20:55 [PATCH RFC v5 0/5] pci,virtio: report surprise removal event Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 1/5] pci: " Michael S. Tsirkin
2025-07-09 23:38 ` Bjorn Helgaas
2025-07-09 23:55 ` Keith Busch
2025-07-14 6:17 ` Michael S. Tsirkin
2025-07-14 6:26 ` Michael S. Tsirkin
2025-07-14 21:13 ` Bjorn Helgaas
2025-07-15 6:28 ` Michael S. Tsirkin
2025-07-16 22:29 ` Bjorn Helgaas
2025-07-17 15:15 ` Michael S. Tsirkin
2025-07-14 6:11 ` Lukas Wunner
2025-07-14 6:18 ` Michael S. Tsirkin
2025-07-14 6:54 ` Michael S. Tsirkin
2025-07-17 15:11 ` Michael S. Tsirkin
2025-07-17 20:12 ` Lukas Wunner
2025-07-17 23:31 ` Michael S. Tsirkin
2025-07-18 4:35 ` Lukas Wunner
2025-07-18 8:40 ` Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 2/5] virtio: fix comments, readability Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 3/5] virtio: pack config changed flags Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 4/5] virtio: allow transports to suppress config change Michael S. Tsirkin
2025-07-09 20:55 ` [PATCH RFC v5 5/5] virtio: support device disconnect Michael S. Tsirkin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).