* [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x
@ 2026-02-17 18:22 Farhan Ali
2026-02-17 18:22 ` [PATCH v9 1/9] PCI: Allow per function PCI slots Farhan Ali
` (8 more replies)
0 siblings, 9 replies; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato
Hi,
This Linux kernel patch series introduces support for error recovery for
passthrough PCI devices on System Z (s390x).
Background
----------
For PCI devices on s390x an operating system receives platform specific
error events from firmware rather than through AER.Today for
passthrough/userspace devices, we don't attempt any error recovery and
ignore any error events for the devices. The passthrough/userspace devices
are managed by the vfio-pci driver. The driver does register error handling
callbacks (error_detected), and on an error trigger an eventfd to
userspace. But we need a mechanism to notify userspace
(QEMU/guest/userspace drivers) about the error event.
Proposal
--------
We can expose this error information (currently only the PCI Error Code)
via a device feature. Userspace can then obtain the error information
via VFIO_DEVICE_FEATURE ioctl and take appropriate actions such as driving
a device reset.
This is how a typical flow for passthrough devices to a VM would work:
For passthrough devices to a VM, the driver bound to the device on the host
is vfio-pci. vfio-pci driver does support the error_detected() callback
(vfio_pci_core_aer_err_detected()), and on an PCI error s390x recovery
code on the host will call the vfio-pci error_detected() callback. The
vfio-pci error_detected() callback will notify userspace/QEMU via an
eventfd, and return PCI_ERS_RESULT_CAN_RECOVER. At this point the s390x
error recovery on the host will skip any further action(see patch 6) and
let userspace drive the error recovery.
Once userspace/QEMU is notified, it then injects this error into the VM
so device drivers in the VM can take recovery actions. For example for a
passthrough NVMe device, the VM's OS NVMe driver will access the device.
At this point the VM's NVMe driver's error_detected() will drive the
recovery by returning PCI_ERS_RESULT_NEED_RESET, and the s390x error
recovery in the VM's OS will try to do a reset. Resets are privileged
operations and so the VM will need intervention from QEMU to perform the
reset. QEMU will invoke the VFIO_DEVICE_RESET ioctl to now notify the
host that the VM is requesting a reset of the device. The vfio-pci driver
on the host will then perform the reset on the device to recover it.
Thanks
Farhan
ChangeLog
---------
v8 series https://lore.kernel.org/all/20260122194437.1903-1-alifm@linux.ibm.com/
v8 -> v9
- Avoid saving PCI config space state in reset path (patch 3) (suggested by Bjorn)
- Add explicit version to struct vfio_device_feature_zpci_err (patch 7).
- Rebase on 6.19
v7 series https://lore.kernel.org/all/20260107183217.1365-1-alifm@linux.ibm.com/
v7 -> v8
- Rebase on 6.19-rc4
- Address feedback from Niklas and Julien.
v6 series https://lore.kernel.org/all/2c609e61-1861-4bf3-b019-a11c137d26a5@linux.ibm.com/
v6 -> v7
- Rebase on 6.19-rc4
- Update commit message based on Niklas's suggestion (patch 3).
v5 series https://lore.kernel.org/all/20251113183502.2388-1-alifm@linux.ibm.com/
v5 -> v6
- Rebase on 6.18 + Lukas's PCI: Universal error recoverability of
devices series (https://lore.kernel.org/all/cover.1763483367.git.lukas@wunner.de/)
- Re-work config space accessibility check to pci_dev_save_and_disable() (patch 3).
This avoids saving the config space, in the reset path, if the device's config space is
corrupted or inaccessible.
v4 series https://lore.kernel.org/all/20250924171628.826-1-alifm@linux.ibm.com/
v4 -> v5
- Rebase on 6.18-rc5
- Move bug fixes to the beginning of the series (patch 1 and 2). These patches
were posted as a separate fixes series
https://lore.kernel.org/all/a14936ac-47d6-461b-816f-0fd66f869b0f@linux.ibm.com/
- Add matching pci_put_dev() for pci_get_slot() (patch 6).
v3 series https://lore.kernel.org/all/20250911183307.1910-1-alifm@linux.ibm.com/
v3 -> v4
- Remove warn messages for each PCI capability not restored (patch 1)
- Check PCI_COMMAND and PCI_STATUS register for error value instead of device id
(patch 1)
- Fix kernel crash in patch 3
- Added reviewed by tags
- Address comments from Niklas's (patches 4, 5, 7)
- Fix compilation error non s390x system (patch 8)
- Explicitly align struct vfio_device_feature_zpci_err (patch 8)
v2 series https://lore.kernel.org/all/20250825171226.1602-1-alifm@linux.ibm.com/
v2 -> v3
- Patch 1 avoids saving any config space state if the device is in error
(suggested by Alex)
- Patch 2 adds additional check only for FLR reset to try other function
reset method (suggested by Alex).
- Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
functions. Creates a new flag pci_slot to allow per function slot.
- Patch 4 fixes a bug in s390 for resource to bus address translation.
- Rebase on 6.17-rc5
v1 series https://lore.kernel.org/all/20250813170821.1115-1-alifm@linux.ibm.com/
v1 - > v2
- Patches 1 and 2 adds some additional checks for FLR/PM reset to
try other function reset method (suggested by Alex).
- Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
functions.
- Patch 7 adds a new device feature for zPCI devices for the VFIO_DEVICE_FEATURE
ioctl. The ioctl is used by userspace to retriece any PCI error
information for the device (suggested by Alex).
- Patch 8 adds a reset_done() callback for the vfio-pci driver, to
restore the state of the device after a reset.
- Patch 9 removes the pcie check for triggering VFIO_PCI_ERR_IRQ_INDEX.
Farhan Ali (9):
PCI: Allow per function PCI slots
s390/pci: Add architecture specific resource/bus address translation
PCI: Avoid saving config space state in reset path
PCI: Add additional checks for flr reset
s390/pci: Update the logic for detecting passthrough device
s390/pci: Store PCI error information for passthrough devices
vfio-pci/zdev: Add a device feature for error information
vfio: Add a reset_done callback for vfio-pci driver
vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
arch/s390/include/asm/pci.h | 29 ++++++++
arch/s390/pci/pci.c | 75 +++++++++++++++++++++
arch/s390/pci/pci_event.c | 107 +++++++++++++++++-------------
drivers/pci/host-bridge.c | 8 +--
drivers/pci/pci.c | 44 ++++++------
drivers/pci/slot.c | 25 ++++++-
drivers/vfio/pci/vfio_pci_core.c | 22 ++++--
drivers/vfio/pci/vfio_pci_intrs.c | 3 +-
drivers/vfio/pci/vfio_pci_priv.h | 9 +++
drivers/vfio/pci/vfio_pci_zdev.c | 47 ++++++++++++-
include/linux/pci.h | 1 +
include/uapi/linux/vfio.h | 17 +++++
12 files changed, 305 insertions(+), 82 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v9 1/9] PCI: Allow per function PCI slots
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
2026-02-17 21:14 ` Keith Busch
2026-02-17 18:22 ` [PATCH v9 2/9] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
` (7 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato
On s390 systems, which use a machine level hypervisor, PCI devices are
always accessed through a form of PCI pass-through which fundamentally
operates on a per PCI function granularity. This is also reflected in the
s390 PCI hotplug driver which creates hotplug slots for individual PCI
functions. Its reset_slot() function, which is a wrapper for
zpci_hot_reset_device(), thus also resets individual functions.
Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
to multifunction devices. This approach worked fine on s390 systems that
only exposed virtual functions as individual PCI domains to the operating
system. Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
s390 supports exposing the topology of multifunction PCI devices by
grouping them in a shared PCI domain. When attempting to reset a function
through the hotplug driver, the shared slot assignment causes the wrong
function to be reset instead of the intended one. It also leaks memory as
we do create a pci_slot object for the function, but don't correctly free
it in pci_slot_release().
Add a flag for struct pci_slot to allow per function PCI slots for
functions managed through a hypervisor, which exposes individual PCI
functions while retaining the topology.
Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
Cc: stable@vger.kernel.org
Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/pci/pci.c | 5 +++--
drivers/pci/slot.c | 25 ++++++++++++++++++++++---
include/linux/pci.h | 1 +
3 files changed, 26 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f3244630bfd0..3090c727b76f 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4869,8 +4869,9 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, bool probe)
static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
{
- if (dev->multifunction || dev->subordinate || !dev->slot ||
- dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
+ if (dev->subordinate || !dev->slot ||
+ dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET ||
+ (dev->multifunction && !dev->slot->per_func_slot))
return -ENOTTY;
return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index 50fb3eb595fe..ed10fa3ae727 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -63,6 +63,22 @@ static ssize_t cur_speed_read_file(struct pci_slot *slot, char *buf)
return bus_speed_read(slot->bus->cur_bus_speed, buf);
}
+static bool pci_dev_matches_slot(struct pci_dev *dev, struct pci_slot *slot)
+{
+ if (slot->per_func_slot)
+ return dev->devfn == slot->number;
+
+ return PCI_SLOT(dev->devfn) == slot->number;
+}
+
+static bool pci_slot_enabled_per_func(void)
+{
+ if (IS_ENABLED(CONFIG_S390))
+ return true;
+
+ return false;
+}
+
static void pci_slot_release(struct kobject *kobj)
{
struct pci_dev *dev;
@@ -73,7 +89,7 @@ static void pci_slot_release(struct kobject *kobj)
down_read(&pci_bus_sem);
list_for_each_entry(dev, &slot->bus->devices, bus_list)
- if (PCI_SLOT(dev->devfn) == slot->number)
+ if (pci_dev_matches_slot(dev, slot))
dev->slot = NULL;
up_read(&pci_bus_sem);
@@ -166,7 +182,7 @@ void pci_dev_assign_slot(struct pci_dev *dev)
mutex_lock(&pci_slot_mutex);
list_for_each_entry(slot, &dev->bus->slots, list)
- if (PCI_SLOT(dev->devfn) == slot->number)
+ if (pci_dev_matches_slot(dev, slot))
dev->slot = slot;
mutex_unlock(&pci_slot_mutex);
}
@@ -265,6 +281,9 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
slot->bus = pci_bus_get(parent);
slot->number = slot_nr;
+ if (pci_slot_enabled_per_func())
+ slot->per_func_slot = 1;
+
slot->kobj.kset = pci_slots_kset;
slot_name = make_slot_name(name);
@@ -285,7 +304,7 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
down_read(&pci_bus_sem);
list_for_each_entry(dev, &parent->devices, bus_list)
- if (PCI_SLOT(dev->devfn) == slot_nr)
+ if (pci_dev_matches_slot(dev, slot))
dev->slot = slot;
up_read(&pci_bus_sem);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 1c270f1d5123..a9975d0e104f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -78,6 +78,7 @@ struct pci_slot {
struct list_head list; /* Node in list of slots */
struct hotplug_slot *hotplug; /* Hotplug info (move here) */
unsigned char number; /* PCI_SLOT(pci_dev->devfn) */
+ unsigned int per_func_slot:1; /* Allow per function slot */
struct kobject kobj;
};
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v9 2/9] s390/pci: Add architecture specific resource/bus address translation
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
2026-02-17 18:22 ` [PATCH v9 1/9] PCI: Allow per function PCI slots Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
2026-02-17 18:22 ` [PATCH v9 3/9] PCI: Avoid saving config space state in reset path Farhan Ali
` (6 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato
On s390 today we overwrite the PCI BAR resource address to either an
artificial cookie address or MIO address. However this address is different
from the bus address of the BARs programmed by firmware. The artificial
cookie address was created to index into an array of function handles
(zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
but maybe different from the bus addresses. This creates an issue when
trying to convert the BAR resource address to bus address using the generic
pcibios_resource_to_bus().
Implement an architecture specific pcibios_resource_to_bus() function to
correctly translate PCI BAR resource addresses to bus addresses for s390.
Similarly add architecture specific pcibios_bus_to_resource function to do
the reverse translation.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
arch/s390/pci/pci.c | 74 +++++++++++++++++++++++++++++++++++++++
drivers/pci/host-bridge.c | 8 ++---
2 files changed, 78 insertions(+), 4 deletions(-)
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 97bab20bc163..b2e6b53ea8e6 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -272,6 +272,80 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
return 0;
}
+void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
+ struct resource *res)
+{
+ struct zpci_bus *zbus = bus->sysdata;
+ struct zpci_bar_struct *zbar;
+ struct zpci_dev *zdev;
+
+ region->start = res->start;
+ region->end = res->end;
+
+ for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
+ int j = 0;
+
+ zbar = NULL;
+ zdev = zbus->function[i];
+ if (!zdev)
+ continue;
+
+ for (j = 0; j < PCI_STD_NUM_BARS; j++) {
+ if (zdev->bars[j].res->start == res->start &&
+ zdev->bars[j].res->end == res->end &&
+ res->flags & IORESOURCE_MEM) {
+ zbar = &zdev->bars[j];
+ break;
+ }
+ }
+
+ if (zbar) {
+ /* only MMIO is supported */
+ region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
+ if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+ region->start |= (u64)zdev->bars[j + 1].val << 32;
+
+ region->end = region->start + (1UL << zbar->size) - 1;
+ return;
+ }
+ }
+}
+
+void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+ struct pci_bus_region *region)
+{
+ struct zpci_bus *zbus = bus->sysdata;
+ struct zpci_dev *zdev;
+ resource_size_t start, end;
+
+ res->start = region->start;
+ res->end = region->end;
+
+ for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
+ zdev = zbus->function[i];
+ if (!zdev || !zdev->has_resources)
+ continue;
+
+ for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
+ if (!zdev->bars[j].size)
+ continue;
+
+ /* only MMIO is supported */
+ start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
+ if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+ start |= (u64)zdev->bars[j + 1].val << 32;
+
+ end = start + (1UL << zdev->bars[j].size) - 1;
+
+ if (start == region->start && end == region->end) {
+ res->start = zdev->bars[j].res->start;
+ res->end = zdev->bars[j].res->end;
+ return;
+ }
+ }
+ }
+}
+
void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
pgprot_t prot)
{
diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
index be5ef6516cff..aed031b8a9f3 100644
--- a/drivers/pci/host-bridge.c
+++ b/drivers/pci/host-bridge.c
@@ -49,8 +49,8 @@ void pci_set_host_bridge_release(struct pci_host_bridge *bridge,
}
EXPORT_SYMBOL_GPL(pci_set_host_bridge_release);
-void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
- struct resource *res)
+void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
+ struct resource *res)
{
struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
struct resource_entry *window;
@@ -74,8 +74,8 @@ static bool region_contains(struct pci_bus_region *region1,
return region1->start <= region2->start && region1->end >= region2->end;
}
-void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
- struct pci_bus_region *region)
+void __weak pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+ struct pci_bus_region *region)
{
struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
struct resource_entry *window;
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
2026-02-17 18:22 ` [PATCH v9 1/9] PCI: Allow per function PCI slots Farhan Ali
2026-02-17 18:22 ` [PATCH v9 2/9] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
2026-02-17 19:11 ` Keith Busch
2026-02-17 18:22 ` [PATCH v9 4/9] PCI: Add additional checks for flr reset Farhan Ali
` (5 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato,
Bjorn Helgaas
The current reset process saves the device's config space state before
reset and restores it afterward. However errors may occur unexpectedly and
it may then be impossible to save config space because the device may be
inaccessible (e.g. DPC) or config space may be corrupted. This results in
saving corrupted values that get written back to the device during state
restoration.
Since commit a2f1e22390ac ("PCI/ERR: Ensure error recoverability at all times"),
we now save the state of device at enumeration. On every restore we should
either use the enumeration saved state or driver's intentional saved state,
never a state saved at the unpredictable time of an error recovery reset.
Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/pci/pci.c | 32 +++++++++++++++-----------------
1 file changed, 15 insertions(+), 17 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 3090c727b76f..2242b97e7d46 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5012,7 +5012,7 @@ void pci_dev_unlock(struct pci_dev *dev)
}
EXPORT_SYMBOL_GPL(pci_dev_unlock);
-static void pci_dev_save_and_disable(struct pci_dev *dev)
+static void pci_dev_disable(struct pci_dev *dev)
{
const struct pci_error_handlers *err_handler =
dev->driver ? dev->driver->err_handler : NULL;
@@ -5029,13 +5029,11 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
pci_warn(dev, "resetting");
/*
- * Wake-up device prior to save. PM registers default to D0 after
- * reset and a simple register restore doesn't reliably return
- * to a non-D0 state anyway.
+ * PM registers default to D0 after reset and a simple register
+ * restore doesn't reliably return to a non-D0 state.
*/
pci_set_power_state(dev, PCI_D0);
- pci_save_state(dev);
/*
* Disable the device by clearing the Command register, except for
* INTx-disable which is set. This not only disables MMIO and I/O port
@@ -5199,7 +5197,7 @@ int pci_reset_function(struct pci_dev *dev)
pci_dev_lock(bridge);
pci_dev_lock(dev);
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
rc = __pci_reset_function_locked(dev);
@@ -5241,7 +5239,7 @@ int pci_reset_function_locked(struct pci_dev *dev)
if (!pci_reset_supported(dev))
return -ENOTTY;
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
rc = __pci_reset_function_locked(dev);
@@ -5267,7 +5265,7 @@ int pci_try_reset_function(struct pci_dev *dev)
if (!pci_dev_trylock(dev))
return -EAGAIN;
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
rc = __pci_reset_function_locked(dev);
pci_dev_restore(dev);
pci_dev_unlock(dev);
@@ -5441,17 +5439,17 @@ static int pci_slot_trylock(struct pci_slot *slot)
}
/*
- * Save and disable devices from the top of the tree down while holding
+ * Disable devices from the top of the tree down while holding
* the @dev mutex lock for the entire tree.
*/
-static void pci_bus_save_and_disable_locked(struct pci_bus *bus)
+static void pci_bus_disable_locked(struct pci_bus *bus)
{
struct pci_dev *dev;
list_for_each_entry(dev, &bus->devices, bus_list) {
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
if (dev->subordinate)
- pci_bus_save_and_disable_locked(dev->subordinate);
+ pci_bus_disable_locked(dev->subordinate);
}
}
@@ -5477,16 +5475,16 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
* Save and disable devices from the top of the tree down while holding
* the @dev mutex lock for the entire tree.
*/
-static void pci_slot_save_and_disable_locked(struct pci_slot *slot)
+static void pci_slot_disable_locked(struct pci_slot *slot)
{
struct pci_dev *dev;
list_for_each_entry(dev, &slot->bus->devices, bus_list) {
if (!dev->slot || dev->slot != slot)
continue;
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
if (dev->subordinate)
- pci_bus_save_and_disable_locked(dev->subordinate);
+ pci_bus_disable_locked(dev->subordinate);
}
}
@@ -5566,7 +5564,7 @@ static int __pci_reset_slot(struct pci_slot *slot)
return rc;
if (pci_slot_trylock(slot)) {
- pci_slot_save_and_disable_locked(slot);
+ pci_slot_disable_locked(slot);
might_sleep();
rc = pci_reset_hotplug_slot(slot->hotplug, PCI_RESET_DO_RESET);
pci_slot_restore_locked(slot);
@@ -5660,7 +5658,7 @@ int __pci_reset_bus(struct pci_bus *bus)
return rc;
if (pci_bus_trylock(bus)) {
- pci_bus_save_and_disable_locked(bus);
+ pci_bus_disable_locked(bus);
might_sleep();
rc = pci_bridge_secondary_bus_reset(bus->self);
pci_bus_restore_locked(bus);
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v9 4/9] PCI: Add additional checks for flr reset
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
` (2 preceding siblings ...)
2026-02-17 18:22 ` [PATCH v9 3/9] PCI: Avoid saving config space state in reset path Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
2026-02-17 18:22 ` [PATCH v9 5/9] s390/pci: Update the logic for detecting passthrough device Farhan Ali
` (4 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato,
Benjamin Block
If a device is in an error state, then any reads of device registers can
return error value. Add addtional checks to validate if a device is in an
error state before doing an flr reset.
Cc: stable@vger.kernel.org
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/pci/pci.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2242b97e7d46..d867db7188d0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4373,12 +4373,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
*/
int pcie_reset_flr(struct pci_dev *dev, bool probe)
{
+ u32 reg;
+
if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
return -ENOTTY;
if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
return -ENOTTY;
+ if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, ®)) {
+ pci_warn(dev, "Device unable to do an FLR\n");
+ return -ENOTTY;
+ }
+
if (probe)
return 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v9 5/9] s390/pci: Update the logic for detecting passthrough device
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
` (3 preceding siblings ...)
2026-02-17 18:22 ` [PATCH v9 4/9] PCI: Add additional checks for flr reset Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
2026-02-17 18:22 ` [PATCH v9 6/9] s390/pci: Store PCI error information for passthrough devices Farhan Ali
` (3 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato
We can now have userspace drivers (vfio-pci based) on s390x. The userspace
drivers will not have any KVM fd and so no kzdev associated with them. So
we need to update the logic for detecting passthrough devices to not depend
on struct kvm_zdev.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
arch/s390/include/asm/pci.h | 1 +
arch/s390/pci/pci_event.c | 14 ++++----------
drivers/vfio/pci/vfio_pci_zdev.c | 9 ++++++++-
3 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index c0ff19dab580..ec8a772bf526 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -171,6 +171,7 @@ struct zpci_dev {
char res_name[16];
bool mio_capable;
+ bool mediated_recovery;
struct zpci_bar_struct bars[PCI_STD_NUM_BARS];
u64 start_dma; /* Start of available DMA addresses */
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index 839bd91c056e..de504925f709 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -60,16 +60,10 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
}
}
-static bool is_passed_through(struct pci_dev *pdev)
+static bool needs_mediated_recovery(struct pci_dev *pdev)
{
struct zpci_dev *zdev = to_zpci(pdev);
- bool ret;
-
- mutex_lock(&zdev->kzdev_lock);
- ret = !!zdev->kzdev;
- mutex_unlock(&zdev->kzdev_lock);
-
- return ret;
+ return zdev->mediated_recovery;
}
static bool is_driver_supported(struct pci_driver *driver)
@@ -194,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
}
pdev->error_state = pci_channel_io_frozen;
- if (is_passed_through(pdev)) {
+ if (needs_mediated_recovery(pdev)) {
pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
pci_name(pdev));
status_str = "failed (pass-through)";
@@ -279,7 +273,7 @@ static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
* we will inject the error event and let the guest recover the device
* itself.
*/
- if (is_passed_through(pdev))
+ if (needs_mediated_recovery(pdev))
goto out;
driver = to_pci_driver(pdev->dev.driver);
if (driver && driver->err_handler && driver->err_handler->error_detected)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 0990fdb146b7..a7bc23ce8483 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -148,6 +148,8 @@ int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
if (!zdev)
return -ENODEV;
+ zdev->mediated_recovery = true;
+
if (!vdev->vdev.kvm)
return 0;
@@ -161,7 +163,12 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
{
struct zpci_dev *zdev = to_zpci(vdev->pdev);
- if (!zdev || !vdev->vdev.kvm)
+ if (!zdev)
+ return;
+
+ zdev->mediated_recovery = false;
+
+ if (!vdev->vdev.kvm)
return;
if (zpci_kvm_hook.kvm_unregister)
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v9 6/9] s390/pci: Store PCI error information for passthrough devices
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
` (4 preceding siblings ...)
2026-02-17 18:22 ` [PATCH v9 5/9] s390/pci: Update the logic for detecting passthrough device Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
2026-02-17 18:22 ` [PATCH v9 7/9] vfio-pci/zdev: Add a device feature for error information Farhan Ali
` (2 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato
For a passthrough device we need co-operation from user space to recover
the device. This would require to bubble up any error information to user
space. Let's store this error information for passthrough devices, so it
can be retrieved later.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
arch/s390/include/asm/pci.h | 28 ++++++++++
arch/s390/pci/pci.c | 1 +
arch/s390/pci/pci_event.c | 95 +++++++++++++++++++-------------
drivers/vfio/pci/vfio_pci_zdev.c | 2 +
4 files changed, 88 insertions(+), 38 deletions(-)
diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index ec8a772bf526..383f6483b656 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -118,6 +118,31 @@ struct zpci_bus {
enum pci_bus_speed max_bus_speed;
};
+/* Content Code Description for PCI Function Error */
+struct zpci_ccdf_err {
+ u32 reserved1;
+ u32 fh; /* function handle */
+ u32 fid; /* function id */
+ u32 ett : 4; /* expected table type */
+ u32 mvn : 12; /* MSI vector number */
+ u32 dmaas : 8; /* DMA address space */
+ u32 reserved2 : 6;
+ u32 q : 1; /* event qualifier */
+ u32 rw : 1; /* read/write */
+ u64 faddr; /* failing address */
+ u32 reserved3;
+ u16 reserved4;
+ u16 pec; /* PCI event code */
+} __packed;
+
+#define ZPCI_ERR_PENDING_MAX 4
+struct zpci_ccdf_pending {
+ u8 count;
+ u8 head;
+ u8 tail;
+ struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
+};
+
/* Private data per function */
struct zpci_dev {
struct zpci_bus *zbus;
@@ -193,6 +218,8 @@ struct zpci_dev {
struct iommu_domain *s390_domain; /* attached IOMMU domain */
struct kvm_zdev *kzdev;
struct mutex kzdev_lock;
+ struct zpci_ccdf_pending pending_errs;
+ struct mutex pending_errs_lock;
spinlock_t dom_lock; /* protect s390_domain change */
};
@@ -331,6 +358,7 @@ void zpci_debug_exit_device(struct zpci_dev *);
int zpci_report_error(struct pci_dev *, struct zpci_report_error_header *);
int zpci_clear_error_state(struct zpci_dev *zdev);
int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
+void zpci_cleanup_pending_errors(struct zpci_dev *zdev);
#ifdef CONFIG_NUMA
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index b2e6b53ea8e6..9f02c2bd29d4 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -915,6 +915,7 @@ struct zpci_dev *zpci_create_device(u32 fid, u32 fh, enum zpci_state state)
mutex_init(&zdev->state_lock);
mutex_init(&zdev->fmb_lock);
mutex_init(&zdev->kzdev_lock);
+ mutex_init(&zdev->pending_errs_lock);
return zdev;
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index de504925f709..9f4ccd79771a 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -17,23 +17,6 @@
#include "pci_bus.h"
#include "pci_report.h"
-/* Content Code Description for PCI Function Error */
-struct zpci_ccdf_err {
- u32 reserved1;
- u32 fh; /* function handle */
- u32 fid; /* function id */
- u32 ett : 4; /* expected table type */
- u32 mvn : 12; /* MSI vector number */
- u32 dmaas : 8; /* DMA address space */
- u32 : 6;
- u32 q : 1; /* event qualifier */
- u32 rw : 1; /* read/write */
- u64 faddr; /* failing address */
- u32 reserved3;
- u16 reserved4;
- u16 pec; /* PCI event code */
-} __packed;
-
/* Content Code Description for PCI Function Availability */
struct zpci_ccdf_avail {
u32 reserved1;
@@ -75,6 +58,42 @@ static bool is_driver_supported(struct pci_driver *driver)
return true;
}
+static void zpci_store_pci_error(struct pci_dev *pdev,
+ struct zpci_ccdf_err *ccdf)
+{
+ struct zpci_dev *zdev = to_zpci(pdev);
+ int i;
+
+ mutex_lock(&zdev->pending_errs_lock);
+ if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
+ pr_err("%s: Maximum number (%d) of pending error events queued",
+ pci_name(pdev), ZPCI_ERR_PENDING_MAX);
+ mutex_unlock(&zdev->pending_errs_lock);
+ return;
+ }
+
+ i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
+ memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
+ zdev->pending_errs.tail++;
+ zdev->pending_errs.count++;
+ mutex_unlock(&zdev->pending_errs_lock);
+}
+
+void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
+{
+ struct pci_dev *pdev = NULL;
+
+ mutex_lock(&zdev->pending_errs_lock);
+ pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
+ if (zdev->pending_errs.count)
+ pr_info("%s: Unhandled PCI error events count=%d",
+ pci_name(pdev), zdev->pending_errs.count);
+ memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
+ pci_dev_put(pdev);
+ mutex_unlock(&zdev->pending_errs_lock);
+}
+EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
+
static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
struct pci_driver *driver)
{
@@ -169,7 +188,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
* and the platform determines which functions are affected for
* multi-function devices.
*/
-static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
+static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
+ struct zpci_ccdf_err *ccdf)
{
pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
struct zpci_dev *zdev = to_zpci(pdev);
@@ -188,13 +208,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
}
pdev->error_state = pci_channel_io_frozen;
- if (needs_mediated_recovery(pdev)) {
- pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
- pci_name(pdev));
- status_str = "failed (pass-through)";
- goto out_unlock;
- }
-
driver = to_pci_driver(pdev->dev.driver);
if (!is_driver_supported(driver)) {
if (!driver) {
@@ -210,12 +223,23 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
goto out_unlock;
}
+ if (needs_mediated_recovery(pdev))
+ zpci_store_pci_error(pdev, ccdf);
+
ers_res = zpci_event_notify_error_detected(pdev, driver);
if (ers_result_indicates_abort(ers_res)) {
status_str = "failed (abort on detection)";
goto out_unlock;
}
+ if (needs_mediated_recovery(pdev)) {
+ pr_info("%s: Leaving recovery of pass-through device to user-space\n",
+ pci_name(pdev));
+ ers_res = PCI_ERS_RESULT_RECOVERED;
+ status_str = "in progress";
+ goto out_unlock;
+ }
+
if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
ers_res = zpci_event_do_error_state_clear(pdev, driver);
if (ers_result_indicates_abort(ers_res)) {
@@ -260,25 +284,20 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
* @pdev: PCI function for which to report
* @es: PCI channel failure state to report
*/
-static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
+static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
+ struct zpci_ccdf_err *ccdf)
{
struct pci_driver *driver;
pci_dev_lock(pdev);
pdev->error_state = es;
- /**
- * While vfio-pci's error_detected callback notifies user-space QEMU
- * reacts to this by freezing the guest. In an s390 environment PCI
- * errors are rarely fatal so this is overkill. Instead in the future
- * we will inject the error event and let the guest recover the device
- * itself.
- */
+
if (needs_mediated_recovery(pdev))
- goto out;
+ zpci_store_pci_error(pdev, ccdf);
driver = to_pci_driver(pdev->dev.driver);
if (driver && driver->err_handler && driver->err_handler->error_detected)
driver->err_handler->error_detected(pdev, pdev->error_state);
-out:
+
pci_dev_unlock(pdev);
}
@@ -324,12 +343,12 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
break;
case 0x0040: /* Service Action or Error Recovery Failed */
case 0x003b:
- zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+ zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
break;
default: /* PCI function left in the error state attempt to recover */
- ers_res = zpci_event_attempt_error_recovery(pdev);
+ ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
if (ers_res != PCI_ERS_RESULT_RECOVERED)
- zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+ zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
break;
}
pci_dev_put(pdev);
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index a7bc23ce8483..2be37eab9279 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
zdev->mediated_recovery = false;
+ zpci_cleanup_pending_errors(zdev);
+
if (!vdev->vdev.kvm)
return;
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v9 7/9] vfio-pci/zdev: Add a device feature for error information
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
` (5 preceding siblings ...)
2026-02-17 18:22 ` [PATCH v9 6/9] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
2026-02-17 18:22 ` [PATCH v9 8/9] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
2026-02-17 18:22 ` [PATCH v9 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
8 siblings, 0 replies; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato
For zPCI devices, we have platform specific error information. The platform
firmware provides this error information to the operating system in an
architecture specific mechanism. To enable recovery from userspace for
these devices, we want to expose this error information to userspace. Add a
new device feature to expose this information.
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/vfio/pci/vfio_pci_core.c | 2 ++
drivers/vfio/pci/vfio_pci_priv.h | 9 ++++++++
drivers/vfio/pci/vfio_pci_zdev.c | 36 ++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 17 +++++++++++++++
4 files changed, 64 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 72c33b399800..d2cb498c026b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1535,6 +1535,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_DMA_BUF:
return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
+ case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
+ return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
default:
return -ENOTTY;
}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 27ac280f00b9..eed69926d8a1 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -89,6 +89,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
struct vfio_info_cap *caps);
int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+ void __user *arg, size_t argsz);
#else
static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
struct vfio_info_cap *caps)
@@ -103,6 +105,13 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
{}
+
+static inline int vfio_pci_zdev_feature_err(struct vfio_device *device,
+ u32 flags, void __user *arg,
+ size_t argsz)
+{
+ return -ENODEV;
+}
#endif
static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 2be37eab9279..d2748dd67c55 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -141,6 +141,42 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
return ret;
}
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+ void __user *arg, size_t argsz)
+{
+ struct vfio_device_feature_zpci_err err;
+ struct vfio_pci_core_device *vdev;
+ struct zpci_dev *zdev;
+ int head = 0;
+ int ret;
+
+ vdev = container_of(device, struct vfio_pci_core_device, vdev);
+ zdev = to_zpci(vdev->pdev);
+ if (!zdev)
+ return -ENODEV;
+
+ ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+ sizeof(err));
+ if (ret != 1)
+ return ret;
+
+ mutex_lock(&zdev->pending_errs_lock);
+ if (zdev->pending_errs.count) {
+ head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
+ err.pec = zdev->pending_errs.err[head].pec;
+ zdev->pending_errs.head++;
+ zdev->pending_errs.count--;
+ err.pending_errors = zdev->pending_errs.count;
+ }
+ mutex_unlock(&zdev->pending_errs_lock);
+
+ err.version = 1;
+ if (copy_to_user(arg, &err, sizeof(err)))
+ return -EFAULT;
+
+ return 0;
+}
+
int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
{
struct zpci_dev *zdev = to_zpci(vdev->pdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index bb7b89330d35..21b1473e4779 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1510,6 +1510,23 @@ struct vfio_device_feature_dma_buf {
struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
};
+/**
+ * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
+ * userspace for vfio-pci devices on s390x. On s390x PCI error recovery involves
+ * platform firmware and notification to operating system is done by
+ * architecture specific mechanism. Exposing this information to userspace
+ * allows userspace to take appropriate actions to handle an error on the
+ * device.
+ */
+
+struct vfio_device_feature_zpci_err {
+ __u8 version;
+ __u8 pending_errors;
+ __u16 pec;
+};
+
+#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 12
+
/* -------- API for Type1 VFIO IOMMU -------- */
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v9 8/9] vfio: Add a reset_done callback for vfio-pci driver
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
` (6 preceding siblings ...)
2026-02-17 18:22 ` [PATCH v9 7/9] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
2026-02-17 18:22 ` [PATCH v9 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
8 siblings, 0 replies; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato,
Julian Ruess
On error recovery for a PCI device bound to vfio-pci driver, we want to
recover the state of the device to its last known saved state. The callback
restores the state of the device to its initial saved state.
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/vfio/pci/vfio_pci_core.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d2cb498c026b..8f7eb3636075 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2258,6 +2258,17 @@ pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
}
EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
+static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)
+{
+ struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+ if (!vdev->pci_saved_state)
+ return;
+
+ pci_load_saved_state(pdev, vdev->pci_saved_state);
+ pci_restore_state(pdev);
+}
+
int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
int nr_virtfn)
{
@@ -2322,6 +2333,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
const struct pci_error_handlers vfio_pci_core_err_handlers = {
.error_detected = vfio_pci_core_aer_err_detected,
+ .reset_done = vfio_pci_core_aer_reset_done,
};
EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v9 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
` (7 preceding siblings ...)
2026-02-17 18:22 ` [PATCH v9 8/9] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
@ 2026-02-17 18:22 ` Farhan Ali
8 siblings, 0 replies; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 18:22 UTC (permalink / raw)
To: linux-s390, linux-kernel, linux-pci
Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato,
Julian Ruess
We are configuring the error signaling on the vast majority of devices and
it's extremely rare that it fires anyway. This allows userspace to be
notified on errors for legacy PCI devices. The Internal Shared Memory (ISM)
device on s390 is one such device. For PCI devices on IBM s390 error
recovery involves platform firmware and notification to operating system
is done by architecture specific way. So the ISM device can still be
recovered when notified of an error.
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/vfio/pci/vfio_pci_core.c | 8 ++------
drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
2 files changed, 3 insertions(+), 8 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 8f7eb3636075..dac0725499ba 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -787,8 +787,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
}
} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
- if (pci_is_pcie(vdev->pdev))
- return 1;
+ return 1;
} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
return 1;
}
@@ -1164,11 +1163,8 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
switch (info.index) {
case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
case VFIO_PCI_REQ_IRQ_INDEX:
- break;
case VFIO_PCI_ERR_IRQ_INDEX:
- if (pci_is_pcie(vdev->pdev))
- break;
- fallthrough;
+ break;
default:
return -EINVAL;
}
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index c76e753b3cec..b6cedaf0bcca 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -859,8 +859,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
case VFIO_PCI_ERR_IRQ_INDEX:
switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
case VFIO_IRQ_SET_ACTION_TRIGGER:
- if (pci_is_pcie(vdev->pdev))
- func = vfio_pci_set_err_trigger;
+ func = vfio_pci_set_err_trigger;
break;
}
break;
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-17 18:22 ` [PATCH v9 3/9] PCI: Avoid saving config space state in reset path Farhan Ali
@ 2026-02-17 19:11 ` Keith Busch
2026-02-17 19:55 ` Farhan Ali
0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2026-02-17 19:11 UTC (permalink / raw)
To: Farhan Ali
Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, alex, clg,
stable, schnelle, mjrosato, Bjorn Helgaas
On Tue, Feb 17, 2026 at 10:22:51AM -0800, Farhan Ali wrote:
> The current reset process saves the device's config space state before
> reset and restores it afterward. However errors may occur unexpectedly and
> it may then be impossible to save config space because the device may be
> inaccessible (e.g. DPC) or config space may be corrupted. This results in
> saving corrupted values that get written back to the device during state
> restoration.
>
> Since commit a2f1e22390ac ("PCI/ERR: Ensure error recoverability at all times"),
> we now save the state of device at enumeration. On every restore we should
> either use the enumeration saved state or driver's intentional saved state,
> never a state saved at the unpredictable time of an error recovery reset.
The vfio driver calls pci_try_reset_function after pci_enable_device,
but before calling pci_store_saved_state. Won't this change, then, mean
that the PCI Command register will get restored to the wrong state with
the resources disabled?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-17 19:11 ` Keith Busch
@ 2026-02-17 19:55 ` Farhan Ali
2026-02-18 19:02 ` Keith Busch
0 siblings, 1 reply; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 19:55 UTC (permalink / raw)
To: Keith Busch
Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, alex, clg,
stable, schnelle, mjrosato, Bjorn Helgaas
On 2/17/2026 11:11 AM, Keith Busch wrote:
> On Tue, Feb 17, 2026 at 10:22:51AM -0800, Farhan Ali wrote:
>> The current reset process saves the device's config space state before
>> reset and restores it afterward. However errors may occur unexpectedly and
>> it may then be impossible to save config space because the device may be
>> inaccessible (e.g. DPC) or config space may be corrupted. This results in
>> saving corrupted values that get written back to the device during state
>> restoration.
>>
>> Since commit a2f1e22390ac ("PCI/ERR: Ensure error recoverability at all times"),
>> we now save the state of device at enumeration. On every restore we should
>> either use the enumeration saved state or driver's intentional saved state,
>> never a state saved at the unpredictable time of an error recovery reset.
> The vfio driver calls pci_try_reset_function after pci_enable_device,
> but before calling pci_store_saved_state. Won't this change, then, mean
> that the PCI Command register will get restored to the wrong state with
> the resources disabled?
Yes I think you are right, with this change the PCI Command register
gets restored to state at enumeration. So we will lose the updated state
after pci_clear_master() and pci_enable_device(). I think we can update
the vfio driver to call pci_save_state() after pci_enable_device()?
Thanks
Farhan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 1/9] PCI: Allow per function PCI slots
2026-02-17 18:22 ` [PATCH v9 1/9] PCI: Allow per function PCI slots Farhan Ali
@ 2026-02-17 21:14 ` Keith Busch
2026-02-17 22:26 ` Farhan Ali
0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2026-02-17 21:14 UTC (permalink / raw)
To: Farhan Ali
Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, alex, clg,
stable, schnelle, mjrosato
On Tue, Feb 17, 2026 at 10:22:49AM -0800, Farhan Ali wrote:
> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
> to multifunction devices. This approach worked fine on s390 systems that
> only exposed virtual functions as individual PCI domains to the operating
> system. Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> s390 supports exposing the topology of multifunction PCI devices by
> grouping them in a shared PCI domain. When attempting to reset a function
> through the hotplug driver, the shared slot assignment causes the wrong
> function to be reset instead of the intended one. It also leaks memory as
> we do create a pci_slot object for the function, but don't correctly free
> it in pci_slot_release().
I think leakage is because s390 is passing in devfn when pci_create_slot
is expecting devnr, so things are not getting matched and assigned as
expected.
If you're able to make this change, it will clash with one existing
thing, and another proposal I've got at v5 now(*). Specifically, this
would allow all 8 bits to be used for the pci_slot 'number' when it's
currently expected to be limited to 5 bits. 0xff is special, and I'm
proposing another special value. If we are going to allow the slot
numbers to use all 8 bits, I think we need to change the type from u8 to
u16 so that there is space to encode such special values.
* https://lore.kernel.org/linux-pci/20260217160836.2709885-3-kbusch@meta.com/
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 1/9] PCI: Allow per function PCI slots
2026-02-17 21:14 ` Keith Busch
@ 2026-02-17 22:26 ` Farhan Ali
2026-02-19 21:37 ` Niklas Schnelle
0 siblings, 1 reply; 21+ messages in thread
From: Farhan Ali @ 2026-02-17 22:26 UTC (permalink / raw)
To: Keith Busch
Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, alex, clg,
stable, schnelle, mjrosato
On 2/17/2026 1:14 PM, Keith Busch wrote:
> On Tue, Feb 17, 2026 at 10:22:49AM -0800, Farhan Ali wrote:
>> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
>> to multifunction devices. This approach worked fine on s390 systems that
>> only exposed virtual functions as individual PCI domains to the operating
>> system. Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
>> s390 supports exposing the topology of multifunction PCI devices by
>> grouping them in a shared PCI domain. When attempting to reset a function
>> through the hotplug driver, the shared slot assignment causes the wrong
>> function to be reset instead of the intended one. It also leaks memory as
>> we do create a pci_slot object for the function, but don't correctly free
>> it in pci_slot_release().
> I think leakage is because s390 is passing in devfn when pci_create_slot
> is expecting devnr, so things are not getting matched and assigned as
> expected.
>
> If you're able to make this change, it will clash with one existing
> thing, and another proposal I've got at v5 now(*). Specifically, this
> would allow all 8 bits to be used for the pci_slot 'number' when it's
> currently expected to be limited to 5 bits. 0xff is special, and I'm
> proposing another special value. If we are going to allow the slot
> numbers to use all 8 bits, I think we need to change the type from u8 to
> u16 so that there is space to encode such special values.
>
> * https://lore.kernel.org/linux-pci/20260217160836.2709885-3-kbusch@meta.com/
I am open to suggestions on how we can better model a pci_slot per
function. And yeah, I think it makes sense to update the pci_slot
'number' to u16.
Thanks
Farhan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-17 19:55 ` Farhan Ali
@ 2026-02-18 19:02 ` Keith Busch
2026-02-18 19:35 ` Bjorn Helgaas
0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2026-02-18 19:02 UTC (permalink / raw)
To: Farhan Ali
Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, alex, clg,
stable, schnelle, mjrosato, Bjorn Helgaas
On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
>
> Yes I think you are right, with this change the PCI Command register gets
> restored to state at enumeration. So we will lose the updated state after
> pci_clear_master() and pci_enable_device(). I think we can update the vfio
> driver to call pci_save_state() after pci_enable_device()?
Either that, or move the pci_enable_device() call to after the function
reset.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-18 19:02 ` Keith Busch
@ 2026-02-18 19:35 ` Bjorn Helgaas
2026-02-18 21:48 ` Farhan Ali
0 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2026-02-18 19:35 UTC (permalink / raw)
To: Keith Busch
Cc: Farhan Ali, linux-s390, linux-kernel, linux-pci, lukas, alex, clg,
stable, schnelle, mjrosato, Bjorn Helgaas
On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
> On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
> >
> > Yes I think you are right, with this change the PCI Command
> > register gets restored to state at enumeration. So we will lose
> > the updated state after pci_clear_master() and
> > pci_enable_device(). I think we can update the vfio driver to call
> > pci_save_state() after pci_enable_device()?
>
> Either that, or move the pci_enable_device() call to after the
> function reset.
I kind of like the latter idea because it seems a little simpler for
the rule of thumb to be that a reset done by the PCI core returns the
device to the same state as when the driver first probed the device.
Drivers would generally not use pci_save_state() at all, and they
could share some initialization logic between probe and post-reset
recovery.
But I would really like to have Lukas's take on this. Clearly some
drivers would have to be adapted if we stop saving config space in the
PCI core reset path. We can take care of that for upstream drivers,
but it seems risky for out-of-tree drivers.
Bjorn
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-18 19:35 ` Bjorn Helgaas
@ 2026-02-18 21:48 ` Farhan Ali
2026-02-19 0:20 ` Bjorn Helgaas
0 siblings, 1 reply; 21+ messages in thread
From: Farhan Ali @ 2026-02-18 21:48 UTC (permalink / raw)
To: Bjorn Helgaas, Keith Busch
Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, clg, stable,
schnelle, mjrosato, Bjorn Helgaas
On 2/18/2026 11:35 AM, Bjorn Helgaas wrote:
> On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
>> On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
>>> Yes I think you are right, with this change the PCI Command
>>> register gets restored to state at enumeration. So we will lose
>>> the updated state after pci_clear_master() and
>>> pci_enable_device(). I think we can update the vfio driver to call
>>> pci_save_state() after pci_enable_device()?
>> Either that, or move the pci_enable_device() call to after the
>> function reset.
> I kind of like the latter idea because it seems a little simpler for
> the rule of thumb to be that a reset done by the PCI core returns the
> device to the same state as when the driver first probed the device.
> Drivers would generally not use pci_save_state() at all, and they
> could share some initialization logic between probe and post-reset
> recovery.
I think the vfio-pci driver was intentionally doing the
pci_enable_device() before doing the reset. As per commit 9a92c5091a42
("vfio-pci: Enable device before attempting reset") it was done to
handle devices using PM reset, that were getting incorrectly identified
not supporting PM reset due to current state of the device not being D0.
It looks like pci_pm_reset() still returns -EINVAL if current power
state is not D0. So I think we can't move pci_enable_device() after
reset. Unless we want to update pci_pm_reset() to not use cached value
of current_state and read it directly from register?
Thanks
Farhan
>
> But I would really like to have Lukas's take on this. Clearly some
> drivers would have to be adapted if we stop saving config space in the
> PCI core reset path. We can take care of that for upstream drivers,
> but it seems risky for out-of-tree drivers.
>
> Bjorn
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-18 21:48 ` Farhan Ali
@ 2026-02-19 0:20 ` Bjorn Helgaas
2026-02-19 18:06 ` Farhan Ali
0 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2026-02-19 0:20 UTC (permalink / raw)
To: Farhan Ali
Cc: Keith Busch, linux-s390, linux-kernel, linux-pci, lukas,
Alex Williamson, clg, stable, schnelle, mjrosato, Bjorn Helgaas
On Wed, Feb 18, 2026 at 01:48:57PM -0800, Farhan Ali wrote:
> On 2/18/2026 11:35 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
> > > On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
> > > > Yes I think you are right, with this change the PCI Command
> > > > register gets restored to state at enumeration. So we will
> > > > lose the updated state after pci_clear_master() and
> > > > pci_enable_device(). I think we can update the vfio driver to
> > > > call pci_save_state() after pci_enable_device()?
> > >
> > > Either that, or move the pci_enable_device() call to after the
> > > function reset.
> >
> > I kind of like the latter idea because it seems a little simpler
> > for the rule of thumb to be that a reset done by the PCI core
> > returns the device to the same state as when the driver first
> > probed the device. Drivers would generally not use
> > pci_save_state() at all, and they could share some initialization
> > logic between probe and post-reset recovery.
>
> I think the vfio-pci driver was intentionally doing the
> pci_enable_device() before doing the reset. As per commit
> 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") it
> was done to handle devices using PM reset, that were getting
> incorrectly identified not supporting PM reset due to current state
> of the device not being D0. It looks like pci_pm_reset() still
> returns -EINVAL if current power state is not D0. So I think we
> can't move pci_enable_device() after reset. Unless we want to update
> pci_pm_reset() to not use cached value of current_state and read it
> directly from register?
Devices are generally disabled at .probe() time, so that will be the
default saved state. But every driver will expect the device to be
enabled after the reset. Skipping the save state at reset time seems
like it would need a lot of work first and maybe it wouldn't ever be
practical. It wasn't really thought out; I was just hoping we could
simplify the save-state model and maybe unify driver reset and error
recovery paths. I think we need to drop this patch at least for now.
9a92c5091a42 ("vfio-pci: Enable device before attempting reset") was
mostly done to make pci_pm_reset() work, which requires the device to
be in D0. The main purpose of pci_enable_device() is to make device
BARs accessible; it *does* also put the device in D0 because BARs are
only accessible in D0, but pci_pm_reset() itself doesn't need the
BARs.
Other reset methods, e.g., FLR, don't seem to require the device to be
in D0, so I'm not sure why pci_pm_reset() requires that. I think the
critical piece is the D3->D0 transition, and maybe we could arrange
for that to happen even if the device is already in D1/D2/D3hot or
even D3cold.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-19 0:20 ` Bjorn Helgaas
@ 2026-02-19 18:06 ` Farhan Ali
2026-02-20 20:53 ` Alex Williamson
0 siblings, 1 reply; 21+ messages in thread
From: Farhan Ali @ 2026-02-19 18:06 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Keith Busch, linux-s390, linux-kernel, linux-pci, lukas,
Alex Williamson, clg, stable, schnelle, mjrosato, Bjorn Helgaas
On 2/18/2026 4:20 PM, Bjorn Helgaas wrote:
> On Wed, Feb 18, 2026 at 01:48:57PM -0800, Farhan Ali wrote:
>> On 2/18/2026 11:35 AM, Bjorn Helgaas wrote:
>>> On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
>>>> On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
>>>>> Yes I think you are right, with this change the PCI Command
>>>>> register gets restored to state at enumeration. So we will
>>>>> lose the updated state after pci_clear_master() and
>>>>> pci_enable_device(). I think we can update the vfio driver to
>>>>> call pci_save_state() after pci_enable_device()?
>>>> Either that, or move the pci_enable_device() call to after the
>>>> function reset.
>>> I kind of like the latter idea because it seems a little simpler
>>> for the rule of thumb to be that a reset done by the PCI core
>>> returns the device to the same state as when the driver first
>>> probed the device. Drivers would generally not use
>>> pci_save_state() at all, and they could share some initialization
>>> logic between probe and post-reset recovery.
>> I think the vfio-pci driver was intentionally doing the
>> pci_enable_device() before doing the reset. As per commit
>> 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") it
>> was done to handle devices using PM reset, that were getting
>> incorrectly identified not supporting PM reset due to current state
>> of the device not being D0. It looks like pci_pm_reset() still
>> returns -EINVAL if current power state is not D0. So I think we
>> can't move pci_enable_device() after reset. Unless we want to update
>> pci_pm_reset() to not use cached value of current_state and read it
>> directly from register?
> Devices are generally disabled at .probe() time, so that will be the
> default saved state. But every driver will expect the device to be
> enabled after the reset. Skipping the save state at reset time seems
> like it would need a lot of work first and maybe it wouldn't ever be
> practical. It wasn't really thought out; I was just hoping we could
> simplify the save-state model and maybe unify driver reset and error
> recovery paths. I think we need to drop this patch at least for now.
Yeah, I agree this patch might be too disruptive for drivers. In that
case would my previous version [1] to at least prevent saving state in
case of an error be acceptable? Or is there another approach we should
consider?
[1] https://lore.kernel.org/all/20260122194437.1903-4-alifm@linux.ibm.com/
>
> 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") was
> mostly done to make pci_pm_reset() work, which requires the device to
> be in D0. The main purpose of pci_enable_device() is to make device
> BARs accessible; it *does* also put the device in D0 because BARs are
> only accessible in D0, but pci_pm_reset() itself doesn't need the
> BARs.
>
> Other reset methods, e.g., FLR, don't seem to require the device to be
> in D0, so I'm not sure why pci_pm_reset() requires that. I think the
> critical piece is the D3->D0 transition, and maybe we could arrange
> for that to happen even if the device is already in D1/D2/D3hot or
> even D3cold.
Looking at the PCI spec (v6.1) I didn't see any requirement for the
device to be in D0 state to perform a power state change. So I think we
should be able to transition from D1/D2/D3hot to D0. But IIUC if a
device is in D3cold, then won't register reads/writes fail till power is
available to the device?
Thanks
Farhan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 1/9] PCI: Allow per function PCI slots
2026-02-17 22:26 ` Farhan Ali
@ 2026-02-19 21:37 ` Niklas Schnelle
0 siblings, 0 replies; 21+ messages in thread
From: Niklas Schnelle @ 2026-02-19 21:37 UTC (permalink / raw)
To: Farhan Ali, Keith Busch
Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, alex, clg,
stable, mjrosato
On Tue, 2026-02-17 at 14:26 -0800, Farhan Ali wrote:
> On 2/17/2026 1:14 PM, Keith Busch wrote:
> > On Tue, Feb 17, 2026 at 10:22:49AM -0800, Farhan Ali wrote:
> > > Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
> > > to multifunction devices. This approach worked fine on s390 systems that
> > > only exposed virtual functions as individual PCI domains to the operating
> > > system. Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> > > s390 supports exposing the topology of multifunction PCI devices by
> > > grouping them in a shared PCI domain. When attempting to reset a function
> > > through the hotplug driver, the shared slot assignment causes the wrong
> > > function to be reset instead of the intended one. It also leaks memory as
> > > we do create a pci_slot object for the function, but don't correctly free
> > > it in pci_slot_release().
> > I think leakage is because s390 is passing in devfn when pci_create_slot
> > is expecting devnr, so things are not getting matched and assigned as
> > expected.
> >
> > If you're able to make this change, it will clash with one existing
> > thing, and another proposal I've got at v5 now(*). Specifically, this
> > would allow all 8 bits to be used for the pci_slot 'number' when it's
> > currently expected to be limited to 5 bits. 0xff is special, and I'm
> > proposing another special value. If we are going to allow the slot
> > numbers to use all 8 bits, I think we need to change the type from u8 to
> > u16 so that there is space to encode such special values.
> >
> > * https://lore.kernel.org/linux-pci/20260217160836.2709885-3-kbusch@meta.com/
>
> I am open to suggestions on how we can better model a pci_slot per
> function. And yeah, I think it makes sense to update the pci_slot
> 'number' to u16.
>
> Thanks
>
> Farhan
It may give some better context to emphasize that s390 has been using a
per PCI function hotplug slot model since commit 7441b0627e22
("s390/pci: PCI hotplug support via SCLP") from 2012. Quoting the
commit description here:
"The hotplug controller creates a slot for every PCI function in
stand-by or configured state. The PCI functions are named after
the PCI function ID (fid). By writing to the power attribute
in /sys/bus/pci/slots/<fid>/power the PCI function
is moved to stand-by or configured state."
So this isn't a new concept or something we can really change since
users and documentation relies on this for over a decade. It's only
that for the longest time the mismatch between the hotplug slot and the
pci slot went unnoticed because it really only causes trouble in very
specific circumstances even after we've started exposing the devfn part
of the geographic PCI addresses for SR-IOV capable multi-function
devices e.g. for the two PFs of ConnectX series NICs.
Thanks,
Niklas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v9 3/9] PCI: Avoid saving config space state in reset path
2026-02-19 18:06 ` Farhan Ali
@ 2026-02-20 20:53 ` Alex Williamson
0 siblings, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2026-02-20 20:53 UTC (permalink / raw)
To: Farhan Ali
Cc: Bjorn Helgaas, Keith Busch, linux-s390, linux-kernel, linux-pci,
lukas, clg, stable, schnelle, mjrosato, Bjorn Helgaas, alex
On Thu, 19 Feb 2026 10:06:05 -0800
Farhan Ali <alifm@linux.ibm.com> wrote:
> On 2/18/2026 4:20 PM, Bjorn Helgaas wrote:
> > On Wed, Feb 18, 2026 at 01:48:57PM -0800, Farhan Ali wrote:
> >> On 2/18/2026 11:35 AM, Bjorn Helgaas wrote:
> >>> On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
> >>>> On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
> >>>>> Yes I think you are right, with this change the PCI Command
> >>>>> register gets restored to state at enumeration. So we will
> >>>>> lose the updated state after pci_clear_master() and
> >>>>> pci_enable_device(). I think we can update the vfio driver to
> >>>>> call pci_save_state() after pci_enable_device()?
> >>>> Either that, or move the pci_enable_device() call to after the
> >>>> function reset.
> >>> I kind of like the latter idea because it seems a little simpler
> >>> for the rule of thumb to be that a reset done by the PCI core
> >>> returns the device to the same state as when the driver first
> >>> probed the device. Drivers would generally not use
> >>> pci_save_state() at all, and they could share some initialization
> >>> logic between probe and post-reset recovery.
> >> I think the vfio-pci driver was intentionally doing the
> >> pci_enable_device() before doing the reset. As per commit
> >> 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") it
> >> was done to handle devices using PM reset, that were getting
> >> incorrectly identified not supporting PM reset due to current state
> >> of the device not being D0. It looks like pci_pm_reset() still
> >> returns -EINVAL if current power state is not D0. So I think we
> >> can't move pci_enable_device() after reset. Unless we want to update
> >> pci_pm_reset() to not use cached value of current_state and read it
> >> directly from register?
> > Devices are generally disabled at .probe() time, so that will be the
> > default saved state. But every driver will expect the device to be
> > enabled after the reset. Skipping the save state at reset time seems
> > like it would need a lot of work first and maybe it wouldn't ever be
> > practical. It wasn't really thought out; I was just hoping we could
> > simplify the save-state model and maybe unify driver reset and error
> > recovery paths. I think we need to drop this patch at least for now.
>
> Yeah, I agree this patch might be too disruptive for drivers. In that
> case would my previous version [1] to at least prevent saving state in
> case of an error be acceptable? Or is there another approach we should
> consider?
>
> [1] https://lore.kernel.org/all/20260122194437.1903-4-alifm@linux.ibm.com/
>
> >
> > 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") was
> > mostly done to make pci_pm_reset() work, which requires the device to
> > be in D0. The main purpose of pci_enable_device() is to make device
> > BARs accessible; it *does* also put the device in D0 because BARs are
> > only accessible in D0, but pci_pm_reset() itself doesn't need the
> > BARs.
> >
> > Other reset methods, e.g., FLR, don't seem to require the device to be
> > in D0, so I'm not sure why pci_pm_reset() requires that. I think the
> > critical piece is the D3->D0 transition, and maybe we could arrange
> > for that to happen even if the device is already in D1/D2/D3hot or
> > even D3cold.
>
> Looking at the PCI spec (v6.1) I didn't see any requirement for the
> device to be in D0 state to perform a power state change. So I think we
> should be able to transition from D1/D2/D3hot to D0. But IIUC if a
> device is in D3cold, then won't register reads/writes fail till power is
> available to the device?
Yes, config space could be inaccessible in D3cold. IIRC, 9a92c5091a42
was specifically addressing that devices are typically provided to the
driver in the PCI_UNKNOWN state and at the time vfio-pci wasn't
changing that in the .probe function, like most drivers would, so we
needed to adjust the ordering of enabling the device versus calling
reset function.
Now that we've gained PM management in vfio-pci, that's no longer an
issue, but pci_pm_reset() does still require the device to arrive in
D0. Accepting devices arriving in D3cold or D3hot (with NoSoftReset-)
might avoid a power state bounce in some circumstances, but would not
have solved the original 9a92c5091a42 scenario where the device was in
PCI_UNKNOWN power state.
Sorry I missed my opportunity to reply to the suggestion for this
approach in the previous revision. I'm not sure if anything
specifically breaks with this approach to restore the initial device
state, but it's certainly not the contract I currently expect as a
user of the reset-function interfaces. I think that contract is
"reset the internal state of the device while saving and restoring
current config space". If we stray from that, what's the expectation
for things like resizable BARs? I don't think we want to reprovision
resources as a result of reset.
Here we seem to be worried about a specific, testable scenario where
config space might be inaccessible after error and applying the
workaround to that regardless whether that specific scenario is preset.
I don't see that a "test if config space is accessible and stuff the
original save state into the buffer rather than creating an invalid
save state" should be so complex as to require this simplification and
associated risk. Thanks,
Alex
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2026-02-20 20:53 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-17 18:22 [PATCH v9 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
2026-02-17 18:22 ` [PATCH v9 1/9] PCI: Allow per function PCI slots Farhan Ali
2026-02-17 21:14 ` Keith Busch
2026-02-17 22:26 ` Farhan Ali
2026-02-19 21:37 ` Niklas Schnelle
2026-02-17 18:22 ` [PATCH v9 2/9] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
2026-02-17 18:22 ` [PATCH v9 3/9] PCI: Avoid saving config space state in reset path Farhan Ali
2026-02-17 19:11 ` Keith Busch
2026-02-17 19:55 ` Farhan Ali
2026-02-18 19:02 ` Keith Busch
2026-02-18 19:35 ` Bjorn Helgaas
2026-02-18 21:48 ` Farhan Ali
2026-02-19 0:20 ` Bjorn Helgaas
2026-02-19 18:06 ` Farhan Ali
2026-02-20 20:53 ` Alex Williamson
2026-02-17 18:22 ` [PATCH v9 4/9] PCI: Add additional checks for flr reset Farhan Ali
2026-02-17 18:22 ` [PATCH v9 5/9] s390/pci: Update the logic for detecting passthrough device Farhan Ali
2026-02-17 18:22 ` [PATCH v9 6/9] s390/pci: Store PCI error information for passthrough devices Farhan Ali
2026-02-17 18:22 ` [PATCH v9 7/9] vfio-pci/zdev: Add a device feature for error information Farhan Ali
2026-02-17 18:22 ` [PATCH v9 8/9] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
2026-02-17 18:22 ` [PATCH v9 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox