Linux PCI subsystem development
 help / color / mirror / Atom feed
* [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x
@ 2026-05-05 20:05 Farhan Ali
  2026-05-05 20:05 ` [PATCH v15 1/7] PCI: Allow per function PCI slots to fix slot reset on s390 Farhan Ali
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Farhan Ali @ 2026-05-05 20:05 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, alifm, schnelle, mjrosato

Hi,

This Linux kernel patch series introduces support for error recovery for
passthrough PCI devices on System Z (s390x).

Background
----------
For PCI devices on s390x an operating system receives platform specific
error events from firmware rather than through AER.Today for
passthrough/userspace devices, we don't attempt any error recovery and
ignore any error events for the devices. The passthrough/userspace devices
are managed by the vfio-pci driver. The driver does register error handling
callbacks (error_detected), and on an error trigger an eventfd to
userspace.  But we need a mechanism to notify userspace
(QEMU/guest/userspace drivers) about the error event.

Proposal
--------
We can expose this error information (currently only the PCI Error Code)
via a device feature. Userspace can then obtain the error information
via VFIO_DEVICE_FEATURE ioctl and take appropriate actions such as driving
a device reset.

This is how a typical flow for passthrough devices to a VM would work:
For passthrough devices to a VM, the driver bound to the device on the host
is vfio-pci. vfio-pci driver does support the error_detected() callback
(vfio_pci_core_aer_err_detected()), and on an PCI error s390x recovery
code on the host will call the vfio-pci error_detected() callback. The
vfio-pci error_detected() callback will notify userspace/QEMU via an
eventfd, and return PCI_ERS_RESULT_CAN_RECOVER. At this point the s390x
error recovery on the host will skip any further action(see patch 4) and
let userspace drive the error recovery.

Once userspace/QEMU is notified, it then injects this error into the VM
so device drivers in the VM can take recovery actions. For example for a
passthrough NVMe device, the VM's OS NVMe driver will access the device.
At this point the VM's NVMe driver's error_detected() will drive the
recovery by returning PCI_ERS_RESULT_NEED_RESET, and the s390x error
recovery in the VM's OS will try to do a reset. Resets are privileged
operations and so the VM will need intervention from QEMU to perform the
reset. QEMU will invoke the VFIO_DEVICE_RESET ioctl to now notify the
host that the VM is requesting a reset of the device. The vfio-pci driver
on the host will then perform the reset on the device to recover it.


Thanks
Farhan

ChangeLog
---------
This only includes change log from last 5 revision. Older change log is
available in v14 cover letter.

v14 https://lore.kernel.org/all/20260421163031.704-1-alifm@linux.ibm.com/
v14 -> v15
   - Fix issues identified by Sashiko (patch 4).

   - Address Niklas feedback (patch 4 and patch 5).

   - Rebase on 7.1-rc2.


v13 series https://lore.kernel.org/all/20260413210608.2912-1-alifm@linux.ibm.com/
v13 -> v14
   - Remove version from vfio uAPI struct. Instead reserve additional space
   and add a flags field. The flags will be used to indicate any usage of
   the reserved space (patch 5).

   - Remove pending_errors from vfio uAPI struct and instead return an
   error to indicate no more pending error for userspace to handle (patch 5).

   - Rebase on recent linux master

v12 series https://lore.kernel.org/all/20260330174011.1161-1-alifm@linux.ibm.com/
v12 -> v13
   - Add the mediated_recovery flag as part of struct zpci_ccdf_pending
   and protect the struct with pending_errs_lock (patch 4).

   - Move dequeing pending error logic to a helper function (patch 5).

   - Update device feature number for VFIO_DEVICE_FEATURE_ZPCI_ERROR (patch 5).

   - Rebase on linux-next with tag next-20260410


v11 series https://lore.kernel.org/all/20260316191544.2279-1-alifm@linux.ibm.com/
   - Address Bjorn's comments from v11 (patches 1-3).

   - Create a common function to check config space accessibility
   (patch 2).

   - Address Alex's comments from v11 (patches 4, 5, 7).

   - Protect the mediated_recovery flag with the pending_errs_lock.
   Doing that it made sense to squash patches 5 and 6 from v11
   (current patch 4). Even though the code didn't change significantly
   I have dropped R-b tags for it. Would appreciate another look at the
   patch (current patch 4).

   - Dropped arch specific pcibios_resource_to_bus and
   pcibios_bus_to_resource as its not needed for this series. Will address
   the issue as a standalone patch separate from this series.

   - Rebased on pci/next, with head at f8a1c947ccc6 ("Merge branch 'pci/misc'")


v10 series https://lore.kernel.org/all/20260302203325.3826-1-alifm@linux.ibm.com/
v10 -> v11
   - Rebase on pci/next to handle merge conflicts with patch 1.

   - Typo fixup in commit message (patch 4) and use guard() for mutex
    (patch 6).



Farhan Ali (7):
  PCI: Allow per function PCI slots to fix slot reset on s390
  PCI: Avoid saving config space state if inaccessible
  PCI: Fail FLR when config space is inaccessible
  s390/pci: Store PCI error information for passthrough devices
  vfio-pci/zdev: Add a device feature for error information
  vfio/pci: Add a reset_done callback for vfio-pci driver
  vfio/pci: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX

 arch/s390/include/asm/pci.h       |  32 +++++++
 arch/s390/pci/pci.c               |   1 +
 arch/s390/pci/pci_event.c         | 133 ++++++++++++++++++------------
 drivers/pci/hotplug/rpaphp_slot.c |   2 +-
 drivers/pci/pci.c                 |  32 ++++++-
 drivers/pci/slot.c                |  33 ++++++--
 drivers/vfio/pci/vfio_pci_core.c  |  22 +++--
 drivers/vfio/pci/vfio_pci_intrs.c |   3 +-
 drivers/vfio/pci/vfio_pci_priv.h  |   9 ++
 drivers/vfio/pci/vfio_pci_zdev.c  |  57 ++++++++++++-
 include/linux/pci.h               |   8 +-
 include/uapi/linux/vfio.h         |  30 +++++++
 12 files changed, 287 insertions(+), 75 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v15 1/7] PCI: Allow per function PCI slots to fix slot reset on s390
  2026-05-05 20:05 [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x Farhan Ali
@ 2026-05-05 20:05 ` Farhan Ali
  2026-05-05 21:29   ` sashiko-bot
  2026-05-05 20:05 ` [PATCH v15 2/7] PCI: Avoid saving config space state if inaccessible Farhan Ali
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Farhan Ali @ 2026-05-05 20:05 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, alifm, schnelle, mjrosato, stable

On s390 systems, which use a machine level hypervisor, PCI devices are
always accessed through a form of PCI pass-through which fundamentally
operates on a per PCI function granularity. This is also reflected in the
s390 PCI hotplug driver which creates hotplug slots for individual PCI
functions. Its reset_slot() function, which is a wrapper for
zpci_hot_reset_device(), thus also resets individual functions.

Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
to multifunction devices. This approach worked fine on s390 systems that
only exposed virtual functions as individual PCI domains to the operating
system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
s390 supports exposing the topology of multifunction PCI devices by
grouping them in a shared PCI domain. This creates a problem when resetting
a function through the hotplug driver's slot_reset() interface.

When attempting to reset a function through the hotplug driver, the shared
slot assignment causes the wrong function to be reset instead of the
intended one. It also leaks memory as we do create a pci_slot object for
the function, but don't correctly free it in pci_slot_release().

Add a flag for struct pci_slot to allow per function PCI slots for
functions managed through a hypervisor, which exposes individual PCI
functions while retaining the topology. Since we can use all 8 bits
for slot 'number' (for ARI devices), change slot 'number' u16 to
account for special values -1 and PCI_SLOT_ALL_DEVICES.

Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
Cc: stable@vger.kernel.org
Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/hotplug/rpaphp_slot.c |  2 +-
 drivers/pci/pci.c                 |  5 +++--
 drivers/pci/slot.c                | 33 +++++++++++++++++++++++--------
 include/linux/pci.h               |  8 ++++++--
 4 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/hotplug/rpaphp_slot.c b/drivers/pci/hotplug/rpaphp_slot.c
index 67362e5b9971..92eabf5f61b9 100644
--- a/drivers/pci/hotplug/rpaphp_slot.c
+++ b/drivers/pci/hotplug/rpaphp_slot.c
@@ -84,7 +84,7 @@ int rpaphp_register_slot(struct slot *slot)
 	struct hotplug_slot *php_slot = &slot->hotplug_slot;
 	u32 my_index;
 	int retval;
-	int slotno = -1;
+	int slotno = PCI_SLOT_PLACEHOLDER;
 
 	dbg("%s registering slot:path[%pOF] index[%x], name[%s] pdomain[%x] type[%d]\n",
 		__func__, slot->dn, slot->index, slot->name,
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8f7cfcc00090..d0c9f0166af5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4865,8 +4865,9 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, bool probe)
 
 static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
 {
-	if (dev->multifunction || dev->subordinate || !dev->slot ||
-	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
+	if (dev->subordinate || !dev->slot ||
+	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET ||
+	    (dev->multifunction && !dev->slot->per_func_slot))
 		return -ENOTTY;
 
 	return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index 6d5cd37bfb1e..894d6213ed30 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -37,7 +37,7 @@ static const struct sysfs_ops pci_slot_sysfs_ops = {
 
 static ssize_t address_read_file(struct pci_slot *slot, char *buf)
 {
-	if (slot->number == 0xff)
+	if (slot->number == (u16)PCI_SLOT_PLACEHOLDER)
 		return sysfs_emit(buf, "%04x:%02x\n",
 				  pci_domain_nr(slot->bus),
 				  slot->bus->number);
@@ -72,6 +72,23 @@ static ssize_t cur_speed_read_file(struct pci_slot *slot, char *buf)
 	return bus_speed_read(slot->bus->cur_bus_speed, buf);
 }
 
+static bool pci_dev_matches_slot(struct pci_dev *dev, struct pci_slot *slot)
+{
+	if (slot->per_func_slot)
+		return dev->devfn == slot->number;
+
+	return slot->number == PCI_SLOT_ALL_DEVICES ||
+		PCI_SLOT(dev->devfn) == slot->number;
+}
+
+static bool pci_slot_enabled_per_func(void)
+{
+	if (IS_ENABLED(CONFIG_S390))
+		return true;
+
+	return false;
+}
+
 static void pci_slot_release(struct kobject *kobj)
 {
 	struct pci_dev *dev;
@@ -82,8 +99,7 @@ static void pci_slot_release(struct kobject *kobj)
 
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &slot->bus->devices, bus_list)
-		if (slot->number == PCI_SLOT_ALL_DEVICES ||
-		    PCI_SLOT(dev->devfn) == slot->number)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = NULL;
 	up_read(&pci_bus_sem);
 
@@ -187,8 +203,7 @@ void pci_dev_assign_slot(struct pci_dev *dev)
 
 	mutex_lock(&pci_slot_mutex);
 	list_for_each_entry(slot, &dev->bus->slots, list)
-		if (slot->number == PCI_SLOT_ALL_DEVICES ||
-		    PCI_SLOT(dev->devfn) == slot->number)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = slot;
 	mutex_unlock(&pci_slot_mutex);
 }
@@ -267,7 +282,7 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 
 	mutex_lock(&pci_slot_mutex);
 
-	if (slot_nr == -1)
+	if (slot_nr == PCI_SLOT_PLACEHOLDER)
 		goto placeholder;
 
 	/*
@@ -298,6 +313,9 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 	slot->bus = pci_bus_get(parent);
 	slot->number = slot_nr;
 
+	if (pci_slot_enabled_per_func())
+		slot->per_func_slot = 1;
+
 	slot->kobj.kset = pci_slots_kset;
 
 	slot_name = make_slot_name(name);
@@ -318,8 +336,7 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &parent->devices, bus_list)
-		if (slot_nr == PCI_SLOT_ALL_DEVICES ||
-		    PCI_SLOT(dev->devfn) == slot_nr)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = slot;
 	up_read(&pci_bus_sem);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..d58982aa8730 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -78,14 +78,18 @@
  * and, if ARI Forwarding is enabled, functions may appear to be on multiple
  * devices.
  */
-#define PCI_SLOT_ALL_DEVICES	0xfe
+#define PCI_SLOT_ALL_DEVICES	0xfeff
+
+/* Used to identify a slot as a placeholder */
+#define PCI_SLOT_PLACEHOLDER	-1
 
 /* pci_slot represents a physical slot */
 struct pci_slot {
 	struct pci_bus		*bus;		/* Bus this slot is on */
 	struct list_head	list;		/* Node in list of slots */
 	struct hotplug_slot	*hotplug;	/* Hotplug info (move here) */
-	unsigned char		number;		/* Device nr, or PCI_SLOT_ALL_DEVICES */
+	u16			number;		/* Device nr, or PCI_SLOT_ALL_DEVICES */
+	unsigned int		per_func_slot:1; /* Allow per function slot */
 	struct kobject		kobj;
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 2/7] PCI: Avoid saving config space state if inaccessible
  2026-05-05 20:05 [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x Farhan Ali
  2026-05-05 20:05 ` [PATCH v15 1/7] PCI: Allow per function PCI slots to fix slot reset on s390 Farhan Ali
@ 2026-05-05 20:05 ` Farhan Ali
  2026-05-05 21:58   ` sashiko-bot
  2026-05-05 20:05 ` [PATCH v15 3/7] PCI: Fail FLR when config space is inaccessible Farhan Ali
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Farhan Ali @ 2026-05-05 20:05 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, alifm, schnelle, mjrosato, Bjorn Helgaas

The current reset process saves the device's config space state before
reset and restores it afterward. However errors may occur unexpectedly and
it may then be impossible to save config space because the device may be
inaccessible (e.g. DPC). This results in saving invalid values that get
written back to the device during state restoration.

With a reset we want to recover/restore the device into a functional state.
So avoid saving the state of the config space when the device config space
is inaccessible.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/pci.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d0c9f0166af5..e71b81415392 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -722,6 +722,27 @@ u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec)
 }
 EXPORT_SYMBOL_GPL(pci_find_dvsec_capability);
 
+static bool pci_dev_config_accessible(struct pci_dev *dev, char *msg)
+{
+	u32 val;
+
+	/*
+	 * If device's config space is inaccessible it can return ~0 for
+	 * any reads. Since VFs can also return ~0 for Device and Vendor ID
+	 * check Command and Status registers. Note that this is racy
+	 * because the device may become inaccessible partway through
+	 * next access.
+	 */
+	pci_read_config_dword(dev, PCI_COMMAND, &val);
+	if (PCI_POSSIBLE_ERROR(val)) {
+		pci_warn(dev, "Device config space inaccessible; unable to %s\n",
+				msg);
+		return false;
+	}
+
+	return true;
+}
+
 /**
  * pci_find_parent_resource - return resource region of parent bus of given
  *			      region
@@ -5027,6 +5048,9 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
 	 */
 	pci_set_power_state(dev, PCI_D0);
 
+	if (!pci_dev_config_accessible(dev, "save state"))
+		return;
+
 	pci_save_state(dev);
 	/*
 	 * Disable the device by clearing the Command register, except for
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 3/7] PCI: Fail FLR when config space is inaccessible
  2026-05-05 20:05 [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x Farhan Ali
  2026-05-05 20:05 ` [PATCH v15 1/7] PCI: Allow per function PCI slots to fix slot reset on s390 Farhan Ali
  2026-05-05 20:05 ` [PATCH v15 2/7] PCI: Avoid saving config space state if inaccessible Farhan Ali
@ 2026-05-05 20:05 ` Farhan Ali
  2026-05-05 22:20   ` sashiko-bot
  2026-05-05 20:05 ` [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices Farhan Ali
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Farhan Ali @ 2026-05-05 20:05 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, alifm, schnelle, mjrosato, Benjamin Block

If a device is in an error state, then it's config space may not be
accssible. Add additional check to validate if a device's config space is
accessible before doing an FLR reset.

Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e71b81415392..c50418f5e318 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4396,6 +4396,9 @@ int pcie_reset_flr(struct pci_dev *dev, bool probe)
 	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
 		return -ENOTTY;
 
+	if (!pci_dev_config_accessible(dev, "FLR"))
+		return -ENOTTY;
+
 	if (probe)
 		return 0;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices
  2026-05-05 20:05 [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (2 preceding siblings ...)
  2026-05-05 20:05 ` [PATCH v15 3/7] PCI: Fail FLR when config space is inaccessible Farhan Ali
@ 2026-05-05 20:05 ` Farhan Ali
  2026-05-05 22:56   ` sashiko-bot
  2026-05-06  9:38   ` Niklas Schnelle
  2026-05-05 20:05 ` [PATCH v15 5/7] vfio-pci/zdev: Add a device feature for error information Farhan Ali
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Farhan Ali @ 2026-05-05 20:05 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, alifm, schnelle, mjrosato

For a passthrough device we need co-operation from user space to recover
the device. This would require to bubble up any error information to user
space.  Let's store this error information for passthrough devices, so it
can be retrieved later.

We can now have userspace drivers (vfio-pci based) on s390x. The userspace
drivers will not have any KVM fd and so no kzdev associated with them. So
we need to update the logic for detecting passthrough devices to not depend
on struct kvm_zdev.

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      |  30 ++++++++
 arch/s390/pci/pci.c              |   1 +
 arch/s390/pci/pci_event.c        | 114 +++++++++++++++++--------------
 drivers/vfio/pci/vfio_pci_zdev.c |  18 ++++-
 4 files changed, 109 insertions(+), 54 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 5dcf35f0f325..016386f7ef4a 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -118,6 +118,32 @@ struct zpci_bus {
 	enum pci_bus_speed	max_bus_speed;
 };
 
+/* Content Code Description for PCI Function Error */
+struct zpci_ccdf_err {
+	u32 reserved1;
+	u32 fh;                         /* function handle */
+	u32 fid;                        /* function id */
+	u32 ett         :  4;           /* expected table type */
+	u32 mvn         : 12;           /* MSI vector number */
+	u32 dmaas       :  8;           /* DMA address space */
+	u32 reserved2   :  6;
+	u32 q           :  1;           /* event qualifier */
+	u32 rw          :  1;           /* read/write */
+	u64 faddr;                      /* failing address */
+	u32 reserved3;
+	u16 reserved4;
+	u16 pec;                        /* PCI event code */
+} __packed;
+
+#define ZPCI_ERR_PENDING_MAX 4
+struct zpci_ccdf_pending {
+	bool mediated_recovery;
+	u8 count;
+	u8 head;
+	u8 tail;
+	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
+};
+
 /* Private data per function */
 struct zpci_dev {
 	struct zpci_bus *zbus;
@@ -192,6 +218,8 @@ struct zpci_dev {
 	struct iommu_domain *s390_domain; /* attached IOMMU domain */
 	struct kvm_zdev *kzdev;
 	struct mutex kzdev_lock;
+	struct zpci_ccdf_pending pending_errs;
+	struct mutex pending_errs_lock;
 	spinlock_t dom_lock;		/* protect s390_domain change */
 };
 
@@ -334,6 +362,8 @@ void zpci_debug_exit_device(struct zpci_dev *);
 int zpci_report_error(struct pci_dev *, struct zpci_report_error_header *);
 int zpci_clear_error_state(struct zpci_dev *zdev);
 int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
+void zpci_start_mediated_recovery(struct zpci_dev *zdev);
+void zpci_stop_mediated_recovery(struct zpci_dev *zdev);
 
 #ifdef CONFIG_NUMA
 
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 39bd2adfc240..2d377c2e194d 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -842,6 +842,7 @@ struct zpci_dev *zpci_create_device(u32 fid, u32 fh, enum zpci_state state)
 	mutex_init(&zdev->state_lock);
 	mutex_init(&zdev->fmb_lock);
 	mutex_init(&zdev->kzdev_lock);
+	mutex_init(&zdev->pending_errs_lock);
 
 	return zdev;
 
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index 839bd91c056e..41547c54076f 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -17,23 +17,6 @@
 #include "pci_bus.h"
 #include "pci_report.h"
 
-/* Content Code Description for PCI Function Error */
-struct zpci_ccdf_err {
-	u32 reserved1;
-	u32 fh;				/* function handle */
-	u32 fid;			/* function id */
-	u32 ett		:  4;		/* expected table type */
-	u32 mvn		: 12;		/* MSI vector number */
-	u32 dmaas	:  8;		/* DMA address space */
-	u32		:  6;
-	u32 q		:  1;		/* event qualifier */
-	u32 rw		:  1;		/* read/write */
-	u64 faddr;			/* failing address */
-	u32 reserved3;
-	u16 reserved4;
-	u16 pec;			/* PCI event code */
-} __packed;
-
 /* Content Code Description for PCI Function Availability */
 struct zpci_ccdf_avail {
 	u32 reserved1;
@@ -60,18 +43,6 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
 	}
 }
 
-static bool is_passed_through(struct pci_dev *pdev)
-{
-	struct zpci_dev *zdev = to_zpci(pdev);
-	bool ret;
-
-	mutex_lock(&zdev->kzdev_lock);
-	ret = !!zdev->kzdev;
-	mutex_unlock(&zdev->kzdev_lock);
-
-	return ret;
-}
-
 static bool is_driver_supported(struct pci_driver *driver)
 {
 	if (!driver || !driver->err_handler)
@@ -81,6 +52,46 @@ static bool is_driver_supported(struct pci_driver *driver)
 	return true;
 }
 
+static void zpci_store_pci_error(struct pci_dev *pdev,
+				 struct zpci_ccdf_err *ccdf)
+{
+	struct zpci_dev *zdev = to_zpci(pdev);
+	int i;
+
+	guard(mutex)(&zdev->pending_errs_lock);
+	if (!zdev->pending_errs.mediated_recovery)
+		return;
+
+	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
+		pr_err("%s: Maximum number (%d) of pending error events queued",
+		       pci_name(pdev), ZPCI_ERR_PENDING_MAX);
+		return;
+	}
+
+	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
+	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
+	zdev->pending_errs.tail++;
+	zdev->pending_errs.count++;
+}
+
+void zpci_start_mediated_recovery(struct zpci_dev *zdev)
+{
+	guard(mutex)(&zdev->pending_errs_lock);
+	zdev->pending_errs.mediated_recovery = true;
+}
+EXPORT_SYMBOL_GPL(zpci_start_mediated_recovery);
+
+void zpci_stop_mediated_recovery(struct zpci_dev *zdev)
+{
+	guard(mutex)(&zdev->pending_errs_lock);
+	zdev->pending_errs.mediated_recovery = false;
+	if (zdev->pending_errs.count)
+		pr_info("Unhandled PCI error events count=%d for PCI function 0x%x",
+			zdev->pending_errs.count, zdev->fid);
+	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
+}
+EXPORT_SYMBOL_GPL(zpci_stop_mediated_recovery);
+
 static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
 							 struct pci_driver *driver)
 {
@@ -175,7 +186,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
  * and the platform determines which functions are affected for
  * multi-function devices.
  */
-static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
+static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
+							  struct zpci_ccdf_err *ccdf)
 {
 	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
 	struct zpci_dev *zdev = to_zpci(pdev);
@@ -194,13 +206,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	}
 	pdev->error_state = pci_channel_io_frozen;
 
-	if (is_passed_through(pdev)) {
-		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
-			pci_name(pdev));
-		status_str = "failed (pass-through)";
-		goto out_unlock;
-	}
-
 	driver = to_pci_driver(pdev->dev.driver);
 	if (!is_driver_supported(driver)) {
 		if (!driver) {
@@ -216,12 +221,25 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 		goto out_unlock;
 	}
 
+	zpci_store_pci_error(pdev, ccdf);
+
 	ers_res = zpci_event_notify_error_detected(pdev, driver);
 	if (ers_result_indicates_abort(ers_res)) {
 		status_str = "failed (abort on detection)";
 		goto out_unlock;
 	}
 
+	mutex_lock(&zdev->pending_errs_lock);
+	if (zdev->pending_errs.mediated_recovery) {
+		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
+			pci_name(pdev));
+		ers_res = PCI_ERS_RESULT_RECOVERED;
+		status_str = "in progress";
+		mutex_unlock(&zdev->pending_errs_lock);
+		goto out_unlock;
+	}
+	mutex_unlock(&zdev->pending_errs_lock);
+
 	if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
 		ers_res = zpci_event_do_error_state_clear(pdev, driver);
 		if (ers_result_indicates_abort(ers_res)) {
@@ -266,25 +284,19 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
  * @pdev: PCI function for which to report
  * @es: PCI channel failure state to report
  */
-static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
+static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
+				  struct zpci_ccdf_err *ccdf)
 {
 	struct pci_driver *driver;
 
 	pci_dev_lock(pdev);
 	pdev->error_state = es;
-	/**
-	 * While vfio-pci's error_detected callback notifies user-space QEMU
-	 * reacts to this by freezing the guest. In an s390 environment PCI
-	 * errors are rarely fatal so this is overkill. Instead in the future
-	 * we will inject the error event and let the guest recover the device
-	 * itself.
-	 */
-	if (is_passed_through(pdev))
-		goto out;
+
+	zpci_store_pci_error(pdev, ccdf);
 	driver = to_pci_driver(pdev->dev.driver);
 	if (driver && driver->err_handler && driver->err_handler->error_detected)
 		driver->err_handler->error_detected(pdev, pdev->error_state);
-out:
+
 	pci_dev_unlock(pdev);
 }
 
@@ -330,12 +342,12 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
 		break;
 	case 0x0040: /* Service Action or Error Recovery Failed */
 	case 0x003b:
-		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
 		break;
 	default: /* PCI function left in the error state attempt to recover */
-		ers_res = zpci_event_attempt_error_recovery(pdev);
+		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
 		if (ers_res != PCI_ERS_RESULT_RECOVERED)
-			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+			zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
 		break;
 	}
 	pci_dev_put(pdev);
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 0990fdb146b7..78a28db00c6d 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -144,24 +144,36 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
+	int ret;
 
 	if (!zdev)
 		return -ENODEV;
 
+	zpci_start_mediated_recovery(zdev);
+
 	if (!vdev->vdev.kvm)
 		return 0;
 
+	ret = -ENOENT;
 	if (zpci_kvm_hook.kvm_register)
-		return zpci_kvm_hook.kvm_register(zdev, vdev->vdev.kvm);
+		ret = zpci_kvm_hook.kvm_register(zdev, vdev->vdev.kvm);
 
-	return -ENOENT;
+	if (ret)
+		zpci_stop_mediated_recovery(zdev);
+
+	return ret;
 }
 
 void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
 
-	if (!zdev || !vdev->vdev.kvm)
+	if (!zdev)
+		return;
+
+	zpci_stop_mediated_recovery(zdev);
+
+	if (!vdev->vdev.kvm)
 		return;
 
 	if (zpci_kvm_hook.kvm_unregister)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 5/7] vfio-pci/zdev: Add a device feature for error information
  2026-05-05 20:05 [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (3 preceding siblings ...)
  2026-05-05 20:05 ` [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2026-05-05 20:05 ` Farhan Ali
  2026-05-05 23:27   ` sashiko-bot
  2026-05-05 20:05 ` [PATCH v15 6/7] vfio/pci: Add a reset_done callback for vfio-pci driver Farhan Ali
  2026-05-05 20:05 ` [PATCH v15 7/7] vfio/pci: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
  6 siblings, 1 reply; 17+ messages in thread
From: Farhan Ali @ 2026-05-05 20:05 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, alifm, schnelle, mjrosato

For zPCI devices, we have platform specific error information. The platform
firmware provides this error information to the operating system in an
architecture specific mechanism. To enable recovery from userspace for
these devices, we want to expose this error information to userspace. Add a
new device feature to expose this information.

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      |  2 ++
 arch/s390/pci/pci_event.c        | 19 ++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c |  2 ++
 drivers/vfio/pci/vfio_pci_priv.h |  9 ++++++++
 drivers/vfio/pci/vfio_pci_zdev.c | 39 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 30 ++++++++++++++++++++++++
 6 files changed, 101 insertions(+)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 016386f7ef4a..88a125b92bdd 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -364,6 +364,8 @@ int zpci_clear_error_state(struct zpci_dev *zdev);
 int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
 void zpci_start_mediated_recovery(struct zpci_dev *zdev);
 void zpci_stop_mediated_recovery(struct zpci_dev *zdev);
+int zpci_get_pending_error(struct zpci_dev *zdev,
+			   struct zpci_ccdf_err *ccdf);
 
 #ifdef CONFIG_NUMA
 
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index 41547c54076f..fc4b51bc4c3c 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -74,6 +74,25 @@ static void zpci_store_pci_error(struct pci_dev *pdev,
 	zdev->pending_errs.count++;
 }
 
+int zpci_get_pending_error(struct zpci_dev *zdev,
+			    struct zpci_ccdf_err *ccdf)
+{
+	int head = 0;
+
+	guard(mutex)(&zdev->pending_errs_lock);
+
+	if (!zdev->pending_errs.count)
+		return -ENOMSG;
+
+	head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
+	memcpy(ccdf, &zdev->pending_errs.err[head],
+		sizeof(struct zpci_ccdf_err));
+	zdev->pending_errs.head++;
+	zdev->pending_errs.count--;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(zpci_get_pending_error);
+
 void zpci_start_mediated_recovery(struct zpci_dev *zdev)
 {
 	guard(mutex)(&zdev->pending_errs_lock);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 3f8d093aacf8..f47a49ca67fa 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1534,6 +1534,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_DMA_BUF:
 		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
+		return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index fca9d0dfac90..4e7162234a2e 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -93,6 +93,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 				struct vfio_info_cap *caps);
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
 void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+			      void __user *arg, size_t argsz);
 #else
 static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 					      struct vfio_info_cap *caps)
@@ -107,6 +109,13 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 
 static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {}
+
+static inline int vfio_pci_zdev_feature_err(struct vfio_device *device,
+					    u32 flags, void __user *arg,
+					    size_t argsz)
+{
+	return -ENOTTY;
+}
 #endif
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 78a28db00c6d..cc148123a97b 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -141,6 +141,45 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+			      void __user *arg, size_t argsz)
+{
+	struct vfio_device_feature_zpci_err err = {};
+	struct vfio_pci_core_device *vdev;
+	struct zpci_ccdf_err ccdf = {};
+	struct zpci_dev *zdev;
+	int ret;
+
+	vdev = container_of(device, struct vfio_pci_core_device, vdev);
+	zdev = to_zpci(vdev->pdev);
+	if (!zdev)
+		return -ENODEV;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(err));
+	if (ret != 1)
+		return ret;
+
+	ret = zpci_get_pending_error(zdev, &ccdf);
+	if (ret)
+		return ret;
+
+	err.fh = ccdf.fh;
+	err.fid = ccdf.fid;
+	err.ett = ccdf.ett;
+	err.mvn = ccdf.mvn;
+	err.dmaas = ccdf.dmaas;
+	err.q = ccdf.q;
+	err.rw = ccdf.rw;
+	err.faddr = ccdf.faddr;
+	err.pec = ccdf.pec;
+
+	if (copy_to_user(arg, &err, sizeof(err)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..3eead86a00ab 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1534,6 +1534,36 @@ struct vfio_device_feature_dma_buf {
  */
 #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
 
+/**
+ * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
+ * userspace for vfio-pci devices on s390. On s390, PCI error recovery
+ * involves platform firmware and notification to operating systems is done
+ * by architecture specific mechanism. Exposing this information to
+ * userspace allows it to take appropriate actions to handle an
+ * error on the device. The ioctl returns -ENOMSG if there are no pending
+ * PCI errors.
+ */
+
+struct vfio_device_feature_zpci_err {
+	__u32 feature_flags;		/* Indicate future features */
+	__u32 reserved1;
+	__u32 fh;			/* function handle */
+	__u32 fid;			/* function id */
+	__u32 ett		:  4;	/* expected table type */
+	__u32 mvn		: 12;	/* MSI vector number */
+	__u32 dmaas		:  8;	/* DMA address space */
+	__u32 reserved2		:  6;
+	__u32 q			:  1;	/* event qualifier */
+	__u32 rw		:  1;	/* read/write */
+	__u64 faddr;			/* failing address */
+	__u32 reserved3;
+	__u16 reserved4;
+	__u16 pec;			/* PCI event code */
+	__u8 reserved5[28];		/* Allow for future expansion */
+};
+
+#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 13
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 6/7] vfio/pci: Add a reset_done callback for vfio-pci driver
  2026-05-05 20:05 [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (4 preceding siblings ...)
  2026-05-05 20:05 ` [PATCH v15 5/7] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2026-05-05 20:05 ` Farhan Ali
  2026-05-05 23:56   ` sashiko-bot
  2026-05-05 20:05 ` [PATCH v15 7/7] vfio/pci: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
  6 siblings, 1 reply; 17+ messages in thread
From: Farhan Ali @ 2026-05-05 20:05 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, alifm, schnelle, mjrosato, Julian Ruess

On error recovery for a PCI device bound to vfio-pci driver, we want to
recover the state of the device to its last known saved state. The callback
restores the state of the device to its initial saved state.

Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Acked-by: Alex Williamson <alex@shazbot.org>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f47a49ca67fa..6358b59fab2b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2255,6 +2255,17 @@ pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
 
+static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+	if (!vdev->pci_saved_state)
+		return;
+
+	pci_load_saved_state(pdev, vdev->pci_saved_state);
+	pci_restore_state(pdev);
+}
+
 int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
 				  int nr_virtfn)
 {
@@ -2319,6 +2330,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
 
 const struct pci_error_handlers vfio_pci_core_err_handlers = {
 	.error_detected = vfio_pci_core_aer_err_detected,
+	.reset_done = vfio_pci_core_aer_reset_done,
 };
 EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 7/7] vfio/pci: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
  2026-05-05 20:05 [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (5 preceding siblings ...)
  2026-05-05 20:05 ` [PATCH v15 6/7] vfio/pci: Add a reset_done callback for vfio-pci driver Farhan Ali
@ 2026-05-05 20:05 ` Farhan Ali
  6 siblings, 0 replies; 17+ messages in thread
From: Farhan Ali @ 2026-05-05 20:05 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, alifm, schnelle, mjrosato, Julian Ruess

The error signaling is configured for the vast majority of devices and it's
extremely rare that it fires anyway. Removing the pcie check will allow
userspace to be notified on errors for legacy PCI devices. The Internal
Shared Memory (ISM) device on s390 is one such device. For PCI devices on
IBM s390 error recovery involves platform firmware and notification to
operating system is done by architecture specific way. So the ISM device
can still be recovered when notified of an error.

Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c  | 8 ++------
 drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 6358b59fab2b..cd9f31b28160 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -786,8 +786,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
 		}
 	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
-		if (pci_is_pcie(vdev->pdev))
-			return 1;
+		return 1;
 	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
 		return 1;
 	}
@@ -1163,11 +1162,8 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
 	switch (info.index) {
 	case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
 	case VFIO_PCI_REQ_IRQ_INDEX:
-		break;
 	case VFIO_PCI_ERR_IRQ_INDEX:
-		if (pci_is_pcie(vdev->pdev))
-			break;
-		fallthrough;
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 33944d4d9dc4..64f80f64ff57 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -859,8 +859,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 	case VFIO_PCI_ERR_IRQ_INDEX:
 		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
 		case VFIO_IRQ_SET_ACTION_TRIGGER:
-			if (pci_is_pcie(vdev->pdev))
-				func = vfio_pci_set_err_trigger;
+			func = vfio_pci_set_err_trigger;
 			break;
 		}
 		break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 1/7] PCI: Allow per function PCI slots to fix slot reset on s390
  2026-05-05 20:05 ` [PATCH v15 1/7] PCI: Allow per function PCI slots to fix slot reset on s390 Farhan Ali
@ 2026-05-05 21:29   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-05-05 21:29 UTC (permalink / raw)
  To: Farhan Ali; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Global enablement of `per_func_slot` on S390 breaks standard PCIe hotplug matching and hardware association.
- [High] Lockless read of `dev->slot` in `pci_dev_reset_slot_function()` leads to a Time-Of-Check to Time-Of-Use (TOCTOU) Use-After-Free.
--

commit e4e35148b2442f92d3db17a20a087722c65c043d
Author: Farhan Ali <alifm@linux.ibm.com>
PCI: Allow per function PCI slots to fix slot reset on s390

This commit introduces a per-function PCI slot flag to fix slot resets for
multifunction devices on s390 systems. It updates the PCI slot number field
to a 16-bit integer to accommodate special placeholder and all-devices values.

> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8f7cfcc000901..d0c9f0166af5c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
[ ... ]
> @@ -4865,8 +4865,9 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, bool probe)
>  
>  static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
>  {
> -	if (dev->multifunction || dev->subordinate || !dev->slot ||
> -	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
> +	if (dev->subordinate || !dev->slot ||
> +	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET ||
> +	    (dev->multifunction && !dev->slot->per_func_slot))

Does this code introduce a time-of-check to time-of-use use-after-free
regression?

When dev->slot is accessed locklessly, it could be freed concurrently if a
hotplug driver module unload invokes pci_destroy_slot() and
pci_slot_release(), which clears dev->slot to NULL under pci_bus_sem and
frees the slot.

Because this function does not hold pci_bus_sem or elevate the slot's
reference count, adding another lockless dereference to
dev->slot->per_func_slot right after the !dev->slot check could access
freed memory if the slot is destroyed in between.

> diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
> index 6d5cd37bfb1e7..894d6213ed306 100644
> --- a/drivers/pci/slot.c
> +++ b/drivers/pci/slot.c
[ ... ]
> @@ -72,6 +72,23 @@ static ssize_t cur_speed_read_file(struct pci_slot *slot, char *buf)
>  	return bus_speed_read(slot->bus->cur_bus_speed, buf);
>  }
>  
> +static bool pci_dev_matches_slot(struct pci_dev *dev, struct pci_slot *slot)
> +{
> +	if (slot->per_func_slot)
> +		return dev->devfn == slot->number;
> +
> +	return slot->number == PCI_SLOT_ALL_DEVICES ||
> +		PCI_SLOT(dev->devfn) == slot->number;
> +}
> +
> +static bool pci_slot_enabled_per_func(void)
> +{
> +	if (IS_ENABLED(CONFIG_S390))
> +		return true;
> +
> +	return false;
> +}

Does this global enablement cause a regression for standard PCIe hotplug
matching on s390 systems?

By unconditionally returning true on s390, slot->per_func_slot is globally
set to 1 for all hotplug slots created. When per_func_slot is true,
pci_dev_matches_slot() prioritizes an exact match of
dev->devfn == slot->number.

However, standard hotplug drivers like pciehp initialize slots using
PCI_SLOT_ALL_DEVICES (now 0xfeff). Since a device's devfn is an 8-bit value,
this will never match, which might completely break pciehp topology on s390
environments such as KVM guests.

Additionally, if generic drivers like shpchp register slots using just the
device number (for example, 5), the strict match will compare devfn to 5
(device 0, function 5) instead of the correct device 5, function 0
(devfn 40). Could this lead to failed associations or cross-device resets?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505200510.2954-1-alifm@linux.ibm.com?part=1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 2/7] PCI: Avoid saving config space state if inaccessible
  2026-05-05 20:05 ` [PATCH v15 2/7] PCI: Avoid saving config space state if inaccessible Farhan Ali
@ 2026-05-05 21:58   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-05-05 21:58 UTC (permalink / raw)
  To: Farhan Ali; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] Early return when config space is inaccessible skips device DMA disablement.
- [Low] The 'msg' parameter in pci_dev_config_accessible() lacks a 'const' qualifier.
--

commit 18b574199dc0470f2784ade2f4883a656c60245e
Author: Farhan Ali <alifm@linux.ibm.com>

PCI: Avoid saving config space state if inaccessible

The commit updates the reset process to check if a device's config space
is accessible before attempting to save its state. This prevents saving
invalid values if the device is in an error state like DPC.

> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d0c9f0166af5c..e71b814153921 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -722,6 +722,27 @@ u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec)
>  }
>  EXPORT_SYMBOL_GPL(pci_find_dvsec_capability);
>  
> +static bool pci_dev_config_accessible(struct pci_dev *dev, char *msg)

This isn't a bug, but should the msg parameter be declared as a const
char pointer?

The function is later called with the string literal "save state".

[ ... ]

> @@ -5027,6 +5048,9 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
>  	 */
>  	pci_set_power_state(dev, PCI_D0);
>  
> +	if (!pci_dev_config_accessible(dev, "save state"))
> +		return;
> +
>  	pci_save_state(dev);
>  	/*
>  	 * Disable the device by clearing the Command register, except for
>  	 * INTx-disable which is set.  This not only disables MMIO and I/O port
>  	 * BARs, but also prevents the device from being Bus Master, preventing
>  	 * DMA from the device including MSI/MSI-X interrupts.  For PCI 2.3
>  	 * compliant devices, INTx-disable prevents legacy interrupts.
>  	 */
>  	pci_write_config_word(dev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);

Does returning early here skip disabling the device DMA engine?

If the configuration space read fails due to a transient error like a
completion timeout instead of the link being fully down, the device might
still be active.

Skipping the write to PCI_COMMAND leaves Bus Master enabled, which could
allow the device to continue performing DMA or raising interrupts during the
reset process.

Could the code conditionally skip pci_save_state() but still attempt the
write to PCI_COMMAND to ensure a best-effort disablement?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505200510.2954-1-alifm@linux.ibm.com?part=2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 3/7] PCI: Fail FLR when config space is inaccessible
  2026-05-05 20:05 ` [PATCH v15 3/7] PCI: Fail FLR when config space is inaccessible Farhan Ali
@ 2026-05-05 22:20   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-05-05 22:20 UTC (permalink / raw)
  To: Farhan Ali; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Logic flaw: pci_dev_config_accessible() executed before the probe check can permanently disable the FLR capability.
--

commit dc74511c4dd4719d1b362bacdaab4c551ada9e66
Author: Farhan Ali <alifm@linux.ibm.com>

PCI: Fail FLR when config space is inaccessible

This commit adds a check to validate if a device's config space is accessible
before attempting an FLR reset. This is intended to fail the FLR early when
the device is in an error state.

> If a device is in an error state, then it's config space may not be
> accssible.

This isn't a bug, but there are a couple of typos here ("it's" instead of
"its", and "accssible" instead of "accessible").

> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e71b814153921..c50418f5e3181 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4396,6 +4396,9 @@ int pcie_reset_flr(struct pci_dev *dev, bool probe)
>  	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
>  		return -ENOTTY;
>  
> +	if (!pci_dev_config_accessible(dev, "FLR"))
> +		return -ENOTTY;
> +
>  	if (probe)
>  		return 0;

Could this placement cause the FLR capability to be permanently disabled if
the device is temporarily inaccessible during discovery?

pcie_reset_flr() is called with probe == true during device initialization
to discover if the hardware inherently supports FLR. By placing the live
config space check before the probe check, the system will actively read the
PCI_COMMAND register during capability discovery.

If the device happens to be temporarily inaccessible during this probe phase,
it will return -ENOTTY and FLR will be permanently omitted from the device's
reset_methods array.

Would it make sense to move the pci_dev_config_accessible() check after the
if (probe) block so we only check accessibility when actually performing the
reset?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505200510.2954-1-alifm@linux.ibm.com?part=3

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices
  2026-05-05 20:05 ` [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2026-05-05 22:56   ` sashiko-bot
  2026-05-06  9:38   ` Niklas Schnelle
  1 sibling, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-05-05 22:56 UTC (permalink / raw)
  To: Farhan Ali; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] The mediated recovery feature for non-KVM passthrough devices is gated by a KVM-dependent Kconfig option, causing it to be compiled out and fail when KVM is disabled.
- [High] A time-of-check to time-of-use (TOCTOU) race condition exists in `zpci_event_attempt_error_recovery` regarding the `mediated_recovery` state, leading to permanently broken passthrough devices.
- [Low] Missing trailing newline (`\n`) characters in `pr_err()` and `pr_info()` format strings.
--

commit b20dbe013c9c596d29efb7c91a62d8992eaf33b9
Author: Farhan Ali <alifm@linux.ibm.com>

s390/pci: Store PCI error information for passthrough devices

This commit modifies s390 PCI error handling to store error information for
passthrough devices. This allows userspace drivers like vfio-pci to manage
device recovery directly, removing the dependency on a KVM file descriptor.

> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -81,6 +52,46 @@ static bool is_driver_supported(struct pci_driver *driver)
[ ... ]
> +static void zpci_store_pci_error(struct pci_dev *pdev,
> +				 struct zpci_ccdf_err *ccdf)
> +{
> +	struct zpci_dev *zdev = to_zpci(pdev);
> +	int i;
> +
> +	guard(mutex)(&zdev->pending_errs_lock);
> +	if (!zdev->pending_errs.mediated_recovery)
> +		return;
> +
> +	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
> +		pr_err("%s: Maximum number (%d) of pending error events queued",
> +		       pci_name(pdev), ZPCI_ERR_PENDING_MAX);

Is this missing a trailing newline (\n) in the format string?

> +		return;
> +	}
> +
> +	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
> +	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
> +	zdev->pending_errs.tail++;
> +	zdev->pending_errs.count++;
> +}
[ ... ]
> +void zpci_stop_mediated_recovery(struct zpci_dev *zdev)
> +{
> +	guard(mutex)(&zdev->pending_errs_lock);
> +	zdev->pending_errs.mediated_recovery = false;
> +	if (zdev->pending_errs.count)
> +		pr_info("Unhandled PCI error events count=%d for PCI function 0x%x",
> +			zdev->pending_errs.count, zdev->fid);

Does this pr_info() also need a trailing newline (\n)?

> +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
> +}
> +EXPORT_SYMBOL_GPL(zpci_stop_mediated_recovery);
[ ... ]
> @@ -216,12 +221,25 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  		goto out_unlock;
>  	}
>  
> +	zpci_store_pci_error(pdev, ccdf);
> +
>  	ers_res = zpci_event_notify_error_detected(pdev, driver);
>  	if (ers_result_indicates_abort(ers_res)) {
>  		status_str = "failed (abort on detection)";
>  		goto out_unlock;
>  	}
>  
> +	mutex_lock(&zdev->pending_errs_lock);
> +	if (zdev->pending_errs.mediated_recovery) {
> +		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
> +			pci_name(pdev));
> +		ers_res = PCI_ERS_RESULT_RECOVERED;
> +		status_str = "in progress";
> +		mutex_unlock(&zdev->pending_errs_lock);
> +		goto out_unlock;
> +	}
> +	mutex_unlock(&zdev->pending_errs_lock);

Is there a time-of-check to time-of-use race condition here?

zpci_store_pci_error() checks zdev->pending_errs.mediated_recovery and
returns without storing the error if it is false. Later in this function,
we acquire pending_errs_lock again and re-evaluate mediated_recovery to
decide whether to skip host-level recovery.

If a concurrent userspace open of the VFIO device sets mediated_recovery to
true between these two checks, the error is discarded but the host recovery
is still skipped. Could this state be cached once under a single critical
section?

> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -144,24 +144,36 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>  int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>  {
>  	struct zpci_dev *zdev = to_zpci(vdev->pdev);
> +	int ret;
>  
>  	if (!zdev)
>  		return -ENODEV;
>  
> +	zpci_start_mediated_recovery(zdev);

The commit message states the goal is to support userspace drivers without
any KVM fd. However, does placing this logic in vfio_pci_zdev.c inadvertently
tie the mediated recovery feature to KVM?

Since vfio_pci_zdev.c is only built if CONFIG_VFIO_PCI_ZDEV_KVM is enabled,
if a kernel is configured without KVM, vfio_pci_zdev_open_device() resolves
to an empty inline stub. This means mediated_recovery would never be set to
true, and the host kernel would still attempt recovery on the passthrough
device.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505200510.2954-1-alifm@linux.ibm.com?part=4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 5/7] vfio-pci/zdev: Add a device feature for error information
  2026-05-05 20:05 ` [PATCH v15 5/7] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2026-05-05 23:27   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-05-05 23:27 UTC (permalink / raw)
  To: Farhan Ali; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] UAPI structure `vfio_device_feature_zpci_err` uses non-portable bitfields and contains implicit padding, violating UAPI design guidelines.
--

commit 9302e2545ccb75aab62ae8f60f8aba469ebfdf04
Author: Farhan Ali <alifm@linux.ibm.com>

vfio-pci/zdev: Add a device feature for error information

This commit adds a new device feature to expose zPCI platform specific error
information to userspace. This allows userspace to take appropriate actions to
handle an error on the device and perform mediated recovery.

> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5eee..3eead86a00ab2 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h

[ ... ]

> +struct vfio_device_feature_zpci_err {
> +	__u32 feature_flags;		/* Indicate future features */
> +	__u32 reserved1;
> +	__u32 fh;			/* function handle */
> +	__u32 fid;			/* function id */
> +	__u32 ett		:  4;	/* expected table type */
> +	__u32 mvn		: 12;	/* MSI vector number */
> +	__u32 dmaas		:  8;	/* DMA address space */
> +	__u32 reserved2		:  6;
> +	__u32 q			:  1;	/* event qualifier */
> +	__u32 rw		:  1;	/* read/write */

Since the memory layout, ordering, and packing of C bitfields are highly
compiler-dependent, could this compromise cross-platform ABI stability or
break compatibility with other languages interacting with the kernel?

Using standard __u32 fields and bitmask macros instead might be a safer
approach for a UAPI header.

> +	__u64 faddr;			/* failing address */

The faddr member requires 8-byte alignment, but it appears to be positioned
after exactly 20 bytes of prior fields. Does this force 4 bytes of implicit
compiler padding before it?

> +	__u32 reserved3;
> +	__u16 reserved4;
> +	__u16 pec;			/* PCI event code */
> +	__u8 reserved5[28];		/* Allow for future expansion */
> +};

The explicit data size of this structure appears to be 68 bytes, which
would force another 4 bytes of implicit trailing padding to align the 72-byte
structure.

Could this implicit padding make ioctl size validation fragile?

Additionally, if the compiler fails to zero the implicit padding during
struct initialization with err = {}, is there a risk that copy_to_user()
will leak uninitialized kernel stack memory to userspace?

Using explicit reserved padding fields to ensure manual, deterministic 8-byte
alignment could avoid these issues.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505200510.2954-1-alifm@linux.ibm.com?part=5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 6/7] vfio/pci: Add a reset_done callback for vfio-pci driver
  2026-05-05 20:05 ` [PATCH v15 6/7] vfio/pci: Add a reset_done callback for vfio-pci driver Farhan Ali
@ 2026-05-05 23:56   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-05-05 23:56 UTC (permalink / raw)
  To: Farhan Ali; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] Concurrent execution of `vfio_pci_core_aer_reset_done` and `vfio_pci_core_disable` leads to an exploitable Use-After-Free on `vdev->pci_saved_state`.
- [High] The return value of `pci_load_saved_state()` is ignored, potentially pushing corrupted state to the hardware.
- [Medium] The state recovery logic is marked `static`, depriving vendor-specific VFIO PCI drivers of the feature.
--

commit 50af2d4f1aef440c94cd13a26e090b9661198f8c
Author: Farhan Ali <alifm@linux.ibm.com>

vfio/pci: Add a reset_done callback for vfio-pci driver

On error recovery for a PCI device bound to the vfio-pci driver, this
commit attempts to restore the device state to its initial saved state
by adding a reset_done callback.

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f47a49ca67fae..6358b59fab2b1 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c

[ ... ]

> @@ -2255,6 +2255,17 @@ pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
>  
> +static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)

Vendor-specific VFIO PCI drivers sometimes override the .reset_done callback
in their pci_error_handlers. Should this function be exported instead of
static, so that vendor drivers can also invoke the core framework's state
restoration?

> +{
> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
> +
> +	if (!vdev->pci_saved_state)
> +		return;
> +

Is there a race condition here between the AER recovery thread and the VFIO
device close path?

The AER thread invokes vfio_pci_core_aer_reset_done() holding
device_lock(&pdev->dev), but without VFIO-specific locks like dev_set->lock.

Concurrently, closing the VFIO device file descriptor invokes the disable
path:

vfio_pci_core_close_device() -> vfio_pci_core_disable()

The disable path runs with dev_set->lock held, but without
device_lock(&pdev->dev), and it frees the saved state:

pci_load_and_free_saved_state(pdev, &vdev->pci_saved_state)

Could vdev->pci_saved_state be freed immediately after the check here,
causing a use-after-free when it is passed to pci_load_saved_state() below?

> +	pci_load_saved_state(pdev, vdev->pci_saved_state);
> +	pci_restore_state(pdev);

Can pci_load_saved_state() fail and return an error?

If it returns an error, such as -EINVAL for capability size mismatches,
dev->state_saved remains false and the capability buffers may only be
partially copied.

Since the return value is ignored here, does calling pci_restore_state(pdev)
immediately afterward risk pushing partially populated or corrupted state back
to the hardware?

> +}
> +

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505200510.2954-1-alifm@linux.ibm.com?part=6

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices
  2026-05-05 20:05 ` [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices Farhan Ali
  2026-05-05 22:56   ` sashiko-bot
@ 2026-05-06  9:38   ` Niklas Schnelle
  2026-05-06 17:20     ` Farhan Ali
  1 sibling, 1 reply; 17+ messages in thread
From: Niklas Schnelle @ 2026-05-06  9:38 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, linux-kernel, linux-pci; +Cc: helgaas, alex, mjrosato

On Tue, 2026-05-05 at 13:05 -0700, Farhan Ali wrote:
> For a passthrough device we need co-operation from user space to recover
> the device. This would require to bubble up any error information to user
> space.  Let's store this error information for passthrough devices, so it
> can be retrieved later.
> 
> We can now have userspace drivers (vfio-pci based) on s390x. The userspace
> drivers will not have any KVM fd and so no kzdev associated with them. So
> we need to update the logic for detecting passthrough devices to not depend
> on struct kvm_zdev.
> 
> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/include/asm/pci.h      |  30 ++++++++
>  arch/s390/pci/pci.c              |   1 +
>  arch/s390/pci/pci_event.c        | 114 +++++++++++++++++--------------
>  drivers/vfio/pci/vfio_pci_zdev.c |  18 ++++-
>  4 files changed, 109 insertions(+), 54 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 5dcf35f0f325..016386f7ef4a 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -118,6 +118,32 @@ struct zpci_bus {
>  	enum pci_bus_speed	max_bus_speed;
>  };
>  
> +/* Content Code Description for PCI Function Error */
> +struct zpci_ccdf_err {
> +	u32 reserved1;
> +	u32 fh;                         /* function handle */
> +	u32 fid;                        /* function id */
> +	u32 ett         :  4;           /* expected table type */
> +	u32 mvn         : 12;           /* MSI vector number */
> +	u32 dmaas       :  8;           /* DMA address space */
> +	u32 reserved2   :  6;
> +	u32 q           :  1;           /* event qualifier */
> +	u32 rw          :  1;           /* read/write */
> +	u64 faddr;                      /* failing address */
> +	u32 reserved3;
> +	u16 reserved4;
> +	u16 pec;                        /* PCI event code */
> +} __packed;
> +
> +#define ZPCI_ERR_PENDING_MAX 4
> +struct zpci_ccdf_pending {
> +	bool mediated_recovery;
> +	u8 count;
> +	u8 head;
> +	u8 tail;
> +	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
> +};
> +
>  /* Private data per function */
>  struct zpci_dev {
>  	struct zpci_bus *zbus;
> @@ -192,6 +218,8 @@ struct zpci_dev {
>  	struct iommu_domain *s390_domain; /* attached IOMMU domain */
>  	struct kvm_zdev *kzdev;
>  	struct mutex kzdev_lock;
> +	struct zpci_ccdf_pending pending_errs;
> +	struct mutex pending_errs_lock;
>  	spinlock_t dom_lock;		/* protect s390_domain change */
>  };
>  
> @@ -334,6 +362,8 @@ void zpci_debug_exit_device(struct zpci_dev *);
>  int zpci_report_error(struct pci_dev *, struct zpci_report_error_header *);
>  int zpci_clear_error_state(struct zpci_dev *zdev);
>  int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
> +void zpci_start_mediated_recovery(struct zpci_dev *zdev);
> +void zpci_stop_mediated_recovery(struct zpci_dev *zdev);
>  
>  #ifdef CONFIG_NUMA
>  
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 39bd2adfc240..2d377c2e194d 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -842,6 +842,7 @@ struct zpci_dev *zpci_create_device(u32 fid, u32 fh, enum zpci_state state)
>  	mutex_init(&zdev->state_lock);
>  	mutex_init(&zdev->fmb_lock);
>  	mutex_init(&zdev->kzdev_lock);
> +	mutex_init(&zdev->pending_errs_lock);
>  
>  	return zdev;
>  
> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> index 839bd91c056e..41547c54076f 100644
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -17,23 +17,6 @@
>  #include "pci_bus.h"
>  #include "pci_report.h"
>  
> -/* Content Code Description for PCI Function Error */
> -struct zpci_ccdf_err {
> -	u32 reserved1;
> -	u32 fh;				/* function handle */
> -	u32 fid;			/* function id */
> -	u32 ett		:  4;		/* expected table type */
> -	u32 mvn		: 12;		/* MSI vector number */
> -	u32 dmaas	:  8;		/* DMA address space */
> -	u32		:  6;
> -	u32 q		:  1;		/* event qualifier */
> -	u32 rw		:  1;		/* read/write */
> -	u64 faddr;			/* failing address */
> -	u32 reserved3;
> -	u16 reserved4;
> -	u16 pec;			/* PCI event code */
> -} __packed;
> -
>  /* Content Code Description for PCI Function Availability */
>  struct zpci_ccdf_avail {
>  	u32 reserved1;
> @@ -60,18 +43,6 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
>  	}
>  }
>  
> -static bool is_passed_through(struct pci_dev *pdev)
> -{
> -	struct zpci_dev *zdev = to_zpci(pdev);
> -	bool ret;
> -
> -	mutex_lock(&zdev->kzdev_lock);
> -	ret = !!zdev->kzdev;
> -	mutex_unlock(&zdev->kzdev_lock);
> -
> -	return ret;
> -}
> -
>  static bool is_driver_supported(struct pci_driver *driver)
>  {
>  	if (!driver || !driver->err_handler)
> @@ -81,6 +52,46 @@ static bool is_driver_supported(struct pci_driver *driver)
>  	return true;
>  }
>  
> +static void zpci_store_pci_error(struct pci_dev *pdev,
> +				 struct zpci_ccdf_err *ccdf)
> +{
> +	struct zpci_dev *zdev = to_zpci(pdev);
> +	int i;
> +
> +	guard(mutex)(&zdev->pending_errs_lock);
> +	if (!zdev->pending_errs.mediated_recovery)
> +		return;
> +
> +	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
> +		pr_err("%s: Maximum number (%d) of pending error events queued",

Missing "\n"

> +		       pci_name(pdev), ZPCI_ERR_PENDING_MAX);
> +		return;
> +	}
> +
> +	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
> +	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
> +	zdev->pending_errs.tail++;
> +	zdev->pending_errs.count++;
> +}
> +
> +void zpci_start_mediated_recovery(struct zpci_dev *zdev)
> +{
> +	guard(mutex)(&zdev->pending_errs_lock);
> +	zdev->pending_errs.mediated_recovery = true;
> +}
> +EXPORT_SYMBOL_GPL(zpci_start_mediated_recovery);
> +
> +void zpci_stop_mediated_recovery(struct zpci_dev *zdev)
> +{
> +	guard(mutex)(&zdev->pending_errs_lock);
> +	zdev->pending_errs.mediated_recovery = false;
> +	if (zdev->pending_errs.count)
> +		pr_info("Unhandled PCI error events count=%d for PCI function 0x%x",

Also missing "\n"

> +			zdev->pending_errs.count, zdev->fid);
> +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
> +}
> +EXPORT_SYMBOL_GPL(zpci_stop_mediated_recovery);
> +
>  static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
>  							 struct pci_driver *driver)
>  {
> @@ -175,7 +186,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
>   * and the platform determines which functions are affected for
>   * multi-function devices.
>   */
> -static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
> +static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
> +							  struct zpci_ccdf_err *ccdf)
>  {
>  	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
>  	struct zpci_dev *zdev = to_zpci(pdev);
> @@ -194,13 +206,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  	}
>  	pdev->error_state = pci_channel_io_frozen;
>  
> -	if (is_passed_through(pdev)) {
> -		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
> -			pci_name(pdev));
> -		status_str = "failed (pass-through)";
> -		goto out_unlock;
> -	}
> -
>  	driver = to_pci_driver(pdev->dev.driver);
>  	if (!is_driver_supported(driver)) {
>  		if (!driver) {
> @@ -216,12 +221,25 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  		goto out_unlock;
>  	}
>  
> +	zpci_store_pci_error(pdev, ccdf);

Sashiko notes that zdev->pendings_errs.mediated_recovery could become
true between the above zpci_store_pci_error() and the below check for
leaving recovery to user-space. I think we could make a general
improvement that also tackles this concern. The ideas is that we could
have zpci_store_pci_error() return true if it did store the error and
we are in mediated recovery mode. Then we use that as the signal to
skip host recovery below. That way we also don't need to retake the
pending_errs_lock which makes the below much simpler and it would be a
win independent of the race. As for the race this would make sure that
we either do the host recovery or store the error and let user-space
recover.

>  	ers_res = zpci_event_notify_error_detected(pdev, driver);
>  	if (ers_result_indicates_abort(ers_res)) {
>  		status_str = "failed (abort on detection)";
>  		goto out_unlock;
>  	}
>  
> +	mutex_lock(&zdev->pending_errs_lock);
> +	if (zdev->pending_errs.mediated_recovery) {
> +		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
> +			pci_name(pdev));
> +		ers_res = PCI_ERS_RESULT_RECOVERED;
> +		status_str = "in progress";
> +		mutex_unlock(&zdev->pending_errs_lock);
> +		goto out_unlock;
> +	}
> +	mutex_unlock(&zdev->pending_errs_lock);

Other than the missing "\n" and suggestion above this looks good to me.
So I'd hope to R-b the next version.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices
  2026-05-06  9:38   ` Niklas Schnelle
@ 2026-05-06 17:20     ` Farhan Ali
  2026-05-08 19:58       ` Niklas Schnelle
  0 siblings, 1 reply; 17+ messages in thread
From: Farhan Ali @ 2026-05-06 17:20 UTC (permalink / raw)
  To: Niklas Schnelle, linux-s390, linux-kernel, linux-pci
  Cc: helgaas, alex, mjrosato


On 5/6/2026 2:38 AM, Niklas Schnelle wrote:
>> -static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>> +static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
>> +							  struct zpci_ccdf_err *ccdf)
>>   {
>>   	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
>>   	struct zpci_dev *zdev = to_zpci(pdev);
>> @@ -194,13 +206,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>>   	}
>>   	pdev->error_state = pci_channel_io_frozen;
>>   
>> -	if (is_passed_through(pdev)) {
>> -		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
>> -			pci_name(pdev));
>> -		status_str = "failed (pass-through)";
>> -		goto out_unlock;
>> -	}
>> -
>>   	driver = to_pci_driver(pdev->dev.driver);
>>   	if (!is_driver_supported(driver)) {
>>   		if (!driver) {
>> @@ -216,12 +221,25 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>>   		goto out_unlock;
>>   	}
>>   
>> +	zpci_store_pci_error(pdev, ccdf);
> Sashiko notes that zdev->pendings_errs.mediated_recovery could become
> true between the above zpci_store_pci_error() and the below check for
> leaving recovery to user-space. I think we could make a general
> improvement that also tackles this concern. The ideas is that we could
> have zpci_store_pci_error() return true if it did store the error and
> we are in mediated recovery mode. Then we use that as the signal to
> skip host recovery below. That way we also don't need to retake the
> pending_errs_lock which makes the below much simpler and it would be a
> win independent of the race. As for the race this would make sure that
> we either do the host recovery or store the error and let user-space
> recover.

I did think of the concern about mediated_recovery becoming true after 
zpci_store_pci_error(), but IIUC in that case we won't even be able to 
deliver the error signal to userspace (via error_detected()). And I 
don't think mediated_recovery flag can be set to true. Since we are 
holding the pci device lock, vfio_pci_core_enable() will fail as it will 
fail trying to reset the device.

Thanks

Farhan

>>   	ers_res = zpci_event_notify_error_detected(pdev, driver);
>>   	if (ers_result_indicates_abort(ers_res)) {
>>   		status_str = "failed (abort on detection)";
>>   		goto out_unlock;
>>   	}
>>   
>> +	mutex_lock(&zdev->pending_errs_lock);
>> +	if (zdev->pending_errs.mediated_recovery) {
>> +		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
>> +			pci_name(pdev));
>> +		ers_res = PCI_ERS_RESULT_RECOVERED;
>> +		status_str = "in progress";
>> +		mutex_unlock(&zdev->pending_errs_lock);
>> +		goto out_unlock;
>> +	}
>> +	mutex_unlock(&zdev->pending_errs_lock);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices
  2026-05-06 17:20     ` Farhan Ali
@ 2026-05-08 19:58       ` Niklas Schnelle
  0 siblings, 0 replies; 17+ messages in thread
From: Niklas Schnelle @ 2026-05-08 19:58 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, linux-kernel, linux-pci; +Cc: helgaas, alex, mjrosato

On Wed, 2026-05-06 at 10:20 -0700, Farhan Ali wrote:
> On 5/6/2026 2:38 AM, Niklas Schnelle wrote:
> > > -static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
> > > +static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
> > > +							  struct zpci_ccdf_err *ccdf)
> > >   {
> > >   	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
> > >   	struct zpci_dev *zdev = to_zpci(pdev);
> > > @@ -194,13 +206,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
> > >   	}
> > >   	pdev->error_state = pci_channel_io_frozen;
> > >   
> > > -	if (is_passed_through(pdev)) {
> > > -		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
> > > -			pci_name(pdev));
> > > -		status_str = "failed (pass-through)";
> > > -		goto out_unlock;
> > > -	}
> > > -
> > >   	driver = to_pci_driver(pdev->dev.driver);
> > >   	if (!is_driver_supported(driver)) {
> > >   		if (!driver) {
> > > @@ -216,12 +221,25 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
> > >   		goto out_unlock;
> > >   	}
> > >   
> > > +	zpci_store_pci_error(pdev, ccdf);
> > Sashiko notes that zdev->pendings_errs.mediated_recovery could become
> > true between the above zpci_store_pci_error() and the below check for
> > leaving recovery to user-space. I think we could make a general
> > improvement that also tackles this concern. The ideas is that we could
> > have zpci_store_pci_error() return true if it did store the error and
> > we are in mediated recovery mode. Then we use that as the signal to
> > skip host recovery below. That way we also don't need to retake the
> > pending_errs_lock which makes the below much simpler and it would be a
> > win independent of the race. As for the race this would make sure that
> > we either do the host recovery or store the error and let user-space
> > recover.
> 
> I did think of the concern about mediated_recovery becoming true after 
> zpci_store_pci_error(), but IIUC in that case we won't even be able to 
> deliver the error signal to userspace (via error_detected()). And I 
> don't think mediated_recovery flag can be set to true. Since we are 
> holding the pci device lock, vfio_pci_core_enable() will fail as it will 
> fail trying to reset the device.
> 
> Thanks
> 
> Farhan

We had a detailed internal discussion on this and I think Farhan is
right that mediated_recovery becoming true between zpci_store_error()
and the below mediated_recovery isn't a concern.

On ther other hand one thing we do think could happen is that
zpci_store_pci_error() finds mediated_recovery true and stores the
error. Then a vfio_pci_core_disable() and thus
zpci_stop_mediated_recovery() happens and by the time we check below,
mediated_recovery has become false and so we store the error
information but then still proceed with recovery in the host.

I believe this is related to what Alex raised in the past about a stale
mediated_recovery. Now my thinking however is that we actually do want
to use the same "stale" mediated_recovery value below, as we did in
zpci_store_pci_error(). This is because this makes this case behave the
same as the inherently possible case that we store an error event and
then the guest just doesn't do recovery before the device is hot
unplugged, the guest shuts down, crashes etc and the event is left
stored until zpci_stop_mediated_recovery() and then we get out of the
error state via the next reset. We also investigated whether it would
make sense to handle any left over error events on the call to
zpci_stop_mediated_recovery() but for now that would get rather complex
and we should get a reset before the device is used again. Also while
the device isn't in use it doesn't hurt for it to be in the error
state.

Overall, I think this also makes the reasoning simpler in that any
error event which arrives between zpci_start_mediated_recovery() and
zpci_stop_mediated_recovery() is either recovered by the guest or left
for cleanup after zpci_stop_mediated_recovery(). So I'd like to still
go with the proposal to use the mediated_recovery state as seen by
zpci_store_pci_error().

@Alex if you object to this or see an argument we're missing we can of
course I'm fine with going with the approach this patch took.

Thanks,
Niklas

> 
> > >   	ers_res = zpci_event_notify_error_detected(pdev, driver);
> > >   	if (ers_result_indicates_abort(ers_res)) {
> > >   		status_str = "failed (abort on detection)";
> > >   		goto out_unlock;
> > >   	}
> > >   
> > > +	mutex_lock(&zdev->pending_errs_lock);
> > > +	if (zdev->pending_errs.mediated_recovery) {
> > > +		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
> > > +			pci_name(pdev));
> > > +		ers_res = PCI_ERS_RESULT_RECOVERED;
> > > +		status_str = "in progress";
> > > +		mutex_unlock(&zdev->pending_errs_lock);
> > > +		goto out_unlock;
> > > +	}
> > > +	mutex_unlock(&zdev->pending_errs_lock);

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-05-08 19:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05 20:05 [PATCH v15 0/7] Error recovery for vfio-pci devices on s390x Farhan Ali
2026-05-05 20:05 ` [PATCH v15 1/7] PCI: Allow per function PCI slots to fix slot reset on s390 Farhan Ali
2026-05-05 21:29   ` sashiko-bot
2026-05-05 20:05 ` [PATCH v15 2/7] PCI: Avoid saving config space state if inaccessible Farhan Ali
2026-05-05 21:58   ` sashiko-bot
2026-05-05 20:05 ` [PATCH v15 3/7] PCI: Fail FLR when config space is inaccessible Farhan Ali
2026-05-05 22:20   ` sashiko-bot
2026-05-05 20:05 ` [PATCH v15 4/7] s390/pci: Store PCI error information for passthrough devices Farhan Ali
2026-05-05 22:56   ` sashiko-bot
2026-05-06  9:38   ` Niklas Schnelle
2026-05-06 17:20     ` Farhan Ali
2026-05-08 19:58       ` Niklas Schnelle
2026-05-05 20:05 ` [PATCH v15 5/7] vfio-pci/zdev: Add a device feature for error information Farhan Ali
2026-05-05 23:27   ` sashiko-bot
2026-05-05 20:05 ` [PATCH v15 6/7] vfio/pci: Add a reset_done callback for vfio-pci driver Farhan Ali
2026-05-05 23:56   ` sashiko-bot
2026-05-05 20:05 ` [PATCH v15 7/7] vfio/pci: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox