[PATCH v11 0/9] Error recovery for vfio-pci devices on s390x

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x
@ 2026-03-16 19:15 Farhan Ali
  2026-03-16 19:15 ` [PATCH v11 1/9] PCI: Allow per function PCI slots Farhan Ali
                   ` (9 more replies)
  0 siblings, 10 replies; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato

Hi,

This Linux kernel patch series introduces support for error recovery for
passthrough PCI devices on System Z (s390x). 

Background
----------
For PCI devices on s390x an operating system receives platform specific
error events from firmware rather than through AER.Today for
passthrough/userspace devices, we don't attempt any error recovery and
ignore any error events for the devices. The passthrough/userspace devices
are managed by the vfio-pci driver. The driver does register error handling
callbacks (error_detected), and on an error trigger an eventfd to
userspace.  But we need a mechanism to notify userspace
(QEMU/guest/userspace drivers) about the error event. 

Proposal
--------
We can expose this error information (currently only the PCI Error Code)
via a device feature. Userspace can then obtain the error information 
via VFIO_DEVICE_FEATURE ioctl and take appropriate actions such as driving 
a device reset.

This is how a typical flow for passthrough devices to a VM would work:
For passthrough devices to a VM, the driver bound to the device on the host 
is vfio-pci. vfio-pci driver does support the error_detected() callback 
(vfio_pci_core_aer_err_detected()), and on an PCI error s390x recovery 
code on the host will call the vfio-pci error_detected() callback. The 
vfio-pci error_detected() callback will notify userspace/QEMU via an 
eventfd, and return PCI_ERS_RESULT_CAN_RECOVER. At this point the s390x 
error recovery on the host will skip any further action(see patch 6) and 
let userspace drive the error recovery.

Once userspace/QEMU is notified, it then injects this error into the VM 
so device drivers in the VM can take recovery actions. For example for a 
passthrough NVMe device, the VM's OS NVMe driver will access the device. 
At this point the VM's NVMe driver's error_detected() will drive the 
recovery by returning PCI_ERS_RESULT_NEED_RESET, and the s390x error 
recovery in the VM's OS will try to do a reset. Resets are privileged 
operations and so the VM will need intervention from QEMU to perform the 
reset. QEMU will invoke the VFIO_DEVICE_RESET ioctl to now notify the 
host that the VM is requesting a reset of the device. The vfio-pci driver 
on the host will then perform the reset on the device to recover it.


Thanks
Farhan

ChangeLog
---------
v10 series https://lore.kernel.org/all/20260302203325.3826-1-alifm@linux.ibm.com/
v10 -> v11
    - Rebase on pci/next to handle merge conflicts with patch 1.

    - Typo fixup in commit message (patch 4) and use guard() for mutex
    (patch 6).

v9 series https://lore.kernel.org/all/20260217182257.1582-1-alifm@linux.ibm.com/
v9 -> v10
   - Change pci_slot number to u16 (patch 1).

   - Avoid saving invalid config space state if config space is
   inaccessible in the device reset path. It uses the same patch as in v8
   with R-b from Niklas.

   - Rebase on 7.0.0-rc2


v8 series https://lore.kernel.org/all/20260122194437.1903-1-alifm@linux.ibm.com/
v8 -> v9
   - Avoid saving PCI config space state in reset path (patch 3) (suggested by Bjorn)
   
   - Add explicit version to struct vfio_device_feature_zpci_err (patch 7).

   - Rebase on 6.19


v7 series https://lore.kernel.org/all/20260107183217.1365-1-alifm@linux.ibm.com/
v7 -> v8
   - Rebase on 6.19-rc4

   - Address feedback from Niklas and Julien.


v6 series https://lore.kernel.org/all/2c609e61-1861-4bf3-b019-a11c137d26a5@linux.ibm.com/
v6 -> v7
    - Rebase on 6.19-rc4

    - Update commit message based on Niklas's suggestion (patch 3).

v5 series https://lore.kernel.org/all/20251113183502.2388-1-alifm@linux.ibm.com/
v5 -> v6
   - Rebase on 6.18 + Lukas's PCI: Universal error recoverability of
   devices series (https://lore.kernel.org/all/cover.1763483367.git.lukas@wunner.de/)

   - Re-work config space accessibility check to pci_dev_save_and_disable() (patch 3).
   This avoids saving the config space, in the reset path, if the device's config space is
   corrupted or inaccessible.

v4 series https://lore.kernel.org/all/20250924171628.826-1-alifm@linux.ibm.com/
v4 -> v5
    - Rebase on 6.18-rc5

    - Move bug fixes to the beginning of the series (patch 1 and 2). These patches
    were posted as a separate fixes series 
https://lore.kernel.org/all/a14936ac-47d6-461b-816f-0fd66f869b0f@linux.ibm.com/

    - Add matching pci_put_dev() for pci_get_slot() (patch 6).

v3 series https://lore.kernel.org/all/20250911183307.1910-1-alifm@linux.ibm.com/
v3 -> v4
    - Remove warn messages for each PCI capability not restored (patch 1)

    - Check PCI_COMMAND and PCI_STATUS register for error value instead of device id 
    (patch 1)

    - Fix kernel crash in patch 3

    - Added reviewed by tags

    - Address comments from Niklas's (patches 4, 5, 7)

    - Fix compilation error non s390x system (patch 8)

    - Explicitly align struct vfio_device_feature_zpci_err (patch 8)


v2 series https://lore.kernel.org/all/20250825171226.1602-1-alifm@linux.ibm.com/
v2 -> v3
   - Patch 1 avoids saving any config space state if the device is in error
   (suggested by Alex)

   - Patch 2 adds additional check only for FLR reset to try other function 
     reset method (suggested by Alex).

   - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
     functions. Creates a new flag pci_slot to allow per function slot.

   - Patch 4 fixes a bug in s390 for resource to bus address translation.

   - Rebase on 6.17-rc5


v1 series https://lore.kernel.org/all/20250813170821.1115-1-alifm@linux.ibm.com/
v1 - > v2
   - Patches 1 and 2 adds some additional checks for FLR/PM reset to 
     try other function reset method (suggested by Alex).

   - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
     functions.

   - Patch 7 adds a new device feature for zPCI devices for the VFIO_DEVICE_FEATURE 
     ioctl. The ioctl is used by userspace to retriece any PCI error
     information for the device (suggested by Alex).

   - Patch 8 adds a reset_done() callback for the vfio-pci driver, to
     restore the state of the device after a reset.

   - Patch 9 removes the pcie check for triggering VFIO_PCI_ERR_IRQ_INDEX.

Farhan Ali (9):
  PCI: Allow per function PCI slots
  s390/pci: Add architecture specific resource/bus address translation
  PCI: Avoid saving config space state if inaccessible
  PCI: Add additional checks for flr reset
  s390/pci: Update the logic for detecting passthrough device
  s390/pci: Store PCI error information for passthrough devices
  vfio-pci/zdev: Add a device feature for error information
  vfio: Add a reset_done callback for vfio-pci driver
  vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX

 arch/s390/include/asm/pci.h       |  29 ++++++++
 arch/s390/pci/pci.c               |  75 +++++++++++++++++++++
 arch/s390/pci/pci_event.c         | 106 +++++++++++++++++-------------
 drivers/pci/host-bridge.c         |   8 +--
 drivers/pci/pci.c                 |  26 +++++++-
 drivers/pci/slot.c                |  31 +++++++--
 drivers/vfio/pci/vfio_pci_core.c  |  22 +++++--
 drivers/vfio/pci/vfio_pci_intrs.c |   3 +-
 drivers/vfio/pci/vfio_pci_priv.h  |   9 +++
 drivers/vfio/pci/vfio_pci_zdev.c  |  47 ++++++++++++-
 include/linux/pci.h               |   5 +-
 include/uapi/linux/vfio.h         |  17 +++++
 12 files changed, 307 insertions(+), 71 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 1/9] PCI: Allow per function PCI slots
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-24 21:55   ` Bjorn Helgaas
  2026-03-16 19:15 ` [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato

On s390 systems, which use a machine level hypervisor, PCI devices are
always accessed through a form of PCI pass-through which fundamentally
operates on a per PCI function granularity. This is also reflected in the
s390 PCI hotplug driver which creates hotplug slots for individual PCI
functions. Its reset_slot() function, which is a wrapper for
zpci_hot_reset_device(), thus also resets individual functions.

Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
to multifunction devices. This approach worked fine on s390 systems that
only exposed virtual functions as individual PCI domains to the operating
system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
s390 supports exposing the topology of multifunction PCI devices by
grouping them in a shared PCI domain. When attempting to reset a function
through the hotplug driver, the shared slot assignment causes the wrong
function to be reset instead of the intended one. It also leaks memory as
we do create a pci_slot object for the function, but don't correctly free
it in pci_slot_release().

Add a flag for struct pci_slot to allow per function PCI slots for
functions managed through a hypervisor, which exposes individual PCI
functions while retaining the topology. Since we can use all 8 bits
for slot 'number' (for ARI devices), change slot 'number' u16 to
account for special values -1 and PCI_SLOT_ALL_DEVICES.

Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
Cc: stable@vger.kernel.org
Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/pci.c   |  5 +++--
 drivers/pci/slot.c  | 31 ++++++++++++++++++++++++-------
 include/linux/pci.h |  5 +++--
 3 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 115e5c11bab3..a93084053537 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4867,8 +4867,9 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, bool probe)
 
 static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
 {
-	if (dev->multifunction || dev->subordinate || !dev->slot ||
-	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
+	if (dev->subordinate || !dev->slot ||
+	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET ||
+	    (dev->multifunction && !dev->slot->per_func_slot))
 		return -ENOTTY;
 
 	return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index e0b7fb43423c..8842fa069392 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -37,7 +37,7 @@ static const struct sysfs_ops pci_slot_sysfs_ops = {
 
 static ssize_t address_read_file(struct pci_slot *slot, char *buf)
 {
-	if (slot->number == 0xff)
+	if (slot->number == (u16)-1)
 		return sysfs_emit(buf, "%04x:%02x\n",
 				  pci_domain_nr(slot->bus),
 				  slot->bus->number);
@@ -72,6 +72,23 @@ static ssize_t cur_speed_read_file(struct pci_slot *slot, char *buf)
 	return bus_speed_read(slot->bus->cur_bus_speed, buf);
 }
 
+static bool pci_dev_matches_slot(struct pci_dev *dev, struct pci_slot *slot)
+{
+	if (slot->per_func_slot)
+		return dev->devfn == slot->number;
+
+	return slot->number == PCI_SLOT_ALL_DEVICES ||
+		PCI_SLOT(dev->devfn) == slot->number;
+}
+
+static bool pci_slot_enabled_per_func(void)
+{
+	if (IS_ENABLED(CONFIG_S390))
+		return true;
+
+	return false;
+}
+
 static void pci_slot_release(struct kobject *kobj)
 {
 	struct pci_dev *dev;
@@ -82,8 +99,7 @@ static void pci_slot_release(struct kobject *kobj)
 
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &slot->bus->devices, bus_list)
-		if (slot->number == PCI_SLOT_ALL_DEVICES ||
-		    PCI_SLOT(dev->devfn) == slot->number)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = NULL;
 	up_read(&pci_bus_sem);
 
@@ -176,8 +192,7 @@ void pci_dev_assign_slot(struct pci_dev *dev)
 
 	mutex_lock(&pci_slot_mutex);
 	list_for_each_entry(slot, &dev->bus->slots, list)
-		if (slot->number == PCI_SLOT_ALL_DEVICES ||
-		    PCI_SLOT(dev->devfn) == slot->number)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = slot;
 	mutex_unlock(&pci_slot_mutex);
 }
@@ -287,6 +302,9 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 	slot->bus = pci_bus_get(parent);
 	slot->number = slot_nr;
 
+	if (pci_slot_enabled_per_func())
+		slot->per_func_slot = 1;
+
 	slot->kobj.kset = pci_slots_kset;
 
 	slot_name = make_slot_name(name);
@@ -307,8 +325,7 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &parent->devices, bus_list)
-		if (slot_nr == PCI_SLOT_ALL_DEVICES ||
-		    PCI_SLOT(dev->devfn) == slot_nr)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = slot;
 	up_read(&pci_bus_sem);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8861eeb4381d..50a84bba3c91 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -78,14 +78,15 @@
  * and, if ARI Forwarding is enabled, functions may appear to be on multiple
  * devices.
  */
-#define PCI_SLOT_ALL_DEVICES	0xfe
+#define PCI_SLOT_ALL_DEVICES	0xfeff
 
 /* pci_slot represents a physical slot */
 struct pci_slot {
 	struct pci_bus		*bus;		/* Bus this slot is on */
 	struct list_head	list;		/* Node in list of slots */
 	struct hotplug_slot	*hotplug;	/* Hotplug info (move here) */
-	unsigned char		number;		/* Device nr, or PCI_SLOT_ALL_DEVICES */
+	u16			number;		/* Device nr, or PCI_SLOT_ALL_DEVICES */
+	unsigned int		per_func_slot:1; /* Allow per function slot */
 	struct kobject		kobj;
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 1/9] PCI: Allow per function PCI slots
  2026-03-16 19:15 ` [PATCH v11 1/9] PCI: Allow per function PCI slots Farhan Ali
@ 2026-03-24 21:55   ` Bjorn Helgaas
  2026-03-24 23:08     ` Farhan Ali
  0 siblings, 1 reply; 33+ messages in thread
From: Bjorn Helgaas @ 2026-03-24 21:55 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato

On Mon, Mar 16, 2026 at 12:15:36PM -0700, Farhan Ali wrote:
> On s390 systems, which use a machine level hypervisor, PCI devices are
> always accessed through a form of PCI pass-through which fundamentally
> operates on a per PCI function granularity. This is also reflected in the
> s390 PCI hotplug driver which creates hotplug slots for individual PCI
> functions. Its reset_slot() function, which is a wrapper for
> zpci_hot_reset_device(), thus also resets individual functions.

I think this "pass-through" is from the hypervisor to Linux, i.e.,
what we think of as the host kernel, right?

> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
> to multifunction devices. This approach worked fine on s390 systems that
> only exposed virtual functions as individual PCI domains to the operating
> system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> s390 supports exposing the topology of multifunction PCI devices by
> grouping them in a shared PCI domain. When attempting to reset a function
> through the hotplug driver, the shared slot assignment causes the wrong
> function to be reset instead of the intended one. It also leaks memory as
> we do create a pci_slot object for the function, but don't correctly free
> it in pci_slot_release().

This alludes to the patch fixing a reset issue, but I think it should
be more prominent, e.g., the reset and leak fixes could be a separate
paragraph.  The subject line should also mention at least the reset
fix.

> Add a flag for struct pci_slot to allow per function PCI slots for
> functions managed through a hypervisor, which exposes individual PCI
> functions while retaining the topology. Since we can use all 8 bits
> for slot 'number' (for ARI devices), change slot 'number' u16 to
> account for special values -1 and PCI_SLOT_ALL_DEVICES.
> ...

>  static ssize_t address_read_file(struct pci_slot *slot, char *buf)
>  {
> -	if (slot->number == 0xff)
> +	if (slot->number == (u16)-1)

This "-1" is mentioned in the commit log, but I don't know where it
came from.  I guess we must assign -1 as a default somewhere?  Could
this be a #define to connect that assignment with this test?

>  		return sysfs_emit(buf, "%04x:%02x\n",
>  				  pci_domain_nr(slot->bus),
>  				  slot->bus->number);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 1/9] PCI: Allow per function PCI slots
  2026-03-24 21:55   ` Bjorn Helgaas
@ 2026-03-24 23:08     ` Farhan Ali
  2026-03-24 23:20       ` Bjorn Helgaas
  0 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-24 23:08 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato


On 3/24/2026 2:55 PM, Bjorn Helgaas wrote:
> On Mon, Mar 16, 2026 at 12:15:36PM -0700, Farhan Ali wrote:
>> On s390 systems, which use a machine level hypervisor, PCI devices are
>> always accessed through a form of PCI pass-through which fundamentally
>> operates on a per PCI function granularity. This is also reflected in the
>> s390 PCI hotplug driver which creates hotplug slots for individual PCI
>> functions. Its reset_slot() function, which is a wrapper for
>> zpci_hot_reset_device(), thus also resets individual functions.
> I think this "pass-through" is from the hypervisor to Linux, i.e.,
> what we think of as the host kernel, right?

Yes, on s390x we have PR/SM hypervisor which would this passthrough to a 
Linux. The Linux would be running in a LPAR (Logical Partition) created 
by the PR/SM hypervisor. For end users an LPAR is the 'host' for all 
practical purposes.


>
>> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
>> to multifunction devices. This approach worked fine on s390 systems that
>> only exposed virtual functions as individual PCI domains to the operating
>> system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
>> s390 supports exposing the topology of multifunction PCI devices by
>> grouping them in a shared PCI domain. When attempting to reset a function
>> through the hotplug driver, the shared slot assignment causes the wrong
>> function to be reset instead of the intended one. It also leaks memory as
>> we do create a pci_slot object for the function, but don't correctly free
>> it in pci_slot_release().
> This alludes to the patch fixing a reset issue, but I think it should
> be more prominent, e.g., the reset and leak fixes could be a separate
> paragraph.  The subject line should also mention at least the reset
> fix.

Will fix this. I will move what we fix into a separate paragraph.


>
>> Add a flag for struct pci_slot to allow per function PCI slots for
>> functions managed through a hypervisor, which exposes individual PCI
>> functions while retaining the topology. Since we can use all 8 bits
>> for slot 'number' (for ARI devices), change slot 'number' u16 to
>> account for special values -1 and PCI_SLOT_ALL_DEVICES.
>> ...
>>   static ssize_t address_read_file(struct pci_slot *slot, char *buf)
>>   {
>> -	if (slot->number == 0xff)
>> +	if (slot->number == (u16)-1)
> This "-1" is mentioned in the commit log, but I don't know where it
> came from.  I guess we must assign -1 as a default somewhere?  Could
> this be a #define to connect that assignment with this test?

The -1 is used a placeholder and from what I could tell 
rpaphp_register_slot() would be the only one to use this. Would you 
prefer this to be a #define?

Thanks

Farhan

>>   		return sysfs_emit(buf, "%04x:%02x\n",
>>   				  pci_domain_nr(slot->bus),
>>   				  slot->bus->number);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 1/9] PCI: Allow per function PCI slots
  2026-03-24 23:08     ` Farhan Ali
@ 2026-03-24 23:20       ` Bjorn Helgaas
  0 siblings, 0 replies; 33+ messages in thread
From: Bjorn Helgaas @ 2026-03-24 23:20 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato

On Tue, Mar 24, 2026 at 04:08:28PM -0700, Farhan Ali wrote:
> On 3/24/2026 2:55 PM, Bjorn Helgaas wrote:
> > On Mon, Mar 16, 2026 at 12:15:36PM -0700, Farhan Ali wrote:
> > > On s390 systems, which use a machine level hypervisor, PCI devices are
> > > always accessed through a form of PCI pass-through which fundamentally
> > > operates on a per PCI function granularity. This is also reflected in the
> > > s390 PCI hotplug driver which creates hotplug slots for individual PCI
> > > functions. Its reset_slot() function, which is a wrapper for
> > > zpci_hot_reset_device(), thus also resets individual functions.
> ...

> > >   static ssize_t address_read_file(struct pci_slot *slot, char *buf)
> > >   {
> > > -	if (slot->number == 0xff)
> > > +	if (slot->number == (u16)-1)
> > This "-1" is mentioned in the commit log, but I don't know where it
> > came from.  I guess we must assign -1 as a default somewhere?  Could
> > this be a #define to connect that assignment with this test?
> 
> The -1 is used a placeholder and from what I could tell
> rpaphp_register_slot() would be the only one to use this. Would you prefer
> this to be a #define?

If that's the only place that uses it, I think a #define would help to
match that use with this one.

I hate having to put it in include/linux/pci.h, but it seems like it
would belong next to PCI_SLOT_ALL_DEVICES (which also doesn't need to
be exposed and is only there because it's next to the struct
pci_slot).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
  2026-03-16 19:15 ` [PATCH v11 1/9] PCI: Allow per function PCI slots Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-24 23:06   ` Bjorn Helgaas
  2026-03-16 19:15 ` [PATCH v11 3/9] PCI: Avoid saving config space state if inaccessible Farhan Ali
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato

On s390 today we overwrite the PCI BAR resource address to either an
artificial cookie address or MIO address. However this address is different
from the bus address of the BARs programmed by firmware. The artificial
cookie address was created to index into an array of function handles
(zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
but maybe different from the bus addresses. This creates an issue when
trying to convert the BAR resource address to bus address using the generic
pcibios_resource_to_bus().

Implement an architecture specific pcibios_resource_to_bus() function to
correctly translate PCI BAR resource addresses to bus addresses for s390.
Similarly add architecture specific pcibios_bus_to_resource function to do
the reverse translation.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
 drivers/pci/host-bridge.c |  8 ++---
 2 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 2a430722cbe4..87077e510266 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -272,6 +272,80 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
 	return 0;
 }
 
+void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
+			     struct resource *res)
+{
+	struct zpci_bus *zbus = bus->sysdata;
+	struct zpci_bar_struct *zbar;
+	struct zpci_dev *zdev;
+
+	region->start = res->start;
+	region->end = res->end;
+
+	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
+		int j = 0;
+
+		zbar = NULL;
+		zdev = zbus->function[i];
+		if (!zdev)
+			continue;
+
+		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
+			if (zdev->bars[j].res->start == res->start &&
+			    zdev->bars[j].res->end == res->end &&
+			    res->flags & IORESOURCE_MEM) {
+				zbar = &zdev->bars[j];
+				break;
+			}
+		}
+
+		if (zbar) {
+			/* only MMIO is supported */
+			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
+			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+				region->start |= (u64)zdev->bars[j + 1].val << 32;
+
+			region->end = region->start + (1UL << zbar->size) - 1;
+			return;
+		}
+	}
+}
+
+void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+			     struct pci_bus_region *region)
+{
+	struct zpci_bus *zbus = bus->sysdata;
+	struct zpci_dev *zdev;
+	resource_size_t start, end;
+
+	res->start = region->start;
+	res->end = region->end;
+
+	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
+		zdev = zbus->function[i];
+		if (!zdev || !zdev->has_resources)
+			continue;
+
+		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
+			if (!zdev->bars[j].size)
+				continue;
+
+			/* only MMIO is supported */
+			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
+			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+				start |= (u64)zdev->bars[j + 1].val << 32;
+
+			end = start + (1UL << zdev->bars[j].size) - 1;
+
+			if (start == region->start && end == region->end) {
+				res->start = zdev->bars[j].res->start;
+				res->end = zdev->bars[j].res->end;
+				return;
+			}
+		}
+	}
+}
+
 void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
 			   pgprot_t prot)
 {
diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
index be5ef6516cff..aed031b8a9f3 100644
--- a/drivers/pci/host-bridge.c
+++ b/drivers/pci/host-bridge.c
@@ -49,8 +49,8 @@ void pci_set_host_bridge_release(struct pci_host_bridge *bridge,
 }
 EXPORT_SYMBOL_GPL(pci_set_host_bridge_release);
 
-void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
-			     struct resource *res)
+void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
+				    struct resource *res)
 {
 	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
 	struct resource_entry *window;
@@ -74,8 +74,8 @@ static bool region_contains(struct pci_bus_region *region1,
 	return region1->start <= region2->start && region1->end >= region2->end;
 }
 
-void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
-			     struct pci_bus_region *region)
+void __weak pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+				    struct pci_bus_region *region)
 {
 	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
 	struct resource_entry *window;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation
  2026-03-16 19:15 ` [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
@ 2026-03-24 23:06   ` Bjorn Helgaas
  2026-03-24 23:47     ` Farhan Ali
  2026-03-25 11:58     ` Ilpo Järvinen
  0 siblings, 2 replies; 33+ messages in thread
From: Bjorn Helgaas @ 2026-03-24 23:06 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato, Ilpo Järvinen

[+cc Ilpo just for awareness; I assume there's nothing Linux can
actually *do* with s390 PCI resources?]

On Mon, Mar 16, 2026 at 12:15:37PM -0700, Farhan Ali wrote:
> On s390 today we overwrite the PCI BAR resource address to either an
> artificial cookie address or MIO address. However this address is different
> from the bus address of the BARs programmed by firmware. The artificial
> cookie address was created to index into an array of function handles
> (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
> but maybe different from the bus addresses. This creates an issue when
> trying to convert the BAR resource address to bus address using the generic
> pcibios_resource_to_bus().
> 
> Implement an architecture specific pcibios_resource_to_bus() function to
> correctly translate PCI BAR resource addresses to bus addresses for s390.
> Similarly add architecture specific pcibios_bus_to_resource function to do
> the reverse translation.

1) It's not clear to me *why* we need these arch-specific versions.
We went to a lot of trouble to make these interfaces generic, and I'll
be really sad if they have to be arch-specific again.  I don't see any
direct uses of these in the series.  In any case, some reference to
the user and the actual problem this solves would help.

2) I'm kind of concerned that the "unusual" s390 PCI resources will be
unintelligible to people who are used to reading lspci or dmesg logs
from non-s390 systems, and they might confuse the PCI core resource
assignment code.  I guess there's not really a concept of a PCI host
bridge on s390, there's no bus hierarchy, and no visible PCI-to-PCI
bridges, so no windows?  Can you use PCIe switches at all?

3) Maybe this patch should be reordered to be closer to the patch that
needs these?  I don't think it's related to the "PCI: Avoid saving
config space state if inaccessible" and PCI: Add additional checks for
flr reset" patches.

> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
>  drivers/pci/host-bridge.c |  8 ++---
>  2 files changed, 78 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 2a430722cbe4..87077e510266 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -272,6 +272,80 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
>  	return 0;
>  }
>  
> +void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> +			     struct resource *res)
> +{
> +	struct zpci_bus *zbus = bus->sysdata;
> +	struct zpci_bar_struct *zbar;
> +	struct zpci_dev *zdev;
> +
> +	region->start = res->start;
> +	region->end = res->end;
> +
> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> +		int j = 0;
> +
> +		zbar = NULL;
> +		zdev = zbus->function[i];
> +		if (!zdev)
> +			continue;
> +
> +		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
> +			if (zdev->bars[j].res->start == res->start &&
> +			    zdev->bars[j].res->end == res->end &&
> +			    res->flags & IORESOURCE_MEM) {
> +				zbar = &zdev->bars[j];
> +				break;
> +			}
> +		}
> +
> +		if (zbar) {
> +			/* only MMIO is supported */
> +			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
> +			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +				region->start |= (u64)zdev->bars[j + 1].val << 32;
> +
> +			region->end = region->start + (1UL << zbar->size) - 1;
> +			return;
> +		}
> +	}
> +}
> +
> +void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
> +			     struct pci_bus_region *region)
> +{
> +	struct zpci_bus *zbus = bus->sysdata;
> +	struct zpci_dev *zdev;
> +	resource_size_t start, end;
> +
> +	res->start = region->start;
> +	res->end = region->end;
> +
> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> +		zdev = zbus->function[i];
> +		if (!zdev || !zdev->has_resources)
> +			continue;
> +
> +		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
> +			if (!zdev->bars[j].size)
> +				continue;
> +
> +			/* only MMIO is supported */
> +			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
> +			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +				start |= (u64)zdev->bars[j + 1].val << 32;
> +
> +			end = start + (1UL << zdev->bars[j].size) - 1;
> +
> +			if (start == region->start && end == region->end) {
> +				res->start = zdev->bars[j].res->start;
> +				res->end = zdev->bars[j].res->end;
> +				return;
> +			}
> +		}
> +	}
> +}
> +
>  void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
>  			   pgprot_t prot)
>  {
> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
> index be5ef6516cff..aed031b8a9f3 100644
> --- a/drivers/pci/host-bridge.c
> +++ b/drivers/pci/host-bridge.c
> @@ -49,8 +49,8 @@ void pci_set_host_bridge_release(struct pci_host_bridge *bridge,
>  }
>  EXPORT_SYMBOL_GPL(pci_set_host_bridge_release);
>  
> -void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> -			     struct resource *res)
> +void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> +				    struct resource *res)
>  {
>  	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
>  	struct resource_entry *window;
> @@ -74,8 +74,8 @@ static bool region_contains(struct pci_bus_region *region1,
>  	return region1->start <= region2->start && region1->end >= region2->end;
>  }
>  
> -void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
> -			     struct pci_bus_region *region)
> +void __weak pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
> +				    struct pci_bus_region *region)
>  {
>  	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
>  	struct resource_entry *window;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation
  2026-03-24 23:06   ` Bjorn Helgaas
@ 2026-03-24 23:47     ` Farhan Ali
  2026-03-25 11:58     ` Ilpo Järvinen
  1 sibling, 0 replies; 33+ messages in thread
From: Farhan Ali @ 2026-03-24 23:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato, Ilpo Järvinen


On 3/24/2026 4:06 PM, Bjorn Helgaas wrote:
> [+cc Ilpo just for awareness; I assume there's nothing Linux can
> actually *do* with s390 PCI resources?]
>
> On Mon, Mar 16, 2026 at 12:15:37PM -0700, Farhan Ali wrote:
>> On s390 today we overwrite the PCI BAR resource address to either an
>> artificial cookie address or MIO address. However this address is different
>> from the bus address of the BARs programmed by firmware. The artificial
>> cookie address was created to index into an array of function handles
>> (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
>> but maybe different from the bus addresses. This creates an issue when
>> trying to convert the BAR resource address to bus address using the generic
>> pcibios_resource_to_bus().
>>
>> Implement an architecture specific pcibios_resource_to_bus() function to
>> correctly translate PCI BAR resource addresses to bus addresses for s390.
>> Similarly add architecture specific pcibios_bus_to_resource function to do
>> the reverse translation.
> 1) It's not clear to me *why* we need these arch-specific versions.
> We went to a lot of trouble to make these interfaces generic, and I'll
> be really sad if they have to be arch-specific again.  I don't see any
> direct uses of these in the series.  In any case, some reference to
> the user and the actual problem this solves would help.

I came across this issue when in one of the previous version of this 
series I was using pci_restore_bars() which calls pci_resource_to_bus() 
to get the bus address. Given how we were updating the resource address 
in s390, we were programming the BARs with invalid addresses. I also 
think a patch from Ilpo, which was later reverted, also exposed this 
issue [1]

[1] 
https://lore.kernel.org/all/62669f67-d53e-2b56-af8c-e02cdff480a8@linux.intel.com/ 




>
> 2) I'm kind of concerned that the "unusual" s390 PCI resources will be
> unintelligible to people who are used to reading lspci or dmesg logs
> from non-s390 systems, and they might confuse the PCI core resource
> assignment code.  I guess there's not really a concept of a PCI host
> bridge on s390, there's no bus hierarchy, and no visible PCI-to-PCI
> bridges, so no windows?  Can you use PCIe switches at all?

On the host kernel we don't expose any of the information such as bus 
hierarchy, topology etc. This is all handled by firmware including the 
use PCIe switches. Host kernel interfaces with firmware through a set of 
instructions. As somewhat alluded to in patch 1, firmware exposes 
individual PCI functions to the host. So the host kernel has no idea 
about actual topology.


>
> 3) Maybe this patch should be reordered to be closer to the patch that
> needs these?  I don't think it's related to the "PCI: Avoid saving
> config space state if inaccessible" and PCI: Add additional checks for
> flr reset" patches.

Currently this patch is not needed by this series. It just fixes a gap 
on s390 on how we handle the resources, I could remove this patch from 
the series and have it as a separate patch.

Thanks

Farhan

>
>> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
>>   drivers/pci/host-bridge.c |  8 ++---
>>   2 files changed, 78 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
>> index 2a430722cbe4..87077e510266 100644
>> --- a/arch/s390/pci/pci.c
>> +++ b/arch/s390/pci/pci.c
>> @@ -272,6 +272,80 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
>>   	return 0;
>>   }
>>   
>> +void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
>> +			     struct resource *res)
>> +{
>> +	struct zpci_bus *zbus = bus->sysdata;
>> +	struct zpci_bar_struct *zbar;
>> +	struct zpci_dev *zdev;
>> +
>> +	region->start = res->start;
>> +	region->end = res->end;
>> +
>> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
>> +		int j = 0;
>> +
>> +		zbar = NULL;
>> +		zdev = zbus->function[i];
>> +		if (!zdev)
>> +			continue;
>> +
>> +		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
>> +			if (zdev->bars[j].res->start == res->start &&
>> +			    zdev->bars[j].res->end == res->end &&
>> +			    res->flags & IORESOURCE_MEM) {
>> +				zbar = &zdev->bars[j];
>> +				break;
>> +			}
>> +		}
>> +
>> +		if (zbar) {
>> +			/* only MMIO is supported */
>> +			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
>> +			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
>> +				region->start |= (u64)zdev->bars[j + 1].val << 32;
>> +
>> +			region->end = region->start + (1UL << zbar->size) - 1;
>> +			return;
>> +		}
>> +	}
>> +}
>> +
>> +void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
>> +			     struct pci_bus_region *region)
>> +{
>> +	struct zpci_bus *zbus = bus->sysdata;
>> +	struct zpci_dev *zdev;
>> +	resource_size_t start, end;
>> +
>> +	res->start = region->start;
>> +	res->end = region->end;
>> +
>> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
>> +		zdev = zbus->function[i];
>> +		if (!zdev || !zdev->has_resources)
>> +			continue;
>> +
>> +		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
>> +			if (!zdev->bars[j].size)
>> +				continue;
>> +
>> +			/* only MMIO is supported */
>> +			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
>> +			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
>> +				start |= (u64)zdev->bars[j + 1].val << 32;
>> +
>> +			end = start + (1UL << zdev->bars[j].size) - 1;
>> +
>> +			if (start == region->start && end == region->end) {
>> +				res->start = zdev->bars[j].res->start;
>> +				res->end = zdev->bars[j].res->end;
>> +				return;
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>   void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
>>   			   pgprot_t prot)
>>   {
>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
>> index be5ef6516cff..aed031b8a9f3 100644
>> --- a/drivers/pci/host-bridge.c
>> +++ b/drivers/pci/host-bridge.c
>> @@ -49,8 +49,8 @@ void pci_set_host_bridge_release(struct pci_host_bridge *bridge,
>>   }
>>   EXPORT_SYMBOL_GPL(pci_set_host_bridge_release);
>>   
>> -void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
>> -			     struct resource *res)
>> +void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
>> +				    struct resource *res)
>>   {
>>   	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
>>   	struct resource_entry *window;
>> @@ -74,8 +74,8 @@ static bool region_contains(struct pci_bus_region *region1,
>>   	return region1->start <= region2->start && region1->end >= region2->end;
>>   }
>>   
>> -void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
>> -			     struct pci_bus_region *region)
>> +void __weak pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
>> +				    struct pci_bus_region *region)
>>   {
>>   	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
>>   	struct resource_entry *window;
>> -- 
>> 2.43.0
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation
  2026-03-24 23:06   ` Bjorn Helgaas
  2026-03-24 23:47     ` Farhan Ali
@ 2026-03-25 11:58     ` Ilpo Järvinen
  2026-03-25 17:44       ` Farhan Ali
  1 sibling, 1 reply; 33+ messages in thread
From: Ilpo Järvinen @ 2026-03-25 11:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Farhan Ali, linux-s390, LKML, linux-pci, Lukas Wunner, alex,
	kbusch, clg, stable, schnelle, mjrosato

On Tue, 24 Mar 2026, Bjorn Helgaas wrote:

> [+cc Ilpo just for awareness; I assume there's nothing Linux can
> actually *do* with s390 PCI resources?]

I'm somewhat aware they've this speciality (and besides that, I'm 
waiting for this change in order to proceed with the series to detect 
which resources are properly setup when we enumerate them which got 
reverted earlier).

An additional thought related to this, there's IORESOURCE_PCI_FIXED 
results in skipping most of the resource fitting and assignment code, so 
if nothing really should touch these resources, perhaps that flag might be 
of some help.

-- 
 i.

> On Mon, Mar 16, 2026 at 12:15:37PM -0700, Farhan Ali wrote:
> > On s390 today we overwrite the PCI BAR resource address to either an
> > artificial cookie address or MIO address. However this address is different
> > from the bus address of the BARs programmed by firmware. The artificial
> > cookie address was created to index into an array of function handles
> > (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
> > but maybe different from the bus addresses. This creates an issue when
> > trying to convert the BAR resource address to bus address using the generic
> > pcibios_resource_to_bus().
> > 
> > Implement an architecture specific pcibios_resource_to_bus() function to
> > correctly translate PCI BAR resource addresses to bus addresses for s390.
> > Similarly add architecture specific pcibios_bus_to_resource function to do
> > the reverse translation.
> 
> 1) It's not clear to me *why* we need these arch-specific versions.
> We went to a lot of trouble to make these interfaces generic, and I'll
> be really sad if they have to be arch-specific again.  I don't see any
> direct uses of these in the series.  In any case, some reference to
> the user and the actual problem this solves would help.
> 
> 2) I'm kind of concerned that the "unusual" s390 PCI resources will be
> unintelligible to people who are used to reading lspci or dmesg logs
> from non-s390 systems, and they might confuse the PCI core resource
> assignment code.  I guess there's not really a concept of a PCI host
> bridge on s390, there's no bus hierarchy, and no visible PCI-to-PCI
> bridges, so no windows?  Can you use PCIe switches at all?
>
> 3) Maybe this patch should be reordered to be closer to the patch that
> needs these?  I don't think it's related to the "PCI: Avoid saving
> config space state if inaccessible" and PCI: Add additional checks for
> flr reset" patches.
> 
> > Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > ---
> >  arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
> >  drivers/pci/host-bridge.c |  8 ++---
> >  2 files changed, 78 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> > index 2a430722cbe4..87077e510266 100644
> > --- a/arch/s390/pci/pci.c
> > +++ b/arch/s390/pci/pci.c
> > @@ -272,6 +272,80 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
> >  	return 0;
> >  }
> >  
> > +void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> > +			     struct resource *res)
> > +{
> > +	struct zpci_bus *zbus = bus->sysdata;
> > +	struct zpci_bar_struct *zbar;
> > +	struct zpci_dev *zdev;
> > +
> > +	region->start = res->start;
> > +	region->end = res->end;
> > +
> > +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> > +		int j = 0;
> > +
> > +		zbar = NULL;
> > +		zdev = zbus->function[i];
> > +		if (!zdev)
> > +			continue;
> > +
> > +		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
> > +			if (zdev->bars[j].res->start == res->start &&
> > +			    zdev->bars[j].res->end == res->end &&
> > +			    res->flags & IORESOURCE_MEM) {
> > +				zbar = &zdev->bars[j];
> > +				break;
> > +			}
> > +		}
> > +
> > +		if (zbar) {
> > +			/* only MMIO is supported */
> > +			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
> > +			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> > +				region->start |= (u64)zdev->bars[j + 1].val << 32;
> > +
> > +			region->end = region->start + (1UL << zbar->size) - 1;
> > +			return;
> > +		}
> > +	}
> > +}
> > +
> > +void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
> > +			     struct pci_bus_region *region)
> > +{
> > +	struct zpci_bus *zbus = bus->sysdata;
> > +	struct zpci_dev *zdev;
> > +	resource_size_t start, end;
> > +
> > +	res->start = region->start;
> > +	res->end = region->end;
> > +
> > +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> > +		zdev = zbus->function[i];
> > +		if (!zdev || !zdev->has_resources)
> > +			continue;
> > +
> > +		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
> > +			if (!zdev->bars[j].size)
> > +				continue;
> > +
> > +			/* only MMIO is supported */
> > +			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
> > +			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> > +				start |= (u64)zdev->bars[j + 1].val << 32;
> > +
> > +			end = start + (1UL << zdev->bars[j].size) - 1;
> > +
> > +			if (start == region->start && end == region->end) {
> > +				res->start = zdev->bars[j].res->start;
> > +				res->end = zdev->bars[j].res->end;
> > +				return;
> > +			}
> > +		}
> > +	}
> > +}
> > +
> >  void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
> >  			   pgprot_t prot)
> >  {
> > diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
> > index be5ef6516cff..aed031b8a9f3 100644
> > --- a/drivers/pci/host-bridge.c
> > +++ b/drivers/pci/host-bridge.c
> > @@ -49,8 +49,8 @@ void pci_set_host_bridge_release(struct pci_host_bridge *bridge,
> >  }
> >  EXPORT_SYMBOL_GPL(pci_set_host_bridge_release);
> >  
> > -void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> > -			     struct resource *res)
> > +void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> > +				    struct resource *res)
> >  {
> >  	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
> >  	struct resource_entry *window;
> > @@ -74,8 +74,8 @@ static bool region_contains(struct pci_bus_region *region1,
> >  	return region1->start <= region2->start && region1->end >= region2->end;
> >  }
> >  
> > -void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
> > -			     struct pci_bus_region *region)
> > +void __weak pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
> > +				    struct pci_bus_region *region)
> >  {
> >  	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
> >  	struct resource_entry *window;
> > -- 
> > 2.43.0
> > 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation
  2026-03-25 11:58     ` Ilpo Järvinen
@ 2026-03-25 17:44       ` Farhan Ali
  0 siblings, 0 replies; 33+ messages in thread
From: Farhan Ali @ 2026-03-25 17:44 UTC (permalink / raw)
  To: Ilpo Järvinen, Bjorn Helgaas
  Cc: linux-s390, LKML, linux-pci, Lukas Wunner, alex, kbusch, clg,
	stable, schnelle, mjrosato


On 3/25/2026 4:58 AM, Ilpo Järvinen wrote:
> On Tue, 24 Mar 2026, Bjorn Helgaas wrote:
>
>> [+cc Ilpo just for awareness; I assume there's nothing Linux can
>> actually *do* with s390 PCI resources?]
> I'm somewhat aware they've this speciality (and besides that, I'm
> waiting for this change in order to proceed with the series to detect
> which resources are properly setup when we enumerate them which got
> reverted earlier).
>
> An additional thought related to this, there's IORESOURCE_PCI_FIXED
> results in skipping most of the resource fitting and assignment code, so
> if nothing really should touch these resources, perhaps that flag might be
> of some help.

I need to look into IORESOURCE_PCI_FIXED. Looking at briefly I think it 
may work for pci_restore_bars(), though need to check other callers of 
pcibios_resource_to_bus() to see if they will overwrite the BARs. But 
for now will remove this patch as part of this series and send it out as 
a separate patch.

Thanks

Farhan


>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 3/9] PCI: Avoid saving config space state if inaccessible
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
  2026-03-16 19:15 ` [PATCH v11 1/9] PCI: Allow per function PCI slots Farhan Ali
  2026-03-16 19:15 ` [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-24 21:40   ` Bjorn Helgaas
  2026-03-16 19:15 ` [PATCH v11 4/9] PCI: Add additional checks for flr reset Farhan Ali
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato

The current reset process saves the device's config space state before
reset and restores it afterward. However errors may occur unexpectedly and
it may then be impossible to save config space because the device may be
inaccessible (e.g. DPC) or config space may be corrupted. This results in
saving corrupted values that get written back to the device during state
restoration.

With a reset we want to recover/restore the device into a functional state.
So avoid saving the state of the config space when the device config space
is inaccessible.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/pci/pci.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index a93084053537..373421f4b9d8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5014,6 +5014,7 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
 {
 	const struct pci_error_handlers *err_handler =
 			dev->driver ? dev->driver->err_handler : NULL;
+	u32 val;
 
 	/*
 	 * dev->driver->err_handler->reset_prepare() is protected against
@@ -5033,6 +5034,19 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
 	 */
 	pci_set_power_state(dev, PCI_D0);
 
+	/*
+	 * If device's config space is inaccessible it can return ~0 for
+	 * any reads. Since VFs can also return ~0 for Device and Vendor ID
+	 * check Command and Status registers. At the very least we should
+	 * avoid restoring config space for device with error bits set in
+	 * Status register.
+	 */
+	pci_read_config_dword(dev, PCI_COMMAND, &val);
+	if (PCI_POSSIBLE_ERROR(val)) {
+		pci_warn(dev, "Device config space inaccessible\n");
+		return;
+	}
+
 	pci_save_state(dev);
 	/*
 	 * Disable the device by clearing the Command register, except for
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 3/9] PCI: Avoid saving config space state if inaccessible
  2026-03-16 19:15 ` [PATCH v11 3/9] PCI: Avoid saving config space state if inaccessible Farhan Ali
@ 2026-03-24 21:40   ` Bjorn Helgaas
  2026-03-24 22:38     ` Farhan Ali
  0 siblings, 1 reply; 33+ messages in thread
From: Bjorn Helgaas @ 2026-03-24 21:40 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato

On Mon, Mar 16, 2026 at 12:15:38PM -0700, Farhan Ali wrote:
> The current reset process saves the device's config space state before
> reset and restores it afterward. However errors may occur unexpectedly and
> it may then be impossible to save config space because the device may be
> inaccessible (e.g. DPC) or config space may be corrupted. This results in
> saving corrupted values that get written back to the device during state
> restoration.

This patch only addresses the "inaccessible" part, so I'd drop the
"config space may be corrupted" because we aren't checking for that.

> With a reset we want to recover/restore the device into a functional state.
> So avoid saving the state of the config space when the device config space
> is inaccessible.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/pci.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a93084053537..373421f4b9d8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5014,6 +5014,7 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
>  {
>  	const struct pci_error_handlers *err_handler =
>  			dev->driver ? dev->driver->err_handler : NULL;
> +	u32 val;
>  
>  	/*
>  	 * dev->driver->err_handler->reset_prepare() is protected against
> @@ -5033,6 +5034,19 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
>  	 */
>  	pci_set_power_state(dev, PCI_D0);
>  
> +	/*
> +	 * If device's config space is inaccessible it can return ~0 for
> +	 * any reads. Since VFs can also return ~0 for Device and Vendor ID
> +	 * check Command and Status registers. At the very least we should
> +	 * avoid restoring config space for device with error bits set in
> +	 * Status register.
> +	 */
> +	pci_read_config_dword(dev, PCI_COMMAND, &val);
> +	if (PCI_POSSIBLE_ERROR(val)) {

Obviously this is still racy because the device may become
inaccessible partway through saving the state, and it might be worth
acknowledging that in the comment.  But I think this is an improvement
over what we do now.

Sashiko complains about this, but I think it's mainly because of that
last sentence of the comment that says "error bits set in Status
register".  This patch has to do with *saving*, not restoring, so I'd
just drop that last sentence.  FWIW, Sashiko said:

  The comment indicates that we should avoid restoring config space
  for a device with error bits set in the Status register, but the
  code only uses PCI_POSSIBLE_ERROR(val).

  Since PCI_POSSIBLE_ERROR() only evaluates whether the entire 32-bit
  value is exactly 0xFFFFFFFF (indicating complete device
  inaccessibility), does this actually check for individual error
  flags in the Status register?

  If a device logs an error but is still accessible, val will reflect
  those bits but will not equal 0xFFFFFFFF, causing the check to
  evaluate to false. Should this code also check specific bits in the
  upper 16 bits of val (such as PCI_STATUS_SIG_SYSTEM_ERROR or
  PCI_STATUS_DETECTED_PARITY) to match the stated intention in the
  comment?

> +		pci_warn(dev, "Device config space inaccessible\n");
> +		return;
> +	}
> +
>  	pci_save_state(dev);
>  	/*
>  	 * Disable the device by clearing the Command register, except for
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 3/9] PCI: Avoid saving config space state if inaccessible
  2026-03-24 21:40   ` Bjorn Helgaas
@ 2026-03-24 22:38     ` Farhan Ali
  2026-03-24 22:52       ` Bjorn Helgaas
  0 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-24 22:38 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato


On 3/24/2026 2:40 PM, Bjorn Helgaas wrote:
> On Mon, Mar 16, 2026 at 12:15:38PM -0700, Farhan Ali wrote:
>> The current reset process saves the device's config space state before
>> reset and restores it afterward. However errors may occur unexpectedly and
>> it may then be impossible to save config space because the device may be
>> inaccessible (e.g. DPC) or config space may be corrupted. This results in
>> saving corrupted values that get written back to the device during state
>> restoration.
> This patch only addresses the "inaccessible" part, so I'd drop the
> "config space may be corrupted" because we aren't checking for that.

Sure, I will update the commit message.


>> With a reset we want to recover/restore the device into a functional state.
>> So avoid saving the state of the config space when the device config space
>> is inaccessible.
>>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>

Thanks for reviewing!


>> ---
>>   drivers/pci/pci.c | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index a93084053537..373421f4b9d8 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -5014,6 +5014,7 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
>>   {
>>   	const struct pci_error_handlers *err_handler =
>>   			dev->driver ? dev->driver->err_handler : NULL;
>> +	u32 val;
>>   
>>   	/*
>>   	 * dev->driver->err_handler->reset_prepare() is protected against
>> @@ -5033,6 +5034,19 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
>>   	 */
>>   	pci_set_power_state(dev, PCI_D0);
>>   
>> +	/*
>> +	 * If device's config space is inaccessible it can return ~0 for
>> +	 * any reads. Since VFs can also return ~0 for Device and Vendor ID
>> +	 * check Command and Status registers. At the very least we should
>> +	 * avoid restoring config space for device with error bits set in
>> +	 * Status register.
>> +	 */
>> +	pci_read_config_dword(dev, PCI_COMMAND, &val);
>> +	if (PCI_POSSIBLE_ERROR(val)) {
> Obviously this is still racy because the device may become
> inaccessible partway through saving the state, and it might be worth
> acknowledging that in the comment.  But I think this is an improvement
> over what we do now.

Yeah, makes sense. Will update the comment. How about something like:

If device's config space is inaccessible it can return ~0 for
any reads. Since VFs can also return ~0 for Device and Vendor ID
check Command and Status registers. This can still be racy as a device
can become inaccessible partway through saving the state, even after this
check.



>
> Sashiko complains about this, but I think it's mainly because of that
> last sentence of the comment that says "error bits set in Status
> register".  This patch has to do with *saving*, not restoring, so I'd
> just drop that last sentence.  FWIW, Sashiko said:
>
>    The comment indicates that we should avoid restoring config space
>    for a device with error bits set in the Status register, but the
>    code only uses PCI_POSSIBLE_ERROR(val).
>
>    Since PCI_POSSIBLE_ERROR() only evaluates whether the entire 32-bit
>    value is exactly 0xFFFFFFFF (indicating complete device
>    inaccessibility), does this actually check for individual error
>    flags in the Status register?
>
>    If a device logs an error but is still accessible, val will reflect
>    those bits but will not equal 0xFFFFFFFF, causing the check to
>    evaluate to false. Should this code also check specific bits in the
>    upper 16 bits of val (such as PCI_STATUS_SIG_SYSTEM_ERROR or
>    PCI_STATUS_DETECTED_PARITY) to match the stated intention in the
>    comment?

Yup will drop the last line.

Thanks

Farhan


>
>> +		pci_warn(dev, "Device config space inaccessible\n");
>> +		return;
>> +	}
>> +
>>   	pci_save_state(dev);
>>   	/*
>>   	 * Disable the device by clearing the Command register, except for
>> -- 
>> 2.43.0
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 3/9] PCI: Avoid saving config space state if inaccessible
  2026-03-24 22:38     ` Farhan Ali
@ 2026-03-24 22:52       ` Bjorn Helgaas
  0 siblings, 0 replies; 33+ messages in thread
From: Bjorn Helgaas @ 2026-03-24 22:52 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato

On Tue, Mar 24, 2026 at 03:38:33PM -0700, Farhan Ali wrote:
> On 3/24/2026 2:40 PM, Bjorn Helgaas wrote:
> > On Mon, Mar 16, 2026 at 12:15:38PM -0700, Farhan Ali wrote:
> > > The current reset process saves the device's config space state before
> > > reset and restores it afterward. However errors may occur unexpectedly and
> > > it may then be impossible to save config space because the device may be
> > > inaccessible (e.g. DPC) or config space may be corrupted. This results in
> > > saving corrupted values that get written back to the device during state
> > > restoration.

> > > +	 * If device's config space is inaccessible it can return ~0 for
> > > +	 * any reads. Since VFs can also return ~0 for Device and Vendor ID
> > > +	 * check Command and Status registers. At the very least we should
> > > +	 * avoid restoring config space for device with error bits set in
> > > +	 * Status register.
> > > +	 */
> > > +	pci_read_config_dword(dev, PCI_COMMAND, &val);
> > > +	if (PCI_POSSIBLE_ERROR(val)) {
> >
> > Obviously this is still racy because the device may become
> > inaccessible partway through saving the state, and it might be worth
> > acknowledging that in the comment.  But I think this is an improvement
> > over what we do now.
> 
> Yeah, makes sense. Will update the comment. How about something like:
> 
> If device's config space is inaccessible it can return ~0 for
> any reads. Since VFs can also return ~0 for Device and Vendor ID
> check Command and Status registers. This can still be racy as a device
> can become inaccessible partway through saving the state, even after this
> check.

How about:

  Note that this is racy because the device may become inaccessible
  partway through saving the state.

It's not just "can still be racy"; it's *always* racy unless we detect
PCI errors on every access and recover from them.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 4/9] PCI: Add additional checks for flr reset
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (2 preceding siblings ...)
  2026-03-16 19:15 ` [PATCH v11 3/9] PCI: Avoid saving config space state if inaccessible Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-24 22:49   ` Bjorn Helgaas
  2026-03-16 19:15 ` [PATCH v11 5/9] s390/pci: Update the logic for detecting passthrough device Farhan Ali
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato, Benjamin Block

If a device is in an error state, then any reads of device registers can
return error value. Add additional checks to validate if a device is in an
error state before doing an flr reset.

Cc: stable@vger.kernel.org
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/pci.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 373421f4b9d8..8e4d924f4e88 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4371,12 +4371,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
  */
 int pcie_reset_flr(struct pci_dev *dev, bool probe)
 {
+	u32 reg;
+
 	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
 		return -ENOTTY;
 
 	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
 		return -ENOTTY;
 
+	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
+		pci_warn(dev, "Device unable to do an FLR\n");
+		return -ENOTTY;
+	}
+
 	if (probe)
 		return 0;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 4/9] PCI: Add additional checks for flr reset
  2026-03-16 19:15 ` [PATCH v11 4/9] PCI: Add additional checks for flr reset Farhan Ali
@ 2026-03-24 22:49   ` Bjorn Helgaas
  2026-03-24 23:22     ` Farhan Ali
  2026-03-25 16:25     ` Alex Williamson
  0 siblings, 2 replies; 33+ messages in thread
From: Bjorn Helgaas @ 2026-03-24 22:49 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato, Benjamin Block

On Mon, Mar 16, 2026 at 12:15:39PM -0700, Farhan Ali wrote:
> If a device is in an error state, then any reads of device registers can
> return error value. Add additional checks to validate if a device is in an
> error state before doing an flr reset.

s/flr/FLR/ (also in subject)

Also please extend the subject to say something specific about the
"additional checks".  E.g.,

  PCI: Fail FLR when config space inaccessible

> Cc: stable@vger.kernel.org
> Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/pci/pci.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 373421f4b9d8..8e4d924f4e88 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4371,12 +4371,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
>   */
>  int pcie_reset_flr(struct pci_dev *dev, bool probe)
>  {
> +	u32 reg;
> +
>  	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
>  		return -ENOTTY;
>  
>  	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
>  		return -ENOTTY;
>  
> +	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
> +		pci_warn(dev, "Device unable to do an FLR\n");
> +		return -ENOTTY;
> +	}

I guess the point of this is to detect devices that are inaccessible?
The same sort of thing as in pci_dev_save_and_disable() from patch
3/9?  But we use "dev->error_state == pci_channel_io_perm_failure"
instead?

No matter what we do, this has the same race as in patch 3/9.  And I
think using dev->error_state also depends on AER being enabled, which
cuts out many PCIe devices.

I think using the same exact code as in pci_dev_save_and_disable()
would be more straightforward.  And also the same sort of wording in
the message, e.g., "Device config space inaccessible; unable to FLR"
or similar.  I foresee many of these messages in my future, and it
will be helpful to have a specific clue about why FLR failed :)

>  	if (probe)
>  		return 0;
>  
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 4/9] PCI: Add additional checks for flr reset
  2026-03-24 22:49   ` Bjorn Helgaas
@ 2026-03-24 23:22     ` Farhan Ali
  2026-03-25 16:25     ` Alex Williamson
  1 sibling, 0 replies; 33+ messages in thread
From: Farhan Ali @ 2026-03-24 23:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato, Benjamin Block


On 3/24/2026 3:49 PM, Bjorn Helgaas wrote:
> On Mon, Mar 16, 2026 at 12:15:39PM -0700, Farhan Ali wrote:
>> If a device is in an error state, then any reads of device registers can
>> return error value. Add additional checks to validate if a device is in an
>> error state before doing an flr reset.
> s/flr/FLR/ (also in subject)
>
> Also please extend the subject to say something specific about the
> "additional checks".  E.g.,
>
>    PCI: Fail FLR when config space inaccessible

This makes sense, will update the subject.


>
>> Cc: stable@vger.kernel.org
>> Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
>> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   drivers/pci/pci.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 373421f4b9d8..8e4d924f4e88 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -4371,12 +4371,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
>>    */
>>   int pcie_reset_flr(struct pci_dev *dev, bool probe)
>>   {
>> +	u32 reg;
>> +
>>   	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
>>   		return -ENOTTY;
>>   
>>   	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
>>   		return -ENOTTY;
>>   
>> +	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
>> +		pci_warn(dev, "Device unable to do an FLR\n");
>> +		return -ENOTTY;
>> +	}
> I guess the point of this is to detect devices that are inaccessible?
> The same sort of thing as in pci_dev_save_and_disable() from patch
> 3/9?  But we use "dev->error_state == pci_channel_io_perm_failure"
> instead?
>
> No matter what we do, this has the same race as in patch 3/9.  And I
> think using dev->error_state also depends on AER being enabled, which
> cuts out many PCIe devices.
>
> I think using the same exact code as in pci_dev_save_and_disable()
> would be more straightforward.  And also the same sort of wording in
> the message, e.g., "Device config space inaccessible; unable to FLR"
> or similar.  I foresee many of these messages in my future, and it
> will be helpful to have a specific clue about why FLR failed :)

I will update the message :) and update the code to use the same check 
at patch 3 here (will move the check to a common function).

Thanks

Farhan


>
>>   	if (probe)
>>   		return 0;
>>   
>> -- 
>> 2.43.0
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 4/9] PCI: Add additional checks for flr reset
  2026-03-24 22:49   ` Bjorn Helgaas
  2026-03-24 23:22     ` Farhan Ali
@ 2026-03-25 16:25     ` Alex Williamson
  2026-03-25 18:40       ` Farhan Ali
  1 sibling, 1 reply; 33+ messages in thread
From: Alex Williamson @ 2026-03-25 16:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Farhan Ali, linux-s390, linux-kernel, linux-pci, lukas, kbusch,
	clg, stable, schnelle, mjrosato, Benjamin Block, alex

On Tue, 24 Mar 2026 17:49:36 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Mar 16, 2026 at 12:15:39PM -0700, Farhan Ali wrote:
> > If a device is in an error state, then any reads of device registers can
> > return error value. Add additional checks to validate if a device is in an
> > error state before doing an flr reset.  
> 
> s/flr/FLR/ (also in subject)
> 
> Also please extend the subject to say something specific about the
> "additional checks".  E.g.,
> 
>   PCI: Fail FLR when config space inaccessible
> 
> > Cc: stable@vger.kernel.org
> > Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
> > Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > ---
> >  drivers/pci/pci.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 373421f4b9d8..8e4d924f4e88 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -4371,12 +4371,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
> >   */
> >  int pcie_reset_flr(struct pci_dev *dev, bool probe)
> >  {
> > +	u32 reg;
> > +
> >  	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
> >  		return -ENOTTY;
> >  
> >  	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
> >  		return -ENOTTY;
> >  
> > +	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
> > +		pci_warn(dev, "Device unable to do an FLR\n");
> > +		return -ENOTTY;
> > +	}  
> 
> I guess the point of this is to detect devices that are inaccessible?
> The same sort of thing as in pci_dev_save_and_disable() from patch
> 3/9?  But we use "dev->error_state == pci_channel_io_perm_failure"
> instead?
> 
> No matter what we do, this has the same race as in patch 3/9.  And I
> think using dev->error_state also depends on AER being enabled, which
> cuts out many PCIe devices.
> 
> I think using the same exact code as in pci_dev_save_and_disable()
> would be more straightforward.  And also the same sort of wording in
> the message, e.g., "Device config space inaccessible; unable to FLR"
> or similar.  I foresee many of these messages in my future, and it
> will be helpful to have a specific clue about why FLR failed :)

I think pci_af_flr() and pci_pm_reset() are all subject to this as
well, some of the device specific resets too.  Maybe we need a common
helper with a string provided so we can differentiate the callers, even
if we don't make all the conversions in this series.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 4/9] PCI: Add additional checks for flr reset
  2026-03-25 16:25     ` Alex Williamson
@ 2026-03-25 18:40       ` Farhan Ali
  0 siblings, 0 replies; 33+ messages in thread
From: Farhan Ali @ 2026-03-25 18:40 UTC (permalink / raw)
  To: Alex Williamson, Bjorn Helgaas
  Cc: linux-s390, linux-kernel, linux-pci, lukas, kbusch, clg, stable,
	schnelle, mjrosato, Benjamin Block


On 3/25/2026 9:25 AM, Alex Williamson wrote:
> On Tue, 24 Mar 2026 17:49:36 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
>
>> On Mon, Mar 16, 2026 at 12:15:39PM -0700, Farhan Ali wrote:
>>> If a device is in an error state, then any reads of device registers can
>>> return error value. Add additional checks to validate if a device is in an
>>> error state before doing an flr reset.
>> s/flr/FLR/ (also in subject)
>>
>> Also please extend the subject to say something specific about the
>> "additional checks".  E.g.,
>>
>>    PCI: Fail FLR when config space inaccessible
>>
>>> Cc: stable@vger.kernel.org
>>> Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
>>> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>> ---
>>>   drivers/pci/pci.c | 7 +++++++
>>>   1 file changed, 7 insertions(+)
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 373421f4b9d8..8e4d924f4e88 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -4371,12 +4371,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
>>>    */
>>>   int pcie_reset_flr(struct pci_dev *dev, bool probe)
>>>   {
>>> +	u32 reg;
>>> +
>>>   	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
>>>   		return -ENOTTY;
>>>   
>>>   	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
>>>   		return -ENOTTY;
>>>   
>>> +	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
>>> +		pci_warn(dev, "Device unable to do an FLR\n");
>>> +		return -ENOTTY;
>>> +	}
>> I guess the point of this is to detect devices that are inaccessible?
>> The same sort of thing as in pci_dev_save_and_disable() from patch
>> 3/9?  But we use "dev->error_state == pci_channel_io_perm_failure"
>> instead?
>>
>> No matter what we do, this has the same race as in patch 3/9.  And I
>> think using dev->error_state also depends on AER being enabled, which
>> cuts out many PCIe devices.
>>
>> I think using the same exact code as in pci_dev_save_and_disable()
>> would be more straightforward.  And also the same sort of wording in
>> the message, e.g., "Device config space inaccessible; unable to FLR"
>> or similar.  I foresee many of these messages in my future, and it
>> will be helpful to have a specific clue about why FLR failed :)
> I think pci_af_flr() and pci_pm_reset() are all subject to this as
> well, some of the device specific resets too.  Maybe we need a common
> helper with a string provided so we can differentiate the callers, even
> if we don't make all the conversions in this series.  Thanks,
>
> Alex

Yeah I was planning on implementing a common function in the next 
revision. Will include a string parameter to differentiate the callers. 
And yeah I prefer not to update pci_af_flr() or pci_pm_reset() as part 
of this series since I don't have the devices to test those paths.

Thanks

Farhan



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 5/9] s390/pci: Update the logic for detecting passthrough device
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (3 preceding siblings ...)
  2026-03-16 19:15 ` [PATCH v11 4/9] PCI: Add additional checks for flr reset Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-25 16:46   ` Alex Williamson
  2026-03-16 19:15 ` [PATCH v11 6/9] s390/pci: Store PCI error information for passthrough devices Farhan Ali
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato

We can now have userspace drivers (vfio-pci based) on s390x. The userspace
drivers will not have any KVM fd and so no kzdev associated with them. So
we need to update the logic for detecting passthrough devices to not depend
on struct kvm_zdev.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      |  1 +
 arch/s390/pci/pci_event.c        | 14 ++++----------
 drivers/vfio/pci/vfio_pci_zdev.c |  9 ++++++++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index c0ff19dab580..ec8a772bf526 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -171,6 +171,7 @@ struct zpci_dev {
 
 	char res_name[16];
 	bool mio_capable;
+	bool mediated_recovery;
 	struct zpci_bar_struct bars[PCI_STD_NUM_BARS];
 
 	u64		start_dma;	/* Start of available DMA addresses */
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index 839bd91c056e..de504925f709 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -60,16 +60,10 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
 	}
 }
 
-static bool is_passed_through(struct pci_dev *pdev)
+static bool needs_mediated_recovery(struct pci_dev *pdev)
 {
 	struct zpci_dev *zdev = to_zpci(pdev);
-	bool ret;
-
-	mutex_lock(&zdev->kzdev_lock);
-	ret = !!zdev->kzdev;
-	mutex_unlock(&zdev->kzdev_lock);
-
-	return ret;
+	return zdev->mediated_recovery;
 }
 
 static bool is_driver_supported(struct pci_driver *driver)
@@ -194,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	}
 	pdev->error_state = pci_channel_io_frozen;
 
-	if (is_passed_through(pdev)) {
+	if (needs_mediated_recovery(pdev)) {
 		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
 			pci_name(pdev));
 		status_str = "failed (pass-through)";
@@ -279,7 +273,7 @@ static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
 	 * we will inject the error event and let the guest recover the device
 	 * itself.
 	 */
-	if (is_passed_through(pdev))
+	if (needs_mediated_recovery(pdev))
 		goto out;
 	driver = to_pci_driver(pdev->dev.driver);
 	if (driver && driver->err_handler && driver->err_handler->error_detected)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 0990fdb146b7..a7bc23ce8483 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -148,6 +148,8 @@ int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 	if (!zdev)
 		return -ENODEV;
 
+	zdev->mediated_recovery = true;
+
 	if (!vdev->vdev.kvm)
 		return 0;
 
@@ -161,7 +163,12 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
 
-	if (!zdev || !vdev->vdev.kvm)
+	if (!zdev)
+		return;
+
+	zdev->mediated_recovery = false;
+
+	if (!vdev->vdev.kvm)
 		return;
 
 	if (zpci_kvm_hook.kvm_unregister)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 5/9] s390/pci: Update the logic for detecting passthrough device
  2026-03-16 19:15 ` [PATCH v11 5/9] s390/pci: Update the logic for detecting passthrough device Farhan Ali
@ 2026-03-25 16:46   ` Alex Williamson
  0 siblings, 0 replies; 33+ messages in thread
From: Alex Williamson @ 2026-03-25 16:46 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, kbusch, clg,
	stable, schnelle, mjrosato, alex

On Mon, 16 Mar 2026 12:15:40 -0700
Farhan Ali <alifm@linux.ibm.com> wrote:

> We can now have userspace drivers (vfio-pci based) on s390x. The userspace
> drivers will not have any KVM fd and so no kzdev associated with them. So
> we need to update the logic for detecting passthrough devices to not depend
> on struct kvm_zdev.
> 
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/include/asm/pci.h      |  1 +
>  arch/s390/pci/pci_event.c        | 14 ++++----------
>  drivers/vfio/pci/vfio_pci_zdev.c |  9 ++++++++-
>  3 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index c0ff19dab580..ec8a772bf526 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -171,6 +171,7 @@ struct zpci_dev {
>  
>  	char res_name[16];
>  	bool mio_capable;
> +	bool mediated_recovery;
>  	struct zpci_bar_struct bars[PCI_STD_NUM_BARS];
>  
>  	u64		start_dma;	/* Start of available DMA addresses */
> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> index 839bd91c056e..de504925f709 100644
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -60,16 +60,10 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
>  	}
>  }
>  
> -static bool is_passed_through(struct pci_dev *pdev)
> +static bool needs_mediated_recovery(struct pci_dev *pdev)
>  {
>  	struct zpci_dev *zdev = to_zpci(pdev);
> -	bool ret;
> -
> -	mutex_lock(&zdev->kzdev_lock);
> -	ret = !!zdev->kzdev;
> -	mutex_unlock(&zdev->kzdev_lock);
> -
> -	return ret;
> +	return zdev->mediated_recovery;

Looks like this was always a bit racy, it's not clear the mutex
provided anything except maybe memory ordering.  Maybe the expectation
is that errors don't occur in relation to open/close_device?  I wonder
if READ/WRITE_ONCE would at least provide some ordering assurances.
Otherwise:

Acked-by: Alex Williamson <alex@shazbot.org>

>  }
>  
>  static bool is_driver_supported(struct pci_driver *driver)
> @@ -194,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  	}
>  	pdev->error_state = pci_channel_io_frozen;
>  
> -	if (is_passed_through(pdev)) {
> +	if (needs_mediated_recovery(pdev)) {
>  		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
>  			pci_name(pdev));
>  		status_str = "failed (pass-through)";
> @@ -279,7 +273,7 @@ static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
>  	 * we will inject the error event and let the guest recover the device
>  	 * itself.
>  	 */
> -	if (is_passed_through(pdev))
> +	if (needs_mediated_recovery(pdev))
>  		goto out;
>  	driver = to_pci_driver(pdev->dev.driver);
>  	if (driver && driver->err_handler && driver->err_handler->error_detected)
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index 0990fdb146b7..a7bc23ce8483 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -148,6 +148,8 @@ int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>  	if (!zdev)
>  		return -ENODEV;
>  
> +	zdev->mediated_recovery = true;
> +
>  	if (!vdev->vdev.kvm)
>  		return 0;
>  
> @@ -161,7 +163,12 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>  {
>  	struct zpci_dev *zdev = to_zpci(vdev->pdev);
>  
> -	if (!zdev || !vdev->vdev.kvm)
> +	if (!zdev)
> +		return;
> +
> +	zdev->mediated_recovery = false;
> +
> +	if (!vdev->vdev.kvm)
>  		return;
>  
>  	if (zpci_kvm_hook.kvm_unregister)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 6/9] s390/pci: Store PCI error information for passthrough devices
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (4 preceding siblings ...)
  2026-03-16 19:15 ` [PATCH v11 5/9] s390/pci: Update the logic for detecting passthrough device Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-25 17:01   ` Alex Williamson
  2026-03-16 19:15 ` [PATCH v11 7/9] vfio-pci/zdev: Add a device feature for error information Farhan Ali
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato

For a passthrough device we need co-operation from user space to recover
the device. This would require to bubble up any error information to user
space.  Let's store this error information for passthrough devices, so it
can be retrieved later.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      | 28 ++++++++++
 arch/s390/pci/pci.c              |  1 +
 arch/s390/pci/pci_event.c        | 94 +++++++++++++++++++-------------
 drivers/vfio/pci/vfio_pci_zdev.c |  2 +
 4 files changed, 87 insertions(+), 38 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index ec8a772bf526..383f6483b656 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -118,6 +118,31 @@ struct zpci_bus {
 	enum pci_bus_speed	max_bus_speed;
 };
 
+/* Content Code Description for PCI Function Error */
+struct zpci_ccdf_err {
+	u32 reserved1;
+	u32 fh;                         /* function handle */
+	u32 fid;                        /* function id */
+	u32 ett         :  4;           /* expected table type */
+	u32 mvn         : 12;           /* MSI vector number */
+	u32 dmaas       :  8;           /* DMA address space */
+	u32 reserved2   :  6;
+	u32 q           :  1;           /* event qualifier */
+	u32 rw          :  1;           /* read/write */
+	u64 faddr;                      /* failing address */
+	u32 reserved3;
+	u16 reserved4;
+	u16 pec;                        /* PCI event code */
+} __packed;
+
+#define ZPCI_ERR_PENDING_MAX 4
+struct zpci_ccdf_pending {
+	u8 count;
+	u8 head;
+	u8 tail;
+	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
+};
+
 /* Private data per function */
 struct zpci_dev {
 	struct zpci_bus *zbus;
@@ -193,6 +218,8 @@ struct zpci_dev {
 	struct iommu_domain *s390_domain; /* attached IOMMU domain */
 	struct kvm_zdev *kzdev;
 	struct mutex kzdev_lock;
+	struct zpci_ccdf_pending pending_errs;
+	struct mutex pending_errs_lock;
 	spinlock_t dom_lock;		/* protect s390_domain change */
 };
 
@@ -331,6 +358,7 @@ void zpci_debug_exit_device(struct zpci_dev *);
 int zpci_report_error(struct pci_dev *, struct zpci_report_error_header *);
 int zpci_clear_error_state(struct zpci_dev *zdev);
 int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
+void zpci_cleanup_pending_errors(struct zpci_dev *zdev);
 
 #ifdef CONFIG_NUMA
 
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 87077e510266..bc253cc52056 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -915,6 +915,7 @@ struct zpci_dev *zpci_create_device(u32 fid, u32 fh, enum zpci_state state)
 	mutex_init(&zdev->state_lock);
 	mutex_init(&zdev->fmb_lock);
 	mutex_init(&zdev->kzdev_lock);
+	mutex_init(&zdev->pending_errs_lock);
 
 	return zdev;
 
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index de504925f709..5b24f3a9fe23 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -17,23 +17,6 @@
 #include "pci_bus.h"
 #include "pci_report.h"
 
-/* Content Code Description for PCI Function Error */
-struct zpci_ccdf_err {
-	u32 reserved1;
-	u32 fh;				/* function handle */
-	u32 fid;			/* function id */
-	u32 ett		:  4;		/* expected table type */
-	u32 mvn		: 12;		/* MSI vector number */
-	u32 dmaas	:  8;		/* DMA address space */
-	u32		:  6;
-	u32 q		:  1;		/* event qualifier */
-	u32 rw		:  1;		/* read/write */
-	u64 faddr;			/* failing address */
-	u32 reserved3;
-	u16 reserved4;
-	u16 pec;			/* PCI event code */
-} __packed;
-
 /* Content Code Description for PCI Function Availability */
 struct zpci_ccdf_avail {
 	u32 reserved1;
@@ -75,6 +58,41 @@ static bool is_driver_supported(struct pci_driver *driver)
 	return true;
 }
 
+static void zpci_store_pci_error(struct pci_dev *pdev,
+				 struct zpci_ccdf_err *ccdf)
+{
+	struct zpci_dev *zdev = to_zpci(pdev);
+	int i;
+
+	mutex_lock(&zdev->pending_errs_lock);
+	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
+		pr_err("%s: Maximum number (%d) of pending error events queued",
+		       pci_name(pdev), ZPCI_ERR_PENDING_MAX);
+		mutex_unlock(&zdev->pending_errs_lock);
+		return;
+	}
+
+	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
+	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
+	zdev->pending_errs.tail++;
+	zdev->pending_errs.count++;
+	mutex_unlock(&zdev->pending_errs_lock);
+}
+
+void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
+{
+	struct pci_dev *pdev = NULL;
+
+	guard(mutex)(&zdev->pending_errs_lock);
+	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
+	if (zdev->pending_errs.count)
+		pr_info("%s: Unhandled PCI error events count=%d",
+			pci_name(pdev), zdev->pending_errs.count);
+	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
+
 static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
 							 struct pci_driver *driver)
 {
@@ -169,7 +187,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
  * and the platform determines which functions are affected for
  * multi-function devices.
  */
-static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
+static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
+							  struct zpci_ccdf_err *ccdf)
 {
 	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
 	struct zpci_dev *zdev = to_zpci(pdev);
@@ -188,13 +207,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	}
 	pdev->error_state = pci_channel_io_frozen;
 
-	if (needs_mediated_recovery(pdev)) {
-		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
-			pci_name(pdev));
-		status_str = "failed (pass-through)";
-		goto out_unlock;
-	}
-
 	driver = to_pci_driver(pdev->dev.driver);
 	if (!is_driver_supported(driver)) {
 		if (!driver) {
@@ -210,12 +222,23 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 		goto out_unlock;
 	}
 
+	if (needs_mediated_recovery(pdev))
+		zpci_store_pci_error(pdev, ccdf);
+
 	ers_res = zpci_event_notify_error_detected(pdev, driver);
 	if (ers_result_indicates_abort(ers_res)) {
 		status_str = "failed (abort on detection)";
 		goto out_unlock;
 	}
 
+	if (needs_mediated_recovery(pdev)) {
+		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
+			pci_name(pdev));
+		ers_res = PCI_ERS_RESULT_RECOVERED;
+		status_str = "in progress";
+		goto out_unlock;
+	}
+
 	if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
 		ers_res = zpci_event_do_error_state_clear(pdev, driver);
 		if (ers_result_indicates_abort(ers_res)) {
@@ -260,25 +283,20 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
  * @pdev: PCI function for which to report
  * @es: PCI channel failure state to report
  */
-static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
+static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
+				  struct zpci_ccdf_err *ccdf)
 {
 	struct pci_driver *driver;
 
 	pci_dev_lock(pdev);
 	pdev->error_state = es;
-	/**
-	 * While vfio-pci's error_detected callback notifies user-space QEMU
-	 * reacts to this by freezing the guest. In an s390 environment PCI
-	 * errors are rarely fatal so this is overkill. Instead in the future
-	 * we will inject the error event and let the guest recover the device
-	 * itself.
-	 */
+
 	if (needs_mediated_recovery(pdev))
-		goto out;
+		zpci_store_pci_error(pdev, ccdf);
 	driver = to_pci_driver(pdev->dev.driver);
 	if (driver && driver->err_handler && driver->err_handler->error_detected)
 		driver->err_handler->error_detected(pdev, pdev->error_state);
-out:
+
 	pci_dev_unlock(pdev);
 }
 
@@ -324,12 +342,12 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
 		break;
 	case 0x0040: /* Service Action or Error Recovery Failed */
 	case 0x003b:
-		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
 		break;
 	default: /* PCI function left in the error state attempt to recover */
-		ers_res = zpci_event_attempt_error_recovery(pdev);
+		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
 		if (ers_res != PCI_ERS_RESULT_RECOVERED)
-			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+			zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
 		break;
 	}
 	pci_dev_put(pdev);
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index a7bc23ce8483..2be37eab9279 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 
 	zdev->mediated_recovery = false;
 
+	zpci_cleanup_pending_errors(zdev);
+
 	if (!vdev->vdev.kvm)
 		return;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 6/9] s390/pci: Store PCI error information for passthrough devices
  2026-03-16 19:15 ` [PATCH v11 6/9] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2026-03-25 17:01   ` Alex Williamson
  2026-03-25 18:06     ` Farhan Ali
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Williamson @ 2026-03-25 17:01 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, kbusch, clg,
	stable, schnelle, mjrosato, alex

On Mon, 16 Mar 2026 12:15:41 -0700
Farhan Ali <alifm@linux.ibm.com> wrote:

> For a passthrough device we need co-operation from user space to recover
> the device. This would require to bubble up any error information to user
> space.  Let's store this error information for passthrough devices, so it
> can be retrieved later.
> 
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/include/asm/pci.h      | 28 ++++++++++
>  arch/s390/pci/pci.c              |  1 +
>  arch/s390/pci/pci_event.c        | 94 +++++++++++++++++++-------------
>  drivers/vfio/pci/vfio_pci_zdev.c |  2 +
>  4 files changed, 87 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index ec8a772bf526..383f6483b656 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -118,6 +118,31 @@ struct zpci_bus {
>  	enum pci_bus_speed	max_bus_speed;
>  };
>  
> +/* Content Code Description for PCI Function Error */
> +struct zpci_ccdf_err {
> +	u32 reserved1;
> +	u32 fh;                         /* function handle */
> +	u32 fid;                        /* function id */
> +	u32 ett         :  4;           /* expected table type */
> +	u32 mvn         : 12;           /* MSI vector number */
> +	u32 dmaas       :  8;           /* DMA address space */
> +	u32 reserved2   :  6;
> +	u32 q           :  1;           /* event qualifier */
> +	u32 rw          :  1;           /* read/write */
> +	u64 faddr;                      /* failing address */
> +	u32 reserved3;
> +	u16 reserved4;
> +	u16 pec;                        /* PCI event code */
> +} __packed;
> +
> +#define ZPCI_ERR_PENDING_MAX 4
> +struct zpci_ccdf_pending {
> +	u8 count;
> +	u8 head;
> +	u8 tail;
> +	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
> +};
> +
>  /* Private data per function */
>  struct zpci_dev {
>  	struct zpci_bus *zbus;
> @@ -193,6 +218,8 @@ struct zpci_dev {
>  	struct iommu_domain *s390_domain; /* attached IOMMU domain */
>  	struct kvm_zdev *kzdev;
>  	struct mutex kzdev_lock;
> +	struct zpci_ccdf_pending pending_errs;
> +	struct mutex pending_errs_lock;
>  	spinlock_t dom_lock;		/* protect s390_domain change */
>  };
>  
> @@ -331,6 +358,7 @@ void zpci_debug_exit_device(struct zpci_dev *);
>  int zpci_report_error(struct pci_dev *, struct zpci_report_error_header *);
>  int zpci_clear_error_state(struct zpci_dev *zdev);
>  int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
> +void zpci_cleanup_pending_errors(struct zpci_dev *zdev);
>  
>  #ifdef CONFIG_NUMA
>  
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 87077e510266..bc253cc52056 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -915,6 +915,7 @@ struct zpci_dev *zpci_create_device(u32 fid, u32 fh, enum zpci_state state)
>  	mutex_init(&zdev->state_lock);
>  	mutex_init(&zdev->fmb_lock);
>  	mutex_init(&zdev->kzdev_lock);
> +	mutex_init(&zdev->pending_errs_lock);
>  
>  	return zdev;
>  
> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> index de504925f709..5b24f3a9fe23 100644
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -17,23 +17,6 @@
>  #include "pci_bus.h"
>  #include "pci_report.h"
>  
> -/* Content Code Description for PCI Function Error */
> -struct zpci_ccdf_err {
> -	u32 reserved1;
> -	u32 fh;				/* function handle */
> -	u32 fid;			/* function id */
> -	u32 ett		:  4;		/* expected table type */
> -	u32 mvn		: 12;		/* MSI vector number */
> -	u32 dmaas	:  8;		/* DMA address space */
> -	u32		:  6;
> -	u32 q		:  1;		/* event qualifier */
> -	u32 rw		:  1;		/* read/write */
> -	u64 faddr;			/* failing address */
> -	u32 reserved3;
> -	u16 reserved4;
> -	u16 pec;			/* PCI event code */
> -} __packed;
> -
>  /* Content Code Description for PCI Function Availability */
>  struct zpci_ccdf_avail {
>  	u32 reserved1;
> @@ -75,6 +58,41 @@ static bool is_driver_supported(struct pci_driver *driver)
>  	return true;
>  }
>  
> +static void zpci_store_pci_error(struct pci_dev *pdev,
> +				 struct zpci_ccdf_err *ccdf)
> +{
> +	struct zpci_dev *zdev = to_zpci(pdev);
> +	int i;
> +
> +	mutex_lock(&zdev->pending_errs_lock);
> +	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
> +		pr_err("%s: Maximum number (%d) of pending error events queued",
> +		       pci_name(pdev), ZPCI_ERR_PENDING_MAX);
> +		mutex_unlock(&zdev->pending_errs_lock);
> +		return;
> +	}
> +
> +	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
> +	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
> +	zdev->pending_errs.tail++;
> +	zdev->pending_errs.count++;
> +	mutex_unlock(&zdev->pending_errs_lock);
> +}
> +
> +void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
> +{
> +	struct pci_dev *pdev = NULL;
> +
> +	guard(mutex)(&zdev->pending_errs_lock);
> +	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
> +	if (zdev->pending_errs.count)
> +		pr_info("%s: Unhandled PCI error events count=%d",
> +			pci_name(pdev), zdev->pending_errs.count);
> +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
> +	pci_dev_put(pdev);
> +}
> +EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
> +
>  static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
>  							 struct pci_driver *driver)
>  {
> @@ -169,7 +187,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
>   * and the platform determines which functions are affected for
>   * multi-function devices.
>   */
> -static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
> +static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
> +							  struct zpci_ccdf_err *ccdf)
>  {
>  	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
>  	struct zpci_dev *zdev = to_zpci(pdev);
> @@ -188,13 +207,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  	}
>  	pdev->error_state = pci_channel_io_frozen;
>  
> -	if (needs_mediated_recovery(pdev)) {
> -		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
> -			pci_name(pdev));
> -		status_str = "failed (pass-through)";
> -		goto out_unlock;
> -	}
> -
>  	driver = to_pci_driver(pdev->dev.driver);
>  	if (!is_driver_supported(driver)) {
>  		if (!driver) {
> @@ -210,12 +222,23 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  		goto out_unlock;
>  	}
>  
> +	if (needs_mediated_recovery(pdev))
> +		zpci_store_pci_error(pdev, ccdf);
> +
>  	ers_res = zpci_event_notify_error_detected(pdev, driver);
>  	if (ers_result_indicates_abort(ers_res)) {
>  		status_str = "failed (abort on detection)";
>  		goto out_unlock;
>  	}
>  
> +	if (needs_mediated_recovery(pdev)) {
> +		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
> +			pci_name(pdev));
> +		ers_res = PCI_ERS_RESULT_RECOVERED;
> +		status_str = "in progress";
> +		goto out_unlock;
> +	}
> +
>  	if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
>  		ers_res = zpci_event_do_error_state_clear(pdev, driver);
>  		if (ers_result_indicates_abort(ers_res)) {
> @@ -260,25 +283,20 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>   * @pdev: PCI function for which to report
>   * @es: PCI channel failure state to report
>   */
> -static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
> +static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
> +				  struct zpci_ccdf_err *ccdf)
>  {
>  	struct pci_driver *driver;
>  
>  	pci_dev_lock(pdev);
>  	pdev->error_state = es;
> -	/**
> -	 * While vfio-pci's error_detected callback notifies user-space QEMU
> -	 * reacts to this by freezing the guest. In an s390 environment PCI
> -	 * errors are rarely fatal so this is overkill. Instead in the future
> -	 * we will inject the error event and let the guest recover the device
> -	 * itself.
> -	 */
> +
>  	if (needs_mediated_recovery(pdev))
> -		goto out;
> +		zpci_store_pci_error(pdev, ccdf);
>  	driver = to_pci_driver(pdev->dev.driver);
>  	if (driver && driver->err_handler && driver->err_handler->error_detected)
>  		driver->err_handler->error_detected(pdev, pdev->error_state);
> -out:
> +
>  	pci_dev_unlock(pdev);
>  }
>  
> @@ -324,12 +342,12 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
>  		break;
>  	case 0x0040: /* Service Action or Error Recovery Failed */
>  	case 0x003b:
> -		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
> +		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
>  		break;
>  	default: /* PCI function left in the error state attempt to recover */
> -		ers_res = zpci_event_attempt_error_recovery(pdev);
> +		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
>  		if (ers_res != PCI_ERS_RESULT_RECOVERED)
> -			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
> +			zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
>  		break;
>  	}
>  	pci_dev_put(pdev);
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index a7bc23ce8483..2be37eab9279 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>  
>  	zdev->mediated_recovery = false;
>  
> +	zpci_cleanup_pending_errors(zdev);
> +
>  	if (!vdev->vdev.kvm)
>  		return;
>  

It begins to look here like the mediated_recovery should be protected
by pending_errs_lock and perhaps there should be
zpci_{start,stop}_mediated_recovery() where we set and clear the flag
under mutex, while also clearing pending errors in the latter case.
The various needs_mediated_recovery tests could be pulled in to test
under mutex as well.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 6/9] s390/pci: Store PCI error information for passthrough devices
  2026-03-25 17:01   ` Alex Williamson
@ 2026-03-25 18:06     ` Farhan Ali
  0 siblings, 0 replies; 33+ messages in thread
From: Farhan Ali @ 2026-03-25 18:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, kbusch, clg,
	stable, schnelle, mjrosato


On 3/25/2026 10:01 AM, Alex Williamson wrote:
> On Mon, 16 Mar 2026 12:15:41 -0700
> Farhan Ali <alifm@linux.ibm.com> wrote:
>
>> For a passthrough device we need co-operation from user space to recover
>> the device. This would require to bubble up any error information to user
>> space.  Let's store this error information for passthrough devices, so it
>> can be retrieved later.
>>
>> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pci.h      | 28 ++++++++++
>>   arch/s390/pci/pci.c              |  1 +
>>   arch/s390/pci/pci_event.c        | 94 +++++++++++++++++++-------------
>>   drivers/vfio/pci/vfio_pci_zdev.c |  2 +
>>   4 files changed, 87 insertions(+), 38 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
>> index ec8a772bf526..383f6483b656 100644
>> --- a/arch/s390/include/asm/pci.h
>> +++ b/arch/s390/include/asm/pci.h
>> @@ -118,6 +118,31 @@ struct zpci_bus {
>>   	enum pci_bus_speed	max_bus_speed;
>>   };
>>   
>> +/* Content Code Description for PCI Function Error */
>> +struct zpci_ccdf_err {
>> +	u32 reserved1;
>> +	u32 fh;                         /* function handle */
>> +	u32 fid;                        /* function id */
>> +	u32 ett         :  4;           /* expected table type */
>> +	u32 mvn         : 12;           /* MSI vector number */
>> +	u32 dmaas       :  8;           /* DMA address space */
>> +	u32 reserved2   :  6;
>> +	u32 q           :  1;           /* event qualifier */
>> +	u32 rw          :  1;           /* read/write */
>> +	u64 faddr;                      /* failing address */
>> +	u32 reserved3;
>> +	u16 reserved4;
>> +	u16 pec;                        /* PCI event code */
>> +} __packed;
>> +
>> +#define ZPCI_ERR_PENDING_MAX 4
>> +struct zpci_ccdf_pending {
>> +	u8 count;
>> +	u8 head;
>> +	u8 tail;
>> +	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
>> +};
>> +
>>   /* Private data per function */
>>   struct zpci_dev {
>>   	struct zpci_bus *zbus;
>> @@ -193,6 +218,8 @@ struct zpci_dev {
>>   	struct iommu_domain *s390_domain; /* attached IOMMU domain */
>>   	struct kvm_zdev *kzdev;
>>   	struct mutex kzdev_lock;
>> +	struct zpci_ccdf_pending pending_errs;
>> +	struct mutex pending_errs_lock;
>>   	spinlock_t dom_lock;		/* protect s390_domain change */
>>   };
>>   
>> @@ -331,6 +358,7 @@ void zpci_debug_exit_device(struct zpci_dev *);
>>   int zpci_report_error(struct pci_dev *, struct zpci_report_error_header *);
>>   int zpci_clear_error_state(struct zpci_dev *zdev);
>>   int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
>> +void zpci_cleanup_pending_errors(struct zpci_dev *zdev);
>>   
>>   #ifdef CONFIG_NUMA
>>   
>> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
>> index 87077e510266..bc253cc52056 100644
>> --- a/arch/s390/pci/pci.c
>> +++ b/arch/s390/pci/pci.c
>> @@ -915,6 +915,7 @@ struct zpci_dev *zpci_create_device(u32 fid, u32 fh, enum zpci_state state)
>>   	mutex_init(&zdev->state_lock);
>>   	mutex_init(&zdev->fmb_lock);
>>   	mutex_init(&zdev->kzdev_lock);
>> +	mutex_init(&zdev->pending_errs_lock);
>>   
>>   	return zdev;
>>   
>> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
>> index de504925f709..5b24f3a9fe23 100644
>> --- a/arch/s390/pci/pci_event.c
>> +++ b/arch/s390/pci/pci_event.c
>> @@ -17,23 +17,6 @@
>>   #include "pci_bus.h"
>>   #include "pci_report.h"
>>   
>> -/* Content Code Description for PCI Function Error */
>> -struct zpci_ccdf_err {
>> -	u32 reserved1;
>> -	u32 fh;				/* function handle */
>> -	u32 fid;			/* function id */
>> -	u32 ett		:  4;		/* expected table type */
>> -	u32 mvn		: 12;		/* MSI vector number */
>> -	u32 dmaas	:  8;		/* DMA address space */
>> -	u32		:  6;
>> -	u32 q		:  1;		/* event qualifier */
>> -	u32 rw		:  1;		/* read/write */
>> -	u64 faddr;			/* failing address */
>> -	u32 reserved3;
>> -	u16 reserved4;
>> -	u16 pec;			/* PCI event code */
>> -} __packed;
>> -
>>   /* Content Code Description for PCI Function Availability */
>>   struct zpci_ccdf_avail {
>>   	u32 reserved1;
>> @@ -75,6 +58,41 @@ static bool is_driver_supported(struct pci_driver *driver)
>>   	return true;
>>   }
>>   
>> +static void zpci_store_pci_error(struct pci_dev *pdev,
>> +				 struct zpci_ccdf_err *ccdf)
>> +{
>> +	struct zpci_dev *zdev = to_zpci(pdev);
>> +	int i;
>> +
>> +	mutex_lock(&zdev->pending_errs_lock);
>> +	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
>> +		pr_err("%s: Maximum number (%d) of pending error events queued",
>> +		       pci_name(pdev), ZPCI_ERR_PENDING_MAX);
>> +		mutex_unlock(&zdev->pending_errs_lock);
>> +		return;
>> +	}
>> +
>> +	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
>> +	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
>> +	zdev->pending_errs.tail++;
>> +	zdev->pending_errs.count++;
>> +	mutex_unlock(&zdev->pending_errs_lock);
>> +}
>> +
>> +void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
>> +{
>> +	struct pci_dev *pdev = NULL;
>> +
>> +	guard(mutex)(&zdev->pending_errs_lock);
>> +	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
>> +	if (zdev->pending_errs.count)
>> +		pr_info("%s: Unhandled PCI error events count=%d",
>> +			pci_name(pdev), zdev->pending_errs.count);
>> +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
>> +	pci_dev_put(pdev);
>> +}
>> +EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
>> +
>>   static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
>>   							 struct pci_driver *driver)
>>   {
>> @@ -169,7 +187,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
>>    * and the platform determines which functions are affected for
>>    * multi-function devices.
>>    */
>> -static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>> +static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
>> +							  struct zpci_ccdf_err *ccdf)
>>   {
>>   	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
>>   	struct zpci_dev *zdev = to_zpci(pdev);
>> @@ -188,13 +207,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>>   	}
>>   	pdev->error_state = pci_channel_io_frozen;
>>   
>> -	if (needs_mediated_recovery(pdev)) {
>> -		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
>> -			pci_name(pdev));
>> -		status_str = "failed (pass-through)";
>> -		goto out_unlock;
>> -	}
>> -
>>   	driver = to_pci_driver(pdev->dev.driver);
>>   	if (!is_driver_supported(driver)) {
>>   		if (!driver) {
>> @@ -210,12 +222,23 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>>   		goto out_unlock;
>>   	}
>>   
>> +	if (needs_mediated_recovery(pdev))
>> +		zpci_store_pci_error(pdev, ccdf);
>> +
>>   	ers_res = zpci_event_notify_error_detected(pdev, driver);
>>   	if (ers_result_indicates_abort(ers_res)) {
>>   		status_str = "failed (abort on detection)";
>>   		goto out_unlock;
>>   	}
>>   
>> +	if (needs_mediated_recovery(pdev)) {
>> +		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
>> +			pci_name(pdev));
>> +		ers_res = PCI_ERS_RESULT_RECOVERED;
>> +		status_str = "in progress";
>> +		goto out_unlock;
>> +	}
>> +
>>   	if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
>>   		ers_res = zpci_event_do_error_state_clear(pdev, driver);
>>   		if (ers_result_indicates_abort(ers_res)) {
>> @@ -260,25 +283,20 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>>    * @pdev: PCI function for which to report
>>    * @es: PCI channel failure state to report
>>    */
>> -static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
>> +static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
>> +				  struct zpci_ccdf_err *ccdf)
>>   {
>>   	struct pci_driver *driver;
>>   
>>   	pci_dev_lock(pdev);
>>   	pdev->error_state = es;
>> -	/**
>> -	 * While vfio-pci's error_detected callback notifies user-space QEMU
>> -	 * reacts to this by freezing the guest. In an s390 environment PCI
>> -	 * errors are rarely fatal so this is overkill. Instead in the future
>> -	 * we will inject the error event and let the guest recover the device
>> -	 * itself.
>> -	 */
>> +
>>   	if (needs_mediated_recovery(pdev))
>> -		goto out;
>> +		zpci_store_pci_error(pdev, ccdf);
>>   	driver = to_pci_driver(pdev->dev.driver);
>>   	if (driver && driver->err_handler && driver->err_handler->error_detected)
>>   		driver->err_handler->error_detected(pdev, pdev->error_state);
>> -out:
>> +
>>   	pci_dev_unlock(pdev);
>>   }
>>   
>> @@ -324,12 +342,12 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
>>   		break;
>>   	case 0x0040: /* Service Action or Error Recovery Failed */
>>   	case 0x003b:
>> -		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
>> +		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
>>   		break;
>>   	default: /* PCI function left in the error state attempt to recover */
>> -		ers_res = zpci_event_attempt_error_recovery(pdev);
>> +		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
>>   		if (ers_res != PCI_ERS_RESULT_RECOVERED)
>> -			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
>> +			zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
>>   		break;
>>   	}
>>   	pci_dev_put(pdev);
>> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
>> index a7bc23ce8483..2be37eab9279 100644
>> --- a/drivers/vfio/pci/vfio_pci_zdev.c
>> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
>> @@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>>   
>>   	zdev->mediated_recovery = false;
>>   
>> +	zpci_cleanup_pending_errors(zdev);
>> +
>>   	if (!vdev->vdev.kvm)
>>   		return;
>>   
> It begins to look here like the mediated_recovery should be protected
> by pending_errs_lock and perhaps there should be
> zpci_{start,stop}_mediated_recovery() where we set and clear the flag
> under mutex, while also clearing pending errors in the latter case.
> The various needs_mediated_recovery tests could be pulled in to test
> under mutex as well.  Thanks,
>
> Alex

Thanks Alex for taking a look at the patches. I agree having the 
mediated_recovery flag being protected by the mutex will be a better 
approach. Will add the interfaces you suggested.

Thanks

Farhan



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 7/9] vfio-pci/zdev: Add a device feature for error information
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (5 preceding siblings ...)
  2026-03-16 19:15 ` [PATCH v11 6/9] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-25 17:18   ` Alex Williamson
  2026-03-16 19:15 ` [PATCH v11 8/9] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato

For zPCI devices, we have platform specific error information. The platform
firmware provides this error information to the operating system in an
architecture specific mechanism. To enable recovery from userspace for
these devices, we want to expose this error information to userspace. Add a
new device feature to expose this information.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c |  2 ++
 drivers/vfio/pci/vfio_pci_priv.h |  9 ++++++++
 drivers/vfio/pci/vfio_pci_zdev.c | 36 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 17 +++++++++++++++
 4 files changed, 64 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d43745fe4c84..bbdb625e35ef 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1534,6 +1534,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_DMA_BUF:
 		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
+		return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 27ac280f00b9..eed69926d8a1 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -89,6 +89,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 				struct vfio_info_cap *caps);
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
 void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+			      void __user *arg, size_t argsz);
 #else
 static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 					      struct vfio_info_cap *caps)
@@ -103,6 +105,13 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 
 static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {}
+
+static inline int vfio_pci_zdev_feature_err(struct vfio_device *device,
+					    u32 flags, void __user *arg,
+					    size_t argsz)
+{
+	return -ENODEV;
+}
 #endif
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 2be37eab9279..d2748dd67c55 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -141,6 +141,42 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+			      void __user *arg, size_t argsz)
+{
+	struct vfio_device_feature_zpci_err err;
+	struct vfio_pci_core_device *vdev;
+	struct zpci_dev *zdev;
+	int head = 0;
+	int ret;
+
+	vdev = container_of(device, struct vfio_pci_core_device, vdev);
+	zdev = to_zpci(vdev->pdev);
+	if (!zdev)
+		return -ENODEV;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(err));
+	if (ret != 1)
+		return ret;
+
+	mutex_lock(&zdev->pending_errs_lock);
+	if (zdev->pending_errs.count) {
+		head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
+		err.pec = zdev->pending_errs.err[head].pec;
+		zdev->pending_errs.head++;
+		zdev->pending_errs.count--;
+		err.pending_errors = zdev->pending_errs.count;
+	}
+	mutex_unlock(&zdev->pending_errs_lock);
+
+	err.version = 1;
+	if (copy_to_user(arg, &err, sizeof(err)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index bb7b89330d35..21b1473e4779 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1510,6 +1510,23 @@ struct vfio_device_feature_dma_buf {
 	struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
 };
 
+/**
+ * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
+ * userspace for vfio-pci devices on s390x. On s390x PCI error recovery involves
+ * platform firmware and notification to operating system is done by
+ * architecture specific mechanism.  Exposing this information to userspace
+ * allows userspace to take appropriate actions to handle an error on the
+ * device.
+ */
+
+struct vfio_device_feature_zpci_err {
+	__u8 version;
+	__u8 pending_errors;
+	__u16 pec;
+};
+
+#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 12
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 7/9] vfio-pci/zdev: Add a device feature for error information
  2026-03-16 19:15 ` [PATCH v11 7/9] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2026-03-25 17:18   ` Alex Williamson
  0 siblings, 0 replies; 33+ messages in thread
From: Alex Williamson @ 2026-03-25 17:18 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, kbusch, clg,
	stable, schnelle, mjrosato, alex

On Mon, 16 Mar 2026 12:15:42 -0700
Farhan Ali <alifm@linux.ibm.com> wrote:

> For zPCI devices, we have platform specific error information. The platform
> firmware provides this error information to the operating system in an
> architecture specific mechanism. To enable recovery from userspace for
> these devices, we want to expose this error information to userspace. Add a
> new device feature to expose this information.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c |  2 ++
>  drivers/vfio/pci/vfio_pci_priv.h |  9 ++++++++
>  drivers/vfio/pci/vfio_pci_zdev.c | 36 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h        | 17 +++++++++++++++
>  4 files changed, 64 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index d43745fe4c84..bbdb625e35ef 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1534,6 +1534,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_DMA_BUF:
>  		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
> +		return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index 27ac280f00b9..eed69926d8a1 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -89,6 +89,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>  				struct vfio_info_cap *caps);
>  int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
>  void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
> +			      void __user *arg, size_t argsz);
>  #else
>  static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>  					      struct vfio_info_cap *caps)
> @@ -103,6 +105,13 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>  
>  static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>  {}
> +
> +static inline int vfio_pci_zdev_feature_err(struct vfio_device *device,
> +					    u32 flags, void __user *arg,
> +					    size_t argsz)
> +{
> +	return -ENODEV;

-ENOTTY

> +}
>  #endif
>  
>  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index 2be37eab9279..d2748dd67c55 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -141,6 +141,42 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>  	return ret;
>  }
>  
> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
> +			      void __user *arg, size_t argsz)
> +{
> +	struct vfio_device_feature_zpci_err err;
> +	struct vfio_pci_core_device *vdev;
> +	struct zpci_dev *zdev;
> +	int head = 0;
> +	int ret;
> +
> +	vdev = container_of(device, struct vfio_pci_core_device, vdev);
> +	zdev = to_zpci(vdev->pdev);
> +	if (!zdev)
> +		return -ENODEV;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(err));
> +	if (ret != 1)
> +		return ret;
> +
> +	mutex_lock(&zdev->pending_errs_lock);
> +	if (zdev->pending_errs.count) {
> +		head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
> +		err.pec = zdev->pending_errs.err[head].pec;
> +		zdev->pending_errs.head++;
> +		zdev->pending_errs.count--;
> +		err.pending_errors = zdev->pending_errs.count;
> +	}
> +	mutex_unlock(&zdev->pending_errs_lock);
> +
> +	err.version = 1;

Returns uninitialized kernel data for case where there are no pending
errors, initialize err with = {};

> +	if (copy_to_user(arg, &err, sizeof(err)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>  int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>  {
>  	struct zpci_dev *zdev = to_zpci(vdev->pdev);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index bb7b89330d35..21b1473e4779 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1510,6 +1510,23 @@ struct vfio_device_feature_dma_buf {
>  	struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
>  };
>  
> +/**
> + * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
> + * userspace for vfio-pci devices on s390x. On s390x PCI error recovery involves
> + * platform firmware and notification to operating system is done by
> + * architecture specific mechanism.  Exposing this information to userspace
> + * allows userspace to take appropriate actions to handle an error on the
> + * device.

This should include some explicit discussion of how pending_errors in
interpreted, ie. pending _additional_ errors, userspace should read
until zero.  Thanks,

Alex

> + */
> +
> +struct vfio_device_feature_zpci_err {
> +	__u8 version;
> +	__u8 pending_errors;
> +	__u16 pec;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 12
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 8/9] vfio: Add a reset_done callback for vfio-pci driver
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (6 preceding siblings ...)
  2026-03-16 19:15 ` [PATCH v11 7/9] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-25 17:30   ` Alex Williamson
  2026-03-16 19:15 ` [PATCH v11 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
  2026-03-24 19:34 ` [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato, Julian Ruess

On error recovery for a PCI device bound to vfio-pci driver, we want to
recover the state of the device to its last known saved state. The callback
restores the state of the device to its initial saved state.

Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index bbdb625e35ef..f1bd1266b88f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2257,6 +2257,17 @@ pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
 
+static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+	if (!vdev->pci_saved_state)
+		return;
+
+	pci_load_saved_state(pdev, vdev->pci_saved_state);
+	pci_restore_state(pdev);
+}
+
 int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
 				  int nr_virtfn)
 {
@@ -2321,6 +2332,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
 
 const struct pci_error_handlers vfio_pci_core_err_handlers = {
 	.error_detected = vfio_pci_core_aer_err_detected,
+	.reset_done = vfio_pci_core_aer_reset_done,
 };
 EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 8/9] vfio: Add a reset_done callback for vfio-pci driver
  2026-03-16 19:15 ` [PATCH v11 8/9] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
@ 2026-03-25 17:30   ` Alex Williamson
  0 siblings, 0 replies; 33+ messages in thread
From: Alex Williamson @ 2026-03-25 17:30 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, helgaas, lukas, kbusch, clg,
	stable, schnelle, mjrosato, Julian Ruess, alex

On Mon, 16 Mar 2026 12:15:43 -0700
Farhan Ali <alifm@linux.ibm.com> wrote:

> On error recovery for a PCI device bound to vfio-pci driver, we want to
> recover the state of the device to its last known saved state. The callback
> restores the state of the device to its initial saved state.
> 
> Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index bbdb625e35ef..f1bd1266b88f 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -2257,6 +2257,17 @@ pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
>  
> +static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
> +
> +	if (!vdev->pci_saved_state)
> +		return;
> +
> +	pci_load_saved_state(pdev, vdev->pci_saved_state);
> +	pci_restore_state(pdev);
> +}
> +
>  int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
>  				  int nr_virtfn)
>  {
> @@ -2321,6 +2332,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
>  
>  const struct pci_error_handlers vfio_pci_core_err_handlers = {
>  	.error_detected = vfio_pci_core_aer_err_detected,
> +	.reset_done = vfio_pci_core_aer_reset_done,
>  };
>  EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
>  

It gives me some anxiety to restore the open state of the device here,
but I can't identify any specific issues.  We'll try it and see how it
goes.

Acked-by: Alex Williamson <alex@shazbot.org>

Thanks,
Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v11 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (7 preceding siblings ...)
  2026-03-16 19:15 ` [PATCH v11 8/9] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
@ 2026-03-16 19:15 ` Farhan Ali
  2026-03-24 21:26   ` Bjorn Helgaas
  2026-03-24 19:34 ` [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
  9 siblings, 1 reply; 33+ messages in thread
From: Farhan Ali @ 2026-03-16 19:15 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, kbusch, clg, stable, alifm, schnelle,
	mjrosato, Julian Ruess

We are configuring the error signaling on the vast majority of devices and
it's extremely rare that it fires anyway. This allows userspace to be
notified on errors for legacy PCI devices. The Internal Shared Memory (ISM)
device on s390 is one such device. For PCI devices on IBM s390 error
recovery involves platform firmware and notification to operating system
is done by architecture specific way. So the ISM device can still be
recovered when notified of an error.

Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c  | 8 ++------
 drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f1bd1266b88f..cfd9a51cd194 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -786,8 +786,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
 		}
 	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
-		if (pci_is_pcie(vdev->pdev))
-			return 1;
+		return 1;
 	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
 		return 1;
 	}
@@ -1163,11 +1162,8 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
 	switch (info.index) {
 	case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
 	case VFIO_PCI_REQ_IRQ_INDEX:
-		break;
 	case VFIO_PCI_ERR_IRQ_INDEX:
-		if (pci_is_pcie(vdev->pdev))
-			break;
-		fallthrough;
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 33944d4d9dc4..64f80f64ff57 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -859,8 +859,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 	case VFIO_PCI_ERR_IRQ_INDEX:
 		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
 		case VFIO_IRQ_SET_ACTION_TRIGGER:
-			if (pci_is_pcie(vdev->pdev))
-				func = vfio_pci_set_err_trigger;
+			func = vfio_pci_set_err_trigger;
 			break;
 		}
 		break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
  2026-03-16 19:15 ` [PATCH v11 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
@ 2026-03-24 21:26   ` Bjorn Helgaas
  2026-03-24 22:30     ` Farhan Ali
  2026-03-25 17:50     ` Alex Williamson
  0 siblings, 2 replies; 33+ messages in thread
From: Bjorn Helgaas @ 2026-03-24 21:26 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato, Julian Ruess

On Mon, Mar 16, 2026 at 12:15:44PM -0700, Farhan Ali wrote:
> We are configuring the error signaling on the vast majority of devices and

Who is "we"?  If a function configures error signaling, can you
mention the name?

> it's extremely rare that it fires anyway. This allows userspace to
> be notified on errors for legacy PCI devices. The Internal Shared
> Memory (ISM) device on s390 is one such device. 

This commit log talks about things that could be done, but doesn't
actually say what the patch does or what makes it safe and effective,
and I'm not VFIO-literate enough for it to be clear.

These pci_is_pcie() tests were added by dad9f8972e04 ("VFIO-AER:
Vfio-pci driver changes for supporting AER"), so I suppose the
dad9f8972e04 assumption was that AER was the only error reporting
mechanism, and AER only exists on PCIe devices?

But s390 can report errors for conventional PCI devices, and you want
VFIO to support that as well?

Obviously this change needs to be safe for all arches, not just s390.
I suppose it's safe to report the VFIO_PCI_ERR_IRQ_INDEX info
everywhere; it's just that it will never be used except on s390?  And
I guess powerpc, which can get to vfio_pci_core_aer_err_detected() via
eeh_report_failure().

It looks like vfio_pci_driver provides vfio_pci_core_err_handlers
whether the device is conventional PCI or PCIe, and s390 can already
call vfio_pci_core_aer_err_detected() (the .error_detected() hook) via
zpci_event_notify_error_detected(), so this patch makes it possible
for the guest (QEMU, etc) to learn about it?

> For PCI devices on IBM s390 error
> recovery involves platform firmware and notification to operating system
> is done by architecture specific way. So the ISM device can still be
> recovered when notified of an error.

I guess this error recovery part would be done by the guest ISM
driver, triggered when when something like QEMU receives the eventfd
signal from vfio_pci_core_aer_err_detected()?

> Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>

I don't maintain VFIO, so I'm just kibbitzing here.  Hopefully Alex
will chime in.

> ---
>  drivers/vfio/pci/vfio_pci_core.c  | 8 ++------
>  drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
>  2 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f1bd1266b88f..cfd9a51cd194 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -786,8 +786,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>  			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
>  		}
>  	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
> -		if (pci_is_pcie(vdev->pdev))
> -			return 1;
> +		return 1;
>  	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
>  		return 1;
>  	}
> @@ -1163,11 +1162,8 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
>  	switch (info.index) {
>  	case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
>  	case VFIO_PCI_REQ_IRQ_INDEX:
> -		break;
>  	case VFIO_PCI_ERR_IRQ_INDEX:
> -		if (pci_is_pcie(vdev->pdev))
> -			break;
> -		fallthrough;
> +		break;
>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 33944d4d9dc4..64f80f64ff57 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -859,8 +859,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
>  	case VFIO_PCI_ERR_IRQ_INDEX:
>  		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
>  		case VFIO_IRQ_SET_ACTION_TRIGGER:
> -			if (pci_is_pcie(vdev->pdev))
> -				func = vfio_pci_set_err_trigger;
> +			func = vfio_pci_set_err_trigger;
>  			break;
>  		}
>  		break;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
  2026-03-24 21:26   ` Bjorn Helgaas
@ 2026-03-24 22:30     ` Farhan Ali
  2026-03-25 17:50     ` Alex Williamson
  1 sibling, 0 replies; 33+ messages in thread
From: Farhan Ali @ 2026-03-24 22:30 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-s390, linux-kernel, linux-pci, lukas, alex, kbusch, clg,
	stable, schnelle, mjrosato, Julian Ruess

Hi Bjorn,

On 3/24/2026 2:26 PM, Bjorn Helgaas wrote:
> On Mon, Mar 16, 2026 at 12:15:44PM -0700, Farhan Ali wrote:
>> We are configuring the error signaling on the vast majority of devices and
> Who is "we"?  If a function configures error signaling, can you
> mention the name?

Thanks for taking a look. By 'We' it would refer to userspace/QEMU that 
set up error notification for a vfio device. I took the message verbatim 
from Alex's suggestion[1].

>> it's extremely rare that it fires anyway. This allows userspace to
>> be notified on errors for legacy PCI devices. The Internal Shared
>> Memory (ISM) device on s390 is one such device.
> This commit log talks about things that could be done, but doesn't
> actually say what the patch does or what makes it safe and effective,
> and I'm not VFIO-literate enough for it to be clear.
>
> These pci_is_pcie() tests were added by dad9f8972e04 ("VFIO-AER:
> Vfio-pci driver changes for supporting AER"), so I suppose the
> dad9f8972e04 assumption was that AER was the only error reporting
> mechanism, and AER only exists on PCIe devices?

We don't have the context anymore why PCIe device check was added. Alex 
had suggested to remove the check entirely[1].

> But s390 can report errors for conventional PCI devices, and you want
> VFIO to support that as well?
>
> Obviously this change needs to be safe for all arches, not just s390.
> I suppose it's safe to report the VFIO_PCI_ERR_IRQ_INDEX info
> everywhere; it's just that it will never be used except on s390?  And
> I guess powerpc, which can get to vfio_pci_core_aer_err_detected() via
> eeh_report_failure().
>
> It looks like vfio_pci_driver provides vfio_pci_core_err_handlers
> whether the device is conventional PCI or PCIe, and s390 can already
> call vfio_pci_core_aer_err_detected() (the .error_detected() hook) via
> zpci_event_notify_error_detected(), so this patch makes it possible
> for the guest (QEMU, etc) to learn about it?

Yes, with this patch want to notify the userspace/QEMU about 
conventional PCI devices. I think it should be safe to report 
VFIO_PCI_ERR_IRQ_INDEX even for conventional PCI devices for other 
architectures. AFAIK other architectures would use the AER error 
handling mechanism to report the error and call 
vfio_pci_core_aer_err_detected().

>> For PCI devices on IBM s390 error
>> recovery involves platform firmware and notification to operating system
>> is done by architecture specific way. So the ISM device can still be
>> recovered when notified of an error.
> I guess this error recovery part would be done by the guest ISM
> driver, triggered when when something like QEMU receives the eventfd
> signal from vfio_pci_core_aer_err_detected()?

Yes, we wanted to be able to signal QEMU on an error, so that QEMU can 
appropriately notify the guest about the error. Today the ISM driver 
doesn't register error recovery callbacks. However we can still recover 
from the guest by writing to 'reset' sysfs file.

[1] 
https://lore.kernel.org/all/20250815144855.51f2ac24.alex.williamson@redhat.com/

>
>> Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
>> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> I don't maintain VFIO, so I'm just kibbitzing here.  Hopefully Alex
> will chime in.
>
>> ---
>>   drivers/vfio/pci/vfio_pci_core.c  | 8 ++------
>>   drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
>>   2 files changed, 3 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index f1bd1266b88f..cfd9a51cd194 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -786,8 +786,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>>   			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
>>   		}
>>   	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
>> -		if (pci_is_pcie(vdev->pdev))
>> -			return 1;
>> +		return 1;
>>   	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
>>   		return 1;
>>   	}
>> @@ -1163,11 +1162,8 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
>>   	switch (info.index) {
>>   	case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
>>   	case VFIO_PCI_REQ_IRQ_INDEX:
>> -		break;
>>   	case VFIO_PCI_ERR_IRQ_INDEX:
>> -		if (pci_is_pcie(vdev->pdev))
>> -			break;
>> -		fallthrough;
>> +		break;
>>   	default:
>>   		return -EINVAL;
>>   	}
>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>> index 33944d4d9dc4..64f80f64ff57 100644
>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>> @@ -859,8 +859,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
>>   	case VFIO_PCI_ERR_IRQ_INDEX:
>>   		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
>>   		case VFIO_IRQ_SET_ACTION_TRIGGER:
>> -			if (pci_is_pcie(vdev->pdev))
>> -				func = vfio_pci_set_err_trigger;
>> +			func = vfio_pci_set_err_trigger;
>>   			break;
>>   		}
>>   		break;
>> -- 
>> 2.43.0
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
  2026-03-24 21:26   ` Bjorn Helgaas
  2026-03-24 22:30     ` Farhan Ali
@ 2026-03-25 17:50     ` Alex Williamson
  1 sibling, 0 replies; 33+ messages in thread
From: Alex Williamson @ 2026-03-25 17:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Farhan Ali, linux-s390, linux-kernel, linux-pci, lukas, kbusch,
	clg, stable, schnelle, mjrosato, Julian Ruess, alex

On Tue, 24 Mar 2026 16:26:02 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Mar 16, 2026 at 12:15:44PM -0700, Farhan Ali wrote:
> > We are configuring the error signaling on the vast majority of devices and  
> 
> Who is "we"?  If a function configures error signaling, can you
> mention the name?
> 
> > it's extremely rare that it fires anyway. This allows userspace to
> > be notified on errors for legacy PCI devices. The Internal Shared
> > Memory (ISM) device on s390 is one such device.   
> 
> This commit log talks about things that could be done, but doesn't
> actually say what the patch does or what makes it safe and effective,
> and I'm not VFIO-literate enough for it to be clear.
> 
> These pci_is_pcie() tests were added by dad9f8972e04 ("VFIO-AER:
> Vfio-pci driver changes for supporting AER"), so I suppose the
> dad9f8972e04 assumption was that AER was the only error reporting
> mechanism, and AER only exists on PCIe devices?

Yes, that's the conclusion we came to in previous discussions that
Farhan notes in their reply.

> But s390 can report errors for conventional PCI devices, and you want
> VFIO to support that as well?
> 
> Obviously this change needs to be safe for all arches, not just s390.
> I suppose it's safe to report the VFIO_PCI_ERR_IRQ_INDEX info
> everywhere; it's just that it will never be used except on s390?  And
> I guess powerpc, which can get to vfio_pci_core_aer_err_detected() via
> eeh_report_failure().
> 
> It looks like vfio_pci_driver provides vfio_pci_core_err_handlers
> whether the device is conventional PCI or PCIe, and s390 can already
> call vfio_pci_core_aer_err_detected() (the .error_detected() hook) via
> zpci_event_notify_error_detected(), so this patch makes it possible
> for the guest (QEMU, etc) to learn about it?
> 
> > For PCI devices on IBM s390 error
> > recovery involves platform firmware and notification to operating system
> > is done by architecture specific way. So the ISM device can still be
> > recovered when notified of an error.  
> 
> I guess this error recovery part would be done by the guest ISM
> driver, triggered when when something like QEMU receives the eventfd
> signal from vfio_pci_core_aer_err_detected()?
> 
> > Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
> > Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>  
> 
> I don't maintain VFIO, so I'm just kibbitzing here.  Hopefully Alex
> will chime in.

It's the previous patch about restoring open state of the device on
.reset_done that gives me more anxiety than just reporting that
non-PCIe (non-AER) devices can report errors.  At worst here, I think
userspace might be wiring an interrupt on a conventional device that
cannot fire, whereas most PCIe device have AER.  The number of
conventional device in use with vfio-pci is probably not enough to
worry about though.

For completeness, I'll note that QEMU sets a "pci_aer" flag based on
whether this error IRQ is exposed, where I think we had intended this
might interact with emulated AER.  However, it never made it that far
and just registers a handler that stops the VM.

Farhan, this and the previous patch should use "vfio/pci:" as their
title prefix.  Thanks,

Alex

> > ---
> >  drivers/vfio/pci/vfio_pci_core.c  | 8 ++------
> >  drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
> >  2 files changed, 3 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index f1bd1266b88f..cfd9a51cd194 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -786,8 +786,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
> >  			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
> >  		}
> >  	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
> > -		if (pci_is_pcie(vdev->pdev))
> > -			return 1;
> > +		return 1;
> >  	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
> >  		return 1;
> >  	}
> > @@ -1163,11 +1162,8 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
> >  	switch (info.index) {
> >  	case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
> >  	case VFIO_PCI_REQ_IRQ_INDEX:
> > -		break;
> >  	case VFIO_PCI_ERR_IRQ_INDEX:
> > -		if (pci_is_pcie(vdev->pdev))
> > -			break;
> > -		fallthrough;
> > +		break;
> >  	default:
> >  		return -EINVAL;
> >  	}
> > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> > index 33944d4d9dc4..64f80f64ff57 100644
> > --- a/drivers/vfio/pci/vfio_pci_intrs.c
> > +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> > @@ -859,8 +859,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
> >  	case VFIO_PCI_ERR_IRQ_INDEX:
> >  		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
> >  		case VFIO_IRQ_SET_ACTION_TRIGGER:
> > -			if (pci_is_pcie(vdev->pdev))
> > -				func = vfio_pci_set_err_trigger;
> > +			func = vfio_pci_set_err_trigger;
> >  			break;
> >  		}
> >  		break;
> > -- 
> > 2.43.0
> >   


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x
  2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (8 preceding siblings ...)
  2026-03-16 19:15 ` [PATCH v11 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
@ 2026-03-24 19:34 ` Farhan Ali
  9 siblings, 0 replies; 33+ messages in thread
From: Farhan Ali @ 2026-03-24 19:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: lukas, alex, kbusch, clg, stable, schnelle, mjrosato, linux-s390,
	LKML, Linux PCI

Hi Bjorn,

Polite ping for the PCI patches in the series. I would appreciate any 
feedback for these patches :) .

Thanks

Farhan

On 3/16/2026 12:15 PM, Farhan Ali wrote:
> Hi,
>
> This Linux kernel patch series introduces support for error recovery for
> passthrough PCI devices on System Z (s390x).
>
> Background
> ----------
> For PCI devices on s390x an operating system receives platform specific
> error events from firmware rather than through AER.Today for
> passthrough/userspace devices, we don't attempt any error recovery and
> ignore any error events for the devices. The passthrough/userspace devices
> are managed by the vfio-pci driver. The driver does register error handling
> callbacks (error_detected), and on an error trigger an eventfd to
> userspace.  But we need a mechanism to notify userspace
> (QEMU/guest/userspace drivers) about the error event.
>
> Proposal
> --------
> We can expose this error information (currently only the PCI Error Code)
> via a device feature. Userspace can then obtain the error information
> via VFIO_DEVICE_FEATURE ioctl and take appropriate actions such as driving
> a device reset.
>
> This is how a typical flow for passthrough devices to a VM would work:
> For passthrough devices to a VM, the driver bound to the device on the host
> is vfio-pci. vfio-pci driver does support the error_detected() callback
> (vfio_pci_core_aer_err_detected()), and on an PCI error s390x recovery
> code on the host will call the vfio-pci error_detected() callback. The
> vfio-pci error_detected() callback will notify userspace/QEMU via an
> eventfd, and return PCI_ERS_RESULT_CAN_RECOVER. At this point the s390x
> error recovery on the host will skip any further action(see patch 6) and
> let userspace drive the error recovery.
>
> Once userspace/QEMU is notified, it then injects this error into the VM
> so device drivers in the VM can take recovery actions. For example for a
> passthrough NVMe device, the VM's OS NVMe driver will access the device.
> At this point the VM's NVMe driver's error_detected() will drive the
> recovery by returning PCI_ERS_RESULT_NEED_RESET, and the s390x error
> recovery in the VM's OS will try to do a reset. Resets are privileged
> operations and so the VM will need intervention from QEMU to perform the
> reset. QEMU will invoke the VFIO_DEVICE_RESET ioctl to now notify the
> host that the VM is requesting a reset of the device. The vfio-pci driver
> on the host will then perform the reset on the device to recover it.
>
>
> Thanks
> Farhan
>
> ChangeLog
> ---------
> v10 series https://lore.kernel.org/all/20260302203325.3826-1-alifm@linux.ibm.com/
> v10 -> v11
>      - Rebase on pci/next to handle merge conflicts with patch 1.
>
>      - Typo fixup in commit message (patch 4) and use guard() for mutex
>      (patch 6).
>
> v9 series https://lore.kernel.org/all/20260217182257.1582-1-alifm@linux.ibm.com/
> v9 -> v10
>     - Change pci_slot number to u16 (patch 1).
>
>     - Avoid saving invalid config space state if config space is
>     inaccessible in the device reset path. It uses the same patch as in v8
>     with R-b from Niklas.
>
>     - Rebase on 7.0.0-rc2
>
>
> v8 series https://lore.kernel.org/all/20260122194437.1903-1-alifm@linux.ibm.com/
> v8 -> v9
>     - Avoid saving PCI config space state in reset path (patch 3) (suggested by Bjorn)
>     
>     - Add explicit version to struct vfio_device_feature_zpci_err (patch 7).
>
>     - Rebase on 6.19
>
>
> v7 series https://lore.kernel.org/all/20260107183217.1365-1-alifm@linux.ibm.com/
> v7 -> v8
>     - Rebase on 6.19-rc4
>
>     - Address feedback from Niklas and Julien.
>
>
> v6 series https://lore.kernel.org/all/2c609e61-1861-4bf3-b019-a11c137d26a5@linux.ibm.com/
> v6 -> v7
>      - Rebase on 6.19-rc4
>
>      - Update commit message based on Niklas's suggestion (patch 3).
>
> v5 series https://lore.kernel.org/all/20251113183502.2388-1-alifm@linux.ibm.com/
> v5 -> v6
>     - Rebase on 6.18 + Lukas's PCI: Universal error recoverability of
>     devices series (https://lore.kernel.org/all/cover.1763483367.git.lukas@wunner.de/)
>
>     - Re-work config space accessibility check to pci_dev_save_and_disable() (patch 3).
>     This avoids saving the config space, in the reset path, if the device's config space is
>     corrupted or inaccessible.
>
> v4 series https://lore.kernel.org/all/20250924171628.826-1-alifm@linux.ibm.com/
> v4 -> v5
>      - Rebase on 6.18-rc5
>
>      - Move bug fixes to the beginning of the series (patch 1 and 2). These patches
>      were posted as a separate fixes series
> https://lore.kernel.org/all/a14936ac-47d6-461b-816f-0fd66f869b0f@linux.ibm.com/
>
>      - Add matching pci_put_dev() for pci_get_slot() (patch 6).
>
> v3 series https://lore.kernel.org/all/20250911183307.1910-1-alifm@linux.ibm.com/
> v3 -> v4
>      - Remove warn messages for each PCI capability not restored (patch 1)
>
>      - Check PCI_COMMAND and PCI_STATUS register for error value instead of device id
>      (patch 1)
>
>      - Fix kernel crash in patch 3
>
>      - Added reviewed by tags
>
>      - Address comments from Niklas's (patches 4, 5, 7)
>
>      - Fix compilation error non s390x system (patch 8)
>
>      - Explicitly align struct vfio_device_feature_zpci_err (patch 8)
>
>
> v2 series https://lore.kernel.org/all/20250825171226.1602-1-alifm@linux.ibm.com/
> v2 -> v3
>     - Patch 1 avoids saving any config space state if the device is in error
>     (suggested by Alex)
>
>     - Patch 2 adds additional check only for FLR reset to try other function
>       reset method (suggested by Alex).
>
>     - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
>       functions. Creates a new flag pci_slot to allow per function slot.
>
>     - Patch 4 fixes a bug in s390 for resource to bus address translation.
>
>     - Rebase on 6.17-rc5
>
>
> v1 series https://lore.kernel.org/all/20250813170821.1115-1-alifm@linux.ibm.com/
> v1 - > v2
>     - Patches 1 and 2 adds some additional checks for FLR/PM reset to
>       try other function reset method (suggested by Alex).
>
>     - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
>       functions.
>
>     - Patch 7 adds a new device feature for zPCI devices for the VFIO_DEVICE_FEATURE
>       ioctl. The ioctl is used by userspace to retriece any PCI error
>       information for the device (suggested by Alex).
>
>     - Patch 8 adds a reset_done() callback for the vfio-pci driver, to
>       restore the state of the device after a reset.
>
>     - Patch 9 removes the pcie check for triggering VFIO_PCI_ERR_IRQ_INDEX.
>
> Farhan Ali (9):
>    PCI: Allow per function PCI slots
>    s390/pci: Add architecture specific resource/bus address translation
>    PCI: Avoid saving config space state if inaccessible
>    PCI: Add additional checks for flr reset
>    s390/pci: Update the logic for detecting passthrough device
>    s390/pci: Store PCI error information for passthrough devices
>    vfio-pci/zdev: Add a device feature for error information
>    vfio: Add a reset_done callback for vfio-pci driver
>    vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
>
>   arch/s390/include/asm/pci.h       |  29 ++++++++
>   arch/s390/pci/pci.c               |  75 +++++++++++++++++++++
>   arch/s390/pci/pci_event.c         | 106 +++++++++++++++++-------------
>   drivers/pci/host-bridge.c         |   8 +--
>   drivers/pci/pci.c                 |  26 +++++++-
>   drivers/pci/slot.c                |  31 +++++++--
>   drivers/vfio/pci/vfio_pci_core.c  |  22 +++++--
>   drivers/vfio/pci/vfio_pci_intrs.c |   3 +-
>   drivers/vfio/pci/vfio_pci_priv.h  |   9 +++
>   drivers/vfio/pci/vfio_pci_zdev.c  |  47 ++++++++++++-
>   include/linux/pci.h               |   5 +-
>   include/uapi/linux/vfio.h         |  17 +++++
>   12 files changed, 307 insertions(+), 71 deletions(-)
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2026-03-25 18:40 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16 19:15 [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
2026-03-16 19:15 ` [PATCH v11 1/9] PCI: Allow per function PCI slots Farhan Ali
2026-03-24 21:55   ` Bjorn Helgaas
2026-03-24 23:08     ` Farhan Ali
2026-03-24 23:20       ` Bjorn Helgaas
2026-03-16 19:15 ` [PATCH v11 2/9] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
2026-03-24 23:06   ` Bjorn Helgaas
2026-03-24 23:47     ` Farhan Ali
2026-03-25 11:58     ` Ilpo Järvinen
2026-03-25 17:44       ` Farhan Ali
2026-03-16 19:15 ` [PATCH v11 3/9] PCI: Avoid saving config space state if inaccessible Farhan Ali
2026-03-24 21:40   ` Bjorn Helgaas
2026-03-24 22:38     ` Farhan Ali
2026-03-24 22:52       ` Bjorn Helgaas
2026-03-16 19:15 ` [PATCH v11 4/9] PCI: Add additional checks for flr reset Farhan Ali
2026-03-24 22:49   ` Bjorn Helgaas
2026-03-24 23:22     ` Farhan Ali
2026-03-25 16:25     ` Alex Williamson
2026-03-25 18:40       ` Farhan Ali
2026-03-16 19:15 ` [PATCH v11 5/9] s390/pci: Update the logic for detecting passthrough device Farhan Ali
2026-03-25 16:46   ` Alex Williamson
2026-03-16 19:15 ` [PATCH v11 6/9] s390/pci: Store PCI error information for passthrough devices Farhan Ali
2026-03-25 17:01   ` Alex Williamson
2026-03-25 18:06     ` Farhan Ali
2026-03-16 19:15 ` [PATCH v11 7/9] vfio-pci/zdev: Add a device feature for error information Farhan Ali
2026-03-25 17:18   ` Alex Williamson
2026-03-16 19:15 ` [PATCH v11 8/9] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
2026-03-25 17:30   ` Alex Williamson
2026-03-16 19:15 ` [PATCH v11 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
2026-03-24 21:26   ` Bjorn Helgaas
2026-03-24 22:30     ` Farhan Ali
2026-03-25 17:50     ` Alex Williamson
2026-03-24 19:34 ` [PATCH v11 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox