public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x
@ 2025-09-11 18:32 Farhan Ali
  2025-09-11 18:32 ` [PATCH v3 01/10] PCI: Avoid saving error values for config space Farhan Ali
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:32 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

Hi,

This Linux kernel patch series introduces support for error recovery for
passthrough PCI devices on System Z (s390x). 

Background
----------
For PCI devices on s390x an operating system receives platform specific
error events from firmware rather than through AER.Today for
passthrough/userspace devices, we don't attempt any error recovery and
ignore any error events for the devices. The passthrough/userspace devices
are managed by the vfio-pci driver. The driver does register error handling
callbacks (error_detected), and on an error trigger an eventfd to
userspace.  But we need a mechanism to notify userspace
(QEMU/guest/userspace drivers) about the error event. 

Proposal
--------
We can expose this error information (currently only the PCI Error Code)
via a device feature. Userspace can then obtain the error information 
via VFIO_DEVICE_FEATURE ioctl and take appropriate actions such as driving 
a device reset.

I would appreciate some feedback on this series.

Thanks
Farhan

ChangeLog
---------
v2 series https://lore.kernel.org/all/20250825171226.1602-1-alifm@linux.ibm.com/
v2 -> v3
   - Patch 1 avoids saving any config space state if the device is in error
   (suggested by Alex)

   - Patch 2 adds additional check only for FLR reset to try other function 
     reset method (suggested by Alex).

   - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
     functions. Creates a new flag pci_slot to allow per function slot.

   - Patch 4 fixes a bug in s390 for resource to bus address translation.

   - Rebase on 6.17-rc5


v1 series https://lore.kernel.org/all/20250813170821.1115-1-alifm@linux.ibm.com/
v1 - > v2
   - Patches 1 and 2 adds some additional checks for FLR/PM reset to 
     try other function reset method (suggested by Alex).

   - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
     functions.

   - Patch 7 adds a new device feature for zPCI devices for the VFIO_DEVICE_FEATURE 
     ioctl. The ioctl is used by userspace to retriece any PCI error
     information for the device (suggested by Alex).

   - Patch 8 adds a reset_done() callback for the vfio-pci driver, to
     restore the state of the device after a reset.

   - Patch 9 removes the pcie check for triggering VFIO_PCI_ERR_IRQ_INDEX.

Farhan Ali (10):
  PCI: Avoid saving error values for config space
  PCI: Add additional checks for flr reset
  PCI: Allow per function PCI slots
  s390/pci: Add architecture specific resource/bus address translation
  s390/pci: Restore IRQ unconditionally for the zPCI device
  s390/pci: Update the logic for detecting passthrough device
  s390/pci: Store PCI error information for passthrough devices
  vfio-pci/zdev: Add a device feature for error information
  vfio: Add a reset_done callback for vfio-pci driver
  vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX

 arch/s390/include/asm/pci.h        |  30 +++++++-
 arch/s390/pci/pci.c                |  74 ++++++++++++++++++++
 arch/s390/pci/pci_event.c          | 107 ++++++++++++++++-------------
 arch/s390/pci/pci_irq.c            |   9 +--
 drivers/pci/host-bridge.c          |   4 +-
 drivers/pci/hotplug/s390_pci_hpc.c |  10 ++-
 drivers/pci/pci.c                  |  40 +++++++++--
 drivers/pci/pcie/aer.c             |   5 ++
 drivers/pci/pcie/dpc.c             |   5 ++
 drivers/pci/pcie/ptm.c             |   5 ++
 drivers/pci/slot.c                 |  14 +++-
 drivers/pci/tph.c                  |   5 ++
 drivers/pci/vc.c                   |   5 ++
 drivers/vfio/pci/vfio_pci_core.c   |  20 ++++--
 drivers/vfio/pci/vfio_pci_intrs.c  |   3 +-
 drivers/vfio/pci/vfio_pci_priv.h   |   8 +++
 drivers/vfio/pci/vfio_pci_zdev.c   |  45 +++++++++++-
 include/linux/pci.h                |   1 +
 include/uapi/linux/vfio.h          |  14 ++++
 19 files changed, 330 insertions(+), 74 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 01/10] PCI: Avoid saving error values for config space
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
@ 2025-09-11 18:32 ` Farhan Ali
  2025-09-13  8:27   ` Alex Williamson
  2025-09-16 18:09   ` Bjorn Helgaas
  2025-09-11 18:32 ` [PATCH v3 02/10] PCI: Add additional checks for flr reset Farhan Ali
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:32 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

The current reset process saves the device's config space state before
reset and restores it afterward. However, when a device is in an error
state before reset, config space reads may return error values instead of
valid data. This results in saving corrupted values that get written back
to the device during state restoration.

Avoid saving the state of the config space when the device is in error.
While restoring we only restorei the state that can be restored through
kernel data such as BARs or doesn't depend on the saved state.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/pci.c      | 29 ++++++++++++++++++++++++++---
 drivers/pci/pcie/aer.c |  5 +++++
 drivers/pci/pcie/dpc.c |  5 +++++
 drivers/pci/pcie/ptm.c |  5 +++++
 drivers/pci/tph.c      |  5 +++++
 drivers/pci/vc.c       |  5 +++++
 6 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b0f4d98036cd..4b67d22faf0a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1720,6 +1720,11 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
 	struct pci_cap_saved_state *save_state;
 	u16 *cap;
 
+	if (!dev->state_saved) {
+		pci_warn(dev, "Not restoring pcie state, no saved state");
+		return;
+	}
+
 	/*
 	 * Restore max latencies (in the LTR capability) before enabling
 	 * LTR itself in PCI_EXP_DEVCTL2.
@@ -1775,6 +1780,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
 	struct pci_cap_saved_state *save_state;
 	u16 *cap;
 
+	if (!dev->state_saved) {
+		pci_warn(dev, "Not restoring pcix state, no saved state");
+		return;
+	}
+
 	save_state = pci_find_saved_cap(dev, PCI_CAP_ID_PCIX);
 	pos = pci_find_capability(dev, PCI_CAP_ID_PCIX);
 	if (!save_state || !pos)
@@ -1792,6 +1802,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
 int pci_save_state(struct pci_dev *dev)
 {
 	int i;
+	u16 val;
+
+	pci_read_config_word(dev, PCI_DEVICE_ID, &val);
+	if (PCI_POSSIBLE_ERROR(val)) {
+		pci_warn(dev, "Device in error, not saving config space state\n");
+		return -EIO;
+	}
+
 	/* XXX: 100% dword access ok here? */
 	for (i = 0; i < 16; i++) {
 		pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
@@ -1854,6 +1872,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
 
 static void pci_restore_config_space(struct pci_dev *pdev)
 {
+	if (!pdev->state_saved) {
+		pci_warn(pdev, "No saved config space, restoring BARs\n");
+		pci_restore_bars(pdev);
+		pci_write_config_word(pdev, PCI_COMMAND,
+				PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
+		return;
+	}
+
 	if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
 		pci_restore_config_space_range(pdev, 10, 15, 0, false);
 		/* Restore BARs before the command register. */
@@ -1906,9 +1932,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
  */
 void pci_restore_state(struct pci_dev *dev)
 {
-	if (!dev->state_saved)
-		return;
-
 	pci_restore_pcie_state(dev);
 	pci_restore_pasid_state(dev);
 	pci_restore_pri_state(dev);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e286c197d716..dca3502ef669 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -361,6 +361,11 @@ void pci_restore_aer_state(struct pci_dev *dev)
 	if (!aer)
 		return;
 
+	if (!dev->state_saved) {
+		pci_warn(dev, "Not restoring aer state, no saved state");
+		return;
+	}
+
 	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_ERR);
 	if (!save_state)
 		return;
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..62c520af71a7 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -67,6 +67,11 @@ void pci_restore_dpc_state(struct pci_dev *dev)
 	if (!pci_is_pcie(dev))
 		return;
 
+	if (!dev->state_saved) {
+		pci_warn(dev, "Not restoring dpc state, no saved state");
+		return;
+	}
+
 	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
 	if (!save_state)
 		return;
diff --git a/drivers/pci/pcie/ptm.c b/drivers/pci/pcie/ptm.c
index 65e4b008be00..7b5bcc23000d 100644
--- a/drivers/pci/pcie/ptm.c
+++ b/drivers/pci/pcie/ptm.c
@@ -112,6 +112,11 @@ void pci_restore_ptm_state(struct pci_dev *dev)
 	if (!ptm)
 		return;
 
+	if (!dev->state_saved) {
+		pci_warn(dev, "Not restoring ptm state, no saved state");
+		return;
+	}
+
 	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_PTM);
 	if (!save_state)
 		return;
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index cc64f93709a4..f0f1bae46736 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -435,6 +435,11 @@ void pci_restore_tph_state(struct pci_dev *pdev)
 	if (!pdev->tph_enabled)
 		return;
 
+	if (!pdev->state_saved) {
+		pci_warn(pdev, "Not restoring tph state, no saved state");
+		return;
+	}
+
 	save_state = pci_find_saved_ext_cap(pdev, PCI_EXT_CAP_ID_TPH);
 	if (!save_state)
 		return;
diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
index a4ff7f5f66dd..fda435cd49c1 100644
--- a/drivers/pci/vc.c
+++ b/drivers/pci/vc.c
@@ -391,6 +391,11 @@ void pci_restore_vc_state(struct pci_dev *dev)
 {
 	int i;
 
+	if (!dev->state_saved) {
+		pci_warn(dev, "Not restoring vc state, no saved state");
+		return;
+	}
+
 	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
 		int pos;
 		struct pci_cap_saved_state *save_state;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 02/10] PCI: Add additional checks for flr reset
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
  2025-09-11 18:32 ` [PATCH v3 01/10] PCI: Avoid saving error values for config space Farhan Ali
@ 2025-09-11 18:32 ` Farhan Ali
  2025-09-11 18:33 ` [PATCH v3 03/10] PCI: Allow per function PCI slots Farhan Ali
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:32 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

If a device is in an error state, then any reads of device registers can
return error value. Add addtional checks to validate if a device is in an
error state before doing an flr reset.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/pci.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4b67d22faf0a..3994fa82df68 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4580,12 +4580,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
  */
 int pcie_reset_flr(struct pci_dev *dev, bool probe)
 {
+	u32 reg;
+
 	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
 		return -ENOTTY;
 
 	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
 		return -ENOTTY;
 
+	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
+		pci_warn(dev, "Device unable to do an FLR\n");
+		return -ENOTTY;
+	}
+
 	if (probe)
 		return 0;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 03/10] PCI: Allow per function PCI slots
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
  2025-09-11 18:32 ` [PATCH v3 01/10] PCI: Avoid saving error values for config space Farhan Ali
  2025-09-11 18:32 ` [PATCH v3 02/10] PCI: Add additional checks for flr reset Farhan Ali
@ 2025-09-11 18:33 ` Farhan Ali
  2025-09-12 12:23   ` Benjamin Block
  2025-09-16  6:52   ` Cédric Le Goater
  2025-09-11 18:33 ` [PATCH v3 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:33 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

On s390 systems, which use a machine level hypervisor, PCI devices are
always accessed through a form of PCI pass-through which fundamentally
operates on a per PCI function granularity. This is also reflected in the
s390 PCI hotplug driver which creates hotplug slots for individual PCI
functions. Its reset_slot() function, which is a wrapper for
zpci_hot_reset_device(), thus also resets individual functions.

Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
to multifunction devices. This approach worked fine on s390 systems that
only exposed virtual functions as individual PCI domains to the operating
system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
s390 supports exposing the topology of multifunction PCI devices by
grouping them in a shared PCI domain. When attempting to reset a function
through the hotplug driver, the shared slot assignment causes the wrong
function to be reset instead of the intended one. It also leaks memory as
we do create a pci_slot object for the function, but don't correctly free
it in pci_slot_release().

Add a flag for struct pci_slot to allow per function PCI slots for
functions managed through a hypervisor, which exposes individual PCI
functions while retaining the topology.

Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/hotplug/s390_pci_hpc.c | 10 ++++++++--
 drivers/pci/pci.c                  |  4 +++-
 drivers/pci/slot.c                 | 14 +++++++++++---
 include/linux/pci.h                |  1 +
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/hotplug/s390_pci_hpc.c b/drivers/pci/hotplug/s390_pci_hpc.c
index d9996516f49e..8b547de464bf 100644
--- a/drivers/pci/hotplug/s390_pci_hpc.c
+++ b/drivers/pci/hotplug/s390_pci_hpc.c
@@ -126,14 +126,20 @@ static const struct hotplug_slot_ops s390_hotplug_slot_ops = {
 
 int zpci_init_slot(struct zpci_dev *zdev)
 {
+	int ret;
 	char name[SLOT_NAME_SIZE];
 	struct zpci_bus *zbus = zdev->zbus;
 
 	zdev->hotplug_slot.ops = &s390_hotplug_slot_ops;
 
 	snprintf(name, SLOT_NAME_SIZE, "%08x", zdev->fid);
-	return pci_hp_register(&zdev->hotplug_slot, zbus->bus,
-			       zdev->devfn, name);
+	ret = pci_hp_register(&zdev->hotplug_slot, zbus->bus,
+				zdev->devfn, name);
+	if (ret)
+		return ret;
+
+	zdev->hotplug_slot.pci_slot->per_func_slot = 1;
+	return 0;
 }
 
 void zpci_exit_slot(struct zpci_dev *zdev)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 3994fa82df68..70296d3b1cfc 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5061,7 +5061,9 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, bool probe)
 
 static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
 {
-	if (dev->multifunction || dev->subordinate || !dev->slot ||
+	if (dev->multifunction && !dev->slot->per_func_slot)
+		return -ENOTTY;
+	if (dev->subordinate || !dev->slot ||
 	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
 		return -ENOTTY;
 
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index 50fb3eb595fe..51ee59e14393 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -63,6 +63,14 @@ static ssize_t cur_speed_read_file(struct pci_slot *slot, char *buf)
 	return bus_speed_read(slot->bus->cur_bus_speed, buf);
 }
 
+static bool pci_dev_matches_slot(struct pci_dev *dev, struct pci_slot *slot)
+{
+	if (slot->per_func_slot)
+		return dev->devfn == slot->number;
+
+	return PCI_SLOT(dev->devfn) == slot->number;
+}
+
 static void pci_slot_release(struct kobject *kobj)
 {
 	struct pci_dev *dev;
@@ -73,7 +81,7 @@ static void pci_slot_release(struct kobject *kobj)
 
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &slot->bus->devices, bus_list)
-		if (PCI_SLOT(dev->devfn) == slot->number)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = NULL;
 	up_read(&pci_bus_sem);
 
@@ -166,7 +174,7 @@ void pci_dev_assign_slot(struct pci_dev *dev)
 
 	mutex_lock(&pci_slot_mutex);
 	list_for_each_entry(slot, &dev->bus->slots, list)
-		if (PCI_SLOT(dev->devfn) == slot->number)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = slot;
 	mutex_unlock(&pci_slot_mutex);
 }
@@ -285,7 +293,7 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &parent->devices, bus_list)
-		if (PCI_SLOT(dev->devfn) == slot_nr)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = slot;
 	up_read(&pci_bus_sem);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 59876de13860..9265f32d9786 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -78,6 +78,7 @@ struct pci_slot {
 	struct list_head	list;		/* Node in list of slots */
 	struct hotplug_slot	*hotplug;	/* Hotplug info (move here) */
 	unsigned char		number;		/* PCI_SLOT(pci_dev->devfn) */
+	unsigned int		per_func_slot:1; /* Allow per function slot */
 	struct kobject		kobj;
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (2 preceding siblings ...)
  2025-09-11 18:33 ` [PATCH v3 03/10] PCI: Allow per function PCI slots Farhan Ali
@ 2025-09-11 18:33 ` Farhan Ali
  2025-09-17 14:48   ` Niklas Schnelle
  2025-09-11 18:33 ` [PATCH v3 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device Farhan Ali
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:33 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

On s390 today we overwrite the PCI BAR resource address to either an
artificial cookie address or MIO address. However this address is different
from the bus address of the BARs programmed by firmware. The artificial
cookie address was created to index into an array of function handles
(zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
but maybe different from the bus address. This creates an issue when trying
to convert the BAR resource address to bus address using the generic
pcibios_resource_to_bus.

Implement an architecture specific pcibios_resource_to_bus function to
correctly translate PCI BAR resource address to bus address for s390.
Similarly add architecture specific pcibios_bus_to_resource function to do
the reverse translation.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/pci/pci.c       | 73 +++++++++++++++++++++++++++++++++++++++
 drivers/pci/host-bridge.c |  4 +--
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index cd6676c2d602..5baeb5f6f674 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -264,6 +264,79 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
 	return 0;
 }
 
+void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
+			     struct resource *res)
+{
+	struct zpci_bus *zbus = bus->sysdata;
+	struct zpci_bar_struct *zbar;
+	struct zpci_dev *zdev;
+
+	region->start = res->start;
+	region->end = res->end;
+
+	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
+		int j = 0;
+
+		zbar = NULL;
+		zdev = zbus->function[i];
+		if (!zdev)
+			continue;
+
+		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
+			if (zdev->bars[j].res->start == res->start &&
+			    zdev->bars[j].res->end == res->end) {
+				zbar = &zdev->bars[j];
+				break;
+			}
+		}
+
+		if (zbar) {
+			/* only MMIO is supported */
+			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
+			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+				region->start |= (u64)zdev->bars[j + 1].val << 32;
+
+			region->end = region->start + (1UL << zbar->size) - 1;
+			return;
+		}
+	}
+}
+
+void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+			     struct pci_bus_region *region)
+{
+	struct zpci_bus *zbus = bus->sysdata;
+	struct zpci_dev *zdev;
+	resource_size_t start, end;
+
+	res->start = region->start;
+	res->end = region->end;
+
+	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
+		zdev = zbus->function[i];
+		if (!zdev || !zdev->has_resources)
+			continue;
+
+		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
+			if (!zdev->bars[j].val && !zdev->bars[j].size)
+				continue;
+
+			/* only MMIO is supported */
+			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
+			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+				start |= (u64)zdev->bars[j + 1].val << 32;
+
+			end = start + (1UL << zdev->bars[j].size) - 1;
+
+			if (start == region->start && end == region->end) {
+				res->start = zdev->bars[j].res->start;
+				res->end = zdev->bars[j].res->end;
+				return;
+			}
+		}
+	}
+}
+
 void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
 			   pgprot_t prot)
 {
diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
index afa50b446567..56d62afb3afe 100644
--- a/drivers/pci/host-bridge.c
+++ b/drivers/pci/host-bridge.c
@@ -48,7 +48,7 @@ void pci_set_host_bridge_release(struct pci_host_bridge *bridge,
 }
 EXPORT_SYMBOL_GPL(pci_set_host_bridge_release);
 
-void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
+void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
 			     struct resource *res)
 {
 	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
@@ -73,7 +73,7 @@ static bool region_contains(struct pci_bus_region *region1,
 	return region1->start <= region2->start && region1->end >= region2->end;
 }
 
-void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+void __weak pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
 			     struct pci_bus_region *region)
 {
 	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (3 preceding siblings ...)
  2025-09-11 18:33 ` [PATCH v3 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
@ 2025-09-11 18:33 ` Farhan Ali
  2025-09-15  8:39   ` Niklas Schnelle
  2025-09-11 18:33 ` [PATCH v3 06/10] s390/pci: Update the logic for detecting passthrough device Farhan Ali
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:33 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

Commit c1e18c17bda6 ("s390/pci: add zpci_set_irq()/zpci_clear_irq()"),
introduced the zpci_set_irq() and zpci_clear_irq(), to be used while
resetting a zPCI device.

Commit da995d538d3a ("s390/pci: implement reset_slot for hotplug slot"),
mentions zpci_clear_irq() being called in the path for zpci_hot_reset_device().
But that is not the case anymore and these functions are not called
outside of this file.

However after a CLP disable/enable reset, the device's IRQ are
unregistered, but the flag zdev->irq_registered does not get cleared. It
creates an inconsistent state and so arch_restore_msi_irqs() doesn't
correctly restore the device's IRQ. This becomes a problem when a PCI
driver tries to restore the state of the device through
pci_restore_state(). Restore IRQ unconditionally for the device and remove
the irq_registered flag as its redundant.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/include/asm/pci.h | 1 -
 arch/s390/pci/pci_irq.c     | 9 +--------
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 41f900f693d9..aed19a1aa9d7 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -145,7 +145,6 @@ struct zpci_dev {
 	u8		has_resources	: 1;
 	u8		is_physfn	: 1;
 	u8		util_str_avail	: 1;
-	u8		irqs_registered	: 1;
 	u8		tid_avail	: 1;
 	u8		rtr_avail	: 1; /* Relaxed translation allowed */
 	unsigned int	devfn;		/* DEVFN part of the RID*/
diff --git a/arch/s390/pci/pci_irq.c b/arch/s390/pci/pci_irq.c
index 84482a921332..e73be96ce5fe 100644
--- a/arch/s390/pci/pci_irq.c
+++ b/arch/s390/pci/pci_irq.c
@@ -107,9 +107,6 @@ static int zpci_set_irq(struct zpci_dev *zdev)
 	else
 		rc = zpci_set_airq(zdev);
 
-	if (!rc)
-		zdev->irqs_registered = 1;
-
 	return rc;
 }
 
@@ -123,9 +120,6 @@ static int zpci_clear_irq(struct zpci_dev *zdev)
 	else
 		rc = zpci_clear_airq(zdev);
 
-	if (!rc)
-		zdev->irqs_registered = 0;
-
 	return rc;
 }
 
@@ -427,8 +421,7 @@ bool arch_restore_msi_irqs(struct pci_dev *pdev)
 {
 	struct zpci_dev *zdev = to_zpci(pdev);
 
-	if (!zdev->irqs_registered)
-		zpci_set_irq(zdev);
+	zpci_set_irq(zdev);
 	return true;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 06/10] s390/pci: Update the logic for detecting passthrough device
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (4 preceding siblings ...)
  2025-09-11 18:33 ` [PATCH v3 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device Farhan Ali
@ 2025-09-11 18:33 ` Farhan Ali
  2025-09-15  9:22   ` Niklas Schnelle
  2025-09-11 18:33 ` [PATCH v3 07/10] s390/pci: Store PCI error information for passthrough devices Farhan Ali
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:33 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

We can now have userspace drivers (vfio-pci based) on s390x. The userspace
drivers will not have any KVM fd and so no kzdev associated with them. So
we need to update the logic for detecting passthrough devices to not depend
on struct kvm_zdev.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      |  1 +
 arch/s390/pci/pci_event.c        | 14 ++++----------
 drivers/vfio/pci/vfio_pci_zdev.c |  9 ++++++++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index aed19a1aa9d7..f47f62fc3bfd 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -169,6 +169,7 @@ struct zpci_dev {
 
 	char res_name[16];
 	bool mio_capable;
+	bool mediated_recovery;
 	struct zpci_bar_struct bars[PCI_STD_NUM_BARS];
 
 	u64		start_dma;	/* Start of available DMA addresses */
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index d930416d4c90..541d536be052 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -61,16 +61,10 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
 	}
 }
 
-static bool is_passed_through(struct pci_dev *pdev)
+static bool needs_mediated_recovery(struct pci_dev *pdev)
 {
 	struct zpci_dev *zdev = to_zpci(pdev);
-	bool ret;
-
-	mutex_lock(&zdev->kzdev_lock);
-	ret = !!zdev->kzdev;
-	mutex_unlock(&zdev->kzdev_lock);
-
-	return ret;
+	return zdev->mediated_recovery;
 }
 
 static bool is_driver_supported(struct pci_driver *driver)
@@ -194,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	}
 	pdev->error_state = pci_channel_io_frozen;
 
-	if (is_passed_through(pdev)) {
+	if (needs_mediated_recovery(pdev)) {
 		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
 			pci_name(pdev));
 		status_str = "failed (pass-through)";
@@ -277,7 +271,7 @@ static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
 	 * we will inject the error event and let the guest recover the device
 	 * itself.
 	 */
-	if (is_passed_through(pdev))
+	if (needs_mediated_recovery(pdev))
 		goto out;
 	driver = to_pci_driver(pdev->dev.driver);
 	if (driver && driver->err_handler && driver->err_handler->error_detected)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 0990fdb146b7..a7bc23ce8483 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -148,6 +148,8 @@ int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 	if (!zdev)
 		return -ENODEV;
 
+	zdev->mediated_recovery = true;
+
 	if (!vdev->vdev.kvm)
 		return 0;
 
@@ -161,7 +163,12 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
 
-	if (!zdev || !vdev->vdev.kvm)
+	if (!zdev)
+		return;
+
+	zdev->mediated_recovery = false;
+
+	if (!vdev->vdev.kvm)
 		return;
 
 	if (zpci_kvm_hook.kvm_unregister)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 07/10] s390/pci: Store PCI error information for passthrough devices
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (5 preceding siblings ...)
  2025-09-11 18:33 ` [PATCH v3 06/10] s390/pci: Update the logic for detecting passthrough device Farhan Ali
@ 2025-09-11 18:33 ` Farhan Ali
  2025-09-15 11:42   ` Niklas Schnelle
  2025-09-11 18:33 ` [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:33 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

For a passthrough device we need co-operation from user space to recover
the device. This would require to bubble up any error information to user
space.  Let's store this error information for passthrough devices, so it
can be retrieved later.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      | 28 ++++++++++
 arch/s390/pci/pci.c              |  1 +
 arch/s390/pci/pci_event.c        | 95 +++++++++++++++++++-------------
 drivers/vfio/pci/vfio_pci_zdev.c |  2 +
 4 files changed, 88 insertions(+), 38 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index f47f62fc3bfd..72e05af90e08 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -116,6 +116,31 @@ struct zpci_bus {
 	enum pci_bus_speed	max_bus_speed;
 };
 
+/* Content Code Description for PCI Function Error */
+struct zpci_ccdf_err {
+	u32 reserved1;
+	u32 fh;                         /* function handle */
+	u32 fid;                        /* function id */
+	u32 ett         :  4;           /* expected table type */
+	u32 mvn         : 12;           /* MSI vector number */
+	u32 dmaas       :  8;           /* DMA address space */
+	u32 reserved2   :  6;
+	u32 q           :  1;           /* event qualifier */
+	u32 rw          :  1;           /* read/write */
+	u64 faddr;                      /* failing address */
+	u32 reserved3;
+	u16 reserved4;
+	u16 pec;                        /* PCI event code */
+} __packed;
+
+#define ZPCI_ERR_PENDING_MAX 16
+struct zpci_ccdf_pending {
+	size_t count;
+	int head;
+	int tail;
+	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
+};
+
 /* Private data per function */
 struct zpci_dev {
 	struct zpci_bus *zbus;
@@ -191,6 +216,8 @@ struct zpci_dev {
 	struct iommu_domain *s390_domain; /* attached IOMMU domain */
 	struct kvm_zdev *kzdev;
 	struct mutex kzdev_lock;
+	struct zpci_ccdf_pending pending_errs;
+	struct mutex pending_errs_lock;
 	spinlock_t dom_lock;		/* protect s390_domain change */
 };
 
@@ -316,6 +343,7 @@ void zpci_debug_exit_device(struct zpci_dev *);
 int zpci_report_error(struct pci_dev *, struct zpci_report_error_header *);
 int zpci_clear_error_state(struct zpci_dev *zdev);
 int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
+void zpci_cleanup_pending_errors(struct zpci_dev *zdev);
 
 #ifdef CONFIG_NUMA
 
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 5baeb5f6f674..02c7b06cfd0e 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -896,6 +896,7 @@ struct zpci_dev *zpci_create_device(u32 fid, u32 fh, enum zpci_state state)
 	mutex_init(&zdev->state_lock);
 	mutex_init(&zdev->fmb_lock);
 	mutex_init(&zdev->kzdev_lock);
+	mutex_init(&zdev->pending_errs_lock);
 
 	return zdev;
 
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index 541d536be052..ac527410812b 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -18,23 +18,6 @@
 #include "pci_bus.h"
 #include "pci_report.h"
 
-/* Content Code Description for PCI Function Error */
-struct zpci_ccdf_err {
-	u32 reserved1;
-	u32 fh;				/* function handle */
-	u32 fid;			/* function id */
-	u32 ett		:  4;		/* expected table type */
-	u32 mvn		: 12;		/* MSI vector number */
-	u32 dmaas	:  8;		/* DMA address space */
-	u32		:  6;
-	u32 q		:  1;		/* event qualifier */
-	u32 rw		:  1;		/* read/write */
-	u64 faddr;			/* failing address */
-	u32 reserved3;
-	u16 reserved4;
-	u16 pec;			/* PCI event code */
-} __packed;
-
 /* Content Code Description for PCI Function Availability */
 struct zpci_ccdf_avail {
 	u32 reserved1;
@@ -76,6 +59,41 @@ static bool is_driver_supported(struct pci_driver *driver)
 	return true;
 }
 
+static void zpci_store_pci_error(struct pci_dev *pdev,
+				 struct zpci_ccdf_err *ccdf)
+{
+	struct zpci_dev *zdev = to_zpci(pdev);
+	int i;
+
+	mutex_lock(&zdev->pending_errs_lock);
+	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
+		pr_err("%s: Cannot store PCI error info for device",
+				pci_name(pdev));
+		mutex_unlock(&zdev->pending_errs_lock);
+		return;
+	}
+
+	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
+	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
+	zdev->pending_errs.tail++;
+	zdev->pending_errs.count++;
+	mutex_unlock(&zdev->pending_errs_lock);
+}
+
+void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
+{
+	struct pci_dev *pdev = NULL;
+
+	mutex_lock(&zdev->pending_errs_lock);
+	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
+	if (zdev->pending_errs.count)
+		pr_err("%s: Unhandled PCI error events count=%zu",
+				pci_name(pdev), zdev->pending_errs.count);
+	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
+	mutex_unlock(&zdev->pending_errs_lock);
+}
+EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
+
 static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
 							 struct pci_driver *driver)
 {
@@ -169,7 +187,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
  * and the platform determines which functions are affected for
  * multi-function devices.
  */
-static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
+static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
+							  struct zpci_ccdf_err *ccdf)
 {
 	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
 	struct zpci_dev *zdev = to_zpci(pdev);
@@ -188,13 +207,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	}
 	pdev->error_state = pci_channel_io_frozen;
 
-	if (needs_mediated_recovery(pdev)) {
-		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
-			pci_name(pdev));
-		status_str = "failed (pass-through)";
-		goto out_unlock;
-	}
-
 	driver = to_pci_driver(pdev->dev.driver);
 	if (!is_driver_supported(driver)) {
 		if (!driver) {
@@ -210,12 +222,22 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 		goto out_unlock;
 	}
 
+	if (needs_mediated_recovery(pdev))
+		zpci_store_pci_error(pdev, ccdf);
+
 	ers_res = zpci_event_notify_error_detected(pdev, driver);
 	if (ers_result_indicates_abort(ers_res)) {
 		status_str = "failed (abort on detection)";
 		goto out_unlock;
 	}
 
+	if (needs_mediated_recovery(pdev)) {
+		pr_info("%s: Recovering passthrough device\n", pci_name(pdev));
+		ers_res = PCI_ERS_RESULT_RECOVERED;
+		status_str = "in progress";
+		goto out_unlock;
+	}
+
 	if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
 		ers_res = zpci_event_do_error_state_clear(pdev, driver);
 		if (ers_result_indicates_abort(ers_res)) {
@@ -258,25 +280,20 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
  * @pdev: PCI function for which to report
  * @es: PCI channel failure state to report
  */
-static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
+static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
+				  struct zpci_ccdf_err *ccdf)
 {
 	struct pci_driver *driver;
 
 	pci_dev_lock(pdev);
 	pdev->error_state = es;
-	/**
-	 * While vfio-pci's error_detected callback notifies user-space QEMU
-	 * reacts to this by freezing the guest. In an s390 environment PCI
-	 * errors are rarely fatal so this is overkill. Instead in the future
-	 * we will inject the error event and let the guest recover the device
-	 * itself.
-	 */
+
 	if (needs_mediated_recovery(pdev))
-		goto out;
+		zpci_store_pci_error(pdev, ccdf);
 	driver = to_pci_driver(pdev->dev.driver);
 	if (driver && driver->err_handler && driver->err_handler->error_detected)
 		driver->err_handler->error_detected(pdev, pdev->error_state);
-out:
+
 	pci_dev_unlock(pdev);
 }
 
@@ -312,6 +329,7 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
 	pr_err("%s: Event 0x%x reports an error for PCI function 0x%x\n",
 	       pdev ? pci_name(pdev) : "n/a", ccdf->pec, ccdf->fid);
 
+
 	if (!pdev)
 		goto no_pdev;
 
@@ -322,12 +340,13 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
 		break;
 	case 0x0040: /* Service Action or Error Recovery Failed */
 	case 0x003b:
-		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
 		break;
 	default: /* PCI function left in the error state attempt to recover */
-		ers_res = zpci_event_attempt_error_recovery(pdev);
+		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
 		if (ers_res != PCI_ERS_RESULT_RECOVERED)
-			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+			zpci_event_io_failure(pdev, pci_channel_io_perm_failure,
+					ccdf);
 		break;
 	}
 	pci_dev_put(pdev);
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index a7bc23ce8483..2be37eab9279 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 
 	zdev->mediated_recovery = false;
 
+	zpci_cleanup_pending_errors(zdev);
+
 	if (!vdev->vdev.kvm)
 		return;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (6 preceding siblings ...)
  2025-09-11 18:33 ` [PATCH v3 07/10] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2025-09-11 18:33 ` Farhan Ali
  2025-09-13  9:04   ` Alex Williamson
  2025-09-15  6:26   ` Cédric Le Goater
  2025-09-11 18:33 ` [PATCH v3 09/10] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
  2025-09-11 18:33 ` [PATCH v3 10/10] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
  9 siblings, 2 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:33 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

For zPCI devices, we have platform specific error information. The platform
firmware provides this error information to the operating system in an
architecture specific mechanism. To enable recovery from userspace for
these devices, we want to expose this error information to userspace. Add a
new device feature to expose this information.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c |  2 ++
 drivers/vfio/pci/vfio_pci_priv.h |  8 ++++++++
 drivers/vfio/pci/vfio_pci_zdev.c | 34 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 14 +++++++++++++
 4 files changed, 58 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7dcf5439dedc..378adb3226db 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1514,6 +1514,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
+		return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index a9972eacb293..a4a7f97fdc2e 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -86,6 +86,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 				struct vfio_info_cap *caps);
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
 void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+			      void __user *arg, size_t argsz);
 #else
 static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 					      struct vfio_info_cap *caps)
@@ -100,6 +102,12 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 
 static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {}
+
+static int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+				     void __user *arg, size_t argsz);
+{
+	return -ENODEV;
+}
 #endif
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 2be37eab9279..261954039aa9 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -141,6 +141,40 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+			      void __user *arg, size_t argsz)
+{
+	struct vfio_device_feature_zpci_err err;
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct zpci_dev *zdev = to_zpci(vdev->pdev);
+	int ret;
+	int head = 0;
+
+	if (!zdev)
+		return -ENODEV;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(err));
+	if (ret != 1)
+		return ret;
+
+	mutex_lock(&zdev->pending_errs_lock);
+	if (zdev->pending_errs.count) {
+		head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
+		err.pec = zdev->pending_errs.err[head].pec;
+		zdev->pending_errs.head++;
+		zdev->pending_errs.count--;
+		err.pending_errors = zdev->pending_errs.count;
+	}
+	mutex_unlock(&zdev->pending_errs_lock);
+
+	if (copy_to_user(arg, &err, sizeof(err)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 75100bf009ba..a950c341602d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1478,6 +1478,20 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
+ * userspace for vfio-pci devices on s390x. On s390x PCI error recovery involves
+ * platform firmware and notification to operating system is done by
+ * architecture specific mechanism.  Exposing this information to userspace
+ * allows userspace to take appropriate actions to handle an error on the
+ * device.
+ */
+struct vfio_device_feature_zpci_err {
+	__u16 pec;
+	int pending_errors;
+};
+#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 11
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 09/10] vfio: Add a reset_done callback for vfio-pci driver
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (7 preceding siblings ...)
  2025-09-11 18:33 ` [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2025-09-11 18:33 ` Farhan Ali
  2025-09-11 18:33 ` [PATCH v3 10/10] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
  9 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:33 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

On error recovery for a PCI device bound to vfio-pci driver, we want to
recover the state of the device to its last known saved state. The callback
restores the state of the device to its initial saved state.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 378adb3226db..f2fcb81b3e69 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2241,6 +2241,17 @@ pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
 
+static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+	if (!vdev->pci_saved_state)
+		return;
+
+	pci_load_saved_state(pdev, vdev->pci_saved_state);
+	pci_restore_state(pdev);
+}
+
 int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
 				  int nr_virtfn)
 {
@@ -2305,6 +2316,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
 
 const struct pci_error_handlers vfio_pci_core_err_handlers = {
 	.error_detected = vfio_pci_core_aer_err_detected,
+	.reset_done = vfio_pci_core_aer_reset_done,
 };
 EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 10/10] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
  2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (8 preceding siblings ...)
  2025-09-11 18:33 ` [PATCH v3 09/10] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
@ 2025-09-11 18:33 ` Farhan Ali
  9 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-11 18:33 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, alifm, schnelle, mjrosato

We are configuring the error signaling on the vast majority of devices and
it's extremely rare that it fires anyway. This allows userspace to be
notified on errors for legacy PCI devices. The Internal Share Memory (ISM)
device on s390x is one such device. For PCI devices on IBM s390x error
recovery involves platform firmware and notification to operating system
is done by architecture specific way. So the ISM device can still be
recovered when notified of an error.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c  | 6 ++----
 drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f2fcb81b3e69..d125471fd5ea 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -749,8 +749,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
 		}
 	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
-		if (pci_is_pcie(vdev->pdev))
-			return 1;
+		return 1;
 	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
 		return 1;
 	}
@@ -1150,8 +1149,7 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
 	case VFIO_PCI_REQ_IRQ_INDEX:
 		break;
 	case VFIO_PCI_ERR_IRQ_INDEX:
-		if (pci_is_pcie(vdev->pdev))
-			break;
+		break;
 		fallthrough;
 	default:
 		return -EINVAL;
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 123298a4dc8f..f2d13b6eb28f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -838,8 +838,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 	case VFIO_PCI_ERR_IRQ_INDEX:
 		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
 		case VFIO_IRQ_SET_ACTION_TRIGGER:
-			if (pci_is_pcie(vdev->pdev))
-				func = vfio_pci_set_err_trigger;
+			func = vfio_pci_set_err_trigger;
 			break;
 		}
 		break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/10] PCI: Allow per function PCI slots
  2025-09-11 18:33 ` [PATCH v3 03/10] PCI: Allow per function PCI slots Farhan Ali
@ 2025-09-12 12:23   ` Benjamin Block
  2025-09-12 17:19     ` Farhan Ali
  2025-09-16  6:52   ` Cédric Le Goater
  1 sibling, 1 reply; 35+ messages in thread
From: Benjamin Block @ 2025-09-12 12:23 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, schnelle, mjrosato

On Thu, Sep 11, 2025 at 11:33:00AM -0700, Farhan Ali wrote:
> On s390 systems, which use a machine level hypervisor, PCI devices are
> always accessed through a form of PCI pass-through which fundamentally
> operates on a per PCI function granularity. This is also reflected in the
> s390 PCI hotplug driver which creates hotplug slots for individual PCI
> functions. Its reset_slot() function, which is a wrapper for
> zpci_hot_reset_device(), thus also resets individual functions.
> 
> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
> to multifunction devices. This approach worked fine on s390 systems that
> only exposed virtual functions as individual PCI domains to the operating
> system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> s390 supports exposing the topology of multifunction PCI devices by
> grouping them in a shared PCI domain. When attempting to reset a function
> through the hotplug driver, the shared slot assignment causes the wrong
> function to be reset instead of the intended one. It also leaks memory as
> we do create a pci_slot object for the function, but don't correctly free
> it in pci_slot_release().
> 
> Add a flag for struct pci_slot to allow per function PCI slots for
> functions managed through a hypervisor, which exposes individual PCI
> functions while retaining the topology.
> 
> Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")

Stable tag?
Reseting the wrong PCI function sounds bad enough.

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/10] PCI: Allow per function PCI slots
  2025-09-12 12:23   ` Benjamin Block
@ 2025-09-12 17:19     ` Farhan Ali
  0 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-12 17:19 UTC (permalink / raw)
  To: Benjamin Block
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, schnelle, mjrosato


On 9/12/2025 5:23 AM, Benjamin Block wrote:
> On Thu, Sep 11, 2025 at 11:33:00AM -0700, Farhan Ali wrote:
>> On s390 systems, which use a machine level hypervisor, PCI devices are
>> always accessed through a form of PCI pass-through which fundamentally
>> operates on a per PCI function granularity. This is also reflected in the
>> s390 PCI hotplug driver which creates hotplug slots for individual PCI
>> functions. Its reset_slot() function, which is a wrapper for
>> zpci_hot_reset_device(), thus also resets individual functions.
>>
>> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
>> to multifunction devices. This approach worked fine on s390 systems that
>> only exposed virtual functions as individual PCI domains to the operating
>> system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
>> s390 supports exposing the topology of multifunction PCI devices by
>> grouping them in a shared PCI domain. When attempting to reset a function
>> through the hotplug driver, the shared slot assignment causes the wrong
>> function to be reset instead of the intended one. It also leaks memory as
>> we do create a pci_slot object for the function, but don't correctly free
>> it in pci_slot_release().
>>
>> Add a flag for struct pci_slot to allow per function PCI slots for
>> functions managed through a hypervisor, which exposes individual PCI
>> functions while retaining the topology.
>>
>> Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> Stable tag?
> Reseting the wrong PCI function sounds bad enough.

That's a fair point. This is definitely broken for NETD devices 
(https://www.ibm.com/docs/en/linux-on-systems?topic=express-direct-mode). 
Will cc stable.

Thanks
Farhan


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] PCI: Avoid saving error values for config space
  2025-09-11 18:32 ` [PATCH v3 01/10] PCI: Avoid saving error values for config space Farhan Ali
@ 2025-09-13  8:27   ` Alex Williamson
  2025-09-15 17:15     ` Farhan Ali
  2025-09-16 18:09   ` Bjorn Helgaas
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2025-09-13  8:27 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, helgaas, schnelle,
	mjrosato

On Thu, 11 Sep 2025 11:32:58 -0700
Farhan Ali <alifm@linux.ibm.com> wrote:

> The current reset process saves the device's config space state before
> reset and restores it afterward. However, when a device is in an error
> state before reset, config space reads may return error values instead of
> valid data. This results in saving corrupted values that get written back
> to the device during state restoration.
> 
> Avoid saving the state of the config space when the device is in error.
> While restoring we only restorei the state that can be restored through

s/restorei/restore/

> kernel data such as BARs or doesn't depend on the saved state.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/pci/pci.c      | 29 ++++++++++++++++++++++++++---
>  drivers/pci/pcie/aer.c |  5 +++++
>  drivers/pci/pcie/dpc.c |  5 +++++
>  drivers/pci/pcie/ptm.c |  5 +++++
>  drivers/pci/tph.c      |  5 +++++
>  drivers/pci/vc.c       |  5 +++++
>  6 files changed, 51 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b0f4d98036cd..4b67d22faf0a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1720,6 +1720,11 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
>  	struct pci_cap_saved_state *save_state;
>  	u16 *cap;
>  
> +	if (!dev->state_saved) {
> +		pci_warn(dev, "Not restoring pcie state, no saved state");
> +		return;
> +	}
> +
>  	/*
>  	 * Restore max latencies (in the LTR capability) before enabling
>  	 * LTR itself in PCI_EXP_DEVCTL2.
> @@ -1775,6 +1780,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
>  	struct pci_cap_saved_state *save_state;
>  	u16 *cap;
>  
> +	if (!dev->state_saved) {
> +		pci_warn(dev, "Not restoring pcix state, no saved state");
> +		return;
> +	}
> +
>  	save_state = pci_find_saved_cap(dev, PCI_CAP_ID_PCIX);
>  	pos = pci_find_capability(dev, PCI_CAP_ID_PCIX);
>  	if (!save_state || !pos)
> @@ -1792,6 +1802,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
>  int pci_save_state(struct pci_dev *dev)
>  {
>  	int i;
> +	u16 val;
> +
> +	pci_read_config_word(dev, PCI_DEVICE_ID, &val);
> +	if (PCI_POSSIBLE_ERROR(val)) {
> +		pci_warn(dev, "Device in error, not saving config space state\n");
> +		return -EIO;
> +	}
> +

I don't think this works with standard VFs, per the spec the device ID
register returns 0xFFFF.  Likely need to look for a CRS or error status
across both vendor and device ID registers.

We could be a little more formal and specific describing the skipped
states too, ex. "PCIe capability", "PCI-X capability", "PCI AER
capability", etc.  Thanks,

Alex

>  	/* XXX: 100% dword access ok here? */
>  	for (i = 0; i < 16; i++) {
>  		pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
> @@ -1854,6 +1872,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
>  
>  static void pci_restore_config_space(struct pci_dev *pdev)
>  {
> +	if (!pdev->state_saved) {
> +		pci_warn(pdev, "No saved config space, restoring BARs\n");
> +		pci_restore_bars(pdev);
> +		pci_write_config_word(pdev, PCI_COMMAND,
> +				PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
> +		return;
> +	}
> +
>  	if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
>  		pci_restore_config_space_range(pdev, 10, 15, 0, false);
>  		/* Restore BARs before the command register. */
> @@ -1906,9 +1932,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
>   */
>  void pci_restore_state(struct pci_dev *dev)
>  {
> -	if (!dev->state_saved)
> -		return;
> -
>  	pci_restore_pcie_state(dev);
>  	pci_restore_pasid_state(dev);
>  	pci_restore_pri_state(dev);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e286c197d716..dca3502ef669 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -361,6 +361,11 @@ void pci_restore_aer_state(struct pci_dev *dev)
>  	if (!aer)
>  		return;
>  
> +	if (!dev->state_saved) {
> +		pci_warn(dev, "Not restoring aer state, no saved state");
> +		return;
> +	}
> +
>  	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_ERR);
>  	if (!save_state)
>  		return;
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index fc18349614d7..62c520af71a7 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -67,6 +67,11 @@ void pci_restore_dpc_state(struct pci_dev *dev)
>  	if (!pci_is_pcie(dev))
>  		return;
>  
> +	if (!dev->state_saved) {
> +		pci_warn(dev, "Not restoring dpc state, no saved state");
> +		return;
> +	}
> +
>  	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
>  	if (!save_state)
>  		return;
> diff --git a/drivers/pci/pcie/ptm.c b/drivers/pci/pcie/ptm.c
> index 65e4b008be00..7b5bcc23000d 100644
> --- a/drivers/pci/pcie/ptm.c
> +++ b/drivers/pci/pcie/ptm.c
> @@ -112,6 +112,11 @@ void pci_restore_ptm_state(struct pci_dev *dev)
>  	if (!ptm)
>  		return;
>  
> +	if (!dev->state_saved) {
> +		pci_warn(dev, "Not restoring ptm state, no saved state");
> +		return;
> +	}
> +
>  	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_PTM);
>  	if (!save_state)
>  		return;
> diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
> index cc64f93709a4..f0f1bae46736 100644
> --- a/drivers/pci/tph.c
> +++ b/drivers/pci/tph.c
> @@ -435,6 +435,11 @@ void pci_restore_tph_state(struct pci_dev *pdev)
>  	if (!pdev->tph_enabled)
>  		return;
>  
> +	if (!pdev->state_saved) {
> +		pci_warn(pdev, "Not restoring tph state, no saved state");
> +		return;
> +	}
> +
>  	save_state = pci_find_saved_ext_cap(pdev, PCI_EXT_CAP_ID_TPH);
>  	if (!save_state)
>  		return;
> diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
> index a4ff7f5f66dd..fda435cd49c1 100644
> --- a/drivers/pci/vc.c
> +++ b/drivers/pci/vc.c
> @@ -391,6 +391,11 @@ void pci_restore_vc_state(struct pci_dev *dev)
>  {
>  	int i;
>  
> +	if (!dev->state_saved) {
> +		pci_warn(dev, "Not restoring vc state, no saved state");
> +		return;
> +	}
> +
>  	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
>  		int pos;
>  		struct pci_cap_saved_state *save_state;


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information
  2025-09-11 18:33 ` [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2025-09-13  9:04   ` Alex Williamson
  2025-09-15 18:27     ` Farhan Ali
  2025-09-15  6:26   ` Cédric Le Goater
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2025-09-13  9:04 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, helgaas, schnelle,
	mjrosato

On Thu, 11 Sep 2025 11:33:05 -0700
Farhan Ali <alifm@linux.ibm.com> wrote:

> For zPCI devices, we have platform specific error information. The platform
> firmware provides this error information to the operating system in an
> architecture specific mechanism. To enable recovery from userspace for
> these devices, we want to expose this error information to userspace. Add a
> new device feature to expose this information.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c |  2 ++
>  drivers/vfio/pci/vfio_pci_priv.h |  8 ++++++++
>  drivers/vfio/pci/vfio_pci_zdev.c | 34 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h        | 14 +++++++++++++
>  4 files changed, 58 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 7dcf5439dedc..378adb3226db 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1514,6 +1514,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
> +		return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index a9972eacb293..a4a7f97fdc2e 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -86,6 +86,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>  				struct vfio_info_cap *caps);
>  int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
>  void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
> +			      void __user *arg, size_t argsz);
>  #else
>  static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>  					      struct vfio_info_cap *caps)
> @@ -100,6 +102,12 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>  
>  static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>  {}
> +
> +static int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
> +				     void __user *arg, size_t argsz);
> +{
> +	return -ENODEV;
> +}
>  #endif
>  
>  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index 2be37eab9279..261954039aa9 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -141,6 +141,40 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>  	return ret;
>  }
>  
> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
> +			      void __user *arg, size_t argsz)
> +{
> +	struct vfio_device_feature_zpci_err err;
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct zpci_dev *zdev = to_zpci(vdev->pdev);
> +	int ret;
> +	int head = 0;
> +
> +	if (!zdev)
> +		return -ENODEV;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(err));
> +	if (ret != 1)
> +		return ret;
> +
> +	mutex_lock(&zdev->pending_errs_lock);
> +	if (zdev->pending_errs.count) {
> +		head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
> +		err.pec = zdev->pending_errs.err[head].pec;
> +		zdev->pending_errs.head++;
> +		zdev->pending_errs.count--;
> +		err.pending_errors = zdev->pending_errs.count;
> +	}
> +	mutex_unlock(&zdev->pending_errs_lock);
> +
> +	if (copy_to_user(arg, &err, sizeof(err)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>  int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>  {
>  	struct zpci_dev *zdev = to_zpci(vdev->pdev);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 75100bf009ba..a950c341602d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1478,6 +1478,20 @@ struct vfio_device_feature_bus_master {
>  };
>  #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>  
> +/**
> + * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
> + * userspace for vfio-pci devices on s390x. On s390x PCI error recovery involves
> + * platform firmware and notification to operating system is done by
> + * architecture specific mechanism.  Exposing this information to userspace
> + * allows userspace to take appropriate actions to handle an error on the
> + * device.
> + */
> +struct vfio_device_feature_zpci_err {
> +	__u16 pec;
> +	int pending_errors;
> +};

This should have some explicit alignment.  Thanks,

Alex

> +#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 11
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information
  2025-09-11 18:33 ` [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
  2025-09-13  9:04   ` Alex Williamson
@ 2025-09-15  6:26   ` Cédric Le Goater
  2025-09-15 18:27     ` Farhan Ali
  1 sibling, 1 reply; 35+ messages in thread
From: Cédric Le Goater @ 2025-09-15  6:26 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, schnelle, mjrosato

On 9/11/25 20:33, Farhan Ali wrote:
> For zPCI devices, we have platform specific error information. The platform
> firmware provides this error information to the operating system in an
> architecture specific mechanism. To enable recovery from userspace for
> these devices, we want to expose this error information to userspace. Add a
> new device feature to expose this information.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>   drivers/vfio/pci/vfio_pci_core.c |  2 ++
>   drivers/vfio/pci/vfio_pci_priv.h |  8 ++++++++
>   drivers/vfio/pci/vfio_pci_zdev.c | 34 ++++++++++++++++++++++++++++++++
>   include/uapi/linux/vfio.h        | 14 +++++++++++++
>   4 files changed, 58 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 7dcf5439dedc..378adb3226db 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1514,6 +1514,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>   		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>   	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>   		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
> +		return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
>   	default:
>   		return -ENOTTY;
>   	}
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index a9972eacb293..a4a7f97fdc2e 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -86,6 +86,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>   				struct vfio_info_cap *caps);
>   int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
>   void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
> +			      void __user *arg, size_t argsz);
>   #else
>   static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>   					      struct vfio_info_cap *caps)
> @@ -100,6 +102,12 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>   
>   static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>   {}
> +
> +static int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
> +				     void __user *arg, size_t argsz);

The extra ';' breaks builds on non-Z platforms.

C.

> +{
> +	return -ENODEV;
> +}
>   #endif
>   
>   static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index 2be37eab9279..261954039aa9 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -141,6 +141,40 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>   	return ret;
>   }
>   
> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
> +			      void __user *arg, size_t argsz)
> +{
> +	struct vfio_device_feature_zpci_err err;
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct zpci_dev *zdev = to_zpci(vdev->pdev);
> +	int ret;
> +	int head = 0;
> +
> +	if (!zdev)
> +		return -ENODEV;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(err));
> +	if (ret != 1)
> +		return ret;
> +
> +	mutex_lock(&zdev->pending_errs_lock);
> +	if (zdev->pending_errs.count) {
> +		head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
> +		err.pec = zdev->pending_errs.err[head].pec;
> +		zdev->pending_errs.head++;
> +		zdev->pending_errs.count--;
> +		err.pending_errors = zdev->pending_errs.count;
> +	}
> +	mutex_unlock(&zdev->pending_errs_lock);
> +
> +	if (copy_to_user(arg, &err, sizeof(err)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>   int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>   {
>   	struct zpci_dev *zdev = to_zpci(vdev->pdev);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 75100bf009ba..a950c341602d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1478,6 +1478,20 @@ struct vfio_device_feature_bus_master {
>   };
>   #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>   
> +/**
> + * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
> + * userspace for vfio-pci devices on s390x. On s390x PCI error recovery involves
> + * platform firmware and notification to operating system is done by
> + * architecture specific mechanism.  Exposing this information to userspace
> + * allows userspace to take appropriate actions to handle an error on the
> + * device.
> + */
> +struct vfio_device_feature_zpci_err {
> +	__u16 pec;
> +	int pending_errors;
> +};
> +#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 11
> +
>   /* -------- API for Type1 VFIO IOMMU -------- */
>   
>   /**


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device
  2025-09-11 18:33 ` [PATCH v3 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device Farhan Ali
@ 2025-09-15  8:39   ` Niklas Schnelle
  2025-09-15 17:42     ` Farhan Ali
  0 siblings, 1 reply; 35+ messages in thread
From: Niklas Schnelle @ 2025-09-15  8:39 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato

On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
> Commit c1e18c17bda6 ("s390/pci: add zpci_set_irq()/zpci_clear_irq()"),
> introduced the zpci_set_irq() and zpci_clear_irq(), to be used while
> resetting a zPCI device.
> 
> Commit da995d538d3a ("s390/pci: implement reset_slot for hotplug slot"),
> mentions zpci_clear_irq() being called in the path for zpci_hot_reset_device().
> But that is not the case anymore and these functions are not called
> outside of this file.

If you're doing another version I think you could add a bit more
information on why this still works for existing recovery based on my
investigation in
https://lore.kernel.org/lkml/052ebdbb6f2d38025ca4345ee51e4857e19bb0e4.camel@linux.ibm.com/

Even if you don't add more explanations, I'd tend to just drop the
above paragraph as it doesn't seem relevant and sounds like
zpci_hot_reset_device() doesn't clear IRQs. As explained in the linked
mail there really is no need to call zpci_clear_irq() in
zpci_hot_reset_device() as the CLP disable does disable IRQs. It's
really only the state tracking that can get screwed up but is also fine
for drivers which end up doing the tear down. 

> 
> However after a CLP disable/enable reset, the device's IRQ are
> unregistered, but the flag zdev->irq_registered does not get cleared. It
> creates an inconsistent state and so arch_restore_msi_irqs() doesn't
> correctly restore the device's IRQ. This becomes a problem when a PCI
> driver tries to restore the state of the device through
> pci_restore_state(). Restore IRQ unconditionally for the device and remove
> the irq_registered flag as its redundant.

s/its/it's/

> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/include/asm/pci.h | 1 -
>  arch/s390/pci/pci_irq.c     | 9 +--------
>  2 files changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 41f900f693d9..aed19a1aa9d7 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -145,7 +145,6 @@ struct zpci_dev {
>  	u8		has_resources	: 1;
>  	u8		is_physfn	: 1;
>  	u8		util_str_avail	: 1;
> -	u8		irqs_registered	: 1;
>  	u8		tid_avail	: 1;
>  	u8		rtr_avail	: 1; /* Relaxed translation allowed */
>  	unsigned int	devfn;		/* DEVFN part of the RID*/
> diff --git a/arch/s390/pci/pci_irq.c b/arch/s390/pci/pci_irq.c
> index 84482a921332..e73be96ce5fe 100644
> --- a/arch/s390/pci/pci_irq.c
> +++ b/arch/s390/pci/pci_irq.c
> @@ -107,9 +107,6 @@ static int zpci_set_irq(struct zpci_dev *zdev)
>  	else
>  		rc = zpci_set_airq(zdev);
>  
> -	if (!rc)
> -		zdev->irqs_registered = 1;
> -
>  	return rc;
>  }
>  
> @@ -123,9 +120,6 @@ static int zpci_clear_irq(struct zpci_dev *zdev)
>  	else
>  		rc = zpci_clear_airq(zdev);
>  
> -	if (!rc)
> -		zdev->irqs_registered = 0;
> -
>  	return rc;
>  }
>  
> @@ -427,8 +421,7 @@ bool arch_restore_msi_irqs(struct pci_dev *pdev)
>  {
>  	struct zpci_dev *zdev = to_zpci(pdev);
>  
> -	if (!zdev->irqs_registered)
> -		zpci_set_irq(zdev);
> +	zpci_set_irq(zdev);
>  	return true;
>  }
>  

Code looks good to me. With or without my suggestions for the commit
message:

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 06/10] s390/pci: Update the logic for detecting passthrough device
  2025-09-11 18:33 ` [PATCH v3 06/10] s390/pci: Update the logic for detecting passthrough device Farhan Ali
@ 2025-09-15  9:22   ` Niklas Schnelle
  0 siblings, 0 replies; 35+ messages in thread
From: Niklas Schnelle @ 2025-09-15  9:22 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato

On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
> We can now have userspace drivers (vfio-pci based) on s390x. The userspace
> drivers will not have any KVM fd and so no kzdev associated with them. So
> we need to update the logic for detecting passthrough devices to not depend
> on struct kvm_zdev.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/include/asm/pci.h      |  1 +
>  arch/s390/pci/pci_event.c        | 14 ++++----------
>  drivers/vfio/pci/vfio_pci_zdev.c |  9 ++++++++-
>  3 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index aed19a1aa9d7..f47f62fc3bfd 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -169,6 +169,7 @@ struct zpci_dev {
>  
>  	char res_name[16];
>  	bool mio_capable;
> +	bool mediated_recovery;
>  	struct zpci_bar_struct bars[PCI_STD_NUM_BARS];
>  
>  	u64		start_dma;	/* Start of available DMA addresses */
> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> index d930416d4c90..541d536be052 100644
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -61,16 +61,10 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
>  	}
>  }
>  
> -static bool is_passed_through(struct pci_dev *pdev)
> +static bool needs_mediated_recovery(struct pci_dev *pdev)
>  {
>  	struct zpci_dev *zdev = to_zpci(pdev);
> -	bool ret;
> -
> -	mutex_lock(&zdev->kzdev_lock);
> -	ret = !!zdev->kzdev;
> -	mutex_unlock(&zdev->kzdev_lock);
> -
> -	return ret;
> +	return zdev->mediated_recovery;
>  }
>  
--- snip ---

Looks good to me.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] s390/pci: Store PCI error information for passthrough devices
  2025-09-11 18:33 ` [PATCH v3 07/10] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2025-09-15 11:42   ` Niklas Schnelle
  2025-09-15 18:12     ` Farhan Ali
  0 siblings, 1 reply; 35+ messages in thread
From: Niklas Schnelle @ 2025-09-15 11:42 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato

On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
> For a passthrough device we need co-operation from user space to recover
> the device. This would require to bubble up any error information to user
> space.  Let's store this error information for passthrough devices, so it
> can be retrieved later.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/include/asm/pci.h      | 28 ++++++++++
>  arch/s390/pci/pci.c              |  1 +
>  arch/s390/pci/pci_event.c        | 95 +++++++++++++++++++-------------
>  drivers/vfio/pci/vfio_pci_zdev.c |  2 +
>  4 files changed, 88 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index f47f62fc3bfd..72e05af90e08 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -116,6 +116,31 @@ struct zpci_bus {
>  	enum pci_bus_speed	max_bus_speed;
>  };
>  
> +/* Content Code Description for PCI Function Error */
> +struct zpci_ccdf_err {
> +	u32 reserved1;
> +	u32 fh;                         /* function handle */
> +	u32 fid;                        /* function id */
> +	u32 ett         :  4;           /* expected table type */
> +	u32 mvn         : 12;           /* MSI vector number */
> +	u32 dmaas       :  8;           /* DMA address space */
> +	u32 reserved2   :  6;
> +	u32 q           :  1;           /* event qualifier */
> +	u32 rw          :  1;           /* read/write */
> +	u64 faddr;                      /* failing address */
> +	u32 reserved3;
> +	u16 reserved4;
> +	u16 pec;                        /* PCI event code */
> +} __packed;
> +
> +#define ZPCI_ERR_PENDING_MAX 16

16 pending error events sounds like a lot for a single devices. This
also means that the array alone already spans more than 2 cache lines
(256 byte on s390x). I can't imagine that we'd ever have that many
errors pending. This is especially true since a device already in an
error state would be the least likely to cause more errors. We have
seen cases of 2 errors in the past, so maybe 4 would give us good head
room?

> +struct zpci_ccdf_pending {
> +	size_t count;
> +	int head;
> +	int tail;
> +	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
> +};
> +
>  /* Private data per function */
>  struct zpci_dev {
>  	struct zpci_bus *zbus;
> @@ -191,6 +216,8 @@ struct zpci_dev {
>  	struct iommu_domain *s390_domain; /* attached IOMMU domain */
>  	struct kvm_zdev *kzdev;
>  	struct mutex kzdev_lock;
> +	struct zpci_ccdf_pending pending_errs;
> +	struct mutex pending_errs_lock;
>  	spinlock_t dom_lock;		/* protect s390_domain change */
>  };
> 
--- snip ---

> -
>  /* Content Code Description for PCI Function Availability */
>  struct zpci_ccdf_avail {
>  	u32 reserved1;
> @@ -76,6 +59,41 @@ static bool is_driver_supported(struct pci_driver *driver)
>  	return true;
>  }
>  
> +static void zpci_store_pci_error(struct pci_dev *pdev,
> +				 struct zpci_ccdf_err *ccdf)
> +{
> +	struct zpci_dev *zdev = to_zpci(pdev);
> +	int i;
> +
> +	mutex_lock(&zdev->pending_errs_lock);
> +	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
> +		pr_err("%s: Cannot store PCI error info for device",
> +				pci_name(pdev));
> +		mutex_unlock(&zdev->pending_errs_lock);

I think the error message should state that the maximum number of
pending error events has been queued. As with the ZPI_ERR_PENDING_MAX I
really don't think we would reach this even at 4 vs 16 max pending but
if we do I agree that having the first couple of errors saved is
probably nice for analysis.

> +		return;
> +	}
> +
> +	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
> +	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
> +	zdev->pending_errs.tail++;
> +	zdev->pending_errs.count++;

With tail being int this would be undefined behavior if it ever
overflowed. Since the array is of fixed length that is always smaller
than 256 how about making tail, head, and count u8. The memory saving
doesn't matter but at least overflow becomes well defined.

> +	mutex_unlock(&zdev->pending_errs_lock);
> +}
> +
> +void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
> +{
> +	struct pci_dev *pdev = NULL;
> +
> +	mutex_lock(&zdev->pending_errs_lock);
> +	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
> +	if (zdev->pending_errs.count)
> +		pr_err("%s: Unhandled PCI error events count=%zu",
> +				pci_name(pdev), zdev->pending_errs.count);

I think this could be a zpci_dbg(). That way you also don't need the
pci_get_slot() which is also buggy as it misses a pci_dev_put(). The
message also doesn't seem useful for the user. As I understand it this
would happen if a vfio-pci user dies without handling all the error
events but then vfio-pci will also reset the slot on closing of the
fds, no? So the device will get reset anyway.

> +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));

If this goes wrong and we subsequently crash or take a live memory dump
I'd prefer to have bread crumbs such as the errors that weren't cleaned
up. Wouldn't it be enough to just set the count to zero and for debug
the original count will be in s390dbf. Also maybe it would make sense
to pull the zdev->mediated_recovery clearing in here?

> +	mutex_unlock(&zdev->pending_errs_lock);
> +}
> +EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
> +
>  static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
>  							 struct pci_driver *driver)
>  {
> @@ -169,7 +187,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
>   * and the platform determines which functions are affected for
>   * multi-function devices.
>   */
> -static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
> +static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
> +							  struct zpci_ccdf_err *ccdf)
>  {
>  	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
>  	struct zpci_dev *zdev = to_zpci(pdev);
> @@ -188,13 +207,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  	}
>  	pdev->error_state = pci_channel_io_frozen;
>  
> -	if (needs_mediated_recovery(pdev)) {
> -		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
> -			pci_name(pdev));
> -		status_str = "failed (pass-through)";
> -		goto out_unlock;
> -	}
> -
>  	driver = to_pci_driver(pdev->dev.driver);
>  	if (!is_driver_supported(driver)) {
>  		if (!driver) {
> @@ -210,12 +222,22 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  		goto out_unlock;
>  	}
>  
> +	if (needs_mediated_recovery(pdev))
> +		zpci_store_pci_error(pdev, ccdf);
> +
>  	ers_res = zpci_event_notify_error_detected(pdev, driver);
>  	if (ers_result_indicates_abort(ers_res)) {
>  		status_str = "failed (abort on detection)";
>  		goto out_unlock;
>  	}
>  
> +	if (needs_mediated_recovery(pdev)) {
> +		pr_info("%s: Recovering passthrough device\n", pci_name(pdev));

I'd say technically we're not recovering the device here but rather
leaving it alone so user-space can take over the recovery. Maybe this
could be made explicit in the message. Something like:

""%s: Leaving recovery of pass-through device to user-space\n"

> +		ers_res = PCI_ERS_RESULT_RECOVERED;
> +		status_str = "in progress";
> +		goto out_unlock;
> +	}
> +
>  	if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
>  		ers_res = zpci_event_do_error_state_clear(pdev, driver);
>  		if (ers_result_indicates_abort(ers_res)) {
> @@ -258,25 +280,20 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>   * @pdev: PCI function for which to report
>   * @es: PCI channel failure state to report
>   */
> -static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
> +static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
> +				  struct zpci_ccdf_err *ccdf)
>  {
>  	struct pci_driver *driver;
>  
>  	pci_dev_lock(pdev);
>  	pdev->error_state = es;
> -	/**
> -	 * While vfio-pci's error_detected callback notifies user-space QEMU
> -	 * reacts to this by freezing the guest. In an s390 environment PCI
> -	 * errors are rarely fatal so this is overkill. Instead in the future
> -	 * we will inject the error event and let the guest recover the device
> -	 * itself.
> -	 */
> +
>  	if (needs_mediated_recovery(pdev))
> -		goto out;
> +		zpci_store_pci_error(pdev, ccdf);
>  	driver = to_pci_driver(pdev->dev.driver);
>  	if (driver && driver->err_handler && driver->err_handler->error_detected)
>  		driver->err_handler->error_detected(pdev, pdev->error_state);
> -out:
> +
>  	pci_dev_unlock(pdev);
>  }
>  
> @@ -312,6 +329,7 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
>  	pr_err("%s: Event 0x%x reports an error for PCI function 0x%x\n",
>  	       pdev ? pci_name(pdev) : "n/a", ccdf->pec, ccdf->fid);
>  
> +
>  	if (!pdev)

Nit, stray empty line.

>  		goto no_pdev;
>  
> @@ -322,12 +340,13 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
>  		break;
>  	case 0x0040: /* Service Action or Error Recovery Failed */
>  	case 0x003b:
> -		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
> +		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
>  		break;
>  	default: /* PCI function left in the error state attempt to recover */
> -		ers_res = zpci_event_attempt_error_recovery(pdev);
> +		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
>  		if (ers_res != PCI_ERS_RESULT_RECOVERED)
> -			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
> +			zpci_event_io_failure(pdev, pci_channel_io_perm_failure,
> +					ccdf);
>  		break;
>  	}
>  	pci_dev_put(pdev);
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index a7bc23ce8483..2be37eab9279 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>  
>  	zdev->mediated_recovery = false;
>  
> +	zpci_cleanup_pending_errors(zdev);
> +
>  	if (!vdev->vdev.kvm)
>  		return;
>  

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] PCI: Avoid saving error values for config space
  2025-09-13  8:27   ` Alex Williamson
@ 2025-09-15 17:15     ` Farhan Ali
  0 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-15 17:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linux-s390, kvm, linux-kernel, linux-pci, helgaas, schnelle,
	mjrosato


On 9/13/2025 1:27 AM, Alex Williamson wrote:
> On Thu, 11 Sep 2025 11:32:58 -0700
> Farhan Ali <alifm@linux.ibm.com> wrote:
>
>> The current reset process saves the device's config space state before
>> reset and restores it afterward. However, when a device is in an error
>> state before reset, config space reads may return error values instead of
>> valid data. This results in saving corrupted values that get written back
>> to the device during state restoration.
>>
>> Avoid saving the state of the config space when the device is in error.
>> While restoring we only restorei the state that can be restored through
> s/restorei/restore/

Thanks for catching that, will fix.

>> kernel data such as BARs or doesn't depend on the saved state.
>>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   drivers/pci/pci.c      | 29 ++++++++++++++++++++++++++---
>>   drivers/pci/pcie/aer.c |  5 +++++
>>   drivers/pci/pcie/dpc.c |  5 +++++
>>   drivers/pci/pcie/ptm.c |  5 +++++
>>   drivers/pci/tph.c      |  5 +++++
>>   drivers/pci/vc.c       |  5 +++++
>>   6 files changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index b0f4d98036cd..4b67d22faf0a 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1720,6 +1720,11 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
>>   	struct pci_cap_saved_state *save_state;
>>   	u16 *cap;
>>   
>> +	if (!dev->state_saved) {
>> +		pci_warn(dev, "Not restoring pcie state, no saved state");
>> +		return;
>> +	}
>> +
>>   	/*
>>   	 * Restore max latencies (in the LTR capability) before enabling
>>   	 * LTR itself in PCI_EXP_DEVCTL2.
>> @@ -1775,6 +1780,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
>>   	struct pci_cap_saved_state *save_state;
>>   	u16 *cap;
>>   
>> +	if (!dev->state_saved) {
>> +		pci_warn(dev, "Not restoring pcix state, no saved state");
>> +		return;
>> +	}
>> +
>>   	save_state = pci_find_saved_cap(dev, PCI_CAP_ID_PCIX);
>>   	pos = pci_find_capability(dev, PCI_CAP_ID_PCIX);
>>   	if (!save_state || !pos)
>> @@ -1792,6 +1802,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
>>   int pci_save_state(struct pci_dev *dev)
>>   {
>>   	int i;
>> +	u16 val;
>> +
>> +	pci_read_config_word(dev, PCI_DEVICE_ID, &val);
>> +	if (PCI_POSSIBLE_ERROR(val)) {
>> +		pci_warn(dev, "Device in error, not saving config space state\n");
>> +		return -EIO;
>> +	}
>> +
> I don't think this works with standard VFs, per the spec the device ID
> register returns 0xFFFF.  Likely need to look for a CRS or error status
> across both vendor and device ID registers.

Yes, I missed that. Though the spec also mentions both vendor and device 
id registers can be 0xFFFF for standard VFs. The implementation note in 
the spec mentions legacy software can ignore VFs if both device id and 
vendor id is 0xFFFF. So not sure if checking both would work here?

Also by CRS are you referring to Configuration Request Retry? (In PCIe 
spec v6 I couldn't find reference to CRS, but found RRS so its probably 
been renamed to Request Retry Status). Based on my understanding of the 
spec a function will return CRS after a reset, but in this case we are 
trying to read and save the state before a reset? Based on 
pci_bus_rrs_vendor_id(), on a CRS vendor ID returned would be 0x1, but 
that wouldn't work for s390 as currently reads on error will return 
0xFFFF. Apologies if I misunderstood anything.

I see pci_dev_wait() check for command and status register in case RRS 
is not available, would that be appropriate check here?


>
> We could be a little more formal and specific describing the skipped
> states too, ex. "PCIe capability", "PCI-X capability", "PCI AER
> capability", etc.  Thanks,
>
> Alex

Makes sense, will update the warn messages.

Thanks
Farhan

>
>>   	/* XXX: 100% dword access ok here? */
>>   	for (i = 0; i < 16; i++) {
>>   		pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
>> @@ -1854,6 +1872,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
>>   
>>   static void pci_restore_config_space(struct pci_dev *pdev)
>>   {
>> +	if (!pdev->state_saved) {
>> +		pci_warn(pdev, "No saved config space, restoring BARs\n");
>> +		pci_restore_bars(pdev);
>> +		pci_write_config_word(pdev, PCI_COMMAND,
>> +				PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
>> +		return;
>> +	}
>> +
>>   	if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
>>   		pci_restore_config_space_range(pdev, 10, 15, 0, false);
>>   		/* Restore BARs before the command register. */
>> @@ -1906,9 +1932,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
>>    */
>>   void pci_restore_state(struct pci_dev *dev)
>>   {
>> -	if (!dev->state_saved)
>> -		return;
>> -
>>   	pci_restore_pcie_state(dev);
>>   	pci_restore_pasid_state(dev);
>>   	pci_restore_pri_state(dev);
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index e286c197d716..dca3502ef669 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -361,6 +361,11 @@ void pci_restore_aer_state(struct pci_dev *dev)
>>   	if (!aer)
>>   		return;
>>   
>> +	if (!dev->state_saved) {
>> +		pci_warn(dev, "Not restoring aer state, no saved state");
>> +		return;
>> +	}
>> +
>>   	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_ERR);
>>   	if (!save_state)
>>   		return;
>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>> index fc18349614d7..62c520af71a7 100644
>> --- a/drivers/pci/pcie/dpc.c
>> +++ b/drivers/pci/pcie/dpc.c
>> @@ -67,6 +67,11 @@ void pci_restore_dpc_state(struct pci_dev *dev)
>>   	if (!pci_is_pcie(dev))
>>   		return;
>>   
>> +	if (!dev->state_saved) {
>> +		pci_warn(dev, "Not restoring dpc state, no saved state");
>> +		return;
>> +	}
>> +
>>   	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
>>   	if (!save_state)
>>   		return;
>> diff --git a/drivers/pci/pcie/ptm.c b/drivers/pci/pcie/ptm.c
>> index 65e4b008be00..7b5bcc23000d 100644
>> --- a/drivers/pci/pcie/ptm.c
>> +++ b/drivers/pci/pcie/ptm.c
>> @@ -112,6 +112,11 @@ void pci_restore_ptm_state(struct pci_dev *dev)
>>   	if (!ptm)
>>   		return;
>>   
>> +	if (!dev->state_saved) {
>> +		pci_warn(dev, "Not restoring ptm state, no saved state");
>> +		return;
>> +	}
>> +
>>   	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_PTM);
>>   	if (!save_state)
>>   		return;
>> diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
>> index cc64f93709a4..f0f1bae46736 100644
>> --- a/drivers/pci/tph.c
>> +++ b/drivers/pci/tph.c
>> @@ -435,6 +435,11 @@ void pci_restore_tph_state(struct pci_dev *pdev)
>>   	if (!pdev->tph_enabled)
>>   		return;
>>   
>> +	if (!pdev->state_saved) {
>> +		pci_warn(pdev, "Not restoring tph state, no saved state");
>> +		return;
>> +	}
>> +
>>   	save_state = pci_find_saved_ext_cap(pdev, PCI_EXT_CAP_ID_TPH);
>>   	if (!save_state)
>>   		return;
>> diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
>> index a4ff7f5f66dd..fda435cd49c1 100644
>> --- a/drivers/pci/vc.c
>> +++ b/drivers/pci/vc.c
>> @@ -391,6 +391,11 @@ void pci_restore_vc_state(struct pci_dev *dev)
>>   {
>>   	int i;
>>   
>> +	if (!dev->state_saved) {
>> +		pci_warn(dev, "Not restoring vc state, no saved state");
>> +		return;
>> +	}
>> +
>>   	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
>>   		int pos;
>>   		struct pci_cap_saved_state *save_state;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device
  2025-09-15  8:39   ` Niklas Schnelle
@ 2025-09-15 17:42     ` Farhan Ali
  2025-09-16 10:59       ` Niklas Schnelle
  0 siblings, 1 reply; 35+ messages in thread
From: Farhan Ali @ 2025-09-15 17:42 UTC (permalink / raw)
  To: Niklas Schnelle, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato


On 9/15/2025 1:39 AM, Niklas Schnelle wrote:
> On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
>> Commit c1e18c17bda6 ("s390/pci: add zpci_set_irq()/zpci_clear_irq()"),
>> introduced the zpci_set_irq() and zpci_clear_irq(), to be used while
>> resetting a zPCI device.
>>
>> Commit da995d538d3a ("s390/pci: implement reset_slot for hotplug slot"),
>> mentions zpci_clear_irq() being called in the path for zpci_hot_reset_device().
>> But that is not the case anymore and these functions are not called
>> outside of this file.
> If you're doing another version I think you could add a bit more
> information on why this still works for existing recovery based on my
> investigation in
> https://lore.kernel.org/lkml/052ebdbb6f2d38025ca4345ee51e4857e19bb0e4.camel@linux.ibm.com/
>
> Even if you don't add more explanations, I'd tend to just drop the
> above paragraph as it doesn't seem relevant and sounds like
> zpci_hot_reset_device() doesn't clear IRQs. As explained in the linked
> mail there really is no need to call zpci_clear_irq() in
> zpci_hot_reset_device() as the CLP disable does disable IRQs. It's
> really only the state tracking that can get screwed up but is also fine
> for drivers which end up doing the tear down.

I referenced commit da995d538d3a as that commit introduced the 
arch_restore_msi_irqs and describes the reasoning as to why we need it. 
It also mentions about zpci_clear_irq being called by 
zpci_hot_reset_device. IMHO the message was confusing as it took me my 
down the path of trying to identify any commit that changed the behavior 
since da995d538d3a. But that wasn't the case and it was an error in the 
commit message. I want to keep a reference here to at least clarify that.

I had tried to clarify that this only becomes an issue if a driver tries 
restoring state through pci_restore_state(), in the paragraph below. But 
should I change it to be more explicit about that it's not an issue for 
driver doing setup and tear down through arch_msi_irq_setup and 
arch_msi_irq_teardown functions?

>
>> However after a CLP disable/enable reset, the device's IRQ are
>> unregistered, but the flag zdev->irq_registered does not get cleared. It
>> creates an inconsistent state and so arch_restore_msi_irqs() doesn't
>> correctly restore the device's IRQ. This becomes a problem when a PCI
>> driver tries to restore the state of the device through
>> pci_restore_state(). Restore IRQ unconditionally for the device and remove
>> the irq_registered flag as its redundant.
> s/its/it's/

Thanks, will fix.

>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pci.h | 1 -
>>   arch/s390/pci/pci_irq.c     | 9 +--------
>>   2 files changed, 1 insertion(+), 9 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
>> index 41f900f693d9..aed19a1aa9d7 100644
>> --- a/arch/s390/include/asm/pci.h
>> +++ b/arch/s390/include/asm/pci.h
>> @@ -145,7 +145,6 @@ struct zpci_dev {
>>   	u8		has_resources	: 1;
>>   	u8		is_physfn	: 1;
>>   	u8		util_str_avail	: 1;
>> -	u8		irqs_registered	: 1;
>>   	u8		tid_avail	: 1;
>>   	u8		rtr_avail	: 1; /* Relaxed translation allowed */
>>   	unsigned int	devfn;		/* DEVFN part of the RID*/
>> diff --git a/arch/s390/pci/pci_irq.c b/arch/s390/pci/pci_irq.c
>> index 84482a921332..e73be96ce5fe 100644
>> --- a/arch/s390/pci/pci_irq.c
>> +++ b/arch/s390/pci/pci_irq.c
>> @@ -107,9 +107,6 @@ static int zpci_set_irq(struct zpci_dev *zdev)
>>   	else
>>   		rc = zpci_set_airq(zdev);
>>   
>> -	if (!rc)
>> -		zdev->irqs_registered = 1;
>> -
>>   	return rc;
>>   }
>>   
>> @@ -123,9 +120,6 @@ static int zpci_clear_irq(struct zpci_dev *zdev)
>>   	else
>>   		rc = zpci_clear_airq(zdev);
>>   
>> -	if (!rc)
>> -		zdev->irqs_registered = 0;
>> -
>>   	return rc;
>>   }
>>   
>> @@ -427,8 +421,7 @@ bool arch_restore_msi_irqs(struct pci_dev *pdev)
>>   {
>>   	struct zpci_dev *zdev = to_zpci(pdev);
>>   
>> -	if (!zdev->irqs_registered)
>> -		zpci_set_irq(zdev);
>> +	zpci_set_irq(zdev);
>>   	return true;
>>   }
>>   
> Code looks good to me. With or without my suggestions for the commit
> message:
>
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

Thanks for reviewing!

Thanks
Farhan



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] s390/pci: Store PCI error information for passthrough devices
  2025-09-15 11:42   ` Niklas Schnelle
@ 2025-09-15 18:12     ` Farhan Ali
  2025-09-16 10:54       ` Niklas Schnelle
  0 siblings, 1 reply; 35+ messages in thread
From: Farhan Ali @ 2025-09-15 18:12 UTC (permalink / raw)
  To: Niklas Schnelle, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato


On 9/15/2025 4:42 AM, Niklas Schnelle wrote:
> On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
>> For a passthrough device we need co-operation from user space to recover
>> the device. This would require to bubble up any error information to user
>> space.  Let's store this error information for passthrough devices, so it
>> can be retrieved later.
>>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pci.h      | 28 ++++++++++
>>   arch/s390/pci/pci.c              |  1 +
>>   arch/s390/pci/pci_event.c        | 95 +++++++++++++++++++-------------
>>   drivers/vfio/pci/vfio_pci_zdev.c |  2 +
>>   4 files changed, 88 insertions(+), 38 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
>> index f47f62fc3bfd..72e05af90e08 100644
>> --- a/arch/s390/include/asm/pci.h
>> +++ b/arch/s390/include/asm/pci.h
>> @@ -116,6 +116,31 @@ struct zpci_bus {
>>   	enum pci_bus_speed	max_bus_speed;
>>   };
>>   
>> +/* Content Code Description for PCI Function Error */
>> +struct zpci_ccdf_err {
>> +	u32 reserved1;
>> +	u32 fh;                         /* function handle */
>> +	u32 fid;                        /* function id */
>> +	u32 ett         :  4;           /* expected table type */
>> +	u32 mvn         : 12;           /* MSI vector number */
>> +	u32 dmaas       :  8;           /* DMA address space */
>> +	u32 reserved2   :  6;
>> +	u32 q           :  1;           /* event qualifier */
>> +	u32 rw          :  1;           /* read/write */
>> +	u64 faddr;                      /* failing address */
>> +	u32 reserved3;
>> +	u16 reserved4;
>> +	u16 pec;                        /* PCI event code */
>> +} __packed;
>> +
>> +#define ZPCI_ERR_PENDING_MAX 16
> 16 pending error events sounds like a lot for a single devices. This
> also means that the array alone already spans more than 2 cache lines
> (256 byte on s390x). I can't imagine that we'd ever have that many
> errors pending. This is especially true since a device already in an
> error state would be the least likely to cause more errors. We have
> seen cases of 2 errors in the past, so maybe 4 would give us good head
> room?

Yeah, I wasn't sure how much headroom did we need. As you said having 
more than 2 is rare, so 4 should give us enough room.


>> +struct zpci_ccdf_pending {
>> +	size_t count;
>> +	int head;
>> +	int tail;
>> +	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
>> +};
>> +
>>   /* Private data per function */
>>   struct zpci_dev {
>>   	struct zpci_bus *zbus;
>> @@ -191,6 +216,8 @@ struct zpci_dev {
>>   	struct iommu_domain *s390_domain; /* attached IOMMU domain */
>>   	struct kvm_zdev *kzdev;
>>   	struct mutex kzdev_lock;
>> +	struct zpci_ccdf_pending pending_errs;
>> +	struct mutex pending_errs_lock;
>>   	spinlock_t dom_lock;		/* protect s390_domain change */
>>   };
>>
> --- snip ---
>
>> -
>>   /* Content Code Description for PCI Function Availability */
>>   struct zpci_ccdf_avail {
>>   	u32 reserved1;
>> @@ -76,6 +59,41 @@ static bool is_driver_supported(struct pci_driver *driver)
>>   	return true;
>>   }
>>   
>> +static void zpci_store_pci_error(struct pci_dev *pdev,
>> +				 struct zpci_ccdf_err *ccdf)
>> +{
>> +	struct zpci_dev *zdev = to_zpci(pdev);
>> +	int i;
>> +
>> +	mutex_lock(&zdev->pending_errs_lock);
>> +	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
>> +		pr_err("%s: Cannot store PCI error info for device",
>> +				pci_name(pdev));
>> +		mutex_unlock(&zdev->pending_errs_lock);
> I think the error message should state that the maximum number of
> pending error events has been queued. As with the ZPI_ERR_PENDING_MAX I
> really don't think we would reach this even at 4 vs 16 max pending but
> if we do I agree that having the first couple of errors saved is
> probably nice for analysis.

Ack, will change the error message.


>
>> +		return;
>> +	}
>> +
>> +	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
>> +	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
>> +	zdev->pending_errs.tail++;
>> +	zdev->pending_errs.count++;
> With tail being int this would be undefined behavior if it ever
> overflowed. Since the array is of fixed length that is always smaller
> than 256 how about making tail, head, and count u8. The memory saving
> doesn't matter but at least overflow becomes well defined.

Yeah, this makes sense, will update this.


>
>> +	mutex_unlock(&zdev->pending_errs_lock);
>> +}
>> +
>> +void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
>> +{
>> +	struct pci_dev *pdev = NULL;
>> +
>> +	mutex_lock(&zdev->pending_errs_lock);
>> +	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
>> +	if (zdev->pending_errs.count)
>> +		pr_err("%s: Unhandled PCI error events count=%zu",
>> +				pci_name(pdev), zdev->pending_errs.count);
> I think this could be a zpci_dbg(). That way you also don't need the
> pci_get_slot() which is also buggy as it misses a pci_dev_put(). The
> message also doesn't seem useful for the user. As I understand it this
> would happen if a vfio-pci user dies without handling all the error
> events but then vfio-pci will also reset the slot on closing of the
> fds, no? So the device will get reset anyway.

Right, the device will reset anyway. But I wanted to at least give an 
indication to the user that some events were not handled correctly. 
Maybe pr_err is a little extreme, so can convert to a warn? This should 
be rare as well behaving applications shouldn't do this. I am fine with 
zpci_dbg as well, its just the kernel needs to be in debug mode for us 
to get this info.

>
>> +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
> If this goes wrong and we subsequently crash or take a live memory dump
> I'd prefer to have bread crumbs such as the errors that weren't cleaned
> up. Wouldn't it be enough to just set the count to zero and for debug
> the original count will be in s390dbf.

I think setting count to zero should be enough, but I am wary about 
keeping stale state around. How about just logging the count that was 
not handled, in s390dbf? I think we already dump the ccdf in s390df if 
we get any error event. So it should be enough for us to trace back the 
unhandled error events?

> Also maybe it would make sense
> to pull the zdev->mediated_recovery clearing in here?

I would like to keep the mediated_recovery flag separate from just 
cleaning up the errors. The flag gets initialized when we open the vfio 
device and so having the flag cleared on close makes it easier to track 
this IMHO.


>
>> +	mutex_unlock(&zdev->pending_errs_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
>> +
>>   static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
>>   							 struct pci_driver *driver)
>>   {
>> @@ -169,7 +187,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
>>    * and the platform determines which functions are affected for
>>    * multi-function devices.
>>    */
>> -static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>> +static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
>> +							  struct zpci_ccdf_err *ccdf)
>>   {
>>   	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
>>   	struct zpci_dev *zdev = to_zpci(pdev);
>> @@ -188,13 +207,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>>   	}
>>   	pdev->error_state = pci_channel_io_frozen;
>>   
>> -	if (needs_mediated_recovery(pdev)) {
>> -		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
>> -			pci_name(pdev));
>> -		status_str = "failed (pass-through)";
>> -		goto out_unlock;
>> -	}
>> -
>>   	driver = to_pci_driver(pdev->dev.driver);
>>   	if (!is_driver_supported(driver)) {
>>   		if (!driver) {
>> @@ -210,12 +222,22 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>>   		goto out_unlock;
>>   	}
>>   
>> +	if (needs_mediated_recovery(pdev))
>> +		zpci_store_pci_error(pdev, ccdf);
>> +
>>   	ers_res = zpci_event_notify_error_detected(pdev, driver);
>>   	if (ers_result_indicates_abort(ers_res)) {
>>   		status_str = "failed (abort on detection)";
>>   		goto out_unlock;
>>   	}
>>   
>> +	if (needs_mediated_recovery(pdev)) {
>> +		pr_info("%s: Recovering passthrough device\n", pci_name(pdev));
> I'd say technically we're not recovering the device here but rather
> leaving it alone so user-space can take over the recovery. Maybe this
> could be made explicit in the message. Something like:
>
> ""%s: Leaving recovery of pass-through device to user-space\n"
>
Sure, will update

Thanks
Farhan


>
>> +		ers_res = PCI_ERS_RESULT_RECOVERED;
>> +		status_str = "in progress";
>> +		goto out_unlock;
>> +	}
>> +
>>   	if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
>>   		ers_res = zpci_event_do_error_state_clear(pdev, driver);
>>   		if (ers_result_indicates_abort(ers_res)) {
>> @@ -258,25 +280,20 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>>    * @pdev: PCI function for which to report
>>    * @es: PCI channel failure state to report
>>    */
>> -static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
>> +static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
>> +				  struct zpci_ccdf_err *ccdf)
>>   {
>>   	struct pci_driver *driver;
>>   
>>   	pci_dev_lock(pdev);
>>   	pdev->error_state = es;
>> -	/**
>> -	 * While vfio-pci's error_detected callback notifies user-space QEMU
>> -	 * reacts to this by freezing the guest. In an s390 environment PCI
>> -	 * errors are rarely fatal so this is overkill. Instead in the future
>> -	 * we will inject the error event and let the guest recover the device
>> -	 * itself.
>> -	 */
>> +
>>   	if (needs_mediated_recovery(pdev))
>> -		goto out;
>> +		zpci_store_pci_error(pdev, ccdf);
>>   	driver = to_pci_driver(pdev->dev.driver);
>>   	if (driver && driver->err_handler && driver->err_handler->error_detected)
>>   		driver->err_handler->error_detected(pdev, pdev->error_state);
>> -out:
>> +
>>   	pci_dev_unlock(pdev);
>>   }
>>   
>> @@ -312,6 +329,7 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
>>   	pr_err("%s: Event 0x%x reports an error for PCI function 0x%x\n",
>>   	       pdev ? pci_name(pdev) : "n/a", ccdf->pec, ccdf->fid);
>>   
>> +
>>   	if (!pdev)
> Nit, stray empty line.
>
>>   		goto no_pdev;
>>   
>> @@ -322,12 +340,13 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
>>   		break;
>>   	case 0x0040: /* Service Action or Error Recovery Failed */
>>   	case 0x003b:
>> -		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
>> +		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
>>   		break;
>>   	default: /* PCI function left in the error state attempt to recover */
>> -		ers_res = zpci_event_attempt_error_recovery(pdev);
>> +		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
>>   		if (ers_res != PCI_ERS_RESULT_RECOVERED)
>> -			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
>> +			zpci_event_io_failure(pdev, pci_channel_io_perm_failure,
>> +					ccdf);
>>   		break;
>>   	}
>>   	pci_dev_put(pdev);
>> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
>> index a7bc23ce8483..2be37eab9279 100644
>> --- a/drivers/vfio/pci/vfio_pci_zdev.c
>> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
>> @@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>>   
>>   	zdev->mediated_recovery = false;
>>   
>> +	zpci_cleanup_pending_errors(zdev);
>> +
>>   	if (!vdev->vdev.kvm)
>>   		return;
>>   

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information
  2025-09-13  9:04   ` Alex Williamson
@ 2025-09-15 18:27     ` Farhan Ali
  0 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-15 18:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linux-s390, kvm, linux-kernel, linux-pci, helgaas, schnelle,
	mjrosato


On 9/13/2025 2:04 AM, Alex Williamson wrote:
> On Thu, 11 Sep 2025 11:33:05 -0700
> Farhan Ali <alifm@linux.ibm.com> wrote:
>
>> For zPCI devices, we have platform specific error information. The platform
>> firmware provides this error information to the operating system in an
>> architecture specific mechanism. To enable recovery from userspace for
>> these devices, we want to expose this error information to userspace. Add a
>> new device feature to expose this information.
>>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   drivers/vfio/pci/vfio_pci_core.c |  2 ++
>>   drivers/vfio/pci/vfio_pci_priv.h |  8 ++++++++
>>   drivers/vfio/pci/vfio_pci_zdev.c | 34 ++++++++++++++++++++++++++++++++
>>   include/uapi/linux/vfio.h        | 14 +++++++++++++
>>   4 files changed, 58 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 7dcf5439dedc..378adb3226db 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -1514,6 +1514,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>>   		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>>   	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>>   		return vfio_pci_core_feature_token(device, flags, arg, argsz);
>> +	case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
>> +		return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
>>   	default:
>>   		return -ENOTTY;
>>   	}
>> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
>> index a9972eacb293..a4a7f97fdc2e 100644
>> --- a/drivers/vfio/pci/vfio_pci_priv.h
>> +++ b/drivers/vfio/pci/vfio_pci_priv.h
>> @@ -86,6 +86,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>>   				struct vfio_info_cap *caps);
>>   int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
>>   void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
>> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
>> +			      void __user *arg, size_t argsz);
>>   #else
>>   static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>>   					      struct vfio_info_cap *caps)
>> @@ -100,6 +102,12 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>>   
>>   static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>>   {}
>> +
>> +static int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
>> +				     void __user *arg, size_t argsz);
>> +{
>> +	return -ENODEV;
>> +}
>>   #endif
>>   
>>   static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
>> index 2be37eab9279..261954039aa9 100644
>> --- a/drivers/vfio/pci/vfio_pci_zdev.c
>> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
>> @@ -141,6 +141,40 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>>   	return ret;
>>   }
>>   
>> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
>> +			      void __user *arg, size_t argsz)
>> +{
>> +	struct vfio_device_feature_zpci_err err;
>> +	struct vfio_pci_core_device *vdev =
>> +		container_of(device, struct vfio_pci_core_device, vdev);
>> +	struct zpci_dev *zdev = to_zpci(vdev->pdev);
>> +	int ret;
>> +	int head = 0;
>> +
>> +	if (!zdev)
>> +		return -ENODEV;
>> +
>> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
>> +				 sizeof(err));
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	mutex_lock(&zdev->pending_errs_lock);
>> +	if (zdev->pending_errs.count) {
>> +		head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
>> +		err.pec = zdev->pending_errs.err[head].pec;
>> +		zdev->pending_errs.head++;
>> +		zdev->pending_errs.count--;
>> +		err.pending_errors = zdev->pending_errs.count;
>> +	}
>> +	mutex_unlock(&zdev->pending_errs_lock);
>> +
>> +	if (copy_to_user(arg, &err, sizeof(err)))
>> +		return -EFAULT;
>> +
>> +	return 0;
>> +}
>> +
>>   int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>>   {
>>   	struct zpci_dev *zdev = to_zpci(vdev->pdev);
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 75100bf009ba..a950c341602d 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -1478,6 +1478,20 @@ struct vfio_device_feature_bus_master {
>>   };
>>   #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>>   
>> +/**
>> + * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
>> + * userspace for vfio-pci devices on s390x. On s390x PCI error recovery involves
>> + * platform firmware and notification to operating system is done by
>> + * architecture specific mechanism.  Exposing this information to userspace
>> + * allows userspace to take appropriate actions to handle an error on the
>> + * device.
>> + */
>> +struct vfio_device_feature_zpci_err {
>> +	__u16 pec;
>> +	int pending_errors;
>> +};
> This should have some explicit alignment.  Thanks,
>
> Alex
>
Sure, would something like this be sufficient?

struct vfio_device_feature_zpci_err {
	__u16 pec;
	__u8 pending_errors;
	__u8 __pad;
};

Based on some discussion with Niklas (patch 7) we can reduce the 
pending_errors to u8.

Thanks
Farhan

>
>> +#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 11
>> +
>>   /* -------- API for Type1 VFIO IOMMU -------- */
>>   
>>   /**

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information
  2025-09-15  6:26   ` Cédric Le Goater
@ 2025-09-15 18:27     ` Farhan Ali
  0 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-15 18:27 UTC (permalink / raw)
  To: Cédric Le Goater, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, schnelle, mjrosato


On 9/14/2025 11:26 PM, Cédric Le Goater wrote:
> On 9/11/25 20:33, Farhan Ali wrote:
>> For zPCI devices, we have platform specific error information. The 
>> platform
>> firmware provides this error information to the operating system in an
>> architecture specific mechanism. To enable recovery from userspace for
>> these devices, we want to expose this error information to userspace. 
>> Add a
>> new device feature to expose this information.
>>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   drivers/vfio/pci/vfio_pci_core.c |  2 ++
>>   drivers/vfio/pci/vfio_pci_priv.h |  8 ++++++++
>>   drivers/vfio/pci/vfio_pci_zdev.c | 34 ++++++++++++++++++++++++++++++++
>>   include/uapi/linux/vfio.h        | 14 +++++++++++++
>>   4 files changed, 58 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c 
>> b/drivers/vfio/pci/vfio_pci_core.c
>> index 7dcf5439dedc..378adb3226db 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -1514,6 +1514,8 @@ int vfio_pci_core_ioctl_feature(struct 
>> vfio_device *device, u32 flags,
>>           return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>>       case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>>           return vfio_pci_core_feature_token(device, flags, arg, argsz);
>> +    case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
>> +        return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
>>       default:
>>           return -ENOTTY;
>>       }
>> diff --git a/drivers/vfio/pci/vfio_pci_priv.h 
>> b/drivers/vfio/pci/vfio_pci_priv.h
>> index a9972eacb293..a4a7f97fdc2e 100644
>> --- a/drivers/vfio/pci/vfio_pci_priv.h
>> +++ b/drivers/vfio/pci/vfio_pci_priv.h
>> @@ -86,6 +86,8 @@ int vfio_pci_info_zdev_add_caps(struct 
>> vfio_pci_core_device *vdev,
>>                   struct vfio_info_cap *caps);
>>   int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
>>   void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
>> +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
>> +                  void __user *arg, size_t argsz);
>>   #else
>>   static inline int vfio_pci_info_zdev_add_caps(struct 
>> vfio_pci_core_device *vdev,
>>                             struct vfio_info_cap *caps)
>> @@ -100,6 +102,12 @@ static inline int 
>> vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>>     static inline void vfio_pci_zdev_close_device(struct 
>> vfio_pci_core_device *vdev)
>>   {}
>> +
>> +static int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 
>> flags,
>> +                     void __user *arg, size_t argsz);
>
> The extra ';' breaks builds on non-Z platforms.
>
> C.

Thanks for catching this, will fix.

Thanks
Farhan


>
>> +{
>> +    return -ENODEV;
>> +}
>>   #endif
>>     static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c 
>> b/drivers/vfio/pci/vfio_pci_zdev.c
>> index 2be37eab9279..261954039aa9 100644
>> --- a/drivers/vfio/pci/vfio_pci_zdev.c
>> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
>> @@ -141,6 +141,40 @@ int vfio_pci_info_zdev_add_caps(struct 
>> vfio_pci_core_device *vdev,
>>       return ret;
>>   }
>>   +int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
>> +                  void __user *arg, size_t argsz)
>> +{
>> +    struct vfio_device_feature_zpci_err err;
>> +    struct vfio_pci_core_device *vdev =
>> +        container_of(device, struct vfio_pci_core_device, vdev);
>> +    struct zpci_dev *zdev = to_zpci(vdev->pdev);
>> +    int ret;
>> +    int head = 0;
>> +
>> +    if (!zdev)
>> +        return -ENODEV;
>> +
>> +    ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
>> +                 sizeof(err));
>> +    if (ret != 1)
>> +        return ret;
>> +
>> +    mutex_lock(&zdev->pending_errs_lock);
>> +    if (zdev->pending_errs.count) {
>> +        head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
>> +        err.pec = zdev->pending_errs.err[head].pec;
>> +        zdev->pending_errs.head++;
>> +        zdev->pending_errs.count--;
>> +        err.pending_errors = zdev->pending_errs.count;
>> +    }
>> +    mutex_unlock(&zdev->pending_errs_lock);
>> +
>> +    if (copy_to_user(arg, &err, sizeof(err)))
>> +        return -EFAULT;
>> +
>> +    return 0;
>> +}
>> +
>>   int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
>>   {
>>       struct zpci_dev *zdev = to_zpci(vdev->pdev);
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 75100bf009ba..a950c341602d 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -1478,6 +1478,20 @@ struct vfio_device_feature_bus_master {
>>   };
>>   #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>>   +/**
>> + * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error 
>> information to
>> + * userspace for vfio-pci devices on s390x. On s390x PCI error 
>> recovery involves
>> + * platform firmware and notification to operating system is done by
>> + * architecture specific mechanism.  Exposing this information to 
>> userspace
>> + * allows userspace to take appropriate actions to handle an error 
>> on the
>> + * device.
>> + */
>> +struct vfio_device_feature_zpci_err {
>> +    __u16 pec;
>> +    int pending_errors;
>> +};
>> +#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 11
>> +
>>   /* -------- API for Type1 VFIO IOMMU -------- */
>>     /**
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/10] PCI: Allow per function PCI slots
  2025-09-11 18:33 ` [PATCH v3 03/10] PCI: Allow per function PCI slots Farhan Ali
  2025-09-12 12:23   ` Benjamin Block
@ 2025-09-16  6:52   ` Cédric Le Goater
  2025-09-16 18:37     ` Farhan Ali
  1 sibling, 1 reply; 35+ messages in thread
From: Cédric Le Goater @ 2025-09-16  6:52 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, schnelle, mjrosato

Hello Ali,

On 9/11/25 20:33, Farhan Ali wrote:
> On s390 systems, which use a machine level hypervisor, PCI devices are
> always accessed through a form of PCI pass-through which fundamentally
> operates on a per PCI function granularity. This is also reflected in the
> s390 PCI hotplug driver which creates hotplug slots for individual PCI
> functions. Its reset_slot() function, which is a wrapper for
> zpci_hot_reset_device(), thus also resets individual functions.
> 
> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
> to multifunction devices. This approach worked fine on s390 systems that
> only exposed virtual functions as individual PCI domains to the operating
> system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> s390 supports exposing the topology of multifunction PCI devices by
> grouping them in a shared PCI domain. When attempting to reset a function
> through the hotplug driver, the shared slot assignment causes the wrong
> function to be reset instead of the intended one. It also leaks memory as
> we do create a pci_slot object for the function, but don't correctly free
> it in pci_slot_release().
> 
> Add a flag for struct pci_slot to allow per function PCI slots for
> functions managed through a hypervisor, which exposes individual PCI
> functions while retaining the topology.
> 
> Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>   drivers/pci/hotplug/s390_pci_hpc.c | 10 ++++++++--
>   drivers/pci/pci.c                  |  4 +++-
>   drivers/pci/slot.c                 | 14 +++++++++++---
>   include/linux/pci.h                |  1 +
>   4 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/s390_pci_hpc.c b/drivers/pci/hotplug/s390_pci_hpc.c
> index d9996516f49e..8b547de464bf 100644
> --- a/drivers/pci/hotplug/s390_pci_hpc.c
> +++ b/drivers/pci/hotplug/s390_pci_hpc.c
> @@ -126,14 +126,20 @@ static const struct hotplug_slot_ops s390_hotplug_slot_ops = {
>   
>   int zpci_init_slot(struct zpci_dev *zdev)
>   {
> +	int ret;
>   	char name[SLOT_NAME_SIZE];
>   	struct zpci_bus *zbus = zdev->zbus;
>   
>   	zdev->hotplug_slot.ops = &s390_hotplug_slot_ops;
>   
>   	snprintf(name, SLOT_NAME_SIZE, "%08x", zdev->fid);
> -	return pci_hp_register(&zdev->hotplug_slot, zbus->bus,
> -			       zdev->devfn, name);
> +	ret = pci_hp_register(&zdev->hotplug_slot, zbus->bus,
> +				zdev->devfn, name);
> +	if (ret)
> +		return ret;
> +
> +	zdev->hotplug_slot.pci_slot->per_func_slot = 1;
> +	return 0;
>   }
>   
>   void zpci_exit_slot(struct zpci_dev *zdev)
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 3994fa82df68..70296d3b1cfc 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5061,7 +5061,9 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, bool probe)
>   
>   static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
>   {
> -	if (dev->multifunction || dev->subordinate || !dev->slot ||
> +	if (dev->multifunction && !dev->slot->per_func_slot)
> +		return -ENOTTY;
> +	if (dev->subordinate || !dev->slot ||
>   	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
>   		return -ENOTTY;
>   
> diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
> index 50fb3eb595fe..51ee59e14393 100644
> --- a/drivers/pci/slot.c
> +++ b/drivers/pci/slot.c
> @@ -63,6 +63,14 @@ static ssize_t cur_speed_read_file(struct pci_slot *slot, char *buf)
>   	return bus_speed_read(slot->bus->cur_bus_speed, buf);
>   }
>   
> +static bool pci_dev_matches_slot(struct pci_dev *dev, struct pci_slot *slot)
> +{
> +	if (slot->per_func_slot)
> +		return dev->devfn == slot->number;
> +
> +	return PCI_SLOT(dev->devfn) == slot->number;
> +}
> +
>   static void pci_slot_release(struct kobject *kobj)
>   {
>   	struct pci_dev *dev;
> @@ -73,7 +81,7 @@ static void pci_slot_release(struct kobject *kobj)
>   
>   	down_read(&pci_bus_sem);
>   	list_for_each_entry(dev, &slot->bus->devices, bus_list)
> -		if (PCI_SLOT(dev->devfn) == slot->number)
> +		if (pci_dev_matches_slot(dev, slot))
>   			dev->slot = NULL;
>   	up_read(&pci_bus_sem);
>   
> @@ -166,7 +174,7 @@ void pci_dev_assign_slot(struct pci_dev *dev)
>   
>   	mutex_lock(&pci_slot_mutex);
>   	list_for_each_entry(slot, &dev->bus->slots, list)
> -		if (PCI_SLOT(dev->devfn) == slot->number)
> +		if (pci_dev_matches_slot(dev, slot))
>   			dev->slot = slot;
>   	mutex_unlock(&pci_slot_mutex);
>   }
> @@ -285,7 +293,7 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
>   
>   	down_read(&pci_bus_sem);
>   	list_for_each_entry(dev, &parent->devices, bus_list)
> -		if (PCI_SLOT(dev->devfn) == slot_nr)
> +		if (pci_dev_matches_slot(dev, slot))
>   			dev->slot = slot;
>   	up_read(&pci_bus_sem);
>   
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 59876de13860..9265f32d9786 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -78,6 +78,7 @@ struct pci_slot {
>   	struct list_head	list;		/* Node in list of slots */
>   	struct hotplug_slot	*hotplug;	/* Hotplug info (move here) */
>   	unsigned char		number;		/* PCI_SLOT(pci_dev->devfn) */
> +	unsigned int		per_func_slot:1; /* Allow per function slot */
>   	struct kobject		kobj;
>   };
>   

This change generates a kernel oops on x86_64. It can be reproduced in a VM.


C.

[    3.073990] BUG: kernel NULL pointer dereference, address: 0000000000000021
[    3.074976] #PF: supervisor read access in kernel mode
[    3.074976] #PF: error_code(0x0000) - not-present page
[    3.074976] PGD 0 P4D 0
[    3.074976] Oops: Oops: 0000 [#1] SMP NOPTI
[    3.074976] CPU: 18 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc6-clg-dirty #8 PREEMPT(voluntary)
[    3.074976] Hardware name: Supermicro Super Server/X13SAE-F, BIOS 4.2 12/17/2024
[    3.074976] RIP: 0010:pci_reset_bus_function+0xdf/0x160
[    3.074976] Code: 4e 08 00 00 40 0f 85 83 00 00 00 48 8b 78 18 e8 27 9d ff ff 83 f8 e7 74 17 48 83 c4 08 5b 5d 41 5c c3 cc cc cc cc 48 8b 43 30 <f6> 40 21 01 75 b6 48 8b 53 10 48 83 7a 10 00 74 5e 48 83 7b 18 00
[    3.074976] RSP: 0000:ffffcd808007b9a8 EFLAGS: 00010202
[    3.074976] RAX: 0000000000000000 RBX: ffff88c4019b8000 RCX: 0000000000000000
[    3.074976] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88c4019b8000
[    3.074976] RBP: 0000000000000001 R08: 0000000000000002 R09: ffffcd808007b99c
[    3.074976] R10: ffffcd808007b950 R11: 0000000000000000 R12: 0000000000000001
[    3.074976] R13: ffff88c4019b80c8 R14: ffff88c401a7e028 R15: ffff88c401a73400
[    3.074976] FS:  0000000000000000(0000) GS:ffff88d38aad5000(0000) knlGS:0000000000000000
[    3.074976] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.074976] CR2: 0000000000000021 CR3: 0000000f66222001 CR4: 0000000000770ef0
[    3.074976] PKRU: 55555554
[    3.074976] Call Trace:
[    3.074976]  <TASK>
[    3.074976]  ? pci_pm_reset+0x39/0x180
[    3.074976]  pci_init_reset_methods+0x52/0x80
[    3.074976]  pci_device_add+0x215/0x5d0
[    3.074976]  pci_scan_single_device+0xa2/0xe0
[    3.074976]  pci_scan_slot+0x66/0x1c0
[    3.074976]  ? klist_next+0x145/0x150
[    3.074976]  pci_scan_child_bus_extend+0x3a/0x290
[    3.074976]  acpi_pci_root_create+0x236/0x2a0
[    3.074976]  pci_acpi_scan_root+0x19b/0x1f0
[    3.074976]  acpi_pci_root_add+0x1a5/0x370
[    3.074976]  acpi_bus_attach+0x1a8/0x290
[    3.074976]  ? __pfx_acpi_dev_for_one_check+0x10/0x10
[    3.074976]  device_for_each_child+0x4b/0x80
[    3.074976]  acpi_dev_for_each_child+0x28/0x40
[    3.074976]  ? __pfx_acpi_bus_attach+0x10/0x10
[    3.074976]  acpi_bus_attach+0x7a/0x290
[    3.074976]  ? _raw_spin_unlock_irqrestore+0x23/0x40
[    3.074976]  ? __pfx_acpi_dev_for_one_check+0x10/0x10
[    3.074976]  device_for_each_child+0x4b/0x80
[    3.074976]  acpi_dev_for_each_child+0x28/0x40
[    3.074976]  ? __pfx_acpi_bus_attach+0x10/0x10
[    3.074976]  acpi_bus_attach+0x7a/0x290
[    3.074976]  acpi_bus_scan+0x6a/0x1c0
[    3.074976]  ? __pfx_acpi_init+0x10/0x10
[    3.074976]  acpi_scan_init+0xdc/0x280
[    3.074976]  ? __pfx_acpi_init+0x10/0x10
[    3.074976]  acpi_init+0x218/0x530
[    3.074976]  do_one_initcall+0x40/0x310
[    3.074976]  kernel_init_freeable+0x2fe/0x450
[    3.074976]  ? __pfx_kernel_init+0x10/0x10
[    3.074976]  kernel_init+0x16/0x1d0
[    3.074976]  ret_from_fork+0x1ab/0x1e0
[    3.074976]  ? __pfx_kernel_init+0x10/0x10
[    3.074976]  ret_from_fork_asm+0x1a/0x30
[    3.074976]  </TASK>
[    3.074976] Modules linked in:
[    3.074976] CR2: 0000000000000021
[    3.074976] ---[ end trace 0000000000000000 ]---
[    3.074976] RIP: 0010:pci_reset_bus_function+0xdf/0x160


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] s390/pci: Store PCI error information for passthrough devices
  2025-09-15 18:12     ` Farhan Ali
@ 2025-09-16 10:54       ` Niklas Schnelle
  0 siblings, 0 replies; 35+ messages in thread
From: Niklas Schnelle @ 2025-09-16 10:54 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato

On Mon, 2025-09-15 at 11:12 -0700, Farhan Ali wrote:
> On 9/15/2025 4:42 AM, Niklas Schnelle wrote:
> > On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
> > > For a passthrough device we need co-operation from user space to recover
> > > the device. This would require to bubble up any error information to user
> > > space.  Let's store this error information for passthrough devices, so it
> > > can be retrieved later.
> > > 
> > > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > > ---
> > > 
--- snip ---
> > > +	mutex_unlock(&zdev->pending_errs_lock);
> > > +}
> > > +
> > > +void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
> > > +{
> > > +	struct pci_dev *pdev = NULL;
> > > +
> > > +	mutex_lock(&zdev->pending_errs_lock);
> > > +	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
> > > +	if (zdev->pending_errs.count)
> > > +		pr_err("%s: Unhandled PCI error events count=%zu",
> > > +				pci_name(pdev), zdev->pending_errs.count);
> > I think this could be a zpci_dbg(). That way you also don't need the
> > pci_get_slot() which is also buggy as it misses a pci_dev_put(). The
> > message also doesn't seem useful for the user. As I understand it this
> > would happen if a vfio-pci user dies without handling all the error
> > events but then vfio-pci will also reset the slot on closing of the
> > fds, no? So the device will get reset anyway.
> 
> Right, the device will reset anyway. But I wanted to at least give an 
> indication to the user that some events were not handled correctly. 
> Maybe pr_err is a little extreme, so can convert to a warn? This should 
> be rare as well behaving applications shouldn't do this. I am fine with 
> zpci_dbg as well, its just the kernel needs to be in debug mode for us 
> to get this info.

No, zpci_dbg() logs to /sys/kernel/debug/s390dbf/pci_msg/sprintf
without need for debug mode. I'm also ok with a pr_warn() or maybe even
pr_info(). I can see your argument that this may be useful to have in
dmesg e.g. when debugging a user-space driver without having to know
about s390 specific debug aids.

> 
> > 
> > > +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
> > If this goes wrong and we subsequently crash or take a live memory dump
> > I'd prefer to have bread crumbs such as the errors that weren't cleaned
> > up. Wouldn't it be enough to just set the count to zero and for debug
> > the original count will be in s390dbf.
> 
> I think setting count to zero should be enough, but I am wary about 
> keeping stale state around. How about just logging the count that was 
> not handled, in s390dbf? I think we already dump the ccdf in s390df if 
> we get any error event. So it should be enough for us to trace back the 
> unhandled error events?
> 
> > Also maybe it would make sense
> > to pull the zdev->mediated_recovery clearing in here?
> 
> I would like to keep the mediated_recovery flag separate from just 
> cleaning up the errors. The flag gets initialized when we open the vfio 
> device and so having the flag cleared on close makes it easier to track 
> this IMHO.

Ok yeah I can see the symmetry argument.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device
  2025-09-15 17:42     ` Farhan Ali
@ 2025-09-16 10:59       ` Niklas Schnelle
  0 siblings, 0 replies; 35+ messages in thread
From: Niklas Schnelle @ 2025-09-16 10:59 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato

On Mon, 2025-09-15 at 10:42 -0700, Farhan Ali wrote:
> On 9/15/2025 1:39 AM, Niklas Schnelle wrote:
> > On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
> > > Commit c1e18c17bda6 ("s390/pci: add zpci_set_irq()/zpci_clear_irq()"),
> > > introduced the zpci_set_irq() and zpci_clear_irq(), to be used while
> > > resetting a zPCI device.
> > > 
> > > Commit da995d538d3a ("s390/pci: implement reset_slot for hotplug slot"),
> > > mentions zpci_clear_irq() being called in the path for zpci_hot_reset_device().
> > > But that is not the case anymore and these functions are not called
> > > outside of this file.
> > If you're doing another version I think you could add a bit more
> > information on why this still works for existing recovery based on my
> > investigation in
> > https://lore.kernel.org/lkml/052ebdbb6f2d38025ca4345ee51e4857e19bb0e4.camel@linux.ibm.com/
> > 
> > Even if you don't add more explanations, I'd tend to just drop the
> > above paragraph as it doesn't seem relevant and sounds like
> > zpci_hot_reset_device() doesn't clear IRQs. As explained in the linked
> > mail there really is no need to call zpci_clear_irq() in
> > zpci_hot_reset_device() as the CLP disable does disable IRQs. It's
> > really only the state tracking that can get screwed up but is also fine
> > for drivers which end up doing the tear down.
> 
> I referenced commit da995d538d3a as that commit introduced the 
> arch_restore_msi_irqs and describes the reasoning as to why we need it. 
> It also mentions about zpci_clear_irq being called by 
> zpci_hot_reset_device. IMHO the message was confusing as it took me my 
> down the path of trying to identify any commit that changed the behavior 
> since da995d538d3a. But that wasn't the case and it was an error in the 
> commit message. I want to keep a reference here to at least clarify that.

Ok that makes sense, maybe you could add asentence like:
"… these functions are not called outside of this file. Instead
zpci_hot_reset_device() relies on zpci_disable_device() also clearing
the IRQs, but misses to reset the zdev->irqs_registered flag.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] PCI: Avoid saving error values for config space
  2025-09-11 18:32 ` [PATCH v3 01/10] PCI: Avoid saving error values for config space Farhan Ali
  2025-09-13  8:27   ` Alex Williamson
@ 2025-09-16 18:09   ` Bjorn Helgaas
  2025-09-16 20:00     ` Farhan Ali
  1 sibling, 1 reply; 35+ messages in thread
From: Bjorn Helgaas @ 2025-09-16 18:09 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	schnelle, mjrosato

On Thu, Sep 11, 2025 at 11:32:58AM -0700, Farhan Ali wrote:
> The current reset process saves the device's config space state before
> reset and restores it afterward. However, when a device is in an error
> state before reset, config space reads may return error values instead of
> valid data. This results in saving corrupted values that get written back
> to the device during state restoration.
> 
> Avoid saving the state of the config space when the device is in error.
> While restoring we only restorei the state that can be restored through
> kernel data such as BARs or doesn't depend on the saved state.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/pci/pci.c      | 29 ++++++++++++++++++++++++++---
>  drivers/pci/pcie/aer.c |  5 +++++
>  drivers/pci/pcie/dpc.c |  5 +++++
>  drivers/pci/pcie/ptm.c |  5 +++++
>  drivers/pci/tph.c      |  5 +++++
>  drivers/pci/vc.c       |  5 +++++
>  6 files changed, 51 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b0f4d98036cd..4b67d22faf0a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1720,6 +1720,11 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
>  	struct pci_cap_saved_state *save_state;
>  	u16 *cap;
>  
> +	if (!dev->state_saved) {
> +		pci_warn(dev, "Not restoring pcie state, no saved state");
> +		return;

Seems like a lot of messages.  If we want to warn about this, why
don't we do it once in pci_restore_state()?

I guess you're making some judgment about what things can be restored
even when !dev->state_saved.  That seems kind of hard to maintain in
the future as other capabilities are added.

Also seems sort of questionable if we restore partial state and keep
using the device as if all is well.  Won't the device be in some kind
of inconsistent, unpredictable state then?

Bjorn

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/10] PCI: Allow per function PCI slots
  2025-09-16  6:52   ` Cédric Le Goater
@ 2025-09-16 18:37     ` Farhan Ali
  2025-09-17  6:21       ` Cédric Le Goater
  0 siblings, 1 reply; 35+ messages in thread
From: Farhan Ali @ 2025-09-16 18:37 UTC (permalink / raw)
  To: Cédric Le Goater, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, schnelle, mjrosato


On 9/15/2025 11:52 PM, Cédric Le Goater wrote:
> Hello Ali,
>
> On 9/11/25 20:33, Farhan Ali wrote:
>> On s390 systems, which use a machine level hypervisor, PCI devices are
>> always accessed through a form of PCI pass-through which fundamentally
>> operates on a per PCI function granularity. This is also reflected in 
>> the
>> s390 PCI hotplug driver which creates hotplug slots for individual PCI
>> functions. Its reset_slot() function, which is a wrapper for
>> zpci_hot_reset_device(), thus also resets individual functions.
>>
>> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot 
>> object
>> to multifunction devices. This approach worked fine on s390 systems that
>> only exposed virtual functions as individual PCI domains to the 
>> operating
>> system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
>> s390 supports exposing the topology of multifunction PCI devices by
>> grouping them in a shared PCI domain. When attempting to reset a 
>> function
>> through the hotplug driver, the shared slot assignment causes the wrong
>> function to be reset instead of the intended one. It also leaks 
>> memory as
>> we do create a pci_slot object for the function, but don't correctly 
>> free
>> it in pci_slot_release().
>>
>> Add a flag for struct pci_slot to allow per function PCI slots for
>> functions managed through a hypervisor, which exposes individual PCI
>> functions while retaining the topology.
>>
>> Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
>> Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   drivers/pci/hotplug/s390_pci_hpc.c | 10 ++++++++--
>>   drivers/pci/pci.c                  |  4 +++-
>>   drivers/pci/slot.c                 | 14 +++++++++++---
>>   include/linux/pci.h                |  1 +
>>   4 files changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/pci/hotplug/s390_pci_hpc.c 
>> b/drivers/pci/hotplug/s390_pci_hpc.c
>> index d9996516f49e..8b547de464bf 100644
>> --- a/drivers/pci/hotplug/s390_pci_hpc.c
>> +++ b/drivers/pci/hotplug/s390_pci_hpc.c
>> @@ -126,14 +126,20 @@ static const struct hotplug_slot_ops 
>> s390_hotplug_slot_ops = {
>>     int zpci_init_slot(struct zpci_dev *zdev)
>>   {
>> +    int ret;
>>       char name[SLOT_NAME_SIZE];
>>       struct zpci_bus *zbus = zdev->zbus;
>>         zdev->hotplug_slot.ops = &s390_hotplug_slot_ops;
>>         snprintf(name, SLOT_NAME_SIZE, "%08x", zdev->fid);
>> -    return pci_hp_register(&zdev->hotplug_slot, zbus->bus,
>> -                   zdev->devfn, name);
>> +    ret = pci_hp_register(&zdev->hotplug_slot, zbus->bus,
>> +                zdev->devfn, name);
>> +    if (ret)
>> +        return ret;
>> +
>> +    zdev->hotplug_slot.pci_slot->per_func_slot = 1;
>> +    return 0;
>>   }
>>     void zpci_exit_slot(struct zpci_dev *zdev)
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 3994fa82df68..70296d3b1cfc 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -5061,7 +5061,9 @@ static int pci_reset_hotplug_slot(struct 
>> hotplug_slot *hotplug, bool probe)
>>     static int pci_dev_reset_slot_function(struct pci_dev *dev, bool 
>> probe)
>>   {
>> -    if (dev->multifunction || dev->subordinate || !dev->slot ||
>> +    if (dev->multifunction && !dev->slot->per_func_slot)
>> +        return -ENOTTY;
>> +    if (dev->subordinate || !dev->slot ||
>>           dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
>>           return -ENOTTY;
>>   diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
>> index 50fb3eb595fe..51ee59e14393 100644
>> --- a/drivers/pci/slot.c
>> +++ b/drivers/pci/slot.c
>> @@ -63,6 +63,14 @@ static ssize_t cur_speed_read_file(struct pci_slot 
>> *slot, char *buf)
>>       return bus_speed_read(slot->bus->cur_bus_speed, buf);
>>   }
>>   +static bool pci_dev_matches_slot(struct pci_dev *dev, struct 
>> pci_slot *slot)
>> +{
>> +    if (slot->per_func_slot)
>> +        return dev->devfn == slot->number;
>> +
>> +    return PCI_SLOT(dev->devfn) == slot->number;
>> +}
>> +
>>   static void pci_slot_release(struct kobject *kobj)
>>   {
>>       struct pci_dev *dev;
>> @@ -73,7 +81,7 @@ static void pci_slot_release(struct kobject *kobj)
>>         down_read(&pci_bus_sem);
>>       list_for_each_entry(dev, &slot->bus->devices, bus_list)
>> -        if (PCI_SLOT(dev->devfn) == slot->number)
>> +        if (pci_dev_matches_slot(dev, slot))
>>               dev->slot = NULL;
>>       up_read(&pci_bus_sem);
>>   @@ -166,7 +174,7 @@ void pci_dev_assign_slot(struct pci_dev *dev)
>>         mutex_lock(&pci_slot_mutex);
>>       list_for_each_entry(slot, &dev->bus->slots, list)
>> -        if (PCI_SLOT(dev->devfn) == slot->number)
>> +        if (pci_dev_matches_slot(dev, slot))
>>               dev->slot = slot;
>>       mutex_unlock(&pci_slot_mutex);
>>   }
>> @@ -285,7 +293,7 @@ struct pci_slot *pci_create_slot(struct pci_bus 
>> *parent, int slot_nr,
>>         down_read(&pci_bus_sem);
>>       list_for_each_entry(dev, &parent->devices, bus_list)
>> -        if (PCI_SLOT(dev->devfn) == slot_nr)
>> +        if (pci_dev_matches_slot(dev, slot))
>>               dev->slot = slot;
>>       up_read(&pci_bus_sem);
>>   diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 59876de13860..9265f32d9786 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -78,6 +78,7 @@ struct pci_slot {
>>       struct list_head    list;        /* Node in list of slots */
>>       struct hotplug_slot    *hotplug;    /* Hotplug info (move here) */
>>       unsigned char        number;        /* PCI_SLOT(pci_dev->devfn) */
>> +    unsigned int        per_func_slot:1; /* Allow per function slot */
>>       struct kobject        kobj;
>>   };
>
> This change generates a kernel oops on x86_64. It can be reproduced in 
> a VM.
>
>
> C.
>
> [    3.073990] BUG: kernel NULL pointer dereference, address: 
> 0000000000000021
> [    3.074976] #PF: supervisor read access in kernel mode
> [    3.074976] #PF: error_code(0x0000) - not-present page
> [    3.074976] PGD 0 P4D 0
> [    3.074976] Oops: Oops: 0000 [#1] SMP NOPTI
> [    3.074976] CPU: 18 UID: 0 PID: 1 Comm: swapper/0 Not tainted 
> 6.17.0-rc6-clg-dirty #8 PREEMPT(voluntary)
> [    3.074976] Hardware name: Supermicro Super Server/X13SAE-F, BIOS 
> 4.2 12/17/2024
> [    3.074976] RIP: 0010:pci_reset_bus_function+0xdf/0x160
> [    3.074976] Code: 4e 08 00 00 40 0f 85 83 00 00 00 48 8b 78 18 e8 
> 27 9d ff ff 83 f8 e7 74 17 48 83 c4 08 5b 5d 41 5c c3 cc cc cc cc 48 
> 8b 43 30 <f6> 40 21 01 75 b6 48 8b 53 10 48 83 7a 10 00 74 5e 48 83 7b 
> 18 00
> [    3.074976] RSP: 0000:ffffcd808007b9a8 EFLAGS: 00010202
> [    3.074976] RAX: 0000000000000000 RBX: ffff88c4019b8000 RCX: 
> 0000000000000000
> [    3.074976] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 
> ffff88c4019b8000
> [    3.074976] RBP: 0000000000000001 R08: 0000000000000002 R09: 
> ffffcd808007b99c
> [    3.074976] R10: ffffcd808007b950 R11: 0000000000000000 R12: 
> 0000000000000001
> [    3.074976] R13: ffff88c4019b80c8 R14: ffff88c401a7e028 R15: 
> ffff88c401a73400
> [    3.074976] FS:  0000000000000000(0000) GS:ffff88d38aad5000(0000) 
> knlGS:0000000000000000
> [    3.074976] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    3.074976] CR2: 0000000000000021 CR3: 0000000f66222001 CR4: 
> 0000000000770ef0
> [    3.074976] PKRU: 55555554
> [    3.074976] Call Trace:
> [    3.074976]  <TASK>
> [    3.074976]  ? pci_pm_reset+0x39/0x180
> [    3.074976]  pci_init_reset_methods+0x52/0x80
> [    3.074976]  pci_device_add+0x215/0x5d0
> [    3.074976]  pci_scan_single_device+0xa2/0xe0
> [    3.074976]  pci_scan_slot+0x66/0x1c0
> [    3.074976]  ? klist_next+0x145/0x150
> [    3.074976]  pci_scan_child_bus_extend+0x3a/0x290
> [    3.074976]  acpi_pci_root_create+0x236/0x2a0
> [    3.074976]  pci_acpi_scan_root+0x19b/0x1f0
> [    3.074976]  acpi_pci_root_add+0x1a5/0x370
> [    3.074976]  acpi_bus_attach+0x1a8/0x290
> [    3.074976]  ? __pfx_acpi_dev_for_one_check+0x10/0x10
> [    3.074976]  device_for_each_child+0x4b/0x80
> [    3.074976]  acpi_dev_for_each_child+0x28/0x40
> [    3.074976]  ? __pfx_acpi_bus_attach+0x10/0x10
> [    3.074976]  acpi_bus_attach+0x7a/0x290
> [    3.074976]  ? _raw_spin_unlock_irqrestore+0x23/0x40
> [    3.074976]  ? __pfx_acpi_dev_for_one_check+0x10/0x10
> [    3.074976]  device_for_each_child+0x4b/0x80
> [    3.074976]  acpi_dev_for_each_child+0x28/0x40
> [    3.074976]  ? __pfx_acpi_bus_attach+0x10/0x10
> [    3.074976]  acpi_bus_attach+0x7a/0x290
> [    3.074976]  acpi_bus_scan+0x6a/0x1c0
> [    3.074976]  ? __pfx_acpi_init+0x10/0x10
> [    3.074976]  acpi_scan_init+0xdc/0x280
> [    3.074976]  ? __pfx_acpi_init+0x10/0x10
> [    3.074976]  acpi_init+0x218/0x530
> [    3.074976]  do_one_initcall+0x40/0x310
> [    3.074976]  kernel_init_freeable+0x2fe/0x450
> [    3.074976]  ? __pfx_kernel_init+0x10/0x10
> [    3.074976]  kernel_init+0x16/0x1d0
> [    3.074976]  ret_from_fork+0x1ab/0x1e0
> [    3.074976]  ? __pfx_kernel_init+0x10/0x10
> [    3.074976]  ret_from_fork_asm+0x1a/0x30
> [    3.074976]  </TASK>
> [    3.074976] Modules linked in:
> [    3.074976] CR2: 0000000000000021
> [    3.074976] ---[ end trace 0000000000000000 ]---
> [    3.074976] RIP: 0010:pci_reset_bus_function+0xdf/0x160
>
Hi Cedric,

Thanks for pointing this out. I missed that dev->slot could be NULL and 
so the per_func_slot check should be done after the check for 
!dev->slot. I tried this change on top of the patch in an x86_64 VM and 
was able to boot the VM without the oops.

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 70296d3b1cfc..3631f7faa0cf 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5061,10 +5061,9 @@ static int pci_reset_hotplug_slot(struct 
hotplug_slot *hotplug, bool probe)

  static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
  {
-       if (dev->multifunction && !dev->slot->per_func_slot)
-               return -ENOTTY;
         if (dev->subordinate || !dev->slot ||
-           dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
+           dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET ||
+           (dev->multifunction && !dev->slot->per_func_slot))
                 return -ENOTTY;



Thanks
Farhan


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] PCI: Avoid saving error values for config space
  2025-09-16 18:09   ` Bjorn Helgaas
@ 2025-09-16 20:00     ` Farhan Ali
  2025-09-19 18:17       ` Alex Williamson
  0 siblings, 1 reply; 35+ messages in thread
From: Farhan Ali @ 2025-09-16 20:00 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	schnelle, mjrosato


On 9/16/2025 11:09 AM, Bjorn Helgaas wrote:
> On Thu, Sep 11, 2025 at 11:32:58AM -0700, Farhan Ali wrote:
>> The current reset process saves the device's config space state before
>> reset and restores it afterward. However, when a device is in an error
>> state before reset, config space reads may return error values instead of
>> valid data. This results in saving corrupted values that get written back
>> to the device during state restoration.
>>
>> Avoid saving the state of the config space when the device is in error.
>> While restoring we only restorei the state that can be restored through
>> kernel data such as BARs or doesn't depend on the saved state.
>>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   drivers/pci/pci.c      | 29 ++++++++++++++++++++++++++---
>>   drivers/pci/pcie/aer.c |  5 +++++
>>   drivers/pci/pcie/dpc.c |  5 +++++
>>   drivers/pci/pcie/ptm.c |  5 +++++
>>   drivers/pci/tph.c      |  5 +++++
>>   drivers/pci/vc.c       |  5 +++++
>>   6 files changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index b0f4d98036cd..4b67d22faf0a 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1720,6 +1720,11 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
>>   	struct pci_cap_saved_state *save_state;
>>   	u16 *cap;
>>   
>> +	if (!dev->state_saved) {
>> +		pci_warn(dev, "Not restoring pcie state, no saved state");
>> +		return;
Hi Bjorn

Thanks for taking a look.

> Seems like a lot of messages.  If we want to warn about this, why
> don't we do it once in pci_restore_state()?

I thought providing messages about which state is not restored would be 
better and meaningful as we try to restore some of the state. But if the 
preference is to just have a single warn message in pci_restore_state 
then I can update it. (would also like to hear if Alex has any 
objections to that)

>
> I guess you're making some judgment about what things can be restored
> even when !dev->state_saved.  That seems kind of hard to maintain in
> the future as other capabilities are added.
>
> Also seems sort of questionable if we restore partial state and keep
> using the device as if all is well.  Won't the device be in some kind
> of inconsistent, unpredictable state then?
>
> Bjorn

I tried to avoid restoring state that explicitly needed to save the 
state. For some of the other capabilities, that didn't explicitly store 
the state, I tried to keep the same behavior. This is based on the 
discussion with Alex 
(https://lore.kernel.org/all/20250826094845.517e0fa7.alex.williamson@redhat.com/). 
Also AFAIU currently the dev->state_saved is set to true as long as we 
save the first 64 bytes of config space (pci_save_state), so we could 
for example fail to save the PCIe state, but while restoring can 
continue to restore other capabilities like pasid.

At the very least I would like to avoid corrupting the BAR registers and 
restore msix (arch_restore_msi_irqs) to get devices into a functional 
state after a reset. I am open to suggestions on how we can do this.

Would also like to get your feedback on patch 3 and the approach there 
of having a new flag in struct pci_slot.

Thanks
Farhan


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/10] PCI: Allow per function PCI slots
  2025-09-16 18:37     ` Farhan Ali
@ 2025-09-17  6:21       ` Cédric Le Goater
  2025-09-17 17:50         ` Farhan Ali
  0 siblings, 1 reply; 35+ messages in thread
From: Cédric Le Goater @ 2025-09-17  6:21 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, schnelle, mjrosato

Hi Farhan,

> Hi Cedric,
> 
> Thanks for pointing this out. I missed that dev->slot could be NULL and so the per_func_slot check should be done after the check for !dev->slot. I tried this change on top of the patch in an x86_64 VM and was able to boot the VM without the oops.
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 70296d3b1cfc..3631f7faa0cf 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5061,10 +5061,9 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, bool probe)
> 
>   static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
>   {
> -       if (dev->multifunction && !dev->slot->per_func_slot)
> -               return -ENOTTY;
>          if (dev->subordinate || !dev->slot ||
> -           dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
> +           dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET ||
> +           (dev->multifunction && !dev->slot->per_func_slot))
>                  return -ENOTTY;
All good.

I have pushed the Linux branch I use for vfio :
  
    https://github.com/legoater/linux/commits/vfio/

These commits have small changes :

     PCI: Allow per function PCI slots
     vfio-pci/zdev: Add a device feature for error information

Thanks,

C.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-09-11 18:33 ` [PATCH v3 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
@ 2025-09-17 14:48   ` Niklas Schnelle
  2025-09-17 17:22     ` Farhan Ali
  0 siblings, 1 reply; 35+ messages in thread
From: Niklas Schnelle @ 2025-09-17 14:48 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato

On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
> On s390 today we overwrite the PCI BAR resource address to either an
> artificial cookie address or MIO address. However this address is different
> from the bus address of the BARs programmed by firmware. The artificial
> cookie address was created to index into an array of function handles
> (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
> but maybe different from the bus address. This creates an issue when trying
> to convert the BAR resource address to bus address using the generic
> pcibios_resource_to_bus.
> 

Nit: I'd prefer referring to functions with e.g.
pcibios_resource_to_bus() to make them easier to distinguish. Same also
below.

> Implement an architecture specific pcibios_resource_to_bus function to
> correctly translate PCI BAR resource address to bus address for s390.

Nit: I'd use the plural "addresses" above as we're dealing with a whole
range.

> Similarly add architecture specific pcibios_bus_to_resource function to do
> the reverse translation.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/pci/pci.c       | 73 +++++++++++++++++++++++++++++++++++++++
>  drivers/pci/host-bridge.c |  4 +--
>  2 files changed, 75 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index cd6676c2d602..5baeb5f6f674 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -264,6 +264,79 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
>  	return 0;
>  }
>  
> +void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> +			     struct resource *res)
> +{
> +	struct zpci_bus *zbus = bus->sysdata;
> +	struct zpci_bar_struct *zbar;
> +	struct zpci_dev *zdev;
> +
> +	region->start = res->start;
> +	region->end = res->end;
> +
> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> +		int j = 0;
> +
> +		zbar = NULL;
> +		zdev = zbus->function[i];
> +		if (!zdev)
> +			continue;
> +
> +		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
> +			if (zdev->bars[j].res->start == res->start &&
> +			    zdev->bars[j].res->end == res->end) {
> +				zbar = &zdev->bars[j];
> +				break;
> +			}
> +		}
> +
> +		if (zbar) {
> +			/* only MMIO is supported */

Should the code that sets zbar check IORESOURCE_MEM on the res->flags
to ensure the above comment? Though zpci_setup_bus_resources() only
creates IORESOURCE_MEM resources so this would only be relevant if
someone uses a resource from some other source.

> +			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
> +			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +				region->start |= (u64)zdev->bars[j + 1].val << 32;
> +
> +			region->end = region->start + (1UL << zbar->size) - 1;
> +			return;
> +		}
> +	}
> +}
> +
> +void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
> +			     struct pci_bus_region *region)
> +{
> +	struct zpci_bus *zbus = bus->sysdata;
> +	struct zpci_dev *zdev;
> +	resource_size_t start, end;
> +
> +	res->start = region->start;
> +	res->end = region->end;
> +
> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> +		zdev = zbus->function[i];
> +		if (!zdev || !zdev->has_resources)
> +			continue;
> +
> +		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
> +			if (!zdev->bars[j].val && !zdev->bars[j].size)
> +				continue;

Shouldn't the above be '||'? I think both a 0 size and an unset bars
value would indicate invalid. zpci_setup_bus_resources() only checks 0
size so I think that would be enoug, no?

> +
> +			/* only MMIO is supported */
> +			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
> +			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +				start |= (u64)zdev->bars[j + 1].val << 32;
> +
> +			end = start + (1UL << zdev->bars[j].size) - 1;
> +
> +			if (start == region->start && end == region->end) {
> +				res->start = zdev->bars[j].res->start;
> +				res->end = zdev->bars[j].res->end;
> +				return;
> +			}
> +		}
> +	}
> +}
> +
> 

Overall the code makes sense to me. I think this hasn't caused issues
so far only because firmware has usually already set up the BAR
addresses for us.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-09-17 14:48   ` Niklas Schnelle
@ 2025-09-17 17:22     ` Farhan Ali
  0 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-17 17:22 UTC (permalink / raw)
  To: Niklas Schnelle, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, mjrosato


On 9/17/2025 7:48 AM, Niklas Schnelle wrote:
> On Thu, 2025-09-11 at 11:33 -0700, Farhan Ali wrote:
>> On s390 today we overwrite the PCI BAR resource address to either an
>> artificial cookie address or MIO address. However this address is different
>> from the bus address of the BARs programmed by firmware. The artificial
>> cookie address was created to index into an array of function handles
>> (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
>> but maybe different from the bus address. This creates an issue when trying
>> to convert the BAR resource address to bus address using the generic
>> pcibios_resource_to_bus.
>>
> Nit: I'd prefer referring to functions with e.g.
> pcibios_resource_to_bus() to make them easier to distinguish. Same also
> below.
>
>> Implement an architecture specific pcibios_resource_to_bus function to
>> correctly translate PCI BAR resource address to bus address for s390.
> Nit: I'd use the plural "addresses" above as we're dealing with a whole
> range.
>
>> Similarly add architecture specific pcibios_bus_to_resource function to do
>> the reverse translation.
>>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   arch/s390/pci/pci.c       | 73 +++++++++++++++++++++++++++++++++++++++
>>   drivers/pci/host-bridge.c |  4 +--
>>   2 files changed, 75 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
>> index cd6676c2d602..5baeb5f6f674 100644
>> --- a/arch/s390/pci/pci.c
>> +++ b/arch/s390/pci/pci.c
>> @@ -264,6 +264,79 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
>>   	return 0;
>>   }
>>   
>> +void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
>> +			     struct resource *res)
>> +{
>> +	struct zpci_bus *zbus = bus->sysdata;
>> +	struct zpci_bar_struct *zbar;
>> +	struct zpci_dev *zdev;
>> +
>> +	region->start = res->start;
>> +	region->end = res->end;
>> +
>> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
>> +		int j = 0;
>> +
>> +		zbar = NULL;
>> +		zdev = zbus->function[i];
>> +		if (!zdev)
>> +			continue;
>> +
>> +		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
>> +			if (zdev->bars[j].res->start == res->start &&
>> +			    zdev->bars[j].res->end == res->end) {
>> +				zbar = &zdev->bars[j];
>> +				break;
>> +			}
>> +		}
>> +
>> +		if (zbar) {
>> +			/* only MMIO is supported */
> Should the code that sets zbar check IORESOURCE_MEM on the res->flags
> to ensure the above comment? Though zpci_setup_bus_resources() only
> creates IORESOURCE_MEM resources so this would only be relevant if
> someone uses a resource from some other source.

I don't think it hurts to add the check. I don't think we support any 
PCI devices on the platform with IORESOURCE_IO.


>
>> +			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
>> +			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
>> +				region->start |= (u64)zdev->bars[j + 1].val << 32;
>> +
>> +			region->end = region->start + (1UL << zbar->size) - 1;
>> +			return;
>> +		}
>> +	}
>> +}
>> +
>> +void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
>> +			     struct pci_bus_region *region)
>> +{
>> +	struct zpci_bus *zbus = bus->sysdata;
>> +	struct zpci_dev *zdev;
>> +	resource_size_t start, end;
>> +
>> +	res->start = region->start;
>> +	res->end = region->end;
>> +
>> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
>> +		zdev = zbus->function[i];
>> +		if (!zdev || !zdev->has_resources)
>> +			continue;
>> +
>> +		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
>> +			if (!zdev->bars[j].val && !zdev->bars[j].size)
>> +				continue;
> Shouldn't the above be '||'? I think both a 0 size and an unset bars
> value would indicate invalid. zpci_setup_bus_resources() only checks 0
> size so I think that would be enoug, no?

Right, architecturally both size 0 and unset BAR value would indicate 
invalid and this check was meant for that. But I think just changing 
this to !zdev->bars[j].size should also be enough, as we already handle 
the 64bit BAR case below. Will change this.

Thanks Farhan

>
>> +
>> +			/* only MMIO is supported */
>> +			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
>> +			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
>> +				start |= (u64)zdev->bars[j + 1].val << 32;
>> +
>> +			end = start + (1UL << zdev->bars[j].size) - 1;
>> +
>> +			if (start == region->start && end == region->end) {
>> +				res->start = zdev->bars[j].res->start;
>> +				res->end = zdev->bars[j].res->end;
>> +				return;
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>
> Overall the code makes sense to me. I think this hasn't caused issues
> so far only because firmware has usually already set up the BAR
> addresses for us.
>
> Thanks,
> Niklas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/10] PCI: Allow per function PCI slots
  2025-09-17  6:21       ` Cédric Le Goater
@ 2025-09-17 17:50         ` Farhan Ali
  0 siblings, 0 replies; 35+ messages in thread
From: Farhan Ali @ 2025-09-17 17:50 UTC (permalink / raw)
  To: Cédric Le Goater, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, schnelle, mjrosato


On 9/16/2025 11:21 PM, Cédric Le Goater wrote:
> Hi Farhan,
>
>> Hi Cedric,
>>
>> Thanks for pointing this out. I missed that dev->slot could be NULL 
>> and so the per_func_slot check should be done after the check for 
>> !dev->slot. I tried this change on top of the patch in an x86_64 VM 
>> and was able to boot the VM without the oops.
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 70296d3b1cfc..3631f7faa0cf 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -5061,10 +5061,9 @@ static int pci_reset_hotplug_slot(struct 
>> hotplug_slot *hotplug, bool probe)
>>
>>   static int pci_dev_reset_slot_function(struct pci_dev *dev, bool 
>> probe)
>>   {
>> -       if (dev->multifunction && !dev->slot->per_func_slot)
>> -               return -ENOTTY;
>>          if (dev->subordinate || !dev->slot ||
>> -           dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
>> +           dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET ||
>> +           (dev->multifunction && !dev->slot->per_func_slot))
>>                  return -ENOTTY;
> All good.
>
> I have pushed the Linux branch I use for vfio :
>
>    https://github.com/legoater/linux/commits/vfio/
>
> These commits have small changes :
>
>     PCI: Allow per function PCI slots
>     vfio-pci/zdev: Add a device feature for error information
>
> Thanks,
>
> C.
>
>
Hi Cedric,

Thanks again for your help in reviewing the patches.

Thanks
Farhan



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] PCI: Avoid saving error values for config space
  2025-09-16 20:00     ` Farhan Ali
@ 2025-09-19 18:17       ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2025-09-19 18:17 UTC (permalink / raw)
  To: Farhan Ali
  Cc: Bjorn Helgaas, linux-s390, kvm, linux-kernel, linux-pci, schnelle,
	mjrosato

On Tue, 16 Sep 2025 13:00:30 -0700
Farhan Ali <alifm@linux.ibm.com> wrote:

> On 9/16/2025 11:09 AM, Bjorn Helgaas wrote:
> > On Thu, Sep 11, 2025 at 11:32:58AM -0700, Farhan Ali wrote:  
> >> The current reset process saves the device's config space state before
> >> reset and restores it afterward. However, when a device is in an error
> >> state before reset, config space reads may return error values instead of
> >> valid data. This results in saving corrupted values that get written back
> >> to the device during state restoration.
> >>
> >> Avoid saving the state of the config space when the device is in error.
> >> While restoring we only restorei the state that can be restored through
> >> kernel data such as BARs or doesn't depend on the saved state.
> >>
> >> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> >> ---
> >>   drivers/pci/pci.c      | 29 ++++++++++++++++++++++++++---
> >>   drivers/pci/pcie/aer.c |  5 +++++
> >>   drivers/pci/pcie/dpc.c |  5 +++++
> >>   drivers/pci/pcie/ptm.c |  5 +++++
> >>   drivers/pci/tph.c      |  5 +++++
> >>   drivers/pci/vc.c       |  5 +++++
> >>   6 files changed, 51 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index b0f4d98036cd..4b67d22faf0a 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -1720,6 +1720,11 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
> >>   	struct pci_cap_saved_state *save_state;
> >>   	u16 *cap;
> >>   
> >> +	if (!dev->state_saved) {
> >> +		pci_warn(dev, "Not restoring pcie state, no saved state");
> >> +		return;  
> Hi Bjorn
> 
> Thanks for taking a look.
> 
> > Seems like a lot of messages.  If we want to warn about this, why
> > don't we do it once in pci_restore_state()?  
> 
> I thought providing messages about which state is not restored would be 
> better and meaningful as we try to restore some of the state. But if the 
> preference is to just have a single warn message in pci_restore_state 
> then I can update it. (would also like to hear if Alex has any 
> objections to that)

I thought it got a bit verbose as well.

> > I guess you're making some judgment about what things can be restored
> > even when !dev->state_saved.  That seems kind of hard to maintain in
> > the future as other capabilities are added.
> >
> > Also seems sort of questionable if we restore partial state and keep
> > using the device as if all is well.  Won't the device be in some kind
> > of inconsistent, unpredictable state then?

To an extent that's always true.  Reset is a lossy process, we're
intentionally throwing away the internal state of the device and
attempting to restore the architected config space as best as we can.
It's hard to guarantee it's complete though.

In this case we're largely just trying to determine whether the
pre-reset config space is already broken, which would mean that some
forms of reset are unavailable and our restore data is bogus.  In
addition to the s390x specific scenario resolved here, I hope this
might eliminate some of the "device stuck in D3" or "device stuck with
pending transaction" errors we currently see trying to do PM or FLR
resets on broken devices.  Failing to actually reset the device in any
way, then trying to write back -1 for restore data is what we'd see
today, which also isn't what we intend.

It probably doesn't make sense to note the specific capabilities that
aren't being restored.  Probably a single pci_warn indicating the
device config space is inaccessible prior to reset and will only be
partially restored is probably sufficient.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-09-19 18:17 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-11 18:32 [PATCH v3 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
2025-09-11 18:32 ` [PATCH v3 01/10] PCI: Avoid saving error values for config space Farhan Ali
2025-09-13  8:27   ` Alex Williamson
2025-09-15 17:15     ` Farhan Ali
2025-09-16 18:09   ` Bjorn Helgaas
2025-09-16 20:00     ` Farhan Ali
2025-09-19 18:17       ` Alex Williamson
2025-09-11 18:32 ` [PATCH v3 02/10] PCI: Add additional checks for flr reset Farhan Ali
2025-09-11 18:33 ` [PATCH v3 03/10] PCI: Allow per function PCI slots Farhan Ali
2025-09-12 12:23   ` Benjamin Block
2025-09-12 17:19     ` Farhan Ali
2025-09-16  6:52   ` Cédric Le Goater
2025-09-16 18:37     ` Farhan Ali
2025-09-17  6:21       ` Cédric Le Goater
2025-09-17 17:50         ` Farhan Ali
2025-09-11 18:33 ` [PATCH v3 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
2025-09-17 14:48   ` Niklas Schnelle
2025-09-17 17:22     ` Farhan Ali
2025-09-11 18:33 ` [PATCH v3 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device Farhan Ali
2025-09-15  8:39   ` Niklas Schnelle
2025-09-15 17:42     ` Farhan Ali
2025-09-16 10:59       ` Niklas Schnelle
2025-09-11 18:33 ` [PATCH v3 06/10] s390/pci: Update the logic for detecting passthrough device Farhan Ali
2025-09-15  9:22   ` Niklas Schnelle
2025-09-11 18:33 ` [PATCH v3 07/10] s390/pci: Store PCI error information for passthrough devices Farhan Ali
2025-09-15 11:42   ` Niklas Schnelle
2025-09-15 18:12     ` Farhan Ali
2025-09-16 10:54       ` Niklas Schnelle
2025-09-11 18:33 ` [PATCH v3 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
2025-09-13  9:04   ` Alex Williamson
2025-09-15 18:27     ` Farhan Ali
2025-09-15  6:26   ` Cédric Le Goater
2025-09-15 18:27     ` Farhan Ali
2025-09-11 18:33 ` [PATCH v3 09/10] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
2025-09-11 18:33 ` [PATCH v3 10/10] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox