[PATCH v4 00/10] Error recovery for vfio-pci devices on s390x

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x
@ 2025-09-24 17:16 Farhan Ali
  2025-09-24 17:16 ` [PATCH v4 01/10] PCI: Avoid saving error values for config space Farhan Ali
                   ` (9 more replies)
  0 siblings, 10 replies; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

Hi,

This Linux kernel patch series introduces support for error recovery for
passthrough PCI devices on System Z (s390x). 

Background
----------
For PCI devices on s390x an operating system receives platform specific
error events from firmware rather than through AER.Today for
passthrough/userspace devices, we don't attempt any error recovery and
ignore any error events for the devices. The passthrough/userspace devices
are managed by the vfio-pci driver. The driver does register error handling
callbacks (error_detected), and on an error trigger an eventfd to
userspace.  But we need a mechanism to notify userspace
(QEMU/guest/userspace drivers) about the error event. 

Proposal
--------
We can expose this error information (currently only the PCI Error Code)
via a device feature. Userspace can then obtain the error information 
via VFIO_DEVICE_FEATURE ioctl and take appropriate actions such as driving 
a device reset.

I would appreciate some feedback on this series. Part of the series touches
PCI common code, so would like to get some feedback on those patches.

Thanks
Farhan

ChangeLog
---------
v3 series https://lore.kernel.org/all/20250911183307.1910-1-alifm@linux.ibm.com/
v3 -> v4
    - Remove warn messages for each PCI capability not restored (patch 1)

    - Check PCI_COMMAND and PCI_STATUS register for error value instead of device id 
    (patch 1)

    - Fix kernel crash in patch 3

    - Added reviewed by tags

    - Address comments from Niklas's (patches 4, 5, 7)

    - Fix compilation error non s390x system (patch 8)

    - Explicitly align struct vfio_device_feature_zpci_err (patch 8)


v2 series https://lore.kernel.org/all/20250825171226.1602-1-alifm@linux.ibm.com/
v2 -> v3
   - Patch 1 avoids saving any config space state if the device is in error
   (suggested by Alex)

   - Patch 2 adds additional check only for FLR reset to try other function 
     reset method (suggested by Alex).

   - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
     functions. Creates a new flag pci_slot to allow per function slot.

   - Patch 4 fixes a bug in s390 for resource to bus address translation.

   - Rebase on 6.17-rc5


v1 series https://lore.kernel.org/all/20250813170821.1115-1-alifm@linux.ibm.com/
v1 - > v2
   - Patches 1 and 2 adds some additional checks for FLR/PM reset to 
     try other function reset method (suggested by Alex).

   - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
     functions.

   - Patch 7 adds a new device feature for zPCI devices for the VFIO_DEVICE_FEATURE 
     ioctl. The ioctl is used by userspace to retriece any PCI error
     information for the device (suggested by Alex).

   - Patch 8 adds a reset_done() callback for the vfio-pci driver, to
     restore the state of the device after a reset.

   - Patch 9 removes the pcie check for triggering VFIO_PCI_ERR_IRQ_INDEX.


Farhan Ali (10):
  PCI: Avoid saving error values for config space
  PCI: Add additional checks for flr reset
  PCI: Allow per function PCI slots
  s390/pci: Add architecture specific resource/bus address translation
  s390/pci: Restore IRQ unconditionally for the zPCI device
  s390/pci: Update the logic for detecting passthrough device
  s390/pci: Store PCI error information for passthrough devices
  vfio-pci/zdev: Add a device feature for error information
  vfio: Add a reset_done callback for vfio-pci driver
  vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX

 arch/s390/include/asm/pci.h        |  30 +++++++-
 arch/s390/pci/pci.c                |  75 ++++++++++++++++++++
 arch/s390/pci/pci_event.c          | 107 ++++++++++++++++-------------
 arch/s390/pci/pci_irq.c            |   9 +--
 drivers/pci/host-bridge.c          |   4 +-
 drivers/pci/hotplug/s390_pci_hpc.c |  10 ++-
 drivers/pci/pci.c                  |  37 ++++++++--
 drivers/pci/pcie/aer.c             |   3 +
 drivers/pci/pcie/dpc.c             |   3 +
 drivers/pci/pcie/ptm.c             |   3 +
 drivers/pci/slot.c                 |  14 +++-
 drivers/pci/tph.c                  |   3 +
 drivers/pci/vc.c                   |   3 +
 drivers/vfio/pci/vfio_pci_core.c   |  20 ++++--
 drivers/vfio/pci/vfio_pci_intrs.c  |   3 +-
 drivers/vfio/pci/vfio_pci_priv.h   |   8 +++
 drivers/vfio/pci/vfio_pci_zdev.c   |  45 +++++++++++-
 include/linux/pci.h                |   1 +
 include/uapi/linux/vfio.h          |  15 ++++
 19 files changed, 318 insertions(+), 75 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-10-01 15:15   ` Benjamin Block
  2025-09-24 17:16 ` [PATCH v4 02/10] PCI: Add additional checks for flr reset Farhan Ali
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

The current reset process saves the device's config space state before
reset and restores it afterward. However, when a device is in an error
state before reset, config space reads may return error values instead of
valid data. This results in saving corrupted values that get written back
to the device during state restoration.

Avoid saving the state of the config space when the device is in error.
While restoring we only restore the state that can be restored through
kernel data such as BARs or doesn't depend on the saved state.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/pci.c      | 25 ++++++++++++++++++++++---
 drivers/pci/pcie/aer.c |  3 +++
 drivers/pci/pcie/dpc.c |  3 +++
 drivers/pci/pcie/ptm.c |  3 +++
 drivers/pci/tph.c      |  3 +++
 drivers/pci/vc.c       |  3 +++
 6 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b0f4d98036cd..a3d93d1baee7 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1720,6 +1720,9 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
 	struct pci_cap_saved_state *save_state;
 	u16 *cap;
 
+	if (!dev->state_saved)
+		return;
+
 	/*
 	 * Restore max latencies (in the LTR capability) before enabling
 	 * LTR itself in PCI_EXP_DEVCTL2.
@@ -1775,6 +1778,9 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
 	struct pci_cap_saved_state *save_state;
 	u16 *cap;
 
+	if (!dev->state_saved)
+		return;
+
 	save_state = pci_find_saved_cap(dev, PCI_CAP_ID_PCIX);
 	pos = pci_find_capability(dev, PCI_CAP_ID_PCIX);
 	if (!save_state || !pos)
@@ -1792,6 +1798,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
 int pci_save_state(struct pci_dev *dev)
 {
 	int i;
+	u32 val;
+
+	pci_read_config_dword(dev, PCI_COMMAND, &val);
+	if (PCI_POSSIBLE_ERROR(val)) {
+		pci_warn(dev, "Device config space inaccessible, will only be partially restored\n");
+		return -EIO;
+	}
+
 	/* XXX: 100% dword access ok here? */
 	for (i = 0; i < 16; i++) {
 		pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
@@ -1854,6 +1868,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
 
 static void pci_restore_config_space(struct pci_dev *pdev)
 {
+	if (!pdev->state_saved) {
+		pci_warn(pdev, "No saved config space, restoring BARs\n");
+		pci_restore_bars(pdev);
+		pci_write_config_word(pdev, PCI_COMMAND,
+				PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
+		return;
+	}
+
 	if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
 		pci_restore_config_space_range(pdev, 10, 15, 0, false);
 		/* Restore BARs before the command register. */
@@ -1906,9 +1928,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
  */
 void pci_restore_state(struct pci_dev *dev)
 {
-	if (!dev->state_saved)
-		return;
-
 	pci_restore_pcie_state(dev);
 	pci_restore_pasid_state(dev);
 	pci_restore_pri_state(dev);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e286c197d716..e6023ffbf94d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -361,6 +361,9 @@ void pci_restore_aer_state(struct pci_dev *dev)
 	if (!aer)
 		return;
 
+	if (!dev->state_saved)
+		return;
+
 	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_ERR);
 	if (!save_state)
 		return;
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..e0d7034c2cd8 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -67,6 +67,9 @@ void pci_restore_dpc_state(struct pci_dev *dev)
 	if (!pci_is_pcie(dev))
 		return;
 
+	if (!dev->state_saved)
+		return;
+
 	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
 	if (!save_state)
 		return;
diff --git a/drivers/pci/pcie/ptm.c b/drivers/pci/pcie/ptm.c
index 65e4b008be00..78613000acfb 100644
--- a/drivers/pci/pcie/ptm.c
+++ b/drivers/pci/pcie/ptm.c
@@ -112,6 +112,9 @@ void pci_restore_ptm_state(struct pci_dev *dev)
 	if (!ptm)
 		return;
 
+	if (!dev->state_saved)
+		return;
+
 	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_PTM);
 	if (!save_state)
 		return;
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index cc64f93709a4..c0fa1b9a7a78 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -435,6 +435,9 @@ void pci_restore_tph_state(struct pci_dev *pdev)
 	if (!pdev->tph_enabled)
 		return;
 
+	if (!pdev->state_saved)
+		return;
+
 	save_state = pci_find_saved_ext_cap(pdev, PCI_EXT_CAP_ID_TPH);
 	if (!save_state)
 		return;
diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
index a4ff7f5f66dd..00609c7e032a 100644
--- a/drivers/pci/vc.c
+++ b/drivers/pci/vc.c
@@ -391,6 +391,9 @@ void pci_restore_vc_state(struct pci_dev *dev)
 {
 	int i;
 
+	if (!dev->state_saved)
+		return;
+
 	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
 		int pos;
 		struct pci_cap_saved_state *save_state;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-09-24 17:16 ` [PATCH v4 01/10] PCI: Avoid saving error values for config space Farhan Ali
@ 2025-10-01 15:15   ` Benjamin Block
  2025-10-01 17:12     ` Farhan Ali
  0 siblings, 1 reply; 47+ messages in thread
From: Benjamin Block @ 2025-10-01 15:15 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, schnelle, mjrosato

On Wed, Sep 24, 2025 at 10:16:19AM -0700, Farhan Ali wrote:
> @@ -1792,6 +1798,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
>  int pci_save_state(struct pci_dev *dev)
>  {
>  	int i;
> +	u32 val;
> +
> +	pci_read_config_dword(dev, PCI_COMMAND, &val);
> +	if (PCI_POSSIBLE_ERROR(val)) {
> +		pci_warn(dev, "Device config space inaccessible, will only be partially restored\n");
> +		return -EIO;

Should it set `dev->state_saved` to `false`, to be on the save side?
Not sure whether we run a risk of restoring an old, outdated state otherwise.

> +	}
> +
>  	/* XXX: 100% dword access ok here? */
>  	for (i = 0; i < 16; i++) {
>  		pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
> @@ -1854,6 +1868,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
>  
>  static void pci_restore_config_space(struct pci_dev *pdev)
>  {
> +	if (!pdev->state_saved) {
> +		pci_warn(pdev, "No saved config space, restoring BARs\n");
> +		pci_restore_bars(pdev);
> +		pci_write_config_word(pdev, PCI_COMMAND,
> +				PCI_COMMAND_MEMORY | PCI_COMMAND_IO);

Is this really something that ought to be universally enabled? I thought this
depends on whether attached resources are IO and/or MEM?

	int pci_enable_resources(struct pci_dev *dev, int mask)
	{
		...
		pci_dev_for_each_resource(dev, r, i) {
			...
			if (r->flags & IORESOURCE_IO)
				cmd |= PCI_COMMAND_IO;
			if (r->flags & IORESOURCE_MEM)
				cmd |= PCI_COMMAND_MEMORY;
		}
		...
	}

Also IIRC, especially on s390, we never have IO resources?

	int zpci_setup_bus_resources(struct zpci_dev *zdev)
	{
		...
		for (i = 0; i < PCI_STD_NUM_BARS; i++) {
			...
			/* only MMIO is supported */
			flags = IORESOURCE_MEM;
			if (zdev->bars[i].val & 8)
				flags |= IORESOURCE_PREFETCH;
			if (zdev->bars[i].val & 4)
				flags |= IORESOURCE_MEM_64;
			...
		}
		...
	}

So I guess this would have to have some form of the same logic as in
`pci_enable_resources()`, after restoring the BARs.

Or am I missing something?

> +		return;
> +	}

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-01 15:15   ` Benjamin Block
@ 2025-10-01 17:12     ` Farhan Ali
  2025-10-02  9:16       ` Benjamin Block
  2025-10-04 14:54       ` Lukas Wunner
  0 siblings, 2 replies; 47+ messages in thread
From: Farhan Ali @ 2025-10-01 17:12 UTC (permalink / raw)
  To: Benjamin Block
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, schnelle, mjrosato


On 10/1/2025 8:15 AM, Benjamin Block wrote:
> On Wed, Sep 24, 2025 at 10:16:19AM -0700, Farhan Ali wrote:
>> @@ -1792,6 +1798,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
>>   int pci_save_state(struct pci_dev *dev)
>>   {
>>   	int i;
>> +	u32 val;
>> +
>> +	pci_read_config_dword(dev, PCI_COMMAND, &val);
>> +	if (PCI_POSSIBLE_ERROR(val)) {
>> +		pci_warn(dev, "Device config space inaccessible, will only be partially restored\n");
>> +		return -EIO;
> Should it set `dev->state_saved` to `false`, to be on the save side?
> Not sure whether we run a risk of restoring an old, outdated state otherwise.

AFAIU if the state_saved flag was set to true then any state that we 
have saved should be valid and should be okay to be restored from. We 
just want to avoid saving any invalid data.


>
>> +	}
>> +
>>   	/* XXX: 100% dword access ok here? */
>>   	for (i = 0; i < 16; i++) {
>>   		pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
>> @@ -1854,6 +1868,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
>>   
>>   static void pci_restore_config_space(struct pci_dev *pdev)
>>   {
>> +	if (!pdev->state_saved) {
>> +		pci_warn(pdev, "No saved config space, restoring BARs\n");
>> +		pci_restore_bars(pdev);
>> +		pci_write_config_word(pdev, PCI_COMMAND,
>> +				PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
> Is this really something that ought to be universally enabled? I thought this
> depends on whether attached resources are IO and/or MEM?
>
> 	int pci_enable_resources(struct pci_dev *dev, int mask)
> 	{
> 		...
> 		pci_dev_for_each_resource(dev, r, i) {
> 			...
> 			if (r->flags & IORESOURCE_IO)
> 				cmd |= PCI_COMMAND_IO;
> 			if (r->flags & IORESOURCE_MEM)
> 				cmd |= PCI_COMMAND_MEMORY;
> 		}
> 		...
> 	}
>
> Also IIRC, especially on s390, we never have IO resources?
>
> 	int zpci_setup_bus_resources(struct zpci_dev *zdev)
> 	{
> 		...
> 		for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> 			...
> 			/* only MMIO is supported */
> 			flags = IORESOURCE_MEM;
> 			if (zdev->bars[i].val & 8)
> 				flags |= IORESOURCE_PREFETCH;
> 			if (zdev->bars[i].val & 4)
> 				flags |= IORESOURCE_MEM_64;
> 			...
> 		}
> 		...
> 	}
>
> So I guess this would have to have some form of the same logic as in
> `pci_enable_resources()`, after restoring the BARs.
>
> Or am I missing something?

As per my understanding of the spec, setting both I/O Space and Memory 
Space should be safe. The spec also mentions if a function doesn't 
support IO/Memory space access it could hardwire the bit to zero. We 
could add the logic to iterate through all the resources and set the 
bits accordingly, but in this case trying a best effort restoration it 
should be fine?

Also I didn't see any issues testing on s390x with the NVMe, RoCE and 
NETD devices, but I could have missed something.

Thanks

Farhan


>
>> +		return;
>> +	}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-01 17:12     ` Farhan Ali
@ 2025-10-02  9:16       ` Benjamin Block
  2025-10-04 14:54       ` Lukas Wunner
  1 sibling, 0 replies; 47+ messages in thread
From: Benjamin Block @ 2025-10-02  9:16 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, schnelle, mjrosato

On Wed, Oct 01, 2025 at 10:12:03AM -0700, Farhan Ali wrote:
> 
> On 10/1/2025 8:15 AM, Benjamin Block wrote:
> > On Wed, Sep 24, 2025 at 10:16:19AM -0700, Farhan Ali wrote:
> >> @@ -1792,6 +1798,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
> >>   int pci_save_state(struct pci_dev *dev)
> >>   {
> >>   	int i;
> >> +	u32 val;
> >> +
> >> +	pci_read_config_dword(dev, PCI_COMMAND, &val);
> >> +	if (PCI_POSSIBLE_ERROR(val)) {
> >> +		pci_warn(dev, "Device config space inaccessible, will only be partially restored\n");
> >> +		return -EIO;
> >
> > Should it set `dev->state_saved` to `false`, to be on the save side?
> > Not sure whether we run a risk of restoring an old, outdated state otherwise.
> 
> AFAIU if the state_saved flag was set to true then any state that we 
> have saved should be valid and should be okay to be restored from. We 
> just want to avoid saving any invalid data.

Hmm, so I dug a bit more, and I see
void pci_restore_state(struct pci_dev *dev) {}
has `dev->state_saved = false` at the end, so I guess if a device is put into
suspend, and then later woken again, the flag gets reset every time.

And then there is also code like this:

	static pci_ers_result_t e1000_io_slot_reset(struct pci_dev *pdev)
	{
		...
		err = pci_enable_device_mem(pdev);
		if (err) {
			...
		} else {
			pdev->state_saved = true;
			pci_restore_state(pdev);

I don't know..

But I see Alex suggested this before.

> >> +	}
> >> +
> >>   	/* XXX: 100% dword access ok here? */
> >>   	for (i = 0; i < 16; i++) {
> >>   		pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
> >> @@ -1854,6 +1868,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
> >>   
> >>   static void pci_restore_config_space(struct pci_dev *pdev)
> >>   {
> >> +	if (!pdev->state_saved) {
> >> +		pci_warn(pdev, "No saved config space, restoring BARs\n");
> >> +		pci_restore_bars(pdev);
> >> +		pci_write_config_word(pdev, PCI_COMMAND,
> >> +				PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
> >
> > Is this really something that ought to be universally enabled? I thought this
> > depends on whether attached resources are IO and/or MEM?
> >
> > 	int pci_enable_resources(struct pci_dev *dev, int mask)
> > 	{
> > 		...
> > 		pci_dev_for_each_resource(dev, r, i) {
> > 			...
> > 			if (r->flags & IORESOURCE_IO)
> > 				cmd |= PCI_COMMAND_IO;
> > 			if (r->flags & IORESOURCE_MEM)
> > 				cmd |= PCI_COMMAND_MEMORY;
> > 		}
> > 		...
> > 	}
> >
> > Also IIRC, especially on s390, we never have IO resources?
> >
> > 	int zpci_setup_bus_resources(struct zpci_dev *zdev)
> > 	{
> > 		...
> > 		for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> > 			...
> > 			/* only MMIO is supported */
> > 			flags = IORESOURCE_MEM;
> > 			if (zdev->bars[i].val & 8)
> > 				flags |= IORESOURCE_PREFETCH;
> > 			if (zdev->bars[i].val & 4)
> > 				flags |= IORESOURCE_MEM_64;
> > 			...
> > 		}
> > 		...
> > 	}
> >
> > So I guess this would have to have some form of the same logic as in
> > `pci_enable_resources()`, after restoring the BARs.
> >
> > Or am I missing something?
> 
> As per my understanding of the spec, setting both I/O Space and Memory 
> Space should be safe. The spec also mentions if a function doesn't 
> support IO/Memory space access it could hardwire the bit to zero. We 
> could add the logic to iterate through all the resources and set the 
> bits accordingly, but in this case trying a best effort restoration it 
> should be fine?
> 
> Also I didn't see any issues testing on s390x with the NVMe, RoCE and 
> NETD devices, but I could have missed something.

Well, just taking a coarse look at how some other PCI device drivers use the
Command register (this being non-s390 specific after all); some of them base
decisions on whether either/or these flags are set in config space. Now that
this sets both flags, this might have surprising side-effects.
    On the other hand, iterating over the resources might not even be enough
with some device-drivers, since they base their decision on whether to enable
either/or on other knowledge.
    So I don't know. Enabling both just in case might be a good compromise.

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-01 17:12     ` Farhan Ali
  2025-10-02  9:16       ` Benjamin Block
@ 2025-10-04 14:54       ` Lukas Wunner
  2025-10-06 17:54         ` Farhan Ali
  1 sibling, 1 reply; 47+ messages in thread
From: Lukas Wunner @ 2025-10-04 14:54 UTC (permalink / raw)
  To: Farhan Ali
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato

On Wed, Oct 01, 2025 at 10:12:03AM -0700, Farhan Ali wrote:
> AFAIU if the state_saved flag was set to true then any state that we have
> saved should be valid and should be okay to be restored from. We just want
> to avoid saving any invalid data.

The state_saved flag is used by the PCI core to detect whether a driver
has called pci_save_state() in one of its suspend callbacks.  If it did,
the PCI core assumes that the driver has taken on the responsibility to
put the device into a low power state.  The PCI core will thus not put
the device into a low power state itself and it won't (again) call
pci_save_state().

Hence state_saved is cleared before the driver suspend callbacks are
invoked and it is checked afterwards.

Clearing the state_saved flag in pci_restore_state() merely serves the
purpose of ensuring that the flag is cleared ahead of the next suspend
and resume cycle.

It is a fallacy to think that state_saved indicates validity of the
saved state.

Unfortunately pci_restore_state() was amended by c82f63e411f1 to
bail out if state_saved is false.  This has arguably caused more
problems than it solved, so I have prepared this development branch
which essentially reverts the commit and undoes most of the awful
workarounds that it necessitated:

https://github.com/l1k/linux/commits/aer_reset_v1

I intend to submit this after the merge window has closed.

The motivation of c82f63e411f1 was to prevent restoring state if
pci_save_state() hasn't been called before.  I am solving that by
calling pci_save_state() on device addition, hence error
recoverability is ensured at all times.

I believe this also makes patch [01/10] in your series unnecessary.

A lot of drivers call pci_save_state() in their probe hook and
that continues to be correct if they modified Config Space
vis-a-vis what was saved on device addition.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-04 14:54       ` Lukas Wunner
@ 2025-10-06 17:54         ` Farhan Ali
  2025-10-06 19:26           ` Lukas Wunner
  0 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-10-06 17:54 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato


On 10/4/2025 7:54 AM, Lukas Wunner wrote:
> On Wed, Oct 01, 2025 at 10:12:03AM -0700, Farhan Ali wrote:
>> AFAIU if the state_saved flag was set to true then any state that we have
>> saved should be valid and should be okay to be restored from. We just want
>> to avoid saving any invalid data.
> The state_saved flag is used by the PCI core to detect whether a driver
> has called pci_save_state() in one of its suspend callbacks.  If it did,
> the PCI core assumes that the driver has taken on the responsibility to
> put the device into a low power state.  The PCI core will thus not put
> the device into a low power state itself and it won't (again) call
> pci_save_state().
>
> Hence state_saved is cleared before the driver suspend callbacks are
> invoked and it is checked afterwards.
>
> Clearing the state_saved flag in pci_restore_state() merely serves the
> purpose of ensuring that the flag is cleared ahead of the next suspend
> and resume cycle.
>
> It is a fallacy to think that state_saved indicates validity of the
> saved state.

Hi Lukas,

Thanks for the detailed explanation, this was very helpful for me.

> Unfortunately pci_restore_state() was amended by c82f63e411f1 to
> bail out if state_saved is false.  This has arguably caused more
> problems than it solved, so I have prepared this development branch
> which essentially reverts the commit and undoes most of the awful
> workarounds that it necessitated:
>
> https://github.com/l1k/linux/commits/aer_reset_v1
>
> I intend to submit this after the merge window has closed.
>
> The motivation of c82f63e411f1 was to prevent restoring state if
> pci_save_state() hasn't been called before.  I am solving that by
> calling pci_save_state() on device addition, hence error
> recoverability is ensured at all times.
>
> I believe this also makes patch [01/10] in your series unnecessary.

I tested your patches + patches 2-10 of this series. It unfortunately 
didn't completely help with the s390x use case. We still need the check 
to in pci_save_state() from this patch to make sure we are not saving 
error values, which can be written back to the device in 
pci_restore_state().

As part of the error recovery userspace can use the VFIO_DEVICE_RESET to 
reset the device (pci_try_reset_function()). The function call for this is:

pci_dev_save_and_disable 
<https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_save_and_disable>();

__pci_reset_function_locked 
<https://elixir.bootlin.com/linux/v6.17.1/C/ident/__pci_reset_function_locked>();

pci_dev_restore 
<https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_restore>();

So we can end up overwriting the initial saved state (added by you in 
pci_bus_add_device()). Do we need to update the 
pci_dev_save_and_disable() not to save the state?

Thanks

Farhan


>
> A lot of drivers call pci_save_state() in their probe hook and
> that continues to be correct if they modified Config Space
> vis-a-vis what was saved on device addition.
>
> Thanks,
>
> Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-06 17:54         ` Farhan Ali
@ 2025-10-06 19:26           ` Lukas Wunner
  2025-10-06 21:35             ` Farhan Ali
  0 siblings, 1 reply; 47+ messages in thread
From: Lukas Wunner @ 2025-10-06 19:26 UTC (permalink / raw)
  To: Farhan Ali
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato

On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote:
> On 10/4/2025 7:54 AM, Lukas Wunner wrote:
> > I believe this also makes patch [01/10] in your series unnecessary.
> 
> I tested your patches + patches 2-10 of this series. It unfortunately didn't
> completely help with the s390x use case. We still need the check to in
> pci_save_state() from this patch to make sure we are not saving error
> values, which can be written back to the device in pci_restore_state().

What's the caller of pci_save_state() that needs this?

Can you move the check for PCI_POSSIBLE_ERROR() to the caller?
I think plenty of other callers don't need this, so it adds
extra overhead for them and down the road it'll be difficult
to untangle which caller needs it and which doesn't.

> As part of the error recovery userspace can use the VFIO_DEVICE_RESET to
> reset the device (pci_try_reset_function()). The function call for this is:
> 
> pci_dev_save_and_disable <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_save_and_disable>();
> 
> __pci_reset_function_locked <https://elixir.bootlin.com/linux/v6.17.1/C/ident/__pci_reset_function_locked>();
> 
> pci_dev_restore
> <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_restore>();
> 
> So we can end up overwriting the initial saved state (added by you in
> pci_bus_add_device()). Do we need to update the pci_dev_save_and_disable()
> not to save the state?

The state saved on device addition is just the initial state and
it is fine if later on it gets updated (which is a nicer term than
"overwritten").  E.g. when portdrv.c instantiates port services
and drivers are bound to them, various registers in Config Space
are changed, hence pcie_portdrv_probe() calls pci_save_state()
again.

However we can discuss whether pci_save_state() is still needed
in pci_dev_save_and_disable().

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-06 19:26           ` Lukas Wunner
@ 2025-10-06 21:35             ` Farhan Ali
  2025-10-08 13:34               ` Lukas Wunner
  0 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-10-06 21:35 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato


On 10/6/2025 12:26 PM, Lukas Wunner wrote:
> On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote:
>> On 10/4/2025 7:54 AM, Lukas Wunner wrote:
>>> I believe this also makes patch [01/10] in your series unnecessary.
>> I tested your patches + patches 2-10 of this series. It unfortunately didn't
>> completely help with the s390x use case. We still need the check to in
>> pci_save_state() from this patch to make sure we are not saving error
>> values, which can be written back to the device in pci_restore_state().
> What's the caller of pci_save_state() that needs this?
>
> Can you move the check for PCI_POSSIBLE_ERROR() to the caller?
> I think plenty of other callers don't need this, so it adds
> extra overhead for them and down the road it'll be difficult
> to untangle which caller needs it and which doesn't.

The caller would be pci_dev_save_and_disable(). Are you suggesting 
moving the PCI_POSSIBLE_ERROR() prior to calling pci_save_state()?

>
>> As part of the error recovery userspace can use the VFIO_DEVICE_RESET to
>> reset the device (pci_try_reset_function()). The function call for this is:
>>
>> pci_dev_save_and_disable <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_save_and_disable>();
>>
>> __pci_reset_function_locked <https://elixir.bootlin.com/linux/v6.17.1/C/ident/__pci_reset_function_locked>();
>>
>> pci_dev_restore
>> <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_restore>();
>>
>> So we can end up overwriting the initial saved state (added by you in
>> pci_bus_add_device()). Do we need to update the pci_dev_save_and_disable()
>> not to save the state?
> The state saved on device addition is just the initial state and
> it is fine if later on it gets updated (which is a nicer term than
> "overwritten").  E.g. when portdrv.c instantiates port services
> and drivers are bound to them, various registers in Config Space
> are changed, hence pcie_portdrv_probe() calls pci_save_state()
> again.
>
> However we can discuss whether pci_save_state() is still needed
> in pci_dev_save_and_disable().

The commit 8dd7f8036c12 ("PCI: add support for function level reset") 
introduced the logic of saving/restoring the device state after an FLR. 
My assumption is it was done to save the most recent state of the device 
(as the state could be updated by drivers). So I think it would still 
make sense to save the device state in pci_dev_save_and_disable() if the 
Config Space is still accessible?

Thanks

Farhan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-06 21:35             ` Farhan Ali
@ 2025-10-08 13:34               ` Lukas Wunner
  2025-10-08 17:56                 ` Farhan Ali
  0 siblings, 1 reply; 47+ messages in thread
From: Lukas Wunner @ 2025-10-08 13:34 UTC (permalink / raw)
  To: Farhan Ali
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato

On Mon, Oct 06, 2025 at 02:35:49PM -0700, Farhan Ali wrote:
> On 10/6/2025 12:26 PM, Lukas Wunner wrote:
> > On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote:
> > > On 10/4/2025 7:54 AM, Lukas Wunner wrote:
> > > > I believe this also makes patch [01/10] in your series unnecessary.
> > > I tested your patches + patches 2-10 of this series. It unfortunately didn't
> > > completely help with the s390x use case. We still need the check to in
> > > pci_save_state() from this patch to make sure we are not saving error
> > > values, which can be written back to the device in pci_restore_state().
> > What's the caller of pci_save_state() that needs this?
> > 
> > Can you move the check for PCI_POSSIBLE_ERROR() to the caller?
> > I think plenty of other callers don't need this, so it adds
> > extra overhead for them and down the road it'll be difficult
> > to untangle which caller needs it and which doesn't.
> 
> The caller would be pci_dev_save_and_disable(). Are you suggesting moving
> the PCI_POSSIBLE_ERROR() prior to calling pci_save_state()?

I'm not sure yet.  Let's back up a little:  I'm missing an
architectural description how you're intending to do error
recovery in the VM.  If I understand correctly, you're
informing the VM of the error via the ->error_detected() callback.

You're saying you need to check for accessibility of the device
prior to resetting it from the VM, does that mean you're attempting
a reset from the ->error_detected() callback?

According to Documentation/PCI/pci-error-recovery.rst, the device
isn't supposed to be considered accessible in ->error_detected().
The first callback which allows access is ->mmio_enabled().

I also don't quite understand why the VM needs to perform a reset.
Why can't you just let the VM tell the host that a reset is needed
(PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
behalf of the VM?

Furthermore, I'm thinking that you should be using pci_channel_offline()
to detect accessibility of the device, rather than reading from
Config Space and checking for PCI_POSSIBLE_ERROR().

> > The state saved on device addition is just the initial state and
> > it is fine if later on it gets updated (which is a nicer term than
> > "overwritten").  E.g. when portdrv.c instantiates port services
> > and drivers are bound to them, various registers in Config Space
> > are changed, hence pcie_portdrv_probe() calls pci_save_state()
> > again.
> > 
> > However we can discuss whether pci_save_state() is still needed
> > in pci_dev_save_and_disable().
> 
> The commit 8dd7f8036c12 ("PCI: add support for function level reset")
> introduced the logic of saving/restoring the device state after an FLR. My
> assumption is it was done to save the most recent state of the device (as
> the state could be updated by drivers). So I think it would still make sense
> to save the device state in pci_dev_save_and_disable() if the Config Space
> is still accessible?

Yes, right now we can't assume that drivers call pci_save_state()
in their probe hook if they modified Config Space.  They may rely
on the state being saved prior to reset or a D3hot/D3cold transition.
So we need to keep the pci_dev_save_and_disable() call for now.

Generally the expectation is that Config Space is accessible when
performing a reset with pci_try_reset_function().  Since that's
apparently not guaranteed for your use case, I'm wondering if you
might be using the function in a context it's not supposed to be used.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-08 13:34               ` Lukas Wunner
@ 2025-10-08 17:56                 ` Farhan Ali
  2025-10-08 18:14                   ` Lukas Wunner
  0 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-10-08 17:56 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato


On 10/8/2025 6:34 AM, Lukas Wunner wrote:
> On Mon, Oct 06, 2025 at 02:35:49PM -0700, Farhan Ali wrote:
>> On 10/6/2025 12:26 PM, Lukas Wunner wrote:
>>> On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote:
>>>> On 10/4/2025 7:54 AM, Lukas Wunner wrote:
>>>>> I believe this also makes patch [01/10] in your series unnecessary.
>>>> I tested your patches + patches 2-10 of this series. It unfortunately didn't
>>>> completely help with the s390x use case. We still need the check to in
>>>> pci_save_state() from this patch to make sure we are not saving error
>>>> values, which can be written back to the device in pci_restore_state().
>>> What's the caller of pci_save_state() that needs this?
>>>
>>> Can you move the check for PCI_POSSIBLE_ERROR() to the caller?
>>> I think plenty of other callers don't need this, so it adds
>>> extra overhead for them and down the road it'll be difficult
>>> to untangle which caller needs it and which doesn't.
>> The caller would be pci_dev_save_and_disable(). Are you suggesting moving
>> the PCI_POSSIBLE_ERROR() prior to calling pci_save_state()?
> I'm not sure yet.  Let's back up a little:  I'm missing an
> architectural description how you're intending to do error
> recovery in the VM.  If I understand correctly, you're
> informing the VM of the error via the ->error_detected() callback.
>
> You're saying you need to check for accessibility of the device
> prior to resetting it from the VM, does that mean you're attempting
> a reset from the ->error_detected() callback?
>
> According to Documentation/PCI/pci-error-recovery.rst, the device
> isn't supposed to be considered accessible in ->error_detected().
> The first callback which allows access is ->mmio_enabled().
>
> I also don't quite understand why the VM needs to perform a reset.
> Why can't you just let the VM tell the host that a reset is needed
> (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
> behalf of the VM?

The ->error_detected() callback is used to inform userspace of an error. 
In the case of a VM, using QEMU as a userspace, once notified of an 
error QEMU will inject an error into the guest in s390x architecture 
specific way [1] (probably should have linked the QEMU series in the 
cover letter). Once notified of the error VM's device driver will drive 
the recovery action. The recovery action require a reset of the device 
and on s390x PCI devices are reset using architecture specific 
instructions (zpci_device_hot_reset()). QEMU will intercept any low 
level recovery instructions from the VM and then perform a reset of 
device on behalf of the VM [2]. QEMU performs a reset by invoking the 
VFIO_DEVICE_RESET ioctl, which attempts to reset the device 
using pci_try_reset_function().

Once a device is in an error state, MMIO to the device is blocked and so 
any PCI reads to the Config Space will return -1. Since 
pci_try_reset_function() will try to save the state of device's Config 
Space with pci_dev_save_and_disable(), it will end up saving -1 as the 
state. Later when we try to restore the state after a reset, we end up 
corrupting device registers which can send the device into an error 
state again. I was trying to avoid this with the patch.

Hopefully, this answers your questions.

[1] QEMU series 
https://lore.kernel.org/all/20250925174852.1302-1-alifm@linux.ibm.com/

[2] v1 patch series discussion on some nuances of reset mechanism 
https://lore.kernel.org/all/20250814145743.204ca19a.alex.williamson@redhat.com/

>
> Furthermore, I'm thinking that you should be using pci_channel_offline()
> to detect accessibility of the device, rather than reading from
> Config Space and checking for PCI_POSSIBLE_ERROR().
>
>>> The state saved on device addition is just the initial state and
>>> it is fine if later on it gets updated (which is a nicer term than
>>> "overwritten").  E.g. when portdrv.c instantiates port services
>>> and drivers are bound to them, various registers in Config Space
>>> are changed, hence pcie_portdrv_probe() calls pci_save_state()
>>> again.
>>>
>>> However we can discuss whether pci_save_state() is still needed
>>> in pci_dev_save_and_disable().
>> The commit 8dd7f8036c12 ("PCI: add support for function level reset")
>> introduced the logic of saving/restoring the device state after an FLR. My
>> assumption is it was done to save the most recent state of the device (as
>> the state could be updated by drivers). So I think it would still make sense
>> to save the device state in pci_dev_save_and_disable() if the Config Space
>> is still accessible?
> Yes, right now we can't assume that drivers call pci_save_state()
> in their probe hook if they modified Config Space.  They may rely
> on the state being saved prior to reset or a D3hot/D3cold transition.
> So we need to keep the pci_dev_save_and_disable() call for now.
>
> Generally the expectation is that Config Space is accessible when
> performing a reset with pci_try_reset_function().  Since that's
> apparently not guaranteed for your use case, I'm wondering if you
> might be using the function in a context it's not supposed to be used.

I am open to suggestions on how we can do this.

Thanks

Farhan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-08 17:56                 ` Farhan Ali
@ 2025-10-08 18:14                   ` Lukas Wunner
  2025-10-08 21:55                     ` Farhan Ali
  2025-10-09  9:12                     ` Niklas Schnelle
  0 siblings, 2 replies; 47+ messages in thread
From: Lukas Wunner @ 2025-10-08 18:14 UTC (permalink / raw)
  To: Farhan Ali
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato

On Wed, Oct 08, 2025 at 10:56:35AM -0700, Farhan Ali wrote:
> On 10/8/2025 6:34 AM, Lukas Wunner wrote:
> > I'm not sure yet.  Let's back up a little:  I'm missing an
> > architectural description how you're intending to do error
> > recovery in the VM.  If I understand correctly, you're
> > informing the VM of the error via the ->error_detected() callback.
> > 
> > You're saying you need to check for accessibility of the device
> > prior to resetting it from the VM, does that mean you're attempting
> > a reset from the ->error_detected() callback?
> > 
> > According to Documentation/PCI/pci-error-recovery.rst, the device
> > isn't supposed to be considered accessible in ->error_detected().
> > The first callback which allows access is ->mmio_enabled().
> > 
> 
> The ->error_detected() callback is used to inform userspace of an error. In
> the case of a VM, using QEMU as a userspace, once notified of an error QEMU
> will inject an error into the guest in s390x architecture specific way [1]
> (probably should have linked the QEMU series in the cover letter). Once
> notified of the error VM's device driver will drive the recovery action. The
> recovery action require a reset of the device and on s390x PCI devices are
> reset using architecture specific instructions (zpci_device_hot_reset()).

According to Documentation/PCI/pci-error-recovery.rst:

   "STEP 1: Notification
    --------------------
    Platform calls the error_detected() callback on every instance of
    every driver affected by the error.
    At this point, the device might not be accessible anymore, [...]
    it gives the driver a chance to cleanup, waiting for pending stuff
    (timers, whatever, etc...) to complete; it can take semaphores,
    schedule, etc... everything but touch the device."
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

And yet you're touching the device by trying to reset it.

The code you're introducing in patch [01/10] only becomes necessary
because you're not following the above-quoted protocol.  If you
follow the protocol, patch [01/10] becomes unnecessary.

> > I also don't quite understand why the VM needs to perform a reset.
> > Why can't you just let the VM tell the host that a reset is needed
> > (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
> > behalf of the VM?

Could you answer this question please?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-08 18:14                   ` Lukas Wunner
@ 2025-10-08 21:55                     ` Farhan Ali
  2025-10-09  4:52                       ` Lukas Wunner
  2025-10-09  9:12                     ` Niklas Schnelle
  1 sibling, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-10-08 21:55 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato


On 10/8/2025 11:14 AM, Lukas Wunner wrote:
> On Wed, Oct 08, 2025 at 10:56:35AM -0700, Farhan Ali wrote:
>> On 10/8/2025 6:34 AM, Lukas Wunner wrote:
>>> I'm not sure yet.  Let's back up a little:  I'm missing an
>>> architectural description how you're intending to do error
>>> recovery in the VM.  If I understand correctly, you're
>>> informing the VM of the error via the ->error_detected() callback.
>>>
>>> You're saying you need to check for accessibility of the device
>>> prior to resetting it from the VM, does that mean you're attempting
>>> a reset from the ->error_detected() callback?
>>>
>>> According to Documentation/PCI/pci-error-recovery.rst, the device
>>> isn't supposed to be considered accessible in ->error_detected().
>>> The first callback which allows access is ->mmio_enabled().
>>>
>> The ->error_detected() callback is used to inform userspace of an error. In
>> the case of a VM, using QEMU as a userspace, once notified of an error QEMU
>> will inject an error into the guest in s390x architecture specific way [1]
>> (probably should have linked the QEMU series in the cover letter). Once
>> notified of the error VM's device driver will drive the recovery action. The
>> recovery action require a reset of the device and on s390x PCI devices are
>> reset using architecture specific instructions (zpci_device_hot_reset()).
> According to Documentation/PCI/pci-error-recovery.rst:
>
>     "STEP 1: Notification
>      --------------------
>      Platform calls the error_detected() callback on every instance of
>      every driver affected by the error.
>      At this point, the device might not be accessible anymore, [...]
>      it gives the driver a chance to cleanup, waiting for pending stuff
>      (timers, whatever, etc...) to complete; it can take semaphores,
>      schedule, etc... everything but touch the device."
>                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> And yet you're touching the device by trying to reset it.
>
> The code you're introducing in patch [01/10] only becomes necessary
> because you're not following the above-quoted protocol.  If you
> follow the protocol, patch [01/10] becomes unnecessary.
>
>>> I also don't quite understand why the VM needs to perform a reset.
>>> Why can't you just let the VM tell the host that a reset is needed
>>> (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
>>> behalf of the VM?
> Could you answer this question please?
>
>
The reset is not performed by the VM, reset is still done by the host. 
My approach for a VM to let the host know that reset was needed, was to 
intercept any reset instructions for the PCI device in QEMU. QEMU would 
then drive a reset via VFIO_DEVICE_RESET. Maybe I am missing something, 
but based on what we have today in vfio driver, we don't have a 
mechanism for userspace to reset a device other than VFIO_DEVICE_RESET 
and VFIO_PCI_DEVICE_HOT_RESET ioctls.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-08 21:55                     ` Farhan Ali
@ 2025-10-09  4:52                       ` Lukas Wunner
  2025-10-09 17:02                         ` Farhan Ali
  0 siblings, 1 reply; 47+ messages in thread
From: Lukas Wunner @ 2025-10-09  4:52 UTC (permalink / raw)
  To: Farhan Ali
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato

On Wed, Oct 08, 2025 at 02:55:56PM -0700, Farhan Ali wrote:
> > > On 10/8/2025 6:34 AM, Lukas Wunner wrote:
> > > > I also don't quite understand why the VM needs to perform a reset.
> > > > Why can't you just let the VM tell the host that a reset is needed
> > > > (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
> > > > behalf of the VM?
> 
> The reset is not performed by the VM, reset is still done by the host. My
> approach for a VM to let the host know that reset was needed, was to
> intercept any reset instructions for the PCI device in QEMU. QEMU would then
> drive a reset via VFIO_DEVICE_RESET. Maybe I am missing something, but based
> on what we have today in vfio driver, we don't have a mechanism for
> userspace to reset a device other than VFIO_DEVICE_RESET and
> VFIO_PCI_DEVICE_HOT_RESET ioctls.

The ask is for the host to notify the VM of the ->error_detected() event
and the VM then responding with one of the "enum pci_ers_result" values.

If the VM returns PCI_ERS_RESULT_NEED_RESET from ->error_detected(),
the host knows that it shall reset the device.  There is no need
for the VM to initiate a reset through an ioctl.  Just integrate
with the existing error handling flow described in
Documentation/PCI/pci-error-recovery.rst and thereby use the
existing mechanism to reset the device.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-09  4:52                       ` Lukas Wunner
@ 2025-10-09 17:02                         ` Farhan Ali
  2025-10-12  6:43                           ` Lukas Wunner
  0 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-10-09 17:02 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato

On 10/8/2025 9:52 PM, Lukas Wunner wrote:
> On Wed, Oct 08, 2025 at 02:55:56PM -0700, Farhan Ali wrote:
>>>> On 10/8/2025 6:34 AM, Lukas Wunner wrote:
>>>>> I also don't quite understand why the VM needs to perform a reset.
>>>>> Why can't you just let the VM tell the host that a reset is needed
>>>>> (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
>>>>> behalf of the VM?
>> The reset is not performed by the VM, reset is still done by the host. My
>> approach for a VM to let the host know that reset was needed, was to
>> intercept any reset instructions for the PCI device in QEMU. QEMU would then
>> drive a reset via VFIO_DEVICE_RESET. Maybe I am missing something, but based
>> on what we have today in vfio driver, we don't have a mechanism for
>> userspace to reset a device other than VFIO_DEVICE_RESET and
>> VFIO_PCI_DEVICE_HOT_RESET ioctls.
> The ask is for the host to notify the VM of the ->error_detected() event
> and the VM then responding with one of the "enum pci_ers_result" values.

Maybe there is some confusion here. Could you clarify what do you mean 
by VM responding with "enum pci_ers_result" values? Is it a device 
driver (for example an NVMe driver) running in the VM that should do 
that? Or is it something else you are suggesting?

Let me try to clarify what I am trying to do with this patch series. For 
passthrough devices to a VM, the driver bound to the device on the host 
is vfio-pci. vfio-pci driver does support the error_detected() callback 
(vfio_pci_core_aer_err_detected()), and on an PCI error s390x recovery 
code on the host will call the vfio-pci error_detected() callback. The 
vfio-pci error_detected() callback will notify userspace/QEMU via an 
eventfd, and return PCI_ERS_RESULT_CAN_RECOVER. At this point the s390x 
error recovery on the host will skip any further action(see patch 7) and 
let userspace drive the error recovery.

Once userspace/QEMU is notified, it then inject this error into the VM 
so device drivers in the VM can take recovery actions. For example for a 
passthrough NVMe device, the VM's OS NVMe driver will access the device. 
At this point the VM's NVMe driver's error_detected() will drive the 
recovery by returning PCI_ERS_RESULT_NEED_RESET, and the s390x error 
recovery in the VM's OS will try to do a reset. Resets are privileged 
operations and so the VM will need intervention from QEMU to perform the 
reset. QEMU will invoke the ioctls to now notify the host that the VM is 
requesting a reset of the device. The vfio-pci driver on the host will 
then perform the reset on the device.

Thanks Farhan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-09 17:02                         ` Farhan Ali
@ 2025-10-12  6:43                           ` Lukas Wunner
  0 siblings, 0 replies; 47+ messages in thread
From: Lukas Wunner @ 2025-10-12  6:43 UTC (permalink / raw)
  To: Farhan Ali
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, schnelle, mjrosato

On Thu, Oct 09, 2025 at 10:02:12AM -0700, Farhan Ali wrote:
> On 10/8/2025 9:52 PM, Lukas Wunner wrote:
> > On Wed, Oct 08, 2025 at 02:55:56PM -0700, Farhan Ali wrote:
> > > > > On 10/8/2025 6:34 AM, Lukas Wunner wrote:
> > > > > > I also don't quite understand why the VM needs to perform a reset.
> > > > > > Why can't you just let the VM tell the host that a reset is needed
> > > > > > (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
> > > > > > behalf of the VM?
> > > The reset is not performed by the VM, reset is still done by the host. My
> > > approach for a VM to let the host know that reset was needed, was to
> > > intercept any reset instructions for the PCI device in QEMU. QEMU would
> > > then drive a reset via VFIO_DEVICE_RESET. Maybe I am missing something,
> > > but based on what we have today in vfio driver, we don't have a mechanism
> > > for userspace to reset a device other than VFIO_DEVICE_RESET and
> > > VFIO_PCI_DEVICE_HOT_RESET ioctls.
> > The ask is for the host to notify the VM of the ->error_detected() event
> > and the VM then responding with one of the "enum pci_ers_result" values.
> 
> Maybe there is some confusion here. Could you clarify what do you mean by VM
> responding with "enum pci_ers_result" values? Is it a device driver (for
> example an NVMe driver) running in the VM that should do that? Or is it
> something else you are suggesting?

My expectation was that the host notifies the VM of the error,
the kernel in the VM notifies the nvme driver of the error,
the nvme driver returns a pci_ers_result return value from
its pci_error_handlers which the VM passes back to the host,
the host drives error recovery normally.

I was missing the high-level architectural overview that Niklas
subsequently provided.  You should provide it as part of your
series because otherwise it's difficult for reviewers to understand
what the individual patches are trying to achieve as a whole.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-08 18:14                   ` Lukas Wunner
  2025-10-08 21:55                     ` Farhan Ali
@ 2025-10-09  9:12                     ` Niklas Schnelle
  2025-10-12  6:34                       ` Lukas Wunner
  1 sibling, 1 reply; 47+ messages in thread
From: Niklas Schnelle @ 2025-10-09  9:12 UTC (permalink / raw)
  To: Lukas Wunner, Farhan Ali
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, mjrosato

On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote:
> On Wed, Oct 08, 2025 at 10:56:35AM -0700, Farhan Ali wrote:
> > On 10/8/2025 6:34 AM, Lukas Wunner wrote:
> > > I'm not sure yet.  Let's back up a little:  I'm missing an
> > > architectural description how you're intending to do error
> > > recovery in the VM.  If I understand correctly, you're
> > > informing the VM of the error via the ->error_detected() callback.
> > > 
> > > You're saying you need to check for accessibility of the device
> > > prior to resetting it from the VM, does that mean you're attempting
> > > a reset from the ->error_detected() callback?
> > > 
> > > According to Documentation/PCI/pci-error-recovery.rst, the device
> > > isn't supposed to be considered accessible in ->error_detected().
> > > The first callback which allows access is ->mmio_enabled().
> > > 
> > 
> > The ->error_detected() callback is used to inform userspace of an error. In
> > the case of a VM, using QEMU as a userspace, once notified of an error QEMU
> > will inject an error into the guest in s390x architecture specific way [1]
> > (probably should have linked the QEMU series in the cover letter). Once
> > notified of the error VM's device driver will drive the recovery action. The
> > recovery action require a reset of the device and on s390x PCI devices are
> > reset using architecture specific instructions (zpci_device_hot_reset()).
> 
> According to Documentation/PCI/pci-error-recovery.rst:
> 
>    "STEP 1: Notification
>     --------------------
>     Platform calls the error_detected() callback on every instance of
>     every driver affected by the error.
>     At this point, the device might not be accessible anymore, [...]
>     it gives the driver a chance to cleanup, waiting for pending stuff
>     (timers, whatever, etc...) to complete; it can take semaphores,
>     schedule, etc... everything but touch the device."
>                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> And yet you're touching the device by trying to reset it.
> 
> The code you're introducing in patch [01/10] only becomes necessary
> because you're not following the above-quoted protocol.  If you
> follow the protocol, patch [01/10] becomes unnecessary.
> 

I agree with your point above error_detected() should not touch the
device. My understanding of Farhan's series though is that it follows
that rule. As I understand it error_detected() is only used to inject
the s390 specific PCI error event into the VM using the information
stored in patch 7. As before vfio-pci returns
PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7
the pass-through case is detected and this gets turned into
PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets
skipped. And yeah, writing it down I'm not super happy with this part,
maybe it would be better to have an explicit
PCI_ERS_RESULT_LEAVE_AS_IS.

Either way this leaves the PCI device in the error state just like for
the host the platform leaves the device in the error state. Up until
this point even if the VM/QEMU tried to do a reset already it would get
blocked on at least the zdev->state_lock until the recovery code is
done. Only after the VM would run its recovery code and with that drive
the reset.

> > > I also don't quite understand why the VM needs to perform a reset.
> > > Why can't you just let the VM tell the host that a reset is needed
> > > (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
> > > behalf of the VM?

The reason is that we want the behavior from the VMs perspective to
follow s390's PCI error event handling architecture. In this model
however there is no mechanism to synchroniously ask the OS "An error
occurred would you want the device reset?" or to tell it that we as
hypervisor already unblocked MMIO/DMA or performed a reset. So instead
our idea was that we just do the error_detected() part in the host's
recovery code and then leave the device as is driving the rest from the
guest.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-09  9:12                     ` Niklas Schnelle
@ 2025-10-12  6:34                       ` Lukas Wunner
  2025-10-14 12:07                         ` Niklas Schnelle
  0 siblings, 1 reply; 47+ messages in thread
From: Lukas Wunner @ 2025-10-12  6:34 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Farhan Ali, Benjamin Block, linux-s390, kvm, linux-kernel,
	linux-pci, alex.williamson, helgaas, clg, mjrosato

On Thu, Oct 09, 2025 at 11:12:03AM +0200, Niklas Schnelle wrote:
> On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote:
> > And yet you're touching the device by trying to reset it.
> > 
> > The code you're introducing in patch [01/10] only becomes necessary
> > because you're not following the above-quoted protocol.  If you
> > follow the protocol, patch [01/10] becomes unnecessary.
> 
> I agree with your point above error_detected() should not touch the
> device. My understanding of Farhan's series though is that it follows
> that rule. As I understand it error_detected() is only used to inject
> the s390 specific PCI error event into the VM using the information
> stored in patch 7. As before vfio-pci returns
> PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7
> the pass-through case is detected and this gets turned into
> PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets
> skipped. And yeah, writing it down I'm not super happy with this part,
> maybe it would be better to have an explicit
> PCI_ERS_RESULT_LEAVE_AS_IS.

Thanks, that's the high-level overview I was looking for.

It would be good to include something like this at least
in the cover letter or additionally in the commit messages
so that it's easier for reviewers to connect the dots.

I was expecting paravirtualized error handling, i.e. the
VM is aware it's virtualized and vfio essentially proxies
the pci_ers_result return value of the driver (e.g. nvme)
back to the host, thereby allowing the host to drive error
recovery normally.  I'm not sure if there are technical
reasons preventing such an approach.

If you do want to stick with your alternative approach,
maybe doing the error handling in the ->mmio_enabled() phase
instead of ->error_detected() would make more sense.
In that phase you're allowed to access the device,
you can also attempt a local reset and return
PCI_ERS_RESULT_RECOVERED on success.

You'd have to return PCI_ERS_RESULT_CAN_RECOVER though
from the ->error_detected() callback in order to progress
to the ->mmio_enabled() step.

Does that make sense?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-12  6:34                       ` Lukas Wunner
@ 2025-10-14 12:07                         ` Niklas Schnelle
  2025-10-16 21:00                           ` Farhan Ali
  2025-10-19 14:34                           ` Lukas Wunner
  0 siblings, 2 replies; 47+ messages in thread
From: Niklas Schnelle @ 2025-10-14 12:07 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Farhan Ali, Benjamin Block, linux-s390, kvm, linux-kernel,
	linux-pci, alex.williamson, helgaas, clg, mjrosato

On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote:
> On Thu, Oct 09, 2025 at 11:12:03AM +0200, Niklas Schnelle wrote:
> > On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote:
> > > And yet you're touching the device by trying to reset it.
> > > 
> > > The code you're introducing in patch [01/10] only becomes necessary
> > > because you're not following the above-quoted protocol.  If you
> > > follow the protocol, patch [01/10] becomes unnecessary.
> > 
> > I agree with your point above error_detected() should not touch the
> > device. My understanding of Farhan's series though is that it follows
> > that rule. As I understand it error_detected() is only used to inject
> > the s390 specific PCI error event into the VM using the information
> > stored in patch 7. As before vfio-pci returns
> > PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7
> > the pass-through case is detected and this gets turned into
> > PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets
> > skipped. And yeah, writing it down I'm not super happy with this part,
> > maybe it would be better to have an explicit
> > PCI_ERS_RESULT_LEAVE_AS_IS.
> 
> Thanks, that's the high-level overview I was looking for.
> 
> It would be good to include something like this at least
> in the cover letter or additionally in the commit messages
> so that it's easier for reviewers to connect the dots.
> 
> I was expecting paravirtualized error handling, i.e. the
> VM is aware it's virtualized and vfio essentially proxies
> the pci_ers_result return value of the driver (e.g. nvme)
> back to the host, thereby allowing the host to drive error
> recovery normally.  I'm not sure if there are technical
> reasons preventing such an approach.

It does sound technically feasible but sticking to the already
architected error reporting and recovery has clear advantages. For one
it will work with existing Linux versions without guest changes and it
also has precedent with it working already in the z/VM hypervisor for
years. I agree that there is some level of mismatch with Linux'
recovery support but I don't think that outweighs having a clean
virtualization support where the host and guest use the same interface.

> 
> If you do want to stick with your alternative approach,
> maybe doing the error handling in the ->mmio_enabled() phase
> instead of ->error_detected() would make more sense.
> In that phase you're allowed to access the device,
> you can also attempt a local reset and return
> PCI_ERS_RESULT_RECOVERED on success.
> 
> You'd have to return PCI_ERS_RESULT_CAN_RECOVER though
> from the ->error_detected() callback in order to progress
> to the ->mmio_enabled() step.
> 
> Does that make sense?
> 
> Thanks,
> 
> Lukas

The problem with using ->mmio_enabled() is two fold. For one we
sometimes have to do a reset instead of clearing the error state, for
example if the device was not only put in the error state but also
disabled, or if the guest driver wants it, so we would also have to use
->slot_reset() and could end up with two resets. Second and more
importantly this would break the guests assumption that the device will
be in the error state with MMIO and DMA blocked when it gets an error
event. On the other hand, that's exactly the state it is in if we
report the error in the ->error_detected() callback and then skip the
rest of recovery so it can be done in the guest, likely with the exact
same Linux code. I'd assume this should be similar if QEMU/KVM wanted
to virtualize AER+DPC except that there MMIO remains accessible?

Here's an idea. Could it be an option to detect the pass-through in the
vfio-pci driver's ->error_detected() callback, possibly with feedback
from QEMU (@Alex?), and then return PCI_ERS_RESULT_RECOVERED from there
skipping the rest of recovery?

The skipping would be in-line with the below section of the
documentation i.e. "no further intervention":

  - PCI_ERS_RESULT_RECOVERED
      Driver returns this if it thinks the device is usable despite
      the error and does not need further intervention.

It's just that in this case the device really remains with MMIO and DMA
blocked, usable only in the sense that the vfio-pci + guest VM combo
knows how to use a device with MMIO and DMA blocked with the guest
recovery.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-14 12:07                         ` Niklas Schnelle
@ 2025-10-16 21:00                           ` Farhan Ali
  2025-10-19 14:34                           ` Lukas Wunner
  1 sibling, 0 replies; 47+ messages in thread
From: Farhan Ali @ 2025-10-16 21:00 UTC (permalink / raw)
  To: Niklas Schnelle, Lukas Wunner
  Cc: Benjamin Block, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, mjrosato


On 10/14/2025 5:07 AM, Niklas Schnelle wrote:
> On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote:
>> On Thu, Oct 09, 2025 at 11:12:03AM +0200, Niklas Schnelle wrote:
>>> On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote:
>>>> And yet you're touching the device by trying to reset it.
>>>>
>>>> The code you're introducing in patch [01/10] only becomes necessary
>>>> because you're not following the above-quoted protocol.  If you
>>>> follow the protocol, patch [01/10] becomes unnecessary.
>>> I agree with your point above error_detected() should not touch the
>>> device. My understanding of Farhan's series though is that it follows
>>> that rule. As I understand it error_detected() is only used to inject
>>> the s390 specific PCI error event into the VM using the information
>>> stored in patch 7. As before vfio-pci returns
>>> PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7
>>> the pass-through case is detected and this gets turned into
>>> PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets
>>> skipped. And yeah, writing it down I'm not super happy with this part,
>>> maybe it would be better to have an explicit
>>> PCI_ERS_RESULT_LEAVE_AS_IS.
>> Thanks, that's the high-level overview I was looking for.
>>
>> It would be good to include something like this at least
>> in the cover letter or additionally in the commit messages
>> so that it's easier for reviewers to connect the dots.
>>
>> I was expecting paravirtualized error handling, i.e. the
>> VM is aware it's virtualized and vfio essentially proxies
>> the pci_ers_result return value of the driver (e.g. nvme)
>> back to the host, thereby allowing the host to drive error
>> recovery normally.  I'm not sure if there are technical
>> reasons preventing such an approach.
> It does sound technically feasible but sticking to the already
> architected error reporting and recovery has clear advantages. For one
> it will work with existing Linux versions without guest changes and it
> also has precedent with it working already in the z/VM hypervisor for
> years. I agree that there is some level of mismatch with Linux'
> recovery support but I don't think that outweighs having a clean
> virtualization support where the host and guest use the same interface.
>
>> If you do want to stick with your alternative approach,
>> maybe doing the error handling in the ->mmio_enabled() phase
>> instead of ->error_detected() would make more sense.
>> In that phase you're allowed to access the device,
>> you can also attempt a local reset and return
>> PCI_ERS_RESULT_RECOVERED on success.
>>
>> You'd have to return PCI_ERS_RESULT_CAN_RECOVER though
>> from the ->error_detected() callback in order to progress
>> to the ->mmio_enabled() step.
>>
>> Does that make sense?
>>
>> Thanks,
>>
>> Lukas
> The problem with using ->mmio_enabled() is two fold. For one we
> sometimes have to do a reset instead of clearing the error state, for
> example if the device was not only put in the error state but also
> disabled, or if the guest driver wants it, so we would also have to use
> ->slot_reset() and could end up with two resets. Second and more
> importantly this would break the guests assumption that the device will
> be in the error state with MMIO and DMA blocked when it gets an error
> event. On the other hand, that's exactly the state it is in if we
> report the error in the ->error_detected() callback and then skip the
> rest of recovery so it can be done in the guest, likely with the exact
> same Linux code. I'd assume this should be similar if QEMU/KVM wanted
> to virtualize AER+DPC except that there MMIO remains accessible?
>
> Here's an idea. Could it be an option to detect the pass-through in the
> vfio-pci driver's ->error_detected() callback, possibly with feedback
> from QEMU (@Alex?), and then return PCI_ERS_RESULT_RECOVERED from there
> skipping the rest of recovery?
>
> The skipping would be in-line with the below section of the
> documentation i.e. "no further intervention":
>
>    - PCI_ERS_RESULT_RECOVERED
>        Driver returns this if it thinks the device is usable despite
>        the error and does not need further intervention.
>
> It's just that in this case the device really remains with MMIO and DMA
> blocked, usable only in the sense that the vfio-pci + guest VM combo
> knows how to use a device with MMIO and DMA blocked with the guest
> recovery.
>
> Thanks,
> Niklas

Hi Lukas,

Hope this helps to clarify why we still need patch [01/10] (or at least 
the check in pci_save_state() to see if the device responds with error 
value or not if we move forward with your patch series PCI: Universal 
error recoverability of devices). We can discuss if that check needs to 
be moved somewhere else if there is concern with overhead in 
pci_save_state(). Discussing with Niklas (off mailing list), we were 
thinking if it makes sense if vfio_pci_core_aer_err_detected() returned 
PCI_ERS_RESULT_RECOVERED if it doesn't need any further intervention 
from platform recovery to align closer to pcie-error-recovery 
documentation? One proposal would be to have a flag in struct 
vfio_pci_core_device(eg vdev->mediated_recovery), which can be used to 
return PCI_ERS_RESULT_RECOVERED in vfio_pci_core_aer_err_detected()if 
the flag was set. The flag could be set by userspace using 
VFIO_DEVICE_FEATURE_SET for the device feature 
VFIO_DEVICE_FEATURE_ZPCI_ERROR (would like to hear Alex's thoughts on 
this proposal).

Thanks

Farhan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-14 12:07                         ` Niklas Schnelle
  2025-10-16 21:00                           ` Farhan Ali
@ 2025-10-19 14:34                           ` Lukas Wunner
  2025-10-20  8:59                             ` Niklas Schnelle
  1 sibling, 1 reply; 47+ messages in thread
From: Lukas Wunner @ 2025-10-19 14:34 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Farhan Ali, Benjamin Block, linux-s390, kvm, linux-kernel,
	linux-pci, alex.williamson, helgaas, clg, mjrosato

On Tue, Oct 14, 2025 at 02:07:57PM +0200, Niklas Schnelle wrote:
> On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote:
> > If you do want to stick with your alternative approach,
> > maybe doing the error handling in the ->mmio_enabled() phase
> > instead of ->error_detected() would make more sense.
> > In that phase you're allowed to access the device,
> > you can also attempt a local reset and return
> > PCI_ERS_RESULT_RECOVERED on success.
> > 
> > You'd have to return PCI_ERS_RESULT_CAN_RECOVER though
> > from the ->error_detected() callback in order to progress
> > to the ->mmio_enabled() step.
> 
> The problem with using ->mmio_enabled() is two fold. For one we
> sometimes have to do a reset instead of clearing the error state, for
> example if the device was not only put in the error state but also
> disabled, or if the guest driver wants it,

Well in that case you could reset the device in the ->mmio_enabled() step
from the guest using the vfio reset ioctl.

> Second and more
> importantly this would break the guests assumption that the device will
> be in the error state with MMIO and DMA blocked when it gets an error
> event. On the other hand, that's exactly the state it is in if we
> report the error in the ->error_detected() callback

At the risk of continuously talking past each other:

How about this, the host notifies the guest of the error in the
->error_detected() callback.  The guest notifies the driver and
collects the result (whether a reset is requested or not), then
returns PCI_ERS_RESULT_CAN_RECOVER to the host.

The host re-enables I/O to the device, invokes the ->mmio_detected()
callback.  The guest then resets the device based on the result it
collected earlier or invokes the driver's ->mmio_enabled() callback.

If the driver returns PCI_ERS_RESULT_NEED_RESET from the
->mmio_enabled() callback, you can likewise reset the device from
the guest using the ioctl method.

My concern is that by insisting that you handle device recovery
completely in the ->error_detected() phase, you're not complying
with the protocol specified in Documentation/PCI/pci-error-recovery.rst
and as a result, you have to amend the reset code in the PCI core
because it assumes that all arches adheres to the protocol.
In my view, that suggests that the approach needs to be reworked
to comply with the protocol.  Then the workarounds for performing
a reset while I/O is blocked become unnecessary.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-19 14:34                           ` Lukas Wunner
@ 2025-10-20  8:59                             ` Niklas Schnelle
  2025-11-22 10:58                               ` Lukas Wunner
  0 siblings, 1 reply; 47+ messages in thread
From: Niklas Schnelle @ 2025-10-20  8:59 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Farhan Ali, Benjamin Block, linux-s390, kvm, linux-kernel,
	linux-pci, alex.williamson, helgaas, clg, mjrosato

On Sun, 2025-10-19 at 16:34 +0200, Lukas Wunner wrote:
> On Tue, Oct 14, 2025 at 02:07:57PM +0200, Niklas Schnelle wrote:
> > On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote:
> > > If you do want to stick with your alternative approach,
> > > maybe doing the error handling in the ->mmio_enabled() phase
> > > instead of ->error_detected() would make more sense.
> > > In that phase you're allowed to access the device,
> > > you can also attempt a local reset and return
> > > PCI_ERS_RESULT_RECOVERED on success.
> > > 
> > > You'd have to return PCI_ERS_RESULT_CAN_RECOVER though
> > > from the ->error_detected() callback in order to progress
> > > to the ->mmio_enabled() step.
> > 
> > The problem with using ->mmio_enabled() is two fold. For one we
> > sometimes have to do a reset instead of clearing the error state, for
> > example if the device was not only put in the error state but also
> > disabled, or if the guest driver wants it,
> 
> Well in that case you could reset the device in the ->mmio_enabled() step
> from the guest using the vfio reset ioctl.
> 
> > Second and more
> > importantly this would break the guests assumption that the device will
> > be in the error state with MMIO and DMA blocked when it gets an error
> > event. On the other hand, that's exactly the state it is in if we
> > report the error in the ->error_detected() callback
> 
> At the risk of continuously talking past each other:
> 
> How about this, the host notifies the guest of the error in the
> ->error_detected() callback.  The guest notifies the driver and
> collects the result (whether a reset is requested or not), then
> returns PCI_ERS_RESULT_CAN_RECOVER to the host.
> 
> The host re-enables I/O to the device, invokes the ->mmio_detected()
> callback.  The guest then resets the device based on the result it
> collected earlier or invokes the driver's ->mmio_enabled() callback.
> 
> If the driver returns PCI_ERS_RESULT_NEED_RESET from the
> ->mmio_enabled() callback, you can likewise reset the device from
> the guest using the ioctl method.
> 
> My concern is that by insisting that you handle device recovery
> completely in the ->error_detected() phase, you're not complying
> with the protocol specified in Documentation/PCI/pci-error-recovery.rst
> and as a result, you have to amend the reset code in the PCI core
> because it assumes that all arches adheres to the protocol.
> In my view, that suggests that the approach needs to be reworked
> to comply with the protocol.  Then the workarounds for performing
> a reset while I/O is blocked become unnecessary.
> 
> Thanks,
> 
> Lukas

Yeah I think we're talking past each other a bit. In my mind we're
really not doing the recovery in ->error_detected() at all. Within that
callback we only do the notify, and then do nothing in the rest of
recovery. Only after will the guest do recovery though I do see your
point that leaving the device in the error state kind of means that
recovery is still ongoing even if we're not in the recovery handler
anymore. But then any driver could also just return
PCI_ERS_RESULT_RECOVERED in error_detected() and land us in the same
situation. And at least on s390 there is also the case that the device
actually enters the error state before we get the error event so we
could already race into a situation where the guest does a reset and
the device is already in the error state but we haven't seen the event
yet. And this seems inherent to hardware blocking I/O on error which by
it's nature can happen at any time.

But let's put that aside, say we want to implement your model where we
do check with the guest and its device driver. How would that work,
somehow error_detected() would have to wait for the guest to proceed
into recovery and since the guest could just not do that we'd have to
have some kind of timeout. Also we can't allow the guest to choose
PCI_ERS_RESULT_RECOVERED because otherwise we'd again be in the
situation where recovery is completed without unblocking I/O. And if we
want to stick to the architecture QEMU/KVM will have to kind of have a
mode where after being informed of ongoing recovery for a device they
intercept attempts to reset / firmware calls for reset and turn that
into the correct return. And somehow also deal with the timeout because
e.g. old Linux guests won't do recovery but there is also no
architected way for a guest to say that it does recovery.

To me this seems worse than accepting that even with PCI error
recovery,  resets can encounter PCI devices with blocked I/O which is
already true now if e.g. user-space happens to drive a reset racing
with hardware I/O blocking.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space
  2025-10-20  8:59                             ` Niklas Schnelle
@ 2025-11-22 10:58                               ` Lukas Wunner
  0 siblings, 0 replies; 47+ messages in thread
From: Lukas Wunner @ 2025-11-22 10:58 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Farhan Ali, Benjamin Block, linux-s390, kvm, linux-kernel,
	linux-pci, alex.williamson, helgaas, clg, mjrosato

On Mon, Oct 20, 2025 at 10:59:48AM +0200, Niklas Schnelle wrote:
> Yeah I think we're talking past each other a bit. In my mind we're
> really not doing the recovery in ->error_detected() at all. Within that
> callback we only do the notify, and then do nothing in the rest of
> recovery. Only after will the guest do recovery though I do see your
> point that leaving the device in the error state kind of means that
> recovery is still ongoing even if we're not in the recovery handler
> anymore. But then any driver could also just return
> PCI_ERS_RESULT_RECOVERED in error_detected() and land us in the same
> situation.

That would be a bug in the driver.  The point of the pci_error_handlers
is to attempt recovery of the device in concert with the driver.
If the driver "fakes" a recovered device towards the PCI core and then
attempts recovery behind the PCI core's back, it gets to keep the pieces...

> But let's put that aside, say we want to implement your model where we
> do check with the guest and its device driver. How would that work,
> somehow error_detected() would have to wait for the guest to proceed
> into recovery and since the guest could just not do that we'd have to
> have some kind of timeout.

Right, a timeout seems reasonable.

> Also we can't allow the guest to choose
> PCI_ERS_RESULT_RECOVERED because otherwise we'd again be in the
> situation where recovery is completed without unblocking I/O.

The guest should only return that if the device has really recovered.
On an architecture which blocks I/O upon an error, by definition the
device cannot already be recovered in the ->error_detected() stage.

> And if we
> want to stick to the architecture QEMU/KVM will have to kind of have a
> mode where after being informed of ongoing recovery for a device they
> intercept attempts to reset / firmware calls for reset and turn that
> into the correct return. And somehow also deal with the timeout because
> e.g. old Linux guests won't do recovery but there is also no
> architected way for a guest to say that it does recovery.

I guess there are gaps in qemu with regards to error recovery,
but I think the solution is to add the missing functionality,
not try to work around the gaps.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 02/10] PCI: Add additional checks for flr reset
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
  2025-09-24 17:16 ` [PATCH v4 01/10] PCI: Avoid saving error values for config space Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-09-30 10:03   ` Benjamin Block
  2025-10-01 14:37   ` Benjamin Block
  2025-09-24 17:16 ` [PATCH v4 03/10] PCI: Allow per function PCI slots Farhan Ali
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

If a device is in an error state, then any reads of device registers can
return error value. Add addtional checks to validate if a device is in an
error state before doing an flr reset.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/pci.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index a3d93d1baee7..327fefc6a1eb 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4576,12 +4576,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
  */
 int pcie_reset_flr(struct pci_dev *dev, bool probe)
 {
+	u32 reg;
+
 	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
 		return -ENOTTY;
 
 	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
 		return -ENOTTY;
 
+	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
+		pci_warn(dev, "Device unable to do an FLR\n");
+		return -ENOTTY;
+	}
+
 	if (probe)
 		return 0;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 02/10] PCI: Add additional checks for flr reset
  2025-09-24 17:16 ` [PATCH v4 02/10] PCI: Add additional checks for flr reset Farhan Ali
@ 2025-09-30 10:03   ` Benjamin Block
  2025-09-30 17:04     ` Farhan Ali
  2025-10-01 14:37   ` Benjamin Block
  1 sibling, 1 reply; 47+ messages in thread
From: Benjamin Block @ 2025-09-30 10:03 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, schnelle, mjrosato

On Wed, Sep 24, 2025 at 10:16:20AM -0700, Farhan Ali wrote:
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a3d93d1baee7..327fefc6a1eb 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4576,12 +4576,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
>   */
>  int pcie_reset_flr(struct pci_dev *dev, bool probe)
>  {
> +	u32 reg;
> +
>  	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
>  		return -ENOTTY;
>  
>  	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
>  		return -ENOTTY;
>  
> +	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
> +		pci_warn(dev, "Device unable to do an FLR\n");
> +		return -ENOTTY;
> +	}

Just thinking out loud, not sure whether it make sense, but since you already
read an up-to-date value from the config space, would it make sense to
pull the check above `dev->devcap & PCI_EXP_DEVCAP_FLR` below this read, and
check on the just read `reg`?

Also wondering whether it makes sense to stable-tag this? We've recently seen
"unpleasant" recovery attempts that look like this in the kernel logs:

    [  663.330053] vfio-pci 0007:00:00.1: timed out waiting for pending transaction; performing function level reset anyway
    [  664.730051] vfio-pci 0007:00:00.1: not ready 1023ms after FLR; waiting
    [  665.830023] vfio-pci 0007:00:00.1: not ready 2047ms after FLR; waiting
    [  667.910023] vfio-pci 0007:00:00.1: not ready 4095ms after FLR; waiting
    [  672.070022] vfio-pci 0007:00:00.1: not ready 8191ms after FLR; waiting
    [  680.550025] vfio-pci 0007:00:00.1: not ready 16383ms after FLR; waiting
    [  697.190023] vfio-pci 0007:00:00.1: not ready 32767ms after FLR; waiting
    [  730.470021] vfio-pci 0007:00:00.1: not ready 65535ms after FLR; giving up

The VF here was already dead in the water at that point, so I suspect
`pci_read_config_dword()` might have failed, and so this check would have
failed, and we wouldn't have "wasted" the minute waiting for a FLR that was
never going to happen anyway.


-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 02/10] PCI: Add additional checks for flr reset
  2025-09-30 10:03   ` Benjamin Block
@ 2025-09-30 17:04     ` Farhan Ali
  2025-10-01  8:33       ` Benjamin Block
  0 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-09-30 17:04 UTC (permalink / raw)
  To: Benjamin Block
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, schnelle, mjrosato


On 9/30/2025 3:03 AM, Benjamin Block wrote:
> On Wed, Sep 24, 2025 at 10:16:20AM -0700, Farhan Ali wrote:
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index a3d93d1baee7..327fefc6a1eb 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -4576,12 +4576,19 @@ EXPORT_SYMBOL_GPL(pcie_flr);
>>    */
>>   int pcie_reset_flr(struct pci_dev *dev, bool probe)
>>   {
>> +	u32 reg;
>> +
>>   	if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
>>   		return -ENOTTY;
>>   
>>   	if (!(dev->devcap & PCI_EXP_DEVCAP_FLR))
>>   		return -ENOTTY;
>>   
>> +	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
>> +		pci_warn(dev, "Device unable to do an FLR\n");
>> +		return -ENOTTY;
>> +	}
> Just thinking out loud, not sure whether it make sense, but since you already
> read an up-to-date value from the config space, would it make sense to
> pull the check above `dev->devcap & PCI_EXP_DEVCAP_FLR` below this read, and
> check on the just read `reg`?

My thinking was we could exit early if the device never had FLR 
capability (and so was not cached in devcap). This way we avoid an extra 
PCI read.


>
> Also wondering whether it makes sense to stable-tag this? We've recently seen
> "unpleasant" recovery attempts that look like this in the kernel logs:
>
>      [  663.330053] vfio-pci 0007:00:00.1: timed out waiting for pending transaction; performing function level reset anyway
>      [  664.730051] vfio-pci 0007:00:00.1: not ready 1023ms after FLR; waiting
>      [  665.830023] vfio-pci 0007:00:00.1: not ready 2047ms after FLR; waiting
>      [  667.910023] vfio-pci 0007:00:00.1: not ready 4095ms after FLR; waiting
>      [  672.070022] vfio-pci 0007:00:00.1: not ready 8191ms after FLR; waiting
>      [  680.550025] vfio-pci 0007:00:00.1: not ready 16383ms after FLR; waiting
>      [  697.190023] vfio-pci 0007:00:00.1: not ready 32767ms after FLR; waiting
>      [  730.470021] vfio-pci 0007:00:00.1: not ready 65535ms after FLR; giving up
>
> The VF here was already dead in the water at that point, so I suspect
> `pci_read_config_dword()` might have failed, and so this check would have
> failed, and we wouldn't have "wasted" the minute waiting for a FLR that was
> never going to happen anyway.
I think maybe we could? I don't think this patch fixes anything that's 
"broken" but rather improves the behavior to escalate to other reset 
method if the device is already in a bad state. I will cc stable.

Thanks

Farhan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 02/10] PCI: Add additional checks for flr reset
  2025-09-30 17:04     ` Farhan Ali
@ 2025-10-01  8:33       ` Benjamin Block
  0 siblings, 0 replies; 47+ messages in thread
From: Benjamin Block @ 2025-10-01  8:33 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, schnelle, mjrosato

On Tue, Sep 30, 2025 at 10:04:25AM -0700, Farhan Ali wrote:
> 
> On 9/30/2025 3:03 AM, Benjamin Block wrote:
> > On Wed, Sep 24, 2025 at 10:16:20AM -0700, Farhan Ali wrote:
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index a3d93d1baee7..327fefc6a1eb 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
--8<--
> >> +	if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &reg)) {
> >> +		pci_warn(dev, "Device unable to do an FLR\n");
> >> +		return -ENOTTY;
> >> +	}
> > Just thinking out loud, not sure whether it make sense, but since you already
> > read an up-to-date value from the config space, would it make sense to
> > pull the check above `dev->devcap & PCI_EXP_DEVCAP_FLR` below this read, and
> > check on the just read `reg`?
> 
> My thinking was we could exit early if the device never had FLR 
> capability (and so was not cached in devcap). This way we avoid an extra 
> PCI read.

That makes sense.

> > Also wondering whether it makes sense to stable-tag this? We've recently seen
> > "unpleasant" recovery attempts that look like this in the kernel logs:
> >
> >      [  663.330053] vfio-pci 0007:00:00.1: timed out waiting for pending transaction; performing function level reset anyway
> >      [  664.730051] vfio-pci 0007:00:00.1: not ready 1023ms after FLR; waiting
> >      [  665.830023] vfio-pci 0007:00:00.1: not ready 2047ms after FLR; waiting
> >      [  667.910023] vfio-pci 0007:00:00.1: not ready 4095ms after FLR; waiting
> >      [  672.070022] vfio-pci 0007:00:00.1: not ready 8191ms after FLR; waiting
> >      [  680.550025] vfio-pci 0007:00:00.1: not ready 16383ms after FLR; waiting
> >      [  697.190023] vfio-pci 0007:00:00.1: not ready 32767ms after FLR; waiting
> >      [  730.470021] vfio-pci 0007:00:00.1: not ready 65535ms after FLR; giving up
> >
> > The VF here was already dead in the water at that point, so I suspect
> > `pci_read_config_dword()` might have failed, and so this check would have
> > failed, and we wouldn't have "wasted" the minute waiting for a FLR that was
> > never going to happen anyway.
>
> I think maybe we could? I don't think this patch fixes anything that's 
> "broken" but rather improves the behavior to escalate to other reset 
> method if the device is already in a bad state. I will cc stable.

Right, adding a Fixes tag doesn't really make sense. But it does help with
accelerating recoveries, so maybe a stable tag will work :)


-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 02/10] PCI: Add additional checks for flr reset
  2025-09-24 17:16 ` [PATCH v4 02/10] PCI: Add additional checks for flr reset Farhan Ali
  2025-09-30 10:03   ` Benjamin Block
@ 2025-10-01 14:37   ` Benjamin Block
  1 sibling, 0 replies; 47+ messages in thread
From: Benjamin Block @ 2025-10-01 14:37 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, schnelle, mjrosato

On Wed, Sep 24, 2025 at 10:16:20AM -0700, Farhan Ali wrote:
> If a device is in an error state, then any reads of device registers can
> return error value. Add addtional checks to validate if a device is in an
> error state before doing an flr reset.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/pci/pci.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 

Apart from the discussion about a stable tag this looks good to me.


Reviewed-by: Benjamin Block <bblock@linux.ibm.com>

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 03/10] PCI: Allow per function PCI slots
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
  2025-09-24 17:16 ` [PATCH v4 01/10] PCI: Avoid saving error values for config space Farhan Ali
  2025-09-24 17:16 ` [PATCH v4 02/10] PCI: Add additional checks for flr reset Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-10-01 14:34   ` Benjamin Block
  2025-09-24 17:16 ` [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato, stable

On s390 systems, which use a machine level hypervisor, PCI devices are
always accessed through a form of PCI pass-through which fundamentally
operates on a per PCI function granularity. This is also reflected in the
s390 PCI hotplug driver which creates hotplug slots for individual PCI
functions. Its reset_slot() function, which is a wrapper for
zpci_hot_reset_device(), thus also resets individual functions.

Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
to multifunction devices. This approach worked fine on s390 systems that
only exposed virtual functions as individual PCI domains to the operating
system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
s390 supports exposing the topology of multifunction PCI devices by
grouping them in a shared PCI domain. When attempting to reset a function
through the hotplug driver, the shared slot assignment causes the wrong
function to be reset instead of the intended one. It also leaks memory as
we do create a pci_slot object for the function, but don't correctly free
it in pci_slot_release().

Add a flag for struct pci_slot to allow per function PCI slots for
functions managed through a hypervisor, which exposes individual PCI
functions while retaining the topology.

Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
Cc: <stable@vger.kernel.org>
Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/pci/hotplug/s390_pci_hpc.c | 10 ++++++++--
 drivers/pci/pci.c                  |  5 +++--
 drivers/pci/slot.c                 | 14 +++++++++++---
 include/linux/pci.h                |  1 +
 4 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/hotplug/s390_pci_hpc.c b/drivers/pci/hotplug/s390_pci_hpc.c
index d9996516f49e..8b547de464bf 100644
--- a/drivers/pci/hotplug/s390_pci_hpc.c
+++ b/drivers/pci/hotplug/s390_pci_hpc.c
@@ -126,14 +126,20 @@ static const struct hotplug_slot_ops s390_hotplug_slot_ops = {
 
 int zpci_init_slot(struct zpci_dev *zdev)
 {
+	int ret;
 	char name[SLOT_NAME_SIZE];
 	struct zpci_bus *zbus = zdev->zbus;
 
 	zdev->hotplug_slot.ops = &s390_hotplug_slot_ops;
 
 	snprintf(name, SLOT_NAME_SIZE, "%08x", zdev->fid);
-	return pci_hp_register(&zdev->hotplug_slot, zbus->bus,
-			       zdev->devfn, name);
+	ret = pci_hp_register(&zdev->hotplug_slot, zbus->bus,
+				zdev->devfn, name);
+	if (ret)
+		return ret;
+
+	zdev->hotplug_slot.pci_slot->per_func_slot = 1;
+	return 0;
 }
 
 void zpci_exit_slot(struct zpci_dev *zdev)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 327fefc6a1eb..530a793a332c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5057,8 +5057,9 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, bool probe)
 
 static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
 {
-	if (dev->multifunction || dev->subordinate || !dev->slot ||
-	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
+	if (dev->subordinate || !dev->slot ||
+	    dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET ||
+	    (dev->multifunction && !dev->slot->per_func_slot))
 		return -ENOTTY;
 
 	return pci_reset_hotplug_slot(dev->slot->hotplug, probe);
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index 50fb3eb595fe..51ee59e14393 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -63,6 +63,14 @@ static ssize_t cur_speed_read_file(struct pci_slot *slot, char *buf)
 	return bus_speed_read(slot->bus->cur_bus_speed, buf);
 }
 
+static bool pci_dev_matches_slot(struct pci_dev *dev, struct pci_slot *slot)
+{
+	if (slot->per_func_slot)
+		return dev->devfn == slot->number;
+
+	return PCI_SLOT(dev->devfn) == slot->number;
+}
+
 static void pci_slot_release(struct kobject *kobj)
 {
 	struct pci_dev *dev;
@@ -73,7 +81,7 @@ static void pci_slot_release(struct kobject *kobj)
 
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &slot->bus->devices, bus_list)
-		if (PCI_SLOT(dev->devfn) == slot->number)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = NULL;
 	up_read(&pci_bus_sem);
 
@@ -166,7 +174,7 @@ void pci_dev_assign_slot(struct pci_dev *dev)
 
 	mutex_lock(&pci_slot_mutex);
 	list_for_each_entry(slot, &dev->bus->slots, list)
-		if (PCI_SLOT(dev->devfn) == slot->number)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = slot;
 	mutex_unlock(&pci_slot_mutex);
 }
@@ -285,7 +293,7 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 
 	down_read(&pci_bus_sem);
 	list_for_each_entry(dev, &parent->devices, bus_list)
-		if (PCI_SLOT(dev->devfn) == slot_nr)
+		if (pci_dev_matches_slot(dev, slot))
 			dev->slot = slot;
 	up_read(&pci_bus_sem);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 59876de13860..9265f32d9786 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -78,6 +78,7 @@ struct pci_slot {
 	struct list_head	list;		/* Node in list of slots */
 	struct hotplug_slot	*hotplug;	/* Hotplug info (move here) */
 	unsigned char		number;		/* PCI_SLOT(pci_dev->devfn) */
+	unsigned int		per_func_slot:1; /* Allow per function slot */
 	struct kobject		kobj;
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 03/10] PCI: Allow per function PCI slots
  2025-09-24 17:16 ` [PATCH v4 03/10] PCI: Allow per function PCI slots Farhan Ali
@ 2025-10-01 14:34   ` Benjamin Block
  0 siblings, 0 replies; 47+ messages in thread
From: Benjamin Block @ 2025-10-01 14:34 UTC (permalink / raw)
  To: Farhan Ali
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, schnelle, mjrosato, stable

On Wed, Sep 24, 2025 at 10:16:21AM -0700, Farhan Ali wrote:
> On s390 systems, which use a machine level hypervisor, PCI devices are
> always accessed through a form of PCI pass-through which fundamentally
> operates on a per PCI function granularity. This is also reflected in the
> s390 PCI hotplug driver which creates hotplug slots for individual PCI
> functions. Its reset_slot() function, which is a wrapper for
> zpci_hot_reset_device(), thus also resets individual functions.
> 
> Currently, the kernel's PCI_SLOT() macro assigns the same pci_slot object
> to multifunction devices. This approach worked fine on s390 systems that
> only exposed virtual functions as individual PCI domains to the operating
> system.  Since commit 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> s390 supports exposing the topology of multifunction PCI devices by
> grouping them in a shared PCI domain. When attempting to reset a function
> through the hotplug driver, the shared slot assignment causes the wrong
> function to be reset instead of the intended one. It also leaks memory as
> we do create a pci_slot object for the function, but don't correctly free
> it in pci_slot_release().
> 
> Add a flag for struct pci_slot to allow per function PCI slots for
> functions managed through a hypervisor, which exposes individual PCI
> functions while retaining the topology.
> 
> Fixes: 44510d6fa0c0 ("s390/pci: Handling multifunctions")
> Cc: <stable@vger.kernel.org>
> Suggested-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/pci/hotplug/s390_pci_hpc.c | 10 ++++++++--
>  drivers/pci/pci.c                  |  5 +++--
>  drivers/pci/slot.c                 | 14 +++++++++++---
>  include/linux/pci.h                |  1 +
>  4 files changed, 23 insertions(+), 7 deletions(-)
> 

Looks good to me; thanks.


Reviewed-by: Benjamin Block <bblock@linux.ibm.com>

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (2 preceding siblings ...)
  2025-09-24 17:16 ` [PATCH v4 03/10] PCI: Allow per function PCI slots Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-09-25 10:54   ` Niklas Schnelle
  2025-10-02 12:58   ` Niklas Schnelle
  2025-09-24 17:16 ` [PATCH v4 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device Farhan Ali
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

On s390 today we overwrite the PCI BAR resource address to either an
artificial cookie address or MIO address. However this address is different
from the bus address of the BARs programmed by firmware. The artificial
cookie address was created to index into an array of function handles
(zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
but maybe different from the bus address. This creates an issue when trying
to convert the BAR resource address to bus address using the generic
pcibios_resource_to_bus().

Implement an architecture specific pcibios_resource_to_bus() function to
correctly translate PCI BAR resource addresses to bus addresses for s390.
Similarly add architecture specific pcibios_bus_to_resource function to do
the reverse translation.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
 drivers/pci/host-bridge.c |  4 +--
 2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index cd6676c2d602..3992ba44e1cf 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -264,6 +264,80 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
 	return 0;
 }
 
+void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
+			     struct resource *res)
+{
+	struct zpci_bus *zbus = bus->sysdata;
+	struct zpci_bar_struct *zbar;
+	struct zpci_dev *zdev;
+
+	region->start = res->start;
+	region->end = res->end;
+
+	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
+		int j = 0;
+
+		zbar = NULL;
+		zdev = zbus->function[i];
+		if (!zdev)
+			continue;
+
+		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
+			if (zdev->bars[j].res->start == res->start &&
+			    zdev->bars[j].res->end == res->end &&
+			    res->flags & IORESOURCE_MEM) {
+				zbar = &zdev->bars[j];
+				break;
+			}
+		}
+
+		if (zbar) {
+			/* only MMIO is supported */
+			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
+			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+				region->start |= (u64)zdev->bars[j + 1].val << 32;
+
+			region->end = region->start + (1UL << zbar->size) - 1;
+			return;
+		}
+	}
+}
+
+void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+			     struct pci_bus_region *region)
+{
+	struct zpci_bus *zbus = bus->sysdata;
+	struct zpci_dev *zdev;
+	resource_size_t start, end;
+
+	res->start = region->start;
+	res->end = region->end;
+
+	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
+		zdev = zbus->function[i];
+		if (!zdev || !zdev->has_resources)
+			continue;
+
+		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
+			if (!zdev->bars[j].size)
+				continue;
+
+			/* only MMIO is supported */
+			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
+			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
+				start |= (u64)zdev->bars[j + 1].val << 32;
+
+			end = start + (1UL << zdev->bars[j].size) - 1;
+
+			if (start == region->start && end == region->end) {
+				res->start = zdev->bars[j].res->start;
+				res->end = zdev->bars[j].res->end;
+				return;
+			}
+		}
+	}
+}
+
 void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
 			   pgprot_t prot)
 {
diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
index afa50b446567..56d62afb3afe 100644
--- a/drivers/pci/host-bridge.c
+++ b/drivers/pci/host-bridge.c
@@ -48,7 +48,7 @@ void pci_set_host_bridge_release(struct pci_host_bridge *bridge,
 }
 EXPORT_SYMBOL_GPL(pci_set_host_bridge_release);
 
-void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
+void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
 			     struct resource *res)
 {
 	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
@@ -73,7 +73,7 @@ static bool region_contains(struct pci_bus_region *region1,
 	return region1->start <= region2->start && region1->end >= region2->end;
 }
 
-void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+void __weak pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
 			     struct pci_bus_region *region)
 {
 	struct pci_host_bridge *bridge = pci_find_host_bridge(bus);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-09-24 17:16 ` [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
@ 2025-09-25 10:54   ` Niklas Schnelle
  2025-10-01 16:04     ` Benjamin Block
  2025-10-02 12:58   ` Niklas Schnelle
  1 sibling, 1 reply; 47+ messages in thread
From: Niklas Schnelle @ 2025-09-25 10:54 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, mjrosato

On Wed, 2025-09-24 at 10:16 -0700, Farhan Ali wrote:
> On s390 today we overwrite the PCI BAR resource address to either an
> artificial cookie address or MIO address. 
> 

I'm not sure "overwrite" fits here. Maybe just "are" and also use the
plural "addresses" and drop the "we" so:
"On s390 today PCI BAR resource addresses are either artificial cookie
addresses or MIO addresses". Then also adjust for the plural below with
"these addresses are".

Backghround: The resource addresses are CPU addresses used for MMIO. On
s390 we either have to adapt the old PCI instructions, which are
distinctly not memory mapped for a memory mapping based API via the
address cookie / zpci_iomap mechanismm, or if we have memory-I/O (MIO)
support, use the MIO addresses + memory mapped PCI instructions. Even
the MIO addresses may not match the bus addresses.

> However this address is different
> from the bus address of the BARs programmed by firmware. The artificial
> cookie address was created to index into an array of function handles
> (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
> but maybe different from the bus address. This creates an issue when trying

Nit: "may be different from the corresponding bus addresses."

> to convert the BAR resource address to bus address using the generic
> pcibios_resource_to_bus().
> 
> Implement an architecture specific pcibios_resource_to_bus() function to
> correctly translate PCI BAR resource addresses to bus addresses for s390.
> Similarly add architecture specific pcibios_bus_to_resource function to do
> the reverse translation.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
>  drivers/pci/host-bridge.c |  4 +--
>  2 files changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index cd6676c2d602..3992ba44e1cf 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -264,6 +264,80 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
>  	return 0;
>  }
>  
> +void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> +			     struct resource *res)
> +{
> +	struct zpci_bus *zbus = bus->sysdata;
> +	struct zpci_bar_struct *zbar;
> +	struct zpci_dev *zdev;
> +
> +	region->start = res->start;
> +	region->end = res->end;

When we don't find a BAR matching the resource this would become the
region used. I'm not sure what a better value would be if we don't find
a match though and that should hopefully not happen in sensible uses.
Also this would keep the existing behavior so seems fine.

> +
> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> +		int j = 0;
> +
> +		zbar = NULL;
> +		zdev = zbus->function[i];
> +		if (!zdev)
> +			continue;
> +
> +		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
> +			if (zdev->bars[j].res->start == res->start &&
> +			    zdev->bars[j].res->end == res->end &&
> +			    res->flags & IORESOURCE_MEM) {
> +				zbar = &zdev->bars[j];
> +				break;
> +			}
> +		}
> +
> +		if (zbar) {
> +			/* only MMIO is supported */
> +			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
> +			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +				region->start |= (u64)zdev->bars[j + 1].val << 32;
> +
> +			region->end = region->start + (1UL << zbar->size) - 1;
> +			return;
> +		}
> +	}
> +}
> +
> +void pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
> +			     struct pci_bus_region *region)
> +{
> +	struct zpci_bus *zbus = bus->sysdata;
> +	struct zpci_dev *zdev;
> +	resource_size_t start, end;
> +
> +	res->start = region->start;
> +	res->end = region->end;

Similar to above. One thought though, I think we could set res->flags
!= IORESOURCE_UNSET | IORESOURCE_DISABLED. Of course that would have to
be moved after the loop so it only takes effect when we don't find a
match.

> +
> +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> +		zdev = zbus->function[i];
> +		if (!zdev || !zdev->has_resources)
> +			continue;
> +
> +		for (int j = 0; j < PCI_STD_NUM_BARS; j++) {
> +			if (!zdev->bars[j].size)
> +				continue;
> +
> +			/* only MMIO is supported */
> +			start = zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_MASK;
> +			if (zdev->bars[j].val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +				start |= (u64)zdev->bars[j + 1].val << 32;
> +
> +			end = start + (1UL << zdev->bars[j].size) - 1;
> +
> +			if (start == region->start && end == region->end) {
> +				res->start = zdev->bars[j].res->start;
> +				res->end = zdev->bars[j].res->end;
> +				return;
> +			}
> +		}
> +	}
> +}
> +
--- snip ---


With or without my suggestion this clearly looks more correct than what
we had so far, even though that hasn't caused issues as far as I'm
aware until your BAR restoration change.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-09-25 10:54   ` Niklas Schnelle
@ 2025-10-01 16:04     ` Benjamin Block
  2025-10-01 18:01       ` Farhan Ali
  0 siblings, 1 reply; 47+ messages in thread
From: Benjamin Block @ 2025-10-01 16:04 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci,
	alex.williamson, helgaas, clg, mjrosato

On Thu, Sep 25, 2025 at 12:54:07PM +0200, Niklas Schnelle wrote:
> On Wed, 2025-09-24 at 10:16 -0700, Farhan Ali wrote:
> > +void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> > +			     struct resource *res)
> > +{
> > +	struct zpci_bus *zbus = bus->sysdata;
> > +	struct zpci_bar_struct *zbar;
> > +	struct zpci_dev *zdev;
> > +
> > +	region->start = res->start;
> > +	region->end = res->end;
> 
> When we don't find a BAR matching the resource this would become the
> region used. I'm not sure what a better value would be if we don't find
> a match though and that should hopefully not happen in sensible uses.
> Also this would keep the existing behavior so seems fine.

I was wondering the same things, but I guess it matches what happens elsewhere
as well, if no match is found

	void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
				     struct resource *res)
	{
		...
		resource_size_t offset = 0;

		resource_list_for_each_entry(window, &bridge->windows) {
			if (resource_contains(window->res, res)) {
				offset = window->offset;
				break;
			}
		}

		region->start = res->start - offset;
		region->end = res->end - offset;
	}

So I guess that is fine.

The thing I'm also unclear on is whether it is OK to "throw out" this whole
logic about `resource_contains(window->res, res)` here and
`region_contains(&bus_region, region)` in the other original function?
I mean, the original function don't search for perfect matches, but also
matches where are contained in a given resource/region, which is different
from what we do here. Are we OK with not doing that at all?

> > +
> > +	for (int i = 0; i < ZPCI_FUNCTIONS_PER_BUS; i++) {
> > +		int j = 0;
> > +
> > +		zbar = NULL;
> > +		zdev = zbus->function[i];
> > +		if (!zdev)
> > +			continue;
> > +
> > +		for (j = 0; j < PCI_STD_NUM_BARS; j++) {
> > +			if (zdev->bars[j].res->start == res->start &&
> > +			    zdev->bars[j].res->end == res->end &&
> > +			    res->flags & IORESOURCE_MEM) {
> > +				zbar = &zdev->bars[j];
> > +				break;
> > +			}
> > +		}
> > +
> > +		if (zbar) {
> > +			/* only MMIO is supported */
> > +			region->start = zbar->val & PCI_BASE_ADDRESS_MEM_MASK;
> > +			if (zbar->val & PCI_BASE_ADDRESS_MEM_TYPE_64)
> > +				region->start |= (u64)zdev->bars[j + 1].val << 32;
> > +
> > +			region->end = region->start + (1UL << zbar->size) - 1;
> > +			return;
> > +		}
> > +	}
> > +}

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-10-01 16:04     ` Benjamin Block
@ 2025-10-01 18:01       ` Farhan Ali
  0 siblings, 0 replies; 47+ messages in thread
From: Farhan Ali @ 2025-10-01 18:01 UTC (permalink / raw)
  To: Benjamin Block, Niklas Schnelle
  Cc: linux-s390, kvm, linux-kernel, linux-pci, alex.williamson,
	helgaas, clg, mjrosato


On 10/1/2025 9:04 AM, Benjamin Block wrote:
> On Thu, Sep 25, 2025 at 12:54:07PM +0200, Niklas Schnelle wrote:
>> On Wed, 2025-09-24 at 10:16 -0700, Farhan Ali wrote:
>>> +void pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
>>> +			     struct resource *res)
>>> +{
>>> +	struct zpci_bus *zbus = bus->sysdata;
>>> +	struct zpci_bar_struct *zbar;
>>> +	struct zpci_dev *zdev;
>>> +
>>> +	region->start = res->start;
>>> +	region->end = res->end;
>> When we don't find a BAR matching the resource this would become the
>> region used. I'm not sure what a better value would be if we don't find
>> a match though and that should hopefully not happen in sensible uses.
>> Also this would keep the existing behavior so seems fine.
> I was wondering the same things, but I guess it matches what happens elsewhere
> as well, if no match is found
>
> 	void __weak pcibios_resource_to_bus(struct pci_bus *bus, struct pci_bus_region *region,
> 				     struct resource *res)
> 	{
> 		...
> 		resource_size_t offset = 0;
>
> 		resource_list_for_each_entry(window, &bridge->windows) {
> 			if (resource_contains(window->res, res)) {
> 				offset = window->offset;
> 				break;
> 			}
> 		}
>
> 		region->start = res->start - offset;
> 		region->end = res->end - offset;
> 	}
>
> So I guess that is fine.
>
> The thing I'm also unclear on is whether it is OK to "throw out" this whole
> logic about `resource_contains(window->res, res)` here and
> `region_contains(&bus_region, region)` in the other original function?
> I mean, the original function don't search for perfect matches, but also
> matches where are contained in a given resource/region, which is different
> from what we do here. Are we OK with not doing that at all?

I had thought about doing the range check, similar to 
resource_contains/region_contains rather than doing exact checks. But I 
think the way we expose the topology of the devices, the offset in our 
(s390x) case is zero. So I thought it should be safe to just doing exact 
match and might help us catch any issues if its not exact.

Thanks

Farhan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-09-24 17:16 ` [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
  2025-09-25 10:54   ` Niklas Schnelle
@ 2025-10-02 12:58   ` Niklas Schnelle
  2025-10-02 17:00     ` Bjorn Helgaas
  1 sibling, 1 reply; 47+ messages in thread
From: Niklas Schnelle @ 2025-10-02 12:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Ilpo Järvinen
  Cc: alex.williamson, clg, mjrosato, Farhan Ali, linux-s390, kvm,
	linux-kernel, linux-pci

On Wed, 2025-09-24 at 10:16 -0700, Farhan Ali wrote:
> On s390 today we overwrite the PCI BAR resource address to either an
> artificial cookie address or MIO address. However this address is different
> from the bus address of the BARs programmed by firmware. The artificial
> cookie address was created to index into an array of function handles
> (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
> but maybe different from the bus address. This creates an issue when trying
> to convert the BAR resource address to bus address using the generic
> pcibios_resource_to_bus().
> 
> Implement an architecture specific pcibios_resource_to_bus() function to
> correctly translate PCI BAR resource addresses to bus addresses for s390.
> Similarly add architecture specific pcibios_bus_to_resource function to do
> the reverse translation.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
>  drivers/pci/host-bridge.c |  4 +--
>  2 files changed, 76 insertions(+), 2 deletions(-)
> 

@Bjorn, interesting new development. This actually fixes a current
linux-next breakage for us. In linux-next commit 06b77d5647a4 ("PCI:
Mark resources IORESOURCE_UNSET when outside bridge windows") from Ilpo
(added) breaks PCI on s390 because the check he added in
__pci_read_base() doesn't find the resource because the BAR address
does not match our MIO / address cookie addresses. With this patch
added however the pcibios_bus_to_resource() in __pci_read_base()
converts  the region correctly and then Ilpo's check works. I was
looking at this code quite intensely today wondering about Benjamin's
comment if we do need to check for containment rather than exact match.
I concluded that I think it is fine as is and was about to give my R-b
before Gerd had tracked down the linux-next issue and I found that this
fixes it.

So now I wonder if we might want to pick this one already to fix the
linux-next regression? Either way I'd like to add my:

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-10-02 12:58   ` Niklas Schnelle
@ 2025-10-02 17:00     ` Bjorn Helgaas
  2025-10-02 17:16       ` Ilpo Järvinen
  2025-10-02 18:14       ` Niklas Schnelle
  0 siblings, 2 replies; 47+ messages in thread
From: Bjorn Helgaas @ 2025-10-02 17:00 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Ilpo Järvinen, alex.williamson, clg, mjrosato, Farhan Ali,
	linux-s390, kvm, linux-kernel, linux-pci

On Thu, Oct 02, 2025 at 02:58:45PM +0200, Niklas Schnelle wrote:
> On Wed, 2025-09-24 at 10:16 -0700, Farhan Ali wrote:
> > On s390 today we overwrite the PCI BAR resource address to either an
> > artificial cookie address or MIO address. However this address is different
> > from the bus address of the BARs programmed by firmware. The artificial
> > cookie address was created to index into an array of function handles
> > (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
> > but maybe different from the bus address. This creates an issue when trying
> > to convert the BAR resource address to bus address using the generic
> > pcibios_resource_to_bus().
> > 
> > Implement an architecture specific pcibios_resource_to_bus() function to
> > correctly translate PCI BAR resource addresses to bus addresses for s390.
> > Similarly add architecture specific pcibios_bus_to_resource function to do
> > the reverse translation.
> > 
> > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > ---
> >  arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
> >  drivers/pci/host-bridge.c |  4 +--
> >  2 files changed, 76 insertions(+), 2 deletions(-)
> > 
> 
> @Bjorn, interesting new development. This actually fixes a current
> linux-next breakage for us. In linux-next commit 06b77d5647a4 ("PCI:
> Mark resources IORESOURCE_UNSET when outside bridge windows") from Ilpo
> (added) breaks PCI on s390 because the check he added in
> __pci_read_base() doesn't find the resource because the BAR address
> does not match our MIO / address cookie addresses. With this patch
> added however the pcibios_bus_to_resource() in __pci_read_base()
> converts  the region correctly and then Ilpo's check works. I was
> looking at this code quite intensely today wondering about Benjamin's
> comment if we do need to check for containment rather than exact match.
> I concluded that I think it is fine as is and was about to give my R-b
> before Gerd had tracked down the linux-next issue and I found that this
> fixes it.
> 
> So now I wonder if we might want to pick this one already to fix the
> linux-next regression? Either way I'd like to add my:
> 
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

Hmmm, thanks for the report.  I'm about ready to send the pull
request, and I hate to include something that is known to break s390
and would require a fix before v6.18.  At the same time, I hate to add
non-trivial code, including more weak functions, this late in the
window.

06b77d5647a4 ("PCI: Mark resources IORESOURCE_UNSET when outside
bridge windows") fixes some bogus messages, but I'm not sure that it's
actually a functional change.  So maybe the simplest at this point
would be to defer that commit until we can do it and the s390 change
together.

Bjorn

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-10-02 17:00     ` Bjorn Helgaas
@ 2025-10-02 17:16       ` Ilpo Järvinen
  2025-10-02 18:14       ` Niklas Schnelle
  1 sibling, 0 replies; 47+ messages in thread
From: Ilpo Järvinen @ 2025-10-02 17:16 UTC (permalink / raw)
  To: Bjorn Helgaas, Geert Uytterhoeven
  Cc: Niklas Schnelle, alex.williamson, clg, mjrosato, Farhan Ali,
	linux-s390, kvm, LKML, linux-pci

On Thu, 2 Oct 2025, Bjorn Helgaas wrote:

> On Thu, Oct 02, 2025 at 02:58:45PM +0200, Niklas Schnelle wrote:
> > On Wed, 2025-09-24 at 10:16 -0700, Farhan Ali wrote:
> > > On s390 today we overwrite the PCI BAR resource address to either an
> > > artificial cookie address or MIO address. However this address is different
> > > from the bus address of the BARs programmed by firmware. The artificial
> > > cookie address was created to index into an array of function handles
> > > (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
> > > but maybe different from the bus address. This creates an issue when trying
> > > to convert the BAR resource address to bus address using the generic
> > > pcibios_resource_to_bus().
> > > 
> > > Implement an architecture specific pcibios_resource_to_bus() function to
> > > correctly translate PCI BAR resource addresses to bus addresses for s390.
> > > Similarly add architecture specific pcibios_bus_to_resource function to do
> > > the reverse translation.
> > > 
> > > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > > ---
> > >  arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
> > >  drivers/pci/host-bridge.c |  4 +--
> > >  2 files changed, 76 insertions(+), 2 deletions(-)
> > > 
> > 
> > @Bjorn, interesting new development. This actually fixes a current
> > linux-next breakage for us. In linux-next commit 06b77d5647a4 ("PCI:
> > Mark resources IORESOURCE_UNSET when outside bridge windows") from Ilpo
> > (added) breaks PCI on s390 because the check he added in
> > __pci_read_base() doesn't find the resource because the BAR address
> > does not match our MIO / address cookie addresses. With this patch
> > added however the pcibios_bus_to_resource() in __pci_read_base()
> > converts  the region correctly and then Ilpo's check works. I was
> > looking at this code quite intensely today wondering about Benjamin's
> > comment if we do need to check for containment rather than exact match.
> > I concluded that I think it is fine as is and was about to give my R-b
> > before Gerd had tracked down the linux-next issue and I found that this
> > fixes it.
> > 
> > So now I wonder if we might want to pick this one already to fix the
> > linux-next regression? Either way I'd like to add my:
> > 
> > Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> 
> Hmmm, thanks for the report.  I'm about ready to send the pull
> request, and I hate to include something that is known to break s390
> and would require a fix before v6.18.  At the same time, I hate to add
> non-trivial code, including more weak functions, this late in the
> window.
> 
> 06b77d5647a4 ("PCI: Mark resources IORESOURCE_UNSET when outside
> bridge windows") fixes some bogus messages, but I'm not sure that it's
> actually a functional change.  So maybe the simplest at this point
> would be to defer that commit until we can do it and the s390 change
> together.

Hi,

I didn't notice any issues because of the conflict messages, but then, I 
didn't look very deeply into what those pnp things were as it seemed bug 
in PCI core we want to fix anyway.

Deferring the commit 06b77d5647a4 would be prudent as there seems to be 
another problem in Geert's case discussed in the other thread. Even this 
short time in next has already served us well by exposing things that need 
fixing so better to wait until we've known things resolved.

-- 
 i.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation
  2025-10-02 17:00     ` Bjorn Helgaas
  2025-10-02 17:16       ` Ilpo Järvinen
@ 2025-10-02 18:14       ` Niklas Schnelle
  1 sibling, 0 replies; 47+ messages in thread
From: Niklas Schnelle @ 2025-10-02 18:14 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ilpo Järvinen, alex.williamson, clg, mjrosato, Farhan Ali,
	linux-s390, kvm, linux-kernel, linux-pci

On Thu, 2025-10-02 at 12:00 -0500, Bjorn Helgaas wrote:
> On Thu, Oct 02, 2025 at 02:58:45PM +0200, Niklas Schnelle wrote:
> > On Wed, 2025-09-24 at 10:16 -0700, Farhan Ali wrote:
> > > On s390 today we overwrite the PCI BAR resource address to either an
> > > artificial cookie address or MIO address. However this address is different
> > > from the bus address of the BARs programmed by firmware. The artificial
> > > cookie address was created to index into an array of function handles
> > > (zpci_iomap_start). The MIO (mapped I/O) addresses are provided by firmware
> > > but maybe different from the bus address. This creates an issue when trying
> > > to convert the BAR resource address to bus address using the generic
> > > pcibios_resource_to_bus().
> > > 
> > > Implement an architecture specific pcibios_resource_to_bus() function to
> > > correctly translate PCI BAR resource addresses to bus addresses for s390.
> > > Similarly add architecture specific pcibios_bus_to_resource function to do
> > > the reverse translation.
> > > 
> > > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > > ---
> > >  arch/s390/pci/pci.c       | 74 +++++++++++++++++++++++++++++++++++++++
> > >  drivers/pci/host-bridge.c |  4 +--
> > >  2 files changed, 76 insertions(+), 2 deletions(-)
> > > 
> > 
> > @Bjorn, interesting new development. This actually fixes a current
> > linux-next breakage for us. In linux-next commit 06b77d5647a4 ("PCI:
> > Mark resources IORESOURCE_UNSET when outside bridge windows") from Ilpo
> > (added) breaks PCI on s390 because the check he added in
> > __pci_read_base() doesn't find the resource because the BAR address
> > does not match our MIO / address cookie addresses. With this patch
> > added however the pcibios_bus_to_resource() in __pci_read_base()
> > converts  the region correctly and then Ilpo's check works. I was
> > looking at this code quite intensely today wondering about Benjamin's
> > comment if we do need to check for containment rather than exact match.
> > I concluded that I think it is fine as is and was about to give my R-b
> > before Gerd had tracked down the linux-next issue and I found that this
> > fixes it.
> > 
> > So now I wonder if we might want to pick this one already to fix the
> > linux-next regression? Either way I'd like to add my:
> > 
> > Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> 
> Hmmm, thanks for the report.  I'm about ready to send the pull
> request, and I hate to include something that is known to break s390
> and would require a fix before v6.18.  At the same time, I hate to add
> non-trivial code, including more weak functions, this late in the
> window.
> 
> 06b77d5647a4 ("PCI: Mark resources IORESOURCE_UNSET when outside
> bridge windows") fixes some bogus messages, but I'm not sure that it's
> actually a functional change.  So maybe the simplest at this point
> would be to defer that commit until we can do it and the s390 change
> together.
> 
> Bjorn

Yeah that makes a lot of sense. I agree this change is not completely
trivial. Might have been a little enthusiastic with it reviving PCI
support from the dead on linux-next. If the other patch is not an
important fix and Ilpo seems to agree, then simply reverting it is the
safe solution.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (3 preceding siblings ...)
  2025-09-24 17:16 ` [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-09-24 17:16 ` [PATCH v4 06/10] s390/pci: Update the logic for detecting passthrough device Farhan Ali
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

Commit c1e18c17bda6 ("s390/pci: add zpci_set_irq()/zpci_clear_irq()"),
introduced the zpci_set_irq() and zpci_clear_irq(), to be used while
resetting a zPCI device.

Commit da995d538d3a ("s390/pci: implement reset_slot for hotplug slot"),
mentions zpci_clear_irq() being called in the path for zpci_hot_reset_device().
But that is not the case anymore and these functions are not called
outside of this file. Instead zpci_hot_reset_device() relies on
zpci_disable_device() also clearing the IRQs, but misses to reset the
zdev->irqs_registered flag.

However after a CLP disable/enable reset, the device's IRQ are
unregistered, but the flag zdev->irq_registered does not get cleared. It
creates an inconsistent state and so arch_restore_msi_irqs() doesn't
correctly restore the device's IRQ. This becomes a problem when a PCI
driver tries to restore the state of the device through
pci_restore_state(). Restore IRQ unconditionally for the device and remove
the irq_registered flag as its redundant.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 arch/s390/include/asm/pci.h | 1 -
 arch/s390/pci/pci_irq.c     | 9 +--------
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 41f900f693d9..aed19a1aa9d7 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -145,7 +145,6 @@ struct zpci_dev {
 	u8		has_resources	: 1;
 	u8		is_physfn	: 1;
 	u8		util_str_avail	: 1;
-	u8		irqs_registered	: 1;
 	u8		tid_avail	: 1;
 	u8		rtr_avail	: 1; /* Relaxed translation allowed */
 	unsigned int	devfn;		/* DEVFN part of the RID*/
diff --git a/arch/s390/pci/pci_irq.c b/arch/s390/pci/pci_irq.c
index 84482a921332..e73be96ce5fe 100644
--- a/arch/s390/pci/pci_irq.c
+++ b/arch/s390/pci/pci_irq.c
@@ -107,9 +107,6 @@ static int zpci_set_irq(struct zpci_dev *zdev)
 	else
 		rc = zpci_set_airq(zdev);

-	if (!rc)
-		zdev->irqs_registered = 1;
-
 	return rc;
 }

@@ -123,9 +120,6 @@ static int zpci_clear_irq(struct zpci_dev *zdev)
 	else
 		rc = zpci_clear_airq(zdev);

-	if (!rc)
-		zdev->irqs_registered = 0;
-
 	return rc;
 }

@@ -427,8 +421,7 @@ bool arch_restore_msi_irqs(struct pci_dev *pdev)
 {
 	struct zpci_dev *zdev = to_zpci(pdev);

-	if (!zdev->irqs_registered)
-		zpci_set_irq(zdev);
+	zpci_set_irq(zdev);
 	return true;
 }

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 06/10] s390/pci: Update the logic for detecting passthrough device
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (4 preceding siblings ...)
  2025-09-24 17:16 ` [PATCH v4 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-09-24 17:16 ` [PATCH v4 07/10] s390/pci: Store PCI error information for passthrough devices Farhan Ali
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

We can now have userspace drivers (vfio-pci based) on s390x. The userspace
drivers will not have any KVM fd and so no kzdev associated with them. So
we need to update the logic for detecting passthrough devices to not depend
on struct kvm_zdev.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      |  1 +
 arch/s390/pci/pci_event.c        | 14 ++++----------
 drivers/vfio/pci/vfio_pci_zdev.c |  9 ++++++++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index aed19a1aa9d7..f47f62fc3bfd 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -169,6 +169,7 @@ struct zpci_dev {
 
 	char res_name[16];
 	bool mio_capable;
+	bool mediated_recovery;
 	struct zpci_bar_struct bars[PCI_STD_NUM_BARS];
 
 	u64		start_dma;	/* Start of available DMA addresses */
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index d930416d4c90..541d536be052 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -61,16 +61,10 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
 	}
 }
 
-static bool is_passed_through(struct pci_dev *pdev)
+static bool needs_mediated_recovery(struct pci_dev *pdev)
 {
 	struct zpci_dev *zdev = to_zpci(pdev);
-	bool ret;
-
-	mutex_lock(&zdev->kzdev_lock);
-	ret = !!zdev->kzdev;
-	mutex_unlock(&zdev->kzdev_lock);
-
-	return ret;
+	return zdev->mediated_recovery;
 }
 
 static bool is_driver_supported(struct pci_driver *driver)
@@ -194,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	}
 	pdev->error_state = pci_channel_io_frozen;
 
-	if (is_passed_through(pdev)) {
+	if (needs_mediated_recovery(pdev)) {
 		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
 			pci_name(pdev));
 		status_str = "failed (pass-through)";
@@ -277,7 +271,7 @@ static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
 	 * we will inject the error event and let the guest recover the device
 	 * itself.
 	 */
-	if (is_passed_through(pdev))
+	if (needs_mediated_recovery(pdev))
 		goto out;
 	driver = to_pci_driver(pdev->dev.driver);
 	if (driver && driver->err_handler && driver->err_handler->error_detected)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 0990fdb146b7..a7bc23ce8483 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -148,6 +148,8 @@ int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 	if (!zdev)
 		return -ENODEV;
 
+	zdev->mediated_recovery = true;
+
 	if (!vdev->vdev.kvm)
 		return 0;
 
@@ -161,7 +163,12 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
 
-	if (!zdev || !vdev->vdev.kvm)
+	if (!zdev)
+		return;
+
+	zdev->mediated_recovery = false;
+
+	if (!vdev->vdev.kvm)
 		return;
 
 	if (zpci_kvm_hook.kvm_unregister)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 07/10] s390/pci: Store PCI error information for passthrough devices
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (5 preceding siblings ...)
  2025-09-24 17:16 ` [PATCH v4 06/10] s390/pci: Update the logic for detecting passthrough device Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-09-25 14:28   ` Niklas Schnelle
  2025-09-24 17:16 ` [PATCH v4 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

For a passthrough device we need co-operation from user space to recover
the device. This would require to bubble up any error information to user
space.  Let's store this error information for passthrough devices, so it
can be retrieved later.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      | 28 ++++++++++
 arch/s390/pci/pci.c              |  1 +
 arch/s390/pci/pci_event.c        | 95 +++++++++++++++++++-------------
 drivers/vfio/pci/vfio_pci_zdev.c |  2 +
 4 files changed, 88 insertions(+), 38 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index f47f62fc3bfd..40bfe5721109 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -116,6 +116,31 @@ struct zpci_bus {
 	enum pci_bus_speed	max_bus_speed;
 };
 
+/* Content Code Description for PCI Function Error */
+struct zpci_ccdf_err {
+	u32 reserved1;
+	u32 fh;                         /* function handle */
+	u32 fid;                        /* function id */
+	u32 ett         :  4;           /* expected table type */
+	u32 mvn         : 12;           /* MSI vector number */
+	u32 dmaas       :  8;           /* DMA address space */
+	u32 reserved2   :  6;
+	u32 q           :  1;           /* event qualifier */
+	u32 rw          :  1;           /* read/write */
+	u64 faddr;                      /* failing address */
+	u32 reserved3;
+	u16 reserved4;
+	u16 pec;                        /* PCI event code */
+} __packed;
+
+#define ZPCI_ERR_PENDING_MAX 4
+struct zpci_ccdf_pending {
+	u8 count;
+	u8 head;
+	u8 tail;
+	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
+};
+
 /* Private data per function */
 struct zpci_dev {
 	struct zpci_bus *zbus;
@@ -191,6 +216,8 @@ struct zpci_dev {
 	struct iommu_domain *s390_domain; /* attached IOMMU domain */
 	struct kvm_zdev *kzdev;
 	struct mutex kzdev_lock;
+	struct zpci_ccdf_pending pending_errs;
+	struct mutex pending_errs_lock;
 	spinlock_t dom_lock;		/* protect s390_domain change */
 };
 
@@ -316,6 +343,7 @@ void zpci_debug_exit_device(struct zpci_dev *);
 int zpci_report_error(struct pci_dev *, struct zpci_report_error_header *);
 int zpci_clear_error_state(struct zpci_dev *zdev);
 int zpci_reset_load_store_blocked(struct zpci_dev *zdev);
+void zpci_cleanup_pending_errors(struct zpci_dev *zdev);
 
 #ifdef CONFIG_NUMA
 
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 3992ba44e1cf..759dfd9e0fb8 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -897,6 +897,7 @@ struct zpci_dev *zpci_create_device(u32 fid, u32 fh, enum zpci_state state)
 	mutex_init(&zdev->state_lock);
 	mutex_init(&zdev->fmb_lock);
 	mutex_init(&zdev->kzdev_lock);
+	mutex_init(&zdev->pending_errs_lock);
 
 	return zdev;
 
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index 541d536be052..9f396a2847ab 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -18,23 +18,6 @@
 #include "pci_bus.h"
 #include "pci_report.h"
 
-/* Content Code Description for PCI Function Error */
-struct zpci_ccdf_err {
-	u32 reserved1;
-	u32 fh;				/* function handle */
-	u32 fid;			/* function id */
-	u32 ett		:  4;		/* expected table type */
-	u32 mvn		: 12;		/* MSI vector number */
-	u32 dmaas	:  8;		/* DMA address space */
-	u32		:  6;
-	u32 q		:  1;		/* event qualifier */
-	u32 rw		:  1;		/* read/write */
-	u64 faddr;			/* failing address */
-	u32 reserved3;
-	u16 reserved4;
-	u16 pec;			/* PCI event code */
-} __packed;
-
 /* Content Code Description for PCI Function Availability */
 struct zpci_ccdf_avail {
 	u32 reserved1;
@@ -76,6 +59,41 @@ static bool is_driver_supported(struct pci_driver *driver)
 	return true;
 }
 
+static void zpci_store_pci_error(struct pci_dev *pdev,
+				 struct zpci_ccdf_err *ccdf)
+{
+	struct zpci_dev *zdev = to_zpci(pdev);
+	int i;
+
+	mutex_lock(&zdev->pending_errs_lock);
+	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
+		pr_err("%s: Maximum number (%d) of pending error events queued",
+				pci_name(pdev), ZPCI_ERR_PENDING_MAX);
+		mutex_unlock(&zdev->pending_errs_lock);
+		return;
+	}
+
+	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
+	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
+	zdev->pending_errs.tail++;
+	zdev->pending_errs.count++;
+	mutex_unlock(&zdev->pending_errs_lock);
+}
+
+void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
+{
+	struct pci_dev *pdev = NULL;
+
+	mutex_lock(&zdev->pending_errs_lock);
+	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
+	if (zdev->pending_errs.count)
+		pr_info("%s: Unhandled PCI error events count=%d",
+				pci_name(pdev), zdev->pending_errs.count);
+	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
+	mutex_unlock(&zdev->pending_errs_lock);
+}
+EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
+
 static pci_ers_result_t zpci_event_notify_error_detected(struct pci_dev *pdev,
 							 struct pci_driver *driver)
 {
@@ -169,7 +187,8 @@ static pci_ers_result_t zpci_event_do_reset(struct pci_dev *pdev,
  * and the platform determines which functions are affected for
  * multi-function devices.
  */
-static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
+static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev,
+							  struct zpci_ccdf_err *ccdf)
 {
 	pci_ers_result_t ers_res = PCI_ERS_RESULT_DISCONNECT;
 	struct zpci_dev *zdev = to_zpci(pdev);
@@ -188,13 +207,6 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	}
 	pdev->error_state = pci_channel_io_frozen;
 
-	if (needs_mediated_recovery(pdev)) {
-		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
-			pci_name(pdev));
-		status_str = "failed (pass-through)";
-		goto out_unlock;
-	}
-
 	driver = to_pci_driver(pdev->dev.driver);
 	if (!is_driver_supported(driver)) {
 		if (!driver) {
@@ -210,12 +222,23 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 		goto out_unlock;
 	}
 
+	if (needs_mediated_recovery(pdev))
+		zpci_store_pci_error(pdev, ccdf);
+
 	ers_res = zpci_event_notify_error_detected(pdev, driver);
 	if (ers_result_indicates_abort(ers_res)) {
 		status_str = "failed (abort on detection)";
 		goto out_unlock;
 	}
 
+	if (needs_mediated_recovery(pdev)) {
+		pr_info("%s: Leaving recovery of pass-through device to user-space\n",
+				pci_name(pdev));
+		ers_res = PCI_ERS_RESULT_RECOVERED;
+		status_str = "in progress";
+		goto out_unlock;
+	}
+
 	if (ers_res != PCI_ERS_RESULT_NEED_RESET) {
 		ers_res = zpci_event_do_error_state_clear(pdev, driver);
 		if (ers_result_indicates_abort(ers_res)) {
@@ -258,25 +281,20 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
  * @pdev: PCI function for which to report
  * @es: PCI channel failure state to report
  */
-static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
+static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es,
+				  struct zpci_ccdf_err *ccdf)
 {
 	struct pci_driver *driver;
 
 	pci_dev_lock(pdev);
 	pdev->error_state = es;
-	/**
-	 * While vfio-pci's error_detected callback notifies user-space QEMU
-	 * reacts to this by freezing the guest. In an s390 environment PCI
-	 * errors are rarely fatal so this is overkill. Instead in the future
-	 * we will inject the error event and let the guest recover the device
-	 * itself.
-	 */
+
 	if (needs_mediated_recovery(pdev))
-		goto out;
+		zpci_store_pci_error(pdev, ccdf);
 	driver = to_pci_driver(pdev->dev.driver);
 	if (driver && driver->err_handler && driver->err_handler->error_detected)
 		driver->err_handler->error_detected(pdev, pdev->error_state);
-out:
+
 	pci_dev_unlock(pdev);
 }
 
@@ -322,12 +340,13 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
 		break;
 	case 0x0040: /* Service Action or Error Recovery Failed */
 	case 0x003b:
-		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
 		break;
 	default: /* PCI function left in the error state attempt to recover */
-		ers_res = zpci_event_attempt_error_recovery(pdev);
+		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
 		if (ers_res != PCI_ERS_RESULT_RECOVERED)
-			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
+			zpci_event_io_failure(pdev, pci_channel_io_perm_failure,
+					ccdf);
 		break;
 	}
 	pci_dev_put(pdev);
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index a7bc23ce8483..2be37eab9279 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 
 	zdev->mediated_recovery = false;
 
+	zpci_cleanup_pending_errors(zdev);
+
 	if (!vdev->vdev.kvm)
 		return;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 07/10] s390/pci: Store PCI error information for passthrough devices
  2025-09-24 17:16 ` [PATCH v4 07/10] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2025-09-25 14:28   ` Niklas Schnelle
  2025-09-25 16:29     ` Farhan Ali
  0 siblings, 1 reply; 47+ messages in thread
From: Niklas Schnelle @ 2025-09-25 14:28 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, mjrosato

On Wed, 2025-09-24 at 10:16 -0700, Farhan Ali wrote:
> For a passthrough device we need co-operation from user space to recover
> the device. This would require to bubble up any error information to user
> space.  Let's store this error information for passthrough devices, so it
> can be retrieved later.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  arch/s390/include/asm/pci.h      | 28 ++++++++++
>  arch/s390/pci/pci.c              |  1 +
>  arch/s390/pci/pci_event.c        | 95 +++++++++++++++++++-------------
>  drivers/vfio/pci/vfio_pci_zdev.c |  2 +
>  4 files changed, 88 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index f47f62fc3bfd..40bfe5721109 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -116,6 +116,31 @@ struct zpci_bus {
>  	enum pci_bus_speed	max_bus_speed;
>  };
>  
> +/* Content Code Description for PCI Function Error */
> +struct zpci_ccdf_err {
> +	u32 reserved1;
> +	u32 fh;                         /* function handle */
> +	u32 fid;                        /* function id */
> +	u32 ett         :  4;           /* expected table type */
> +	u32 mvn         : 12;           /* MSI vector number */
> +	u32 dmaas       :  8;           /* DMA address space */
> +	u32 reserved2   :  6;
> +	u32 q           :  1;           /* event qualifier */
> +	u32 rw          :  1;           /* read/write */
> +	u64 faddr;                      /* failing address */
> +	u32 reserved3;
> +	u16 reserved4;
> +	u16 pec;                        /* PCI event code */
> +} __packed;
> +
> +#define ZPCI_ERR_PENDING_MAX 4
> +struct zpci_ccdf_pending {
> +	u8 count;
> +	u8 head;
> +	u8 tail;
> +	struct zpci_ccdf_err err[ZPCI_ERR_PENDING_MAX];
> +};

Thanks this looks more reasonably sized. 

> +
>  /* Private data per function */
>  struct zpci_dev {
>  	struct zpci_bus *zbus;
> @@ -191,6 +216,8 @@ struct zpci_dev {
>  	struct iommu_domain *s390_domain; /* attached IOMMU domain */
>  	struct kvm_zdev *kzdev;
>  	struct mutex kzdev_lock;
> +	struct zpci_ccdf_pending pending_errs;
> +	struct mutex pending_errs_lock;
>  	spinlock_t dom_lock;		/* protect s390_domain change */
>  };
>  
--- snip ---
> +static void zpci_store_pci_error(struct pci_dev *pdev,
> +				 struct zpci_ccdf_err *ccdf)
> +{
> +	struct zpci_dev *zdev = to_zpci(pdev);
> +	int i;
> +
> +	mutex_lock(&zdev->pending_errs_lock);
> +	if (zdev->pending_errs.count >= ZPCI_ERR_PENDING_MAX) {
> +		pr_err("%s: Maximum number (%d) of pending error events queued",
> +				pci_name(pdev), ZPCI_ERR_PENDING_MAX);

So for a vfio-pci user which doesn't pick up the error information but
does reset on error and thus recovers we would leave just the 4 first
errors that occurred in the pending_errs and get this message once. I
think that is okay and maybe even preferrable since most errors are,
well, errors. And often the first time something went wrong is the
interesting one. So I think this makes sense.

> +		mutex_unlock(&zdev->pending_errs_lock);
> +		return;
> +	}
> +
> +	i = zdev->pending_errs.tail % ZPCI_ERR_PENDING_MAX;
> +	memcpy(&zdev->pending_errs.err[i], ccdf, sizeof(struct zpci_ccdf_err));
> +	zdev->pending_errs.tail++;
> +	zdev->pending_errs.count++;
> +	mutex_unlock(&zdev->pending_errs_lock);
> +}
> +
> +void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
> +{
> +	struct pci_dev *pdev = NULL;
> +
> +	mutex_lock(&zdev->pending_errs_lock);
> +	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);

I think you missed my comment on the previous version. This is missing
the matching pci_dev_put() for the pci_get_slot().

> +	if (zdev->pending_errs.count)
> +		pr_info("%s: Unhandled PCI error events count=%d",
> +				pci_name(pdev), zdev->pending_errs.count);
> +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
> +	mutex_unlock(&zdev->pending_errs_lock);
> +}
> +EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
> +
> 
--- snip ---
>  
> @@ -322,12 +340,13 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
>  		break;
>  	case 0x0040: /* Service Action or Error Recovery Failed */
>  	case 0x003b:
> -		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
> +		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
>  		break;
>  	default: /* PCI function left in the error state attempt to recover */
> -		ers_res = zpci_event_attempt_error_recovery(pdev);
> +		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
>  		if (ers_res != PCI_ERS_RESULT_RECOVERED)
> -			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
> +			zpci_event_io_failure(pdev, pci_channel_io_perm_failure,
> +					ccdf);

Nit: I'd just keep the above on one line. It's still below the 100
columns limit and just cleaner on one line.

>  		break;
>  	}
>  	pci_dev_put(pdev);
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index a7bc23ce8483..2be37eab9279 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>  
>  	zdev->mediated_recovery = false;
>  
> +	zpci_cleanup_pending_errors(zdev);
> +
>  	if (!vdev->vdev.kvm)
>  		return;
>  

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 07/10] s390/pci: Store PCI error information for passthrough devices
  2025-09-25 14:28   ` Niklas Schnelle
@ 2025-09-25 16:29     ` Farhan Ali
  0 siblings, 0 replies; 47+ messages in thread
From: Farhan Ali @ 2025-09-25 16:29 UTC (permalink / raw)
  To: Niklas Schnelle, linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, mjrosato

>> +void zpci_cleanup_pending_errors(struct zpci_dev *zdev)
>> +{
>> +	struct pci_dev *pdev = NULL;
>> +
>> +	mutex_lock(&zdev->pending_errs_lock);
>> +	pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
> I think you missed my comment on the previous version. This is missing
> the matching pci_dev_put() for the pci_get_slot().

Ah yes indeed i missed that comment, my apologies. Will fixup.


>
>> +	if (zdev->pending_errs.count)
>> +		pr_info("%s: Unhandled PCI error events count=%d",
>> +				pci_name(pdev), zdev->pending_errs.count);
>> +	memset(&zdev->pending_errs, 0, sizeof(struct zpci_ccdf_pending));
>> +	mutex_unlock(&zdev->pending_errs_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(zpci_cleanup_pending_errors);
>> +
>>
> --- snip ---
>>   
>> @@ -322,12 +340,13 @@ static void __zpci_event_error(struct zpci_ccdf_err *ccdf)
>>   		break;
>>   	case 0x0040: /* Service Action or Error Recovery Failed */
>>   	case 0x003b:
>> -		zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
>> +		zpci_event_io_failure(pdev, pci_channel_io_perm_failure, ccdf);
>>   		break;
>>   	default: /* PCI function left in the error state attempt to recover */
>> -		ers_res = zpci_event_attempt_error_recovery(pdev);
>> +		ers_res = zpci_event_attempt_error_recovery(pdev, ccdf);
>>   		if (ers_res != PCI_ERS_RESULT_RECOVERED)
>> -			zpci_event_io_failure(pdev, pci_channel_io_perm_failure);
>> +			zpci_event_io_failure(pdev, pci_channel_io_perm_failure,
>> +					ccdf);
> Nit: I'd just keep the above on one line. It's still below the 100
> columns limit and just cleaner on one line.

I think I did this for checkpatch warning, but can move it back and see 
if the warning happens.

Thanks
Farhan

>
>>   		break;
>>   	}
>>   	pci_dev_put(pdev);
>> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
>> index a7bc23ce8483..2be37eab9279 100644
>> --- a/drivers/vfio/pci/vfio_pci_zdev.c
>> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
>> @@ -168,6 +168,8 @@ void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
>>   
>>   	zdev->mediated_recovery = false;
>>   
>> +	zpci_cleanup_pending_errors(zdev);
>> +
>>   	if (!vdev->vdev.kvm)
>>   		return;
>>   

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 08/10] vfio-pci/zdev: Add a device feature for error information
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (6 preceding siblings ...)
  2025-09-24 17:16 ` [PATCH v4 07/10] s390/pci: Store PCI error information for passthrough devices Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-09-25  8:04   ` kernel test robot
  2025-09-24 17:16 ` [PATCH v4 09/10] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
  2025-09-24 17:16 ` [PATCH v4 10/10] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
  9 siblings, 1 reply; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

For zPCI devices, we have platform specific error information. The platform
firmware provides this error information to the operating system in an
architecture specific mechanism. To enable recovery from userspace for
these devices, we want to expose this error information to userspace. Add a
new device feature to expose this information.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c |  2 ++
 drivers/vfio/pci/vfio_pci_priv.h |  8 ++++++++
 drivers/vfio/pci/vfio_pci_zdev.c | 34 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 15 ++++++++++++++
 4 files changed, 59 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7dcf5439dedc..378adb3226db 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1514,6 +1514,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_ZPCI_ERROR:
+		return vfio_pci_zdev_feature_err(device, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index a9972eacb293..808d92aab76b 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -86,6 +86,8 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 				struct vfio_info_cap *caps);
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev);
 void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev);
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+			      void __user *arg, size_t argsz);
 #else
 static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 					      struct vfio_info_cap *caps)
@@ -100,6 +102,12 @@ static inline int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 
 static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
 {}
+
+static int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+				     void __user *arg, size_t argsz)
+{
+	return -ENODEV;
+}
 #endif
 
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 2be37eab9279..261954039aa9 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -141,6 +141,40 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
+			      void __user *arg, size_t argsz)
+{
+	struct vfio_device_feature_zpci_err err;
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct zpci_dev *zdev = to_zpci(vdev->pdev);
+	int ret;
+	int head = 0;
+
+	if (!zdev)
+		return -ENODEV;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(err));
+	if (ret != 1)
+		return ret;
+
+	mutex_lock(&zdev->pending_errs_lock);
+	if (zdev->pending_errs.count) {
+		head = zdev->pending_errs.head % ZPCI_ERR_PENDING_MAX;
+		err.pec = zdev->pending_errs.err[head].pec;
+		zdev->pending_errs.head++;
+		zdev->pending_errs.count--;
+		err.pending_errors = zdev->pending_errs.count;
+	}
+	mutex_unlock(&zdev->pending_errs_lock);
+
+	if (copy_to_user(arg, &err, sizeof(err)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int vfio_pci_zdev_open_device(struct vfio_pci_core_device *vdev)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 75100bf009ba..d72177bc3961 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1478,6 +1478,21 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * VFIO_DEVICE_FEATURE_ZPCI_ERROR feature provides PCI error information to
+ * userspace for vfio-pci devices on s390x. On s390x PCI error recovery involves
+ * platform firmware and notification to operating system is done by
+ * architecture specific mechanism.  Exposing this information to userspace
+ * allows userspace to take appropriate actions to handle an error on the
+ * device.
+ */
+struct vfio_device_feature_zpci_err {
+	__u16 pec;
+	__u8 pending_errors;
+	__u8 pad;
+};
+#define VFIO_DEVICE_FEATURE_ZPCI_ERROR 11
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 08/10] vfio-pci/zdev: Add a device feature for error information
  2025-09-24 17:16 ` [PATCH v4 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2025-09-25  8:04   ` kernel test robot
  0 siblings, 0 replies; 47+ messages in thread
From: kernel test robot @ 2025-09-25  8:04 UTC (permalink / raw)
  To: Farhan Ali, linux-s390, kvm, linux-kernel, linux-pci
  Cc: oe-kbuild-all, alex.williamson, helgaas, clg, alifm, schnelle,
	mjrosato

Hi Farhan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on pci/next]
[also build test WARNING on pci/for-linus awilliam-vfio/next s390/features linus/master v6.17-rc7]
[cannot apply to awilliam-vfio/for-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Farhan-Ali/PCI-Avoid-saving-error-values-for-config-space/20250925-012105
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link:    https://lore.kernel.org/r/20250924171628.826-9-alifm%40linux.ibm.com
patch subject: [PATCH v4 08/10] vfio-pci/zdev: Add a device feature for error information
config: csky-randconfig-002-20250925 (https://download.01.org/0day-ci/archive/20250925/202509251506.Z5Ov6pYQ-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250925/202509251506.Z5Ov6pYQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509251506.Z5Ov6pYQ-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from drivers/vfio/pci/vfio_pci_config.c:29:
>> drivers/vfio/pci/vfio_pci_priv.h:106:12: warning: 'vfio_pci_zdev_feature_err' defined but not used [-Wunused-function]
     106 | static int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~


vim +/vfio_pci_zdev_feature_err +106 drivers/vfio/pci/vfio_pci_priv.h

   105	
 > 106	static int vfio_pci_zdev_feature_err(struct vfio_device *device, u32 flags,
   107					     void __user *arg, size_t argsz)
   108	{
   109		return -ENODEV;
   110	}
   111	#endif
   112	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 09/10] vfio: Add a reset_done callback for vfio-pci driver
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (7 preceding siblings ...)
  2025-09-24 17:16 ` [PATCH v4 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  2025-09-24 17:16 ` [PATCH v4 10/10] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
  9 siblings, 0 replies; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

On error recovery for a PCI device bound to vfio-pci driver, we want to
recover the state of the device to its last known saved state. The callback
restores the state of the device to its initial saved state.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 378adb3226db..f2fcb81b3e69 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2241,6 +2241,17 @@ pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
 
+static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+	if (!vdev->pci_saved_state)
+		return;
+
+	pci_load_saved_state(pdev, vdev->pci_saved_state);
+	pci_restore_state(pdev);
+}
+
 int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
 				  int nr_virtfn)
 {
@@ -2305,6 +2316,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
 
 const struct pci_error_handlers vfio_pci_core_err_handlers = {
 	.error_detected = vfio_pci_core_aer_err_detected,
+	.reset_done = vfio_pci_core_aer_reset_done,
 };
 EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 10/10] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX
  2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
                   ` (8 preceding siblings ...)
  2025-09-24 17:16 ` [PATCH v4 09/10] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
@ 2025-09-24 17:16 ` Farhan Ali
  9 siblings, 0 replies; 47+ messages in thread
From: Farhan Ali @ 2025-09-24 17:16 UTC (permalink / raw)
  To: linux-s390, kvm, linux-kernel, linux-pci
  Cc: alex.williamson, helgaas, clg, alifm, schnelle, mjrosato

We are configuring the error signaling on the vast majority of devices and
it's extremely rare that it fires anyway. This allows userspace to be
notified on errors for legacy PCI devices. The Internal Share Memory (ISM)
device on s390x is one such device. For PCI devices on IBM s390x error
recovery involves platform firmware and notification to operating system
is done by architecture specific way. So the ISM device can still be
recovered when notified of an error.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c  | 6 ++----
 drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f2fcb81b3e69..d125471fd5ea 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -749,8 +749,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
 		}
 	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
-		if (pci_is_pcie(vdev->pdev))
-			return 1;
+		return 1;
 	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
 		return 1;
 	}
@@ -1150,8 +1149,7 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
 	case VFIO_PCI_REQ_IRQ_INDEX:
 		break;
 	case VFIO_PCI_ERR_IRQ_INDEX:
-		if (pci_is_pcie(vdev->pdev))
-			break;
+		break;
 		fallthrough;
 	default:
 		return -EINVAL;
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 123298a4dc8f..f2d13b6eb28f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -838,8 +838,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 	case VFIO_PCI_ERR_IRQ_INDEX:
 		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
 		case VFIO_IRQ_SET_ACTION_TRIGGER:
-			if (pci_is_pcie(vdev->pdev))
-				func = vfio_pci_set_err_trigger;
+			func = vfio_pci_set_err_trigger;
 			break;
 		}
 		break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-11-22 10:58 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 17:16 [PATCH v4 00/10] Error recovery for vfio-pci devices on s390x Farhan Ali
2025-09-24 17:16 ` [PATCH v4 01/10] PCI: Avoid saving error values for config space Farhan Ali
2025-10-01 15:15   ` Benjamin Block
2025-10-01 17:12     ` Farhan Ali
2025-10-02  9:16       ` Benjamin Block
2025-10-04 14:54       ` Lukas Wunner
2025-10-06 17:54         ` Farhan Ali
2025-10-06 19:26           ` Lukas Wunner
2025-10-06 21:35             ` Farhan Ali
2025-10-08 13:34               ` Lukas Wunner
2025-10-08 17:56                 ` Farhan Ali
2025-10-08 18:14                   ` Lukas Wunner
2025-10-08 21:55                     ` Farhan Ali
2025-10-09  4:52                       ` Lukas Wunner
2025-10-09 17:02                         ` Farhan Ali
2025-10-12  6:43                           ` Lukas Wunner
2025-10-09  9:12                     ` Niklas Schnelle
2025-10-12  6:34                       ` Lukas Wunner
2025-10-14 12:07                         ` Niklas Schnelle
2025-10-16 21:00                           ` Farhan Ali
2025-10-19 14:34                           ` Lukas Wunner
2025-10-20  8:59                             ` Niklas Schnelle
2025-11-22 10:58                               ` Lukas Wunner
2025-09-24 17:16 ` [PATCH v4 02/10] PCI: Add additional checks for flr reset Farhan Ali
2025-09-30 10:03   ` Benjamin Block
2025-09-30 17:04     ` Farhan Ali
2025-10-01  8:33       ` Benjamin Block
2025-10-01 14:37   ` Benjamin Block
2025-09-24 17:16 ` [PATCH v4 03/10] PCI: Allow per function PCI slots Farhan Ali
2025-10-01 14:34   ` Benjamin Block
2025-09-24 17:16 ` [PATCH v4 04/10] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
2025-09-25 10:54   ` Niklas Schnelle
2025-10-01 16:04     ` Benjamin Block
2025-10-01 18:01       ` Farhan Ali
2025-10-02 12:58   ` Niklas Schnelle
2025-10-02 17:00     ` Bjorn Helgaas
2025-10-02 17:16       ` Ilpo Järvinen
2025-10-02 18:14       ` Niklas Schnelle
2025-09-24 17:16 ` [PATCH v4 05/10] s390/pci: Restore IRQ unconditionally for the zPCI device Farhan Ali
2025-09-24 17:16 ` [PATCH v4 06/10] s390/pci: Update the logic for detecting passthrough device Farhan Ali
2025-09-24 17:16 ` [PATCH v4 07/10] s390/pci: Store PCI error information for passthrough devices Farhan Ali
2025-09-25 14:28   ` Niklas Schnelle
2025-09-25 16:29     ` Farhan Ali
2025-09-24 17:16 ` [PATCH v4 08/10] vfio-pci/zdev: Add a device feature for error information Farhan Ali
2025-09-25  8:04   ` kernel test robot
2025-09-24 17:16 ` [PATCH v4 09/10] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
2025-09-24 17:16 ` [PATCH v4 10/10] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).