All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/9] Error recovery for vfio-pci devices on s390x
@ 2026-01-07 18:32 Farhan Ali
  2026-01-07 18:32 ` [PATCH v7 1/9] PCI: Allow per function PCI slots Farhan Ali
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Farhan Ali @ 2026-01-07 18:32 UTC (permalink / raw)
  To: linux-s390, linux-kernel, linux-pci
  Cc: helgaas, lukas, alex, clg, stable, alifm, schnelle, mjrosato

Hi,

This Linux kernel patch series introduces support for error recovery for
passthrough PCI devices on System Z (s390x). 

Background
----------
For PCI devices on s390x an operating system receives platform specific
error events from firmware rather than through AER.Today for
passthrough/userspace devices, we don't attempt any error recovery and
ignore any error events for the devices. The passthrough/userspace devices
are managed by the vfio-pci driver. The driver does register error handling
callbacks (error_detected), and on an error trigger an eventfd to
userspace.  But we need a mechanism to notify userspace
(QEMU/guest/userspace drivers) about the error event. 

Proposal
--------
We can expose this error information (currently only the PCI Error Code)
via a device feature. Userspace can then obtain the error information 
via VFIO_DEVICE_FEATURE ioctl and take appropriate actions such as driving 
a device reset.

This is how a typical flow for passthrough devices to a VM would work:
For passthrough devices to a VM, the driver bound to the device on the host 
is vfio-pci. vfio-pci driver does support the error_detected() callback 
(vfio_pci_core_aer_err_detected()), and on an PCI error s390x recovery 
code on the host will call the vfio-pci error_detected() callback. The 
vfio-pci error_detected() callback will notify userspace/QEMU via an 
eventfd, and return PCI_ERS_RESULT_CAN_RECOVER. At this point the s390x 
error recovery on the host will skip any further action(see patch 6) and 
let userspace drive the error recovery.

Once userspace/QEMU is notified, it then injects this error into the VM 
so device drivers in the VM can take recovery actions. For example for a 
passthrough NVMe device, the VM's OS NVMe driver will access the device. 
At this point the VM's NVMe driver's error_detected() will drive the 
recovery by returning PCI_ERS_RESULT_NEED_RESET, and the s390x error 
recovery in the VM's OS will try to do a reset. Resets are privileged 
operations and so the VM will need intervention from QEMU to perform the 
reset. QEMU will invoke the VFIO_DEVICE_RESET ioctl to now notify the 
host that the VM is requesting a reset of the device. The vfio-pci driver 
on the host will then perform the reset on the device to recover it.


Thanks
Farhan

ChangeLog
---------
v6 series https://lore.kernel.org/all/2c609e61-1861-4bf3-b019-a11c137d26a5@linux.ibm.com/
v6 -> v7
    - Rebase on 6.19-rc4

    - Update commit message based on Niklas's suggestion (patch 3).

v5 series https://lore.kernel.org/all/20251113183502.2388-1-alifm@linux.ibm.com/
v5 -> v6
   - Rebase on 6.18 + Lukas's PCI: Universal error recoverability of
   devices series (https://lore.kernel.org/all/cover.1763483367.git.lukas@wunner.de/)

   - Re-work config space accessibility check to pci_dev_save_and_disable() (patch 3).
   This avoids saving the config space, in the reset path, if the device's config space is
   corrupted or inaccessible.

v4 series https://lore.kernel.org/all/20250924171628.826-1-alifm@linux.ibm.com/
v4 -> v5
    - Rebase on 6.18-rc5

    - Move bug fixes to the beginning of the series (patch 1 and 2). These patches
    were posted as a separate fixes series 
https://lore.kernel.org/all/a14936ac-47d6-461b-816f-0fd66f869b0f@linux.ibm.com/

    - Add matching pci_put_dev() for pci_get_slot() (patch 6).

v3 series https://lore.kernel.org/all/20250911183307.1910-1-alifm@linux.ibm.com/
v3 -> v4
    - Remove warn messages for each PCI capability not restored (patch 1)

    - Check PCI_COMMAND and PCI_STATUS register for error value instead of device id 
    (patch 1)

    - Fix kernel crash in patch 3

    - Added reviewed by tags

    - Address comments from Niklas's (patches 4, 5, 7)

    - Fix compilation error non s390x system (patch 8)

    - Explicitly align struct vfio_device_feature_zpci_err (patch 8)


v2 series https://lore.kernel.org/all/20250825171226.1602-1-alifm@linux.ibm.com/
v2 -> v3
   - Patch 1 avoids saving any config space state if the device is in error
   (suggested by Alex)

   - Patch 2 adds additional check only for FLR reset to try other function 
     reset method (suggested by Alex).

   - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
     functions. Creates a new flag pci_slot to allow per function slot.

   - Patch 4 fixes a bug in s390 for resource to bus address translation.

   - Rebase on 6.17-rc5


v1 series https://lore.kernel.org/all/20250813170821.1115-1-alifm@linux.ibm.com/
v1 - > v2
   - Patches 1 and 2 adds some additional checks for FLR/PM reset to 
     try other function reset method (suggested by Alex).

   - Patch 3 fixes a bug in s390 for resetting PCI devices with multiple
     functions.

   - Patch 7 adds a new device feature for zPCI devices for the VFIO_DEVICE_FEATURE 
     ioctl. The ioctl is used by userspace to retriece any PCI error
     information for the device (suggested by Alex).

   - Patch 8 adds a reset_done() callback for the vfio-pci driver, to
     restore the state of the device after a reset.

   - Patch 9 removes the pcie check for triggering VFIO_PCI_ERR_IRQ_INDEX.


Farhan Ali (9):
  PCI: Allow per function PCI slots
  s390/pci: Add architecture specific resource/bus address translation
  PCI: Avoid saving config space state if inaccessible
  PCI: Add additional checks for flr reset
  s390/pci: Update the logic for detecting passthrough device
  s390/pci: Store PCI error information for passthrough devices
  vfio-pci/zdev: Add a device feature for error information
  vfio: Add a reset_done callback for vfio-pci driver
  vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX

 arch/s390/include/asm/pci.h       |  29 ++++++++
 arch/s390/pci/pci.c               |  75 +++++++++++++++++++++
 arch/s390/pci/pci_event.c         | 107 +++++++++++++++++-------------
 drivers/pci/host-bridge.c         |   4 +-
 drivers/pci/pci.c                 |  19 +++++-
 drivers/pci/slot.c                |  25 ++++++-
 drivers/vfio/pci/vfio_pci_core.c  |  20 ++++--
 drivers/vfio/pci/vfio_pci_intrs.c |   3 +-
 drivers/vfio/pci/vfio_pci_priv.h  |   9 +++
 drivers/vfio/pci/vfio_pci_zdev.c  |  45 ++++++++++++-
 include/linux/pci.h               |   1 +
 include/uapi/linux/vfio.h         |  16 +++++
 12 files changed, 292 insertions(+), 61 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-01-20 22:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-07 18:32 [PATCH v7 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali
2026-01-07 18:32 ` [PATCH v7 1/9] PCI: Allow per function PCI slots Farhan Ali
2026-01-07 18:32 ` [PATCH v7 2/9] s390/pci: Add architecture specific resource/bus address translation Farhan Ali
2026-01-20 22:15   ` Niklas Schnelle
2026-01-07 18:32 ` [PATCH v7 3/9] PCI: Avoid saving config space state if inaccessible Farhan Ali
2026-01-19 21:23   ` Niklas Schnelle
2026-01-20 19:50     ` Farhan Ali
2026-01-20 22:12   ` Niklas Schnelle
2026-01-07 18:32 ` [PATCH v7 4/9] PCI: Add additional checks for flr reset Farhan Ali
2026-01-19 21:34   ` Niklas Schnelle
2026-01-07 18:32 ` [PATCH v7 5/9] s390/pci: Update the logic for detecting passthrough device Farhan Ali
2026-01-07 18:32 ` [PATCH v7 6/9] s390/pci: Store PCI error information for passthrough devices Farhan Ali
2026-01-20 22:34   ` Niklas Schnelle
2026-01-07 18:32 ` [PATCH v7 7/9] vfio-pci/zdev: Add a device feature for error information Farhan Ali
2026-01-19 16:24   ` Julian Ruess
2026-01-20 18:54     ` Farhan Ali
2026-01-07 18:32 ` [PATCH v7 8/9] vfio: Add a reset_done callback for vfio-pci driver Farhan Ali
2026-01-19 16:41   ` Julian Ruess
2026-01-07 18:32 ` [PATCH v7 9/9] vfio: Remove the pcie check for VFIO_PCI_ERR_IRQ_INDEX Farhan Ali
2026-01-19 16:56   ` Julian Ruess
2026-01-20 18:56     ` Farhan Ali
2026-01-15 18:28 ` [PATCH v7 0/9] Error recovery for vfio-pci devices on s390x Farhan Ali

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.