[PATCH AUTOSEL 6.17-6.12] PCI/ERR: Update device error

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.17-6.12] PCI/ERR: Update device error_state already after reset
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
@ 2025-10-25 15:54 ` Sasha Levin
  2025-10-25 15:54 ` [PATCH AUTOSEL 6.17-5.4] powerpc/eeh: Use result of error_detected() in uevent Sasha Levin
  2025-10-25 16:00 ` [PATCH AUTOSEL 6.17] PCI/AER: Fix NULL pointer access by aer_info Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-10-25 15:54 UTC (permalink / raw)
  To: patches, stable
  Cc: Lukas Wunner, Bjorn Helgaas, Sasha Levin, shshaikh, manishc,
	GR-Linux-NIC-Dev, mahesh, njavali, GR-QLogic-Storage-Upstream,
	netdev, linuxppc-dev, linux-scsi

From: Lukas Wunner <lukas@wunner.de>

[ Upstream commit 45bc82563d5505327d97963bc54d3709939fa8f8 ]

After a Fatal Error has been reported by a device and has been recovered
through a Secondary Bus Reset, AER updates the device's error_state to
pci_channel_io_normal before invoking its driver's ->resume() callback.

By contrast, EEH updates the error_state earlier, namely after resetting
the device and before invoking its driver's ->slot_reset() callback.
Commit c58dc575f3c8 ("powerpc/pseries: Set error_state to
pci_channel_io_normal in eeh_report_reset()") explains in great detail
that the earlier invocation is necessitated by various drivers checking
accessibility of the device with pci_channel_offline() and avoiding
accesses if it returns true.  It returns true for any other error_state
than pci_channel_io_normal.

The device should be accessible already after reset, hence the reasoning
is that it's safe to update the error_state immediately afterwards.

This deviation between AER and EEH seems problematic because drivers
behave differently depending on which error recovery mechanism the
platform uses.  Three drivers have gone so far as to update the
error_state themselves, presumably to work around AER's behavior.

For consistency, amend AER to update the error_state at the same recovery
steps as EEH.  Drop the now unnecessary workaround from the three drivers.

Keep updating the error_state before ->resume() in case ->error_detected()
or ->mmio_enabled() return PCI_ERS_RESULT_RECOVERED, which causes
->slot_reset() to be skipped.  There are drivers doing this even for Fatal
Errors, e.g. mhi_pci_error_detected().

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/4517af6359ffb9d66152b827a5d2833459144e3f.1755008151.git.lukas@wunner.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

- Summary
  - Fixes a long-standing AER vs EEH inconsistency by setting
    `dev->error_state = pci_channel_io_normal` immediately after reset
    (before `->slot_reset()`), matching EEH behavior and removing per-
    driver hacks.
  - Small, targeted behavioral fix that improves error recovery
    reliability for drivers that gate hardware access on
    `pci_channel_offline()`.

- Why it’s a bug
  - Under AER, `error_state` was previously restored to normal only
    before `->resume()`, causing `pci_channel_offline()` to return true
    during `->slot_reset()`. Drivers that correctly re-initialize
    hardware in `->slot_reset()` could incorrectly self-gate and skip
    needed accesses.
  - EEH has set `error_state` to normal before `->slot_reset()` since
    c58dc575f3c8 for exactly this reason. The mismatch forces drivers to
    add workarounds under AER.

- What changes (code specifics)
  - Core AER: Set `error_state` early in the slot-reset phase
    - Adds early state transition in `report_slot_reset()` so drivers
      see the device as online during `->slot_reset()`:
      - `drivers/pci/pcie/err.c:156`: `if (!pci_dev_set_io_state(dev,
        pci_channel_io_normal) || !pdrv || !pdrv->err_handler ||
        !pdrv->err_handler->slot_reset) goto out;`
    - Keeps the existing update before `->resume()` to cover flows where
      `->slot_reset()` is skipped (e.g., when `->error_detected()` or
      `->mmio_enabled()` returns RECOVERED):
      - `drivers/pci/pcie/err.c:170`: `if (!pci_dev_set_io_state(dev,
        pci_channel_io_normal) || ... ) goto out;`
    - Transition gating is safe: `pci_dev_set_io_state()` only returns
      false for `pci_channel_io_perm_failure` (see semantics in
      `drivers/pci/pci.h:456`), so we avoid calling `->slot_reset()` on
      permanently failed devices (sensible safety net).
  - Remove driver workarounds that manually forced `error_state =
    normal`
    - QLogic qlcnic:
      - `drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c:4218`:
        remove `pdev->error_state = pci_channel_io_normal;` from
        `qlcnic_83xx_io_slot_reset()`.
      - `drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c:3770`: remove
        `pdev->error_state = pci_channel_io_normal;` from
        `qlcnic_attach_func()` (used in 82xx `->slot_reset()` path at
        `...:3864`).
    - QLogic qla2xxx:
      - `drivers/scsi/qla2xxx/qla_os.c:7902`: remove the workaround and
        comment in `qla2xxx_pci_slot_reset()` that set
        `pdev->error_state = pci_channel_io_normal;` to avoid mailbox
        timeouts.
  - The commit also notes drivers like MHI can return RECOVERED from
    `->error_detected()`, skipping `->slot_reset()`; the resume-path
    normalization remains to handle that path correctly (consistent with
    code in `drivers/pci/pcie/err.c:170`).

- Risk/compatibility assessment
  - Scope is minimal and contained: a single earlier state transition in
    core AER and removal of redundant per-driver hacks.
  - Aligns AER with EEH behavior proven since 2009 (c58dc575f3c8),
    reducing platform-dependent behavioral differences in recovery
    paths.
  - Drivers that previously avoided IO in `->slot_reset()` because
    `pci_channel_offline()` returned true will now proceed as intended
    once the device is reset and accessible. This improves recovery
    success rates rather than risking harm.
  - The core change is guarded by `pci_dev_set_io_state()` semantics; it
    will not “normalize” devices in permanent failure.
  - No new features or architectural changes; no ABI/API changes.

- Backport assessment
  - Fixes real recovery failures/workarounds (e.g., qla2xxx mailbox
    timeouts), affects users, and reduces platform-specific divergence
    in error recovery semantics.
  - Change is small and surgical; drivers touched only remove redundant
    assignments now handled in the core.
  - Even in stable, these driver-line removals are safe once the core
    change is present; alternatively, stable could carry just the core
    change and leave driver workarounds (harmless duplication). As a
    single commit, it remains suitable.
  - While the commit message snippet doesn’t show a “Fixes:” or “Cc:
    stable” tag, the rationale, history, and limited blast radius make
    it an appropriate stable backport candidate.

 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 1 -
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c    | 2 --
 drivers/pci/pcie/err.c                              | 3 ++-
 drivers/scsi/qla2xxx/qla_os.c                       | 5 -----
 4 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
index d7cdea8f604d0..91e7b38143ead 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
@@ -4215,7 +4215,6 @@ static pci_ers_result_t qlcnic_83xx_io_slot_reset(struct pci_dev *pdev)
 	struct qlcnic_adapter *adapter = pci_get_drvdata(pdev);
 	int err = 0;
 
-	pdev->error_state = pci_channel_io_normal;
 	err = pci_enable_device(pdev);
 	if (err)
 		goto disconnect;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index 53cdd36c41236..e051d8c7a28d6 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -3766,8 +3766,6 @@ static int qlcnic_attach_func(struct pci_dev *pdev)
 	struct qlcnic_adapter *adapter = pci_get_drvdata(pdev);
 	struct net_device *netdev = adapter->netdev;
 
-	pdev->error_state = pci_channel_io_normal;
-
 	err = pci_enable_device(pdev);
 	if (err)
 		return err;
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index a4990c9ad493a..e85b9cd5fec1b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -141,7 +141,8 @@ static int report_slot_reset(struct pci_dev *dev, void *data)
 
 	device_lock(&dev->dev);
 	pdrv = dev->driver;
-	if (!pdrv || !pdrv->err_handler || !pdrv->err_handler->slot_reset)
+	if (!pci_dev_set_io_state(dev, pci_channel_io_normal) ||
+	    !pdrv || !pdrv->err_handler || !pdrv->err_handler->slot_reset)
 		goto out;
 
 	err_handler = pdrv->err_handler;
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index d4b484c0fd9d7..4460421834cb2 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -7883,11 +7883,6 @@ qla2xxx_pci_slot_reset(struct pci_dev *pdev)
 	       "Slot Reset.\n");
 
 	ha->pci_error_state = QLA_PCI_SLOT_RESET;
-	/* Workaround: qla2xxx driver which access hardware earlier
-	 * needs error state to be pci_channel_io_online.
-	 * Otherwise mailbox command timesout.
-	 */
-	pdev->error_state = pci_channel_io_normal;
 
 	pci_restore_state(pdev);
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.17-5.4] powerpc/eeh: Use result of error_detected() in uevent
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
  2025-10-25 15:54 ` [PATCH AUTOSEL 6.17-6.12] PCI/ERR: Update device error_state already after reset Sasha Levin
@ 2025-10-25 15:54 ` Sasha Levin
  2025-10-25 16:00 ` [PATCH AUTOSEL 6.17] PCI/AER: Fix NULL pointer access by aer_info Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-10-25 15:54 UTC (permalink / raw)
  To: patches, stable
  Cc: Niklas Schnelle, Lukas Wunner, Bjorn Helgaas,
	Kuppuswamy Sathyanarayanan, Mahesh Salgaonkar, Sasha Levin,
	linuxppc-dev

From: Niklas Schnelle <schnelle@linux.ibm.com>

[ Upstream commit 704e5dd1c02371dfc7d22e1520102b197a3b628b ]

Ever since uevent support was added for AER and EEH with commit
856e1eb9bdd4 ("PCI/AER: Add uevents in AER and EEH error/resume"), it
reported PCI_ERS_RESULT_NONE as uevent when recovery begins.

Commit 7b42d97e99d3 ("PCI/ERR: Always report current recovery status for
udev") subsequently amended AER to report the actual return value of
error_detected().

Make the same change to EEH to align it with AER and s390.

Suggested-by: Lukas Wunner <lukas@wunner.de>
Link: https://lore.kernel.org/linux-pci/aIp6LiKJor9KLVpv@wunner.de/
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Link: https://patch.msgid.link/20250807-add_err_uevents-v5-3-adf85b0620b0@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

Rationale
- Fixes incorrect uevent status at start of EEH recovery: the code
  currently emits a uevent with `PCI_ERS_RESULT_NONE` regardless of what
  the driver reported via `error_detected()`. This misrepresents the
  actual recovery status to user space.
- The fix makes EEH behave like AER (already fixed by commit
  7b42d97e99d3) and s390, improving cross-arch consistency and user
  space expectations.

Evidence in code
- Current EEH behavior: emits BEGIN_RECOVERY unconditionally at error
  detection
  - `pci_uevent_ers(pdev, PCI_ERS_RESULT_NONE);` is called after
    `error_detected()` even if the driver “votes” differently (e.g.,
    DISCONNECT/NEED_RESET): arch/powerpc/kernel/eeh_driver.c:337
- Proposed change: pass actual driver result
  - Changes the above call to `pci_uevent_ers(pdev, rc);`, where `rc` is
    the result of `driver->err_handler->error_detected()` captured just
    above: arch/powerpc/kernel/eeh_driver.c:337
- uevent mapping semantics (what user space sees) are centralized in
  `pci_uevent_ers()`:
  - NONE/CAN_RECOVER -> `ERROR_EVENT=BEGIN_RECOVERY`, `DEVICE_ONLINE=0`
  - RECOVERED -> `ERROR_EVENT=SUCCESSFUL_RECOVERY`, `DEVICE_ONLINE=1`
  - DISCONNECT -> `ERROR_EVENT=FAILED_RECOVERY`, `DEVICE_ONLINE=0`
  - Others (e.g., NEED_RESET) -> no immediate uevent (consistent with
    AER)
  - drivers/pci/pci-driver.c:1595
- AER already reports actual `error_detected()` return value to udev:
  - `pci_uevent_ers(dev, vote);` after computing `vote` in
    `report_error_detected()`: drivers/pci/pcie/err.c:83
- EEH already emits final-stage uevents correctly (unchanged by this
  patch):
  - Success at resume: `pci_uevent_ers(edev->pdev,
    PCI_ERS_RESULT_RECOVERED);` arch/powerpc/kernel/eeh_driver.c:432
  - Failure path: `pci_uevent_ers(pdev, PCI_ERS_RESULT_DISCONNECT);`
    arch/powerpc/kernel/eeh_driver.c:462

Why this is a bugfix suitable for stable
- User-visible correctness: With the current code, user space always
  sees “BEGIN_RECOVERY” even when drivers have already indicated an
  unrecoverable state (e.g., DISCONNECT). The patch ensures uevents
  reflect the true state immediately, matching AER behavior introduced
  by 7b42d97e99d3.
- Minimal, contained change: One-line change in a single architecture-
  specific file (PowerPC EEH). No API/ABI changes; only corrects the
  parameter passed to an existing helper.
- No architectural change: Keeps existing EEH flow; only adjusts the
  uevent status emitted at a single step.
- Low regression risk: AER has used this semantic for years;
  `pci_uevent_ers()` already handles `rc` values. EEH already emits
  RECOVERED/DISCONNECT at later stages; this makes the initial event
  consistent.
- Aligns cross-arch semantics: Consistent uevent reporting across AER,
  EEH, and s390 reduces user space special-casing and potential errors.

Potential side effects and why acceptable
- For drivers returning `PCI_ERS_RESULT_DISCONNECT` at
  `error_detected()`, user space will now see `FAILED_RECOVERY`
  immediately instead of a misleading `BEGIN_RECOVERY`. This is a
  correctness fix.
- For returns like `PCI_ERS_RESULT_NEED_RESET`, no initial uevent is
  emitted (consistent with AER); user space will still receive final
  RECOVERED/FAILED, as today. Any scripts that strictly expected an
  initial BEGIN_RECOVERY for all cases are already inconsistent with AER
  and should not rely on that behavior.

Historical context
- Uevent support was added by 856e1eb9bdd4 (“PCI/AER: Add uevents in AER
  and EEH error/resume”), initially emitting `NONE` at error detection
  for both AER and EEH.
- AER was corrected by 7b42d97e99d3 (“PCI/ERR: Always report current
  recovery status for udev”) to emit the actual `error_detected()`
  result.
- This patch brings EEH to parity with that established AER behavior.

Conclusion
- This is a targeted, low-risk correctness fix that improves user space
  observability and cross-arch consistency without changing kernel-side
  recovery logic. It fits stable backport rules (important bugfix,
  minimal change, low regression risk, confined to a subsystem).

 arch/powerpc/kernel/eeh_driver.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 48ad0116f3590..ef78ff77cf8f2 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -334,7 +334,7 @@ static enum pci_ers_result eeh_report_error(struct eeh_dev *edev,
 	rc = driver->err_handler->error_detected(pdev, pci_channel_io_frozen);
 
 	edev->in_error = true;
-	pci_uevent_ers(pdev, PCI_ERS_RESULT_NONE);
+	pci_uevent_ers(pdev, rc);
 	return rc;
 }
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.17] PCI/AER: Fix NULL pointer access by aer_info
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
  2025-10-25 15:54 ` [PATCH AUTOSEL 6.17-6.12] PCI/ERR: Update device error_state already after reset Sasha Levin
  2025-10-25 15:54 ` [PATCH AUTOSEL 6.17-5.4] powerpc/eeh: Use result of error_detected() in uevent Sasha Levin
@ 2025-10-25 16:00 ` Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-10-25 16:00 UTC (permalink / raw)
  To: patches, stable
  Cc: Vernon Yang, Bjorn Helgaas, Sasha Levin, mahesh, linuxppc-dev

From: Vernon Yang <yanglincheng@kylinos.cn>

[ Upstream commit 0a27bdb14b028fed30a10cec2f945c38cb5ca4fa ]

The kzalloc(GFP_KERNEL) may return NULL, so all accesses to aer_info->xxx
will result in kernel panic. Fix it.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20250904182527.67371-1-vernon2gm@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

**Why It Matters**
- Prevents a NULL pointer dereference and kernel panic during device
  enumeration when `kzalloc(GFP_KERNEL)` fails in AER initialization.
  This is a real bug users can hit under memory pressure and affects any
  kernel with `CONFIG_PCIEAER` enabled.

**Change Details**
- Adds a NULL check after allocating `dev->aer_info` and returns early
  on failure, resetting `dev->aer_cap` to keep state consistent:
  - drivers/pci/pcie/aer.c:395
  - drivers/pci/pcie/aer.c:396
  - drivers/pci/pcie/aer.c:397
- The dereferences that would otherwise panic immediately follow the
  allocation (ratelimit initialization), so without this guard, OOM
  leads to instant crash:
  - drivers/pci/pcie/aer.c:401
  - drivers/pci/pcie/aer.c:403

**Consistency With AER Flows**
- Resetting `dev->aer_cap` to 0 on allocation failure is correct and
  keeps all AER-related code paths coherent:
  - Save/restore explicitly no-op when `aer_cap == 0`, avoiding config
    space accesses:
    - drivers/pci/pcie/aer.c:349
    - drivers/pci/pcie/aer.c:371
  - AER enablement and ECRC setup get skipped because AER is treated as
    unavailable:
    - drivers/pci/pcie/aer.c:417 (enable reporting)
    - drivers/pci/pcie/aer.c:420 (ECRC)
    - ECRC helpers themselves also gate on `aer_cap`:
      - drivers/pci/pcie/aer.c:164
      - drivers/pci/pcie/aer.c:188
- Sysfs attributes that unconditionally dereference `pdev->aer_info` are
  already hidden when `aer_info == NULL`:
  - Visibility gating for stats attrs checks `pdev->aer_info`:
    - drivers/pci/pcie/aer.c:632
  - Visibility gating for ratelimit attrs checks `pdev->aer_info`:
    - drivers/pci/pcie/aer.c:769
- AER initialization is called during capability setup for every device;
  avoiding a panic here is critical:
  - drivers/pci/probe.c:2671

**Risk and Side Effects**
- Impact is limited and defensive:
  - On allocation failure, AER features are disabled for that device
    (graceful degradation) instead of panicking.
  - No architectural changes; no ABI changes; minimal lines touched.
  - All later AER users already handle `aer_info == NULL` and/or
    `aer_cap == 0` via existing guards.
- Side effects are intentional and safe:
  - Port driver IRQ message number programming for AER is skipped if
    `aer_cap == 0`, consistent with AER being unavailable:
    - drivers/pci/pcie/portdrv.c:81
    - drivers/pci/pcie/portdrv.c:242

**Stable Criteria**
- Fixes a real crash bug that can affect users (OOM during enumeration
  or hotplug).
- Small, contained change in a single function.
- No new features or interfaces; no architectural churn.
- Very low regression risk due to consistent gating on
  `aer_cap`/`aer_info`.

Given the clear correctness and robustness benefits with minimal risk,
this is a strong candidate for backporting to stable trees.

 drivers/pci/pcie/aer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9d23294ceb2f6..3dba9c0c6ae11 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -383,6 +383,10 @@ void pci_aer_init(struct pci_dev *dev)
 		return;
 
 	dev->aer_info = kzalloc(sizeof(*dev->aer_info), GFP_KERNEL);
+	if (!dev->aer_info) {
+		dev->aer_cap = 0;
+		return;
+	}
 
 	ratelimit_state_init(&dev->aer_info->correctable_ratelimit,
 			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-10-25 16:28 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17-6.12] PCI/ERR: Update device error_state already after reset Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17-5.4] powerpc/eeh: Use result of error_detected() in uevent Sasha Levin
2025-10-25 16:00 ` [PATCH AUTOSEL 6.17] PCI/AER: Fix NULL pointer access by aer_info Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).