Linux PCI subsystem development
 help / color / mirror / Atom feed
* [PATCH] pci_crash: capture PCI config space at panic time
@ 2026-05-12 16:24 hangej
  2026-05-13 21:30 ` kernel test robot
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: hangej @ 2026-05-12 16:24 UTC (permalink / raw)
  To: bhelgaas; +Cc: linux-pci, linux-kernel, dwmw2, kexec, hangej

From: Johannes Hange <hangej@amazon.com>

When a system crashes and kexec boots into the crash kernel, PCI
device state is lost.  AER registers are volatile and cleared by
device reset, making post-mortem root cause analysis difficult.

Add CONFIG_PCI_CRASH to capture PCI configuration space (including
AER extended capability registers) during panic, before kexec.  The
data is exported via VMCOREINFO so crash analysis tools can extract
it from the vmcore.

Hook point: crash_save_vmcoreinfo() inside __crash_kexec(), which
runs before machine_kexec() -- the only reliable capture point since
panic notifiers run after kexec by default.

Key design choices:
- No config reads at init or hotplug; buffer pre-allocated, filled
  only at crash time for fresh register state.
- capture=aer (default): quick-scan root port ROOT_STATUS, bail if
  no uncorrectable errors.  capture=always: every panic.
- devices= filter applied at rebuild time (zero crash-path cost).
- 200ms debounced workqueue rebuild for VF enumeration storms.
- kvmalloc buffer + kmalloc pagemap for physical page directory
  (vmalloc addresses need explicit page-to-phys translation).
- dcache flush (ARM64/x86) ensures data survives kexec.

Build-tested on v7.1-rc2 (x86_64).  Runtime-tested on 6.6 with
forced panics: ~2ms for 31 PCIe devices (4096B config each),
sub-microsecond AER bail on non-PCI panics.

See Documentation/PCI/pci-crash-capture.rst for full details.

Signed-off-by: Johannes Hange <hangej@amazon.com>
---
 Documentation/PCI/index.rst             |   1 +
 Documentation/PCI/pci-crash-capture.rst | 134 +++++
 MAINTAINERS                             |   8 +
 include/linux/pci_crash.h               | 110 ++++
 kernel/Kconfig.kexec                    |  17 +
 kernel/Makefile                         |   1 +
 kernel/pci_crash.c                      | 755 ++++++++++++++++++++++++
 kernel/vmcore_info.c                    |  13 +
 8 files changed, 1039 insertions(+)
 create mode 100644 Documentation/PCI/pci-crash-capture.rst
 create mode 100644 include/linux/pci_crash.h
 create mode 100644 kernel/pci_crash.c

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
index 5d720d2a415e..7f499a43ddb4 100644
--- a/Documentation/PCI/index.rst
+++ b/Documentation/PCI/index.rst
@@ -19,4 +19,5 @@ PCI Bus Subsystem
    endpoint/index
    controller/index
    boot-interrupts
+   pci-crash-capture
    tph
diff --git a/Documentation/PCI/pci-crash-capture.rst b/Documentation/PCI/pci-crash-capture.rst
new file mode 100644
index 000000000000..35f676c2724b
--- /dev/null
+++ b/Documentation/PCI/pci-crash-capture.rst
@@ -0,0 +1,134 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+PCI Crash Capture Buffer
+========================
+
+Overview
+========
+
+The PCI crash capture module (``CONFIG_PCI_CRASH``) saves PCI configuration
+space for all (or selected) devices at panic time.  The data is written into
+a pre-allocated buffer whose physical pages are exported via VMCOREINFO,
+allowing crash analysis tools to extract device state from the vmcore.
+
+This is useful because AER (Advanced Error Reporting) registers are volatile
+and cleared by device reset during kexec into the crash kernel.  Capturing
+them before kexec preserves the error state that caused or contributed to the
+crash.
+
+Boot Parameters
+===============
+
+``pci_crash.capture=`` (default: ``aer``)
+  When to capture PCI config space.  Comma-separated tokens:
+
+  ``aer``
+    Capture only if a root port reports uncorrectable errors in its
+    AER ROOT_STATUS register.  Non-PCI panics skip capture entirely
+    (a handful of MMIO reads to root ports, sub-microsecond).
+
+  ``always``
+    Capture on every panic regardless of AER state.  Useful for
+    cascading failures where a PCI link-down causes an MCE or NMI
+    watchdog timeout before DPC/AER fires.
+
+``pci_crash.devices=`` (default: ``all``)
+  Which devices to include in the capture buffer.  Comma-separated tokens:
+
+  ``all``
+    Every PCI device in the system.
+
+  ``bridges``
+    PCI-to-PCI bridges (class 0604) and CardBus bridges (class 0607).
+
+  ``root_ports``
+    PCIe root ports only.
+
+  ``XXYY``
+    Hex PCI class code (class byte XX, subclass byte YY).
+    Up to 8 class codes may be specified.
+
+  Bridges are always implicitly included regardless of the filter value
+  because they hold AER registers needed for root cause analysis.  The
+  filter is applied at device enumeration and hotplug rebuild time, not at
+  crash time (zero overhead on the panic path).
+
+Both parameters are writable at runtime via sysfs
+(``/sys/module/pci_crash/parameters/``).
+
+Architecture
+============
+
+::
+
+  late_initcall
+      │
+      ├── enumerate PCI devices (filtered by devices= param)
+      ├── allocate buffer via kvmalloc (may be vmalloc for >4 MiB)
+      ├── build pagemap: kmalloc'd array of per-page physical addresses
+      └── register bus notifier for hotplug
+
+  hotplug (BUS_NOTIFY_ADD_DEVICE / BUS_NOTIFY_DEL_DEVICE)
+      │
+      └── schedule delayed rebuild (200 ms debounce)
+              └── re-enumerate, re-allocate buffer + pagemap
+
+  panic (__crash_kexec → crash_save_vmcoreinfo → pci_crash_save)
+      │
+      ├── quick-scan root port AER ROOT_STATUS (capture=aer)
+      │     └── bail if no uncorrectable errors
+      ├── read config space for each device (MMIO, no locks)
+      ├── flush dcache (buffer + pagemap) to RAM
+      └── VMCOREINFO exports: PCI_CRASH_PAGEMAP, PCI_CRASH_BUF_SZ
+
+Buffer Format
+=============
+
+The buffer consists of a 32-byte header followed by variable-length
+device records:
+
+.. code-block:: c
+
+    struct pci_crash_buffer_header {   /* 32 bytes */
+        __le32 magic;           /* 0x50434943 "PCIC" */
+        __le32 version;         /* 1 */
+        __le32 device_count;
+        __le32 config_size;     /* 0 = variable-length records */
+        __le64 timestamp;       /* ktime_get_real_fast_ns() */
+        __le32 flags;           /* reserved */
+        __le32 reserved;
+    };
+
+    struct pci_crash_device_record {   /* 8 + cfg_size bytes */
+        __le16 domain;
+        __u8   bus;
+        __u8   devfn;
+        __le32 config_size;     /* 256 or 4096 */
+        __u8   config_data[];
+    };
+
+The pagemap (exported via ``PCI_CRASH_PAGEMAP``) allows the parser to
+locate buffer pages without walking page tables:
+
+.. code-block:: c
+
+    struct pci_crash_pagemap {
+        __le32 magic;           /* 0x5043504d "PCPM" */
+        __le32 num_pages;
+        __le64 buf_size;
+        __le64 addrs[];         /* physical address per page */
+    };
+
+Safety Considerations
+=====================
+
+- ``pci_read_config_dword()`` is direct ECAM MMIO at crash time (no locks).
+- ``ktime_get_real_fast_ns()`` is NMI-safe (lockless timekeeper snapshot).
+- ``WRITE_ONCE``/``READ_ONCE`` + memory barriers between rebuild (process
+  context) and ``pci_crash_save()`` (crash context, single CPU, interrupts
+  disabled).
+- Buffer capped at 24 MiB to prevent excessive allocation on systems with
+  thousands of VFs.
+- ``slab_is_available()`` guard in param setters prevents use-before-init
+  when parameters are set via kernel command line.
diff --git a/MAINTAINERS b/MAINTAINERS
index b2040011a386..3e4fbb3011ce 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20622,6 +20622,14 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
 F:	drivers/pci/pwrctrl/*
 F:	include/linux/pci-pwrctrl.h
 
+
+PCI CRASH BUFFER
+M:	Johannes Hange <hangej@amazon.com>
+L:	linux-pci@vger.kernel.org
+S:	Maintained
+F:	include/linux/pci_crash.h
+F:	kernel/pci_crash.c
+
 PCI SUBSYSTEM
 M:	Bjorn Helgaas <bhelgaas@google.com>
 L:	linux-pci@vger.kernel.org
diff --git a/include/linux/pci_crash.h b/include/linux/pci_crash.h
new file mode 100644
index 000000000000..59a46ee26183
--- /dev/null
+++ b/include/linux/pci_crash.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PCI Crash Buffer - Capture PCI config space at panic time
+ *
+ * This module captures PCI configuration space data (including AER
+ * extended capability registers) for all PCI devices at panic time.
+ * The data is stored in a buffer whose pages are captured in the
+ * vmcore for off-site analysis.
+ *
+ * Copyright (c) 2026 Amazon.com, Inc. or its affiliates.
+ */
+#ifndef _LINUX_PCI_CRASH_H
+#define _LINUX_PCI_CRASH_H
+
+#include <linux/types.h>
+
+#define PCI_CRASH_MAGIC         0x50434943  /* "PCIC" in ASCII */
+#define PCI_CRASH_VERSION       1
+
+/**
+ * struct pci_crash_buffer_header - Header for PCI crash buffer
+ * @magic:        Magic number (PCI_CRASH_MAGIC)
+ * @version:      Format version (PCI_CRASH_VERSION)
+ * @device_count: Number of device records following this header
+ * @config_size:  0 -- indicates variable-length records. Each device
+ *                record stores its own config_size (pdev->cfg_size:
+ *                256 for legacy PCI, 4096 for PCIe). Parsers walk
+ *                records sequentially using per-record config_size.
+ * @timestamp:    Capture timestamp from ktime_get_real_fast_ns()
+ * @flags:        Reserved for future use (0 for now)
+ * @reserved:     Padding to align to 32 bytes
+ *
+ * Total size: 32 bytes
+ */
+struct pci_crash_buffer_header {
+	__le32 magic;
+	__le32 version;
+	__le32 device_count;
+	__le32 config_size;
+	__le64 timestamp;
+	__le32 flags;
+	__le32 reserved;
+} __packed;
+
+/**
+ * struct pci_crash_device_record - Per-device record in crash buffer
+ * @domain:      PCI domain number
+ * @bus:         PCI bus number
+ * @devfn:       Device and function number (PCI_DEVFN format)
+ * @config_size: Config space size for this device (pdev->cfg_size:
+ *               256 for legacy PCI, 4096 for PCIe)
+ * @config_data: Raw PCI config space (config_size bytes)
+ *
+ * Records are variable-length: total size per record is
+ * PCI_CRASH_RECORD_META + config_size bytes.
+ */
+struct pci_crash_device_record {
+	__le16 domain;
+	__u8   bus;
+	__u8   devfn;
+	__le32 config_size;
+	__u8   config_data[];
+} __packed;
+
+#define PCI_CRASH_HEADER_SIZE  sizeof(struct pci_crash_buffer_header)
+#define PCI_CRASH_RECORD_META  sizeof(struct pci_crash_device_record)
+
+/**
+ * struct pci_crash_pagemap - Physical page directory for crash buffer
+ *
+ * The PCI crash buffer may be allocated via vmalloc (for buffers
+ * exceeding ~4 MB where the buddy allocator cannot provide contiguous
+ * pages).  virt_to_phys() returns garbage for vmalloc addresses, so
+ * we maintain this small kmalloc'd directory that maps the buffer's
+ * virtual pages to their actual physical addresses.
+ *
+ * At panic time, crash_core.c exports the pagemap's physical address
+ * via VMCOREINFO.  The parser reads the pagemap, then reads each
+ * physical page from the vmcore to reconstruct the full buffer.
+ *
+ * The pagemap itself is always kmalloc'd (direct-mapped), so
+ * virt_to_phys() works correctly on it.
+ *
+ * @magic:     0x5043504d ("PCPM") -- validates this is a pagemap
+ * @num_pages: Number of entries in the addrs[] array
+ * @buf_size:  Exact buffer size in bytes (last page may be partial)
+ * @addrs:     Physical address of each PAGE_SIZE page backing the buffer
+ */
+struct pci_crash_pagemap {
+	__le32 magic;
+	__le32 num_pages;
+	__le64 buf_size;
+	__le64 addrs[];
+} __packed;
+
+#define PCI_CRASH_PAGEMAP_MAGIC  0x5043504d  /* "PCPM" in ASCII */
+
+#ifdef CONFIG_PCI_CRASH
+void pci_crash_save(void);
+extern void *pci_crash_buffer;
+extern size_t pci_crash_buffer_size;
+extern phys_addr_t pci_crash_pagemap_phys;
+#else
+static inline void pci_crash_save(void) {}
+#define pci_crash_buffer        ((void *)NULL)
+#define pci_crash_buffer_size   ((size_t)0)
+#define pci_crash_pagemap_phys  ((phys_addr_t)0)
+#endif
+
+#endif /* _LINUX_PCI_CRASH_H */
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 15632358bcf7..6eb49a934e50 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -179,4 +179,21 @@ config CRASH_MAX_MEMORY_RANGES
 	  the computation behind the value provided through the
 	  /sys/kernel/crash_elfcorehdr_size attribute.
 
+
+config PCI_CRASH
+	bool "Capture PCI config space at panic time"
+	depends on VMCORE_INFO && PCI && (ARM64 || X86)
+	default y
+	help
+	  Capture PCI configuration space (including AER extended capability
+	  registers) for all PCI devices at panic time.  The data is stored
+	  in a buffer whose pages are recorded in VMCOREINFO for off-site
+	  crash analysis.
+
+	  This is useful for diagnosing PCI errors that caused or contributed
+	  to the crash, especially when AER registers are volatile and cleared
+	  by device reset during kexec.
+
+	  If unsure, say Y.
+
 endmenu
diff --git a/kernel/Makefile b/kernel/Makefile
index 6785982013dc..6629755bc5a3 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_KEXEC_CORE) += kexec_core.o
 obj-$(CONFIG_CRASH_DUMP) += crash_core.o
 obj-$(CONFIG_CRASH_DM_CRYPT) += crash_dump_dm_crypt.o
 obj-$(CONFIG_CRASH_DUMP_KUNIT_TEST) += crash_core_test.o
+obj-$(CONFIG_PCI_CRASH) += pci_crash.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
diff --git a/kernel/pci_crash.c b/kernel/pci_crash.c
new file mode 100644
index 000000000000..8bbc9008fdc8
--- /dev/null
+++ b/kernel/pci_crash.c
@@ -0,0 +1,755 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PCI Crash Buffer - Capture PCI config space for crash analysis
+ *
+ * Copyright (c) 2026 Amazon.com, Inc. or its affiliates.
+ *
+ * Captures PCI configuration space at crash time so AER error
+ * registers reflect the crash-time state for off-site analysis.
+ *
+ * Design:
+ * - Init (late_initcall): enumerate devices, allocate buffer.
+ * - Hotplug: bus notifier queues deferred rebuild of device list
+ *   and buffer via workqueue -- no PCI reads.
+ * - Crash: crash_save_vmcoreinfo() calls pci_crash_save() which
+ *   reads config space into buffer, flushes dcache to RAM so
+ *   data survives kexec into crash kernel.
+ *
+ * Records are variable-length: each device's record is exactly
+ * 8 + pdev->cfg_size bytes (264 for legacy PCI, 4104 for PCIe).
+ * The parser walks records sequentially using per-record config_size.
+ *
+ * Buffer pages may be physically scattered (kvmalloc falls back to
+ * vmalloc for buffers exceeding ~4 MB).  A small kmalloc'd pagemap
+ * records each page's physical address so the crash parser can
+ * reconstruct the buffer without page-table walking.
+ *
+ * pci_read_config_dword() is direct MMIO (no locks) -- safe in crash.
+ * for_each_pci_dev() needs pci_bus_sem -- only used at init/hotplug.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/crash_dump.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/moduleparam.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci_crash.h>
+#include <linux/slab.h>
+#include <linux/timekeeping.h>
+#include <linux/workqueue.h>
+
+#include <linux/cacheflush.h>
+#include <linux/unaligned.h>
+
+/**
+ * pci_crash_flush_dcache() - Flush a memory region from CPU cache to RAM
+ * @addr: virtual address of region to flush
+ * @size: size in bytes
+ *
+ * Used at crash time to ensure the crash kernel sees our buffer/pagemap
+ * writes after kexec.
+ */
+static inline void pci_crash_flush_dcache(void *addr, size_t size)
+{
+#ifdef CONFIG_ARM64
+	unsigned long start = (unsigned long)addr;
+	unsigned long end = start + size;
+
+	dcache_clean_inval_poc(start, end);
+#elif defined(CONFIG_X86)
+	clflush_cache_range(addr, size);
+#endif
+}
+
+/**
+ * pci_crash_buffer - Virtual address of PCI crash capture buffer
+ *
+ * Contains PCI configuration space data captured at panic time.
+ * Written by pci_crash_save(), read by crash kernel via VMCOREINFO.
+ * May be vmalloc'd -- use pci_crash_pagemap for physical addresses.
+ */
+void *pci_crash_buffer;
+EXPORT_SYMBOL_GPL(pci_crash_buffer);
+
+/**
+ * pci_crash_buffer_size - Size of pci_crash_buffer in bytes
+ */
+size_t pci_crash_buffer_size;
+EXPORT_SYMBOL_GPL(pci_crash_buffer_size);
+
+/**
+ * pci_crash_pagemap_phys - Physical address of page directory
+ *
+ * Points to a struct pci_crash_pagemap (always kmalloc'd, direct-mapped).
+ * Exported via VMCOREINFO so the crash kernel can locate buffer pages
+ * without walking page tables.
+ */
+phys_addr_t pci_crash_pagemap_phys;
+EXPORT_SYMBOL_GPL(pci_crash_pagemap_phys);
+
+static struct pci_crash_pagemap *pci_crash_pagemap;
+static size_t pci_crash_pagemap_size;
+static struct pci_dev **pci_crash_devs;
+static unsigned int pci_crash_num_devs;
+static DEFINE_MUTEX(pci_crash_lock);
+
+/*
+ * Set in pci_crash_init() after delayed_work, PCI bus and notifier are
+ * ready.  Guards parse + rebuild in param setters: at boot (level -1)
+ * the setter just stores the string; pci_crash_init() parses and does
+ * the initial rebuild once PCI is up.
+ */
+static bool pci_crash_ready;
+
+/*
+ * capture -- when to capture PCI config space.
+ * Comma-separated tokens:
+ *   aer    -- root port ROOT_STATUS has uncorrectable errors (default)
+ *   always -- every panic regardless of PCI error state
+ *
+ * Writable at runtime (0644) so operators and tests can toggle without
+ * reboot.  Writes re-parse capture_flags immediately.
+ */
+static char capture[32] = "aer";
+
+#define PCI_CRASH_CAPTURE_AER		BIT(0)
+#define PCI_CRASH_CAPTURE_ALWAYS	BIT(1)
+static unsigned long capture_flags = PCI_CRASH_CAPTURE_AER;
+
+static void pci_crash_parse_capture(void);
+
+static int capture_param_set(const char *val, const struct kernel_param *kp)
+{
+	if (strlen(val) >= sizeof(capture))
+		return -EINVAL;
+	strscpy(capture, val, sizeof(capture));
+	strim(capture);
+	if (READ_ONCE(pci_crash_ready))
+		pci_crash_parse_capture();
+	return 0;
+}
+
+static int capture_param_get(char *buf, const struct kernel_param *kp)
+{
+	return scnprintf(buf, PAGE_SIZE, "%s\n", capture);
+}
+
+static const struct kernel_param_ops capture_param_ops = {
+	.set = capture_param_set,
+	.get = capture_param_get,
+};
+module_param_cb(capture, &capture_param_ops, NULL, 0644);
+MODULE_PARM_DESC(capture, "When to capture: aer, always (default: aer)");
+
+/*
+ * devices -- which devices to capture.
+ * Comma-separated tokens:
+ *   all        -- every PCI device (default)
+ *   bridges    -- PCI bridges (class 0604, 0607)
+ *   root_ports -- PCIe root ports only
+ *   XXYY       -- hex PCI class code (class + subclass)
+ *
+ * Bridges are always implicitly included regardless of filter value
+ * because they hold the AER registers needed for root cause analysis.
+ * Applies at rebuild time only -- zero cost at crash time.  Writable
+ * at runtime (0644); writes re-parse and trigger async rebuild.
+ */
+static char devices[256] = "all";
+
+static void pci_crash_parse_devices(void);
+static struct delayed_work pci_crash_rebuild_dwork;
+
+/* Debounce period for bus notifications (ms).
+ * TRN2 liveupdate enumerates ~3000 VFs in ~1.5s -- this coalesces
+ * the storm into a single rebuild after the last event.
+ */
+#define PCI_CRASH_REBUILD_DELAY_MS	200
+
+static int devices_param_set(const char *val, const struct kernel_param *kp)
+{
+	if (strlen(val) >= sizeof(devices))
+		return -EINVAL;
+
+	mutex_lock(&pci_crash_lock);
+	strscpy(devices, val, sizeof(devices));
+	strim(devices);
+	if (READ_ONCE(pci_crash_ready)) {
+		pci_crash_parse_devices();
+		mod_delayed_work(system_wq, &pci_crash_rebuild_dwork,
+				 msecs_to_jiffies(PCI_CRASH_REBUILD_DELAY_MS));
+	}
+	mutex_unlock(&pci_crash_lock);
+	return 0;
+}
+
+static int devices_param_get(char *buf, const struct kernel_param *kp)
+{
+	return scnprintf(buf, PAGE_SIZE, "%s\n", devices);
+}
+
+static const struct kernel_param_ops devices_param_ops = {
+	.set = devices_param_set,
+	.get = devices_param_get,
+};
+module_param_cb(devices, &devices_param_ops, NULL, 0644);
+MODULE_PARM_DESC(devices,
+	"Which devices: all, bridges, root_ports, XXYY hex class (default: all)");
+
+#define PCI_CRASH_DEVICES_ALL		BIT(0)
+#define PCI_CRASH_DEVICES_BRIDGES	BIT(1)
+#define PCI_CRASH_DEVICES_ROOT_PORTS	BIT(2)
+#define PCI_CRASH_MAX_DEVICE_CLASSES	8
+static unsigned long devices_flags = PCI_CRASH_DEVICES_ALL;
+static u16 device_classes[PCI_CRASH_MAX_DEVICE_CLASSES];
+static unsigned int device_class_count;
+
+static void pci_crash_parse_capture(void)
+{
+	char *buf, *token, *rest;
+	unsigned long flags = 0;
+
+	if (!*capture) {
+		WRITE_ONCE(capture_flags, PCI_CRASH_CAPTURE_AER);
+		return;
+	}
+
+	buf = kstrdup(capture, GFP_KERNEL);
+	if (!buf) {
+		WRITE_ONCE(capture_flags, PCI_CRASH_CAPTURE_AER);
+		return;
+	}
+
+	rest = buf;
+	while ((token = strsep(&rest, ",")) != NULL) {
+		if (strcmp(token, "aer") == 0)
+			flags |= PCI_CRASH_CAPTURE_AER;
+		else if (strcmp(token, "always") == 0)
+			flags |= PCI_CRASH_CAPTURE_ALWAYS;
+		else
+			pr_warn("unknown capture token: %s\n",
+				token);
+	}
+	kfree(buf);
+
+	if (!flags) {
+		pr_warn("no valid capture tokens, defaulting to aer\n");
+		flags = PCI_CRASH_CAPTURE_AER;
+	}
+	WRITE_ONCE(capture_flags, flags);
+}
+
+static void pci_crash_parse_devices(void)
+{
+	char *buf, *token, *rest;
+	unsigned long val;
+
+	devices_flags = 0;
+	device_class_count = 0;
+
+	if (!*devices) {
+		devices_flags = PCI_CRASH_DEVICES_ALL;
+		return;
+	}
+
+	buf = kstrdup(devices, GFP_KERNEL);
+	if (!buf) {
+		devices_flags = PCI_CRASH_DEVICES_ALL;
+		return;
+	}
+
+	rest = buf;
+	while ((token = strsep(&rest, ",")) != NULL) {
+		if (strcmp(token, "all") == 0) {
+			devices_flags |= PCI_CRASH_DEVICES_ALL;
+		} else if (strcmp(token, "bridges") == 0) {
+			devices_flags |= PCI_CRASH_DEVICES_BRIDGES;
+		} else if (strcmp(token, "root_ports") == 0) {
+			devices_flags |= PCI_CRASH_DEVICES_ROOT_PORTS;
+		} else if (kstrtoul(token, 16, &val) == 0 && val <= 0xFFFF) {
+			if (device_class_count < PCI_CRASH_MAX_DEVICE_CLASSES)
+				device_classes[device_class_count++] = (u16)val;
+			else
+				pr_warn("too many device classes (max %d)\n",
+					PCI_CRASH_MAX_DEVICE_CLASSES);
+		} else {
+			pr_warn("unknown devices token: %s\n",
+				token);
+		}
+	}
+	kfree(buf);
+
+	if (!devices_flags && device_class_count == 0) {
+		pr_warn("no valid devices tokens, defaulting to all\n");
+		devices_flags = PCI_CRASH_DEVICES_ALL;
+	}
+}
+
+static bool pci_crash_device_matches(struct pci_dev *pdev)
+{
+	unsigned int i;
+	u16 dev_class = pdev->class >> 8;
+
+	if (devices_flags & PCI_CRASH_DEVICES_ALL)
+		return true;
+
+	/* Bridges always included -- they hold AER registers */
+	if (dev_class == PCI_CLASS_BRIDGE_PCI ||
+	    dev_class == PCI_CLASS_BRIDGE_CARDBUS)
+		return true;
+
+	if ((devices_flags & PCI_CRASH_DEVICES_ROOT_PORTS) &&
+	    pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT)
+		return true;
+
+	for (i = 0; i < device_class_count; i++) {
+		if (dev_class == device_classes[i])
+			return true;
+	}
+
+	return false;
+}
+
+/* Sanity limit -- prevents multi-GB allocations on systems with many VFs */
+#define PCI_CRASH_MAX_BUFFER_SIZE	(24 * 1024 * 1024)
+
+/**
+ * pci_crash_build_pagemap() - Build physical page directory for buffer
+ * @buf: buffer allocated via kvmalloc (may be vmalloc'd)
+ * @buf_size: buffer size in bytes
+ *
+ * Allocates a kmalloc'd directory containing the physical address of
+ * each page backing @buf.  The pagemap is always direct-mapped, so
+ * virt_to_phys() works on it at crash time.
+ *
+ * Returns the new pagemap, or NULL on allocation failure.
+ */
+static struct pci_crash_pagemap *pci_crash_build_pagemap(void *buf,
+							 size_t buf_size)
+{
+	unsigned int num_pages = DIV_ROUND_UP(buf_size, PAGE_SIZE);
+	struct pci_crash_pagemap *pm;
+	unsigned int i;
+
+	pm = kmalloc(struct_size(pm, addrs, num_pages), GFP_KERNEL);
+	if (!pm)
+		return NULL;
+
+	pm->magic = cpu_to_le32(PCI_CRASH_PAGEMAP_MAGIC);
+	pm->num_pages = cpu_to_le32(num_pages);
+	pm->buf_size = cpu_to_le64(buf_size);
+
+	for (i = 0; i < num_pages; i++) {
+		struct page *page;
+		phys_addr_t pa;
+
+		if (is_vmalloc_addr(buf + i * PAGE_SIZE))
+			page = vmalloc_to_page(buf + i * PAGE_SIZE);
+		else
+			page = virt_to_page(buf + i * PAGE_SIZE);
+
+		if (!page) {
+			kfree(pm);
+			return NULL;
+		}
+		pa = page_to_phys(page);
+		pm->addrs[i] = cpu_to_le64(pa);
+	}
+
+	return pm;
+}
+
+/**
+ * pci_crash_read_config_space() - Read config space for one device
+ * @pdev: PCI device to read
+ * @ptr: destination pointer within the crash buffer
+ *
+ * Reads pdev->cfg_size bytes of config space (256 for legacy PCI,
+ * 4096 for PCIe). Each 4-byte dword is read individually via MMIO.
+ * Failed reads write 0xFFFFFFFF -- standard PCI convention for
+ * absent/unreachable devices.
+ */
+static void pci_crash_read_config_space(struct pci_dev *pdev, u8 *ptr)
+{
+	struct pci_crash_device_record *record =
+		(struct pci_crash_device_record *)ptr;
+	u8 *cfg_data = ptr + PCI_CRASH_RECORD_META;
+	int offset;
+	u32 val;
+
+	record->domain = cpu_to_le16(pci_domain_nr(pdev->bus));
+	record->bus = pdev->bus->number;
+	record->devfn = pdev->devfn;
+	record->config_size = cpu_to_le32(pdev->cfg_size);
+
+	for (offset = 0; offset < pdev->cfg_size; offset += 4) {
+		if (pci_read_config_dword(pdev, offset, &val)) {
+			put_unaligned_le32(0xFFFFFFFF, &cfg_data[offset]);
+			continue;
+		}
+		put_unaligned_le32(val, &cfg_data[offset]);
+	}
+}
+
+/**
+ * pci_crash_fill_buffer() - Populate buffer with config space
+ * @buffer: destination buffer (header + variable-length records)
+ * @num_devs: number of devices to read
+ *
+ * Records are variable-length: each is PCI_CRASH_RECORD_META +
+ * pdev->cfg_size bytes. The header's config_size is 0 to indicate
+ * variable-length; parsers walk records using per-record config_size.
+ *
+ * Uses ktime_get_real_fast_ns() for the timestamp -- safe in NMI/panic
+ * context (lockless, reads the NMI-safe timekeeper snapshot).
+ */
+static void pci_crash_fill_buffer(void *buffer, unsigned int num_devs)
+{
+	struct pci_crash_buffer_header *header = buffer;
+	struct pci_dev **devs = READ_ONCE(pci_crash_devs);
+	u8 *ptr;
+	unsigned int i;
+
+	header->magic = cpu_to_le32(PCI_CRASH_MAGIC);
+	header->version = cpu_to_le32(PCI_CRASH_VERSION);
+	header->device_count = cpu_to_le32(num_devs);
+	header->config_size = 0;
+	header->timestamp = cpu_to_le64(ktime_get_real_fast_ns());
+	header->flags = 0;
+	header->reserved = 0;
+
+	ptr = (u8 *)buffer + PCI_CRASH_HEADER_SIZE;
+	for (i = 0; i < num_devs; i++) {
+		struct pci_dev *pdev = devs[i];
+
+		if (unlikely(!pdev))
+			break;
+		pci_crash_read_config_space(pdev, ptr);
+		ptr += PCI_CRASH_RECORD_META + pdev->cfg_size;
+	}
+
+	header->device_count = cpu_to_le32(i);
+}
+
+/**
+ * pci_crash_rebuild_snapshot() - Rebuild device list and allocate buffer
+ *
+ * Two-pass approach:
+ *   Pass 1: count PCI devices
+ *   Pass 2: populate device array (filtered) and compute exact buffer
+ *            size from actual pdev->cfg_size per device (no padding)
+ *
+ * The devices param controls which devices are included.  Bridges are
+ * always included regardless of devices setting (they hold AER registers).
+ * devices=all (default) includes everything.
+ *
+ * Does NOT read PCI config space -- reads happen only at crash time.
+ * This keeps rebuild fast during VF enumeration storms (~6000 ADD
+ * events on TRN2 during liveupdate).
+ *
+ * After allocation, builds the pagemap so the crash parser can
+ * locate the buffer's physical pages in the vmcore.
+ *
+ * Caller must hold pci_crash_lock.
+ */
+static void pci_crash_rebuild_snapshot(void)
+{
+	struct pci_dev *pdev = NULL;
+	unsigned int count = 0, i;
+	void *old_buf;
+	unsigned int old_num_devs = pci_crash_num_devs;
+	struct pci_dev **old_devs = pci_crash_devs;
+	struct pci_crash_pagemap *old_pm = pci_crash_pagemap;
+	struct pci_crash_pagemap *new_pm;
+	struct pci_dev **new_devs;
+	void *new_buf;
+	size_t total_size;
+
+	/* Pass 1: count devices */
+	for_each_pci_dev(pdev)
+		count++;
+
+	if (count == 0) {
+		pr_info("no PCI devices found\n");
+		WRITE_ONCE(pci_crash_num_devs, 0);
+		if (old_devs) {
+			for (i = 0; i < old_num_devs; i++)
+				pci_dev_put(old_devs[i]);
+			kvfree(old_devs);
+			WRITE_ONCE(pci_crash_devs, NULL);
+		}
+		kfree(old_pm);
+		WRITE_ONCE(pci_crash_pagemap, NULL);
+		WRITE_ONCE(pci_crash_pagemap_phys, 0);
+		WRITE_ONCE(pci_crash_pagemap_size, 0);
+		old_buf = pci_crash_buffer;
+		WRITE_ONCE(pci_crash_buffer, NULL);
+		WRITE_ONCE(pci_crash_buffer_size, 0);
+		kvfree(old_buf);
+		return;
+	}
+
+	/*
+	 * Disable capture during rebuild to prevent pci_crash_save()
+	 * from accessing stale device references or partially populated
+	 * arrays.
+	 */
+	WRITE_ONCE(pci_crash_num_devs, 0);
+
+	new_devs = kvmalloc_array(count, sizeof(struct pci_dev *),
+				  GFP_KERNEL | __GFP_ZERO);
+	if (!new_devs) {
+		pr_err("dev array alloc failed (%u)\n", count);
+		goto err_restore;
+	}
+
+	/*
+	 * Pass 2: populate device array (filtered) and compute exact
+	 * buffer size from actual pdev->cfg_size per device.
+	 * count from pass 1 is an upper bound; actual may be smaller.
+	 */
+	total_size = PCI_CRASH_HEADER_SIZE;
+	pdev = NULL;
+	i = 0;
+	for_each_pci_dev(pdev) {
+		if (i >= count) {
+			pci_dev_put(pdev);
+			break;
+		}
+		if (!pci_crash_device_matches(pdev))
+			continue;
+		new_devs[i] = pci_dev_get(pdev);
+		total_size += PCI_CRASH_RECORD_META + pdev->cfg_size;
+		i++;
+	}
+	count = i;
+
+	if (count == 0) {
+		kvfree(new_devs);
+		pr_info("no devices match devices=%s\n", devices);
+		goto err_restore;
+	}
+
+	if (total_size > PCI_CRASH_MAX_BUFFER_SIZE) {
+		for (i = 0; i < count; i++)
+			pci_dev_put(new_devs[i]);
+		kvfree(new_devs);
+		pr_warn("buffer too large (%zu > %d bytes)\n",
+			total_size, PCI_CRASH_MAX_BUFFER_SIZE);
+		goto err_restore;
+	}
+
+	new_buf = kvmalloc(total_size, GFP_KERNEL | __GFP_ZERO);
+	if (!new_buf) {
+		for (i = 0; i < count; i++)
+			pci_dev_put(new_devs[i]);
+		kvfree(new_devs);
+		pr_err("buffer alloc failed (%zu bytes)\n",
+		       total_size);
+		goto err_restore;
+	}
+
+	new_pm = pci_crash_build_pagemap(new_buf, total_size);
+	if (!new_pm) {
+		kvfree(new_buf);
+		for (i = 0; i < count; i++)
+			pci_dev_put(new_devs[i]);
+		kvfree(new_devs);
+		pr_err("pagemap alloc failed\n");
+		goto err_restore;
+	}
+
+	/* Release old device references */
+	for (i = 0; i < old_num_devs; i++)
+		if (old_devs && old_devs[i])
+			pci_dev_put(old_devs[i]);
+
+	WRITE_ONCE(pci_crash_devs, new_devs);
+	kvfree(old_devs);
+
+	/*
+	 * Swap pagemap first (with pre-computed phys addr), then buffer,
+	 * then size and count.  If crash fires mid-swap, num_devs is
+	 * still 0 from above, so pci_crash_save() bails out safely.
+	 */
+	WRITE_ONCE(pci_crash_pagemap, new_pm);
+	WRITE_ONCE(pci_crash_pagemap_phys, virt_to_phys(new_pm));
+	WRITE_ONCE(pci_crash_pagemap_size, struct_size(new_pm, addrs,
+			le32_to_cpu(new_pm->num_pages)));
+	kfree(old_pm);
+
+	old_buf = pci_crash_buffer;
+	WRITE_ONCE(pci_crash_buffer, new_buf);
+	WRITE_ONCE(pci_crash_buffer_size, total_size);
+	/* Ensure buffer/pagemap/size are visible before num_devs enables capture */
+	smp_wmb();
+	WRITE_ONCE(pci_crash_num_devs, count);
+	kvfree(old_buf);
+
+	pr_info("rebuild: %u devices (%zu bytes, %u pages)\n",
+		count, total_size, le32_to_cpu(new_pm->num_pages));
+	return;
+
+err_restore:
+	WRITE_ONCE(pci_crash_num_devs, old_num_devs);
+}
+
+/**
+ * pci_crash_save() - Capture PCI config space at crash time
+ *
+ * Called from crash_save_vmcoreinfo() inside __crash_kexec(), which
+ * runs before machine_kexec() boots the crash kernel.  This is the
+ * only reliable capture point -- panic notifiers run AFTER kexec by
+ * default (crash_kexec_post_notifiers=0).
+ *
+ * Capture check (capture param):
+ *   always  -- capture unconditionally
+ *   aer     -- quick-scan root port AER ROOT_STATUS for uncorrectable
+ *             errors; skip if none found
+ *
+ * When capture=always, captures on every panic.
+ * This is useful for cascading failures: a PCI link-down can cause
+ * an MCE or NMI watchdog timeout before DPC/AER fires, so the crash
+ * reason is UNKNOWN but AER registers may still hold error state.
+ *
+ * Reads config space fresh -- successful reads get current register
+ * state, failed reads (offline devices) write 0xFFFFFFFF.
+ *
+ * Flushes both buffer and pagemap from CPU cache to RAM so data
+ * survives kexec into crash kernel.
+ */
+void pci_crash_save(void)
+{
+	struct pci_crash_pagemap *pm;
+	unsigned long cflags;
+	unsigned int num_devs;
+	size_t pm_size;
+	size_t buf_size;
+	void *buffer;
+
+	num_devs = READ_ONCE(pci_crash_num_devs);
+	if (num_devs == 0)
+		return;
+
+	/* Pairs with smp_wmb() in rebuild -- ensures buffer/pagemap visible */
+	smp_rmb();
+
+	buffer = READ_ONCE(pci_crash_buffer);
+	buf_size = READ_ONCE(pci_crash_buffer_size);
+	pm = READ_ONCE(pci_crash_pagemap);
+	pm_size = READ_ONCE(pci_crash_pagemap_size);
+	if (!buffer || buf_size == 0)
+		return;
+
+	cflags = READ_ONCE(capture_flags);
+	if (!(cflags & PCI_CRASH_CAPTURE_ALWAYS)) {
+		if (cflags & PCI_CRASH_CAPTURE_AER) {
+			struct pci_dev **devs = READ_ONCE(pci_crash_devs);
+			unsigned int i;
+			bool pci_error_found = false;
+
+			if (!devs)
+				return;
+
+			for (i = 0; i < num_devs; i++) {
+				struct pci_dev *pdev = devs[i];
+				u32 status;
+
+				if (!pdev || !pdev->aer_cap)
+					continue;
+
+				if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) {
+					pci_read_config_dword(pdev,
+							     pdev->aer_cap + PCI_ERR_ROOT_STATUS,
+							     &status);
+					if (status & PCI_ERR_ROOT_UNCOR_RCV) {
+						pci_error_found = true;
+						break;
+					}
+				}
+			}
+
+			if (!pci_error_found) {
+				pr_emerg("no PCI errors detected, skipping capture\n");
+				return;
+			}
+		} else {
+			return;
+		}
+	}
+
+	pci_crash_fill_buffer(buffer, num_devs);
+
+	/*
+	 * Flush buffer and pagemap from CPU cache to RAM so the
+	 * crash kernel sees our writes after kexec.
+	 */
+	pci_crash_flush_dcache(buffer, buf_size);
+
+	if (pm && pm_size > 0)
+		pci_crash_flush_dcache(pm, pm_size);
+
+	pr_emerg("CAPTURE: %u devices, %zu bytes\n",
+		 num_devs, buf_size);
+}
+EXPORT_SYMBOL_GPL(pci_crash_save);
+
+static void pci_crash_rebuild_worker(struct work_struct *work)
+{
+	mutex_lock(&pci_crash_lock);
+	pci_crash_rebuild_snapshot();
+	mutex_unlock(&pci_crash_lock);
+}
+
+static int pci_crash_bus_notifier(struct notifier_block *nb,
+				  unsigned long action, void *data)
+{
+	if (action == BUS_NOTIFY_ADD_DEVICE ||
+	    action == BUS_NOTIFY_DEL_DEVICE)
+		mod_delayed_work(system_wq, &pci_crash_rebuild_dwork,
+				 msecs_to_jiffies(PCI_CRASH_REBUILD_DELAY_MS));
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block pci_crash_bus_nb = {
+	.notifier_call = pci_crash_bus_notifier,
+};
+
+static int __init pci_crash_init(void)
+{
+	/* Nothing to do in crash kernel -- the buffer from the first kernel
+	 * is already in RAM (flushed before kexec) and the parser finds it
+	 * via the pagemap in VMCOREINFO.
+	 */
+	if (is_kdump_kernel())
+		return 0;
+
+	INIT_DELAYED_WORK(&pci_crash_rebuild_dwork, pci_crash_rebuild_worker);
+
+	pci_crash_parse_capture();
+	pci_crash_parse_devices();
+
+	mutex_lock(&pci_crash_lock);
+	pci_crash_rebuild_snapshot();
+	mutex_unlock(&pci_crash_lock);
+
+	bus_register_notifier(&pci_bus_type, &pci_crash_bus_nb);
+
+	WRITE_ONCE(pci_crash_ready, true);
+
+	pr_info("ready: %u devices (%zu bytes), capture=%s devices=%s\n",
+		pci_crash_num_devs, pci_crash_buffer_size,
+		capture, devices);
+
+	return 0;
+}
+late_initcall(pci_crash_init);
+
+/* Built-in only: crash infrastructure must outlive all drivers. */
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Capture PCI config space at panic time for crash analysis");
+MODULE_AUTHOR("Amazon.com, Inc.");
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index 8614430ca212..8813f7e2e516 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -14,6 +14,7 @@
 #include <linux/cpuhotplug.h>
 #include <linux/memblock.h>
 #include <linux/kmemleak.h>
+#include <linux/pci_crash.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -91,6 +92,18 @@ void crash_save_vmcoreinfo(void)
 		vmcoreinfo_data = vmcoreinfo_data_safecopy;
 
 	vmcoreinfo_append_str("CRASHTIME=%lld\n", ktime_get_real_seconds());
+
+	/* Capture PCI config space before kexec into crash kernel */
+	pci_crash_save();
+	if (pci_crash_pagemap_phys && pci_crash_buffer_size > 0) {
+		vmcoreinfo_append_str("PCI_CRASH_PAGEMAP=0x%llx\n",
+				      (unsigned long long)pci_crash_pagemap_phys);
+		vmcoreinfo_append_str("PCI_CRASH_VERSION=%d\n",
+				      PCI_CRASH_VERSION);
+		vmcoreinfo_append_str("PCI_CRASH_BUF_SZ=%zu\n",
+				      pci_crash_buffer_size);
+	}
+
 	update_vmcoreinfo_note();
 }
 
-- 
2.47.3




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] pci_crash: capture PCI config space at panic time
  2026-05-12 16:24 [PATCH] pci_crash: capture PCI config space at panic time hangej
@ 2026-05-13 21:30 ` kernel test robot
  2026-05-13 22:23 ` kernel test robot
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: kernel test robot @ 2026-05-13 21:30 UTC (permalink / raw)
  To: hangej, bhelgaas
  Cc: oe-kbuild-all, linux-pci, linux-kernel, dwmw2, kexec, hangej

Hi hangej,

kernel test robot noticed the following build errors:

[auto build test ERROR on pci/next]
[also build test ERROR on pci/for-linus linus/master v7.1-rc3 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/hangej/pci_crash-capture-PCI-config-space-at-panic-time/20260514-000653
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link:    https://lore.kernel.org/r/20260512162402.1929687-1-hangej%40amazon.com
patch subject: [PATCH] pci_crash: capture PCI config space at panic time
config: i386-randconfig-r071-20260514 (https://download.01.org/0day-ci/archive/20260514/202605140501.1tSG7lxR-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
smatch: v0.5.0-9185-gbcc58b9c
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260514/202605140501.1tSG7lxR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605140501.1tSG7lxR-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/pci_crash.c: In function 'pci_crash_save':
>> kernel/pci_crash.c:661:53: error: 'struct pci_dev' has no member named 'aer_cap'; did you mean 'acs_cap'?
     661 |                                 if (!pdev || !pdev->aer_cap)
         |                                                     ^~~~~~~
         |                                                     acs_cap
   kernel/pci_crash.c:666:68: error: 'struct pci_dev' has no member named 'aer_cap'; did you mean 'acs_cap'?
     666 |                                                              pdev->aer_cap + PCI_ERR_ROOT_STATUS,
         |                                                                    ^~~~~~~
         |                                                                    acs_cap


vim +661 kernel/pci_crash.c

   599	
   600	/**
   601	 * pci_crash_save() - Capture PCI config space at crash time
   602	 *
   603	 * Called from crash_save_vmcoreinfo() inside __crash_kexec(), which
   604	 * runs before machine_kexec() boots the crash kernel.  This is the
   605	 * only reliable capture point -- panic notifiers run AFTER kexec by
   606	 * default (crash_kexec_post_notifiers=0).
   607	 *
   608	 * Capture check (capture param):
   609	 *   always  -- capture unconditionally
   610	 *   aer     -- quick-scan root port AER ROOT_STATUS for uncorrectable
   611	 *             errors; skip if none found
   612	 *
   613	 * When capture=always, captures on every panic.
   614	 * This is useful for cascading failures: a PCI link-down can cause
   615	 * an MCE or NMI watchdog timeout before DPC/AER fires, so the crash
   616	 * reason is UNKNOWN but AER registers may still hold error state.
   617	 *
   618	 * Reads config space fresh -- successful reads get current register
   619	 * state, failed reads (offline devices) write 0xFFFFFFFF.
   620	 *
   621	 * Flushes both buffer and pagemap from CPU cache to RAM so data
   622	 * survives kexec into crash kernel.
   623	 */
   624	void pci_crash_save(void)
   625	{
   626		struct pci_crash_pagemap *pm;
   627		unsigned long cflags;
   628		unsigned int num_devs;
   629		size_t pm_size;
   630		size_t buf_size;
   631		void *buffer;
   632	
   633		num_devs = READ_ONCE(pci_crash_num_devs);
   634		if (num_devs == 0)
   635			return;
   636	
   637		/* Pairs with smp_wmb() in rebuild -- ensures buffer/pagemap visible */
   638		smp_rmb();
   639	
   640		buffer = READ_ONCE(pci_crash_buffer);
   641		buf_size = READ_ONCE(pci_crash_buffer_size);
   642		pm = READ_ONCE(pci_crash_pagemap);
   643		pm_size = READ_ONCE(pci_crash_pagemap_size);
   644		if (!buffer || buf_size == 0)
   645			return;
   646	
   647		cflags = READ_ONCE(capture_flags);
   648		if (!(cflags & PCI_CRASH_CAPTURE_ALWAYS)) {
   649			if (cflags & PCI_CRASH_CAPTURE_AER) {
   650				struct pci_dev **devs = READ_ONCE(pci_crash_devs);
   651				unsigned int i;
   652				bool pci_error_found = false;
   653	
   654				if (!devs)
   655					return;
   656	
   657				for (i = 0; i < num_devs; i++) {
   658					struct pci_dev *pdev = devs[i];
   659					u32 status;
   660	
 > 661					if (!pdev || !pdev->aer_cap)
   662						continue;
   663	
   664					if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) {
   665						pci_read_config_dword(pdev,
   666								     pdev->aer_cap + PCI_ERR_ROOT_STATUS,
   667								     &status);
   668						if (status & PCI_ERR_ROOT_UNCOR_RCV) {
   669							pci_error_found = true;
   670							break;
   671						}
   672					}
   673				}
   674	
   675				if (!pci_error_found) {
   676					pr_emerg("no PCI errors detected, skipping capture\n");
   677					return;
   678				}
   679			} else {
   680				return;
   681			}
   682		}
   683	
   684		pci_crash_fill_buffer(buffer, num_devs);
   685	
   686		/*
   687		 * Flush buffer and pagemap from CPU cache to RAM so the
   688		 * crash kernel sees our writes after kexec.
   689		 */
   690		pci_crash_flush_dcache(buffer, buf_size);
   691	
   692		if (pm && pm_size > 0)
   693			pci_crash_flush_dcache(pm, pm_size);
   694	
   695		pr_emerg("CAPTURE: %u devices, %zu bytes\n",
   696			 num_devs, buf_size);
   697	}
   698	EXPORT_SYMBOL_GPL(pci_crash_save);
   699	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] pci_crash: capture PCI config space at panic time
  2026-05-12 16:24 [PATCH] pci_crash: capture PCI config space at panic time hangej
  2026-05-13 21:30 ` kernel test robot
@ 2026-05-13 22:23 ` kernel test robot
  2026-05-13 23:07 ` kernel test robot
  2026-05-13 23:39 ` sashiko-bot
  3 siblings, 0 replies; 5+ messages in thread
From: kernel test robot @ 2026-05-13 22:23 UTC (permalink / raw)
  To: hangej, bhelgaas
  Cc: oe-kbuild-all, linux-pci, linux-kernel, dwmw2, kexec, hangej

Hi hangej,

kernel test robot noticed the following build errors:

[auto build test ERROR on pci/next]
[also build test ERROR on pci/for-linus linus/master v7.1-rc3 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/hangej/pci_crash-capture-PCI-config-space-at-panic-time/20260514-000653
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link:    https://lore.kernel.org/r/20260512162402.1929687-1-hangej%40amazon.com
patch subject: [PATCH] pci_crash: capture PCI config space at panic time
config: x86_64-randconfig-161-20260514 (https://download.01.org/0day-ci/archive/20260514/202605140624.Tw71sECt-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
smatch: v0.5.0-9185-gbcc58b9c
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260514/202605140624.Tw71sECt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605140624.Tw71sECt-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/pci_crash.c:661:25: error: no member named 'aer_cap' in 'struct pci_dev'
     661 |                                 if (!pdev || !pdev->aer_cap)
         |                                               ~~~~  ^
   kernel/pci_crash.c:666:19: error: no member named 'aer_cap' in 'struct pci_dev'
     666 |                                                              pdev->aer_cap + PCI_ERR_ROOT_STATUS,
         |                                                              ~~~~  ^
   2 errors generated.


vim +661 kernel/pci_crash.c

   599	
   600	/**
   601	 * pci_crash_save() - Capture PCI config space at crash time
   602	 *
   603	 * Called from crash_save_vmcoreinfo() inside __crash_kexec(), which
   604	 * runs before machine_kexec() boots the crash kernel.  This is the
   605	 * only reliable capture point -- panic notifiers run AFTER kexec by
   606	 * default (crash_kexec_post_notifiers=0).
   607	 *
   608	 * Capture check (capture param):
   609	 *   always  -- capture unconditionally
   610	 *   aer     -- quick-scan root port AER ROOT_STATUS for uncorrectable
   611	 *             errors; skip if none found
   612	 *
   613	 * When capture=always, captures on every panic.
   614	 * This is useful for cascading failures: a PCI link-down can cause
   615	 * an MCE or NMI watchdog timeout before DPC/AER fires, so the crash
   616	 * reason is UNKNOWN but AER registers may still hold error state.
   617	 *
   618	 * Reads config space fresh -- successful reads get current register
   619	 * state, failed reads (offline devices) write 0xFFFFFFFF.
   620	 *
   621	 * Flushes both buffer and pagemap from CPU cache to RAM so data
   622	 * survives kexec into crash kernel.
   623	 */
   624	void pci_crash_save(void)
   625	{
   626		struct pci_crash_pagemap *pm;
   627		unsigned long cflags;
   628		unsigned int num_devs;
   629		size_t pm_size;
   630		size_t buf_size;
   631		void *buffer;
   632	
   633		num_devs = READ_ONCE(pci_crash_num_devs);
   634		if (num_devs == 0)
   635			return;
   636	
   637		/* Pairs with smp_wmb() in rebuild -- ensures buffer/pagemap visible */
   638		smp_rmb();
   639	
   640		buffer = READ_ONCE(pci_crash_buffer);
   641		buf_size = READ_ONCE(pci_crash_buffer_size);
   642		pm = READ_ONCE(pci_crash_pagemap);
   643		pm_size = READ_ONCE(pci_crash_pagemap_size);
   644		if (!buffer || buf_size == 0)
   645			return;
   646	
   647		cflags = READ_ONCE(capture_flags);
   648		if (!(cflags & PCI_CRASH_CAPTURE_ALWAYS)) {
   649			if (cflags & PCI_CRASH_CAPTURE_AER) {
   650				struct pci_dev **devs = READ_ONCE(pci_crash_devs);
   651				unsigned int i;
   652				bool pci_error_found = false;
   653	
   654				if (!devs)
   655					return;
   656	
   657				for (i = 0; i < num_devs; i++) {
   658					struct pci_dev *pdev = devs[i];
   659					u32 status;
   660	
 > 661					if (!pdev || !pdev->aer_cap)
   662						continue;
   663	
   664					if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) {
   665						pci_read_config_dword(pdev,
   666								     pdev->aer_cap + PCI_ERR_ROOT_STATUS,
   667								     &status);
   668						if (status & PCI_ERR_ROOT_UNCOR_RCV) {
   669							pci_error_found = true;
   670							break;
   671						}
   672					}
   673				}
   674	
   675				if (!pci_error_found) {
   676					pr_emerg("no PCI errors detected, skipping capture\n");
   677					return;
   678				}
   679			} else {
   680				return;
   681			}
   682		}
   683	
   684		pci_crash_fill_buffer(buffer, num_devs);
   685	
   686		/*
   687		 * Flush buffer and pagemap from CPU cache to RAM so the
   688		 * crash kernel sees our writes after kexec.
   689		 */
   690		pci_crash_flush_dcache(buffer, buf_size);
   691	
   692		if (pm && pm_size > 0)
   693			pci_crash_flush_dcache(pm, pm_size);
   694	
   695		pr_emerg("CAPTURE: %u devices, %zu bytes\n",
   696			 num_devs, buf_size);
   697	}
   698	EXPORT_SYMBOL_GPL(pci_crash_save);
   699	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] pci_crash: capture PCI config space at panic time
  2026-05-12 16:24 [PATCH] pci_crash: capture PCI config space at panic time hangej
  2026-05-13 21:30 ` kernel test robot
  2026-05-13 22:23 ` kernel test robot
@ 2026-05-13 23:07 ` kernel test robot
  2026-05-13 23:39 ` sashiko-bot
  3 siblings, 0 replies; 5+ messages in thread
From: kernel test robot @ 2026-05-13 23:07 UTC (permalink / raw)
  To: hangej, bhelgaas
  Cc: llvm, oe-kbuild-all, linux-pci, linux-kernel, dwmw2, kexec,
	hangej

Hi hangej,

kernel test robot noticed the following build warnings:

[auto build test WARNING on pci/next]
[also build test WARNING on pci/for-linus linus/master v7.1-rc3 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/hangej/pci_crash-capture-PCI-config-space-at-panic-time/20260514-000653
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link:    https://lore.kernel.org/r/20260512162402.1929687-1-hangej%40amazon.com
patch subject: [PATCH] pci_crash: capture PCI config space at panic time
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20260514/202605140714.yFl4KATZ-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260514/202605140714.yFl4KATZ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605140714.yFl4KATZ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: kernel/pci_crash.c:74 cannot understand function prototype: 'void *pci_crash_buffer;'
>> Warning: kernel/pci_crash.c:80 cannot understand function prototype: 'size_t pci_crash_buffer_size;'
>> Warning: kernel/pci_crash.c:90 cannot understand function prototype: 'phys_addr_t pci_crash_pagemap_phys;'

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] pci_crash: capture PCI config space at panic time
  2026-05-12 16:24 [PATCH] pci_crash: capture PCI config space at panic time hangej
                   ` (2 preceding siblings ...)
  2026-05-13 23:07 ` kernel test robot
@ 2026-05-13 23:39 ` sashiko-bot
  3 siblings, 0 replies; 5+ messages in thread
From: sashiko-bot @ 2026-05-13 23:39 UTC (permalink / raw)
  To: hangej; +Cc: linux-pci

Thank you for your contribution! Sashiko AI review found 9 potential issue(s) to consider:
- [Critical] Deadlock in panic path due to global pci_lock acquisition.
- [High] Hardware error and recursive panic from accessing offline or D3cold devices.
- [High] Lockless array teardown and missing memory barriers cause Use-After-Free during panic.
- [High] Loss of intra-page offset in the physical pagemap for kmalloc-backed buffers.
- [Medium] Resource leak when all matched devices are hot-removed.
- [Medium] Uninitialized variable used in AER capture check.
- [Medium] Missing locking in sysfs capture parameter setter leads to data race.
- [Low] Ignoring strim() return value prevents stripping of leading whitespace.
- [Low] Commit message incorrectly claims the use of slab_is_available().
--

commit f80de92901e525b2a8355b2b6d9ab3d784f115db
Author: Johannes Hange <hangej@amazon.com>
Subject: pci_crash: capture PCI config space at panic time

This commit introduces CONFIG_PCI_CRASH to capture PCI configuration space
data during a kernel panic. The data is stored in a buffer that can be
extracted from the vmcore for post-mortem analysis.

> diff --git a/Documentation/PCI/pci-crash-capture.rst b/Documentation/PCI/pci-crash-capture.rst
> index 0000000000000..35f676c2724b7
> --- /dev/null
> +++ b/Documentation/PCI/pci-crash-capture.rst
[ ... ]
> +- slab_is_available() guard in param setters prevents use-before-init
> +  when parameters are set via kernel command line.

This isn't a bug, but the commit message and documentation mention using
a slab_is_available() guard. It looks like the code actually uses
READ_ONCE(pci_crash_ready). Could this documentation be updated to
match the implementation?

> diff --git a/kernel/pci_crash.c b/kernel/pci_crash.c
> new file mode 100644
> index 0000000000000..8bbc9008fdc85
> --- /dev/null
> +++ b/kernel/pci_crash.c
[ ... ]
> +static int capture_param_set(const char *val, const struct kernel_param *kp)
> +{
> +	if (strlen(val) >= sizeof(capture))
> +		return -EINVAL;
> +	strscpy(capture, val, sizeof(capture));
> +	strim(capture);

Could concurrent sysfs writes from userspace cause data races here? The
devices_param_set() function protects the string modification with
pci_crash_lock, but it appears to be missing in this setter.

Also, does this code correctly handle leading whitespace? The strim()
function modifies the string in-place to remove trailing whitespace, but it
returns a pointer to the first non-whitespace character to skip leading
whitespace. Since the return value is ignored here and in
devices_param_set(), any leading whitespace would be preserved and might
break the exact token string comparisons later.

[ ... ]
> +static struct pci_crash_pagemap *pci_crash_build_pagemap(void *buf,
> +							 size_t buf_size)
> +{
[ ... ]
> +	for (i = 0; i < num_pages; i++) {
> +		struct page *page;
> +		phys_addr_t pa;
> +
> +		if (is_vmalloc_addr(buf + i * PAGE_SIZE))
> +			page = vmalloc_to_page(buf + i * PAGE_SIZE);
> +		else
> +			page = virt_to_page(buf + i * PAGE_SIZE);
> +
> +		if (!page) {
> +			kfree(pm);
> +			return NULL;
> +		}
> +		pa = page_to_phys(page);
> +		pm->addrs[i] = cpu_to_le64(pa);

Can this lose the intra-page offset for the buffer? The kvmalloc()
function falls back to kmalloc() for smaller allocations, which are
aligned to slab boundaries rather than page boundaries. The page_to_phys()
macro returns the physical address of the start of the page, so won't the
crash parser read from the page boundary instead of the actual buffer start?

[ ... ]
> +static void pci_crash_read_config_space(struct pci_dev *pdev, u8 *ptr)
> +{
[ ... ]
> +	for (offset = 0; offset < pdev->cfg_size; offset += 4) {
> +		if (pci_read_config_dword(pdev, offset, &val)) {

Could this trigger a recursive panic when accessing offline or D3cold devices?
The code iterates over all captured devices and reads their config space
without checking their power state. If a device is in D3cold or otherwise
unreachable, accessing it will cause a PCIe Master Abort.

On some architectures like ARM64, unhandled Master Aborts trigger an SError,
which would cause an immediate hard hang or reset when raised from within
the panic handler.

Also, the commit message claims that pci_read_config_dword() is lockless.
Is this description accurate? Looking at pci_read_config_dword(), the
generic implementation unconditionally calls pci_bus_read_config_dword(),
which acquires the global raw spinlock pci_lock. If the kernel panics
while this lock is held by another CPU, won't the panicking CPU spin
indefinitely attempting to acquire it and deadlock the panic handler?

> +			put_unaligned_le32(0xFFFFFFFF, &cfg_data[offset]);
> +			continue;
> +		}
> +		put_unaligned_le32(val, &cfg_data[offset]);
> +	}
> +}
[ ... ]
> +static void pci_crash_rebuild_snapshot(void)
> +{
[ ... ]
> +	if (count == 0) {
> +		pr_info("no PCI devices found\n");
> +		WRITE_ONCE(pci_crash_num_devs, 0);
> +		if (old_devs) {
> +			for (i = 0; i < old_num_devs; i++)
> +				pci_dev_put(old_devs[i]);
> +			kvfree(old_devs);
> +			WRITE_ONCE(pci_crash_devs, NULL);
> +		}
> +		kfree(old_pm);
> +		WRITE_ONCE(pci_crash_pagemap, NULL);
> +		WRITE_ONCE(pci_crash_pagemap_phys, 0);

Could this result in a use-after-free during a panic?

On weakly ordered architectures, the CPU store buffer might delay making
pci_crash_num_devs = 0 visible globally while kvfree(old_devs) or
kfree(old_pm) modify allocator metadata in memory.

If an NMI stops this CPU at this exact point, could the panicking CPU
observe a non-zero device count and traverse the freed old_devs array or
export a freed physical address from old_pm? Should there be an smp_wmb()
after writing 0 to pci_crash_num_devs?

> +		old_buf = pci_crash_buffer;
> +		WRITE_ONCE(pci_crash_buffer, NULL);
> +		WRITE_ONCE(pci_crash_buffer_size, 0);
> +		kvfree(old_buf);
> +		return;
> +	}
[ ... ]
> +	if (count == 0) {
> +		kvfree(new_devs);
> +		pr_info("no devices match devices=%s\n", devices);
> +		goto err_restore;
> +	}
[ ... ]
> +err_restore:
> +	WRITE_ONCE(pci_crash_num_devs, old_num_devs);
> +}

Does this code leak the struct pci_dev references for the removed devices?
When jumping to err_restore, it restores pci_crash_num_devs to old_num_devs
but doesn't appear to drop the references on the old_devs array, pinning
their memory indefinitely.

[ ... ]
> +void pci_crash_save(void)
> +{
[ ... ]
> +	if (!(cflags & PCI_CRASH_CAPTURE_ALWAYS)) {
> +		if (cflags & PCI_CRASH_CAPTURE_AER) {
> +			struct pci_dev **devs = READ_ONCE(pci_crash_devs);
> +			unsigned int i;
> +			bool pci_error_found = false;
> +
> +			if (!devs)
> +				return;
> +
> +			for (i = 0; i < num_devs; i++) {
> +				struct pci_dev *pdev = devs[i];
> +				u32 status;
> +
> +				if (!pdev || !pdev->aer_cap)
> +					continue;
> +
> +				if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) {
> +					pci_read_config_dword(pdev,
> +							     pdev->aer_cap + PCI_ERR_ROOT_STATUS,
> +							     &status);
> +					if (status & PCI_ERR_ROOT_UNCOR_RCV) {

Can this evaluate uninitialized stack memory? If pci_read_config_dword()
fails early, it returns an error without populating the pointer, leaving
the status variable with uninitialized stack garbage. The return value isn't
checked before evaluating status here.

> +						pci_error_found = true;
> +						break;
> +					}
> +				}
> +			}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512162402.1929687-1-hangej@amazon.com?part=1

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-13 23:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 16:24 [PATCH] pci_crash: capture PCI config space at panic time hangej
2026-05-13 21:30 ` kernel test robot
2026-05-13 22:23 ` kernel test robot
2026-05-13 23:07 ` kernel test robot
2026-05-13 23:39 ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox