[PATCH] pci_crash: capture PCI config space at panic time

Linux PCI subsystem development
 help / color / mirror / Atom feed

From: hangej <hangej@amazon.com>
To: <bhelgaas@google.com>
Cc: <linux-pci@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<dwmw2@infradead.org>, <kexec@lists.infradead.org>,
	<hangej@amazon.com>
Subject: [PATCH] pci_crash: capture PCI config space at panic time
Date: Tue, 12 May 2026 18:24:01 +0200	[thread overview]
Message-ID: <20260512162402.1929687-1-hangej@amazon.com> (raw)

From: Johannes Hange <hangej@amazon.com>

When a system crashes and kexec boots into the crash kernel, PCI
device state is lost.  AER registers are volatile and cleared by
device reset, making post-mortem root cause analysis difficult.

Add CONFIG_PCI_CRASH to capture PCI configuration space (including
AER extended capability registers) during panic, before kexec.  The
data is exported via VMCOREINFO so crash analysis tools can extract
it from the vmcore.

Hook point: crash_save_vmcoreinfo() inside __crash_kexec(), which
runs before machine_kexec() -- the only reliable capture point since
panic notifiers run after kexec by default.

Key design choices:
- No config reads at init or hotplug; buffer pre-allocated, filled
  only at crash time for fresh register state.
- capture=aer (default): quick-scan root port ROOT_STATUS, bail if
  no uncorrectable errors.  capture=always: every panic.
- devices= filter applied at rebuild time (zero crash-path cost).
- 200ms debounced workqueue rebuild for VF enumeration storms.
- kvmalloc buffer + kmalloc pagemap for physical page directory
  (vmalloc addresses need explicit page-to-phys translation).
- dcache flush (ARM64/x86) ensures data survives kexec.

Build-tested on v7.1-rc2 (x86_64).  Runtime-tested on 6.6 with
forced panics: ~2ms for 31 PCIe devices (4096B config each),
sub-microsecond AER bail on non-PCI panics.

See Documentation/PCI/pci-crash-capture.rst for full details.

Signed-off-by: Johannes Hange <hangej@amazon.com>
---
 Documentation/PCI/index.rst             |   1 +
 Documentation/PCI/pci-crash-capture.rst | 134 +++++
 MAINTAINERS                             |   8 +
 include/linux/pci_crash.h               | 110 ++++
 kernel/Kconfig.kexec                    |  17 +
 kernel/Makefile                         |   1 +
 kernel/pci_crash.c                      | 755 ++++++++++++++++++++++++
 kernel/vmcore_info.c                    |  13 +
 8 files changed, 1039 insertions(+)
 create mode 100644 Documentation/PCI/pci-crash-capture.rst
 create mode 100644 include/linux/pci_crash.h
 create mode 100644 kernel/pci_crash.c

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
index 5d720d2a415e..7f499a43ddb4 100644
--- a/Documentation/PCI/index.rst
+++ b/Documentation/PCI/index.rst
@@ -19,4 +19,5 @@ PCI Bus Subsystem
    endpoint/index
    controller/index
    boot-interrupts
+   pci-crash-capture
    tph
diff --git a/Documentation/PCI/pci-crash-capture.rst b/Documentation/PCI/pci-crash-capture.rst
new file mode 100644
index 000000000000..35f676c2724b
--- /dev/null
+++ b/Documentation/PCI/pci-crash-capture.rst
@@ -0,0 +1,134 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+PCI Crash Capture Buffer
+========================
+
+Overview
+========
+
+The PCI crash capture module (``CONFIG_PCI_CRASH``) saves PCI configuration
+space for all (or selected) devices at panic time.  The data is written into
+a pre-allocated buffer whose physical pages are exported via VMCOREINFO,
+allowing crash analysis tools to extract device state from the vmcore.
+
+This is useful because AER (Advanced Error Reporting) registers are volatile
+and cleared by device reset during kexec into the crash kernel.  Capturing
+them before kexec preserves the error state that caused or contributed to the
+crash.
+
+Boot Parameters
+===============
+
+``pci_crash.capture=`` (default: ``aer``)
+  When to capture PCI config space.  Comma-separated tokens:
+
+  ``aer``
+    Capture only if a root port reports uncorrectable errors in its
+    AER ROOT_STATUS register.  Non-PCI panics skip capture entirely
+    (a handful of MMIO reads to root ports, sub-microsecond).
+
+  ``always``
+    Capture on every panic regardless of AER state.  Useful for
+    cascading failures where a PCI link-down causes an MCE or NMI
+    watchdog timeout before DPC/AER fires.
+
+``pci_crash.devices=`` (default: ``all``)
+  Which devices to include in the capture buffer.  Comma-separated tokens:
+
+  ``all``
+    Every PCI device in the system.
+
+  ``bridges``
+    PCI-to-PCI bridges (class 0604) and CardBus bridges (class 0607).
+
+  ``root_ports``
+    PCIe root ports only.
+
+  ``XXYY``
+    Hex PCI class code (class byte XX, subclass byte YY).
+    Up to 8 class codes may be specified.
+
+  Bridges are always implicitly included regardless of the filter value
+  because they hold AER registers needed for root cause analysis.  The
+  filter is applied at device enumeration and hotplug rebuild time, not at
+  crash time (zero overhead on the panic path).
+
+Both parameters are writable at runtime via sysfs
+(``/sys/module/pci_crash/parameters/``).
+
+Architecture
+============
+
+::
+
+  late_initcall
+      │
+      ├── enumerate PCI devices (filtered by devices= param)
+      ├── allocate buffer via kvmalloc (may be vmalloc for >4 MiB)
+      ├── build pagemap: kmalloc'd array of per-page physical addresses
+      └── register bus notifier for hotplug
+
+  hotplug (BUS_NOTIFY_ADD_DEVICE / BUS_NOTIFY_DEL_DEVICE)
+      │
+      └── schedule delayed rebuild (200 ms debounce)
+              └── re-enumerate, re-allocate buffer + pagemap
+
+  panic (__crash_kexec → crash_save_vmcoreinfo → pci_crash_save)
+      │
+      ├── quick-scan root port AER ROOT_STATUS (capture=aer)
+      │     └── bail if no uncorrectable errors
+      ├── read config space for each device (MMIO, no locks)
+      ├── flush dcache (buffer + pagemap) to RAM
+      └── VMCOREINFO exports: PCI_CRASH_PAGEMAP, PCI_CRASH_BUF_SZ
+
+Buffer Format
+=============
+
+The buffer consists of a 32-byte header followed by variable-length
+device records:
+
+.. code-block:: c
+
+    struct pci_crash_buffer_header {   /* 32 bytes */
+        __le32 magic;           /* 0x50434943 "PCIC" */
+        __le32 version;         /* 1 */
+        __le32 device_count;
+        __le32 config_size;     /* 0 = variable-length records */
+        __le64 timestamp;       /* ktime_get_real_fast_ns() */
+        __le32 flags;           /* reserved */
+        __le32 reserved;
+    };
+
+    struct pci_crash_device_record {   /* 8 + cfg_size bytes */
+        __le16 domain;
+        __u8   bus;
+        __u8   devfn;
+        __le32 config_size;     /* 256 or 4096 */
+        __u8   config_data[];
+    };
+
+The pagemap (exported via ``PCI_CRASH_PAGEMAP``) allows the parser to
+locate buffer pages without walking page tables:
+
+.. code-block:: c
+
+    struct pci_crash_pagemap {
+        __le32 magic;           /* 0x5043504d "PCPM" */
+        __le32 num_pages;
+        __le64 buf_size;
+        __le64 addrs[];         /* physical address per page */
+    };
+
+Safety Considerations
+=====================
+
+- ``pci_read_config_dword()`` is direct ECAM MMIO at crash time (no locks).
+- ``ktime_get_real_fast_ns()`` is NMI-safe (lockless timekeeper snapshot).
+- ``WRITE_ONCE``/``READ_ONCE`` + memory barriers between rebuild (process
+  context) and ``pci_crash_save()`` (crash context, single CPU, interrupts
+  disabled).
+- Buffer capped at 24 MiB to prevent excessive allocation on systems with
+  thousands of VFs.
+- ``slab_is_available()`` guard in param setters prevents use-before-init
+  when parameters are set via kernel command line.
diff --git a/MAINTAINERS b/MAINTAINERS
index b2040011a386..3e4fbb3011ce 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20622,6 +20622,14 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
 F:	drivers/pci/pwrctrl/*
 F:	include/linux/pci-pwrctrl.h
 
+
+PCI CRASH BUFFER
+M:	Johannes Hange <hangej@amazon.com>
+L:	linux-pci@vger.kernel.org
+S:	Maintained
+F:	include/linux/pci_crash.h
+F:	kernel/pci_crash.c
+
 PCI SUBSYSTEM
 M:	Bjorn Helgaas <bhelgaas@google.com>
 L:	linux-pci@vger.kernel.org
diff --git a/include/linux/pci_crash.h b/include/linux/pci_crash.h
new file mode 100644
index 000000000000..59a46ee26183
--- /dev/null
+++ b/include/linux/pci_crash.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PCI Crash Buffer - Capture PCI config space at panic time
+ *
+ * This module captures PCI configuration space data (including AER
+ * extended capability registers) for all PCI devices at panic time.
+ * The data is stored in a buffer whose pages are captured in the
+ * vmcore for off-site analysis.
+ *
+ * Copyright (c) 2026 Amazon.com, Inc. or its affiliates.
+ */
+#ifndef _LINUX_PCI_CRASH_H
+#define _LINUX_PCI_CRASH_H
+
+#include <linux/types.h>
+
+#define PCI_CRASH_MAGIC         0x50434943  /* "PCIC" in ASCII */
+#define PCI_CRASH_VERSION       1
+
+/**
+ * struct pci_crash_buffer_header - Header for PCI crash buffer
+ * @magic:        Magic number (PCI_CRASH_MAGIC)
+ * @version:      Format version (PCI_CRASH_VERSION)
+ * @device_count: Number of device records following this header
+ * @config_size:  0 -- indicates variable-length records. Each device
+ *                record stores its own config_size (pdev->cfg_size:
+ *                256 for legacy PCI, 4096 for PCIe). Parsers walk
+ *                records sequentially using per-record config_size.
+ * @timestamp:    Capture timestamp from ktime_get_real_fast_ns()
+ * @flags:        Reserved for future use (0 for now)
+ * @reserved:     Padding to align to 32 bytes
+ *
+ * Total size: 32 bytes
+ */
+struct pci_crash_buffer_header {
+	__le32 magic;
+	__le32 version;
+	__le32 device_count;
+	__le32 config_size;
+	__le64 timestamp;
+	__le32 flags;
+	__le32 reserved;
+} __packed;
+
+/**
+ * struct pci_crash_device_record - Per-device record in crash buffer
+ * @domain:      PCI domain number
+ * @bus:         PCI bus number
+ * @devfn:       Device and function number (PCI_DEVFN format)
+ * @config_size: Config space size for this device (pdev->cfg_size:
+ *               256 for legacy PCI, 4096 for PCIe)
+ * @config_data: Raw PCI config space (config_size bytes)
+ *
+ * Records are variable-length: total size per record is
+ * PCI_CRASH_RECORD_META + config_size bytes.
+ */
+struct pci_crash_device_record {
+	__le16 domain;
+	__u8   bus;
+	__u8   devfn;
+	__le32 config_size;
+	__u8   config_data[];
+} __packed;
+
+#define PCI_CRASH_HEADER_SIZE  sizeof(struct pci_crash_buffer_header)
+#define PCI_CRASH_RECORD_META  sizeof(struct pci_crash_device_record)
+
+/**
+ * struct pci_crash_pagemap - Physical page directory for crash buffer
+ *
+ * The PCI crash buffer may be allocated via vmalloc (for buffers
+ * exceeding ~4 MB where the buddy allocator cannot provide contiguous
+ * pages).  virt_to_phys() returns garbage for vmalloc addresses, so
+ * we maintain this small kmalloc'd directory that maps the buffer's
+ * virtual pages to their actual physical addresses.
+ *
+ * At panic time, crash_core.c exports the pagemap's physical address
+ * via VMCOREINFO.  The parser reads the pagemap, then reads each
+ * physical page from the vmcore to reconstruct the full buffer.
+ *
+ * The pagemap itself is always kmalloc'd (direct-mapped), so
+ * virt_to_phys() works correctly on it.
+ *
+ * @magic:     0x5043504d ("PCPM") -- validates this is a pagemap
+ * @num_pages: Number of entries in the addrs[] array
+ * @buf_size:  Exact buffer size in bytes (last page may be partial)
+ * @addrs:     Physical address of each PAGE_SIZE page backing the buffer
+ */
+struct pci_crash_pagemap {
+	__le32 magic;
+	__le32 num_pages;
+	__le64 buf_size;
+	__le64 addrs[];
+} __packed;
+
+#define PCI_CRASH_PAGEMAP_MAGIC  0x5043504d  /* "PCPM" in ASCII */
+
+#ifdef CONFIG_PCI_CRASH
+void pci_crash_save(void);
+extern void *pci_crash_buffer;
+extern size_t pci_crash_buffer_size;
+extern phys_addr_t pci_crash_pagemap_phys;
+#else
+static inline void pci_crash_save(void) {}
+#define pci_crash_buffer        ((void *)NULL)
+#define pci_crash_buffer_size   ((size_t)0)
+#define pci_crash_pagemap_phys  ((phys_addr_t)0)
+#endif
+
+#endif /* _LINUX_PCI_CRASH_H */
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 15632358bcf7..6eb49a934e50 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -179,4 +179,21 @@ config CRASH_MAX_MEMORY_RANGES
 	  the computation behind the value provided through the
 	  /sys/kernel/crash_elfcorehdr_size attribute.
 
+
+config PCI_CRASH
+	bool "Capture PCI config space at panic time"
+	depends on VMCORE_INFO && PCI && (ARM64 || X86)
+	default y
+	help
+	  Capture PCI configuration space (including AER extended capability
+	  registers) for all PCI devices at panic time.  The data is stored
+	  in a buffer whose pages are recorded in VMCOREINFO for off-site
+	  crash analysis.
+
+	  This is useful for diagnosing PCI errors that caused or contributed
+	  to the crash, especially when AER registers are volatile and cleared
+	  by device reset during kexec.
+
+	  If unsure, say Y.
+
 endmenu
diff --git a/kernel/Makefile b/kernel/Makefile
index 6785982013dc..6629755bc5a3 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_KEXEC_CORE) += kexec_core.o
 obj-$(CONFIG_CRASH_DUMP) += crash_core.o
 obj-$(CONFIG_CRASH_DM_CRYPT) += crash_dump_dm_crypt.o
 obj-$(CONFIG_CRASH_DUMP_KUNIT_TEST) += crash_core_test.o
+obj-$(CONFIG_PCI_CRASH) += pci_crash.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
diff --git a/kernel/pci_crash.c b/kernel/pci_crash.c
new file mode 100644
index 000000000000..8bbc9008fdc8
--- /dev/null
+++ b/kernel/pci_crash.c
@@ -0,0 +1,755 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PCI Crash Buffer - Capture PCI config space for crash analysis
+ *
+ * Copyright (c) 2026 Amazon.com, Inc. or its affiliates.
+ *
+ * Captures PCI configuration space at crash time so AER error
+ * registers reflect the crash-time state for off-site analysis.
+ *
+ * Design:
+ * - Init (late_initcall): enumerate devices, allocate buffer.
+ * - Hotplug: bus notifier queues deferred rebuild of device list
+ *   and buffer via workqueue -- no PCI reads.
+ * - Crash: crash_save_vmcoreinfo() calls pci_crash_save() which
+ *   reads config space into buffer, flushes dcache to RAM so
+ *   data survives kexec into crash kernel.
+ *
+ * Records are variable-length: each device's record is exactly
+ * 8 + pdev->cfg_size bytes (264 for legacy PCI, 4104 for PCIe).
+ * The parser walks records sequentially using per-record config_size.
+ *
+ * Buffer pages may be physically scattered (kvmalloc falls back to
+ * vmalloc for buffers exceeding ~4 MB).  A small kmalloc'd pagemap
+ * records each page's physical address so the crash parser can
+ * reconstruct the buffer without page-table walking.
+ *
+ * pci_read_config_dword() is direct MMIO (no locks) -- safe in crash.
+ * for_each_pci_dev() needs pci_bus_sem -- only used at init/hotplug.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/crash_dump.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/moduleparam.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci_crash.h>
+#include <linux/slab.h>
+#include <linux/timekeeping.h>
+#include <linux/workqueue.h>
+
+#include <linux/cacheflush.h>
+#include <linux/unaligned.h>
+
+/**
+ * pci_crash_flush_dcache() - Flush a memory region from CPU cache to RAM
+ * @addr: virtual address of region to flush
+ * @size: size in bytes
+ *
+ * Used at crash time to ensure the crash kernel sees our buffer/pagemap
+ * writes after kexec.
+ */
+static inline void pci_crash_flush_dcache(void *addr, size_t size)
+{
+#ifdef CONFIG_ARM64
+	unsigned long start = (unsigned long)addr;
+	unsigned long end = start + size;
+
+	dcache_clean_inval_poc(start, end);
+#elif defined(CONFIG_X86)
+	clflush_cache_range(addr, size);
+#endif
+}
+
+/**
+ * pci_crash_buffer - Virtual address of PCI crash capture buffer
+ *
+ * Contains PCI configuration space data captured at panic time.
+ * Written by pci_crash_save(), read by crash kernel via VMCOREINFO.
+ * May be vmalloc'd -- use pci_crash_pagemap for physical addresses.
+ */
+void *pci_crash_buffer;
+EXPORT_SYMBOL_GPL(pci_crash_buffer);
+
+/**
+ * pci_crash_buffer_size - Size of pci_crash_buffer in bytes
+ */
+size_t pci_crash_buffer_size;
+EXPORT_SYMBOL_GPL(pci_crash_buffer_size);
+
+/**
+ * pci_crash_pagemap_phys - Physical address of page directory
+ *
+ * Points to a struct pci_crash_pagemap (always kmalloc'd, direct-mapped).
+ * Exported via VMCOREINFO so the crash kernel can locate buffer pages
+ * without walking page tables.
+ */
+phys_addr_t pci_crash_pagemap_phys;
+EXPORT_SYMBOL_GPL(pci_crash_pagemap_phys);
+
+static struct pci_crash_pagemap *pci_crash_pagemap;
+static size_t pci_crash_pagemap_size;
+static struct pci_dev **pci_crash_devs;
+static unsigned int pci_crash_num_devs;
+static DEFINE_MUTEX(pci_crash_lock);
+
+/*
+ * Set in pci_crash_init() after delayed_work, PCI bus and notifier are
+ * ready.  Guards parse + rebuild in param setters: at boot (level -1)
+ * the setter just stores the string; pci_crash_init() parses and does
+ * the initial rebuild once PCI is up.
+ */
+static bool pci_crash_ready;
+
+/*
+ * capture -- when to capture PCI config space.
+ * Comma-separated tokens:
+ *   aer    -- root port ROOT_STATUS has uncorrectable errors (default)
+ *   always -- every panic regardless of PCI error state
+ *
+ * Writable at runtime (0644) so operators and tests can toggle without
+ * reboot.  Writes re-parse capture_flags immediately.
+ */
+static char capture[32] = "aer";
+
+#define PCI_CRASH_CAPTURE_AER		BIT(0)
+#define PCI_CRASH_CAPTURE_ALWAYS	BIT(1)
+static unsigned long capture_flags = PCI_CRASH_CAPTURE_AER;
+
+static void pci_crash_parse_capture(void);
+
+static int capture_param_set(const char *val, const struct kernel_param *kp)
+{
+	if (strlen(val) >= sizeof(capture))
+		return -EINVAL;
+	strscpy(capture, val, sizeof(capture));
+	strim(capture);
+	if (READ_ONCE(pci_crash_ready))
+		pci_crash_parse_capture();
+	return 0;
+}
+
+static int capture_param_get(char *buf, const struct kernel_param *kp)
+{
+	return scnprintf(buf, PAGE_SIZE, "%s\n", capture);
+}
+
+static const struct kernel_param_ops capture_param_ops = {
+	.set = capture_param_set,
+	.get = capture_param_get,
+};
+module_param_cb(capture, &capture_param_ops, NULL, 0644);
+MODULE_PARM_DESC(capture, "When to capture: aer, always (default: aer)");
+
+/*
+ * devices -- which devices to capture.
+ * Comma-separated tokens:
+ *   all        -- every PCI device (default)
+ *   bridges    -- PCI bridges (class 0604, 0607)
+ *   root_ports -- PCIe root ports only
+ *   XXYY       -- hex PCI class code (class + subclass)
+ *
+ * Bridges are always implicitly included regardless of filter value
+ * because they hold the AER registers needed for root cause analysis.
+ * Applies at rebuild time only -- zero cost at crash time.  Writable
+ * at runtime (0644); writes re-parse and trigger async rebuild.
+ */
+static char devices[256] = "all";
+
+static void pci_crash_parse_devices(void);
+static struct delayed_work pci_crash_rebuild_dwork;
+
+/* Debounce period for bus notifications (ms).
+ * TRN2 liveupdate enumerates ~3000 VFs in ~1.5s -- this coalesces
+ * the storm into a single rebuild after the last event.
+ */
+#define PCI_CRASH_REBUILD_DELAY_MS	200
+
+static int devices_param_set(const char *val, const struct kernel_param *kp)
+{
+	if (strlen(val) >= sizeof(devices))
+		return -EINVAL;
+
+	mutex_lock(&pci_crash_lock);
+	strscpy(devices, val, sizeof(devices));
+	strim(devices);
+	if (READ_ONCE(pci_crash_ready)) {
+		pci_crash_parse_devices();
+		mod_delayed_work(system_wq, &pci_crash_rebuild_dwork,
+				 msecs_to_jiffies(PCI_CRASH_REBUILD_DELAY_MS));
+	}
+	mutex_unlock(&pci_crash_lock);
+	return 0;
+}
+
+static int devices_param_get(char *buf, const struct kernel_param *kp)
+{
+	return scnprintf(buf, PAGE_SIZE, "%s\n", devices);
+}
+
+static const struct kernel_param_ops devices_param_ops = {
+	.set = devices_param_set,
+	.get = devices_param_get,
+};
+module_param_cb(devices, &devices_param_ops, NULL, 0644);
+MODULE_PARM_DESC(devices,
+	"Which devices: all, bridges, root_ports, XXYY hex class (default: all)");
+
+#define PCI_CRASH_DEVICES_ALL		BIT(0)
+#define PCI_CRASH_DEVICES_BRIDGES	BIT(1)
+#define PCI_CRASH_DEVICES_ROOT_PORTS	BIT(2)
+#define PCI_CRASH_MAX_DEVICE_CLASSES	8
+static unsigned long devices_flags = PCI_CRASH_DEVICES_ALL;
+static u16 device_classes[PCI_CRASH_MAX_DEVICE_CLASSES];
+static unsigned int device_class_count;
+
+static void pci_crash_parse_capture(void)
+{
+	char *buf, *token, *rest;
+	unsigned long flags = 0;
+
+	if (!*capture) {
+		WRITE_ONCE(capture_flags, PCI_CRASH_CAPTURE_AER);
+		return;
+	}
+
+	buf = kstrdup(capture, GFP_KERNEL);
+	if (!buf) {
+		WRITE_ONCE(capture_flags, PCI_CRASH_CAPTURE_AER);
+		return;
+	}
+
+	rest = buf;
+	while ((token = strsep(&rest, ",")) != NULL) {
+		if (strcmp(token, "aer") == 0)
+			flags |= PCI_CRASH_CAPTURE_AER;
+		else if (strcmp(token, "always") == 0)
+			flags |= PCI_CRASH_CAPTURE_ALWAYS;
+		else
+			pr_warn("unknown capture token: %s\n",
+				token);
+	}
+	kfree(buf);
+
+	if (!flags) {
+		pr_warn("no valid capture tokens, defaulting to aer\n");
+		flags = PCI_CRASH_CAPTURE_AER;
+	}
+	WRITE_ONCE(capture_flags, flags);
+}
+
+static void pci_crash_parse_devices(void)
+{
+	char *buf, *token, *rest;
+	unsigned long val;
+
+	devices_flags = 0;
+	device_class_count = 0;
+
+	if (!*devices) {
+		devices_flags = PCI_CRASH_DEVICES_ALL;
+		return;
+	}
+
+	buf = kstrdup(devices, GFP_KERNEL);
+	if (!buf) {
+		devices_flags = PCI_CRASH_DEVICES_ALL;
+		return;
+	}
+
+	rest = buf;
+	while ((token = strsep(&rest, ",")) != NULL) {
+		if (strcmp(token, "all") == 0) {
+			devices_flags |= PCI_CRASH_DEVICES_ALL;
+		} else if (strcmp(token, "bridges") == 0) {
+			devices_flags |= PCI_CRASH_DEVICES_BRIDGES;
+		} else if (strcmp(token, "root_ports") == 0) {
+			devices_flags |= PCI_CRASH_DEVICES_ROOT_PORTS;
+		} else if (kstrtoul(token, 16, &val) == 0 && val <= 0xFFFF) {
+			if (device_class_count < PCI_CRASH_MAX_DEVICE_CLASSES)
+				device_classes[device_class_count++] = (u16)val;
+			else
+				pr_warn("too many device classes (max %d)\n",
+					PCI_CRASH_MAX_DEVICE_CLASSES);
+		} else {
+			pr_warn("unknown devices token: %s\n",
+				token);
+		}
+	}
+	kfree(buf);
+
+	if (!devices_flags && device_class_count == 0) {
+		pr_warn("no valid devices tokens, defaulting to all\n");
+		devices_flags = PCI_CRASH_DEVICES_ALL;
+	}
+}
+
+static bool pci_crash_device_matches(struct pci_dev *pdev)
+{
+	unsigned int i;
+	u16 dev_class = pdev->class >> 8;
+
+	if (devices_flags & PCI_CRASH_DEVICES_ALL)
+		return true;
+
+	/* Bridges always included -- they hold AER registers */
+	if (dev_class == PCI_CLASS_BRIDGE_PCI ||
+	    dev_class == PCI_CLASS_BRIDGE_CARDBUS)
+		return true;
+
+	if ((devices_flags & PCI_CRASH_DEVICES_ROOT_PORTS) &&
+	    pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT)
+		return true;
+
+	for (i = 0; i < device_class_count; i++) {
+		if (dev_class == device_classes[i])
+			return true;
+	}
+
+	return false;
+}
+
+/* Sanity limit -- prevents multi-GB allocations on systems with many VFs */
+#define PCI_CRASH_MAX_BUFFER_SIZE	(24 * 1024 * 1024)
+
+/**
+ * pci_crash_build_pagemap() - Build physical page directory for buffer
+ * @buf: buffer allocated via kvmalloc (may be vmalloc'd)
+ * @buf_size: buffer size in bytes
+ *
+ * Allocates a kmalloc'd directory containing the physical address of
+ * each page backing @buf.  The pagemap is always direct-mapped, so
+ * virt_to_phys() works on it at crash time.
+ *
+ * Returns the new pagemap, or NULL on allocation failure.
+ */
+static struct pci_crash_pagemap *pci_crash_build_pagemap(void *buf,
+							 size_t buf_size)
+{
+	unsigned int num_pages = DIV_ROUND_UP(buf_size, PAGE_SIZE);
+	struct pci_crash_pagemap *pm;
+	unsigned int i;
+
+	pm = kmalloc(struct_size(pm, addrs, num_pages), GFP_KERNEL);
+	if (!pm)
+		return NULL;
+
+	pm->magic = cpu_to_le32(PCI_CRASH_PAGEMAP_MAGIC);
+	pm->num_pages = cpu_to_le32(num_pages);
+	pm->buf_size = cpu_to_le64(buf_size);
+
+	for (i = 0; i < num_pages; i++) {
+		struct page *page;
+		phys_addr_t pa;
+
+		if (is_vmalloc_addr(buf + i * PAGE_SIZE))
+			page = vmalloc_to_page(buf + i * PAGE_SIZE);
+		else
+			page = virt_to_page(buf + i * PAGE_SIZE);
+
+		if (!page) {
+			kfree(pm);
+			return NULL;
+		}
+		pa = page_to_phys(page);
+		pm->addrs[i] = cpu_to_le64(pa);
+	}
+
+	return pm;
+}
+
+/**
+ * pci_crash_read_config_space() - Read config space for one device
+ * @pdev: PCI device to read
+ * @ptr: destination pointer within the crash buffer
+ *
+ * Reads pdev->cfg_size bytes of config space (256 for legacy PCI,
+ * 4096 for PCIe). Each 4-byte dword is read individually via MMIO.
+ * Failed reads write 0xFFFFFFFF -- standard PCI convention for
+ * absent/unreachable devices.
+ */
+static void pci_crash_read_config_space(struct pci_dev *pdev, u8 *ptr)
+{
+	struct pci_crash_device_record *record =
+		(struct pci_crash_device_record *)ptr;
+	u8 *cfg_data = ptr + PCI_CRASH_RECORD_META;
+	int offset;
+	u32 val;
+
+	record->domain = cpu_to_le16(pci_domain_nr(pdev->bus));
+	record->bus = pdev->bus->number;
+	record->devfn = pdev->devfn;
+	record->config_size = cpu_to_le32(pdev->cfg_size);
+
+	for (offset = 0; offset < pdev->cfg_size; offset += 4) {
+		if (pci_read_config_dword(pdev, offset, &val)) {
+			put_unaligned_le32(0xFFFFFFFF, &cfg_data[offset]);
+			continue;
+		}
+		put_unaligned_le32(val, &cfg_data[offset]);
+	}
+}
+
+/**
+ * pci_crash_fill_buffer() - Populate buffer with config space
+ * @buffer: destination buffer (header + variable-length records)
+ * @num_devs: number of devices to read
+ *
+ * Records are variable-length: each is PCI_CRASH_RECORD_META +
+ * pdev->cfg_size bytes. The header's config_size is 0 to indicate
+ * variable-length; parsers walk records using per-record config_size.
+ *
+ * Uses ktime_get_real_fast_ns() for the timestamp -- safe in NMI/panic
+ * context (lockless, reads the NMI-safe timekeeper snapshot).
+ */
+static void pci_crash_fill_buffer(void *buffer, unsigned int num_devs)
+{
+	struct pci_crash_buffer_header *header = buffer;
+	struct pci_dev **devs = READ_ONCE(pci_crash_devs);
+	u8 *ptr;
+	unsigned int i;
+
+	header->magic = cpu_to_le32(PCI_CRASH_MAGIC);
+	header->version = cpu_to_le32(PCI_CRASH_VERSION);
+	header->device_count = cpu_to_le32(num_devs);
+	header->config_size = 0;
+	header->timestamp = cpu_to_le64(ktime_get_real_fast_ns());
+	header->flags = 0;
+	header->reserved = 0;
+
+	ptr = (u8 *)buffer + PCI_CRASH_HEADER_SIZE;
+	for (i = 0; i < num_devs; i++) {
+		struct pci_dev *pdev = devs[i];
+
+		if (unlikely(!pdev))
+			break;
+		pci_crash_read_config_space(pdev, ptr);
+		ptr += PCI_CRASH_RECORD_META + pdev->cfg_size;
+	}
+
+	header->device_count = cpu_to_le32(i);
+}
+
+/**
+ * pci_crash_rebuild_snapshot() - Rebuild device list and allocate buffer
+ *
+ * Two-pass approach:
+ *   Pass 1: count PCI devices
+ *   Pass 2: populate device array (filtered) and compute exact buffer
+ *            size from actual pdev->cfg_size per device (no padding)
+ *
+ * The devices param controls which devices are included.  Bridges are
+ * always included regardless of devices setting (they hold AER registers).
+ * devices=all (default) includes everything.
+ *
+ * Does NOT read PCI config space -- reads happen only at crash time.
+ * This keeps rebuild fast during VF enumeration storms (~6000 ADD
+ * events on TRN2 during liveupdate).
+ *
+ * After allocation, builds the pagemap so the crash parser can
+ * locate the buffer's physical pages in the vmcore.
+ *
+ * Caller must hold pci_crash_lock.
+ */
+static void pci_crash_rebuild_snapshot(void)
+{
+	struct pci_dev *pdev = NULL;
+	unsigned int count = 0, i;
+	void *old_buf;
+	unsigned int old_num_devs = pci_crash_num_devs;
+	struct pci_dev **old_devs = pci_crash_devs;
+	struct pci_crash_pagemap *old_pm = pci_crash_pagemap;
+	struct pci_crash_pagemap *new_pm;
+	struct pci_dev **new_devs;
+	void *new_buf;
+	size_t total_size;
+
+	/* Pass 1: count devices */
+	for_each_pci_dev(pdev)
+		count++;
+
+	if (count == 0) {
+		pr_info("no PCI devices found\n");
+		WRITE_ONCE(pci_crash_num_devs, 0);
+		if (old_devs) {
+			for (i = 0; i < old_num_devs; i++)
+				pci_dev_put(old_devs[i]);
+			kvfree(old_devs);
+			WRITE_ONCE(pci_crash_devs, NULL);
+		}
+		kfree(old_pm);
+		WRITE_ONCE(pci_crash_pagemap, NULL);
+		WRITE_ONCE(pci_crash_pagemap_phys, 0);
+		WRITE_ONCE(pci_crash_pagemap_size, 0);
+		old_buf = pci_crash_buffer;
+		WRITE_ONCE(pci_crash_buffer, NULL);
+		WRITE_ONCE(pci_crash_buffer_size, 0);
+		kvfree(old_buf);
+		return;
+	}
+
+	/*
+	 * Disable capture during rebuild to prevent pci_crash_save()
+	 * from accessing stale device references or partially populated
+	 * arrays.
+	 */
+	WRITE_ONCE(pci_crash_num_devs, 0);
+
+	new_devs = kvmalloc_array(count, sizeof(struct pci_dev *),
+				  GFP_KERNEL | __GFP_ZERO);
+	if (!new_devs) {
+		pr_err("dev array alloc failed (%u)\n", count);
+		goto err_restore;
+	}
+
+	/*
+	 * Pass 2: populate device array (filtered) and compute exact
+	 * buffer size from actual pdev->cfg_size per device.
+	 * count from pass 1 is an upper bound; actual may be smaller.
+	 */
+	total_size = PCI_CRASH_HEADER_SIZE;
+	pdev = NULL;
+	i = 0;
+	for_each_pci_dev(pdev) {
+		if (i >= count) {
+			pci_dev_put(pdev);
+			break;
+		}
+		if (!pci_crash_device_matches(pdev))
+			continue;
+		new_devs[i] = pci_dev_get(pdev);
+		total_size += PCI_CRASH_RECORD_META + pdev->cfg_size;
+		i++;
+	}
+	count = i;
+
+	if (count == 0) {
+		kvfree(new_devs);
+		pr_info("no devices match devices=%s\n", devices);
+		goto err_restore;
+	}
+
+	if (total_size > PCI_CRASH_MAX_BUFFER_SIZE) {
+		for (i = 0; i < count; i++)
+			pci_dev_put(new_devs[i]);
+		kvfree(new_devs);
+		pr_warn("buffer too large (%zu > %d bytes)\n",
+			total_size, PCI_CRASH_MAX_BUFFER_SIZE);
+		goto err_restore;
+	}
+
+	new_buf = kvmalloc(total_size, GFP_KERNEL | __GFP_ZERO);
+	if (!new_buf) {
+		for (i = 0; i < count; i++)
+			pci_dev_put(new_devs[i]);
+		kvfree(new_devs);
+		pr_err("buffer alloc failed (%zu bytes)\n",
+		       total_size);
+		goto err_restore;
+	}
+
+	new_pm = pci_crash_build_pagemap(new_buf, total_size);
+	if (!new_pm) {
+		kvfree(new_buf);
+		for (i = 0; i < count; i++)
+			pci_dev_put(new_devs[i]);
+		kvfree(new_devs);
+		pr_err("pagemap alloc failed\n");
+		goto err_restore;
+	}
+
+	/* Release old device references */
+	for (i = 0; i < old_num_devs; i++)
+		if (old_devs && old_devs[i])
+			pci_dev_put(old_devs[i]);
+
+	WRITE_ONCE(pci_crash_devs, new_devs);
+	kvfree(old_devs);
+
+	/*
+	 * Swap pagemap first (with pre-computed phys addr), then buffer,
+	 * then size and count.  If crash fires mid-swap, num_devs is
+	 * still 0 from above, so pci_crash_save() bails out safely.
+	 */
+	WRITE_ONCE(pci_crash_pagemap, new_pm);
+	WRITE_ONCE(pci_crash_pagemap_phys, virt_to_phys(new_pm));
+	WRITE_ONCE(pci_crash_pagemap_size, struct_size(new_pm, addrs,
+			le32_to_cpu(new_pm->num_pages)));
+	kfree(old_pm);
+
+	old_buf = pci_crash_buffer;
+	WRITE_ONCE(pci_crash_buffer, new_buf);
+	WRITE_ONCE(pci_crash_buffer_size, total_size);
+	/* Ensure buffer/pagemap/size are visible before num_devs enables capture */
+	smp_wmb();
+	WRITE_ONCE(pci_crash_num_devs, count);
+	kvfree(old_buf);
+
+	pr_info("rebuild: %u devices (%zu bytes, %u pages)\n",
+		count, total_size, le32_to_cpu(new_pm->num_pages));
+	return;
+
+err_restore:
+	WRITE_ONCE(pci_crash_num_devs, old_num_devs);
+}
+
+/**
+ * pci_crash_save() - Capture PCI config space at crash time
+ *
+ * Called from crash_save_vmcoreinfo() inside __crash_kexec(), which
+ * runs before machine_kexec() boots the crash kernel.  This is the
+ * only reliable capture point -- panic notifiers run AFTER kexec by
+ * default (crash_kexec_post_notifiers=0).
+ *
+ * Capture check (capture param):
+ *   always  -- capture unconditionally
+ *   aer     -- quick-scan root port AER ROOT_STATUS for uncorrectable
+ *             errors; skip if none found
+ *
+ * When capture=always, captures on every panic.
+ * This is useful for cascading failures: a PCI link-down can cause
+ * an MCE or NMI watchdog timeout before DPC/AER fires, so the crash
+ * reason is UNKNOWN but AER registers may still hold error state.
+ *
+ * Reads config space fresh -- successful reads get current register
+ * state, failed reads (offline devices) write 0xFFFFFFFF.
+ *
+ * Flushes both buffer and pagemap from CPU cache to RAM so data
+ * survives kexec into crash kernel.
+ */
+void pci_crash_save(void)
+{
+	struct pci_crash_pagemap *pm;
+	unsigned long cflags;
+	unsigned int num_devs;
+	size_t pm_size;
+	size_t buf_size;
+	void *buffer;
+
+	num_devs = READ_ONCE(pci_crash_num_devs);
+	if (num_devs == 0)
+		return;
+
+	/* Pairs with smp_wmb() in rebuild -- ensures buffer/pagemap visible */
+	smp_rmb();
+
+	buffer = READ_ONCE(pci_crash_buffer);
+	buf_size = READ_ONCE(pci_crash_buffer_size);
+	pm = READ_ONCE(pci_crash_pagemap);
+	pm_size = READ_ONCE(pci_crash_pagemap_size);
+	if (!buffer || buf_size == 0)
+		return;
+
+	cflags = READ_ONCE(capture_flags);
+	if (!(cflags & PCI_CRASH_CAPTURE_ALWAYS)) {
+		if (cflags & PCI_CRASH_CAPTURE_AER) {
+			struct pci_dev **devs = READ_ONCE(pci_crash_devs);
+			unsigned int i;
+			bool pci_error_found = false;
+
+			if (!devs)
+				return;
+
+			for (i = 0; i < num_devs; i++) {
+				struct pci_dev *pdev = devs[i];
+				u32 status;
+
+				if (!pdev || !pdev->aer_cap)
+					continue;
+
+				if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) {
+					pci_read_config_dword(pdev,
+							     pdev->aer_cap + PCI_ERR_ROOT_STATUS,
+							     &status);
+					if (status & PCI_ERR_ROOT_UNCOR_RCV) {
+						pci_error_found = true;
+						break;
+					}
+				}
+			}
+
+			if (!pci_error_found) {
+				pr_emerg("no PCI errors detected, skipping capture\n");
+				return;
+			}
+		} else {
+			return;
+		}
+	}
+
+	pci_crash_fill_buffer(buffer, num_devs);
+
+	/*
+	 * Flush buffer and pagemap from CPU cache to RAM so the
+	 * crash kernel sees our writes after kexec.
+	 */
+	pci_crash_flush_dcache(buffer, buf_size);
+
+	if (pm && pm_size > 0)
+		pci_crash_flush_dcache(pm, pm_size);
+
+	pr_emerg("CAPTURE: %u devices, %zu bytes\n",
+		 num_devs, buf_size);
+}
+EXPORT_SYMBOL_GPL(pci_crash_save);
+
+static void pci_crash_rebuild_worker(struct work_struct *work)
+{
+	mutex_lock(&pci_crash_lock);
+	pci_crash_rebuild_snapshot();
+	mutex_unlock(&pci_crash_lock);
+}
+
+static int pci_crash_bus_notifier(struct notifier_block *nb,
+				  unsigned long action, void *data)
+{
+	if (action == BUS_NOTIFY_ADD_DEVICE ||
+	    action == BUS_NOTIFY_DEL_DEVICE)
+		mod_delayed_work(system_wq, &pci_crash_rebuild_dwork,
+				 msecs_to_jiffies(PCI_CRASH_REBUILD_DELAY_MS));
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block pci_crash_bus_nb = {
+	.notifier_call = pci_crash_bus_notifier,
+};
+
+static int __init pci_crash_init(void)
+{
+	/* Nothing to do in crash kernel -- the buffer from the first kernel
+	 * is already in RAM (flushed before kexec) and the parser finds it
+	 * via the pagemap in VMCOREINFO.
+	 */
+	if (is_kdump_kernel())
+		return 0;
+
+	INIT_DELAYED_WORK(&pci_crash_rebuild_dwork, pci_crash_rebuild_worker);
+
+	pci_crash_parse_capture();
+	pci_crash_parse_devices();
+
+	mutex_lock(&pci_crash_lock);
+	pci_crash_rebuild_snapshot();
+	mutex_unlock(&pci_crash_lock);
+
+	bus_register_notifier(&pci_bus_type, &pci_crash_bus_nb);
+
+	WRITE_ONCE(pci_crash_ready, true);
+
+	pr_info("ready: %u devices (%zu bytes), capture=%s devices=%s\n",
+		pci_crash_num_devs, pci_crash_buffer_size,
+		capture, devices);
+
+	return 0;
+}
+late_initcall(pci_crash_init);
+
+/* Built-in only: crash infrastructure must outlive all drivers. */
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Capture PCI config space at panic time for crash analysis");
+MODULE_AUTHOR("Amazon.com, Inc.");
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index 8614430ca212..8813f7e2e516 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -14,6 +14,7 @@
 #include <linux/cpuhotplug.h>
 #include <linux/memblock.h>
 #include <linux/kmemleak.h>
+#include <linux/pci_crash.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -91,6 +92,18 @@ void crash_save_vmcoreinfo(void)
 		vmcoreinfo_data = vmcoreinfo_data_safecopy;
 
 	vmcoreinfo_append_str("CRASHTIME=%lld\n", ktime_get_real_seconds());
+
+	/* Capture PCI config space before kexec into crash kernel */
+	pci_crash_save();
+	if (pci_crash_pagemap_phys && pci_crash_buffer_size > 0) {
+		vmcoreinfo_append_str("PCI_CRASH_PAGEMAP=0x%llx\n",
+				      (unsigned long long)pci_crash_pagemap_phys);
+		vmcoreinfo_append_str("PCI_CRASH_VERSION=%d\n",
+				      PCI_CRASH_VERSION);
+		vmcoreinfo_append_str("PCI_CRASH_BUF_SZ=%zu\n",
+				      pci_crash_buffer_size);
+	}
+
 	update_vmcoreinfo_note();
 }
 
-- 
2.47.3




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

                 reply	other threads:[~2026-05-12 16:24 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:5d720d2a415 dfblob:7f499a43ddb dfblob:35f676c2724
dfblob:b2040011a38 dfblob:3e4fbb3011c dfblob:59a46ee2618
dfblob:15632358bcf dfblob:6eb49a934e5 dfblob:6785982013d
dfblob:6629755bc5a dfblob:8bbc9008fdc dfblob:8614430ca21
dfblob:8813f7e2e51 )
 OR (
bs:"[PATCH] pci_crash: capture PCI config space at panic time" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260512162402.1929687-1-hangej@amazon.com \
    --to=hangej@amazon.com \
    --cc=bhelgaas@google.com \
    --cc=dwmw2@infradead.org \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox