* [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC)
@ 2025-10-03 9:00 Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 01/13] pci: pcsc: Add plumbing for the " Evangelos Petrongonas
` (12 more replies)
0 siblings, 13 replies; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Modern virtualization environments, particularly those leveraging SR-IOV
with hundreds or thousands of Virtual Functions, expose a significant
performance bottleneck in PCI configuration space accesses. During VM
initialization, the repeated enumeration and configuration of assigned
VFs creates substantial delays that scale linearly with deployment
density. Each configuration space read triggers a hardware transaction,
leading to bus contention and measurable impact on system startup times.
The fundamental issue stems from the PCI subsystem's approach of
treating every configuration space access as a direct hardware
operation, even when reading registers that contain static,
rarely-changing values. Capability registers, device/vendor IDs, and
many control registers remain constant throughout a device's lifetime,
yet are repeatedly accessed during driver initialization, device
enumeration, and system management operations.
This patch series introduces the PCI Configuration Space Cache (PCSC), a
transparent caching layer that intercepts configuration space operations
and maintains cached copies of register values. The implementation
philosophy centers on complete transparency; existing drivers,
applications, and system management tools continue functioning unchanged
while benefiting from reduced hardware access overhead. Currently, only
endpoint devices are cached. Bridges and Root Complexes utilize a
passthrough approach.
The architecture employs a write-invalidate caching policy. By
invalidating cache entries on writes and allowing subsequent reads to
repopulate the cache with fresh values, the system accommodates complex
initialization sequences like BAR sizing. This design choice
maximizes the number of cacheable registers while maintaining strict
coherency guarantees, ensuring that any software-visible state changes
are immediately reflected.
The implementation attaches per-device cache nodes to `struct pci_dev`,
each containing a cached copy of the configuration space alongside
bitmasks tracking which registers are cacheable and currently valid. The
system dynamically injects custom PCI operations into the bus hierarchy
at multiple integration points - during host bridge registration, child
bus allocation, and dynamic operation changes.
The cacheability of the configuration space registers is deduced by
dynamically traversing device's capability chains during initialization.
The implementation analyses both PCI capabilities and PCIe extended
capabilities, identifying safe-to-cache registers based on the PCI and
PCIe specifications. Capabilities like Power Management, MSI/MSI-X, EA,
VPD, AF and vendor-specific regions are parsed to determine which
fields represent static configuration versus dynamic status information.
The extended capability support covers AER, ACS, ARI, SR-IOV, PRI, DPC,
PASID and PTM.
Device reset handling ensures cache coherency across all reset scenarios
where configuration space values may change. The implementation hooks
into Function Level Resets, Advanced Features FLR, power management
resets (D3hot->D0 transitions), device-specific resets, D3cold power
state transitions, ACPI-based resets, and both bus and slot restore
operations. Secondary bus resets receive special handling - the cache is
recursively invalidated for all devices on the secondary bus and its
subordinate buses, ensuring consistency across the entire hierarchy.
Additionally, the patch addresses cache consistency when bus operations
are dynamically changed via `pci_bus_set_ops()`, as different ops
implementations may return different values for the same registers.
The invalidation mechanism clears the `cached_bitmask` while preserving
the cacheable_bitmask, allowing the cache to repopulate with fresh
values on subsequent accesses.
Paths that do need to access the real Config Space like,
`pci_dev_wait()` are configured to bypass the cache entirely to read
hardware state directly.
Beyond basic caching, the series implements persistence across kexec
operations using the Kernel HandOver (KHO) subsystem. This feature
allows cached PCI configuration data to survive kexec eliminating
redundant configuration space probing in the new kernel. During kexec
preparation, the implementation creates Flattened Device Tree structures
containing device information and physical addresses of preserved cache
data. The new kernel discovers and restores this data during PCI
initialization, achieving up to 50% vm start time improvements on systems
with numerous PCI devices.
The persistence mechanism includes versioning support to handle
evolving cacheability rules. When the incoming kernel detects a version
mismatch with saved data, it re-infers cacheability while preserving
cached values, ensuring compatibility across kernel updates. A
hashtable-based lookup optimization reduces restoration complexity from
O(n^2) to O(n) by building an index during initialization rather than
searching the FDT for each device. In the next iteration of this
patchset, the Live Update Orchestrator (LUO) subsystem will be used
instead.
Performance characteristics demonstrate significant improvements, with
cache hit rates reaching 49% in typical virtualization scenarios without
persistence, and up to 81% when combined with kexec persistence. These
metrics translate to substantial reductions in configuration space
access latency, particularly during bulk VM operations where thousands
of configuration space accesses would otherwise create significant bus
traffic spikes.
Testing has covered diverse hardware configurations including
high-density SR-IOV deployments. The implementation demonstrates
particular effectiveness in cloud environments where rapid VM deployment
and high device density create substantial configuration space access
pressure.
The entire implementation strives to maintain compatibility with the
existing PCI subsystem behaviour. No driver modifications should be
required, and the caching layer can be completely disabled without
functional impact.
Current Limitations
-------------------
- map_bus is not properly handled. This does not cause any issues in
the current upstream linux, as the only users are either bridges that are
not currently being cached, or the
`pci_generic_config_{read,write}{,32}` which is already handled.
- In PowerPC secondary bus resets the architecture-specific
`pcibios_reset_secondary_bus()` can bypass the generic
`pci_reset_secondary_bus()` where our cache invalidation occurs.
- The current Implementation is based on KHO. In the next RFC version
it will be changed to use the Live Update Orchestrator (LUO)
https://lore.kernel.org/lkml/20250807014442.3829950-1-pasha.tatashin@soleen.com/
A branch can be found in:
https://git.infradead.org/?p=users/vpetrog/linux.git;a=shortlog;h=refs/heads/pcsc-rfc-v1
Example Output of `sys/bus/pci/pcsc/stats`
```
Cache Hits: 21063
Cache Misses: 510
Uncachable Reads: 4398
Writes: 1049
Cache Invalidations: 584
Device Resets: 0
Total Reads: 25971
Hardware Reads: 4908
Hit Rate: 81%
Total Cache Access Time: 30952 us
Cache Access Time (without HW reads due to Misses): 16126 us
HW Access Time due to misses: 14826 us
Total Hardware Access Time: 101819 us
KHO Restore Statistics:
Restored Devices: 2819
Total Restore Time: 1362 us
Hashtable Initial Entries: 2819
Hashtable Build Time: 1000 us
```
Evangelos Petrongonas (13):
pci: pcsc: Add plumbing for the PCI Configuration Space Cache (PCSC)
pci: pcsc: implement basic functionality
pci: pcsc: infer cacheability of PCI capabilities
pci: pcsc: infer PCIe extended capabilities
pci: pcsc: control the cache via sysfs and kernel params
pci: pcsc: handle device resets
pci: pcsc: introduce statistic gathering tools
pci: Save only spec-defined configuration space
vfio: pci: Fill only spec-defined configuration space regions
pci: pcsc: Use contiguous pages for the cache data
pci: pcsc: Add kexec persistence support via KHO
pci: pcsc: introduce persistence versioning
pci: pcsc: introduce hashtable lookup to speed up restoration
Documentation/ABI/testing/sysfs-bus-pci-pcsc | 29 +
.../admin-guide/kernel-parameters.txt | 7 +
drivers/pci/Kconfig | 27 +
drivers/pci/Makefile | 1 +
drivers/pci/access.c | 86 +-
drivers/pci/pci-acpi.c | 4 +
drivers/pci/pci-driver.c | 5 +
drivers/pci/pci.c | 121 +-
drivers/pci/pcie/Kconfig | 9 +
drivers/pci/pcsc.c | 1766 +++++++++++++++++
drivers/pci/probe.c | 33 +-
drivers/pci/quirks.c | 7 +-
drivers/vfio/pci/vfio_pci_config.c | 13 +-
include/linux/pci.h | 8 +
include/linux/pcsc.h | 163 ++
15 files changed, 2268 insertions(+), 11 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-pcsc
create mode 100644 drivers/pci/pcsc.c
create mode 100644 include/linux/pcsc.h
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC PATCH 01/13] pci: pcsc: Add plumbing for the PCI Configuration Space Cache (PCSC)
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-09 12:38 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 02/13] pci: pcsc: implement basic functionality Evangelos Petrongonas
` (11 subsequent siblings)
12 siblings, 1 reply; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Introduce the basic infrastructure for the PCI Configuration Space Cache
(PCSC), a mechanism to cache PCI configuration space accesses to reduce
latency and bus traffic.
The PCSC implements a transparent interception layer for PCI config
space operations by dynamically injecting its own ops into the PCI bus
hierarchy. The design preserves existing PCI ops while allowing PCSC to
intercept and cache accesses:
The` struct pci_bus` is extended to hold the original `pci_ops`, while
the cache ones are injected via `pcsc_inject_bus_ops()`. The cache ops
are injected when new buses are added via registering it to a bus
notifier and integrating it at:
* `pci_register_host_bridge()` - for root buses
* `pci_alloc_child_bus()` - for child buses
* `pci_bus_set_ops()` - when ops are dynamically changed
The implementation includes weak pcsc_hw_config_read/write functions
that handle calling the original op, when access to the actual HW is
required.
This approach ensures complete transparency - existing drivers and
subsystems continue to use standard PCI config access functions while
PCSC can intercept and cache accesses as needed. The weak functions also
allow architecture-specific implementations to override the default
behavior.
The `core` initcall level is chosen so the cache is initialised before
the PCI driver, ensuring that all config space access go through the
cache.
Kconfig options are added for both PCSC and PCIe PCSC support, with the
latter extending the cache to handle 4KB PCIe configuration space.
In this initial patch, the cache simply passes through all accesses to
the hardware via the original ops - actual caching functionality will be
added in subsequent patches.
There is one caveat in this patch. The map_bus operations can
potentially alter the cache, without invalidating / updating the cache.
This is not an issue for the current upstream usages, as it is only
being used in Root complexes and the
`pci_generic_config_{read,write}{,32}`
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
drivers/pci/Kconfig | 10 ++
drivers/pci/Makefile | 1 +
drivers/pci/access.c | 81 ++++++++++++++-
drivers/pci/pcie/Kconfig | 9 ++
drivers/pci/pcsc.c | 208 +++++++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 24 ++++-
include/linux/pci.h | 3 +
include/linux/pcsc.h | 86 ++++++++++++++++
8 files changed, 419 insertions(+), 3 deletions(-)
create mode 100644 drivers/pci/pcsc.c
create mode 100644 include/linux/pcsc.h
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 9a249c65aedc..c26162b58365 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -40,6 +40,16 @@ config PCI_DOMAINS_GENERIC
config PCI_SYSCALL
bool
+config PCSC
+ bool "PCI Configuration Space Cache"
+ depends on PCI
+ default n
+ help
+ This option enables support for the PCI Configuration Space Cache
+ (PCSC). PCSC is a transparent caching layer that
+ intercepts configuration space operations and maintains cached
+ copies of register values
+
source "drivers/pci/pcie/Kconfig"
config PCI_MSI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 67647f1880fb..012561b97e32 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -37,6 +37,7 @@ obj-$(CONFIG_PCI_DOE) += doe.o
obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
obj-$(CONFIG_PCI_NPEM) += npem.o
obj-$(CONFIG_PCIE_TPH) += tph.o
+obj-$(CONFIG_PCSC) += pcsc.o
# Endpoint library must be initialized before its users
obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index b123da16b63b..b89e9210d330 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/pci.h>
+#include <linux/pcsc.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/ioport.h>
@@ -189,15 +190,93 @@ EXPORT_SYMBOL_GPL(pci_generic_config_write32);
* @ops: new raw operations
*
* Return previous raw operations
+ *
+ * When PCSC is enabled, this function maintains transparency by:
+ * - Returning the original non-PCSC ops to the caller
+ * - Properly handling the case where PCSC ops are already injected
+ * - Re-injecting PCSC ops after setting new ops when appropriate
*/
struct pci_ops *pci_bus_set_ops(struct pci_bus *bus, struct pci_ops *ops)
{
struct pci_ops *old_ops;
unsigned long flags;
+#ifdef CONFIG_PCSC
+ bool pcsc_was_injected = false;
+ struct pci_ops *pcsc_ops_ptr = NULL;
+#endif
raw_spin_lock_irqsave(&pci_lock, flags);
- old_ops = bus->ops;
+
+#ifdef CONFIG_PCSC
+ /*
+ * Check if PCSC ops are currently injected. If so, we need to:
+ * 1. Return the original (non-PCSC) ops to maintain transparency
+ * 2. Update orig_ops to point to the new ops
+ * 3. Re-inject PCSC ops if the new ops are different from PCSC ops
+ */
+ if (bus->orig_ops) {
+ pcsc_was_injected = true;
+ pcsc_ops_ptr = bus->ops; /* Save current PCSC ops */
+ old_ops = bus->orig_ops; /* Return the real original ops */
+
+ /*
+ * If the caller is trying to restore the PCSC ops themselves,
+ * just keep the current setup and return the original ops
+ */
+ if (ops == pcsc_ops_ptr)
+ goto out_unlock;
+
+ /* Clear orig_ops temporarily to allow re-injection */
+ bus->orig_ops = NULL;
+ } else
+#endif
+ {
+ old_ops = bus->ops;
+ }
+
bus->ops = ops;
+
+#ifdef CONFIG_PCSC
+ /*
+ * Re-inject PCSC ops if they were previously injected and the new ops
+ * are not the PCSC ops themselves. This maintains caching transparency.
+ */
+ if (pcsc_was_injected && ops != pcsc_ops_ptr) {
+ /*
+ * IMPORTANT: Dynamic ops changes after PCSC injection can lead to
+ * cache consistency issues if operations were performed that should
+ * have invalidated the cache. We re-inject PCSC ops here, but the
+ * caller is responsible for ensuring cache consistency if needed.
+ * This will be fixed in a future commit, when PCSC resets are
+ * introduced.
+ */
+
+ pr_warn("PCSC: Dynamic ops change detected on bus %04x:%02x, resetting cache\n",
+ pci_domain_nr(bus), bus->number);
+
+ if (pcsc_inject_bus_ops(bus)) {
+ pr_err("PCSC: Failed to re-inject ops after ops change on bus %04x:%02x\n",
+ pci_domain_nr(bus), bus->number);
+ /*
+ * If re-injection fails, we've lost caching but at least
+ * the caller's requested ops are in place. Log it
+ */
+ pr_warn("PCSC: Cache disabled for bus %04x:%02x after ops change\n",
+ pci_domain_nr(bus), bus->number);
+ } else {
+ pr_debug("PCSC: Successfully re-injected ops after ops change on bus %04x:%02x\n",
+ pci_domain_nr(bus), bus->number);
+ }
+ } else if (!pcsc_was_injected) {
+ /* First-time injection for this bus */
+ if (pcsc_inject_bus_ops(bus)) {
+ pr_err("PCSC: Failed to inject ops on bus %04x:%02x\n",
+ pci_domain_nr(bus), bus->number);
+ }
+ }
+
+out_unlock:
+#endif
raw_spin_unlock_irqrestore(&pci_lock, flags);
return old_ops;
}
diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 17919b99fa66..2f1efc41afcc 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -155,3 +155,12 @@ config PCIE_EDR
the PCI Firmware Specification r3.2. Enable this if you want to
support hybrid DPC model which uses both firmware and OS to
implement DPC.
+
+config PCIE_PCSC
+ bool "PCI Configuration Space Cache PCIE Support"
+ depends on PCSC
+ default y
+ help
+ This option adds PCIe support to the PCSC, by expanding the
+ configuration space to 4K and adding support for PCIe Capabilities.
+ For more information check PCSC and `/drivers/pci/pcsc.c`
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
new file mode 100644
index 000000000000..dec7c51b5cfd
--- /dev/null
+++ b/drivers/pci/pcsc.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ * Author: Evangelos Petrongonas <epetron@amazon.de>
+ *
+ * Implementation of the PCI Configuration Space Cache (PCSC)
+ * PCSC is a module which caches the PCI Configuration Space Accesses
+ * It implements a write-invalidate policy, meaning that writes are
+ * propagated to the device and invalidating the cache. The registers that
+ * we are caching are based on the values that are safe to cache and we
+ * are not expecting them to change without OS actions.
+ *
+ */
+
+ #define pr_fmt(fmt) "PCSC: " fmt
+
+#include <linux/pcsc.h>
+
+static bool pcsc_initialised;
+
+static int pcsc_add_bus(struct pci_bus *bus)
+{
+ if (!bus->orig_ops || !bus->orig_ops->add_bus)
+ return 0;
+ return bus->orig_ops->add_bus(bus);
+}
+
+static void pcsc_remove_bus(struct pci_bus *bus)
+{
+ if (bus->orig_ops && bus->orig_ops->remove_bus)
+ bus->orig_ops->remove_bus(bus);
+}
+
+/**
+ * pcsc_map_bus - Map PCI configuration space for memory-mapped access
+ * @bus: PCI bus structure
+ * @devfn: Device and function number
+ * @where: Offset in configuration space
+ *
+ * WARNING: Cache Bypass Issue
+ * This function returns a memory-mapped I/O address that provides direct
+ * access to PCI configuration space, completely bypassing the PCSC cache.
+ *
+ * Any reads or writes performed through the returned MMIO address will NOT:
+ * - Use cached values for reads
+ * - Update cached values on reads
+ * - Invalidate cached values on writes
+ *
+ * This can lead to cache inconsistency where:
+ * 1. PCSC cache contains stale data after MMIO writes
+ * 2. Subsequent cached reads return outdated values
+ * 3. Cache coherency is lost until the next cache invalidation
+ *
+ * Current users include:
+ * - (pci_generic_config_{read,write}{,32}) which are already handled
+ * - operations on RCs that are not supported by PCSC.
+ * Therefore, there is no risk of cache inconsistency here.
+ * However, any future use of map_bus after cache population poses risks.
+ *
+ * IMPORTANT: Callers using the returned MMIO address are responsible for
+ * maintaining cache consistency. Consider invalidating relevant cache entries
+ * after MMIO operations if the device's cache may be active.
+ *
+ * Return: Virtual address for memory-mapped config space access, or NULL
+ */
+static void __iomem *pcsc_map_bus(struct pci_bus *bus, unsigned int devfn,
+ int where)
+{
+ if (!bus->orig_ops || !bus->orig_ops->map_bus)
+ return NULL;
+ return bus->orig_ops->map_bus(bus, devfn, where);
+}
+
+/* Weak references to allow architecture-specific overrides */
+int __weak pcsc_hw_config_read(struct pci_bus *bus, unsigned int devfn,
+ int where, int size, u32 *val)
+{
+ /*
+ * This function is only called from pcsc_cached_config_read,
+ * which means PCSC ops have already been injected and orig_ops
+ * should be valid.
+ */
+ if (bus->orig_ops && bus->orig_ops->read)
+ return bus->orig_ops->read(bus, devfn, where, size, val);
+
+ *val = 0xffffffff;
+ return PCIBIOS_FUNC_NOT_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(pcsc_hw_config_read);
+
+int __weak pcsc_hw_config_write(struct pci_bus *bus, unsigned int devfn,
+ int where, int size, u32 val)
+{
+ /*
+ * This function is only called from pcsc_cached_config_write,
+ * which means PCSC ops have already been injected and orig_ops
+ * should be valid.
+ */
+ if (bus->orig_ops && bus->orig_ops->write)
+ return bus->orig_ops->write(bus, devfn, where, size, val);
+
+ return PCIBIOS_FUNC_NOT_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(pcsc_hw_config_write);
+
+int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
+ int size, u32 *val)
+{
+ if (!pcsc_initialised)
+ goto read_from_dev;
+
+read_from_dev:
+ return pcsc_hw_config_read(bus, devfn, where, size, val);
+}
+EXPORT_SYMBOL_GPL(pcsc_cached_config_read);
+
+int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
+ int size, u32 val)
+{
+ if (!pcsc_initialised)
+ goto write_to_dev;
+
+write_to_dev:
+ return pcsc_hw_config_write(bus, devfn, where, size, val);
+}
+EXPORT_SYMBOL_GPL(pcsc_cached_config_write);
+
+static struct pci_ops pcsc_ops = {
+ .add_bus = pcsc_add_bus,
+ .remove_bus = pcsc_remove_bus,
+ .map_bus = pcsc_map_bus,
+ .read = pcsc_cached_config_read,
+ .write = pcsc_cached_config_write,
+};
+
+int pcsc_inject_bus_ops(struct pci_bus *bus)
+{
+ if (!bus)
+ return -EINVAL;
+
+ if (!bus->ops) {
+ WARN_ONCE(
+ 1,
+ "PCSC: Cannot inject ops - bus %04x:%02x ops not defined\n",
+ pci_domain_nr(bus), bus->number);
+ return -EINVAL;
+ }
+
+ if (bus->ops->read == pcsc_cached_config_read || bus->orig_ops)
+ return 0;
+
+ bus->orig_ops = bus->ops;
+ bus->ops = &pcsc_ops;
+
+ pci_dbg(bus, "PCSC: Injected ops for bus");
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pcsc_inject_bus_ops);
+
+static void pcsc_remove_bus_ops(struct pci_bus *bus)
+{
+ if (bus->orig_ops && bus->ops == &pcsc_ops) {
+ bus->ops = bus->orig_ops;
+ bus->orig_ops = NULL;
+ }
+}
+
+static int pcsc_bus_notify(struct notifier_block *nb, unsigned long action,
+ void *data)
+{
+ struct device *dev = data;
+ struct pci_bus *bus;
+
+ bus = to_pci_bus(dev);
+ if (!bus)
+ return NOTIFY_OK;
+
+ switch (action) {
+ case BUS_NOTIFY_ADD_DEVICE:
+ pcsc_inject_bus_ops(bus);
+ break;
+ case BUS_NOTIFY_DEL_DEVICE:
+ /*
+ * Remove on DEL_DEVICE to unhook before device_del() completes.
+ * This ensures caching is disabled before the final cleanup.
+ */
+ pcsc_remove_bus_ops(bus);
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block pcsc_bus_nb = {
+ .notifier_call = pcsc_bus_notify,
+};
+
+static int __init pcsc_init(void)
+{
+ bus_register_notifier(&pci_bus_type, &pcsc_bus_nb);
+
+ pcsc_initialised = true;
+ pr_info("initialised\n");
+
+ return 0;
+}
+
+core_initcall(pcsc_init);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 37f5bd476f39..33a186e4bf1e 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -8,6 +8,7 @@
#include <linux/delay.h>
#include <linux/init.h>
#include <linux/pci.h>
+#include <linux/pcsc.h>
#include <linux/msi.h>
#include <linux/of_pci.h>
#include <linux/of_platform.h>
@@ -1039,6 +1040,11 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
}
#endif
+#ifdef CONFIG_PCSC
+ if (pcsc_inject_bus_ops(bus))
+ pci_err(bus, "PCSC: Failed to inject ops\n");
+#endif
+
b = pci_find_bus(pci_domain_nr(bus), bridge->busnr);
if (b) {
/* Ignore it if we already got here via a different bridge */
@@ -1236,10 +1242,24 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
child->bus_flags = parent->bus_flags;
host = pci_find_host_bridge(parent);
- if (host->child_ops)
+ if (host->child_ops) {
child->ops = host->child_ops;
- else
+#ifdef CONFIG_PCSC
+ child->orig_ops = host->child_ops;
+#endif
+ } else {
child->ops = parent->ops;
+#ifdef CONFIG_PCSC
+ child->orig_ops = parent->orig_ops;
+#endif
+ }
+
+#ifdef CONFIG_PCSC
+ if (child->ops) {
+ if (pcsc_inject_bus_ops(child))
+ pci_err(child, "PCSC: Failed to inject ops\n");
+ }
+#endif
/*
* Initialize some portions of the bus device, but don't register
diff --git a/include/linux/pci.h b/include/linux/pci.h
index d1fdf81fbe1e..b6cbf93db644 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -669,6 +669,9 @@ struct pci_bus {
struct resource busn_res; /* Bus numbers routed to this bus */
struct pci_ops *ops; /* Configuration access functions */
+#ifdef CONFIG_PCSC
+ struct pci_ops *orig_ops; /* Original ops before PCSC injection */
+#endif
void *sysdata; /* Hook for sys-specific extension */
struct proc_dir_entry *procdir; /* Directory entry in /proc/bus/pci */
diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h
new file mode 100644
index 000000000000..45816eb2b2c8
--- /dev/null
+++ b/include/linux/pcsc.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ * Author: Evangelos Petrongonas <epetron@amazon.de>
+ *
+ */
+
+#ifndef _LINUX_PCSC_H
+#define _LINUX_PCSC_H
+
+#include <linux/pci.h>
+
+/**
+ * pcsc_hw_config_read - Direct hardware PCI config space read
+ * @bus: PCI bus
+ * @devfn: PCI device function
+ * @where: offset in PCI config space
+ * @size: size of data to read
+ * @val: pointer to store read data
+ *
+ * This function performs a direct hardware read from PCI configuration space,
+ * bypassing the PCSC cache. It is a weak function that can be overridden by
+ * architecture-specific implementations.
+ *
+ * Return: 0 on success, non-zero error code on failure
+ */
+int pcsc_hw_config_read(struct pci_bus *bus, unsigned int devfn, int where,
+ int size, u32 *val);
+
+/**
+ * pcsc_hw_config_write - Direct hardware PCI config space write
+ * @bus: PCI bus
+ * @devfn: PCI device function
+ * @where: offset in PCI config space
+ * @size: size of data to write
+ * @val: value to write
+ *
+ * This function performs a direct hardware write to PCI configuration space,
+ * bypassing the PCSC cache. It is a weak function that can be overridden by
+ * architecture-specific implementations.
+ *
+ * Return: 0 on success, non-zero error code on failure
+ */
+int pcsc_hw_config_write(struct pci_bus *bus, unsigned int devfn, int where,
+ int size, u32 val);
+
+/**
+ * pcsc_cached_config_read - Read PCI config space register via PCSC
+ * @bus: PCI bus
+ * @devfn: PCI device function
+ * @where: offset in PCI config space
+ * @size: size of data to read
+ * @val: pointer to store read data
+ *
+ * Reads a register from the PCI configuration space of a device using the
+ * PCSC infrastructure.
+ *
+ * Return: 0 on success, non-zero error code on failure
+ */
+int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
+ int size, u32 *val);
+
+/**
+ * pcsc_cached_config_write - Write PCI config space register via PCSC
+ * @bus: PCI bus
+ * @devfn: PCI device function
+ * @where: offset in PCI config space
+ * @size: size of data to write
+ * @val: value to write
+ *
+ * Writes a value to a register in the PCI configuration space of a device using
+ * the PCSC infrastructure.
+ *
+ * Return: 0 on success, non-zero error code on failure
+ */
+int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
+ int size, u32 val);
+
+/**
+ * pcsc_inject_bus_ops Inject the pcsc ops into bus pci_ops
+ * @bus: the bus in which to inject the ops
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int pcsc_inject_bus_ops(struct pci_bus *bus);
+#endif /* _LINUX_PCSC_H */
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 02/13] pci: pcsc: implement basic functionality
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 01/13] pci: pcsc: Add plumbing for the " Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-09 13:12 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 03/13] pci: pcsc: infer cacheability of PCI capabilities Evangelos Petrongonas
` (10 subsequent siblings)
12 siblings, 1 reply; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Implement the core functionality of the PCI Configuration Space Cache
using per-device cache nodes attached to struct pci_dev.
Each cache node stores:
- A 256-byte array (4KB for PCIe) representing the configuration space
- A cacheable bitmask indicating which registers can be cached
- A cached bitmask tracking which bytes are currently cached
The implementation attaches cache nodes directly to pci_dev structures
during `pci_device_add()` and removes them during `pci_device_remove()`.
The cache implements a write-invalidate policy where writes are
propagated to the device while invalidating the cache. This design
choice improves robustness and increases the number of cacheable
registers, particularly for operations like BAR sizing which use
write-read sequences to detect read-only bits.
Currently, the cacheable bitmask is zero-initialized,
effectively disabling the cache. This will be changed in the next
commits.
This implementation only supports endpoint devices; bridges and
root complexes are not cached.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
drivers/pci/pci-driver.c | 5 +
drivers/pci/pcsc.c | 244 ++++++++++++++++++++++++++++++++++++++-
drivers/pci/probe.c | 9 ++
include/linux/pci.h | 5 +
include/linux/pcsc.h | 38 ++++++
5 files changed, 299 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 302d61783f6c..7c0cbbd50b32 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -21,6 +21,7 @@
#include <linux/acpi.h>
#include <linux/dma-map-ops.h>
#include <linux/iommu.h>
+#include <linux/pcsc.h>
#include "pci.h"
#include "pcie/portdrv.h"
@@ -497,7 +498,11 @@ static void pci_device_remove(struct device *dev)
* horrible the crap we have to deal with is when we are awake...
*/
+ #ifdef CONFIG_PCSC
+ pcsc_remove_device(pci_dev);
+#endif
pci_dev_put(pci_dev);
+
}
static void pci_device_shutdown(struct device *dev)
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index dec7c51b5cfd..7531217925e8 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -14,9 +14,16 @@
#define pr_fmt(fmt) "PCSC: " fmt
+#include <linux/atomic.h>
#include <linux/pcsc.h>
static bool pcsc_initialised;
+static atomic_t num_nodes = ATOMIC_INIT(0);
+
+inline bool pcsc_is_initialised(void)
+{
+ return pcsc_initialised;
+}
static int pcsc_add_bus(struct pci_bus *bus)
{
@@ -103,13 +110,225 @@ int __weak pcsc_hw_config_write(struct pci_bus *bus, unsigned int devfn,
}
EXPORT_SYMBOL_GPL(pcsc_hw_config_write);
+static inline int _test_bits(int where, int size, const void *addr)
+{
+ int i;
+ int res = 1;
+
+ for (i = 0; i < size; i++)
+ res &= test_bit(where + i, addr);
+ return res;
+}
+
+static int pcsc_is_access_cacheable(struct pci_dev *dev, int where, int size)
+{
+ if (unlikely(!dev || (where + size > PCSC_CFG_SPC_SIZE)))
+ return 0;
+
+ return _test_bits(where, size, dev->pcsc->cachable_bitmask);
+}
+
+static inline bool pcsc_is_cached(struct pci_dev *dev, int where, int size)
+{
+ if (unlikely(!dev || !dev->pcsc || !dev->pcsc->cfg_space ||
+ (where + size > PCSC_CFG_SPC_SIZE)))
+ return 0;
+
+ return _test_bits(where, size, dev->pcsc->cached_bitmask);
+}
+
+static inline void pcsc_set_cached(struct pci_dev *dev, int where, bool cached)
+{
+ if (WARN_ON(!dev))
+ return;
+
+ if (WARN_ON(where >= PCSC_CFG_SPC_SIZE))
+ return;
+
+ if (cached)
+ set_bit(where, dev->pcsc->cached_bitmask);
+ else
+ clear_bit(where, dev->pcsc->cached_bitmask);
+}
+
+static int pcsc_get_byte(struct pci_dev *dev, int where, u8 *val)
+{
+ if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
+ return -EINVAL;
+
+ if (WARN_ON(where >= PCSC_CFG_SPC_SIZE))
+ return -EPERM;
+ *val = dev->pcsc->cfg_space[where];
+ return 0;
+}
+
+static int pcsc_update_byte(struct pci_dev *dev, int where, u8 val)
+{
+ if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
+ return -EINVAL;
+
+ if (WARN_ON(where >= PCSC_CFG_SPC_SIZE))
+ return -EPERM;
+ dev->pcsc->cfg_space[where] = val;
+ pcsc_set_cached(dev, where, true);
+
+ return 0;
+}
+
+int pcsc_add_device(struct pci_dev *dev)
+{
+ struct pcsc_node *node;
+ struct pci_bus *bus;
+
+ if (WARN_ON(!dev))
+ return -EINVAL;
+
+ bus = dev->bus;
+
+ node = kzalloc(sizeof(*node), GFP_KERNEL);
+ if (!node)
+ return -ENOMEM;
+
+ dev->pcsc = node;
+ /* The current version of the PCSC supports only endpoint devices.
+ * Bridges and RCs are not supported, but we are still creating
+ * nodes for these devices, as it simplifies the code flow
+ */
+ if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
+ dev->pcsc->cfg_space = kzalloc(PCSC_CFG_SPC_SIZE, GFP_KERNEL);
+ if (!dev->pcsc->cfg_space)
+ goto err_free_node;
+
+ } else {
+ dev->pcsc->cfg_space = NULL;
+ }
+
+ atomic_inc(&num_nodes);
+ pci_dbg(dev, "PCSC: Created cache node\n");
+
+ return 0;
+
+err_free_node:
+ dev->pcsc = NULL;
+ kfree(node);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(pcsc_add_device);
+
+int pcsc_remove_device(struct pci_dev *dev)
+{
+ if (WARN_ON(!dev))
+ return -EINVAL;
+
+ pci_dbg(dev, "PCSC: Removing cache node");
+
+ atomic_dec(&num_nodes);
+
+ if (dev->pcsc && dev->pcsc->cfg_space) {
+ kfree(dev->pcsc->cfg_space);
+ kfree(dev->pcsc);
+ }
+ dev->pcsc = NULL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pcsc_remove_device);
+
+/**
+ * pcsc_get_and_insert_multiple - Read multiple bytes from PCI cache or HW
+ * @dev: PCI device to read from
+ * @bus: PCI bus to read from
+ * @devfn: device and function number
+ * @where: offset in config space
+ * @word: pointer to store read value
+ * @size: number of bytes to read (1, 2 or 4)
+ *
+ * Reads consecutive bytes from PCI cache or hardware. If values are not cached,
+ * reads from hardware and inserts into cache.
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
+ struct pci_bus *bus, unsigned int devfn,
+ int where, u32 *word, int size)
+{
+ u32 word_cached = 0;
+ u8 byte_val;
+ int rc, i;
+
+ if (WARN_ON(!dev || !bus || !word))
+ return -EINVAL;
+
+ if (WARN_ON(size != 1 && size != 2 && size != 4))
+ return -EINVAL;
+
+ /* Check bounds */
+ if (where + size > PCSC_CFG_SPC_SIZE)
+ return -EINVAL;
+
+ if (pcsc_is_cached(dev, where, size)) {
+ /* Read bytes from cache and assemble them into word_cached
+ * in little-endian order (as per PCI spec)
+ */
+ for (i = 0; i < size; i++) {
+ pcsc_get_byte(dev, where + i, &byte_val);
+ word_cached |= ((u32)byte_val << (i * 8));
+ }
+ } else {
+ rc = pcsc_hw_config_read(bus, devfn, where, size, &word_cached);
+ if (rc) {
+ pci_err(dev,
+ "%s: Failed to read CFG Space where=%d size=%d",
+ __func__, where, size);
+ return rc;
+ }
+
+ /* Extract bytes from word_cached in little-endian order
+ * and store them in cache.
+ */
+ for (i = 0; i < size; i++) {
+ byte_val = (word_cached >> (i * 8)) & 0xFF;
+ pcsc_update_byte(dev, where + i, byte_val);
+ }
+ }
+
+ *word = word_cached;
+ return 0;
+}
+
int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 *val)
{
- if (!pcsc_initialised)
+ int rc;
+ struct pci_dev *dev;
+
+ if (unlikely(!pcsc_is_initialised()))
goto read_from_dev;
+ if (WARN_ON(!bus || !val || (size != 1 && size != 2 && size != 4) ||
+ where + size > PCSC_CFG_SPC_SIZE))
+ return -EINVAL;
+
+ dev = pci_get_slot(bus, devfn);
+
+ if (unlikely(!dev || !dev->pcsc))
+ goto read_from_dev;
+
+ if (dev->pcsc->cfg_space &&
+ pcsc_is_access_cacheable(dev, where, size)) {
+ rc = pcsc_get_and_insert_multiple(dev, bus, devfn, where, val,
+ size);
+ if (likely(!rc)) {
+ pci_dev_put(dev);
+ return 0;
+ }
+ /* if reading from the cache failed continue and try reading
+ * from the actual device
+ */
+ }
read_from_dev:
+ if (dev)
+ pci_dev_put(dev);
return pcsc_hw_config_read(bus, devfn, where, size, val);
}
EXPORT_SYMBOL_GPL(pcsc_cached_config_read);
@@ -117,10 +336,31 @@ EXPORT_SYMBOL_GPL(pcsc_cached_config_read);
int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 val)
{
- if (!pcsc_initialised)
+ int i;
+ struct pci_dev *dev;
+
+ if (unlikely(!pcsc_is_initialised()))
goto write_to_dev;
+ if (WARN_ON(!bus || (size != 1 && size != 2 && size != 4) ||
+ where + size > PCSC_CFG_SPC_SIZE))
+ return -EINVAL;
+
+ dev = pci_get_slot(bus, devfn);
+
+ if (unlikely(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) {
+ /* Do not add nodes on arbitrary writes */
+ goto write_to_dev;
+ } else {
+ /* Mark the cache as dirty */
+ if (pcsc_is_access_cacheable(dev, where, size)) {
+ for (i = 0; i < size; i++)
+ pcsc_set_cached(dev, where + i, false);
+ }
+ }
write_to_dev:
+ if (dev)
+ pci_dev_put(dev);
return pcsc_hw_config_write(bus, devfn, where, size, val);
}
EXPORT_SYMBOL_GPL(pcsc_cached_config_write);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 33a186e4bf1e..c231e09e5a6e 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -23,6 +23,7 @@
#include <linux/irqdomain.h>
#include <linux/pm_runtime.h>
#include <linux/bitfield.h>
+#include <linux/pcsc.h>
#include "pci.h"
#define CARDBUS_LATENCY_TIMER 176 /* secondary latency timer */
@@ -2801,6 +2802,14 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
dev->state_saved = false;
+#ifdef CONFIG_PCSC
+ if (likely(pcsc_is_initialised()))
+ if (!dev->pcsc)
+ if (pcsc_add_device(dev))
+ pci_warn(dev,
+ "Failed to add PCI device to PCSC\n");
+#endif
+
pci_init_capabilities(dev);
/*
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b6cbf93db644..e59b585f96bb 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -42,6 +42,7 @@
#include <uapi/linux/pci.h>
#include <linux/pci_ids.h>
+#include <linux/pcsc.h>
#define PCI_STATUS_ERROR_BITS (PCI_STATUS_DETECTED_PARITY | \
PCI_STATUS_SIG_SYSTEM_ERROR | \
@@ -560,6 +561,10 @@ struct pci_dev {
u8 tph_mode; /* TPH mode */
u8 tph_req_type; /* TPH requester type */
#endif
+
+#ifdef CONFIG_PCSC
+ struct pcsc_node *pcsc;
+#endif
};
static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h
index 45816eb2b2c8..516d73931608 100644
--- a/include/linux/pcsc.h
+++ b/include/linux/pcsc.h
@@ -9,6 +9,20 @@
#define _LINUX_PCSC_H
#include <linux/pci.h>
+#include <linux/sizes.h>
+#include <linux/bitmap.h>
+
+#ifdef CONFIG_PCIE_PCSC
+#define PCSC_CFG_SPC_SIZE (4 * SZ_1K)
+#else
+#define PCSC_CFG_SPC_SIZE 256
+#endif
+
+struct pcsc_node {
+ u8 *cfg_space;
+ DECLARE_BITMAP(cachable_bitmask, PCSC_CFG_SPC_SIZE);
+ DECLARE_BITMAP(cached_bitmask, PCSC_CFG_SPC_SIZE);
+};
/**
* pcsc_hw_config_read - Direct hardware PCI config space read
@@ -83,4 +97,28 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
* Return: 0 on success, negative error code on failure
*/
int pcsc_inject_bus_ops(struct pci_bus *bus);
+
+/**
+ * pcsc_add_device - Allocate and initialize a new PCSC node
+ * This should only be called once for each device
+ * @dev: PCI device to initialise the cache for
+ *
+ * Returns: 0 on success error code on failure
+ */
+int pcsc_add_device(struct pci_dev *dev);
+
+/**
+ * pcsc_remove_device - Clear up any PCSC data
+ * @dev: PCI device to remove
+ *
+ * Returns: 0 on success, -EINVAL if dev is NULL
+ */
+int pcsc_remove_device(struct pci_dev *dev);
+
+/**
+ * @brief Returns if the PCSC infrastructure is initialised
+ *
+ */
+bool pcsc_is_initialised(void);
+
#endif /* _LINUX_PCSC_H */
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 03/13] pci: pcsc: infer cacheability of PCI capabilities
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 01/13] pci: pcsc: Add plumbing for the " Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 02/13] pci: pcsc: implement basic functionality Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-09 14:17 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 04/13] pci: pcsc: infer PCIe extended capabilities Evangelos Petrongonas
` (9 subsequent siblings)
12 siblings, 1 reply; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Implement cacheability inference for PCI capabilities to
determine which configuration space registers can be safely cached.
The first 64 bytes of PCI configuration space follow a standardized
format, allowing straightforward cacheability determination. For
capability-specific registers, the implementation traverses the PCI
capability list to identify supported capabilities.
Cacheable registers are identified for the following capabilities:
- Power Management (PM)
- Message Signaled Interrupts (MSI)
- Message Signaled Interrupts Extensions (MSI-X)
- PCI Express
- PCI Advanced Features (AF)
- Enhanced Allocation (EA)
- Vital Product Data (VPD)
- Vendor Specific
The implementation pre-populates the cache with known values including
device/vendor IDs and header type to avoid unnecessary configuration
space reads during initialization.
We are currently not caching the Command/Status registers.
The cacheability of all capabilities apart from MSI, are straightforward
and can be deduced from the spec. Regarding MSI the MSI flags are read
and based on this, the cacheability is inferred.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
drivers/pci/pcsc.c | 261 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 261 insertions(+)
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 7531217925e8..29945eac4190 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -175,6 +175,266 @@ static int pcsc_update_byte(struct pci_dev *dev, int where, u8 val)
return 0;
}
+static const u8 PCSC_SUPPORTED_CAPABILITIES[] = {
+ PCI_CAP_ID_PM, PCI_CAP_ID_VPD, PCI_CAP_ID_MSI, PCI_CAP_ID_VNDR,
+ PCI_CAP_ID_MSIX, PCI_CAP_ID_EXP, PCI_CAP_ID_AF, PCI_CAP_ID_EA
+};
+
+/**
+ * pcsc_handle_msi_cacheability - Set cacheability for MSI capability registers
+ * @dev: PCI device
+ * @cap_pos: Capability position in config space
+ *
+ * The MSI capability has four different shapes (12-24 bytes) depending on:
+ * - 64-bit addressing capability (PCI_MSI_FLAGS_64BIT)
+ * - Per-vector masking capability (PCI_MSI_FLAGS_MASKBIT)
+ *
+ * Cacheable registers:
+ * - PCI_MSI_FLAGS: Control register
+ * - PCI_MSI_ADDRESS_LO: Lower 32 bits of message address
+ * - PCI_MSI_ADDRESS_HI: Upper 32 bits (if 64-bit capable)
+ * - PCI_MSI_DATA_32/64: Message data register
+ * - PCI_MSI_MASK_32/64: Mask bits register (if masking capable)
+ *
+ * Non-cacheable registers:
+ * - PCI_MSI_PENDING_32/64: Pending bits (modified by device)
+ */
+static void pcsc_handle_msi_cacheability(struct pci_dev *dev, int cap_pos)
+{
+ u32 val;
+ u16 msi_flags;
+ bool is_64bit_capable;
+ bool is_mask_capable;
+ int data_offset;
+ int mask_offset;
+
+ if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
+ return;
+
+ /* Read MSI flags to determine capability shape */
+ if (pcsc_hw_config_read(dev->bus, dev->devfn, cap_pos + PCI_MSI_FLAGS,
+ 2, &val) != PCIBIOS_SUCCESSFUL) {
+ pci_warn(dev, "PCSC: Failed to read MSI flags at %#x\n",
+ cap_pos + PCI_MSI_FLAGS);
+ return;
+ }
+
+ msi_flags = val & 0xFFFF;
+ pcsc_update_byte(dev, cap_pos + PCI_MSI_FLAGS, msi_flags & 0xFF);
+ pcsc_update_byte(dev, cap_pos + PCI_MSI_FLAGS + 1, (msi_flags >> 8) & 0xFF);
+
+ /* Mark MSI flags as cacheable */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_MSI_FLAGS, 2);
+ is_64bit_capable = !!(msi_flags & PCI_MSI_FLAGS_64BIT);
+ is_mask_capable = !!(msi_flags & PCI_MSI_FLAGS_MASKBIT);
+
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_MSI_ADDRESS_LO,
+ 4);
+
+ if (is_64bit_capable) {
+ /* PCI_MSI_ADDRESS_HI is cacheable for 64-bit capable devices */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_MSI_ADDRESS_HI, 4);
+
+ data_offset = PCI_MSI_DATA_64;
+ mask_offset = PCI_MSI_MASK_64;
+ } else {
+ /* Message Data register is at different offset for 32-bit */
+ data_offset = PCI_MSI_DATA_32;
+ mask_offset = PCI_MSI_MASK_32;
+ }
+
+ /*
+ * Message Data register is always cacheable
+ * Note: PCI spec defines Extended Message Data Capable (bit 9, 0x0200)
+ * which allows 4-byte message data instead of 2-byte. However, Linux
+ * doesn't currently define or use this capability, so we conservatively
+ * mark only 2 bytes as cacheable for compatibility.
+ */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + data_offset, 2);
+
+ if (is_mask_capable) {
+ /* Mask bits register is cacheable if masking is supported */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + mask_offset,
+ 4);
+ }
+}
+
+static void infer_capability_cacheability(struct pci_dev *dev, int cap_pos,
+ u8 cap_id)
+{
+ if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
+ return;
+
+ switch (cap_id) {
+ case PCI_CAP_ID_PM:
+ /* Power Management Capability */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_PM_PMC,
+ 2); /* PCI_PM_PMC */
+ break;
+ case PCI_CAP_ID_MSI:
+ /* Message Signaled Interrupts */
+ pcsc_handle_msi_cacheability(dev, cap_pos);
+ break;
+ case PCI_CAP_ID_VNDR:
+ /* Vendor Specific */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_CAP_FLAGS,
+ 1);
+ /* Only the flag can be cached as the body is opaque */
+ break;
+ case PCI_CAP_ID_MSIX:
+ /* MSI-X - the entire capability is cacheable */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_MSIX_FLAGS, 10);
+ break;
+ case PCI_CAP_ID_EXP:
+ /* PCI Express capability - All except Status registers */
+ bitmap_set(
+ dev->pcsc->cachable_bitmask, cap_pos + PCI_EXP_FLAGS,
+ 8); /* PCI_EXP_FLAGS, PCI_EXP_DEVCAP, PCI_EXP_DEVCTL */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_LNKCAP,
+ 6); /* PCI_EXP_LNKCAP, PCI_EXP_LNKCTL */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_SLTCAP,
+ 6); /* PCI_EXP_SLTCAP, PCI_EXP_SLTCTL */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_EXP_RTCTL,
+ 4); /* PCI_EXP_RTCTL, PCI_EXP_RTCAP */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_DEVCAP2,
+ 6); /* PCI_EXP_DEVCAP2, PCI_EXP_DEVCTL2 */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_LNKCAP2,
+ 6); /* PCI_EXP_LNKCAP2, PCI_EXP_LNKCTL2 */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_SLTCAP2,
+ 6); /* PCI_EXP_SLTCAP2, PCI_EXP_SLTCTL2 */
+ break;
+ case PCI_CAP_ID_AF:
+ /* PCI Advanced Features */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_AF_LENGTH,
+ 2); /* PCI_AF_LENGTH, PCI_AF_CAP */
+ break;
+ case PCI_CAP_ID_EA:
+ /* Enhanced Allocation Theoretically the entire capability could
+ * be cached, but it is not trivial to deduce its size.
+ */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EA_NUM_ENT, 2);
+ break;
+ case PCI_CAP_ID_VPD:
+ /* Vital Product Data */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_VPD_ADDR,
+ 2); /* PCI_VPD_ADDR */
+ break;
+ default:
+ /* Unsupported capability - We shouldn't reach this point */
+ pr_warn("Something is off when iterating through the supported capabilities.");
+ break;
+ }
+}
+
+static void infer_capabilities_pointers(struct pci_dev *dev)
+{
+ u8 pos, cap_id, next_cap;
+ u32 val;
+ int i;
+
+ if (pcsc_hw_config_read(dev->bus, dev->devfn, PCI_CAPABILITY_LIST, 1,
+ &val) != PCIBIOS_SUCCESSFUL)
+ return;
+
+ pos = (val & 0xFF) & ~0x3;
+
+ while (pos) {
+ if (pos < 0x40 || pos > 0xFE)
+ break;
+
+ pos &= ~0x3;
+ if (pcsc_hw_config_read(dev->bus, dev->devfn, pos, 2, &val) !=
+ PCIBIOS_SUCCESSFUL)
+ break;
+
+ cap_id = val & 0xFF; /* PCI_CAP_LIST_ID */
+ next_cap = (val >> 8) & 0xFF; /* PCI_CAP_LIST_NEXT */
+
+ bitmap_set(dev->pcsc->cachable_bitmask, pos, 2);
+ pcsc_update_byte(dev, pos, cap_id); /* PCI_CAP_LIST_ID */
+ pcsc_update_byte(dev, pos + 1,
+ next_cap); /* PCI_CAP_LIST_NEXT */
+
+ pci_dbg(dev, "Capability ID %#x found at %#x\n", cap_id, pos);
+
+ /* Check if this is a supported capability and infer cacheability */
+ for (i = 0; i < ARRAY_SIZE(PCSC_SUPPORTED_CAPABILITIES); i++) {
+ if (cap_id == PCSC_SUPPORTED_CAPABILITIES[i]) {
+ infer_capability_cacheability(dev, pos, cap_id);
+ break;
+ }
+ }
+
+ /* Move to next capability */
+ pos = next_cap;
+ }
+}
+
+static void infer_cacheability(struct pci_dev *dev)
+{
+ if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
+ return;
+
+ bitmap_zero(dev->pcsc->cachable_bitmask, PCSC_CFG_SPC_SIZE);
+
+ /* Type 0 Configuration Space Header */
+ if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
+ /*
+ * Mark cacheable registers in the PCI configuration space header.
+ * We cache read-only and rarely changing registers:
+ * - PCI_VENDOR_ID, PCI_DEVICE_ID (0x00-0x03)
+ * - PCI_CLASS_REVISION through PCI_CAPABILITY_LIST (0x08-0x34)
+ * Includes: CLASS_REVISION, CACHE_LINE_SIZE, LATENCY_TIMER,
+ * HEADER_TYPE, BIST, BASE_ADDRESS_0-5, CARDBUS_CIS,
+ * SUBSYSTEM_VENDOR_ID, SUBSYSTEM_ID, ROM_ADDRESS, CAPABILITY_LIST
+ * - PCI_INTERRUPT_LINE through PCI_MAX_LAT (0x3c-0x3f)
+ * Includes: INTERRUPT_LINE, INTERRUPT_PIN, MIN_GNT, MAX_LAT
+ */
+ bitmap_set(dev->pcsc->cachable_bitmask, PCI_VENDOR_ID, 4);
+ bitmap_set(dev->pcsc->cachable_bitmask, PCI_CLASS_REVISION, 45);
+ bitmap_set(dev->pcsc->cachable_bitmask, PCI_INTERRUPT_LINE, 4);
+
+ /* Pre populate the cache with the values that we already know */
+ pcsc_update_byte(dev, PCI_HEADER_TYPE,
+ dev->hdr_type |
+ (dev->multifunction ? 0x80 : 0));
+
+ /*
+ * SR-IOV VFs must return 0xFFFF (PCI_ANY_ID) for vendor/device ID
+ * registers per PCIe spec.
+ */
+ if (dev->is_virtfn) {
+ pcsc_update_byte(dev, PCI_VENDOR_ID, 0xFF);
+ pcsc_update_byte(dev, PCI_VENDOR_ID + 1, 0xFF);
+ pcsc_update_byte(dev, PCI_DEVICE_ID, 0xFF);
+ pcsc_update_byte(dev, PCI_DEVICE_ID + 1, 0xFF);
+ } else {
+ if (dev->vendor != PCI_ANY_ID) {
+ pcsc_update_byte(dev, PCI_VENDOR_ID,
+ dev->vendor & 0xFF);
+ pcsc_update_byte(dev, PCI_VENDOR_ID + 1,
+ (dev->vendor >> 8) & 0xFF);
+ }
+ if (dev->device != PCI_ANY_ID) {
+ pcsc_update_byte(dev, PCI_DEVICE_ID,
+ dev->device & 0xFF);
+ pcsc_update_byte(dev, PCI_DEVICE_ID + 1,
+ (dev->device >> 8) & 0xFF);
+ }
+ }
+
+ infer_capabilities_pointers(dev);
+ }
+}
+
int pcsc_add_device(struct pci_dev *dev)
{
struct pcsc_node *node;
@@ -199,6 +459,7 @@ int pcsc_add_device(struct pci_dev *dev)
if (!dev->pcsc->cfg_space)
goto err_free_node;
+ infer_cacheability(dev);
} else {
dev->pcsc->cfg_space = NULL;
}
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 04/13] pci: pcsc: infer PCIe extended capabilities
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (2 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 03/13] pci: pcsc: infer cacheability of PCI capabilities Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-09 14:24 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 05/13] pci: pcsc: control the cache via sysfs and kernel params Evangelos Petrongonas
` (8 subsequent siblings)
12 siblings, 1 reply; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Extend PCSC to support cacheability inference for PCIe extended
capabilities located in the 4KB extended configuration space.
Similar to the capabilities, PCIe extended capabilities require
traversal of the capability list to determine cacheability. The
implementation identifies cacheable registers for capabilities used
by the generic PCIe driver:
- Advanced Error Reporting (AER)
- Access Control Services (ACS)
- Alternative Routing-ID (ARI)
- SR-IOV
- Address Translation Services (ATS)
- Page Request Interface (PRI)
- Process Address Space ID (PASID)
- Downstream Port Containment (DPC)
- Precision Time Measurement (PTM)
The extended capability header (4 bytes) is always cached to enable
efficient capability list traversal.
All the extended capabilities apart from the DPC are static. Regarding
DPC, the DPC capabilities is read and based on its value the
cacheability of RP* registers is inferred.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
drivers/pci/pcsc.c | 203 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 203 insertions(+)
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 29945eac4190..343f8b03831a 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -180,6 +180,65 @@ static const u8 PCSC_SUPPORTED_CAPABILITIES[] = {
PCI_CAP_ID_MSIX, PCI_CAP_ID_EXP, PCI_CAP_ID_AF, PCI_CAP_ID_EA
};
+#ifdef CONFIG_PCIE_PCSC
+static const u16 PCSCS_SUPPORTED_EXT_CAPABILITIES[] = {
+ PCI_EXT_CAP_ID_ERR, PCI_EXT_CAP_ID_ACS, PCI_EXT_CAP_ID_ARI,
+ PCI_EXT_CAP_ID_SRIOV, PCI_EXT_CAP_ID_ATS, PCI_EXT_CAP_ID_PRI,
+ PCI_EXT_CAP_ID_PASID, PCI_EXT_CAP_ID_DPC, PCI_EXT_CAP_ID_PTM
+};
+
+/**
+ * pcsc_handle_dpc_cacheability - Set cacheability for DPC capability registers
+ * @dev: PCI device
+ * @cap_pos: Capability position in config space
+ *
+ * The DPC capability cacheability depends on whether RP extensions are supported:
+ * - PCI_EXP_DPC_CAP_RP_EXT bit indicates RP extension register presence
+ */
+static void pcsc_handle_dpc_cacheability(struct pci_dev *dev, int cap_pos)
+{
+ u32 val;
+ u16 dpc_cap;
+ bool has_rp_extensions;
+
+ if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
+ return;
+
+ if (pcsc_hw_config_read(dev->bus, dev->devfn, cap_pos + PCI_EXP_DPC_CAP,
+ 2, &val) != PCIBIOS_SUCCESSFUL) {
+ pci_warn(dev, "PCSC: Failed to read DPC capability at %#x\n",
+ cap_pos + PCI_EXP_DPC_CAP);
+ return;
+ }
+
+ dpc_cap = val & 0xFFFF;
+ has_rp_extensions = !!(dpc_cap & PCI_EXP_DPC_CAP_RP_EXT);
+
+ /* Cache the DPC capability register */
+ pcsc_update_byte(dev, cap_pos + PCI_EXP_DPC_CAP, dpc_cap & 0xFF);
+ pcsc_update_byte(dev, cap_pos + PCI_EXP_DPC_CAP + 1,
+ (dpc_cap >> 8) & 0xFF);
+
+ /* Always cacheable: main DPC registers */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_EXP_DPC_CAP, 2);
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_EXP_DPC_CTL, 2);
+
+ /* Conditionally cacheable: RP extension registers PCI_EXP_DPC_RP_PIO_MASK
+ * PCI_EXP_DPC_RP_PIO_SEVERITY , PCI_EXP_DPC_RP_PIO_SYSERROR, PCI_EXP_DPC_RP_PIO_EXCEPTION
+ */
+ if (has_rp_extensions) {
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_DPC_RP_PIO_MASK, 16);
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_DPC_RP_PIO_SEVERITY, 4);
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_DPC_RP_PIO_SYSERROR, 4);
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_EXP_DPC_RP_PIO_EXCEPTION, 4);
+ }
+}
+#endif
+
/**
* pcsc_handle_msi_cacheability - Set cacheability for MSI capability registers
* @dev: PCI device
@@ -378,6 +437,146 @@ static void infer_capabilities_pointers(struct pci_dev *dev)
}
}
+#ifdef CONFIG_PCIE_PCSC
+
+static void infer_extended_capability_cacheability(struct pci_dev *dev,
+ int cap_pos, u16 cap_id)
+{
+ if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
+ return;
+
+ switch (cap_id) {
+ case PCI_EXT_CAP_ID_ERR:
+ /* Advanced Error Reporting */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_ERR_UNCOR_MASK,
+ 8); /* PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_ERR_COR_MASK,
+ 4); /* PCI_ERR_COR_MASK only */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_ERR_ROOT_COMMAND,
+ 4); /* PCI_ERR_ROOT_COMMAND */
+ break;
+ case PCI_EXT_CAP_ID_ACS:
+ /* Access Control Services
+ * We only cache PCI_ACS_CAP and PCI_ACS_CTRL (first 4 bytes).
+ * The Egress Control Vector that follows (if present) is not
+ * cached because:
+ * - Determining its size would require reading PCI_ACS_CAP
+ * - These registers are typically only written by the OS during
+ * setup and not read frequently during runtime
+ * - Caching them would provide no performance benefit
+ */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_ACS_CAP,
+ 4); /* PCI_ACS_CAP, PCI_ACS_CTRL */
+ break;
+ case PCI_EXT_CAP_ID_ARI:
+ /* Alternative Routing-ID: */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_ARI_CAP,
+ 4); /* PCI_ARI_CAP, PCI_ARI_CTRL */
+ break;
+ case PCI_EXT_CAP_ID_SRIOV:
+ /* SR-IOV */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_SRIOV_CAP,
+ 6); /* PCI_SRIOV_CAP, PCI_SRIOV_CTRL */
+ /* PCI_SRIOV_INITIAL_VF, PCI_SRIOV_TOTAL_VF,
+ * PCI_SRIOV_NUM_VF,PCI_SRIOV_FUNC_LINK
+ */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_SRIOV_INITIAL_VF, 7);
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_SRIOV_VF_OFFSET,
+ 4); /* PCI_SRIOV_VF_OFFSET, PCI_SRIOV_VF_STRIDE */
+ /* PCI_SRIOV_VF_DID, PCI_SRIOV_SUPPORTED_PAGE_SIZES,PCI_SRIOV_PAGE_SIZE */
+ bitmap_set(
+ dev->pcsc->cachable_bitmask, cap_pos + PCI_SRIOV_VF_DID,
+ 10);
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_SRIOV_BAR,
+ 24); /* PCI_SRIOV_BAR0-5 */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_SRIOV_VFM,
+ 4); /* PCI_SRIOV_VFMM */
+ break;
+ case PCI_EXT_CAP_ID_ATS:
+ /* Address Translation Service: */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_ATS_CAP,
+ 4); /* PCI_ATS_CAP, PCI_ATS_CTRL*/
+ break;
+ case PCI_EXT_CAP_ID_PRI:
+ /* Page Request Interface */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_PRI_CTRL,
+ 2); /* PCI_PRI_CTRL */
+ bitmap_set(dev->pcsc->cachable_bitmask,
+ cap_pos + PCI_PRI_MAX_REQ,
+ 8); /* PCI_PRI_MAX_REQ, PCI_PRI_ALLOC_REQ */
+ break;
+ case PCI_EXT_CAP_ID_PASID:
+ /* Process Address Space ID */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_PASID_CAP,
+ 4); /* PCI_PASID_CAP, PCI_PASID_CTRL */
+ break;
+ case PCI_EXT_CAP_ID_DPC:
+ /* Downstream Port Containment */
+ pcsc_handle_dpc_cacheability(dev, cap_pos);
+ break;
+ case PCI_EXT_CAP_ID_PTM:
+ /* Precision Time Measurement */
+ bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_PTM_CAP,
+ 8); /* PCI_PTM_CAP, PCI_PTM_CTRL */
+ break;
+ default:
+ /* Unknown extended capability - only cache header */
+ break;
+ }
+}
+
+static void infer_extended_capabilities_pointers(struct pci_dev *dev)
+{
+ int pos = 0x100;
+ u32 header;
+ int cap_ver, cap_id;
+ int i;
+
+ while (pos) {
+ if (pos > 0xFFC || pos < 0x100)
+ break;
+
+ pos &= ~0x3;
+
+ if (pcsc_hw_config_read(dev->bus, dev->devfn, pos, 4,
+ &header) != PCIBIOS_SUCCESSFUL)
+ break;
+
+ if (!header)
+ break;
+
+ bitmap_set(dev->pcsc->cachable_bitmask, pos, 4);
+ for (i = 0; i < 4; i++)
+ pcsc_update_byte(dev, pos + i,
+ (header >> (i * 8)) & 0xFF);
+
+ cap_id = PCI_EXT_CAP_ID(header);
+ cap_ver = PCI_EXT_CAP_VER(header);
+
+ pci_dbg(dev,
+ "Extended capability ID %#x (ver %d) found at %#x, next cap at %#x\n",
+ cap_id, cap_ver, pos, PCI_EXT_CAP_NEXT(header));
+
+ /* Check if this is a supported extended capability and infer cacheability */
+ for (i = 0; i < ARRAY_SIZE(PCSCS_SUPPORTED_EXT_CAPABILITIES);
+ i++) {
+ if (cap_id == PCSCS_SUPPORTED_EXT_CAPABILITIES[i]) {
+ infer_extended_capability_cacheability(dev, pos,
+ cap_id);
+ break;
+ }
+ }
+
+ pos = PCI_EXT_CAP_NEXT(header);
+ }
+}
+#endif
+
static void infer_cacheability(struct pci_dev *dev)
{
if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
@@ -432,6 +631,10 @@ static void infer_cacheability(struct pci_dev *dev)
}
infer_capabilities_pointers(dev);
+#ifdef CONFIG_PCIE_PCSC
+ if (pci_is_pcie(dev))
+ infer_extended_capabilities_pointers(dev);
+#endif
}
}
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 05/13] pci: pcsc: control the cache via sysfs and kernel params
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (3 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 04/13] pci: pcsc: infer PCIe extended capabilities Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-09 14:41 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 06/13] pci: pcsc: handle device resets Evangelos Petrongonas
` (7 subsequent siblings)
12 siblings, 1 reply; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Add kernel parameters and runtime control mechanisms for the PCSC
A new kernel parameter 'pcsc_enabled' allows enabling or disabling
the cache at boot time. The parameter defaults to disabled.
A sysfs interface at /sys/bus/pci/pcsc/enabled provides:
- Read access to query current cache status (1=enabled, 0=disabled)
- Write access to dynamically enable/disable the cache at runtime
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
Documentation/ABI/testing/sysfs-bus-pci-pcsc | 20 ++++
.../admin-guide/kernel-parameters.txt | 3 +
drivers/pci/pcsc.c | 93 ++++++++++++++++++-
3 files changed, 114 insertions(+), 2 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-pcsc
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-pcsc b/Documentation/ABI/testing/sysfs-bus-pci-pcsc
new file mode 100644
index 000000000000..ee92bf087816
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-pci-pcsc
@@ -0,0 +1,20 @@
+PCI Configuration Space Cache (PCSC)
+-------------------------------------
+
+The PCI Configuration Space Cache (PCSC) is a transparent caching layer
+that intercepts configuration space operations to reduce hardware access
+overhead. This subsystem addresses performance bottlenecks in PCI
+configuration space accesses, particularly in virtualization
+environments with high-density SR-IOV deployments where repeated
+enumeration of Virtual Functions creates substantial delays.
+
+What: /sys/bus/pci/pcsc/enabled
+Date: September 2025
+Contact: Linux PCI developers <linux-pci@vger.kernel.org>
+Description:
+ PCI Configuration Space Cache (PCSC) is a subsystem that
+ caches accesses to the PCI configuration space of PCI
+ functions. When this file contains the "1", the kernel
+ is utilizing the cache, while when on "0" the
+ system bypasses it. This setting can also be controlled
+parameter.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 747a55abf494..08c7a13f107c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5036,6 +5036,9 @@
pcmv= [HW,PCMCIA] BadgePAD 4
+ pcsc_enabled= [PCSC] enable the use of the PCI Configuration Space
+ Cache (PCSC).
+
pd_ignore_unused
[PM]
Keep all power-domains already enabled by bootloader on,
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 343f8b03831a..44d842733230 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -16,13 +16,21 @@
#include <linux/atomic.h>
#include <linux/pcsc.h>
+#include <linux/sysfs.h>
+
+static bool pcsc_enabled;
+static int __init pcsc_enabled_setup(char *str)
+{
+ return kstrtobool(str, &pcsc_enabled) == 0;
+}
+__setup("pcsc_enabled=", pcsc_enabled_setup);
static bool pcsc_initialised;
static atomic_t num_nodes = ATOMIC_INIT(0);
inline bool pcsc_is_initialised(void)
{
- return pcsc_initialised;
+ return pcsc_initialised && pcsc_enabled;
}
static int pcsc_add_bus(struct pci_bus *bus)
@@ -899,14 +907,95 @@ static struct notifier_block pcsc_bus_nb = {
.notifier_call = pcsc_bus_notify,
};
+static ssize_t pcsc_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", pcsc_enabled);
+}
+
+static ssize_t pcsc_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf,
+ size_t count)
+{
+ bool new_value;
+ int ret;
+
+ ret = kstrtobool(buf, &new_value);
+ if (ret < 0)
+ return ret;
+
+ pcsc_enabled = new_value;
+ return count;
+}
+
+static struct kobj_attribute pcsc_enabled_attribute =
+ __ATTR(enabled, 0644, pcsc_enabled_show, pcsc_enabled_store);
+
+static struct attribute *pcsc_attrs[] = {
+ &pcsc_enabled_attribute.attr,
+ NULL,
+};
+
+static struct attribute_group pcsc_attr_group = {
+ .attrs = pcsc_attrs,
+};
+
+static struct kobject *pcsc_kobj;
+
+static void pcsc_create_sysfs(void)
+{
+ struct kset *pci_bus_kset;
+ int ret;
+
+ if (pcsc_kobj)
+ return; /* Already created */
+
+ pci_bus_kset = bus_get_kset(&pci_bus_type);
+ if (!pci_bus_kset) {
+ /* PCI bus kset not ready yet, will be retried later */
+ return;
+ }
+
+ pcsc_kobj = kobject_create_and_add("pcsc", &pci_bus_kset->kobj);
+ if (!pcsc_kobj) {
+ pr_err("Failed to create sysfs kobject\n");
+ return;
+ }
+
+ ret = sysfs_create_group(pcsc_kobj, &pcsc_attr_group);
+ if (ret) {
+ pr_err("Failed to create sysfs group\n");
+ kobject_put(pcsc_kobj);
+ pcsc_kobj = NULL;
+ return;
+ }
+}
+
static int __init pcsc_init(void)
{
bus_register_notifier(&pci_bus_type, &pcsc_bus_nb);
+ /* Try to create sysfs entry, but don't fail if PCI bus isn't ready yet */
+ pcsc_create_sysfs();
+
pcsc_initialised = true;
- pr_info("initialised\n");
+ pr_info("initialised (enabled=%d)\n", pcsc_enabled);
return 0;
}
+/* Late initcall to retry sysfs creation if it failed during core_initcall */
+static int __init pcsc_sysfs_init(void)
+{
+ pcsc_create_sysfs();
+ return 0;
+}
+
core_initcall(pcsc_init);
+
+/*
+ * The PCI subsystem is initialised later, therefore we need to add
+ * our sysfs entries later. This is done to avoid modifying the sysfs
+ * creation of the core pci driver.
+ */
+late_initcall(pcsc_sysfs_init);
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 06/13] pci: pcsc: handle device resets
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (4 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 05/13] pci: pcsc: control the cache via sysfs and kernel params Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-09 14:49 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 07/13] pci: pcsc: introduce statistic gathering tools Evangelos Petrongonas
` (6 subsequent siblings)
12 siblings, 1 reply; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
The PCI Configuration Space Cache (PCSC) maintains cached values of
configuration space registers for performance optimization. When a PCI
device is reset or bus operations are dynamically changed, cached values
become stale and can cause incorrect behavior. This patch ensures cache
coherency by invalidating the PCSC cache in all scenarios where the
underlying configuration space values may have changed.
Device Reset Handling:
----------------------
When PCI devices are reset, their configuration space registers return
to default values. Add pcsc_device_reset() calls after all device reset
operations to invalidate stale cached values:
- Function Level Resets (FLR) in `pcie_flr()`
- Advanced Features FLR in `pci_af_flr()`
- Power Management resets (D3hot->D0 transition) in `pci_pm_reset()`
- Device-specific resets in `pci_dev_specific_reset()`
- D3cold power state transitions in `__pci_set_power_state()`
- ACPI-based resets in `pci_dev_acpi_reset()`
- Bus restore operations in `pci_bus_restore_locked()`
- Slot restore operations in `pci_slot_restore_locked()`
- Secondary bus resets in `pci_bridge_secondary_bus_reset()`
For secondary bus resets, `pcsc_reset_bus_recursively()` invalidates the
cache for all devices on the secondary bus and subordinate buses. This
also covers hotplug slot reset operations since `pciehp_reset_slot()`
calls `pci_bridge_secondary_bus_reset()`.
In addition, functions like `pci_dev_wait` are configured to bypass the
cahce and reads the actual HW values.
Dynamic Ops Changes:
--------------------
The patch also addresses cache consistency issues when bus operations
are dynamically changed via `pci_bus_set_ops()``. Different ops
implementations may return different values for the same registers, and
hardware state may have changed while using the different ops. This
commit resets the cache for all devices on the affected bus
Implementation Details:
-----------------------
The cache invalidation clears the cached_bitmask while preserving the
cacheable_bitmask, as the configuration space layout remains unchanged
after a reset. This allows the cache to be repopulated with fresh values
on subsequent configuration space accesses.
Known Limitations:
------------------
- There is currently a gap in handling PowerPC secondary bus resets, as
the architecture-specific `pcibios_reset_secondary_bus()` can bypass the
generic `pci_reset_secondary_bus()` where our cache invalidation occurs.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
drivers/pci/access.c | 13 ++++++---
drivers/pci/pci-acpi.c | 4 +++
drivers/pci/pci.c | 60 +++++++++++++++++++++++++++++++++++++++++-
drivers/pci/pcsc.c | 17 ++++++++++++
drivers/pci/quirks.c | 7 ++++-
include/linux/pcsc.h | 11 ++++++++
6 files changed, 106 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index b89e9210d330..0a5de8d76bfe 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -245,11 +245,16 @@ struct pci_ops *pci_bus_set_ops(struct pci_bus *bus, struct pci_ops *ops)
/*
* IMPORTANT: Dynamic ops changes after PCSC injection can lead to
* cache consistency issues if operations were performed that should
- * have invalidated the cache. We re-inject PCSC ops here, but the
- * caller is responsible for ensuring cache consistency if needed.
- * This will be fixed in a future commit, when PCSC resets are
- * introduced.
+ * have invalidated the cache. We must reset the cache for all
+ * devices on this bus to ensure consistency. (No need for recursive
+ * reset on subordinate buses)
*/
+ struct pci_dev *dev;
+
+ list_for_each_entry(dev, &bus->devices, bus_list) {
+ if (dev->pcsc && dev->pcsc->cached_bitmask)
+ bitmap_zero(dev->pcsc->cached_bitmask, PCSC_CFG_SPC_SIZE);
+ }
pr_warn("PCSC: Dynamic ops change detected on bus %04x:%02x, resetting cache\n",
pci_domain_nr(bus), bus->number);
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 9369377725fa..0b638115c7c7 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -983,6 +983,10 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
return -ENOTTY;
}
+#ifdef CONFIG_PCSC
+ pcsc_device_reset(dev);
+#endif
+
return 0;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f518cfa266b5..db940f8fd408 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -26,6 +26,7 @@
#include <linux/device.h>
#include <linux/pm_runtime.h>
#include <linux/pci_hotplug.h>
+#include <linux/pcsc.h>
#include <linux/vmalloc.h>
#include <asm/dma.h>
#include <linux/aer.h>
@@ -1248,11 +1249,19 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
}
if (root && root->config_rrs_sv) {
+#ifdef CONFIG_PCSC
+ pcsc_hw_config_read(dev->bus, dev->devfn, PCI_VENDOR_ID, 4, &id);
+#else
pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
+#endif
if (!pci_bus_rrs_vendor_id(id))
break;
} else {
+#ifdef CONFIG_PCSC
+ pcsc_hw_config_read(dev->bus, dev->devfn, PCI_COMMAND, 4, &id);
+#else
pci_read_config_dword(dev, PCI_COMMAND, &id);
+#endif
if (!PCI_POSSIBLE_ERROR(id))
break;
}
@@ -1564,7 +1573,9 @@ static int __pci_set_power_state(struct pci_dev *dev, pci_power_t state, bool lo
if (pci_platform_power_transition(dev, PCI_D3cold))
return error;
-
+ #ifdef CONFIG_PCSC
+ pcsc_device_reset(dev);
+ #endif
/* Powering off a bridge may power off the whole hierarchy */
if (dev->current_state == PCI_D3cold)
__pci_bus_set_current_state(dev->subordinate, PCI_D3cold, locked);
@@ -4493,6 +4504,10 @@ int pcie_flr(struct pci_dev *dev)
*/
msleep(100);
+#ifdef CONFIG_PCSC
+ pcsc_device_reset(dev);
+#endif
+
return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
}
EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4560,6 +4575,10 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
*/
msleep(100);
+#ifdef CONFIG_PCSC
+ pcsc_device_reset(dev);
+#endif
+
return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
}
@@ -4605,6 +4624,10 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
pci_dev_d3_sleep(dev);
+#ifdef CONFIG_PCSC
+ pcsc_device_reset(dev);
+#endif
+
return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
}
@@ -4904,6 +4927,31 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
PCIE_RESET_READY_POLL_MS - delay);
}
+#ifdef CONFIG_PCSC
+/**
+ * pcsc_reset_bus_recursively - Recursively reset PCSC cache for all devices
+ * in bus hierarchy
+ * @bus: PCI bus to process
+ *
+ * Recursively invalidate PCSC cache for all devices on the given bus
+ * and all subordinate buses.
+ */
+static void pcsc_reset_bus_recursively(struct pci_bus *bus)
+{
+ struct pci_dev *dev;
+
+ if (!bus)
+ return;
+
+ list_for_each_entry(dev, &bus->devices, bus_list) {
+ pcsc_device_reset(dev);
+ /* If this device is a bridge, recursively process its subordinate bus */
+ if (dev->subordinate)
+ pcsc_reset_bus_recursively(dev->subordinate);
+ }
+}
+#endif
+
void pci_reset_secondary_bus(struct pci_dev *dev)
{
u16 ctrl;
@@ -4920,6 +4968,10 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+
+#ifdef CONFIG_PCSC
+ pcsc_reset_bus_recursively(dev->subordinate);
+#endif
}
void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
@@ -5542,6 +5594,9 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
list_for_each_entry(dev, &bus->devices, bus_list) {
pci_dev_restore(dev);
+#ifdef CONFIG_PCSC
+ pcsc_device_reset(dev);
+#endif
if (dev->subordinate) {
pci_bridge_wait_for_secondary_bus(dev, "bus reset");
pci_bus_restore_locked(dev->subordinate);
@@ -5579,6 +5634,9 @@ static void pci_slot_restore_locked(struct pci_slot *slot)
if (!dev->slot || dev->slot != slot)
continue;
pci_dev_restore(dev);
+#ifdef CONFIG_PCSC
+ pcsc_device_reset(dev);
+#endif
if (dev->subordinate) {
pci_bridge_wait_for_secondary_bus(dev, "slot reset");
pci_bus_restore_locked(dev->subordinate);
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 44d842733230..5412dea23446 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -837,6 +837,23 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
}
EXPORT_SYMBOL_GPL(pcsc_cached_config_write);
+int pcsc_device_reset(struct pci_dev *dev)
+{
+ if (unlikely((!dev)))
+ return -EINVAL;
+
+ if (unlikely(!pcsc_is_initialised()))
+ return 0;
+
+ /* The layout of the CFG Space is not going to change after a device
+ * reset, whether the reset is FLR or conventional. Only the values
+ * are going to change. We could further optimise the cache to maintain
+ * some of the HWInt values that are going to remain constant after a reset.
+ */
+ bitmap_zero(dev->pcsc->cached_bitmask, PCSC_CFG_SPC_SIZE);
+ return 0;
+}
+
static struct pci_ops pcsc_ops = {
.add_bus = pcsc_add_bus,
.remove_bus = pcsc_remove_bus,
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6eb3d20386e9..97555fbba938 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4239,8 +4239,13 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
if ((i->vendor == dev->vendor ||
i->vendor == (u16)PCI_ANY_ID) &&
(i->device == dev->device ||
- i->device == (u16)PCI_ANY_ID))
+ i->device == (u16)PCI_ANY_ID)) {
+#ifdef CONFIG_PCSC
+ if (!probe)
+ pcsc_device_reset(dev);
+#endif
return i->reset(dev, probe);
+ }
}
return -ENOTTY;
diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h
index 516d73931608..85471273c0a9 100644
--- a/include/linux/pcsc.h
+++ b/include/linux/pcsc.h
@@ -121,4 +121,15 @@ int pcsc_remove_device(struct pci_dev *dev);
*/
bool pcsc_is_initialised(void);
+/**
+ * pcsc_device_reset - Handle PCI device reset
+ * @dev: PCI device being reset
+ *
+ * This function should be called when a PCI device is being reset
+ * to ensure the cache is properly invalidated.
+ *
+ * Returns: 0 on success, negative error code on failure
+ */
+int pcsc_device_reset(struct pci_dev *dev);
+
#endif /* _LINUX_PCSC_H */
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 07/13] pci: pcsc: introduce statistic gathering tools
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (5 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 06/13] pci: pcsc: handle device resets Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-09 14:56 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 08/13] pci: Save only spec-defined configuration space Evangelos Petrongonas
` (5 subsequent siblings)
12 siblings, 1 reply; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Introduce optional statistics gathering for the PCI Configuration Space
Cache to measure cache effectiveness and performance impact.
When CONFIG_PCSC_STATS is enabled, the implementation tracks:
- Cache hits and misses
- Uncacheable reads
- Write operations and cache invalidations
- Total reads and hardware reads
- Time spent in cache vs hardware accesses
- Number of Device Resets
Statistics are exposed via /sys/bus/pci/pcsc/stats in a human-readable
format including calculated hit rates and access times in microseconds.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
Documentation/ABI/testing/sysfs-bus-pci-pcsc | 9 +
drivers/pci/Kconfig | 7 +
drivers/pci/pcsc.c | 183 ++++++++++++++++++-
3 files changed, 196 insertions(+), 3 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-pcsc b/Documentation/ABI/testing/sysfs-bus-pci-pcsc
index ee92bf087816..daf0d06c89c8 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci-pcsc
+++ b/Documentation/ABI/testing/sysfs-bus-pci-pcsc
@@ -18,3 +18,12 @@ Description:
is utilizing the cache, while when on "0" the
system bypasses it. This setting can also be controlled
parameter.
+
+What: /sys/bus/pci/pcsc/stats
+Date: March 2025
+Contact: Evangelos Petrongonas <epetron@amazon.de>
+Description:
+ PCI Configuration Space Cache (PCSC) if the PCSC
+ Statistics are enabled via the PCSC_STATS
+ configuration option, the statistics can be recovered
+ via reading this sysfs.
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index c26162b58365..9b5275ef2d16 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -50,6 +50,13 @@ config PCSC
intercepts configuration space operations and maintains cached
copies of register values
+config PCSC_STATS
+ bool "PCI Configuration Space Cache Statistics"
+ depends on PCSC
+ default n
+ help
+ This option allows the collection of statistics for the PCSC.
+
source "drivers/pci/pcie/Kconfig"
config PCI_MSI
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 5412dea23446..304239b7ff8a 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -25,9 +25,84 @@ static int __init pcsc_enabled_setup(char *str)
}
__setup("pcsc_enabled=", pcsc_enabled_setup);
+#ifdef CONFIG_PCSC_STATS
+struct pcsc_stats {
+ /* Operation Counters */
+ unsigned long cache_hits;
+ unsigned long cache_misses;
+ unsigned long uncachable_reads;
+ unsigned long writes;
+ unsigned long cache_invalidations;
+ unsigned long total_reads;
+ unsigned long hw_reads;
+ unsigned long device_resets;
+ u64 total_cache_access_time; /* in milliseconds */
+ u64 total_hw_access_time; /* in milliseconds */
+ u64 hw_access_time_due_to_misses; /* in milliseconds */
+};
+#endif
+
static bool pcsc_initialised;
static atomic_t num_nodes = ATOMIC_INIT(0);
+#ifdef CONFIG_PCSC_STATS
+struct pcsc_stats pcsc_stats;
+
+static inline void pcsc_count_cache_hit(void)
+{
+ pcsc_stats.cache_hits++;
+ pcsc_stats.total_reads++;
+}
+
+static inline void pcsc_count_cache_miss(void)
+{
+ pcsc_stats.cache_misses++;
+ pcsc_stats.total_reads++;
+ pcsc_stats.hw_reads++;
+}
+
+static inline void pcsc_count_uncachable_read(void)
+{
+ pcsc_stats.uncachable_reads++;
+ pcsc_stats.total_reads++;
+ pcsc_stats.hw_reads++;
+}
+
+static inline void pcsc_count_write(void)
+{
+ pcsc_stats.writes++;
+}
+
+static inline void pcsc_count_cache_invalidation(void)
+{
+ pcsc_stats.cache_invalidations++;
+}
+
+static inline void pcsc_count_device_reset(void)
+{
+ pcsc_stats.device_resets++;
+}
+#else
+static inline void pcsc_count_cache_hit(void)
+{
+}
+static inline void pcsc_count_cache_miss(void)
+{
+}
+static inline void pcsc_count_uncachable_read(void)
+{
+}
+static inline void pcsc_count_write(void)
+{
+}
+static inline void pcsc_count_cache_invalidation(void)
+{
+}
+static inline void pcsc_count_device_reset(void)
+{
+}
+#endif
+
inline bool pcsc_is_initialised(void)
{
return pcsc_initialised && pcsc_enabled;
@@ -727,6 +802,10 @@ static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
u32 word_cached = 0;
u8 byte_val;
int rc, i;
+#ifdef CONFIG_PCSC_STATS
+ ktime_t start_time;
+ u64 duration;
+#endif
if (WARN_ON(!dev || !bus || !word))
return -EINVAL;
@@ -734,7 +813,6 @@ static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
if (WARN_ON(size != 1 && size != 2 && size != 4))
return -EINVAL;
- /* Check bounds */
if (where + size > PCSC_CFG_SPC_SIZE)
return -EINVAL;
@@ -746,8 +824,17 @@ static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
pcsc_get_byte(dev, where + i, &byte_val);
word_cached |= ((u32)byte_val << (i * 8));
}
+ pcsc_count_cache_hit();
} else {
+#ifdef CONFIG_PCSC_STATS
+ start_time = ktime_get();
+#endif
rc = pcsc_hw_config_read(bus, devfn, where, size, &word_cached);
+#ifdef CONFIG_PCSC_STATS
+ duration = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+ pcsc_stats.hw_access_time_due_to_misses += duration;
+ pcsc_stats.total_hw_access_time += duration;
+#endif
if (rc) {
pci_err(dev,
"%s: Failed to read CFG Space where=%d size=%d",
@@ -762,6 +849,7 @@ static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
byte_val = (word_cached >> (i * 8)) & 0xFF;
pcsc_update_byte(dev, where + i, byte_val);
}
+ pcsc_count_cache_miss();
}
*word = word_cached;
@@ -773,6 +861,17 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
{
int rc;
struct pci_dev *dev;
+#ifdef CONFIG_PCSC_STATS
+ ktime_t hw_start_time;
+ u64 hw_duration;
+#endif
+
+#ifdef CONFIG_PCSC_STATS
+ u64 duration;
+ ktime_t start_time;
+
+ start_time = ktime_get();
+#endif
if (unlikely(!pcsc_is_initialised()))
goto read_from_dev;
@@ -790,6 +889,10 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
pcsc_is_access_cacheable(dev, where, size)) {
rc = pcsc_get_and_insert_multiple(dev, bus, devfn, where, val,
size);
+#ifdef CONFIG_PCSC_STATS
+ duration = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+ pcsc_stats.total_cache_access_time += duration;
+#endif
if (likely(!rc)) {
pci_dev_put(dev);
return 0;
@@ -797,11 +900,23 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
/* if reading from the cache failed continue and try reading
* from the actual device
*/
+ } else {
+ if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+ pcsc_count_uncachable_read();
}
read_from_dev:
+#ifdef CONFIG_PCSC_STATS
+ hw_start_time = ktime_get();
+#endif
if (dev)
pci_dev_put(dev);
- return pcsc_hw_config_read(bus, devfn, where, size, val);
+ rc = pcsc_hw_config_read(bus, devfn, where, size, val);
+#ifdef CONFIG_PCSC_STATS
+ hw_duration = ktime_to_ns(ktime_sub(ktime_get(), hw_start_time));
+ /* Add timing for uncacheable reads */
+ pcsc_stats.total_hw_access_time += hw_duration;
+#endif
+ return rc;
}
EXPORT_SYMBOL_GPL(pcsc_cached_config_read);
@@ -810,6 +925,11 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
{
int i;
struct pci_dev *dev;
+ int rc;
+#ifdef CONFIG_PCSC_STATS
+ ktime_t hw_start_time;
+ u64 hw_duration;
+#endif
if (unlikely(!pcsc_is_initialised()))
goto write_to_dev;
@@ -828,12 +948,22 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
if (pcsc_is_access_cacheable(dev, where, size)) {
for (i = 0; i < size; i++)
pcsc_set_cached(dev, where + i, false);
+ pcsc_count_cache_invalidation();
}
}
write_to_dev:
+ pcsc_count_write();
if (dev)
pci_dev_put(dev);
- return pcsc_hw_config_write(bus, devfn, where, size, val);
+#ifdef CONFIG_PCSC_STATS
+ hw_start_time = ktime_get();
+#endif
+ rc = pcsc_hw_config_write(bus, devfn, where, size, val);
+#ifdef CONFIG_PCSC_STATS
+ hw_duration = ktime_to_ns(ktime_sub(ktime_get(), hw_start_time));
+ pcsc_stats.total_hw_access_time += hw_duration;
+#endif
+ return rc;
}
EXPORT_SYMBOL_GPL(pcsc_cached_config_write);
@@ -851,6 +981,7 @@ int pcsc_device_reset(struct pci_dev *dev)
* some of the HWInt values that are going to remain constant after a reset.
*/
bitmap_zero(dev->pcsc->cached_bitmask, PCSC_CFG_SPC_SIZE);
+ pcsc_count_device_reset();
return 0;
}
@@ -948,8 +1079,50 @@ static ssize_t pcsc_enabled_store(struct kobject *kobj,
static struct kobj_attribute pcsc_enabled_attribute =
__ATTR(enabled, 0644, pcsc_enabled_show, pcsc_enabled_store);
+#ifdef CONFIG_PCSC_STATS
+static ssize_t pcsc_stats_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sysfs_emit(
+ buf,
+ "Cache Hits: %lu\n"
+ "Cache Misses: %lu\n"
+ "Uncachable Reads: %lu\n"
+ "Writes: %lu\n"
+ "Cache Invalidations: %lu\n"
+ "Device Resets: %lu\n"
+ "Total Reads: %lu\n"
+ "Hardware Reads: %lu\n"
+ "Hit Rate: %lu%%\n"
+ "Total Cache Access Time: %llu us\n"
+ "Cache Access Time (without HW reads due to Misses): %llu us\n"
+ "HW Access Time due to misses: %llu us\n"
+ "Total Hardware Access Time: %llu us\n",
+ pcsc_stats.cache_hits, pcsc_stats.cache_misses,
+ pcsc_stats.uncachable_reads, pcsc_stats.writes,
+ pcsc_stats.cache_invalidations, pcsc_stats.device_resets,
+ pcsc_stats.total_reads,
+ pcsc_stats.hw_reads,
+ pcsc_stats.total_reads ?
+ (pcsc_stats.cache_hits * 100) / pcsc_stats.total_reads :
+ 0,
+ pcsc_stats.total_cache_access_time / 1000,
+ (pcsc_stats.total_cache_access_time -
+ pcsc_stats.hw_access_time_due_to_misses) /
+ 1000,
+ pcsc_stats.hw_access_time_due_to_misses / 1000,
+ pcsc_stats.total_hw_access_time / 1000);
+}
+
+static struct kobj_attribute pcsc_stats_attribute =
+ __ATTR(stats, 0444, pcsc_stats_show, NULL);
+#endif
+
static struct attribute *pcsc_attrs[] = {
&pcsc_enabled_attribute.attr,
+#ifdef CONFIG_PCSC_STATS
+ &pcsc_stats_attribute.attr,
+#endif
NULL,
};
@@ -995,6 +1168,10 @@ static int __init pcsc_init(void)
/* Try to create sysfs entry, but don't fail if PCI bus isn't ready yet */
pcsc_create_sysfs();
+#ifdef CONFIG_PCSC_STATS
+ memset(&pcsc_stats, 0, sizeof(pcsc_stats));
+#endif
+
pcsc_initialised = true;
pr_info("initialised (enabled=%d)\n", pcsc_enabled);
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 08/13] pci: Save only spec-defined configuration space
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (6 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 07/13] pci: pcsc: introduce statistic gathering tools Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-09 14:58 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 09/13] vfio: pci: Fill only spec-defined configuration space regions Evangelos Petrongonas
` (4 subsequent siblings)
12 siblings, 1 reply; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Change PCI configuration space save/restore operations by
saving only the regions defined by the PCI specification avoiding any
potential side effects of undefined behaviour.
The current implementation saves the entire configuration space for
device restore operations, including reserved and undefined regions.
This change modifies the save logic to save only architecturally defined
configuration space regions and skipping the undefined areas.
This benefits the PCSC hitrate, as a 4byte access to a region where only
2 bytes are cacheable and 2 are undefined, therefore uncached, will lead
to a HW access instead.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
drivers/pci/pci.c | 61 +++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 56 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index db940f8fd408..3e99baaaf8cd 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1752,11 +1752,62 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
int pci_save_state(struct pci_dev *dev)
{
int i;
- /* XXX: 100% dword access ok here? */
- for (i = 0; i < 16; i++) {
- pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
- pci_dbg(dev, "save config %#04x: %#010x\n",
- i * 4, dev->saved_config_space[i]);
+
+ if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
+ for (i = 0; i < 13; i++) {
+ pci_read_config_dword(dev, i * 4,
+ &dev->saved_config_space[i]);
+ pci_dbg(dev,
+ "saving config space at offset %#x (reading %#x)\n",
+ i * 4, dev->saved_config_space[i]);
+ }
+ pci_read_config_byte(
+ dev, PCI_CAPABILITY_LIST,
+ (u8 *)(&dev->saved_config_space[PCI_CAPABILITY_LIST /
+ 4]) +
+ (PCI_CAPABILITY_LIST % 4));
+ pci_dbg(dev,
+ "saving config space at offset %#x (reading %#x)\n",
+ PCI_CAPABILITY_LIST,
+ dev->saved_config_space[PCI_CAPABILITY_LIST]);
+ pci_read_config_dword(
+ dev, PCI_INTERRUPT_LINE,
+ &dev->saved_config_space[PCI_INTERRUPT_LINE / 4]);
+ pci_dbg(dev,
+ "saving config space at offset %#x (reading %#x)\n",
+ PCI_INTERRUPT_LINE,
+ dev->saved_config_space[PCI_INTERRUPT_LINE]);
+ } else if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+ for (i = 0; i < 13; i++) {
+ pci_read_config_dword(dev, i * 4,
+ &dev->saved_config_space[i]);
+ pci_dbg(dev,
+ "saving config space at offset %#x (reading %#x)\n",
+ i * 4, dev->saved_config_space[i]);
+ }
+ pci_read_config_byte(
+ dev, PCI_CAPABILITY_LIST,
+ (u8 *)(&dev->saved_config_space[PCI_CAPABILITY_LIST /
+ 4]) +
+ (PCI_CAPABILITY_LIST % 4));
+ pci_dbg(dev,
+ "saving config space at offset %#x (reading %#x)\n",
+ PCI_CAPABILITY_LIST,
+ dev->saved_config_space[PCI_CAPABILITY_LIST]);
+ for (i = 14; i < 16; i++) {
+ pci_read_config_dword(dev, i * 4,
+ &dev->saved_config_space[i]);
+ pci_dbg(dev,
+ "saving config space at offset %#x (reading %#x)\n",
+ i * 4, dev->saved_config_space[i]);
+ }
+ } else {
+ for (i = 0; i < 16; i++) {
+ pci_read_config_dword(dev, i * 4,
+ &dev->saved_config_space[i]);
+ pci_dbg(dev, "save config %#04x: %#010x\n", i * 4,
+ dev->saved_config_space[i]);
+ }
}
dev->state_saved = true;
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 09/13] vfio: pci: Fill only spec-defined configuration space regions
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (7 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 08/13] pci: Save only spec-defined configuration space Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 10/13] pci: pcsc: Use contiguous pages for the cache data Evangelos Petrongonas
` (3 subsequent siblings)
12 siblings, 0 replies; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Ammend VFIO PCI configuration space initialization by filling only
the regions defined by the PCI specification, avoiding unnecessary
reads from undefined or reserved areas.
The current implementation reads the entire configuration space during
initialization, including reserved regions that may not be implemented
by the device or may have side effects when accessed. This change
modifies vfio_fill_vconfig_bytes() skips reserved regions in the
standard configuration space header.
This benefits the PCSC hit rate, as 4byte access to a region where only
2 bytes are cacheable and 2 are undefined, therefore uncached, will lead
to a HW access instead.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
drivers/vfio/pci/vfio_pci_config.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 8f02f236b5b4..4fc7156a77d1 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1485,7 +1485,18 @@ static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev,
while (size) {
int filled;
- if (size >= 4 && !(offset % 4)) {
+ if (offset == PCI_CAPABILITY_LIST) {
+ u8 *byte = &vdev->vconfig[offset];
+
+ ret = pci_read_config_byte(pdev, offset, byte);
+ if (ret)
+ return ret;
+ /* Skip the reserved area */
+ filled = 4;
+ } else if (offset == 0x38) {
+ /* Skip the reserved area */
+ filled = 4;
+ } else if (size >= 4 && !(offset % 4)) {
__le32 *dwordp = (__le32 *)&vdev->vconfig[offset];
u32 dword;
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 10/13] pci: pcsc: Use contiguous pages for the cache data
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (8 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 09/13] vfio: pci: Fill only spec-defined configuration space regions Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 11/13] pci: pcsc: Add kexec persistence support via KHO Evangelos Petrongonas
` (2 subsequent siblings)
12 siblings, 0 replies; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
This patch refactors PCSC to use a single contiguous memory block for
all per-device data. This improves memory allocation efficiency.
This is a preparatory step for KHO persistence support, as it is easier
to manipulate physically continuous pages.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
drivers/pci/pcsc.c | 28 ++++++++++++++++++++++++----
include/linux/pcsc.h | 32 ++++++++++++++++++++++++++++++--
2 files changed, 54 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 304239b7ff8a..18d508f76649 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -725,6 +725,7 @@ int pcsc_add_device(struct pci_dev *dev)
{
struct pcsc_node *node;
struct pci_bus *bus;
+ size_t data_size;
if (WARN_ON(!dev))
return -EINVAL;
@@ -741,12 +742,27 @@ int pcsc_add_device(struct pci_dev *dev)
* nodes for these devices, as it simplifies the code flow
*/
if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
- dev->pcsc->cfg_space = kzalloc(PCSC_CFG_SPC_SIZE, GFP_KERNEL);
- if (!dev->pcsc->cfg_space)
+ /* Allocate contiguous, page aligned data block. This will be
+ * needed for persisting the data with KHO.
+ */
+ data_size = sizeof(struct pcsc_data);
+
+ dev->pcsc->data =
+ (struct pcsc_data *)__get_free_pages(
+ GFP_KERNEL | __GFP_ZERO, get_order(data_size));
+ if (!dev->pcsc->data)
+
goto err_free_node;
+ dev->pcsc->cachable_bitmask = dev->pcsc->data->cachable_bitmask;
+ dev->pcsc->cached_bitmask = dev->pcsc->data->cached_bitmask;
+ dev->pcsc->cfg_space = dev->pcsc->data->cfg_space;
+
infer_cacheability(dev);
} else {
+ dev->pcsc->data = NULL;
+ dev->pcsc->cachable_bitmask = NULL;
+ dev->pcsc->cached_bitmask = NULL;
dev->pcsc->cfg_space = NULL;
}
@@ -771,8 +787,12 @@ int pcsc_remove_device(struct pci_dev *dev)
atomic_dec(&num_nodes);
- if (dev->pcsc && dev->pcsc->cfg_space) {
- kfree(dev->pcsc->cfg_space);
+ if (dev->pcsc && dev->pcsc->data) {
+ size_t data_size = sizeof(struct pcsc_data);
+ size_t total_size = PAGE_ALIGN(data_size);
+
+ free_pages((unsigned long)dev->pcsc->data,
+ get_order(total_size));
kfree(dev->pcsc);
}
dev->pcsc = NULL;
diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h
index 85471273c0a9..88894f641257 100644
--- a/include/linux/pcsc.h
+++ b/include/linux/pcsc.h
@@ -18,12 +18,40 @@
#define PCSC_CFG_SPC_SIZE 256
#endif
-struct pcsc_node {
- u8 *cfg_space;
+/*
+ * struct pcsc__data - Continuous data block for PCSC
+ *
+ * This structure contains all the PCSC data in a single continuous
+ * memory block.
+ *
+ * @cfg_space: Configuration space cache
+ * @cachable_bitmask: Bitmap of cacheable configuration space offsets
+ * @cached_bitmask: Bitmap of cached configuration space offsets
+ */
+struct pcsc_data {
+ u8 cfg_space[PCSC_CFG_SPC_SIZE];
DECLARE_BITMAP(cachable_bitmask, PCSC_CFG_SPC_SIZE);
DECLARE_BITMAP(cached_bitmask, PCSC_CFG_SPC_SIZE);
};
+/*
+ * struct pcsc_node - PCSC node structure
+ *
+ * This structure represents a PCSC node for a PCI device.
+ * It contains pointers into the data block for convenient access.
+ *
+ * @data: Pointer to the continuous data block
+ * @cachable_bitmask: Pointer to cachable_bitmask in data
+ * @cached_bitmask: Pointer to cached_bitmask in data
+ * @cfg_space: Pointer to cfg_space in data
+ */
+struct pcsc_node {
+ struct pcsc_data *data; /* Pointer to continuous data block */
+ unsigned long *cachable_bitmask; /* Convenience pointer into data */
+ unsigned long *cached_bitmask; /* Convenience pointer into data */
+ u8 *cfg_space; /* Convenience pointer into data */
+};
+
/**
* pcsc_hw_config_read - Direct hardware PCI config space read
* @bus: PCI bus
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 11/13] pci: pcsc: Add kexec persistence support via KHO
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (9 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 10/13] pci: pcsc: Use contiguous pages for the cache data Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 12/13] pci: pcsc: introduce persistence versioning Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 13/13] pci: pcsc: introduce hashtable lookup to speed up restoration Evangelos Petrongonas
12 siblings, 0 replies; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
Add support for preserving PCI Configuration Space Cache (PCSC) data
across kexec operations using the Kexec Handover (KHO) framework. This
allows the cached PCI configuration data to survive kexec, eliminating
the need to re-probe PCI configuration space after kexec, which can
significantly reduce boot time on systems with many PCI devices.
To enable PCSC persistence, the kernel must be built with
CONFIG_PCSC_KHO enabled, which depends on both `CONFIG_PCSC` and
`CONFIG_KEXEC_HANDOVER`. When enabled, persistence can be controlled at
runtime using the 'pcsc_persistence_enabled' kernel parameter. By
default, persistence is disabled, but it can be enabled by passing
'pcsc_persistence_enabled=0' on the kernel command line.
During kexec preparation, the implementation iterates through all PCI
devices and saves the PCSC data for endpoint devices (header type 0). It
creates a Flattened Device Tree (FDT) structure containing device
information and physical addresses of the preserved data. The physical
memory pages containing PCSC data are preserved through KHO, and the FDT
is added to the KHO tree for the new kernel to discover.
After kexec, during PCI device initialization, the implementation checks
if KHO data is available for each device being initialized. If found, it
restores the cached configuration space data, avoiding the need to
re-probe the device. The implementation tracks timing statistics to
measure the performance benefits of this optimization.
Performance metrics are collected and reported, showing both the time
taken to save devices during kexec and the time saved during restore in
the new kernel. This helps quantify the boot time improvements.
The implementation handles error cases gracefully, falling back to
normal PCSC initialization if KHO data is not available or corrupted.
This ensures that the system remains functional even if persistence
cannot be achieved.
The time complexity of this implementation is O(n^2), where n is the
number of restored devices, as for every device the FDT needs to be
traversed again. This will be improved in a future patch
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
.../admin-guide/kernel-parameters.txt | 4 +
drivers/pci/Kconfig | 10 +
drivers/pci/pcsc.c | 389 +++++++++++++++++-
3 files changed, 386 insertions(+), 17 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 08c7a13f107c..39f71e27df2d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5039,6 +5039,10 @@
pcsc_enabled= [PCSC] enable the use of the PCI Configuration Space
Cache (PCSC).
+ pcsc_persistence_enabled= [PCSC] enable the persistence over kexec
+ using KHO of the PCI Configuration Space Cache Data. For more
+ information seen drivers/pci/pcsc.c
+
pd_ignore_unused
[PM]
Keep all power-domains already enabled by bootloader on,
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 9b5275ef2d16..0eb189ad526b 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -57,6 +57,16 @@ config PCSC_STATS
help
This option allows the collection of statistics for the PCSC.
+config PCSC_KHO
+ bool "PCI Configuration Space Cache persist data over kexec"
+ depends on PCSC
+ depends on KEXEC_HANDOVER
+ default n
+ help
+ This option enables the persistence of the cache data over kexec
+ using Kexec Handover KHO. For more information, check
+ `drivers/pci/pcsc.c'
+
source "drivers/pci/pcie/Kconfig"
config PCI_MSI
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 18d508f76649..0c4ae73744d6 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -17,6 +17,11 @@
#include <linux/atomic.h>
#include <linux/pcsc.h>
#include <linux/sysfs.h>
+#include <linux/kexec_handover.h>
+#include <linux/libfdt.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
static bool pcsc_enabled;
static int __init pcsc_enabled_setup(char *str)
@@ -25,6 +30,16 @@ static int __init pcsc_enabled_setup(char *str)
}
__setup("pcsc_enabled=", pcsc_enabled_setup);
+static bool pcsc_persistence_enabled;
+static int __init pcsc_persistence_enabled_setup(char *str)
+{
+ return kstrtobool(str, &pcsc_persistence_enabled) == 0;
+}
+__setup("pcsc_persistence_enabled=", pcsc_persistence_enabled_setup);
+
+#define PCSC_KHO_FDT "pcsc"
+#define PCSC_KHO_NODE_COMPATIBLE "pcsc-v1"
+
#ifdef CONFIG_PCSC_STATS
struct pcsc_stats {
/* Operation Counters */
@@ -39,6 +54,10 @@ struct pcsc_stats {
u64 total_cache_access_time; /* in milliseconds */
u64 total_hw_access_time; /* in milliseconds */
u64 hw_access_time_due_to_misses; /* in milliseconds */
+#ifdef CONFIG_PCSC_KHO
+ u64 pcsc_kho_total_restore_time_ns;
+ u32 pcsc_kho_restored_device_count;
+#endif
};
#endif
@@ -82,6 +101,12 @@ static inline void pcsc_count_device_reset(void)
{
pcsc_stats.device_resets++;
}
+#ifdef CONFIG_PCSC_KHO
+static inline void pcsc_count_restored_devices(void)
+{
+ pcsc_stats.pcsc_kho_restored_device_count++;
+}
+#endif
#else
static inline void pcsc_count_cache_hit(void)
{
@@ -101,6 +126,11 @@ static inline void pcsc_count_cache_invalidation(void)
static inline void pcsc_count_device_reset(void)
{
}
+#ifdef CONFIG_PCSC_KHO
+static inline void pcsc_count_restored_devices(void)
+{
+}
+#endif
#endif
inline bool pcsc_is_initialised(void)
@@ -721,6 +751,288 @@ static void infer_cacheability(struct pci_dev *dev)
}
}
+#ifdef CONFIG_PCSC_KHO
+static struct page *pcsc_kho_fdt;
+static int pcsc_kho_fdt_order;
+
+static int pcsc_kho_save_device(struct pci_dev *dev, void *fdt)
+{
+ char node_name[32];
+ size_t data_size, total_size;
+ u64 data_addr;
+ int err = 0;
+
+ if (!dev->pcsc || !dev->pcsc->data)
+ return 1;
+
+ if (dev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+ return 1;
+
+ /* Create FDT node for this device - node name contains device identifer */
+ snprintf(node_name, sizeof(node_name), "dev_%04x_%02x_%02x_%x",
+ pci_domain_nr(dev->bus), dev->bus->number,
+ PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+
+ err = fdt_begin_node(fdt, node_name);
+ if (err) {
+ pci_err(dev, "PCSC: Failed to begin FDT node '%s': %d\n",
+ node_name, err);
+ return err;
+ }
+
+ data_size = sizeof(struct pcsc_data);
+ total_size = PAGE_ALIGN(data_size);
+
+ data_addr = virt_to_phys(dev->pcsc->data);
+ err = kho_preserve_phys(data_addr, total_size);
+ if (err) {
+ pci_err(dev, "PCSC: Failed to preserve data buffer: %d\n", err);
+ return err;
+ }
+
+ err = fdt_property(fdt, "da", &data_addr, sizeof(data_addr));
+ if (err) {
+ pci_err(dev, "PCSC: Failed to set da property: %d\n",
+ err);
+ return err;
+ }
+
+ err = fdt_end_node(fdt);
+ if (err) {
+ pci_err(dev, "PCSC: Failed to end FDT node: %d\n", err);
+ return err;
+ }
+
+ return 0;
+}
+
+static int pcsc_kho_notifier(struct notifier_block *self, unsigned long cmd,
+ void *v)
+{
+ struct kho_serialization *ser = v;
+ struct pci_dev *dev = NULL;
+ void *fdt;
+ int err = 0;
+ size_t fdt_size;
+ u32 dev_count = 0;
+ u32 eligible_count = 0;
+ u32 saved_count = 0;
+ u32 skipped_count = 0;
+
+ switch (cmd) {
+ case KEXEC_KHO_ABORT:
+ if (pcsc_kho_fdt) {
+ __free_pages(pcsc_kho_fdt, pcsc_kho_fdt_order);
+ pcsc_kho_fdt = NULL;
+ }
+ return NOTIFY_DONE;
+ case KEXEC_KHO_FINALIZE:
+ /* Handled below */
+ break;
+ default:
+ return NOTIFY_BAD;
+ }
+
+#ifdef CONFIG_PCSC_STATS
+ ktime_t start_time = ktime_get();
+#endif
+
+ for_each_pci_dev(dev) {
+ dev_count++;
+ if (dev->pcsc && dev->pcsc->cfg_space &&
+ dev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+ eligible_count++;
+ }
+
+ pr_info("Total PCI devices: %u, eligible for save: %u\n",
+ dev_count, eligible_count);
+
+ if (eligible_count == 0)
+ return NOTIFY_DONE;
+
+ /* Allocate FDT with size calculation (conservative estimates):
+ * - Per device: node_name(~20) + node_overhead(~12) + da_property(~20)
+ * = ~52 bytes, round up to 64 for alignment/margin
+ * - Fixed overhead: header(40) + root_node(~40) + strings_table(~30)
+ * + misc(~32) = ~144 bytes, round up to 256
+ */
+ fdt_size = PAGE_ALIGN((eligible_count * 64 + 256));
+ pcsc_kho_fdt_order = get_order(fdt_size);
+ pcsc_kho_fdt = alloc_pages(GFP_KERNEL, pcsc_kho_fdt_order);
+ if (!pcsc_kho_fdt) {
+ pr_err("PCSC: Failed to allocate FDT pages (size=%zu, order=%d)\n",
+ fdt_size, pcsc_kho_fdt_order);
+ return NOTIFY_BAD;
+ }
+
+ fdt = page_to_virt(pcsc_kho_fdt);
+
+ /* Create FDT */
+ err = fdt_create(fdt, fdt_size);
+ if (err) {
+ pr_err("PCSC: Failed to create FDT: %d\n", err);
+ goto error_cleanup;
+ }
+
+ err = fdt_finish_reservemap(fdt);
+ if (err) {
+ pr_err("PCSC: Failed to finish FDT reservemap: %d\n", err);
+ goto error_cleanup;
+ }
+
+ err = fdt_begin_node(fdt, "");
+ if (err) {
+ pr_err("PCSC: Failed to begin root FDT node: %d\n", err);
+ goto error_cleanup;
+ }
+
+ err = fdt_property_string(fdt, "compatible", PCSC_KHO_NODE_COMPATIBLE);
+ if (err) {
+ pr_err("PCSC: Failed to set compatible property: %d\n", err);
+ goto error_cleanup;
+ }
+
+ for_each_pci_dev(dev) {
+ int save_err = pcsc_kho_save_device(dev, fdt);
+
+ if (save_err == 0) {
+ saved_count++;
+ } else if (save_err == 1) {
+ /* Skipped (not eligible) */
+ skipped_count++;
+ } else {
+ pr_err("Failed to save device %04x:%02x:%02x.%d: %d\n",
+ pci_domain_nr(dev->bus), dev->bus->number,
+ PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+ save_err);
+ break;
+ }
+ }
+
+ err = fdt_end_node(fdt);
+ if (err) {
+ pr_err("Failed to end root FDT node: %d\n", err);
+ goto error_cleanup;
+ }
+
+ err = fdt_finish(fdt);
+ if (err) {
+ pr_err("Failed to finish FDT: %d\n", err);
+ goto error_cleanup;
+ }
+
+ int fdt_final_size = fdt_totalsize(fdt);
+ int num_pages = PAGE_ALIGN(fdt_final_size) / PAGE_SIZE;
+
+ err = kho_preserve_phys(page_to_phys(pcsc_kho_fdt),
+ num_pages * PAGE_SIZE);
+ if (err) {
+ pr_err("Failed to preserve FDT pages: %d\n", err);
+ goto error_cleanup;
+ }
+
+ err = kho_add_subtree(ser, PCSC_KHO_FDT, fdt);
+ if (err) {
+ pr_err("Failed to add FDT to KHO tree: %d\n", err);
+ goto error_cleanup;
+ }
+
+#ifdef CONFIG_PCSC_STATS
+ ktime_t end_time = ktime_get();
+ u64 duration_ns = ktime_to_ns(ktime_sub(end_time, start_time));
+ u64 duration_us = duration_ns / 1000;
+
+ pr_info("Saved %u devices to KHO in %llu us (%llu.%03llu ms)\n",
+ saved_count, duration_us, duration_us / 1000,
+ duration_us % 1000);
+#endif
+ return NOTIFY_DONE;
+
+error_cleanup:
+ pr_err("KHO save failed with error %d\n", err);
+ __free_pages(pcsc_kho_fdt, pcsc_kho_fdt_order);
+ pcsc_kho_fdt = NULL;
+ return NOTIFY_BAD;
+}
+
+static struct notifier_block pcsc_kho_nb = {
+ .notifier_call = pcsc_kho_notifier,
+};
+
+static bool pcsc_kho_restore_device(struct pci_dev *dev, const void *fdt,
+ int node)
+{
+ const struct pcsc_data *preserved_data;
+ const u64 *data_addr;
+ int len;
+
+ data_addr = fdt_getprop(fdt, node, "da", &len);
+ if (!data_addr || len != sizeof(*data_addr))
+ return false;
+
+ preserved_data = phys_to_virt(*data_addr);
+ if (!preserved_data)
+ return false;
+
+
+ dev->pcsc->data = (struct pcsc_data *)preserved_data;
+ dev->pcsc->cachable_bitmask = dev->pcsc->data->cachable_bitmask;
+ dev->pcsc->cached_bitmask = dev->pcsc->data->cached_bitmask;
+ dev->pcsc->cfg_space = dev->pcsc->data->cfg_space;
+
+ return true;
+}
+
+static bool pcsc_kho_check_restore(struct pci_dev *dev)
+{
+ phys_addr_t fdt_phys;
+ const void *fdt;
+ int node, err;
+ bool restored = false;
+ char node_name[32];
+#ifdef CONFIG_PCSC_STATS
+ ktime_t start_time, end_time;
+ u64 duration_ns;
+#endif
+
+ err = kho_retrieve_subtree(PCSC_KHO_FDT, &fdt_phys);
+ if (err) {
+ pci_dbg(dev, "PCSC: kho_retrieve_subtree failed: %d\n", err);
+ return false;
+ }
+
+ fdt = phys_to_virt(fdt_phys);
+ if (fdt_node_check_compatible(fdt, 0, PCSC_KHO_NODE_COMPATIBLE)) {
+ pci_dbg(dev, "PCSC: FDT node not compatible\n");
+ return false;
+ }
+
+#ifdef CONFIG_PCSC_STATS
+ start_time = ktime_get();
+#endif
+
+ snprintf(node_name, sizeof(node_name), "dev_%04x_%02x_%02x_%x",
+ pci_domain_nr(dev->bus), dev->bus->number,
+ PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+
+ node = fdt_subnode_offset(fdt, 0, node_name);
+ if (node >= 0)
+ restored = pcsc_kho_restore_device(dev, fdt, node);
+
+#ifdef CONFIG_PCSC_STATS
+ if (restored) {
+ end_time = ktime_get();
+ duration_ns = ktime_to_ns(ktime_sub(end_time, start_time));
+
+ pcsc_stats.pcsc_kho_total_restore_time_ns += duration_ns;
+ pcsc_count_restored_devices();
+ }
+#endif
+
+ return restored;
+}
+#endif
+
int pcsc_add_device(struct pci_dev *dev)
{
struct pcsc_node *node;
@@ -742,23 +1054,34 @@ int pcsc_add_device(struct pci_dev *dev)
* nodes for these devices, as it simplifies the code flow
*/
if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
- /* Allocate contiguous, page aligned data block. This will be
- * needed for persisting the data with KHO.
- */
- data_size = sizeof(struct pcsc_data);
-
- dev->pcsc->data =
- (struct pcsc_data *)__get_free_pages(
- GFP_KERNEL | __GFP_ZERO, get_order(data_size));
- if (!dev->pcsc->data)
+#ifdef CONFIG_PCSC_KHO
+ bool restored = false;
- goto err_free_node;
+ /* Try to restore from KHO first, before any allocation */
+ if (pcsc_persistence_enabled && kho_is_enabled())
+ restored = pcsc_kho_check_restore(dev);
- dev->pcsc->cachable_bitmask = dev->pcsc->data->cachable_bitmask;
- dev->pcsc->cached_bitmask = dev->pcsc->data->cached_bitmask;
- dev->pcsc->cfg_space = dev->pcsc->data->cfg_space;
-
- infer_cacheability(dev);
+ if (!restored) {
+#endif
+ /* Allocate contiguous, page aligned data block. This is
+ * needed for persisting the data with KHO.
+ */
+ data_size = sizeof(struct pcsc_data);
+
+ dev->pcsc->data =
+ (struct pcsc_data *)__get_free_pages(
+ GFP_KERNEL | __GFP_ZERO, get_order(data_size));
+ if (!dev->pcsc->data)
+ goto err_free_node;
+
+ dev->pcsc->cachable_bitmask = dev->pcsc->data->cachable_bitmask;
+ dev->pcsc->cached_bitmask = dev->pcsc->data->cached_bitmask;
+ dev->pcsc->cfg_space = dev->pcsc->data->cfg_space;
+
+ infer_cacheability(dev);
+#ifdef CONFIG_PCSC_KHO
+ }
+#endif
} else {
dev->pcsc->data = NULL;
dev->pcsc->cachable_bitmask = NULL;
@@ -1103,7 +1426,9 @@ static struct kobj_attribute pcsc_enabled_attribute =
static ssize_t pcsc_stats_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
- return sysfs_emit(
+ ssize_t ret;
+
+ ret = sysfs_emit(
buf,
"Cache Hits: %lu\n"
"Cache Misses: %lu\n"
@@ -1132,6 +1457,20 @@ static ssize_t pcsc_stats_show(struct kobject *kobj,
1000,
pcsc_stats.hw_access_time_due_to_misses / 1000,
pcsc_stats.total_hw_access_time / 1000);
+
+#ifdef CONFIG_PCSC_KHO
+ u64 total_restore_time_us = pcsc_stats.pcsc_kho_total_restore_time_ns / 1000;
+
+ ret += sysfs_emit_at(buf, ret,
+ "KHO Restore Statistics:\n"
+ " Restored Devices: %u\n"
+ " Total Restore Time: %llu us\n",
+ pcsc_stats.pcsc_kho_restored_device_count,
+ total_restore_time_us);
+
+#endif
+
+ return ret;
}
static struct kobj_attribute pcsc_stats_attribute =
@@ -1183,6 +1522,10 @@ static void pcsc_create_sysfs(void)
static int __init pcsc_init(void)
{
+#ifdef CONFIG_PCSC_KHO
+ int ret;
+#endif
+
bus_register_notifier(&pci_bus_type, &pcsc_bus_nb);
/* Try to create sysfs entry, but don't fail if PCI bus isn't ready yet */
@@ -1192,8 +1535,20 @@ static int __init pcsc_init(void)
memset(&pcsc_stats, 0, sizeof(pcsc_stats));
#endif
+#ifdef CONFIG_PCSC_KHO
+ /* Register KHO notifier if persistence is enabled */
+ if (pcsc_persistence_enabled && kho_is_enabled()) {
+ ret = register_kho_notifier(&pcsc_kho_nb);
+ if (ret == 0)
+ pr_info("KHO notifier registered successfully\n");
+ else
+ pr_err("Failed to register KHO notifier: %d\n", ret);
+ }
+#endif /* CONFIG_PCSC_KHO */
+
pcsc_initialised = true;
- pr_info("initialised (enabled=%d)\n", pcsc_enabled);
+ pr_info("initialised (enabled=%d, persistence=%d)\n",
+ pcsc_enabled, pcsc_persistence_enabled);
return 0;
}
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 12/13] pci: pcsc: introduce persistence versioning
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (10 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 11/13] pci: pcsc: Add kexec persistence support via KHO Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 13/13] pci: pcsc: introduce hashtable lookup to speed up restoration Evangelos Petrongonas
12 siblings, 0 replies; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
The cacheability of the devices can change, either because registers
need to be removed due to a bug / non-compliant device, or adding more
cacheable registers. The current version does not support the
concept of changing the `is_cacheable` bitmask and in order to do so,
the whole persistence mechanism needs to be disabled, change the bitmask
and reenable it again. This adds maintenance difficulty, as well as
negatively impacting the cache hit rate.
This commit adds a mechanism to handle those changes more efficiently.
A version number is identified and it is stored in the FDT during the
save process in the outgoing kernel. The incoming kernel, if compiled
with a different persistence version, will re-infer the cacheability of
all the saved devices, without touching the `is_cached` or the actual
configuration space saved data. This is safe as all access to the cache
are guarded by the `is_cacheable` bitmask. As a result changing the
cacheability will only change the differing registers, while the rest
of the cache will remain valid.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
drivers/pci/pcsc.c | 40 +++++++++++++++++++++++++++++++++++-----
1 file changed, 35 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 0c4ae73744d6..8ff91ed24a37 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -23,6 +23,9 @@
#include <linux/vmalloc.h>
#include <linux/slab.h>
+/* PCSC persistent data version - increment when the cacheability is changed */
+#define PCSC_PERSISTENT_VERSION 1U
+
static bool pcsc_enabled;
static int __init pcsc_enabled_setup(char *str)
{
@@ -818,6 +821,7 @@ static int pcsc_kho_notifier(struct notifier_block *self, unsigned long cmd,
u32 eligible_count = 0;
u32 saved_count = 0;
u32 skipped_count = 0;
+ const u32 version = PCSC_PERSISTENT_VERSION;
switch (cmd) {
case KEXEC_KHO_ABORT:
@@ -853,8 +857,8 @@ static int pcsc_kho_notifier(struct notifier_block *self, unsigned long cmd,
/* Allocate FDT with size calculation (conservative estimates):
* - Per device: node_name(~20) + node_overhead(~12) + da_property(~20)
* = ~52 bytes, round up to 64 for alignment/margin
- * - Fixed overhead: header(40) + root_node(~40) + strings_table(~30)
- * + misc(~32) = ~144 bytes, round up to 256
+ * - Fixed overhead: header(40) + root_node(~48) + strings_table(~30)
+ * + misc(~32) = ~152 bytes, round up to 256
*/
fdt_size = PAGE_ALIGN((eligible_count * 64 + 256));
pcsc_kho_fdt_order = get_order(fdt_size);
@@ -892,6 +896,12 @@ static int pcsc_kho_notifier(struct notifier_block *self, unsigned long cmd,
goto error_cleanup;
}
+ err = fdt_property(fdt, "pv", &version, sizeof(version));
+ if (err) {
+ pr_err("PCSC: Failed to set version property: %d\n", err);
+ goto error_cleanup;
+ }
+
for_each_pci_dev(dev) {
int save_err = pcsc_kho_save_device(dev, fdt);
@@ -960,7 +970,7 @@ static struct notifier_block pcsc_kho_nb = {
};
static bool pcsc_kho_restore_device(struct pci_dev *dev, const void *fdt,
- int node)
+ int node, bool version_mismatch)
{
const struct pcsc_data *preserved_data;
const u64 *data_addr;
@@ -980,6 +990,9 @@ static bool pcsc_kho_restore_device(struct pci_dev *dev, const void *fdt,
dev->pcsc->cached_bitmask = dev->pcsc->data->cached_bitmask;
dev->pcsc->cfg_space = dev->pcsc->data->cfg_space;
+ if (version_mismatch)
+ infer_cacheability(dev);
+
return true;
}
@@ -987,9 +1000,12 @@ static bool pcsc_kho_check_restore(struct pci_dev *dev)
{
phys_addr_t fdt_phys;
const void *fdt;
- int node, err;
+ int node, err, len;
bool restored = false;
+ bool version_mismatch = false;
char node_name[32];
+ const u32 *version_ptr;
+ u32 saved_version;
#ifdef CONFIG_PCSC_STATS
ktime_t start_time, end_time;
u64 duration_ns;
@@ -1007,6 +1023,20 @@ static bool pcsc_kho_check_restore(struct pci_dev *dev)
return false;
}
+ version_ptr = fdt_getprop(fdt, 0, "pv", &len);
+ if (version_ptr && len == sizeof(*version_ptr)) {
+ saved_version = *version_ptr;
+ if (saved_version != PCSC_PERSISTENT_VERSION)
+ version_mismatch = true;
+
+ } else {
+ /* No version found, assume version 0 */
+ pci_info(
+ dev,
+ "PCSC: No version found in restored data. Re-infer Cacheability.\n");
+ version_mismatch = true;
+ }
+
#ifdef CONFIG_PCSC_STATS
start_time = ktime_get();
#endif
@@ -1017,7 +1047,7 @@ static bool pcsc_kho_check_restore(struct pci_dev *dev)
node = fdt_subnode_offset(fdt, 0, node_name);
if (node >= 0)
- restored = pcsc_kho_restore_device(dev, fdt, node);
+ restored = pcsc_kho_restore_device(dev, fdt, node, version_mismatch);
#ifdef CONFIG_PCSC_STATS
if (restored) {
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 13/13] pci: pcsc: introduce hashtable lookup to speed up restoration
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
` (11 preceding siblings ...)
2025-10-03 9:00 ` [RFC PATCH 12/13] pci: pcsc: introduce persistence versioning Evangelos Petrongonas
@ 2025-10-03 9:00 ` Evangelos Petrongonas
12 siblings, 0 replies; 22+ messages in thread
From: Evangelos Petrongonas @ 2025-10-03 9:00 UTC (permalink / raw)
To: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown
Cc: Evangelos Petrongonas, Pasha Tatashin, David Matlack,
Vipin Sharma, Chris Li, Jason Miu, Pratyush Yadav,
Stanislav Spassov, linux-pci, linux-acpi, linux-kernel,
nh-open-source
During PCSC KHO (Kexec Handover) restoration, the current implementation
searches for each device's saved data by performing a linear traversal
through the FDT (Flattened Device Tree) nodes. This approach requires
examining every node until finding a match, resulting in O(n) complexity
per device lookup. On systems with thousands of PCI devices, this
becomes a significant bottleneck, as restoring all devices requires
O(n^2) operations in total.
This patch replaces the linear search with a hashtable that provides
O(1) lookup performance. During initialization, when KHO restore
data is detected, we build a hashtable by traversing the FDT once and
indexing all device nodes. Each device is keyed by a 32-bit value
combining its PCI domain (16 bits), bus number (8 bits), and device
function (8 bits), ensuring unique identification across the entire
PCI topology.
The hashtable uses 1024 buckets (PCSC_KHO_HASH_BITS=10).
As devices are successfully restored, their entries are removed from the
hashtable, progressively freeing memory. Once all devices have been
restored, the entire hashtable structure is cleaned up.
An additional optimization moves the version compatibility check from
per-device to a single global check during hashtable initialization.
This eliminates redundant version comparisons and simplifies the
restoration logic. Since the hashtable is built once during module
initialization and never modified during lookups, no locking is
required.
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
drivers/pci/pcsc.c | 244 +++++++++++++++++++++++++++++++++++++--------
1 file changed, 205 insertions(+), 39 deletions(-)
diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
index 8ff91ed24a37..43880b85b3f9 100644
--- a/drivers/pci/pcsc.c
+++ b/drivers/pci/pcsc.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/slab.h>
+#include <linux/hashtable.h>
/* PCSC persistent data version - increment when the cacheability is changed */
#define PCSC_PERSISTENT_VERSION 1U
@@ -755,8 +756,182 @@ static void infer_cacheability(struct pci_dev *dev)
}
#ifdef CONFIG_PCSC_KHO
+
+/*
+ * Hash table for O(1) device lookup during restore
+ * Size is chosen to handle systems with up to 4096 devices
+ */
+#define PCSC_KHO_HASH_BITS 10 /* 1024 buckets */
+static DEFINE_HASHTABLE(pcsc_kho_lookup_table, PCSC_KHO_HASH_BITS);
+
+struct pcsc_kho_lookup_entry {
+ struct hlist_node node;
+ u32 key; /* Hash of domain:bus:devfn */
+ int fdt_offset; /* Offset of this device's node in FDT */
+};
+
static struct page *pcsc_kho_fdt;
static int pcsc_kho_fdt_order;
+static bool pcsc_kho_lookup_initialized;
+static phys_addr_t pcsc_kho_fdt_phys;
+static const void *pcsc_kho_fdt_virt;
+static bool version_mismatch;
+static atomic_t pcsc_kho_hashtable_entries = ATOMIC_INIT(0);
+static u32 pcsc_kho_initial_hashtable_entries;
+#ifdef CONFIG_PCSC_STATS
+static u64 pcsc_kho_hashtable_build_time_ns;
+#endif
+
+static inline u32 pcsc_kho_device_key(u32 domain, u32 bus, u32 devfn)
+{
+ return (domain << 16) | (bus << 8) | devfn;
+}
+
+static int pcsc_kho_build_lookup_table(const void *fdt)
+{
+ int node;
+ u32 domain, bus, slot, func, devfn;
+ struct pcsc_kho_lookup_entry *entry;
+ u32 key;
+ int count = 0;
+
+ fdt_for_each_subnode(node, fdt, 0) {
+ const char *name = fdt_get_name(fdt, node, NULL);
+
+ if (!name)
+ continue;
+
+ if (sscanf(name, "dev_%x_%x_%x_%x", &domain, &bus, &slot,
+ &func) != 4)
+ continue;
+
+ devfn = PCI_DEVFN(slot, func);
+ key = pcsc_kho_device_key(domain, bus, devfn);
+
+ entry = kmalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry)
+ return -ENOMEM;
+
+ entry->key = key;
+ entry->fdt_offset = node;
+
+ hash_add(pcsc_kho_lookup_table, &entry->node, key);
+ count++;
+ }
+
+ atomic_set(&pcsc_kho_hashtable_entries, count);
+ pcsc_kho_initial_hashtable_entries = count;
+ pr_info("Built KHO lookup table with %d entries\n", count);
+ return 0;
+}
+
+static int pcsc_kho_init_lookup(void)
+{
+ int err, len;
+ const u32 *version_ptr;
+ u32 saved_version;
+#ifdef CONFIG_PCSC_STATS
+ ktime_t start_time, end_time;
+#endif
+
+ if (pcsc_kho_lookup_initialized)
+ return 0;
+
+#ifdef CONFIG_PCSC_STATS
+ start_time = ktime_get();
+#endif
+
+ err = kho_retrieve_subtree(PCSC_KHO_FDT, &pcsc_kho_fdt_phys);
+ if (err)
+ return err;
+
+ pcsc_kho_fdt_virt = phys_to_virt(pcsc_kho_fdt_phys);
+ if (fdt_node_check_compatible(pcsc_kho_fdt_virt, 0,
+ PCSC_KHO_NODE_COMPATIBLE))
+ return -EINVAL;
+
+ /* Check version for all devices */
+ version_ptr = fdt_getprop(pcsc_kho_fdt_virt, 0, "pv", &len);
+ if (version_ptr && len == sizeof(*version_ptr)) {
+ saved_version = *version_ptr;
+ if (saved_version != PCSC_PERSISTENT_VERSION) {
+ version_mismatch = true;
+ pr_info("Version mismatch (saved=%u, current=%u), will re-infer cacheability\n",
+ saved_version, PCSC_PERSISTENT_VERSION);
+ }
+ } else {
+ /* No version found, assume version 0 */
+ version_mismatch = true;
+ pr_info("No version found in restored data, will re-infer cacheability\n");
+ }
+
+ err = pcsc_kho_build_lookup_table(pcsc_kho_fdt_virt);
+ if (err)
+ return err;
+
+#ifdef CONFIG_PCSC_STATS
+ end_time = ktime_get();
+ pcsc_kho_hashtable_build_time_ns =
+ ktime_to_ns(ktime_sub(end_time, start_time));
+#endif
+
+ pcsc_kho_lookup_initialized = true;
+ pr_info("KHO lookup table initialized in %llu us\n",
+#ifdef CONFIG_PCSC_STATS
+ pcsc_kho_hashtable_build_time_ns / 1000
+#else
+ 0ULL
+#endif
+ );
+ return 0;
+}
+
+static int pcsc_kho_find_device_node(u32 domain, u32 bus, u32 devfn)
+{
+ u32 key = pcsc_kho_device_key(domain, bus, devfn);
+ struct pcsc_kho_lookup_entry *entry;
+ int offset = -1;
+
+ hash_for_each_possible(pcsc_kho_lookup_table, entry, node, key) {
+ if (entry->key == key) {
+ offset = entry->fdt_offset;
+ break;
+ }
+ }
+
+ return offset;
+}
+
+
+static void pcsc_kho_cleanup_hashtable(void)
+{
+ if (!pcsc_kho_lookup_initialized)
+ return;
+
+ pcsc_kho_lookup_initialized = false;
+ pr_info("KHO hashtable cleaned up - all devices restored\n");
+}
+
+static void pcsc_kho_remove_device_entry(u32 domain, u32 bus, u32 devfn)
+{
+ u32 key = pcsc_kho_device_key(domain, bus, devfn);
+ struct pcsc_kho_lookup_entry *entry;
+ struct hlist_node *tmp;
+
+ hash_for_each_possible_safe(pcsc_kho_lookup_table, entry, tmp, node,
+ key) {
+ if (entry->key == key) {
+ hash_del(&entry->node);
+ kfree(entry);
+
+ if (atomic_dec_and_test(&pcsc_kho_hashtable_entries))
+ pcsc_kho_cleanup_hashtable();
+
+ break;
+ }
+ }
+}
+
static int pcsc_kho_save_device(struct pci_dev *dev, void *fdt)
{
@@ -998,56 +1173,31 @@ static bool pcsc_kho_restore_device(struct pci_dev *dev, const void *fdt,
static bool pcsc_kho_check_restore(struct pci_dev *dev)
{
- phys_addr_t fdt_phys;
- const void *fdt;
- int node, err, len;
+ int node;
bool restored = false;
- bool version_mismatch = false;
- char node_name[32];
- const u32 *version_ptr;
- u32 saved_version;
+ u32 domain, bus_num;
#ifdef CONFIG_PCSC_STATS
ktime_t start_time, end_time;
u64 duration_ns;
-#endif
- err = kho_retrieve_subtree(PCSC_KHO_FDT, &fdt_phys);
- if (err) {
- pci_dbg(dev, "PCSC: kho_retrieve_subtree failed: %d\n", err);
- return false;
- }
+ start_time = ktime_get();
+#endif
- fdt = phys_to_virt(fdt_phys);
- if (fdt_node_check_compatible(fdt, 0, PCSC_KHO_NODE_COMPATIBLE)) {
- pci_dbg(dev, "PCSC: FDT node not compatible\n");
+ if (!pcsc_kho_lookup_initialized)
return false;
- }
- version_ptr = fdt_getprop(fdt, 0, "pv", &len);
- if (version_ptr && len == sizeof(*version_ptr)) {
- saved_version = *version_ptr;
- if (saved_version != PCSC_PERSISTENT_VERSION)
- version_mismatch = true;
-
- } else {
- /* No version found, assume version 0 */
- pci_info(
- dev,
- "PCSC: No version found in restored data. Re-infer Cacheability.\n");
- version_mismatch = true;
- }
+ domain = pci_domain_nr(dev->bus);
+ bus_num = dev->bus->number;
-#ifdef CONFIG_PCSC_STATS
- start_time = ktime_get();
-#endif
+ node = pcsc_kho_find_device_node(domain, bus_num, dev->devfn);
- snprintf(node_name, sizeof(node_name), "dev_%04x_%02x_%02x_%x",
- pci_domain_nr(dev->bus), dev->bus->number,
- PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+ if (node >= 0) {
+ restored = pcsc_kho_restore_device(dev, pcsc_kho_fdt_virt, node,
+ version_mismatch);
- node = fdt_subnode_offset(fdt, 0, node_name);
- if (node >= 0)
- restored = pcsc_kho_restore_device(dev, fdt, node, version_mismatch);
+ if (restored)
+ pcsc_kho_remove_device_entry(domain, bus_num, dev->devfn);
+ }
#ifdef CONFIG_PCSC_STATS
if (restored) {
@@ -1056,6 +1206,9 @@ static bool pcsc_kho_check_restore(struct pci_dev *dev)
pcsc_stats.pcsc_kho_total_restore_time_ns += duration_ns;
pcsc_count_restored_devices();
+
+ pci_dbg(dev, "PCSC: Restored from KHO in %llu ns\n",
+ duration_ns);
}
#endif
@@ -1498,6 +1651,11 @@ static ssize_t pcsc_stats_show(struct kobject *kobj,
pcsc_stats.pcsc_kho_restored_device_count,
total_restore_time_us);
+ ret += sysfs_emit_at(buf, ret,
+ " Hashtable Initial Entries: %u\n"
+ " Hashtable Build Time: %llu us\n",
+ pcsc_kho_initial_hashtable_entries,
+ pcsc_kho_hashtable_build_time_ns / 1000);
#endif
return ret;
@@ -1573,6 +1731,14 @@ static int __init pcsc_init(void)
pr_info("KHO notifier registered successfully\n");
else
pr_err("Failed to register KHO notifier: %d\n", ret);
+
+ ret = pcsc_kho_init_lookup();
+ if (ret == 0)
+ pr_info("KHO lookup table initialized during init\n");
+ else if (ret == -ENOENT)
+ pr_info("No KHO saved data found - fresh boot\n");
+ else
+ pr_err("Failed to initialize KHO lookup table: %d\n", ret);
}
#endif /* CONFIG_PCSC_KHO */
--
2.47.3
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 01/13] pci: pcsc: Add plumbing for the PCI Configuration Space Cache (PCSC)
2025-10-03 9:00 ` [RFC PATCH 01/13] pci: pcsc: Add plumbing for the " Evangelos Petrongonas
@ 2025-10-09 12:38 ` Jonathan Cameron
0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Cameron @ 2025-10-09 12:38 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown,
Pasha Tatashin, David Matlack, Vipin Sharma, Chris Li, Jason Miu,
Pratyush Yadav, Stanislav Spassov, linux-pci, linux-acpi,
linux-kernel, nh-open-source
On Fri, 3 Oct 2025 09:00:37 +0000
Evangelos Petrongonas <epetron@amazon.de> wrote:
> Introduce the basic infrastructure for the PCI Configuration Space Cache
> (PCSC), a mechanism to cache PCI configuration space accesses to reduce
> latency and bus traffic.
>
> The PCSC implements a transparent interception layer for PCI config
> space operations by dynamically injecting its own ops into the PCI bus
> hierarchy. The design preserves existing PCI ops while allowing PCSC to
> intercept and cache accesses:
>
> The` struct pci_bus` is extended to hold the original `pci_ops`, while
> the cache ones are injected via `pcsc_inject_bus_ops()`. The cache ops
> are injected when new buses are added via registering it to a bus
> notifier and integrating it at:
> * `pci_register_host_bridge()` - for root buses
> * `pci_alloc_child_bus()` - for child buses
> * `pci_bus_set_ops()` - when ops are dynamically changed
>
> The implementation includes weak pcsc_hw_config_read/write functions
> that handle calling the original op, when access to the actual HW is
> required.
>
> This approach ensures complete transparency - existing drivers and
> subsystems continue to use standard PCI config access functions while
> PCSC can intercept and cache accesses as needed. The weak functions also
> allow architecture-specific implementations to override the default
> behavior.
>
> The `core` initcall level is chosen so the cache is initialised before
> the PCI driver, ensuring that all config space access go through the
> cache.
>
> Kconfig options are added for both PCSC and PCIe PCSC support, with the
> latter extending the cache to handle 4KB PCIe configuration space.
>
> In this initial patch, the cache simply passes through all accesses to
> the hardware via the original ops - actual caching functionality will be
> added in subsequent patches.
>
> There is one caveat in this patch. The map_bus operations can
> potentially alter the cache, without invalidating / updating the cache.
> This is not an issue for the current upstream usages, as it is only
> being used in Root complexes and the
> `pci_generic_config_{read,write}{,32}`
>
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
Hi,
This is an interesting idea. I'll probably give a bunch of trivial
feedback whilst I do a first read through. I appreciate this is only
an RFC at this point though!
> ---
> drivers/pci/Kconfig | 10 ++
> drivers/pci/Makefile | 1 +
> drivers/pci/access.c | 81 ++++++++++++++-
> drivers/pci/pcie/Kconfig | 9 ++
> drivers/pci/pcsc.c | 208 +++++++++++++++++++++++++++++++++++++++
> drivers/pci/probe.c | 24 ++++-
> include/linux/pci.h | 3 +
> include/linux/pcsc.h | 86 ++++++++++++++++
> 8 files changed, 419 insertions(+), 3 deletions(-)
> create mode 100644 drivers/pci/pcsc.c
> create mode 100644 include/linux/pcsc.h
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 9a249c65aedc..c26162b58365 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -40,6 +40,16 @@ config PCI_DOMAINS_GENERIC
> config PCI_SYSCALL
> bool
>
> +config PCSC
> + bool "PCI Configuration Space Cache"
> + depends on PCI
> + default n
Not needed as default if you don't have this line is n.
> + help
> + This option enables support for the PCI Configuration Space Cache
> + (PCSC). PCSC is a transparent caching layer that
> + intercepts configuration space operations and maintains cached
> + copies of register values
Odd wrapping. Rewrap to consistent 80 chars.
> +
> source "drivers/pci/pcie/Kconfig"
>
> config PCI_MSI
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 67647f1880fb..012561b97e32 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -37,6 +37,7 @@ obj-$(CONFIG_PCI_DOE) += doe.o
> obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
> obj-$(CONFIG_PCI_NPEM) += npem.o
> obj-$(CONFIG_PCIE_TPH) += tph.o
> +obj-$(CONFIG_PCSC) += pcsc.o
>
> # Endpoint library must be initialized before its users
> obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
> diff --git a/drivers/pci/access.c b/drivers/pci/access.c
> index b123da16b63b..b89e9210d330 100644
> --- a/drivers/pci/access.c
> +++ b/drivers/pci/access.c
> @@ -1,5 +1,6 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/pci.h>
> +#include <linux/pcsc.h>
> #include <linux/module.h>
> #include <linux/slab.h>
> #include <linux/ioport.h>
> @@ -189,15 +190,93 @@ EXPORT_SYMBOL_GPL(pci_generic_config_write32);
> * @ops: new raw operations
> *
> * Return previous raw operations
> + *
> + * When PCSC is enabled, this function maintains transparency by:
> + * - Returning the original non-PCSC ops to the caller
> + * - Properly handling the case where PCSC ops are already injected
> + * - Re-injecting PCSC ops after setting new ops when appropriate
> */
> struct pci_ops *pci_bus_set_ops(struct pci_bus *bus, struct pci_ops *ops)
> {
> struct pci_ops *old_ops;
> unsigned long flags;
> +#ifdef CONFIG_PCSC
> + bool pcsc_was_injected = false;
> + struct pci_ops *pcsc_ops_ptr = NULL;
> +#endif
With change suggested below the compiler will happily remove these
without needing the ifdef fun.
>
> raw_spin_lock_irqsave(&pci_lock, flags);
I'd precede this set with a patch using guard() for this
as then you can avoid the goto complexity in your new code
and potentially just return directly in error cases which will
I think simpify flow a little more.
> - old_ops = bus->ops;
> +
> +#ifdef CONFIG_PCSC
Is it possible to use
if (IS_ENABLED(CONFIG_PCSC))
to expose the code to the compiler /static analysis etc but allow
the compiler to cleanly do dead code removal so we get same resulting
binary as here?
> + /*
> + * Check if PCSC ops are currently injected. If so, we need to:
> + * 1. Return the original (non-PCSC) ops to maintain transparency
> + * 2. Update orig_ops to point to the new ops
> + * 3. Re-inject PCSC ops if the new ops are different from PCSC ops
> + */
> + if (bus->orig_ops) {
> + pcsc_was_injected = true;
> + pcsc_ops_ptr = bus->ops; /* Save current PCSC ops */
> + old_ops = bus->orig_ops; /* Return the real original ops */
> +
> + /*
> + * If the caller is trying to restore the PCSC ops themselves,
> + * just keep the current setup and return the original ops
> + */
> + if (ops == pcsc_ops_ptr)
> + goto out_unlock;
> +
> + /* Clear orig_ops temporarily to allow re-injection */
> + bus->orig_ops = NULL;
> + } else
> +#endif
> + {
> + old_ops = bus->ops;
> + }
> +
> bus->ops = ops;
> +
> +#ifdef CONFIG_PCSC
Similar: if (IS_ENABLED())
> + /*
> + * Re-inject PCSC ops if they were previously injected and the new ops
> + * are not the PCSC ops themselves. This maintains caching transparency.
> + */
> + if (pcsc_was_injected && ops != pcsc_ops_ptr) {
> + /*
> + * IMPORTANT: Dynamic ops changes after PCSC injection can lead to
> + * cache consistency issues if operations were performed that should
> + * have invalidated the cache. We re-inject PCSC ops here, but the
> + * caller is responsible for ensuring cache consistency if needed.
> + * This will be fixed in a future commit, when PCSC resets are
> + * introduced.
> + */
> +
> + pr_warn("PCSC: Dynamic ops change detected on bus %04x:%02x, resetting cache\n",
> + pci_domain_nr(bus), bus->number);
> +
> + if (pcsc_inject_bus_ops(bus)) {
> + pr_err("PCSC: Failed to re-inject ops after ops change on bus %04x:%02x\n",
> + pci_domain_nr(bus), bus->number);
> + /*
> + * If re-injection fails, we've lost caching but at least
> + * the caller's requested ops are in place. Log it
> + */
> + pr_warn("PCSC: Cache disabled for bus %04x:%02x after ops change\n",
> + pci_domain_nr(bus), bus->number);
> + } else {
> + pr_debug("PCSC: Successfully re-injected ops after ops change on bus %04x:%02x\n",
> + pci_domain_nr(bus), bus->number);
> + }
> + } else if (!pcsc_was_injected) {
> + /* First-time injection for this bus */
> + if (pcsc_inject_bus_ops(bus)) {
> + pr_err("PCSC: Failed to inject ops on bus %04x:%02x\n",
> + pci_domain_nr(bus), bus->number);
> + }
> + }
> +
> +out_unlock:
> +#endif
> raw_spin_unlock_irqrestore(&pci_lock, flags);
> return old_ops;
> }
> diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
> new file mode 100644
> index 000000000000..dec7c51b5cfd
> --- /dev/null
> +++ b/drivers/pci/pcsc.c
> @@ -0,0 +1,208 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + * Author: Evangelos Petrongonas <epetron@amazon.de>
> + *
> + * Implementation of the PCI Configuration Space Cache (PCSC)
> + * PCSC is a module which caches the PCI Configuration Space Accesses
> + * It implements a write-invalidate policy, meaning that writes are
> + * propagated to the device and invalidating the cache. The registers that
> + * we are caching are based on the values that are safe to cache and we
> + * are not expecting them to change without OS actions.
> + *
Trivial but this trailing blank line doesn't add much.
> + */
> +/* Weak references to allow architecture-specific overrides */
> +int __weak pcsc_hw_config_read(struct pci_bus *bus, unsigned int devfn,
> + int where, int size, u32 *val)
> +{
> + /*
> + * This function is only called from pcsc_cached_config_read,
pcsc_cached_config_read()
> + * which means PCSC ops have already been injected and orig_ops
> + * should be valid.
> + */
> + if (bus->orig_ops && bus->orig_ops->read)
> + return bus->orig_ops->read(bus, devfn, where, size, val);
> +
> + *val = 0xffffffff;
I'd use GENMASK(31, 0);
as no one likes counting fs.
> + return PCIBIOS_FUNC_NOT_SUPPORTED;
> +}
> +EXPORT_SYMBOL_GPL(pcsc_hw_config_read);
> +
> +int __weak pcsc_hw_config_write(struct pci_bus *bus, unsigned int devfn,
> + int where, int size, u32 val)
> +{
> + /*
> + * This function is only called from pcsc_cached_config_write,
() same for other references to functions in comments.
> + * which means PCSC ops have already been injected and orig_ops
> + * should be valid.
> + */
> + if (bus->orig_ops && bus->orig_ops->write)
> + return bus->orig_ops->write(bus, devfn, where, size, val);
> +
> + return PCIBIOS_FUNC_NOT_SUPPORTED;
> +}
> +EXPORT_SYMBOL_GPL(pcsc_hw_config_write);
> +static int __init pcsc_init(void)
> +{
> + bus_register_notifier(&pci_bus_type, &pcsc_bus_nb);
> +
> + pcsc_initialised = true;
> + pr_info("initialised\n");
Bit noisy but I guess there is no other way to tell this is running.
> +
> + return 0;
> +}
> +
> +core_initcall(pcsc_init);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 37f5bd476f39..33a186e4bf1e 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -8,6 +8,7 @@
> #include <linux/delay.h>
> #include <linux/init.h>
> #include <linux/pci.h>
> +#include <linux/pcsc.h>
> #include <linux/msi.h>
> #include <linux/of_pci.h>
> #include <linux/of_platform.h>
> @@ -1039,6 +1040,11 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
> }
> #endif
>
> +#ifdef CONFIG_PCSC
if (IS_ENABLED())
and a stub probably to get rid of the ifdef mess.
> + if (pcsc_inject_bus_ops(bus))
> + pci_err(bus, "PCSC: Failed to inject ops\n");
> +#endif
> +
> b = pci_find_bus(pci_domain_nr(bus), bridge->busnr);
> if (b) {
> /* Ignore it if we already got here via a different bridge */
> @@ -1236,10 +1242,24 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
> child->bus_flags = parent->bus_flags;
>
> host = pci_find_host_bridge(parent);
> - if (host->child_ops)
> + if (host->child_ops) {
> child->ops = host->child_ops;
> - else
> +#ifdef CONFIG_PCSC
> + child->orig_ops = host->child_ops;
I'd see if anyone really minds the unused orig_ops vs
the complexity of lots of local code. Otherwise wrap
setting and reading it up in access functions and stub
them out on !IS_ENABLED(CONFIG_PCSC)
> +#endif
> + } else {
> child->ops = parent->ops;
> +#ifdef CONFIG_PCSC
> + child->orig_ops = parent->orig_ops;
> +#endif
> + }
> +
> +#ifdef CONFIG_PCSC
Similar to above.
> + if (child->ops) {
> + if (pcsc_inject_bus_ops(child))
> + pci_err(child, "PCSC: Failed to inject ops\n");
> + }
> +#endif
>
> diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h
> new file mode 100644
> index 000000000000..45816eb2b2c8
> --- /dev/null
> +++ b/include/linux/pcsc.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + * Author: Evangelos Petrongonas <epetron@amazon.de>
> + *
Trivial but drop this line as it adds nothing to readability and makes me scroll
a tiny little bit more ;)
> + */
> +/**
> + * pcsc_inject_bus_ops Inject the pcsc ops into bus pci_ops
> + * @bus: the bus in which to inject the ops
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int pcsc_inject_bus_ops(struct pci_bus *bus);
It's a bit awkward given the weak functions, but mostly PCI seems to
put documentation alongside implementation rather then in the header.
> +#endif /* _LINUX_PCSC_H */
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 02/13] pci: pcsc: implement basic functionality
2025-10-03 9:00 ` [RFC PATCH 02/13] pci: pcsc: implement basic functionality Evangelos Petrongonas
@ 2025-10-09 13:12 ` Jonathan Cameron
0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Cameron @ 2025-10-09 13:12 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown,
Pasha Tatashin, David Matlack, Vipin Sharma, Chris Li, Jason Miu,
Pratyush Yadav, Stanislav Spassov, linux-pci, linux-acpi,
linux-kernel, nh-open-source
On Fri, 3 Oct 2025 09:00:38 +0000
Evangelos Petrongonas <epetron@amazon.de> wrote:
> Implement the core functionality of the PCI Configuration Space Cache
> using per-device cache nodes attached to struct pci_dev.
>
> Each cache node stores:
> - A 256-byte array (4KB for PCIe) representing the configuration space
> - A cacheable bitmask indicating which registers can be cached
> - A cached bitmask tracking which bytes are currently cached
>
> The implementation attaches cache nodes directly to pci_dev structures
> during `pci_device_add()` and removes them during `pci_device_remove()`.
>
> The cache implements a write-invalidate policy where writes are
> propagated to the device while invalidating the cache. This design
> choice improves robustness and increases the number of cacheable
> registers, particularly for operations like BAR sizing which use
> write-read sequences to detect read-only bits.
>
> Currently, the cacheable bitmask is zero-initialized,
> effectively disabling the cache. This will be changed in the next
> commits.
>
> This implementation only supports endpoint devices; bridges and
> root complexes are not cached.
>
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> ---
> drivers/pci/pci-driver.c | 5 +
> drivers/pci/pcsc.c | 244 ++++++++++++++++++++++++++++++++++++++-
> drivers/pci/probe.c | 9 ++
> include/linux/pci.h | 5 +
> include/linux/pcsc.h | 38 ++++++
> 5 files changed, 299 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 302d61783f6c..7c0cbbd50b32 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -21,6 +21,7 @@
> #include <linux/acpi.h>
> #include <linux/dma-map-ops.h>
> #include <linux/iommu.h>
> +#include <linux/pcsc.h>
> #include "pci.h"
> #include "pcie/portdrv.h"
>
> @@ -497,7 +498,11 @@ static void pci_device_remove(struct device *dev)
> * horrible the crap we have to deal with is when we are awake...
> */
>
> + #ifdef CONFIG_PCSC
if (IS_ENABLED()) or a stub.
> + pcsc_remove_device(pci_dev);
> +#endif
> pci_dev_put(pci_dev);
> +
Clean this out.
> }
>
> static void pci_device_shutdown(struct device *dev)
> diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
> index dec7c51b5cfd..7531217925e8 100644
> --- a/drivers/pci/pcsc.c
> +++ b/drivers/pci/pcsc.c
>
> +/**
> + * pcsc_get_and_insert_multiple - Read multiple bytes from PCI cache or HW
> + * @dev: PCI device to read from
> + * @bus: PCI bus to read from
> + * @devfn: device and function number
Is this always the same as dev->devfn?
> + * @where: offset in config space
> + * @word: pointer to store read value
> + * @size: number of bytes to read (1, 2 or 4)
> + *
> + * Reads consecutive bytes from PCI cache or hardware. If values are not cached,
> + * reads from hardware and inserts into cache.
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
> + struct pci_bus *bus, unsigned int devfn,
> + int where, u32 *word, int size)
> +{
> + u32 word_cached = 0;
> + u8 byte_val;
> + int rc, i;
> +
> + if (WARN_ON(!dev || !bus || !word))
> + return -EINVAL;
> +
> + if (WARN_ON(size != 1 && size != 2 && size != 4))
> + return -EINVAL;
> +
> + /* Check bounds */
> + if (where + size > PCSC_CFG_SPC_SIZE)
> + return -EINVAL;
> +
> + if (pcsc_is_cached(dev, where, size)) {
> + /* Read bytes from cache and assemble them into word_cached
Multiline comment syntax as below. Same for others.
> + * in little-endian order (as per PCI spec)
> + */
> + for (i = 0; i < size; i++) {
> + pcsc_get_byte(dev, where + i, &byte_val);
> + word_cached |= ((u32)byte_val << (i * 8));
> + }
> + } else {
> + rc = pcsc_hw_config_read(bus, devfn, where, size, &word_cached);
> + if (rc) {
> + pci_err(dev,
> + "%s: Failed to read CFG Space where=%d size=%d",
> + __func__, where, size);
> + return rc;
> + }
> +
> + /* Extract bytes from word_cached in little-endian order
> + * and store them in cache.
> + */
> + for (i = 0; i < size; i++) {
> + byte_val = (word_cached >> (i * 8)) & 0xFF;
> + pcsc_update_byte(dev, where + i, byte_val);
> + }
> + }
> +
> + *word = word_cached;
> + return 0;
> +}
> +
> int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
> int size, u32 *val)
> {
> - if (!pcsc_initialised)
> + int rc;
> + struct pci_dev *dev;
> +
> + if (unlikely(!pcsc_is_initialised()))
As below. Just add a stub function to patch 1 for this.
> goto read_from_dev;
>
> + if (WARN_ON(!bus || !val || (size != 1 && size != 2 && size != 4) ||
> + where + size > PCSC_CFG_SPC_SIZE))
> + return -EINVAL;
> +
> + dev = pci_get_slot(bus, devfn);
> +
> + if (unlikely(!dev || !dev->pcsc))
I'm curious how much difference that unlikely is making. Generally don't
use them unless you have the perf numbers to back them up.
Letting the branch predictors do their thing is usually a better plan.
> + goto read_from_dev;
> +
> + if (dev->pcsc->cfg_space &&
> + pcsc_is_access_cacheable(dev, where, size)) {
> + rc = pcsc_get_and_insert_multiple(dev, bus, devfn, where, val,
> + size);
> + if (likely(!rc)) {
> + pci_dev_put(dev);
> + return 0;
> + }
> + /* if reading from the cache failed continue and try reading
Match multiline comment syntax for PCI.
/*
* If reading...
> + * from the actual device
> + */
> + }
> read_from_dev:
> + if (dev)
> + pci_dev_put(dev);
> return pcsc_hw_config_read(bus, devfn, where, size, val);
> }
> EXPORT_SYMBOL_GPL(pcsc_cached_config_read);
> @@ -117,10 +336,31 @@ EXPORT_SYMBOL_GPL(pcsc_cached_config_read);
> int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
> int size, u32 val)
> {
> - if (!pcsc_initialised)
> + int i;
> + struct pci_dev *dev;
> +
> + if (unlikely(!pcsc_is_initialised()))
Make it this from the start to reduce churn.
> goto write_to_dev;
>
> + if (WARN_ON(!bus || (size != 1 && size != 2 && size != 4) ||
> + where + size > PCSC_CFG_SPC_SIZE))
> + return -EINVAL;
> +
> + dev = pci_get_slot(bus, devfn);
> +
> + if (unlikely(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) {
> + /* Do not add nodes on arbitrary writes */
> + goto write_to_dev;
> + } else {
Can drop the else given the goto above.
> + /* Mark the cache as dirty */
> + if (pcsc_is_access_cacheable(dev, where, size)) {
> + for (i = 0; i < size; i++)
> + pcsc_set_cached(dev, where + i, false);
> + }
> + }
> write_to_dev:
> + if (dev)
> + pci_dev_put(dev);
> return pcsc_hw_config_write(bus, devfn, where, size, val);
> }
> EXPORT_SYMBOL_GPL(pcsc_cached_config_write);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 33a186e4bf1e..c231e09e5a6e 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -23,6 +23,7 @@
> #include <linux/irqdomain.h>
> #include <linux/pm_runtime.h>
> #include <linux/bitfield.h>
> +#include <linux/pcsc.h>
> #include "pci.h"
>
> #define CARDBUS_LATENCY_TIMER 176 /* secondary latency timer */
> @@ -2801,6 +2802,14 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>
> dev->state_saved = false;
>
> +#ifdef CONFIG_PCSC
Similar request for if (IS_ENABLED()) etc.
> + if (likely(pcsc_is_initialised()))
> + if (!dev->pcsc)
> + if (pcsc_add_device(dev))
Maybe combine some of that if stack.
> + pci_warn(dev,
> + "Failed to add PCI device to PCSC\n");
> +#endif
> +
> pci_init_capabilities(dev);
>
> /*
> static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
> diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h
> index 45816eb2b2c8..516d73931608 100644
> --- a/include/linux/pcsc.h
> +++ b/include/linux/pcsc.h
> +/**
> + * @brief Returns if the PCSC infrastructure is initialised
> + *
Run the kernel-doc script over these files it gets fussy about
partial documentation etc. I'm fairly sure this will trip it up.
> + */
> +bool pcsc_is_initialised(void);
> +
> #endif /* _LINUX_PCSC_H */
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 03/13] pci: pcsc: infer cacheability of PCI capabilities
2025-10-03 9:00 ` [RFC PATCH 03/13] pci: pcsc: infer cacheability of PCI capabilities Evangelos Petrongonas
@ 2025-10-09 14:17 ` Jonathan Cameron
0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Cameron @ 2025-10-09 14:17 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown,
Pasha Tatashin, David Matlack, Vipin Sharma, Chris Li, Jason Miu,
Pratyush Yadav, Stanislav Spassov, linux-pci, linux-acpi,
linux-kernel, nh-open-source
On Fri, 3 Oct 2025 09:00:39 +0000
Evangelos Petrongonas <epetron@amazon.de> wrote:
> Implement cacheability inference for PCI capabilities to
> determine which configuration space registers can be safely cached.
>
> The first 64 bytes of PCI configuration space follow a standardized
> format, allowing straightforward cacheability determination. For
> capability-specific registers, the implementation traverses the PCI
> capability list to identify supported capabilities.
>
> Cacheable registers are identified for the following capabilities:
> - Power Management (PM)
> - Message Signaled Interrupts (MSI)
> - Message Signaled Interrupts Extensions (MSI-X)
> - PCI Express
> - PCI Advanced Features (AF)
> - Enhanced Allocation (EA)
> - Vital Product Data (VPD)
> - Vendor Specific
>
> The implementation pre-populates the cache with known values including
> device/vendor IDs and header type to avoid unnecessary configuration
> space reads during initialization.
>
> We are currently not caching the Command/Status registers.
>
> The cacheability of all capabilities apart from MSI, are straightforward
> and can be deduced from the spec. Regarding MSI the MSI flags are read
> and based on this, the cacheability is inferred.
>
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
A few minor things below.
> +
> +static void infer_capabilities_pointers(struct pci_dev *dev)
> +{
> + u8 pos, cap_id, next_cap;
> + u32 val;
> + int i;
> +
> + if (pcsc_hw_config_read(dev->bus, dev->devfn, PCI_CAPABILITY_LIST, 1,
> + &val) != PCIBIOS_SUCCESSFUL)
> + return;
> +
> + pos = (val & 0xFF) & ~0x3;
Given single byte read, shouldn't need the 0xFF masking.
Might be worth setting val = 0 at declaration though so that the compiler
can see it is assigned if we reach here.
> +
> + while (pos) {
> + if (pos < 0x40 || pos > 0xFE)
> + break;
> +
> + pos &= ~0x3;
I couldn't immediately find a spec statement of the bottom two bits being 0 for
the next capability pointers. (There is one for the PCI_CAPABILITY_LIST)
> + if (pcsc_hw_config_read(dev->bus, dev->devfn, pos, 2, &val) !=
> + PCIBIOS_SUCCESSFUL)
> + break;
> +
> + cap_id = val & 0xFF; /* PCI_CAP_LIST_ID */
> + next_cap = (val >> 8) & 0xFF; /* PCI_CAP_LIST_NEXT */
> +
> + bitmap_set(dev->pcsc->cachable_bitmask, pos, 2);
> + pcsc_update_byte(dev, pos, cap_id); /* PCI_CAP_LIST_ID */
> + pcsc_update_byte(dev, pos + 1,
> + next_cap); /* PCI_CAP_LIST_NEXT */
Could you do something like moving the bitmap_set before the pcsc_hw_config_read() and
cal pcsc_cached_config_read() to fill the cache directly during the read?
> +
> + pci_dbg(dev, "Capability ID %#x found at %#x\n", cap_id, pos);
> +
> + /* Check if this is a supported capability and infer cacheability */
> + for (i = 0; i < ARRAY_SIZE(PCSC_SUPPORTED_CAPABILITIES); i++) {
> + if (cap_id == PCSC_SUPPORTED_CAPABILITIES[i]) {
> + infer_capability_cacheability(dev, pos, cap_id);
> + break;
> + }
> + }
> +
> + /* Move to next capability */
> + pos = next_cap;
> + }
> +}
> +
> +static void infer_cacheability(struct pci_dev *dev)
> +{
> + if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
> + return;
> +
> + bitmap_zero(dev->pcsc->cachable_bitmask, PCSC_CFG_SPC_SIZE);
> +
> + /* Type 0 Configuration Space Header */
> + if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
Unless you plan to add type 1 very soon I'd flip the logic to reduce
indent of the code that follows.
> + /*
> + * Mark cacheable registers in the PCI configuration space header.
> + * We cache read-only and rarely changing registers:
> + * - PCI_VENDOR_ID, PCI_DEVICE_ID (0x00-0x03)
> + * - PCI_CLASS_REVISION through PCI_CAPABILITY_LIST (0x08-0x34)
> + * Includes: CLASS_REVISION, CACHE_LINE_SIZE, LATENCY_TIMER,
> + * HEADER_TYPE, BIST, BASE_ADDRESS_0-5, CARDBUS_CIS,
> + * SUBSYSTEM_VENDOR_ID, SUBSYSTEM_ID, ROM_ADDRESS, CAPABILITY_LIST
> + * - PCI_INTERRUPT_LINE through PCI_MAX_LAT (0x3c-0x3f)
> + * Includes: INTERRUPT_LINE, INTERRUPT_PIN, MIN_GNT, MAX_LAT
> + */
> + bitmap_set(dev->pcsc->cachable_bitmask, PCI_VENDOR_ID, 4);
> + bitmap_set(dev->pcsc->cachable_bitmask, PCI_CLASS_REVISION, 45);
For this large range can you derive that 45 as something like
PCI_CAPABILITY_LIST - PCI_CLASS_REVISION + 1
Same applies for the other multifield bitmap sets
> + bitmap_set(dev->pcsc->cachable_bitmask, PCI_INTERRUPT_LINE, 4);
> +
> + /* Pre populate the cache with the values that we already know */
I'm curious - do you have perf numbers to show it's worth writing these 5 bytes
directly into the cache? Feels like complexity that maybe doesn't belong there
in initial patch but should come along later in a patch with numbers to support
the small amount of extra complexity.
> + pcsc_update_byte(dev, PCI_HEADER_TYPE,
> + dev->hdr_type |
> + (dev->multifunction ? 0x80 : 0));
> +
> + /*
> + * SR-IOV VFs must return 0xFFFF (PCI_ANY_ID) for vendor/device ID
> + * registers per PCIe spec.
> + */
> + if (dev->is_virtfn) {
> + pcsc_update_byte(dev, PCI_VENDOR_ID, 0xFF);
> + pcsc_update_byte(dev, PCI_VENDOR_ID + 1, 0xFF);
> + pcsc_update_byte(dev, PCI_DEVICE_ID, 0xFF);
> + pcsc_update_byte(dev, PCI_DEVICE_ID + 1, 0xFF);
> + } else {
> + if (dev->vendor != PCI_ANY_ID) {
> + pcsc_update_byte(dev, PCI_VENDOR_ID,
> + dev->vendor & 0xFF);
> + pcsc_update_byte(dev, PCI_VENDOR_ID + 1,
> + (dev->vendor >> 8) & 0xFF);
> + }
> + if (dev->device != PCI_ANY_ID) {
> + pcsc_update_byte(dev, PCI_DEVICE_ID,
> + dev->device & 0xFF);
> + pcsc_update_byte(dev, PCI_DEVICE_ID + 1,
> + (dev->device >> 8) & 0xFF);
> + }
> + }
> +
> + infer_capabilities_pointers(dev);
> + }
> +}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 04/13] pci: pcsc: infer PCIe extended capabilities
2025-10-03 9:00 ` [RFC PATCH 04/13] pci: pcsc: infer PCIe extended capabilities Evangelos Petrongonas
@ 2025-10-09 14:24 ` Jonathan Cameron
0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Cameron @ 2025-10-09 14:24 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown,
Pasha Tatashin, David Matlack, Vipin Sharma, Chris Li, Jason Miu,
Pratyush Yadav, Stanislav Spassov, linux-pci, linux-acpi,
linux-kernel, nh-open-source
On Fri, 3 Oct 2025 09:00:40 +0000
Evangelos Petrongonas <epetron@amazon.de> wrote:
> Extend PCSC to support cacheability inference for PCIe extended
> capabilities located in the 4KB extended configuration space.
>
> Similar to the capabilities, PCIe extended capabilities require
> traversal of the capability list to determine cacheability. The
> implementation identifies cacheable registers for capabilities used
> by the generic PCIe driver:
>
> - Advanced Error Reporting (AER)
> - Access Control Services (ACS)
> - Alternative Routing-ID (ARI)
> - SR-IOV
> - Address Translation Services (ATS)
> - Page Request Interface (PRI)
> - Process Address Space ID (PASID)
> - Downstream Port Containment (DPC)
> - Precision Time Measurement (PTM)
>
> The extended capability header (4 bytes) is always cached to enable
> efficient capability list traversal.
>
> All the extended capabilities apart from the DPC are static. Regarding
> DPC, the DPC capabilities is read and based on its value the
> cacheability of RP* registers is inferred.
>
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
A few comments below.
> ---
> drivers/pci/pcsc.c | 203 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 203 insertions(+)
>
> diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
> index 29945eac4190..343f8b03831a 100644
> --- a/drivers/pci/pcsc.c
> +++ b/drivers/pci/pcsc.c
> +static void infer_extended_capabilities_pointers(struct pci_dev *dev)
> +{
> + int pos = 0x100;
> + u32 header;
> + int cap_ver, cap_id;
> + int i;
> +
if (!IS_ENABLED(CONFIG_PCIE_PCSC))
return;
is probably enough to allow the compiler to get rid of everything without
the ifdef magic.
> + while (pos) {
> + if (pos > 0xFFC || pos < 0x100)
> + break;
> +
> + pos &= ~0x3;
> +
> + if (pcsc_hw_config_read(dev->bus, dev->devfn, pos, 4,
> + &header) != PCIBIOS_SUCCESSFUL)
> + break;
> +
> + if (!header)
> + break;
> +
> + bitmap_set(dev->pcsc->cachable_bitmask, pos, 4);
> + for (i = 0; i < 4; i++)
> + pcsc_update_byte(dev, pos + i,
> + (header >> (i * 8)) & 0xFF);
> +
> + cap_id = PCI_EXT_CAP_ID(header);
> + cap_ver = PCI_EXT_CAP_VER(header);
> +
> + pci_dbg(dev,
> + "Extended capability ID %#x (ver %d) found at %#x, next cap at %#x\n",
> + cap_id, cap_ver, pos, PCI_EXT_CAP_NEXT(header));
> +
> + /* Check if this is a supported extended capability and infer cacheability */
> + for (i = 0; i < ARRAY_SIZE(PCSCS_SUPPORTED_EXT_CAPABILITIES);
> + i++) {
> + if (cap_id == PCSCS_SUPPORTED_EXT_CAPABILITIES[i]) {
> + infer_extended_capability_cacheability(dev, pos,
> + cap_id);
> + break;
> + }
> + }
> +
> + pos = PCI_EXT_CAP_NEXT(header);
> + }
> +}
> +#endif
> +
> static void infer_cacheability(struct pci_dev *dev)
> {
> if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space))
> @@ -432,6 +631,10 @@ static void infer_cacheability(struct pci_dev *dev)
> }
>
> infer_capabilities_pointers(dev);
> +#ifdef CONFIG_PCIE_PCSC
> + if (pci_is_pcie(dev))
> + infer_extended_capabilities_pointers(dev);
> +#endif
> }
> }
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 05/13] pci: pcsc: control the cache via sysfs and kernel params
2025-10-03 9:00 ` [RFC PATCH 05/13] pci: pcsc: control the cache via sysfs and kernel params Evangelos Petrongonas
@ 2025-10-09 14:41 ` Jonathan Cameron
0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Cameron @ 2025-10-09 14:41 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown,
Pasha Tatashin, David Matlack, Vipin Sharma, Chris Li, Jason Miu,
Pratyush Yadav, Stanislav Spassov, linux-pci, linux-acpi,
linux-kernel, nh-open-source
On Fri, 3 Oct 2025 09:00:41 +0000
Evangelos Petrongonas <epetron@amazon.de> wrote:
> Add kernel parameters and runtime control mechanisms for the PCSC
>
> A new kernel parameter 'pcsc_enabled' allows enabling or disabling
> the cache at boot time. The parameter defaults to disabled.
>
> A sysfs interface at /sys/bus/pci/pcsc/enabled provides:
> - Read access to query current cache status (1=enabled, 0=disabled)
> - Write access to dynamically enable/disable the cache at runtime
>
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> ---
> Documentation/ABI/testing/sysfs-bus-pci-pcsc | 20 ++++
> .../admin-guide/kernel-parameters.txt | 3 +
> drivers/pci/pcsc.c | 93 ++++++++++++++++++-
> 3 files changed, 114 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-pcsc
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-pcsc b/Documentation/ABI/testing/sysfs-bus-pci-pcsc
> new file mode 100644
> index 000000000000..ee92bf087816
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-pcsc
> @@ -0,0 +1,20 @@
> +PCI Configuration Space Cache (PCSC)
> +-------------------------------------
> +
> +The PCI Configuration Space Cache (PCSC) is a transparent caching layer
> +that intercepts configuration space operations to reduce hardware access
> +overhead. This subsystem addresses performance bottlenecks in PCI
> +configuration space accesses, particularly in virtualization
> +environments with high-density SR-IOV deployments where repeated
> +enumeration of Virtual Functions creates substantial delays.
> +
> +What: /sys/bus/pci/pcsc/enabled
> +Date: September 2025
> +Contact: Linux PCI developers <linux-pci@vger.kernel.org>
> +Description:
> + PCI Configuration Space Cache (PCSC) is a subsystem that
> + caches accesses to the PCI configuration space of PCI
> + functions. When this file contains the "1", the kernel
> + is utilizing the cache, while when on "0" the
> + system bypasses it. This setting can also be controlled
> +parameter.
indent issue on this last line.
Excellent to see someone remembering the ABI docs for once in an RFC!
> diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
> index 343f8b03831a..44d842733230 100644
> --- a/drivers/pci/pcsc.c
> +++ b/drivers/pci/pcsc.c
> +static struct kobj_attribute pcsc_enabled_attribute =
> + __ATTR(enabled, 0644, pcsc_enabled_show, pcsc_enabled_store);
> +
> +static struct attribute *pcsc_attrs[] = {
> + &pcsc_enabled_attribute.attr,
> + NULL,
Trivial but no need for that trailing comma after the NULL terminator.
We don't want it to be easy to accidentally add something after that.
> +};
> +
> +static struct attribute_group pcsc_attr_group = {
> + .attrs = pcsc_attrs,
> +};
> +
> +static struct kobject *pcsc_kobj;
> +
> +static void pcsc_create_sysfs(void)
> +{
> + struct kset *pci_bus_kset;
> + int ret;
> +
> + if (pcsc_kobj)
> + return; /* Already created */
Why do we need the kobject? Can't we make this a group on the
pci_bus_kset->kobj with a group name of pcsc?
(I see you have a comment on this next bit later in here)
Event better if we can arrange for not to be added after that is
created but just be a group on it in the first place.
That is make it a group in bus_groups of the pci_bus_type alongside
the one with bus_attr_rescan.attr in it.
That should mean you don't need the two tried to set it up that
you have currently.
> +
> + pci_bus_kset = bus_get_kset(&pci_bus_type);
> + if (!pci_bus_kset) {
> + /* PCI bus kset not ready yet, will be retried later */
> + return;
> + }
> +
> +
> +/*
> + * The PCI subsystem is initialised later, therefore we need to add
> + * our sysfs entries later. This is done to avoid modifying the sysfs
> + * creation of the core pci driver.
Vs complexity and races, I think I'd rather you did modify that.
> + */
> +late_initcall(pcsc_sysfs_init);
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 06/13] pci: pcsc: handle device resets
2025-10-03 9:00 ` [RFC PATCH 06/13] pci: pcsc: handle device resets Evangelos Petrongonas
@ 2025-10-09 14:49 ` Jonathan Cameron
0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Cameron @ 2025-10-09 14:49 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown,
Pasha Tatashin, David Matlack, Vipin Sharma, Chris Li, Jason Miu,
Pratyush Yadav, Stanislav Spassov, linux-pci, linux-acpi,
linux-kernel, nh-open-source
On Fri, 3 Oct 2025 09:00:42 +0000
Evangelos Petrongonas <epetron@amazon.de> wrote:
> The PCI Configuration Space Cache (PCSC) maintains cached values of
> configuration space registers for performance optimization. When a PCI
> device is reset or bus operations are dynamically changed, cached values
> become stale and can cause incorrect behavior. This patch ensures cache
> coherency by invalidating the PCSC cache in all scenarios where the
> underlying configuration space values may have changed.
>
> Device Reset Handling:
> ----------------------
> When PCI devices are reset, their configuration space registers return
> to default values. Add pcsc_device_reset() calls after all device reset
> operations to invalidate stale cached values:
>
> - Function Level Resets (FLR) in `pcie_flr()`
> - Advanced Features FLR in `pci_af_flr()`
> - Power Management resets (D3hot->D0 transition) in `pci_pm_reset()`
> - Device-specific resets in `pci_dev_specific_reset()`
> - D3cold power state transitions in `__pci_set_power_state()`
> - ACPI-based resets in `pci_dev_acpi_reset()`
> - Bus restore operations in `pci_bus_restore_locked()`
> - Slot restore operations in `pci_slot_restore_locked()`
> - Secondary bus resets in `pci_bridge_secondary_bus_reset()`
cxl bus reset?
>
> For secondary bus resets, `pcsc_reset_bus_recursively()` invalidates the
> cache for all devices on the secondary bus and subordinate buses. This
> also covers hotplug slot reset operations since `pciehp_reset_slot()`
> calls `pci_bridge_secondary_bus_reset()`.
>
> In addition, functions like `pci_dev_wait` are configured to bypass the
> cahce and reads the actual HW values.
cache
>
> Dynamic Ops Changes:
> --------------------
> The patch also addresses cache consistency issues when bus operations
> are dynamically changed via `pci_bus_set_ops()``. Different ops
> implementations may return different values for the same registers, and
> hardware state may have changed while using the different ops. This
> commit resets the cache for all devices on the affected bus
>
> Implementation Details:
> -----------------------
> The cache invalidation clears the cached_bitmask while preserving the
> cacheable_bitmask, as the configuration space layout remains unchanged
> after a reset. This allows the cache to be repopulated with fresh values
> on subsequent configuration space accesses.
>
> Known Limitations:
> ------------------
> - There is currently a gap in handling PowerPC secondary bus resets, as
> the architecture-specific `pcibios_reset_secondary_bus()` can bypass the
> generic `pci_reset_secondary_bus()` where our cache invalidation occurs.
>
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f518cfa266b5..db940f8fd408 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -26,6 +26,7 @@
> #include <linux/device.h>
> #include <linux/pm_runtime.h>
> #include <linux/pci_hotplug.h>
> +#include <linux/pcsc.h>
> #include <linux/vmalloc.h>
> #include <asm/dma.h>
> #include <linux/aer.h>
> @@ -1248,11 +1249,19 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> }
>
> if (root && root->config_rrs_sv) {
> +#ifdef CONFIG_PCSC
> + pcsc_hw_config_read(dev->bus, dev->devfn, PCI_VENDOR_ID, 4, &id);
> +#else
> pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> +#endif
> if (!pci_bus_rrs_vendor_id(id))
> break;
> } else {
> +#ifdef CONFIG_PCSC
> + pcsc_hw_config_read(dev->bus, dev->devfn, PCI_COMMAND, 4, &id);
In the !CONFIG case define this to be pci_read_config_dword()
> +#else
> pci_read_config_dword(dev, PCI_COMMAND, &id);
> +#endif
> if (!PCI_POSSIBLE_ERROR(id))
> break;
> }
> void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
> @@ -5542,6 +5594,9 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
>
> list_for_each_entry(dev, &bus->devices, bus_list) {
> pci_dev_restore(dev);
> +#ifdef CONFIG_PCSC
> + pcsc_device_reset(dev);
> +#endif
> if (dev->subordinate) {
> pci_bridge_wait_for_secondary_bus(dev, "bus reset");
> pci_bus_restore_locked(dev->subordinate);
> @@ -5579,6 +5634,9 @@ static void pci_slot_restore_locked(struct pci_slot *slot)
> if (!dev->slot || dev->slot != slot)
> continue;
> pci_dev_restore(dev);
> +#ifdef CONFIG_PCSC
> + pcsc_device_reset(dev);
Definitely use a stub for these.
> +#endif
> if (dev->subordinate) {
> pci_bridge_wait_for_secondary_bus(dev, "slot reset");
> pci_bus_restore_locked(dev->subordinate);
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 07/13] pci: pcsc: introduce statistic gathering tools
2025-10-03 9:00 ` [RFC PATCH 07/13] pci: pcsc: introduce statistic gathering tools Evangelos Petrongonas
@ 2025-10-09 14:56 ` Jonathan Cameron
0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Cameron @ 2025-10-09 14:56 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown,
Pasha Tatashin, David Matlack, Vipin Sharma, Chris Li, Jason Miu,
Pratyush Yadav, Stanislav Spassov, linux-pci, linux-acpi,
linux-kernel, nh-open-source
Mostly suggestions to move away from ifdefs.
Jonathan
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index c26162b58365..9b5275ef2d16 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -50,6 +50,13 @@ config PCSC
> intercepts configuration space operations and maintains cached
> copies of register values
>
> +config PCSC_STATS
> + bool "PCI Configuration Space Cache Statistics"
> + depends on PCSC
> + default n
No need, that's what you get if you don't specify a default.
> + help
> + This option allows the collection of statistics for the PCSC.
> +
> source "drivers/pci/pcie/Kconfig"
>
> config PCI_MSI
> diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c
> index 5412dea23446..304239b7ff8a 100644
> --- a/drivers/pci/pcsc.c
> +++ b/drivers/pci/pcsc.c
> @@ -25,9 +25,84 @@ static int __init pcsc_enabled_setup(char *str)
> }
> __setup("pcsc_enabled=", pcsc_enabled_setup);
>
> +#ifdef CONFIG_PCSC_STATS
Use IS_ENABLED() on this instead of defines below and let
dead code removal drop this without needing the ifdefs.
> +struct pcsc_stats {
> + /* Operation Counters */
> + unsigned long cache_hits;
> + unsigned long cache_misses;
> + unsigned long uncachable_reads;
> + unsigned long writes;
> + unsigned long cache_invalidations;
> + unsigned long total_reads;
> + unsigned long hw_reads;
> + unsigned long device_resets;
> + u64 total_cache_access_time; /* in milliseconds */
> + u64 total_hw_access_time; /* in milliseconds */
> + u64 hw_access_time_due_to_misses; /* in milliseconds */
> +};
> +#endif
> +
> static bool pcsc_initialised;
> static atomic_t num_nodes = ATOMIC_INIT(0);
>
> +#ifdef CONFIG_PCSC_STATS
> +struct pcsc_stats pcsc_stats;
static?
> +
> +static inline void pcsc_count_cache_hit(void)
> +{
I'd drop the stubs and do
if (IS_ENABLED_CONFIG_PCSC_STATS)) {
pcsc_stats.cache_hits++;
pcsc_stats.total_reads++;
}
Compiler should then figure out they are stubs and drop them for you.
> + pcsc_stats.cache_hits++;
> + pcsc_stats.total_reads++;
> +}
>
> if (WARN_ON(!dev || !bus || !word))
> return -EINVAL;
> @@ -734,7 +813,6 @@ static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
> if (WARN_ON(size != 1 && size != 2 && size != 4))
> return -EINVAL;
>
> - /* Check bounds */
Not sure why this is removed here.
> if (where + size > PCSC_CFG_SPC_SIZE)
> return -EINVAL;
>
> @@ -746,8 +824,17 @@ static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
> pcsc_get_byte(dev, where + i, &byte_val);
> word_cached |= ((u32)byte_val << (i * 8));
> }
> + pcsc_count_cache_hit();
> } else {
> +#ifdef CONFIG_PCSC_STATS
> + start_time = ktime_get();
> +#endif
> rc = pcsc_hw_config_read(bus, devfn, where, size, &word_cached);
> +#ifdef CONFIG_PCSC_STATS
> + duration = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> + pcsc_stats.hw_access_time_due_to_misses += duration;
> + pcsc_stats.total_hw_access_time += duration;
> +#endif
> if (rc) {
> pci_err(dev,
> "%s: Failed to read CFG Space where=%d size=%d",
> @@ -762,6 +849,7 @@ static int pcsc_get_and_insert_multiple(struct pci_dev *dev,
> byte_val = (word_cached >> (i * 8)) & 0xFF;
> pcsc_update_byte(dev, where + i, byte_val);
> }
> + pcsc_count_cache_miss();
> }
>
> *word = word_cached;
> @@ -773,6 +861,17 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
> {
> int rc;
> struct pci_dev *dev;
> +#ifdef CONFIG_PCSC_STATS
> + ktime_t hw_start_time;
> + u64 hw_duration;
> +#endif
> +
> +#ifdef CONFIG_PCSC_STATS
> + u64 duration;
> + ktime_t start_time;
> +
> + start_time = ktime_get();
> +#endif
>
> if (unlikely(!pcsc_is_initialised()))
> goto read_from_dev;
> @@ -790,6 +889,10 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
> pcsc_is_access_cacheable(dev, where, size)) {
> rc = pcsc_get_and_insert_multiple(dev, bus, devfn, where, val,
> size);
> +#ifdef CONFIG_PCSC_STATS
As above. Stick this under and IS_ENABLED() and let dead code removal tidy it up.
> + duration = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> + pcsc_stats.total_cache_access_time += duration;
> +#endif
> if (likely(!rc)) {
> pci_dev_put(dev);
> return 0;
> @@ -797,11 +900,23 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int where,
> /* if reading from the cache failed continue and try reading
> * from the actual device
> */
> + } else {
> + if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL)
> + pcsc_count_uncachable_read();
> }
> read_from_dev:
> +#ifdef CONFIG_PCSC_STATS
> + hw_start_time = ktime_get();
> +#endif
> if (dev)
> pci_dev_put(dev);
> - return pcsc_hw_config_read(bus, devfn, where, size, val);
> + rc = pcsc_hw_config_read(bus, devfn, where, size, val);
> +#ifdef CONFIG_PCSC_STATS
> + hw_duration = ktime_to_ns(ktime_sub(ktime_get(), hw_start_time));
> + /* Add timing for uncacheable reads */
> + pcsc_stats.total_hw_access_time += hw_duration;
> +#endif
> + return rc;
> }
> EXPORT_SYMBOL_GPL(pcsc_cached_config_read);
>
> @@ -810,6 +925,11 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
> {
> int i;
> struct pci_dev *dev;
> + int rc;
> +#ifdef CONFIG_PCSC_STATS
> + ktime_t hw_start_time;
> + u64 hw_duration;
> +#endif
>
> if (unlikely(!pcsc_is_initialised()))
> goto write_to_dev;
> @@ -828,12 +948,22 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int where,
> if (pcsc_is_access_cacheable(dev, where, size)) {
> for (i = 0; i < size; i++)
> pcsc_set_cached(dev, where + i, false);
> + pcsc_count_cache_invalidation();
> }
> }
> write_to_dev:
> + pcsc_count_write();
> if (dev)
> pci_dev_put(dev);
> - return pcsc_hw_config_write(bus, devfn, where, size, val);
> +#ifdef CONFIG_PCSC_STATS
> + hw_start_time = ktime_get();
> +#endif
> + rc = pcsc_hw_config_write(bus, devfn, where, size, val);
> +#ifdef CONFIG_PCSC_STATS
> + hw_duration = ktime_to_ns(ktime_sub(ktime_get(), hw_start_time));
> + pcsc_stats.total_hw_access_time += hw_duration;
> +#endif
> + return rc;
> }
> EXPORT_SYMBOL_GPL(pcsc_cached_config_write);
>
> @@ -851,6 +981,7 @@ int pcsc_device_reset(struct pci_dev *dev)
> * some of the HWInt values that are going to remain constant after a reset.
> */
> bitmap_zero(dev->pcsc->cached_bitmask, PCSC_CFG_SPC_SIZE);
> + pcsc_count_device_reset();
> return 0;
> }
>
> @@ -948,8 +1079,50 @@ static ssize_t pcsc_enabled_store(struct kobject *kobj,
> static struct kobj_attribute pcsc_enabled_attribute =
> __ATTR(enabled, 0644, pcsc_enabled_show, pcsc_enabled_store);
>
> +#ifdef CONFIG_PCSC_STATS
> +static ssize_t pcsc_stats_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sysfs_emit(
> + buf,
> + "Cache Hits: %lu\n"
> + "Cache Misses: %lu\n"
> + "Uncachable Reads: %lu\n"
> + "Writes: %lu\n"
> + "Cache Invalidations: %lu\n"
> + "Device Resets: %lu\n"
> + "Total Reads: %lu\n"
> + "Hardware Reads: %lu\n"
> + "Hit Rate: %lu%%\n"
> + "Total Cache Access Time: %llu us\n"
> + "Cache Access Time (without HW reads due to Misses): %llu us\n"
> + "HW Access Time due to misses: %llu us\n"
> + "Total Hardware Access Time: %llu us\n",
> + pcsc_stats.cache_hits, pcsc_stats.cache_misses,
> + pcsc_stats.uncachable_reads, pcsc_stats.writes,
> + pcsc_stats.cache_invalidations, pcsc_stats.device_resets,
> + pcsc_stats.total_reads,
> + pcsc_stats.hw_reads,
> + pcsc_stats.total_reads ?
> + (pcsc_stats.cache_hits * 100) / pcsc_stats.total_reads :
> + 0,
> + pcsc_stats.total_cache_access_time / 1000,
> + (pcsc_stats.total_cache_access_time -
> + pcsc_stats.hw_access_time_due_to_misses) /
> + 1000,
> + pcsc_stats.hw_access_time_due_to_misses / 1000,
> + pcsc_stats.total_hw_access_time / 1000);
> +}
> +
This would need a __maybe_unused
> +static struct kobj_attribute pcsc_stats_attribute =
> + __ATTR(stats, 0444, pcsc_stats_show, NULL);
> +#endif
> +
> static struct attribute *pcsc_attrs[] = {
> &pcsc_enabled_attribute.attr,
> +#ifdef CONFIG_PCSC_STATS
And this ifdef can't be as easily avoided. You could use an
is_visible but probably not worth it.
> + &pcsc_stats_attribute.attr,
> +#endif
> NULL,
> };
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 08/13] pci: Save only spec-defined configuration space
2025-10-03 9:00 ` [RFC PATCH 08/13] pci: Save only spec-defined configuration space Evangelos Petrongonas
@ 2025-10-09 14:58 ` Jonathan Cameron
0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Cameron @ 2025-10-09 14:58 UTC (permalink / raw)
To: Evangelos Petrongonas
Cc: Bjorn Helgaas, Alex Williamson, Rafael J . Wysocki, Len Brown,
Pasha Tatashin, David Matlack, Vipin Sharma, Chris Li, Jason Miu,
Pratyush Yadav, Stanislav Spassov, linux-pci, linux-acpi,
linux-kernel, nh-open-source
On Fri, 3 Oct 2025 09:00:44 +0000
Evangelos Petrongonas <epetron@amazon.de> wrote:
> Change PCI configuration space save/restore operations by
> saving only the regions defined by the PCI specification avoiding any
> potential side effects of undefined behaviour.
>
> The current implementation saves the entire configuration space for
> device restore operations, including reserved and undefined regions.
> This change modifies the save logic to save only architecturally defined
> configuration space regions and skipping the undefined areas.
>
> This benefits the PCSC hitrate, as a 4byte access to a region where only
> 2 bytes are cacheable and 2 are undefined, therefore uncached, will lead
> to a HW access instead.
>
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> ---
> drivers/pci/pci.c | 61 +++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 56 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index db940f8fd408..3e99baaaf8cd 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1752,11 +1752,62 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
> int pci_save_state(struct pci_dev *dev)
> {
> int i;
> - /* XXX: 100% dword access ok here? */
> - for (i = 0; i < 16; i++) {
> - pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
> - pci_dbg(dev, "save config %#04x: %#010x\n",
> - i * 4, dev->saved_config_space[i]);
> +
> + if (dev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
> + for (i = 0; i < 13; i++) {
Needs basing on the register defines not magic numbers.
Same for other cases.
Jonathan
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-10-09 14:59 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-03 9:00 [RFC PATCH 00/13] Introduce PCI Configuration Space Cache (PCSC) Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 01/13] pci: pcsc: Add plumbing for the " Evangelos Petrongonas
2025-10-09 12:38 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 02/13] pci: pcsc: implement basic functionality Evangelos Petrongonas
2025-10-09 13:12 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 03/13] pci: pcsc: infer cacheability of PCI capabilities Evangelos Petrongonas
2025-10-09 14:17 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 04/13] pci: pcsc: infer PCIe extended capabilities Evangelos Petrongonas
2025-10-09 14:24 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 05/13] pci: pcsc: control the cache via sysfs and kernel params Evangelos Petrongonas
2025-10-09 14:41 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 06/13] pci: pcsc: handle device resets Evangelos Petrongonas
2025-10-09 14:49 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 07/13] pci: pcsc: introduce statistic gathering tools Evangelos Petrongonas
2025-10-09 14:56 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 08/13] pci: Save only spec-defined configuration space Evangelos Petrongonas
2025-10-09 14:58 ` Jonathan Cameron
2025-10-03 9:00 ` [RFC PATCH 09/13] vfio: pci: Fill only spec-defined configuration space regions Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 10/13] pci: pcsc: Use contiguous pages for the cache data Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 11/13] pci: pcsc: Add kexec persistence support via KHO Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 12/13] pci: pcsc: introduce persistence versioning Evangelos Petrongonas
2025-10-03 9:00 ` [RFC PATCH 13/13] pci: pcsc: introduce hashtable lookup to speed up restoration Evangelos Petrongonas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).