* [PATCH v5 00/11] PCI: liveupdate: PCI core support for Live Update
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
This series can be found on GitHub:
https://github.com/dmatlack/linux/tree/liveupdate/pci/base/v5
This patch series introduces the initial support in the PCI core for
Live Update, enabling drivers to preserve PCI devices across a
kexec-based kernel update without interrupting the device. This
functionality is critical for minimizing downtime in environments where
PCI devices (e.g., those assigned to VMs via VFIO) must continue
operating or maintain state across a host kernel upgrade.
Specifically, this patch series allows preserved PCI devices to perform
memory transactions to/from system memory (DMA) uninterrupted across a
Live Update. The devices can be behind a bridge, but must not be a VF.
Support for P2P and preserving VFs will come in future series.
Series Overview
---------------
This series implements the following to support PCI device preservation
across Live Update:
1. Set up a File-Lifecycle-Bound (FLB) handler to track and preserve
PCI-specific state (struct pci_ser) across Live Update using Kexec
Handover (KHO).
2. Add APIs for drivers to register "outgoing" devices for
preservation and for the PCI core to identify "incoming" preserved
devices during enumeration.
3. Automatically preserve all upstream bridges for any preserved
endpoint. Use reference counting to ensure bridges remain preserved
as long as any downstream device is preserved.
4. Guarantee that preserved devices can be identified by the same
RequesterID (bus, device, function) for as long as they are
preserved by always inheriting secondary and subordinate bus
numbers and ARI Forwarding Enable on bridges with preserved
downstream endpoints.
5. Guarantee the memory transactions to/from preserved devices are
routed the same way by inheriting Access Control Services (ACS)
flags across a Live Update.
6. Modify the PCI shutdown path to avoid disabling bus mastering on
preserved devices and their upstream bridges, allowing memory
transactions to continue uninterrupted.
7. Provide comprehensive documentation for the FLB API, device
tracking mechanisms, and the division of responsibilities between
the PCI core, drivers, and userspace.
Dependencies
------------
This series is built on top of the next branch of liveupdate.git tree
which has 2 commits to enable refcounting the incoming FLB:
https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git/log/?h=next
Testing
-------
This series was tested in conjunction with v4 of the VFIO PCI driver
series:
https://lore.kernel.org/kvm/20260511234802.2280368-1-vipinsh@google.com/
The full set of patches that I used for testing can be found on GitHub:
https://github.com/dmatlack/linux/tree/liveupdate/pci/base/v5-with-vfio
The full set of patches was tested using the new VFIO selftests:
- vfio_pci_liveupdate_uapi_test
- vfio_pci_liveupdate_kexec_test
Both tests were ran in ran in a QEMU-based VM environment, using a
single virtio-net PCIe device connected to a root port (to exercise the
bridge support in this series), and in a baremetal environment on an
Intel EMR server, using 8x Intel DSA PCIe devices (each on a host
bridge) and 1x NVMe device connected to a root port.
Future Work
-----------
After this series we expect to make further improvements to the PCI core
support for Live Update.
- Allow P2P across Live Update by avoiding sizing or moving preserved
device BARs and preserving all upstream bridge windows.
- Support preserving Virtual Functions, by preserving SR-IOV
configuration on PFs and enumerating VFs after Live Update.
Changelog
---------
v5:
- Update PCI LIVE UPDATE entry in MAINTAINERS to use liveupdate.git,
add kexec@ mailing list, and drop Bjorn (Pasha, Bjorn, Pratyush)
- Create separate headers for Live Update definitions to avoid future
patch conflicts (me)
- Add kernel doc for public (Driver) API (me)
- Rename reserved field to padding (Vipin)
- Reorder checks outside of mutex where possible (Jacob)
- Clarify refcount in struct pci_dev_ser in kernel-doc (Sami)
- Require CONFIG_64BIT to avoid overflowing xarray key (Sashiko)
- Various spelling and grammar fixes (Bjorn)
- Ensure incoming and outgoing devices do not have their bus numbers
changed during manual rescans via sysfs (Jacob)
- Fix refcount dropping for upstream bridges during finish (Sashiko)
- Disallow devices with PCI_DEV_FLAGS_ACS_ENABLED_QUIRK to simplify
ACS inheritence across Live Update (Sashiko)
- Fix ACS re-enablement via pci_restore_state() (Sashiko)
- Drop commit that requires singleton iommu groups (me, Sami)
- Add per-device lock to protect Live Update fields (Sami, Sashiko)
v4: https://lore.kernel.org/linux-pci/20260423212316.3431746-1-dmatlack@google.com/
v3: https://lore.kernel.org/kvm/20260323235817.1960573-1-dmatlack@google.com/
v2: https://lore.kernel.org/kvm/20260129212510.967611-1-dmatlack@google.com/
v1: https://lore.kernel.org/kvm/20251126193608.2678510-1-dmatlack@google.com/
rfc: https://lore.kernel.org/kvm/20251018000713.677779-1-vipinsh@google.com/
David Matlack (11):
PCI: liveupdate: Set up FLB handler for the PCI core
PCI: liveupdate: Track outgoing preserved PCI devices
PCI: liveupdate: Track incoming preserved PCI devices
PCI: liveupdate: Document driver binding responsibilities
PCI: liveupdate: Keep bus numbers constant during Live Update
PCI: liveupdate: Auto-preserve upstream bridges across Live Update
PCI: liveupdate: Inherit ACS flags in incoming preserved devices
PCI: liveupdate: Inherit ARI Forwarding Enable on preserved bridges
PCI: liveupdate: Freeze preservation status during shutdown
PCI: liveupdate: Do not disable bus mastering on preserved devices
during kexec
Documentation: PCI: Add documentation for Live Update
Documentation/PCI/index.rst | 1 +
Documentation/PCI/liveupdate.rst | 29 +
.../admin-guide/kernel-parameters.txt | 6 +-
Documentation/core-api/liveupdate.rst | 1 +
MAINTAINERS | 12 +
drivers/pci/Kconfig | 14 +
drivers/pci/Makefile | 1 +
drivers/pci/liveupdate.c | 807 ++++++++++++++++++
drivers/pci/liveupdate.h | 66 ++
drivers/pci/pci-driver.c | 33 +-
drivers/pci/pci.c | 13 +-
drivers/pci/probe.c | 29 +-
include/linux/kho/abi/pci.h | 64 ++
include/linux/pci.h | 4 +
include/linux/pci_liveupdate.h | 75 ++
15 files changed, 1140 insertions(+), 15 deletions(-)
create mode 100644 Documentation/PCI/liveupdate.rst
create mode 100644 drivers/pci/liveupdate.c
create mode 100644 drivers/pci/liveupdate.h
create mode 100644 include/linux/kho/abi/pci.h
create mode 100644 include/linux/pci_liveupdate.h
base-commit: 34e8f02817e31826e76bb2ded48bf28fe921f20b
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply
* [PATCH v5 01/11] PCI: liveupdate: Set up FLB handler for the PCI core
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
Set up a File-Lifecycle-Bound (FLB) handler for the PCI core to enable
it to participate in the preservation of PCI devices across Live Update.
Essentially, this commit enables the PCI core to allocate a struct
(struct pci_ser) and preserve it across a Live Update whenever at least
one device is preserved.
Preserving PCI devices across Live Update is built on top of the Live
Update Orchestrator's (LUO) support for file preservation. Drivers are
expected to expose a file to userspace to represent a single PCI device
and support preservation of that file. This is intended primarily to
support preservation of PCI devices bound to VFIO drivers.
This commit enables drivers to register their liveupdate_file_handler
with the PCI core so that the PCI core can do its own tracking and
enforcement of which devices are preserved.
pci_liveupdate_register_flb(driver_file_handler);
pci_liveupdate_unregister_flb(driver_file_handler);
When the first file (with a handler registered with the PCI core) is
preserved, the PCI core will be notified to allocate its tracking struct
(pci_ser). When the last file is unpreserved (i.e. preservation
cancelled) the PCI core will be notified to free struct pci_ser.
This struct is preserved across a Live Update using KHO and can be
fetched by the PCI core during early boot (e.g. during device
enumeration) so that it knows which devices were preserved.
Note: This commit only allocates struct pci_ser and preserves it across
Live Update. A subsequent commit will add an API for drivers to tell the
PCI core exactly which devices are being preserved.
Note: There is no reason to check for kho_is_enabled() since it can be
assumed to return true. If KHO was not enabled then Live Update would
not be enabled and these routines would never run.
Signed-off-by: David Matlack <dmatlack@google.com>
---
MAINTAINERS | 10 +++
drivers/pci/Kconfig | 14 +++
drivers/pci/Makefile | 1 +
drivers/pci/liveupdate.c | 153 +++++++++++++++++++++++++++++++++
include/linux/kho/abi/pci.h | 61 +++++++++++++
include/linux/pci.h | 1 +
include/linux/pci_liveupdate.h | 30 +++++++
7 files changed, 270 insertions(+)
create mode 100644 drivers/pci/liveupdate.c
create mode 100644 include/linux/kho/abi/pci.h
create mode 100644 include/linux/pci_liveupdate.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 2fb1c75afd16..6c618830cf61 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20530,6 +20530,16 @@ L: linux-pci@vger.kernel.org
S: Supported
F: Documentation/PCI/pci-error-recovery.rst
+PCI LIVE UPDATE
+M: David Matlack <dmatlack@google.com>
+L: kexec@lists.infradead.org
+L: linux-pci@vger.kernel.org
+S: Maintained
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git
+F: drivers/pci/liveupdate.c
+F: include/linux/kho/abi/pci.h
+F: include/linux/pci_liveupdate.h
+
PCI MSI DRIVER FOR ALTERA MSI IP
L: linux-pci@vger.kernel.org
S: Orphan
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 33c88432b728..08398cbe970c 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -328,6 +328,20 @@ config VGA_ARB_MAX_GPUS
Reserves space in the kernel to maintain resource locking for
multiple GPUS. The overhead for each GPU is very small.
+config PCI_LIVEUPDATE
+ bool "PCI Live Update Support (EXPERIMENTAL)"
+ depends on PCI && LIVEUPDATE
+ help
+ Enable PCI core support for preserving PCI devices across Live
+ Update. This, in combination with support in a device's driver,
+ enables PCI devices to run and perform memory transactions
+ uninterrupted during a kexec for Live Update.
+
+ This option should only be enabled by developers working on
+ implementing this support.
+
+ If unsure, say N.
+
source "drivers/pci/hotplug/Kconfig"
source "drivers/pci/controller/Kconfig"
source "drivers/pci/endpoint/Kconfig"
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 41ebc3b9a518..e8d003cb6757 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_SYSFS) += pci-sysfs.o slot.o
obj-$(CONFIG_ACPI) += pci-acpi.o
obj-$(CONFIG_GENERIC_PCI_IOMAP) += iomap.o
+obj-$(CONFIG_PCI_LIVEUPDATE) += liveupdate.o
endif
obj-$(CONFIG_OF) += of.o
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
new file mode 100644
index 000000000000..dd2449e12b6d
--- /dev/null
+++ b/drivers/pci/liveupdate.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2026, Google LLC.
+ * David Matlack <dmatlack@google.com>
+ */
+
+/**
+ * DOC: PCI Live Update
+ *
+ * The PCI subsystem participates in the Live Update process to enable drivers
+ * to preserve their PCI devices across kexec.
+ *
+ * .. note::
+ * The support for preserving PCI devices across Live Update is currently
+ * *partial* and should be considered *experimental*. It should only be
+ * used by developers working on the implementation for the time being.
+ *
+ * To enable the support, enable ``CONFIG_PCI_LIVEUPDATE``.
+ *
+ * File-Lifecycle-Bound (FLB) Data
+ * ===============================
+ *
+ * PCI device preservation across Live Update is built on top of the Live Update
+ * Orchestrator's (LUO) support for file preservation across kexec. Drivers
+ * are expected to expose a file to represent a single PCI device and support
+ * preservation of that file with ``ioctl(LIVEUPDATE_SESSION_PRESERVE_FD)``.
+ * This allows userspace to control the preservation of devices and ensure
+ * proper lifecycle management while a device is preserved. The first intended
+ * use-case is preserving vfio-pci device files.
+ *
+ * The PCI core maintains its own state about what devices are being preserved
+ * across Live Update using a feature called File-Lifecycle-Bound (FLB) data in
+ * LUO. Essentially, this allows the PCI core to allocate struct pci_ser when
+ * the first device (file) is preserved and free it when the last device (file)
+ * is unpreserved. After kexec, the PCI core can fetch the struct pci_ser (which
+ * was constructed by the previous kernel) from LUO at any time (e.g. during
+ * enumeration) so that it knows which devices were preserved.
+ *
+ * To enable the PCI core to be notified whenever a file representing a device
+ * is preserved, drivers must register their struct liveupdate_file_handler with
+ * the PCI core by using the following APIs:
+ *
+ * * ``pci_liveupdate_register_flb(driver_file_handler)``
+ * * ``pci_liveupdate_unregister_flb(driver_file_handler)``
+ */
+
+#define pr_fmt(fmt) "PCI: liveupdate: " fmt
+
+#include <linux/io.h>
+#include <linux/kexec_handover.h>
+#include <linux/kho/abi/pci.h>
+#include <linux/liveupdate.h>
+#include <linux/mutex.h>
+#include <linux/mm.h>
+#include <linux/pci.h>
+
+static int pci_flb_preserve(struct liveupdate_flb_op_args *args)
+{
+ struct pci_dev *dev = NULL;
+ u32 max_nr_devices = 0;
+ struct pci_ser *ser;
+ unsigned long size;
+
+ /*
+ * Allocate enough space to preserve all of the devices that are
+ * currently present on the system. Extra padding can be added to this
+ * in the future to increase the chances that there is enough room to
+ * preserve devices that are not yet present on the system (e.g. VFs,
+ * hot-plugged devices).
+ */
+ for_each_pci_dev(dev)
+ max_nr_devices++;
+
+ size = struct_size_t(struct pci_ser, devices, max_nr_devices);
+
+ ser = kho_alloc_preserve(size);
+ if (IS_ERR(ser))
+ return PTR_ERR(ser);
+
+ pr_debug("Preserved struct pci_ser with room for %u devices\n",
+ max_nr_devices);
+
+ ser->max_nr_devices = max_nr_devices;
+ ser->nr_devices = 0;
+
+ args->obj = ser;
+ args->data = virt_to_phys(ser);
+ return 0;
+}
+
+static void pci_flb_unpreserve(struct liveupdate_flb_op_args *args)
+{
+ struct pci_ser *ser = args->obj;
+
+ WARN_ON_ONCE(ser->nr_devices);
+ kho_unpreserve_free(ser);
+
+ pr_debug("Unpreserved struct pci_ser\n");
+}
+
+static int pci_flb_retrieve(struct liveupdate_flb_op_args *args)
+{
+ args->obj = phys_to_virt(args->data);
+ return 0;
+}
+
+static void pci_flb_finish(struct liveupdate_flb_op_args *args)
+{
+ kho_restore_free(args->obj);
+}
+
+static struct liveupdate_flb_ops pci_liveupdate_flb_ops = {
+ .preserve = pci_flb_preserve,
+ .unpreserve = pci_flb_unpreserve,
+ .retrieve = pci_flb_retrieve,
+ .finish = pci_flb_finish,
+ .owner = THIS_MODULE,
+};
+
+static struct liveupdate_flb pci_liveupdate_flb = {
+ .ops = &pci_liveupdate_flb_ops,
+ .compatible = PCI_LUO_FLB_COMPATIBLE,
+};
+
+/**
+ * pci_liveupdate_register_flb() - Register a file handler with the PCI core
+ * @fh: The file handler to register.
+ *
+ * Drivers should call pci_liveupdate_register_flb() to register their
+ * struct liveupdate_file_handler with the PCI core. This enables the PCI core
+ * to allocate its outgoing struct pci_ser whenever the first device is
+ * preserved, and free it when the last device is unpreserved.
+ *
+ * Return: 0 on success, <0 on failure.
+ */
+int pci_liveupdate_register_flb(struct liveupdate_file_handler *fh)
+{
+ pr_debug("Registering file handler \"%s\"\n", fh->compatible);
+ return liveupdate_register_flb(fh, &pci_liveupdate_flb);
+}
+EXPORT_SYMBOL_GPL(pci_liveupdate_register_flb);
+
+/**
+ * pci_liveupdate_unregister_flb() - Unregister a file handler with the PCI core
+ * @fh: The file handler to unregister.
+ */
+void pci_liveupdate_unregister_flb(struct liveupdate_file_handler *fh)
+{
+ pr_debug("Unregistering file handler \"%s\"\n", fh->compatible);
+ liveupdate_unregister_flb(fh, &pci_liveupdate_flb);
+}
+EXPORT_SYMBOL_GPL(pci_liveupdate_unregister_flb);
diff --git a/include/linux/kho/abi/pci.h b/include/linux/kho/abi/pci.h
new file mode 100644
index 000000000000..6ebcf817fff4
--- /dev/null
+++ b/include/linux/kho/abi/pci.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2026, Google LLC.
+ * David Matlack <dmatlack@google.com>
+ */
+
+#ifndef _LINUX_KHO_ABI_PCI_H
+#define _LINUX_KHO_ABI_PCI_H
+
+#include <linux/bug.h>
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/**
+ * DOC: PCI File-Lifecycle Bound (FLB) Live Update ABI
+ *
+ * This header defines the ABI for preserving core PCI state across kexec using
+ * Live Update File-Lifecycle Bound (FLB) data.
+ *
+ * This interface is a contract. Any modification to any of the serialization
+ * structs defined here constitutes a breaking change. Such changes require
+ * incrementing the version number in the PCI_LUO_FLB_COMPATIBLE string.
+ */
+
+#define PCI_LUO_FLB_COMPATIBLE "pci-v1"
+
+/**
+ * struct pci_dev_ser - Serialized state about a single PCI device.
+ *
+ * @domain: The device's PCI domain number (segment).
+ * @bdf: The device's PCI bus, device, and function number.
+ * @padding: Padding to naturally align struct pci_dev_ser.
+ */
+struct pci_dev_ser {
+ u32 domain;
+ u16 bdf;
+ u16 padding;
+} __packed;
+
+/**
+ * struct pci_ser - PCI Subsystem Live Update State
+ *
+ * This struct tracks state about all devices that are being preserved across
+ * a Live Update for the next kernel.
+ *
+ * @max_nr_devices: The length of the devices[] flexible array.
+ * @nr_devices: The number of devices that were preserved.
+ * @devices: Flexible array of pci_dev_ser structs for each device.
+ */
+struct pci_ser {
+ u32 max_nr_devices;
+ u32 nr_devices;
+ struct pci_dev_ser devices[];
+} __packed;
+
+/* Ensure all elements of devices[] are naturally aligned. */
+static_assert(offsetof(struct pci_ser, devices) % sizeof(unsigned long) == 0);
+static_assert(sizeof(struct pci_dev_ser) % sizeof(unsigned long) == 0);
+
+#endif /* _LINUX_KHO_ABI_PCI_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..8cadeeab86fd 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -42,6 +42,7 @@
#include <uapi/linux/pci.h>
#include <linux/pci_ids.h>
+#include <linux/pci_liveupdate.h>
#define PCI_STATUS_ERROR_BITS (PCI_STATUS_DETECTED_PARITY | \
PCI_STATUS_SIG_SYSTEM_ERROR | \
diff --git a/include/linux/pci_liveupdate.h b/include/linux/pci_liveupdate.h
new file mode 100644
index 000000000000..8ec98beefcb4
--- /dev/null
+++ b/include/linux/pci_liveupdate.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Live Update support (Public/Driver API)
+ *
+ * Copyright (c) 2026, Google LLC.
+ * David Matlack <dmatlack@google.com>
+ */
+#ifndef LINUX_PCI_LIVEUPDATE_H
+#define LINUX_PCI_LIVEUPDATE_H
+
+#include <linux/liveupdate.h>
+#include <linux/types.h>
+
+struct pci_dev;
+
+#ifdef CONFIG_PCI_LIVEUPDATE
+int pci_liveupdate_register_flb(struct liveupdate_file_handler *fh);
+void pci_liveupdate_unregister_flb(struct liveupdate_file_handler *fh);
+#else
+static inline int pci_liveupdate_register_flb(struct liveupdate_file_handler *fh)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline void pci_liveupdate_unregister_flb(struct liveupdate_file_handler *fh)
+{
+}
+#endif
+
+#endif /* LINUX_PCI_LIVEUPDATE_H */
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 02/11] PCI: liveupdate: Track outgoing preserved PCI devices
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
Add APIs to allow drivers to notify the PCI core of which devices are
being preserved across a Live Update for the next kernel, i.e.
"outgoing" devices.
Drivers must notify the PCI core when devices are preserved so that the
PCI core can update its FLB data (struct pci_ser) and track the list of
outgoing devices. pci_liveupdate_preserve() notifies the PCI core that a
device must be preserved across Live Update. pci_liveupdate_unpreserve()
reverses this (cancels the preservation of the device).
This tracking ensures the PCI core is fully aware of which devices may
need special handling during shutdown and kexec, and so that it can be
handed off to the next kernel.
Signed-off-by: David Matlack <dmatlack@google.com>
---
drivers/pci/liveupdate.c | 167 ++++++++++++++++++++++++++++++---
drivers/pci/probe.c | 3 +
include/linux/kho/abi/pci.h | 9 +-
include/linux/pci.h | 3 +
include/linux/pci_liveupdate.h | 23 +++++
5 files changed, 191 insertions(+), 14 deletions(-)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index dd2449e12b6d..9c4582ecd55c 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -43,6 +43,26 @@
*
* * ``pci_liveupdate_register_flb(driver_file_handler)``
* * ``pci_liveupdate_unregister_flb(driver_file_handler)``
+ *
+ * Device Tracking
+ * ===============
+ *
+ * Drivers must notify the PCI core when specific devices are preserved or
+ * unpreserved with the following APIs:
+ *
+ * * ``pci_liveupdate_preserve(pci_dev)``
+ * * ``pci_liveupdate_unpreserve(pci_dev)``
+ *
+ * This allows the PCI core to keep its FLB data (struct pci_ser) up to date
+ * with the list of **outgoing** preserved devices for the next kernel.
+ *
+ * Restrictions
+ * ============
+ *
+ * The PCI core enforces the following restrictions on which devices can be
+ * preserved. These may be relaxed in the future:
+ *
+ * * The device cannot be a Virtual Function (VF).
*/
#define pr_fmt(fmt) "PCI: liveupdate: " fmt
@@ -55,13 +75,29 @@
#include <linux/mm.h>
#include <linux/pci.h>
+/**
+ * struct pci_flb_outgoing - Outgoing PCI FLB object
+ * @ser: The outgoing struct pci_ser for the next kernel.
+ * @lock: Lock used to protect against changes to @ser.
+ */
+struct pci_flb_outgoing {
+ struct pci_ser *ser;
+ struct mutex lock;
+};
+
static int pci_flb_preserve(struct liveupdate_flb_op_args *args)
{
+ struct pci_flb_outgoing *outgoing;
struct pci_dev *dev = NULL;
u32 max_nr_devices = 0;
- struct pci_ser *ser;
unsigned long size;
+ outgoing = kmalloc_obj(*outgoing);
+ if (!outgoing)
+ return -ENOMEM;
+
+ mutex_init(&outgoing->lock);
+
/*
* Allocate enough space to preserve all of the devices that are
* currently present on the system. Extra padding can be added to this
@@ -74,27 +110,30 @@ static int pci_flb_preserve(struct liveupdate_flb_op_args *args)
size = struct_size_t(struct pci_ser, devices, max_nr_devices);
- ser = kho_alloc_preserve(size);
- if (IS_ERR(ser))
- return PTR_ERR(ser);
+ outgoing->ser = kho_alloc_preserve(size);
+ if (IS_ERR(outgoing->ser)) {
+ kfree(outgoing);
+ return PTR_ERR(outgoing->ser);
+ }
pr_debug("Preserved struct pci_ser with room for %u devices\n",
max_nr_devices);
- ser->max_nr_devices = max_nr_devices;
- ser->nr_devices = 0;
+ outgoing->ser->max_nr_devices = max_nr_devices;
+ outgoing->ser->nr_devices = 0;
- args->obj = ser;
- args->data = virt_to_phys(ser);
+ args->obj = outgoing;
+ args->data = virt_to_phys(outgoing->ser);
return 0;
}
static void pci_flb_unpreserve(struct liveupdate_flb_op_args *args)
{
- struct pci_ser *ser = args->obj;
+ struct pci_flb_outgoing *outgoing = args->obj;
- WARN_ON_ONCE(ser->nr_devices);
- kho_unpreserve_free(ser);
+ WARN_ON_ONCE(outgoing->ser->nr_devices);
+ kho_unpreserve_free(outgoing->ser);
+ kfree(outgoing);
pr_debug("Unpreserved struct pci_ser\n");
}
@@ -123,6 +162,112 @@ static struct liveupdate_flb pci_liveupdate_flb = {
.compatible = PCI_LUO_FLB_COMPATIBLE,
};
+/**
+ * pci_liveupdate_preserve() - Preserve a PCI device across Live Update
+ * @dev: The PCI device to preserve.
+ *
+ * pci_liveupdate_preserve() notifies the PCI core that a PCI device should be
+ * preserved across the next Live Update. Drivers must call
+ * pci_liveupdate_preserve() from their struct liveupdate_file_handler
+ * preserve() callback to ensure the outgoing struct pci_ser is allocated.
+ *
+ * Returns: 0 on success, <0 on failure.
+ */
+int pci_liveupdate_preserve(struct pci_dev *dev)
+{
+ struct pci_flb_outgoing *outgoing = NULL;
+ struct pci_ser *ser;
+ int i, ret;
+
+ if (dev->is_virtfn)
+ return -EINVAL;
+
+ ret = liveupdate_flb_get_outgoing(&pci_liveupdate_flb, (void **)&outgoing);
+ if (ret)
+ return ret;
+
+ if (!outgoing)
+ return -ENOENT;
+
+ guard(mutex)(&outgoing->lock);
+ ser = outgoing->ser;
+
+ guard(write_lock)(&dev->liveupdate.lock);
+
+ if (dev->liveupdate.outgoing)
+ return -EBUSY;
+
+ if (ser->nr_devices == ser->max_nr_devices)
+ return -ENOSPC;
+
+ for (i = 0; i < ser->max_nr_devices; i++) {
+ /*
+ * Start searching at index ser->nr_devices. This should result
+ * in a constant time search under expected conditions (devices
+ * are not getting unpreserved).
+ */
+ int index = (ser->nr_devices + i) % ser->max_nr_devices;
+ struct pci_dev_ser *dev_ser = &ser->devices[index];
+
+ if (dev_ser->refcount)
+ continue;
+
+ pci_info(dev, "Device will be preserved across next Live Update\n");
+ ser->nr_devices++;
+
+ dev_ser->domain = pci_domain_nr(dev->bus);
+ dev_ser->bdf = pci_dev_id(dev);
+ dev_ser->refcount = 1;
+
+ dev->liveupdate.outgoing = dev_ser;
+ return 0;
+ }
+
+ return -ENOSPC;
+}
+EXPORT_SYMBOL_GPL(pci_liveupdate_preserve);
+
+/**
+ * pci_liveupdate_unpreserve() - Cancel preservation of a PCI device
+ * @dev: The PCI device to preserve.
+ *
+ * pci_liveupdate_unpreserve() notifies the PCI core that a PCI device should no
+ * longer be preserved across the next Live Update. Drivers must call
+ * pci_liveupdate_unpreserve() from their struct liveupdate_file_handler
+ * unpreserve() callback to ensure the outgoing struct pci_ser is allocated.
+ */
+void pci_liveupdate_unpreserve(struct pci_dev *dev)
+{
+ struct pci_flb_outgoing *outgoing = NULL;
+ struct pci_dev_ser *dev_ser;
+ struct pci_ser *ser;
+ int ret;
+
+ ret = liveupdate_flb_get_outgoing(&pci_liveupdate_flb, (void **)&outgoing);
+
+ if (ret || !outgoing) {
+ pci_warn(dev, "Cannot unpreserve device without outgoing Live Update state\n");
+ return;
+ }
+
+ guard(mutex)(&outgoing->lock);
+ ser = outgoing->ser;
+
+ guard(write_lock)(&dev->liveupdate.lock);
+
+ dev_ser = dev->liveupdate.outgoing;
+ if (!dev_ser) {
+ pci_warn(dev, "Cannot unpreserve device that is not preserved\n");
+ return;
+ }
+
+ pci_info(dev, "Device will no longer be preserved across next Live Update\n");
+ ser->nr_devices--;
+ memset(dev_ser, 0, sizeof(*dev_ser));
+ dev->liveupdate.outgoing = NULL;
+}
+EXPORT_SYMBOL_GPL(pci_liveupdate_unpreserve);
+
/**
* pci_liveupdate_register_flb() - Register a file handler with the PCI core
* @fh: The file handler to register.
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b63cd0c310bc..54ae32cb0000 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2522,6 +2522,9 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
spin_lock_init(&dev->pcie_cap_lock);
#ifdef CONFIG_PCI_MSI
raw_spin_lock_init(&dev->msi_lock);
+#endif
+#ifdef CONFIG_PCI_LIVEUPDATE
+ rwlock_init(&dev->liveupdate.lock);
#endif
return dev;
}
diff --git a/include/linux/kho/abi/pci.h b/include/linux/kho/abi/pci.h
index 6ebcf817fff4..807fe0e6538f 100644
--- a/include/linux/kho/abi/pci.h
+++ b/include/linux/kho/abi/pci.h
@@ -23,19 +23,22 @@
* incrementing the version number in the PCI_LUO_FLB_COMPATIBLE string.
*/
-#define PCI_LUO_FLB_COMPATIBLE "pci-v1"
+#define PCI_LUO_FLB_COMPATIBLE "pci-v2"
/**
* struct pci_dev_ser - Serialized state about a single PCI device.
*
* @domain: The device's PCI domain number (segment).
* @bdf: The device's PCI bus, device, and function number.
- * @padding: Padding to naturally align struct pci_dev_ser.
+ * @refcount: Reference count used by the PCI core to keep track of whether it
+ * is done using a device's struct pci_dev_ser. The value of the
+ * refcount is equal to the number of devices preserved at or below
+ * this device in the PCI hierarchy.
*/
struct pci_dev_ser {
u32 domain;
u16 bdf;
- u16 padding;
+ u16 refcount;
} __packed;
/**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8cadeeab86fd..a7c3722b1e77 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -594,6 +594,9 @@ struct pci_dev {
u8 tph_mode; /* TPH mode */
u8 tph_req_type; /* TPH requester type */
#endif
+#ifdef CONFIG_PCI_LIVEUPDATE
+ struct pci_liveupdate liveupdate;
+#endif
};
static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
diff --git a/include/linux/pci_liveupdate.h b/include/linux/pci_liveupdate.h
index 8ec98beefcb4..0803d44becd5 100644
--- a/include/linux/pci_liveupdate.h
+++ b/include/linux/pci_liveupdate.h
@@ -8,14 +8,28 @@
#ifndef LINUX_PCI_LIVEUPDATE_H
#define LINUX_PCI_LIVEUPDATE_H
+#include <linux/kho/abi/pci.h>
#include <linux/liveupdate.h>
#include <linux/types.h>
+#include <linux/spinlock_types.h>
+
+/**
+ * struct pci_liveupdate - PCI Live Update state for a struct pci_dev
+ * @lock: Lock used to protect members of struct pci_liveupdate.
+ * @outgoing: State preserved for the next kernel.
+ */
+struct pci_liveupdate {
+ rwlock_t lock;
+ struct pci_dev_ser *outgoing;
+};
struct pci_dev;
#ifdef CONFIG_PCI_LIVEUPDATE
int pci_liveupdate_register_flb(struct liveupdate_file_handler *fh);
void pci_liveupdate_unregister_flb(struct liveupdate_file_handler *fh);
+int pci_liveupdate_preserve(struct pci_dev *dev);
+void pci_liveupdate_unpreserve(struct pci_dev *dev);
#else
static inline int pci_liveupdate_register_flb(struct liveupdate_file_handler *fh)
{
@@ -25,6 +39,15 @@ static inline int pci_liveupdate_register_flb(struct liveupdate_file_handler *fh
static inline void pci_liveupdate_unregister_flb(struct liveupdate_file_handler *fh)
{
}
+
+static inline int pci_liveupdate_preserve(struct pci_dev *dev)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline void pci_liveupdate_unpreserve(struct pci_dev *dev)
+{
+}
#endif
#endif /* LINUX_PCI_LIVEUPDATE_H */
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 03/11] PCI: liveupdate: Track incoming preserved PCI devices
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
During PCI enumeration, the previous kernel might have passed state about
devices that were preserved across kexec. The PCI core needs to fetch
this state to identify which devices are "incoming" and require special
handling.
Add pci_liveupdate_setup_device() which is called during device setup
to fetch the serialized state (struct pci_ser) from the Live Update
Orchestrator. The first time this happens, pci_flb_retrieve() will run
and convert the array of pci_dev_ser structs into an xarray so that it
can be looked up efficiently.
If a device is found in the xarray, the PCI core stores a pointer to its
state in dev->liveupdate_incoming and holds a reference to the incoming
FLB until pci_liveupdate_finish() is called by the driver.
This ensures proper lifecycle management for incoming preserved devices
and allows the PCI core and drivers to apply specific Live Update
logic to them in subsequent commits.
Drivers can check if a device is an incoming preserved device (e.g.
during probe) by calling pci_liveupdate_is_incoming().
CONFIG_64BIT is now required to enable CONFIG_PCI_LIVEUPDATE so that the
domain and bdf can be guaranteed to fit in an unsigned long and be used
as the xarray key.
Signed-off-by: David Matlack <dmatlack@google.com>
---
MAINTAINERS | 1 +
drivers/pci/Kconfig | 2 +-
drivers/pci/liveupdate.c | 223 ++++++++++++++++++++++++++++++++-
drivers/pci/liveupdate.h | 26 ++++
drivers/pci/probe.c | 5 +
include/linux/pci_liveupdate.h | 13 ++
6 files changed, 267 insertions(+), 3 deletions(-)
create mode 100644 drivers/pci/liveupdate.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 6c618830cf61..0e262c0ceb43 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20537,6 +20537,7 @@ L: linux-pci@vger.kernel.org
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git
F: drivers/pci/liveupdate.c
+F: drivers/pci/liveupdate.h
F: include/linux/kho/abi/pci.h
F: include/linux/pci_liveupdate.h
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 08398cbe970c..eea0a6cd388a 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -330,7 +330,7 @@ config VGA_ARB_MAX_GPUS
config PCI_LIVEUPDATE
bool "PCI Live Update Support (EXPERIMENTAL)"
- depends on PCI && LIVEUPDATE
+ depends on PCI && LIVEUPDATE && 64BIT
help
Enable PCI core support for preserving PCI devices across Live
Update. This, in combination with support in a device's driver,
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 9c4582ecd55c..f14396dd1477 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -56,6 +56,20 @@
* This allows the PCI core to keep its FLB data (struct pci_ser) up to date
* with the list of **outgoing** preserved devices for the next kernel.
*
+ * After kexec, whenever a device is enumerated, the PCI core will check if it
+ * is an **incoming** preserved device (i.e. preserved by the previous kernel)
+ * by checking the incoming FLB data (struct pci_ser).
+ *
+ * Drivers must notify the PCI core when an **incoming** device is done
+ * participating in the incoming Live Update with the following API:
+ *
+ * * ``pci_liveupdate_finish(pci_dev)``
+ *
+ * The PCI core does not enforce any ordering of ``pci_liveupdate_finish()`` and
+ * ``pci_liveupdate_preserve()``. i.e. A PCI device can be **outgoing**
+ * (preserved for next kernel) and **incoming** (preserved by previous kernel)
+ * at the same time.
+ *
* Restrictions
* ============
*
@@ -75,6 +89,8 @@
#include <linux/mm.h>
#include <linux/pci.h>
+#include "liveupdate.h"
+
/**
* struct pci_flb_outgoing - Outgoing PCI FLB object
* @ser: The outgoing struct pci_ser for the next kernel.
@@ -85,6 +101,21 @@ struct pci_flb_outgoing {
struct mutex lock;
};
+/**
+ * struct pci_flb_incoming - Incoming PCI FLB object
+ * @ser: The incoming struct pci_ser from the previous kernel.
+ * @xa: Xarray used to quickly lookup devices in @ser.
+ */
+struct pci_flb_incoming {
+ struct pci_ser *ser;
+ struct xarray xa;
+};
+
+static unsigned long pci_ser_xa_key(u32 domain, u16 bdf)
+{
+ return domain << 16 | bdf;
+}
+
static int pci_flb_preserve(struct liveupdate_flb_op_args *args)
{
struct pci_flb_outgoing *outgoing;
@@ -140,13 +171,44 @@ static void pci_flb_unpreserve(struct liveupdate_flb_op_args *args)
static int pci_flb_retrieve(struct liveupdate_flb_op_args *args)
{
- args->obj = phys_to_virt(args->data);
+ struct pci_flb_incoming *incoming;
+ int i, ret;
+
+ incoming = kmalloc_obj(*incoming);
+ if (!incoming)
+ return -ENOMEM;
+
+ incoming->ser = phys_to_virt(args->data);
+
+ xa_init(&incoming->xa);
+
+ for (i = 0; i < incoming->ser->max_nr_devices; i++) {
+ struct pci_dev_ser *dev_ser = &incoming->ser->devices[i];
+ unsigned long key;
+
+ if (!dev_ser->refcount)
+ continue;
+
+ key = pci_ser_xa_key(dev_ser->domain, dev_ser->bdf);
+ ret = xa_err(xa_store(&incoming->xa, key, dev_ser, GFP_KERNEL));
+ if (ret) {
+ xa_destroy(&incoming->xa);
+ kfree(incoming);
+ return ret;
+ }
+ }
+
+ args->obj = incoming;
return 0;
}
static void pci_flb_finish(struct liveupdate_flb_op_args *args)
{
- kho_restore_free(args->obj);
+ struct pci_flb_incoming *incoming = args->obj;
+
+ xa_destroy(&incoming->xa);
+ kho_restore_free(incoming->ser);
+ kfree(incoming);
}
static struct liveupdate_flb_ops pci_liveupdate_flb_ops = {
@@ -268,6 +330,163 @@ void pci_liveupdate_unpreserve(struct pci_dev *dev)
}
EXPORT_SYMBOL_GPL(pci_liveupdate_unpreserve);
+static struct pci_flb_incoming *pci_liveupdate_flb_get_incoming(void)
+{
+ struct pci_flb_incoming *incoming = NULL;
+ int ret;
+
+ ret = liveupdate_flb_get_incoming(&pci_liveupdate_flb, (void **)&incoming);
+
+ /* Live Update is not enabled. */
+ if (ret == -EOPNOTSUPP)
+ return NULL;
+
+ /* Live Update is enabled, but there is no incoming FLB data. */
+ if (ret == -ENODATA)
+ return NULL;
+
+ /*
+ * Live Update is enabled and there is incoming FLB data, but none of it
+ * matches pci_liveupdate_flb.compatible.
+ *
+ * This could mean that no PCI FLB data was passed by the previous
+ * kernel, but it could also mean the previous kernel used a different
+ * compatibility string (i.e. a different ABI).
+ */
+ if (ret == -ENOENT) {
+ pr_info_once("No incoming FLB matched %s\n", pci_liveupdate_flb.compatible);
+ return NULL;
+ }
+
+ /*
+ * There is incoming FLB data that matches pci_liveupdate_flb.compatible
+ * but it cannot be retrieved.
+ */
+ if (ret) {
+ WARN_ONCE(ret, "Failed to retrieve incoming FLB data\n");
+ return NULL;
+ }
+
+ return incoming;
+}
+
+static void pci_liveupdate_flb_put_incoming(void)
+{
+ liveupdate_flb_put_incoming(&pci_liveupdate_flb);
+}
+
+void pci_liveupdate_setup_device(struct pci_dev *dev)
+{
+ struct pci_flb_incoming *incoming;
+ struct pci_dev_ser *dev_ser;
+ unsigned long key;
+
+ incoming = pci_liveupdate_flb_get_incoming();
+ if (!incoming)
+ return;
+
+ key = pci_ser_xa_key(pci_domain_nr(dev->bus), pci_dev_id(dev));
+ dev_ser = xa_load(&incoming->xa, key);
+
+ /* This device was not preserved across Live Update */
+ if (!dev_ser) {
+ pci_liveupdate_flb_put_incoming();
+ return;
+ }
+
+ /*
+ * This device was preserved, but has already been probed and gone
+ * through pci_liveupdate_finish(). This can happen if PCI core probes
+ * the same device multiple times, e.g. due to hotplug.
+ */
+ if (!dev_ser->refcount) {
+ pci_liveupdate_flb_put_incoming();
+ return;
+ }
+
+ pci_info(dev, "Device was preserved by previous kernel across Live Update\n");
+ guard(write_lock)(&dev->liveupdate.lock);
+ dev->liveupdate.incoming = dev_ser;
+
+ /*
+ * Hold the ref on the incoming FLB until pci_liveupdate_finish() so
+ * that dev->liveupdate.incoming does not get freed while it is in use.
+ */
+}
+
+void pci_liveupdate_cleanup_device(struct pci_dev *dev)
+{
+ bool incoming;
+
+ scoped_guard(write_lock, &dev->liveupdate.lock)
+ incoming = !!xchg(&dev->liveupdate.incoming, NULL);
+
+ /*
+ * Drop the FLB reference acquired in pci_liveupdate_setup_device() if
+ * the device is being cleaned up before pci_liveupdate_finish(), e.g.
+ * due to allocation failure during setup.
+ *
+ * Do not drop dev->liveupdate.incoming->refcount since this device has
+ * not gone through pci_liveupdate_finish() and thus is still an
+ * incoming preserved device.
+ */
+ if (incoming)
+ pci_liveupdate_flb_put_incoming();
+}
+
+/**
+ * pci_liveupdate_finish() - Finish the preservation of a PCI device across Live Update
+ * @dev: The PCI device
+ *
+ * pci_liveupdate_finish() notifies the PCI core that a PCI device that was
+ * preserved across the previous Live Update has finished participating in Live
+ * Update. Drivers must call pci_liveupdate_finish() from their struct
+ * liveupdate_file_handler finish() callback to ensure the incoming struct
+ * pci_ser is allocated.
+ */
+void pci_liveupdate_finish(struct pci_dev *dev)
+{
+ guard(write_lock)(&dev->liveupdate.lock);
+
+ if (!dev->liveupdate.incoming) {
+ pci_warn(dev, "Cannot finish preserving an unpreserved device\n");
+ return;
+ }
+
+ pci_info(dev, "Device is finished participating in Live Update\n");
+
+ /*
+ * Drop the refcount so this device does not get treated as an incoming
+ * device again, e.g. in case pci_liveupdate_setup_device() gets called
+ * again because the device is hot-plugged.
+ */
+ dev->liveupdate.incoming->refcount = 0;
+ dev->liveupdate.incoming = NULL;
+
+ /* Drop this device's reference on the incoming FLB. */
+ pci_liveupdate_flb_put_incoming();
+}
+EXPORT_SYMBOL_GPL(pci_liveupdate_finish);
+
+/**
+ * pci_liveupdate_is_incoming() - Check if a device is incoming preserved
+ * @dev: The PCI device to check
+ *
+ * Check if a device was preserved across Live Update by the previous kernel,
+ * i.e. the device is incoming preserved. Note that a device is only considered
+ * incoming preserved prior to pci_liveupdate_finish(). It is up to drivers to
+ * synchronize usage of pci_liveupdate_is_incoming() with their own call to
+ * pci_liveupdate_finish() to avoid acting on stale data.
+ *
+ * Returns: True if the device is incoming preserved, false otherwise.
+ */
+bool pci_liveupdate_is_incoming(struct pci_dev *dev)
+{
+ guard(read_lock)(&dev->liveupdate.lock);
+ return dev->liveupdate.incoming;
+}
+EXPORT_SYMBOL_GPL(pci_liveupdate_is_incoming);
+
/**
* pci_liveupdate_register_flb() - Register a file handler with the PCI core
* @fh: The file handler to register.
diff --git a/drivers/pci/liveupdate.h b/drivers/pci/liveupdate.h
new file mode 100644
index 000000000000..eaaa3559fd77
--- /dev/null
+++ b/drivers/pci/liveupdate.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Live Update support (core API)
+ *
+ * Copyright (c) 2026, Google LLC.
+ * David Matlack <dmatlack@google.com>
+ */
+#ifndef DRIVERS_PCI_LIVEUPDATE_H
+#define DRIVERS_PCI_LIVEUPDATE_H
+
+#include <linux/pci.h>
+
+#ifdef CONFIG_PCI_LIVEUPDATE
+void pci_liveupdate_setup_device(struct pci_dev *dev);
+void pci_liveupdate_cleanup_device(struct pci_dev *dev);
+#else
+static inline void pci_liveupdate_setup_device(struct pci_dev *dev)
+{
+}
+
+static inline void pci_liveupdate_cleanup_device(struct pci_dev *dev)
+{
+}
+#endif
+
+#endif /* DRIVERS_PCI_LIVEUPDATE_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 54ae32cb0000..b5fdc5017f92 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -24,6 +24,7 @@
#include <linux/pm_runtime.h>
#include <linux/bitfield.h>
#include <trace/events/pci.h>
+#include "liveupdate.h"
#include "pci.h"
static struct resource busn_resource = {
@@ -2069,6 +2070,8 @@ int pci_setup_device(struct pci_dev *dev)
if (pci_early_dump)
early_dump_pci_device(dev);
+ pci_liveupdate_setup_device(dev);
+
/* Need to have dev->class ready */
dev->cfg_size = pci_cfg_space_size(dev);
@@ -2192,6 +2195,7 @@ int pci_setup_device(struct pci_dev *dev)
default: /* unknown header */
pci_err(dev, "unknown header type %02x, ignoring device\n",
dev->hdr_type);
+ pci_liveupdate_cleanup_device(dev);
pci_release_of_node(dev);
return -EIO;
@@ -2490,6 +2494,7 @@ static void pci_release_dev(struct device *dev)
pci_dev = to_pci_dev(dev);
pci_release_capabilities(pci_dev);
+ pci_liveupdate_cleanup_device(pci_dev);
pci_release_of_node(pci_dev);
pcibios_release_device(pci_dev);
pci_bus_put(pci_dev->bus);
diff --git a/include/linux/pci_liveupdate.h b/include/linux/pci_liveupdate.h
index 0803d44becd5..1c2ee32ad058 100644
--- a/include/linux/pci_liveupdate.h
+++ b/include/linux/pci_liveupdate.h
@@ -17,10 +17,12 @@
* struct pci_liveupdate - PCI Live Update state for a struct pci_dev
* @lock: Lock used to protect members of struct pci_liveupdate.
* @outgoing: State preserved for the next kernel.
+ * @incoming: State preserved by the previous kernel.
*/
struct pci_liveupdate {
rwlock_t lock;
struct pci_dev_ser *outgoing;
+ struct pci_dev_ser *incoming;
};
struct pci_dev;
@@ -30,6 +32,8 @@ int pci_liveupdate_register_flb(struct liveupdate_file_handler *fh);
void pci_liveupdate_unregister_flb(struct liveupdate_file_handler *fh);
int pci_liveupdate_preserve(struct pci_dev *dev);
void pci_liveupdate_unpreserve(struct pci_dev *dev);
+void pci_liveupdate_finish(struct pci_dev *dev);
+bool pci_liveupdate_is_incoming(struct pci_dev *dev);
#else
static inline int pci_liveupdate_register_flb(struct liveupdate_file_handler *fh)
{
@@ -48,6 +52,15 @@ static inline int pci_liveupdate_preserve(struct pci_dev *dev)
static inline void pci_liveupdate_unpreserve(struct pci_dev *dev)
{
}
+
+static inline void pci_liveupdate_finish(struct pci_dev *dev)
+{
+}
+
+static inline bool pci_liveupdate_is_incoming(struct pci_dev *dev)
+{
+ return false;
+}
#endif
#endif /* LINUX_PCI_LIVEUPDATE_H */
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 04/11] PCI: liveupdate: Document driver binding responsibilities
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
Document how driver binding works during a Live Update and what the PCI
core expects of drivers and users. Note that this is only a description
of the current division of responsibilities. These can change in the
future if we decide.
Signed-off-by: David Matlack <dmatlack@google.com>
---
drivers/pci/liveupdate.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index f14396dd1477..d77e64906a25 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -77,6 +77,22 @@
* preserved. These may be relaxed in the future:
*
* * The device cannot be a Virtual Function (VF).
+ *
+ * Driver Binding
+ * ==============
+ *
+ * In the outgoing kernel, it is the driver's responsibility to ensure that it
+ * does not release a device between pci_liveupdate_preserve() and
+ * pci_liveupdate_unpreserve().
+ *
+ * In the incoming kernel, it is the driver's responsibility to ensure that it
+ * does not release a preserved device between probe() and
+ * pci_liveupdate_finish().
+ *
+ * It is the user's responsibility to ensure that incoming preserved devices are
+ * bound to the correct driver. i.e. The PCI core does not protect against a
+ * device getting preserved by driver A in the outgoing kernel and then getting
+ * bound to driver B in the incoming kernel.
*/
#define pr_fmt(fmt) "PCI: liveupdate: " fmt
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 05/11] PCI: liveupdate: Keep bus numbers constant during Live Update
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
During a Live Update, preserved devices must be allowed to continue
performing memory transactions so the kernel cannot change the fabric
topology, including bus numbers, since that would require disabling
and flushing any memory transactions first.
To keep bus numbers constant, always inherit the secondary and
subordinate bus numbers assigned to bridges during scanning, instead of
assigning new ones, if any PCI devices are being preserved. Note that
the kernel inherits bus numbers even on bridges without any downstream
endpoints that were preserved. This avoids accidentally assigning a
bridge a new window that overlaps with a preserved device that is
downstream of a different bridge.
If a bridge is scanned with a broken topology or has no bus numbers
set during a Live Update, refuse to assign it new bus numbers and refuse
to enumerate devices below it. This is a safety measure to prevent
topology conflicts.
Require that CONFIG_CARDBUS is not enabled to enable
CONFIG_PCI_LIVEUPDATE since inheriting bus numbers on PCI-to-CardBus
bridges requires additional work but is not a priority at the moment.
Signed-off-by: David Matlack <dmatlack@google.com>
---
.../admin-guide/kernel-parameters.txt | 6 +-
drivers/pci/Kconfig | 2 +-
drivers/pci/liveupdate.c | 60 +++++++++++++++++++
drivers/pci/liveupdate.h | 6 ++
drivers/pci/probe.c | 21 +++++--
5 files changed, 89 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb3ec..a64af71c2705 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5138,7 +5138,11 @@ Kernel parameters
explicitly which ones they are.
assign-busses [X86] Always assign all PCI bus
numbers ourselves, overriding
- whatever the firmware may have done.
+ whatever the firmware may have done. Ignored
+ during a Live Update, where the kernel must
+ inherit the PCI topology (including bus numbers)
+ to avoid interrupting ongoing memory
+ transactions of preserved devices.
usepirqmask [X86] Honor the possible IRQ mask stored
in the BIOS $PIR table. This is needed on
some systems with broken BIOSes, notably
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index eea0a6cd388a..aa665231921c 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -330,7 +330,7 @@ config VGA_ARB_MAX_GPUS
config PCI_LIVEUPDATE
bool "PCI Live Update Support (EXPERIMENTAL)"
- depends on PCI && LIVEUPDATE && 64BIT
+ depends on PCI && LIVEUPDATE && 64BIT && !CARDBUS
help
Enable PCI core support for preserving PCI devices across Live
Update. This, in combination with support in a device's driver,
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index d77e64906a25..558fbaec8ddd 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -93,6 +93,21 @@
* bound to the correct driver. i.e. The PCI core does not protect against a
* device getting preserved by driver A in the outgoing kernel and then getting
* bound to driver B in the incoming kernel.
+ *
+ * BDF Stability
+ * =============
+ *
+ * The PCI core guarantees that preserved devices can be identified by the same
+ * bus, device, and function numbers for as long as they are preserved
+ * (including across kexec). To accomplish this, the PCI core always inherits
+ * the secondary and subordinate bus numbers assigned to bridges during scanning
+ * if any device is preserved. This is true even on architectures that always
+ * assign new bus numbers during scanning. The kernel assumes the previous
+ * kernel established a sane bus topology across kexec.
+ *
+ * If a misconfigured or unconfigured bridge is encountered during enumeration
+ * while there are preserved devices, itss secondary and subordinate bus numbers
+ * will be cleared and devices below it will not be enumerated.
*/
#define pr_fmt(fmt) "PCI: liveupdate: " fmt
@@ -107,6 +122,20 @@
#include "liveupdate.h"
+/*
+ * During a Live Update, preserved devices are allowed to continue performing
+ * memory transactions. The kernel must not change the fabric topology,
+ * including bus numbers, since that would require disabling and flushing any
+ * memory transactions first.
+ *
+ * To keep things simple, inherit the secondary and subordinate bus numbers on
+ * _all_ bridges if _any_ PCI devices are preserved (i.e. even bridges without
+ * any downstream endpoints that were preserved). This avoids accidentally
+ * assigning a bridge a new window that overlaps with a preserved device that is
+ * downstream of a different bridge.
+ */
+static atomic_t inherit_buses;
+
/**
* struct pci_flb_outgoing - Outgoing PCI FLB object
* @ser: The outgoing struct pci_ser for the next kernel.
@@ -132,6 +161,29 @@ static unsigned long pci_ser_xa_key(u32 domain, u16 bdf)
return domain << 16 | bdf;
}
+bool pci_liveupdate_inherit_buses(void)
+{
+ return atomic_read(&inherit_buses);
+}
+
+static void pci_set_liveupdate_inherit_buses(bool enable)
+{
+ /* Ensure updates to inherit_buses do not race with rescans */
+ pci_lock_rescan_remove();
+
+ /*
+ * Increment/decrement instead of setting directly to true/false so that
+ * pci_liveupdate_inherit_buses() returns true if any device is outgoing
+ * preserved or incoming preserved.
+ */
+ if (enable)
+ atomic_inc(&inherit_buses);
+ else
+ atomic_dec(&inherit_buses);
+
+ pci_unlock_rescan_remove();
+}
+
static int pci_flb_preserve(struct liveupdate_flb_op_args *args)
{
struct pci_flb_outgoing *outgoing;
@@ -171,6 +223,8 @@ static int pci_flb_preserve(struct liveupdate_flb_op_args *args)
args->obj = outgoing;
args->data = virt_to_phys(outgoing->ser);
+
+ pci_set_liveupdate_inherit_buses(true);
return 0;
}
@@ -178,6 +232,8 @@ static void pci_flb_unpreserve(struct liveupdate_flb_op_args *args)
{
struct pci_flb_outgoing *outgoing = args->obj;
+ pci_set_liveupdate_inherit_buses(false);
+
WARN_ON_ONCE(outgoing->ser->nr_devices);
kho_unpreserve_free(outgoing->ser);
kfree(outgoing);
@@ -215,6 +271,8 @@ static int pci_flb_retrieve(struct liveupdate_flb_op_args *args)
}
args->obj = incoming;
+
+ pci_set_liveupdate_inherit_buses(true);
return 0;
}
@@ -222,6 +280,8 @@ static void pci_flb_finish(struct liveupdate_flb_op_args *args)
{
struct pci_flb_incoming *incoming = args->obj;
+ pci_set_liveupdate_inherit_buses(false);
+
xa_destroy(&incoming->xa);
kho_restore_free(incoming->ser);
kfree(incoming);
diff --git a/drivers/pci/liveupdate.h b/drivers/pci/liveupdate.h
index eaaa3559fd77..0bd3e961d5c5 100644
--- a/drivers/pci/liveupdate.h
+++ b/drivers/pci/liveupdate.h
@@ -13,6 +13,7 @@
#ifdef CONFIG_PCI_LIVEUPDATE
void pci_liveupdate_setup_device(struct pci_dev *dev);
void pci_liveupdate_cleanup_device(struct pci_dev *dev);
+bool pci_liveupdate_inherit_buses(void);
#else
static inline void pci_liveupdate_setup_device(struct pci_dev *dev)
{
@@ -21,6 +22,11 @@ static inline void pci_liveupdate_setup_device(struct pci_dev *dev)
static inline void pci_liveupdate_cleanup_device(struct pci_dev *dev)
{
}
+
+static inline bool pci_liveupdate_inherit_buses(void)
+{
+ return false;
+}
#endif
#endif /* DRIVERS_PCI_LIVEUPDATE_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b5fdc5017f92..08ea9324647b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1375,6 +1375,14 @@ bool pci_ea_fixed_busnrs(struct pci_dev *dev, u8 *sec, u8 *sub)
return true;
}
+static bool pci_should_assign_new_buses(void)
+{
+ if (pci_liveupdate_inherit_buses())
+ return false;
+
+ return pcibios_assign_all_busses();
+}
+
/*
* pci_scan_bridge_extend() - Scan buses behind a bridge
* @bus: Parent bus the bridge is on
@@ -1402,6 +1410,7 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev,
int max, unsigned int available_buses,
int pass)
{
+ const bool assign_new_buses = pci_should_assign_new_buses();
struct pci_bus *child;
u32 buses;
u16 bctl;
@@ -1454,8 +1463,7 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev,
goto out;
}
- if ((secondary || subordinate) &&
- !pcibios_assign_all_busses() && !broken) {
+ if ((secondary || subordinate) && !assign_new_buses && !broken) {
unsigned int cmax, buses;
/*
@@ -1497,8 +1505,7 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev,
* do in the second pass.
*/
if (!pass) {
- if (pcibios_assign_all_busses() || broken)
-
+ if (assign_new_buses || broken)
/*
* Temporarily disable forwarding of the
* configuration cycles on all bridges in
@@ -1512,6 +1519,12 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev,
goto out;
}
+ if (pci_liveupdate_inherit_buses()) {
+ pci_err(dev, "Cannot reconfigure bridge during Live Update!\n");
+ pci_err(dev, "Downstream devices will not be enumerated!\n");
+ goto out;
+ }
+
/* Clear errors */
pci_write_config_word(dev, PCI_STATUS, 0xffff);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 06/11] PCI: liveupdate: Auto-preserve upstream bridges across Live Update
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
When a PCI device is preserved across a Live Update, all of its upstream
bridges up to the root port must also be preserved. This enables the PCI
core and any drivers bound to the bridges to manage bridges correctly
across a Live Update.
Notably, this will be used in subsequent commits to ensure that
preserved devices can continue performing memory transactions without a
disruption or change in routing.
To preserve bridges, the PCI core tracks the number of downstream
devices preserved under each bridge using a reference count in struct
pci_dev_ser. This allows a bridge to remain preserved until all its
downstream preserved devices are unpreserved or finish their
participation in the Live Update.
Signed-off-by: David Matlack <dmatlack@google.com>
---
drivers/pci/liveupdate.c | 241 ++++++++++++++++++++++++++++++---------
1 file changed, 184 insertions(+), 57 deletions(-)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 558fbaec8ddd..d8e06afde2c7 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -108,6 +108,18 @@
* If a misconfigured or unconfigured bridge is encountered during enumeration
* while there are preserved devices, itss secondary and subordinate bus numbers
* will be cleared and devices below it will not be enumerated.
+ *
+ * PCI-to-PCI Bridges
+ * ==================
+ *
+ * Any PCI-to-PCI bridges upstream of a preserved device are automatically
+ * preserved when the device is preserved. The PCI core keeps track of the
+ * number of downstream devices that are preserved under a bridge so that the
+ * bridge is only unpreserved once all downstream devices are unpreserved.
+ *
+ * This enables the PCI core and any drivers bound to the bridge to participate
+ * in the Live Update so that preserved endpoints can continue issuing memory
+ * transactions during the Live Update.
*/
#define pr_fmt(fmt) "PCI: liveupdate: " fmt
@@ -300,41 +312,55 @@ static struct liveupdate_flb pci_liveupdate_flb = {
.compatible = PCI_LUO_FLB_COMPATIBLE,
};
-/**
- * pci_liveupdate_preserve() - Preserve a PCI device across Live Update
- * @dev: The PCI device to preserve.
- *
- * pci_liveupdate_preserve() notifies the PCI core that a PCI device should be
- * preserved across the next Live Update. Drivers must call
- * pci_liveupdate_preserve() from their struct liveupdate_file_handler
- * preserve() callback to ensure the outgoing struct pci_ser is allocated.
- *
- * Returns: 0 on success, <0 on failure.
- */
-int pci_liveupdate_preserve(struct pci_dev *dev)
+static int pci_liveupdate_unpreserve_device(struct pci_ser *ser, struct pci_dev *dev)
{
- struct pci_flb_outgoing *outgoing = NULL;
- struct pci_ser *ser;
- int i, ret;
+ struct pci_dev_ser *dev_ser;
- if (dev->is_virtfn)
+ guard(write_lock)(&dev->liveupdate.lock);
+
+ dev_ser = dev->liveupdate.outgoing;
+ if (!dev_ser) {
+ pci_warn(dev, "Cannot unpreserve device that is not preserved\n");
return -EINVAL;
+ }
- ret = liveupdate_flb_get_outgoing(&pci_liveupdate_flb, (void **)&outgoing);
- if (ret)
- return ret;
+ if (!dev_ser->refcount) {
+ pci_WARN(dev, 1, "Preserved device has a 0 refcount!\n");
+ return -EINVAL;
+ }
- if (!outgoing)
- return -ENOENT;
+ if (--dev_ser->refcount)
+ return 0;
- guard(mutex)(&outgoing->lock);
- ser = outgoing->ser;
+ pci_info(dev, "Device will no longer be preserved across next Live Update\n");
+ ser->nr_devices--;
+ memset(dev_ser, 0, sizeof(*dev_ser));
+ dev->liveupdate.outgoing = NULL;
+ return 0;
+}
- guard(write_lock)(&dev->liveupdate.lock);
+static int pci_liveupdate_preserve_device_existing(struct pci_dev *dev)
+{
+ if (!dev->liveupdate.outgoing->refcount) {
+ pci_WARN(dev, 1, "Preserved device with 0 refcount!\n");
+ return -EINVAL;
+ }
- if (dev->liveupdate.outgoing)
+ /*
+ * Endpoint devices should not be preserved more than once. Bridges are
+ * preserved once for every downstream device that is preserved.
+ */
+ if (!dev->subordinate)
return -EBUSY;
+ dev->liveupdate.outgoing->refcount++;
+ return 0;
+}
+
+static int pci_liveupdate_preserve_device_new(struct pci_ser *ser, struct pci_dev *dev)
+{
+ int i;
+
if (ser->nr_devices == ser->max_nr_devices)
return -ENOSPC;
@@ -363,8 +389,82 @@ int pci_liveupdate_preserve(struct pci_dev *dev)
return -ENOSPC;
}
+
+static int pci_liveupdate_preserve_device(struct pci_ser *ser, struct pci_dev *dev)
+{
+ guard(write_lock)(&dev->liveupdate.lock);
+
+ if (dev->liveupdate.outgoing)
+ return pci_liveupdate_preserve_device_existing(dev);
+ else
+ return pci_liveupdate_preserve_device_new(ser, dev);
+}
+
+static int pci_liveupdate_preserve_path(struct pci_ser *ser, struct pci_dev *dev)
+{
+ int ret;
+
+ if (!dev)
+ return 0;
+
+ ret = pci_liveupdate_preserve_device(ser, dev);
+ if (ret)
+ return ret;
+
+ ret = pci_liveupdate_preserve_path(ser, dev->bus->self);
+ if (ret) {
+ pci_liveupdate_unpreserve_device(ser, dev);
+ return ret;
+ }
+
+ return 0;
+}
+
+/**
+ * pci_liveupdate_preserve() - Preserve a PCI device across Live Update
+ * @dev: The PCI device to preserve.
+ *
+ * pci_liveupdate_preserve() notifies the PCI core that a PCI device should be
+ * preserved across the next Live Update. Drivers must call
+ * pci_liveupdate_preserve() from their struct liveupdate_file_handler
+ * preserve() callback to ensure the outgoing struct pci_ser is allocated.
+ *
+ * pci_liveupdate_preserve() automatically preserves all bridges upstream of
+ * @dev.
+ *
+ * Returns: 0 on success, <0 on failure.
+ */
+int pci_liveupdate_preserve(struct pci_dev *dev)
+{
+ struct pci_flb_outgoing *outgoing = NULL;
+ int ret;
+
+ if (dev->is_virtfn)
+ return -EINVAL;
+
+ ret = liveupdate_flb_get_outgoing(&pci_liveupdate_flb, (void **)&outgoing);
+ if (ret)
+ return ret;
+
+ if (!outgoing)
+ return -ENOENT;
+
+ guard(mutex)(&outgoing->lock);
+ return pci_liveupdate_preserve_path(outgoing->ser, dev);
+}
EXPORT_SYMBOL_GPL(pci_liveupdate_preserve);
+static void pci_liveupdate_unpreserve_path(struct pci_ser *ser, struct pci_dev *dev)
+{
+ if (!dev)
+ return;
+
+ if (pci_liveupdate_unpreserve_device(ser, dev))
+ return;
+
+ pci_liveupdate_unpreserve_path(ser, dev->bus->self);
+}
+
/**
* pci_liveupdate_unpreserve() - Cancel preservation of a PCI device
* @dev: The PCI device to preserve.
@@ -373,12 +473,13 @@ EXPORT_SYMBOL_GPL(pci_liveupdate_preserve);
* longer be preserved across the next Live Update. Drivers must call
* pci_liveupdate_unpreserve() from their struct liveupdate_file_handler
* unpreserve() callback to ensure the outgoing struct pci_ser is allocated.
+ *
+ * pci_liveupdate_unpreserve() automatically unpreserves all bridges upstream of
+ * @dev.
*/
void pci_liveupdate_unpreserve(struct pci_dev *dev)
{
struct pci_flb_outgoing *outgoing = NULL;
- struct pci_dev_ser *dev_ser;
- struct pci_ser *ser;
int ret;
ret = liveupdate_flb_get_outgoing(&pci_liveupdate_flb, (void **)&outgoing);
@@ -389,20 +490,7 @@ void pci_liveupdate_unpreserve(struct pci_dev *dev)
}
guard(mutex)(&outgoing->lock);
- ser = outgoing->ser;
-
- guard(write_lock)(&dev->liveupdate.lock);
-
- dev_ser = dev->liveupdate.outgoing;
- if (!dev_ser) {
- pci_warn(dev, "Cannot unpreserve device that is not preserved\n");
- return;
- }
-
- pci_info(dev, "Device will no longer be preserved across next Live Update\n");
- ser->nr_devices--;
- memset(dev_ser, 0, sizeof(*dev_ser));
- dev->liveupdate.outgoing = NULL;
+ pci_liveupdate_unpreserve_path(outgoing->ser, dev);
}
EXPORT_SYMBOL_GPL(pci_liveupdate_unpreserve);
@@ -510,6 +598,55 @@ void pci_liveupdate_cleanup_device(struct pci_dev *dev)
pci_liveupdate_flb_put_incoming();
}
+static int __pci_liveupdate_finish_device(struct pci_dev *dev)
+{
+ guard(write_lock)(&dev->liveupdate.lock);
+
+ if (!dev->liveupdate.incoming) {
+ pci_warn(dev, "Cannot finish preserving an unpreserved device\n");
+ return -EINVAL;
+ }
+
+ if (!dev->liveupdate.incoming->refcount) {
+ pci_WARN(dev, 1, "Preserved device has a 0 refcount!\n");
+ return -EINVAL;
+ }
+
+ /*
+ * Decrement the refcount so this device does not get treated as an
+ * incoming device again, e.g. in case pci_liveupdate_setup_device()
+ * gets called again because the device is hot-plugged.
+ */
+ if (--dev->liveupdate.incoming->refcount)
+ return -EBUSY;
+
+ pci_info(dev, "Device is finished participating in Live Update\n");
+ dev->liveupdate.incoming = NULL;
+ return 0;
+}
+
+static int pci_liveupdate_finish_device(struct pci_dev *dev)
+{
+ int ret;
+
+ /*
+ * If ret == -EBUSY the device is still preserved due to remaining
+ * references. Return 0 up to the caller to indicate it should proceed
+ * to finish preserving upstream devices but do not drop the device's
+ * reference on the incoming FLB below.
+ */
+ ret = __pci_liveupdate_finish_device(dev);
+ if (ret)
+ return ret == -EBUSY ? 0 : ret;
+
+ /*
+ * Once the device's refcount reaches zero drop the device's reference
+ * on the incoming FLB so it can be freed.
+ */
+ pci_liveupdate_flb_put_incoming();
+ return 0;
+}
+
/**
* pci_liveupdate_finish() - Finish the preservation of a PCI device across Live Update
* @dev: The PCI device
@@ -519,28 +656,18 @@ void pci_liveupdate_cleanup_device(struct pci_dev *dev)
* Update. Drivers must call pci_liveupdate_finish() from their struct
* liveupdate_file_handler finish() callback to ensure the incoming struct
* pci_ser is allocated.
+ *
+ * pci_liveupdate_finish() automatically finishes all bridges upstream of @dev.
*/
void pci_liveupdate_finish(struct pci_dev *dev)
{
- guard(write_lock)(&dev->liveupdate.lock);
-
- if (!dev->liveupdate.incoming) {
- pci_warn(dev, "Cannot finish preserving an unpreserved device\n");
+ if (!dev)
return;
- }
-
- pci_info(dev, "Device is finished participating in Live Update\n");
- /*
- * Drop the refcount so this device does not get treated as an incoming
- * device again, e.g. in case pci_liveupdate_setup_device() gets called
- * again because the device is hot-plugged.
- */
- dev->liveupdate.incoming->refcount = 0;
- dev->liveupdate.incoming = NULL;
+ if (pci_liveupdate_finish_device(dev))
+ return;
- /* Drop this device's reference on the incoming FLB. */
- pci_liveupdate_flb_put_incoming();
+ pci_liveupdate_finish(dev->bus->self);
}
EXPORT_SYMBOL_GPL(pci_liveupdate_finish);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 07/11] PCI: liveupdate: Inherit ACS flags in incoming preserved devices
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
Inherit Access Control Services (ACS) flags on all incoming preserved
devices (endpoints and upstream bridges) during a Live Update.
Inheriting ACS flags avoids changing routing rules while memory
transactions are in flight from preserved devices. This is also strictly
necessary to ensure that IOMMU group assignments do not change across
a Live Update for preserved devices, as changing ACS configurations can
split or merge IOMMU groups.
Cache the inherited ACS controls established by the previous kernel in
struct pci_dev so that ACS controls do not change after a reset
(pci_restore_state() calls pci_enable_acs()).
Signed-off-by: David Matlack <dmatlack@google.com>
---
drivers/pci/liveupdate.c | 49 ++++++++++++++++++++++++++++++++++
drivers/pci/liveupdate.h | 11 ++++++++
drivers/pci/pci.c | 5 ++++
include/linux/pci_liveupdate.h | 6 +++++
4 files changed, 71 insertions(+)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index d8e06afde2c7..e3cd6d76636c 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -120,6 +120,18 @@
* This enables the PCI core and any drivers bound to the bridge to participate
* in the Live Update so that preserved endpoints can continue issuing memory
* transactions during the Live Update.
+ *
+ * Handling Preserved Devices
+ * ==========================
+ *
+ * The PCI core treats preserved devices differently than non-preserved devices.
+ * This section enumerates those differences.
+ *
+ * * The PCI core inherits all ACS flags enabled on incoming preserved devices
+ * rather than assigning new ones. This ensures that TLPs are routed the same
+ * way after Live Update and ensures that IOMMU groups do not change. Note
+ * that a device will use its inherited ACS flags for the lifetime of its
+ * struct pci_dev (i.e. even after pci_liveupdate_finish()).
*/
#define pr_fmt(fmt) "PCI: liveupdate: " fmt
@@ -361,6 +373,16 @@ static int pci_liveupdate_preserve_device_new(struct pci_ser *ser, struct pci_de
{
int i;
+ /*
+ * Do not preserve a devices that rely on device-specific ACS
+ * equivalents (for now) since that would complicate keeping ACS
+ * flags constant across Live Update.
+ */
+ if (dev->dev_flags & PCI_DEV_FLAGS_ACS_ENABLED_QUIRK) {
+ pci_warn(dev, "Refusing to preserve device that relies on ACS quirks\n");
+ return -EINVAL;
+ }
+
if (ser->nr_devices == ser->max_nr_devices)
return -ENOSPC;
@@ -571,6 +593,7 @@ void pci_liveupdate_setup_device(struct pci_dev *dev)
pci_info(dev, "Device was preserved by previous kernel across Live Update\n");
guard(write_lock)(&dev->liveupdate.lock);
dev->liveupdate.incoming = dev_ser;
+ dev->liveupdate.was_preserved = true;
/*
* Hold the ref on the incoming FLB until pci_liveupdate_finish() so
@@ -671,6 +694,32 @@ void pci_liveupdate_finish(struct pci_dev *dev)
}
EXPORT_SYMBOL_GPL(pci_liveupdate_finish);
+void pci_liveupdate_init_acs(struct pci_dev *dev)
+{
+ guard(read_lock)(&dev->liveupdate.lock);
+
+ if (!dev->acs_cap || !dev->liveupdate.incoming)
+ return;
+
+ pci_read_config_word(dev, dev->acs_cap + PCI_ACS_CTRL, &dev->liveupdate.acs_ctrl);
+}
+
+bool pci_liveupdate_inherit_acs(struct pci_dev *dev)
+{
+ guard(read_lock)(&dev->liveupdate.lock);
+
+ /*
+ * Use liveupdate.was_preserved instead of liveupdate.incoming since the
+ * device's ACS controls should not change even after the device is
+ * finished participating in the Live Update.
+ */
+ if (!dev->acs_cap || !dev->liveupdate.was_preserved)
+ return false;
+
+ pci_write_config_word(dev, dev->acs_cap + PCI_ACS_CTRL, dev->liveupdate.acs_ctrl);
+ return true;
+}
+
/**
* pci_liveupdate_is_incoming() - Check if a device is incoming preserved
* @dev: The PCI device to check
diff --git a/drivers/pci/liveupdate.h b/drivers/pci/liveupdate.h
index 0bd3e961d5c5..c0826ca717e3 100644
--- a/drivers/pci/liveupdate.h
+++ b/drivers/pci/liveupdate.h
@@ -14,6 +14,8 @@
void pci_liveupdate_setup_device(struct pci_dev *dev);
void pci_liveupdate_cleanup_device(struct pci_dev *dev);
bool pci_liveupdate_inherit_buses(void);
+void pci_liveupdate_init_acs(struct pci_dev *dev);
+bool pci_liveupdate_inherit_acs(struct pci_dev *dev);
#else
static inline void pci_liveupdate_setup_device(struct pci_dev *dev)
{
@@ -27,6 +29,15 @@ static inline bool pci_liveupdate_inherit_buses(void)
{
return false;
}
+
+static inline void pci_liveupdate_init_acs(struct pci_dev *dev)
+{
+}
+
+static inline bool pci_liveupdate_inherit_acs(struct pci_dev *dev)
+{
+ return false;
+}
#endif
#endif /* DRIVERS_PCI_LIVEUPDATE_H */
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8f7cfcc00090..cd2c1f2ada92 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -33,6 +33,7 @@
#include <asm/dma.h>
#include <linux/aer.h>
#include <linux/bitfield.h>
+#include "liveupdate.h"
#include "pci.h"
DEFINE_MUTEX(pci_slot_mutex);
@@ -1017,6 +1018,9 @@ void pci_enable_acs(struct pci_dev *dev)
bool enable_acs = false;
int pos;
+ if (pci_liveupdate_inherit_acs(dev))
+ return;
+
/* If an iommu is present we start with kernel default caps */
if (pci_acs_enable) {
if (pci_dev_specific_enable_acs(dev))
@@ -3657,6 +3661,7 @@ void pci_acs_init(struct pci_dev *dev)
pci_read_config_word(dev, pos + PCI_ACS_CAP, &dev->acs_capabilities);
pci_disable_broken_acs_cap(dev);
+ pci_liveupdate_init_acs(dev);
}
/**
diff --git a/include/linux/pci_liveupdate.h b/include/linux/pci_liveupdate.h
index 1c2ee32ad058..34f9900c7d29 100644
--- a/include/linux/pci_liveupdate.h
+++ b/include/linux/pci_liveupdate.h
@@ -18,11 +18,17 @@
* @lock: Lock used to protect members of struct pci_liveupdate.
* @outgoing: State preserved for the next kernel.
* @incoming: State preserved by the previous kernel.
+ * @acs_ctrl: ACS features established by the previous kernel.
+ * @was_preserved: True if this struct pci_dev was preserved by the previous
+ * kernel. Unlike @incoming, this field is not cleared after
+ * the device is finished participating in Live Update.
*/
struct pci_liveupdate {
rwlock_t lock;
struct pci_dev_ser *outgoing;
struct pci_dev_ser *incoming;
+ u16 acs_ctrl;
+ unsigned int was_preserved:1;
};
struct pci_dev;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 08/11] PCI: liveupdate: Inherit ARI Forwarding Enable on preserved bridges
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
Inherit the ARI Forwarding Enable on preserved bridges and update
pci_dev->ari_enabled accordingly during a Live Update. This ensures that
the preserved devices on the bridge's secondary bus can be identified
with the same expanded 8-bit function number after a Live Update.
Signed-off-by: David Matlack <dmatlack@google.com>
---
drivers/pci/liveupdate.c | 18 ++++++++++++++++++
drivers/pci/liveupdate.h | 6 ++++++
drivers/pci/pci.c | 8 +++++++-
3 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index e3cd6d76636c..6ab03bd548b3 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -132,6 +132,10 @@
* way after Live Update and ensures that IOMMU groups do not change. Note
* that a device will use its inherited ACS flags for the lifetime of its
* struct pci_dev (i.e. even after pci_liveupdate_finish()).
+ *
+ * * The PCI core inherits ARI Forwarding Enable on all bridges with downstream
+ * preserved devices to ensure that all preserved devices on the bridge's
+ * secondary bus are addressable after the Live Update.
*/
#define pr_fmt(fmt) "PCI: liveupdate: " fmt
@@ -720,6 +724,20 @@ bool pci_liveupdate_inherit_acs(struct pci_dev *dev)
return true;
}
+bool pci_liveupdate_inherit_ari(struct pci_dev *dev)
+{
+ u16 val;
+
+ guard(read_lock)(&dev->liveupdate.lock);
+
+ if (!dev->liveupdate.incoming)
+ return false;
+
+ pcie_capability_read_word(dev, PCI_EXP_DEVCTL2, &val);
+ dev->ari_enabled = !!(val & PCI_EXP_DEVCTL2_ARI);
+ return true;
+}
+
/**
* pci_liveupdate_is_incoming() - Check if a device is incoming preserved
* @dev: The PCI device to check
diff --git a/drivers/pci/liveupdate.h b/drivers/pci/liveupdate.h
index c0826ca717e3..fd7693c7ddd2 100644
--- a/drivers/pci/liveupdate.h
+++ b/drivers/pci/liveupdate.h
@@ -16,6 +16,7 @@ void pci_liveupdate_cleanup_device(struct pci_dev *dev);
bool pci_liveupdate_inherit_buses(void);
void pci_liveupdate_init_acs(struct pci_dev *dev);
bool pci_liveupdate_inherit_acs(struct pci_dev *dev);
+bool pci_liveupdate_inherit_ari(struct pci_dev *dev);
#else
static inline void pci_liveupdate_setup_device(struct pci_dev *dev)
{
@@ -38,6 +39,11 @@ static inline bool pci_liveupdate_inherit_acs(struct pci_dev *dev)
{
return false;
}
+
+static inline bool pci_liveupdate_inherit_ari(struct pci_dev *dev)
+{
+ return false;
+}
#endif
#endif /* DRIVERS_PCI_LIVEUPDATE_H */
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index cd2c1f2ada92..7e9768dfe092 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3495,7 +3495,7 @@ void pci_configure_ari(struct pci_dev *dev)
u32 cap;
struct pci_dev *bridge;
- if (pcie_ari_disabled || !pci_is_pcie(dev) || dev->devfn)
+ if (!pci_is_pcie(dev) || dev->devfn)
return;
bridge = dev->bus->self;
@@ -3506,6 +3506,12 @@ void pci_configure_ari(struct pci_dev *dev)
if (!(cap & PCI_EXP_DEVCAP2_ARI))
return;
+ if (pci_liveupdate_inherit_ari(bridge))
+ return;
+
+ if (pcie_ari_disabled)
+ return;
+
if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ARI)) {
pcie_capability_set_word(bridge, PCI_EXP_DEVCTL2,
PCI_EXP_DEVCTL2_ARI);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 09/11] PCI: liveupdate: Freeze preservation status during shutdown
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
Freeze a device's outgoing preservation status (preserved or not
preserved) during shutdown. This enables the PCI core and drivers to
safely make decisions based on the device's preservation status during
shutdown.
Note that pci_liveupdate_freeze() is triggered by the PCI core rather
than from drivers participating in Live Update so that all devices can
have their status frozen (i.e. prevent non-preserved devices from
getting preserved late).
Signed-off-by: David Matlack <dmatlack@google.com>
---
drivers/pci/liveupdate.c | 16 ++++++++++++++++
drivers/pci/liveupdate.h | 5 +++++
drivers/pci/pci-driver.c | 2 ++
include/linux/pci_liveupdate.h | 3 +++
4 files changed, 26 insertions(+)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 6ab03bd548b3..825166a57913 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -334,6 +334,11 @@ static int pci_liveupdate_unpreserve_device(struct pci_ser *ser, struct pci_dev
guard(write_lock)(&dev->liveupdate.lock);
+ if (dev->liveupdate.frozen) {
+ pci_WARN(dev, 1, "Cannot unpreserve device after it is frozen!\n");
+ return -EINVAL;
+ }
+
dev_ser = dev->liveupdate.outgoing;
if (!dev_ser) {
pci_warn(dev, "Cannot unpreserve device that is not preserved\n");
@@ -420,6 +425,11 @@ static int pci_liveupdate_preserve_device(struct pci_ser *ser, struct pci_dev *d
{
guard(write_lock)(&dev->liveupdate.lock);
+ if (dev->liveupdate.frozen) {
+ pci_WARN(dev, 1, "Cannot preserve device after it is frozen!\n");
+ return -EINVAL;
+ }
+
if (dev->liveupdate.outgoing)
return pci_liveupdate_preserve_device_existing(dev);
else
@@ -625,6 +635,12 @@ void pci_liveupdate_cleanup_device(struct pci_dev *dev)
pci_liveupdate_flb_put_incoming();
}
+void pci_liveupdate_freeze(struct pci_dev *dev)
+{
+ guard(write_lock)(&dev->liveupdate.lock);
+ dev->liveupdate.frozen = 1;
+}
+
static int __pci_liveupdate_finish_device(struct pci_dev *dev)
{
guard(write_lock)(&dev->liveupdate.lock);
diff --git a/drivers/pci/liveupdate.h b/drivers/pci/liveupdate.h
index fd7693c7ddd2..30deaa673efe 100644
--- a/drivers/pci/liveupdate.h
+++ b/drivers/pci/liveupdate.h
@@ -13,6 +13,7 @@
#ifdef CONFIG_PCI_LIVEUPDATE
void pci_liveupdate_setup_device(struct pci_dev *dev);
void pci_liveupdate_cleanup_device(struct pci_dev *dev);
+void pci_liveupdate_freeze(struct pci_dev *dev);
bool pci_liveupdate_inherit_buses(void);
void pci_liveupdate_init_acs(struct pci_dev *dev);
bool pci_liveupdate_inherit_acs(struct pci_dev *dev);
@@ -26,6 +27,10 @@ static inline void pci_liveupdate_cleanup_device(struct pci_dev *dev)
{
}
+static inline void pci_liveupdate_freeze(struct pci_dev *dev);
+{
+}
+
static inline bool pci_liveupdate_inherit_buses(void)
{
return false;
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index d10ece0889f0..f7a5e65a7c75 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -21,6 +21,7 @@
#include <linux/acpi.h>
#include <linux/dma-map-ops.h>
#include <linux/iommu.h>
+#include "liveupdate.h"
#include "pci.h"
#include "pcie/portdrv.h"
@@ -536,6 +537,7 @@ static void pci_device_shutdown(struct device *dev)
struct pci_dev *pci_dev = to_pci_dev(dev);
struct pci_driver *drv = pci_dev->driver;
+ pci_liveupdate_freeze(pci_dev);
pm_runtime_resume(dev);
if (drv && drv->shutdown)
diff --git a/include/linux/pci_liveupdate.h b/include/linux/pci_liveupdate.h
index 34f9900c7d29..7e4ac7a0f4fc 100644
--- a/include/linux/pci_liveupdate.h
+++ b/include/linux/pci_liveupdate.h
@@ -22,6 +22,8 @@
* @was_preserved: True if this struct pci_dev was preserved by the previous
* kernel. Unlike @incoming, this field is not cleared after
* the device is finished participating in Live Update.
+ * @frozen: True if the outgoing preservation status of this device is frozen
+ * and thus cannot be changed.
*/
struct pci_liveupdate {
rwlock_t lock;
@@ -29,6 +31,7 @@ struct pci_liveupdate {
struct pci_dev_ser *incoming;
u16 acs_ctrl;
unsigned int was_preserved:1;
+ unsigned int frozen:1;
};
struct pci_dev;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 10/11] PCI: liveupdate: Do not disable bus mastering on preserved devices during kexec
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
Do not disable bus mastering on outgoing preserved devices during
pci_device_shutdown() for kexec.
Preserved devices must be allowed to perform memory transactions during
a Live Update to minimize downtime and ensure continuous operation.
Clearing the bus mastering bit would prevent these devices from issuing
any memory requests while the new kernel boots.
Because bridges upstream of preserved endpoint devices are also
automatically preserved, this change also avoids clearing bus mastering
on them. This is critical because clearing bus mastering on an upstream
bridge prevents the bridge from forwarding memory requests upstream (i.e.
it would prevent the endpoint device from accessing system RAM and doing
peer-to-peer transactions with devices not downstream of the bridge).
Signed-off-by: David Matlack <dmatlack@google.com>
---
drivers/pci/liveupdate.c | 4 ++++
drivers/pci/liveupdate.h | 12 ++++++++++++
drivers/pci/pci-driver.c | 31 ++++++++++++++++++++++---------
3 files changed, 38 insertions(+), 9 deletions(-)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 825166a57913..6c4b57d8f780 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -136,6 +136,10 @@
* * The PCI core inherits ARI Forwarding Enable on all bridges with downstream
* preserved devices to ensure that all preserved devices on the bridge's
* secondary bus are addressable after the Live Update.
+ *
+ * * The PCI core does not disable bus mastering on outgoing preserved devices
+ * during kexec. This allows preserved devices to issue memory transactions
+ * throughout the Live Update.
*/
#define pr_fmt(fmt) "PCI: liveupdate: " fmt
diff --git a/drivers/pci/liveupdate.h b/drivers/pci/liveupdate.h
index 30deaa673efe..8ad404307a70 100644
--- a/drivers/pci/liveupdate.h
+++ b/drivers/pci/liveupdate.h
@@ -18,6 +18,13 @@ bool pci_liveupdate_inherit_buses(void);
void pci_liveupdate_init_acs(struct pci_dev *dev);
bool pci_liveupdate_inherit_acs(struct pci_dev *dev);
bool pci_liveupdate_inherit_ari(struct pci_dev *dev);
+
+static inline bool pci_liveupdate_is_outgoing(struct pci_dev *dev)
+{
+ guard(read_lock)(&dev->liveupdate.lock);
+ pci_WARN_ONCE(dev, !dev->liveupdate.frozen, "Preservation status is unstable!\n");
+ return dev->liveupdate.outgoing;
+}
#else
static inline void pci_liveupdate_setup_device(struct pci_dev *dev)
{
@@ -49,6 +56,11 @@ static inline bool pci_liveupdate_inherit_ari(struct pci_dev *dev)
{
return false;
}
+
+static inline bool pci_liveupdate_is_outgoing(struct pci_dev *dev)
+{
+ return false;
+}
#endif
#endif /* DRIVERS_PCI_LIVEUPDATE_H */
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index f7a5e65a7c75..b6c931ebd3be 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -532,6 +532,27 @@ static void pci_device_remove(struct device *dev)
pci_dev_put(pci_dev);
}
+/*
+ * Disable bus mastering on the device so that it does not perform memory
+ * transactions during kexec.
+ *
+ * Don't touch devices that are being preserved across kexec for Live
+ * Update or that are in D3cold or unknown states.
+ */
+static void pci_clear_master_for_shutdown(struct pci_dev *pci_dev)
+{
+ if (!kexec_in_progress)
+ return;
+
+ if (pci_liveupdate_is_outgoing(pci_dev))
+ return;
+
+ if (pci_dev->current_state > PCI_D3hot)
+ return;
+
+ pci_clear_master(pci_dev);
+}
+
static void pci_device_shutdown(struct device *dev)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
@@ -543,15 +564,7 @@ static void pci_device_shutdown(struct device *dev)
if (drv && drv->shutdown)
drv->shutdown(pci_dev);
- /*
- * If this is a kexec reboot, turn off Bus Master bit on the
- * device to tell it to not continue to do DMA. Don't touch
- * devices in D3cold or unknown states.
- * If it is not a kexec reboot, firmware will hit the PCI
- * devices with big hammer and stop their DMA any way.
- */
- if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
- pci_clear_master(pci_dev);
+ pci_clear_master_for_shutdown(pci_dev);
}
#ifdef CONFIG_PM_SLEEP
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v5 11/11] Documentation: PCI: Add documentation for Live Update
From: David Matlack @ 2026-05-12 18:48 UTC (permalink / raw)
To: kexec, linux-doc, linux-kernel, linux-mm, linux-pci
Cc: Adithya Jayachandran, Alexander Graf, Alex Williamson,
Bjorn Helgaas, Chris Li, David Matlack, David Rientjes, Jacob Pan,
Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260512184846.119396-1-dmatlack@google.com>
Add documentation files for the PCI subsystem's participation in Live
Update.
These documentation files are generated from the kernel-doc comments
in the PCI Live Update source code. They describe the File-Lifecycle
Bound (FLB) API, the device tracking API, and the specific policies
applied to preserved devices (such as bus number inheritance and bus
mastering preservation).
Signed-off-by: David Matlack <dmatlack@google.com>
---
Documentation/PCI/index.rst | 1 +
Documentation/PCI/liveupdate.rst | 29 +++++++++++++++++++++++++++
Documentation/core-api/liveupdate.rst | 1 +
MAINTAINERS | 1 +
4 files changed, 32 insertions(+)
create mode 100644 Documentation/PCI/liveupdate.rst
diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
index 5d720d2a415e..23fb737ac969 100644
--- a/Documentation/PCI/index.rst
+++ b/Documentation/PCI/index.rst
@@ -20,3 +20,4 @@ PCI Bus Subsystem
controller/index
boot-interrupts
tph
+ liveupdate
diff --git a/Documentation/PCI/liveupdate.rst b/Documentation/PCI/liveupdate.rst
new file mode 100644
index 000000000000..eba55f8a92ae
--- /dev/null
+++ b/Documentation/PCI/liveupdate.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+===========================
+PCI Support for Live Update
+===========================
+
+.. kernel-doc:: drivers/pci/liveupdate.c
+ :doc: PCI Live Update
+
+Driver API
+==========
+
+.. kernel-doc:: drivers/pci/liveupdate.c
+ :export:
+
+Live Update ABI
+===============
+
+.. kernel-doc:: include/linux/kho/abi/pci.h
+ :doc: PCI File-Lifecycle Bound (FLB) Live Update ABI
+
+.. kernel-doc:: include/linux/kho/abi/pci.h
+ :internal:
+
+See Also
+========
+
+ * :doc:`/core-api/liveupdate`
+ * :doc:`/core-api/kho/index`
diff --git a/Documentation/core-api/liveupdate.rst b/Documentation/core-api/liveupdate.rst
index 5a292d0f3706..d56a7760978a 100644
--- a/Documentation/core-api/liveupdate.rst
+++ b/Documentation/core-api/liveupdate.rst
@@ -70,3 +70,4 @@ See Also
- :doc:`Live Update uAPI </userspace-api/liveupdate>`
- :doc:`/core-api/kho/index`
+- :doc:`PCI </PCI/liveupdate>`
diff --git a/MAINTAINERS b/MAINTAINERS
index 0e262c0ceb43..6f0b0ebf67cd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20536,6 +20536,7 @@ L: kexec@lists.infradead.org
L: linux-pci@vger.kernel.org
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git
+F: Documentation/PCI/liveupdate.rst
F: drivers/pci/liveupdate.c
F: drivers/pci/liveupdate.h
F: include/linux/kho/abi/pci.h
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: T.J. Mercier @ 2026-05-12 18:53 UTC (permalink / raw)
To: Christian König
Cc: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
Jonathan Corbet, Shuah Khan, Sumit Semwal, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Benjamin Gaignard, Brian Starkey, John Stultz, Christian Brauner,
Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, linux-mm,
linux-security-module, selinux, linux-kselftest, mripard,
echanude
In-Reply-To: <8ef38815-6ae9-4359-86d4-042554357639@amd.com>
On Tue, May 12, 2026 at 3:14 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/12/26 11:10, Albert Esteve wrote:
> > On embedded platforms a central process often allocates dma-buf
> > memory on behalf of client applications. Without a way to
> > attribute the charge to the requesting client's cgroup, the
> > cost lands on the allocator, making per-cgroup memory limits
> > ineffective for the actual consumers.
> >
> > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > the mem_accounting module parameter enabled, the buffer is charged
> > to the allocator's own cgroup.
> >
> > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > all accounting through a single MEMCG_DMABUF path.
> >
> > Usage examples:
> >
> > 1. Central allocator charging to a client at allocation time.
> > The allocator knows the client's PID (e.g., from binder's
> > sender_pid) and uses pidfd to attribute the charge:
> >
> > pid_t client_pid = txn->sender_pid;
> > int pidfd = pidfd_open(client_pid, 0);
> >
> > struct dma_heap_allocation_data alloc = {
> > .len = buffer_size,
> > .fd_flags = O_RDWR | O_CLOEXEC,
> > .charge_pid_fd = pidfd,
> > };
> > ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > close(pidfd);
> > /* alloc.fd is now charged to client's cgroup */
> >
> > 2. Default allocation (no pidfd, mem_accounting=1).
> > When charge_pid_fd is not set and the mem_accounting module
> > parameter is enabled, the buffer is charged to the allocator's
> > own cgroup:
> >
> > struct dma_heap_allocation_data alloc = {
> > .len = buffer_size,
> > .fd_flags = O_RDWR | O_CLOEXEC,
> > };
> > ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> > /* charged to current process's cgroup */
> >
> > Current limitations:
> >
> > - Single-owner model: a dma-buf carries one memcg charge regardless of
> > how many processes share it. Means only the first owner (and exporter)
> > of the shared buffer bears the charge.
> > - Only memcg accounting supported. While this makes sense for system
> > heap buffers, other heaps (e.g., CMA heaps) will require selectively
> > charging also for the dmem controller.
>
> Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
Yeah I think this might work. I know of 3 cases, and it trivially
solves the first two. The third requires some work on our end to
extend our userspace interfaces to include the pidfd but it seems
doable. I'm checking with our graphics folks.
1) Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo)
No changes required to userspace. mem_accounting=1 charges the app.
2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc)
gralloc has the caller's pid as described in the commit message. Open
a pidfd and pass it in the dma_heap_allocation_data.
3) Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc)
In this case gralloc knows SurfaceFlinger's pid, but not the app's. So
we need to add the app's pidfd to the SurfaceFlinger -> gralloc
interface, or transfer the memcg charge from SurfaceFlinger to the app
after the allocation.
It'd be nice to avoid the charge transfer option entirely, but if we
need it that doesn't seem so bad in this case because it's a bulk
charge for the entire dmabuf rather than per-page. So the exporter
doesn't need to get involved (we wouldn't need a new dma_buf_op) and
we wouldn't have to worry about looping and locking for each page.
> I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
>
> Essentially the problem boils down to two limitations:
> 1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
Yup, memcg already has this problem with pagecache and shmem.
> 2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
>
> The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
>
> On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps.
I have a question about this part. Albert I guess you are interested
only in accounting dmabuf-heap allocations, or do you expect to add
__GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
non-dmabuf-heap exporters?
> On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > ---
> > Documentation/admin-guide/cgroup-v2.rst | 5 ++--
> > drivers/dma-buf/dma-buf.c | 16 ++++---------
> > drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++---
> > drivers/dma-buf/heaps/system_heap.c | 2 --
> > include/uapi/linux/dma-heap.h | 6 +++++
> > 5 files changed, 53 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 8bdbc2e866430..824d269531eb1 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> > structures.
> >
> > dmabuf (npn)
> > - Amount of memory used for exported DMA buffers allocated by the cgroup.
> > - Stays with the allocating cgroup regardless of how the buffer is shared.
> > + Amount of memory used for exported DMA buffers allocated by or on
> > + behalf of the cgroup. Stays with the allocating cgroup regardless
> > + of how the buffer is shared.
> >
> > workingset_refault_anon
> > Number of refaults of previously evicted anonymous pages.
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index ce02377f48908..23fb758b78297 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> > */
> > BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> >
> > - mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > - mem_cgroup_put(dmabuf->memcg);
> > + if (dmabuf->memcg) {
> > + mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> > + PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> > + mem_cgroup_put(dmabuf->memcg);
> > + }
> >
> > dmabuf->ops->release(dmabuf);
> >
> > @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> > dmabuf->resv = resv;
> > }
> >
> > - dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> > - if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> > - GFP_KERNEL)) {
> > - ret = -ENOMEM;
> > - goto err_memcg;
> > - }
> > -
> > file->private_data = dmabuf;
> > file->f_path.dentry->d_fsdata = dmabuf;
> > dmabuf->file = file;
> > @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >
> > return dmabuf;
> >
> > -err_memcg:
> > - mem_cgroup_put(dmabuf->memcg);
> > err_file:
> > fput(file);
> > err_module:
> > diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> > index ac5f8685a6494..ff6e259afcdc0 100644
> > --- a/drivers/dma-buf/dma-heap.c
> > +++ b/drivers/dma-buf/dma-heap.c
> > @@ -7,13 +7,17 @@
> > */
> >
> > #include <linux/cdev.h>
> > +#include <linux/cgroup.h>
> > #include <linux/device.h>
> > #include <linux/dma-buf.h>
> > #include <linux/dma-heap.h>
> > +#include <linux/memcontrol.h>
> > +#include <linux/sched/mm.h>
> > #include <linux/err.h>
> > #include <linux/export.h>
> > #include <linux/list.h>
> > #include <linux/nospec.h>
> > +#include <linux/pidfd.h>
> > #include <linux/syscalls.h>
> > #include <linux/uaccess.h>
> > #include <linux/xarray.h>
> > @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> > "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> >
> > static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > - u32 fd_flags,
> > - u64 heap_flags)
> > + u32 fd_flags, u64 heap_flags,
> > + struct mem_cgroup *charge_to)
> > {
> > struct dma_buf *dmabuf;
> > + unsigned int nr_pages;
> > + struct mem_cgroup *memcg = charge_to;
> > int fd;
> >
> > /*
> > @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> > if (IS_ERR(dmabuf))
> > return PTR_ERR(dmabuf);
> >
> > + nr_pages = len / PAGE_SIZE;
> > +
> > + if (memcg)
> > + css_get(&memcg->css);
> > + else if (mem_accounting)
> > + memcg = get_mem_cgroup_from_mm(current->mm);
> > +
> > + if (memcg) {
> > + if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> > + mem_cgroup_put(memcg);
> > + dma_buf_put(dmabuf);
> > + return -ENOMEM;
> > + }
> > + dmabuf->memcg = memcg;
> > + }
> > +
> > fd = dma_buf_fd(dmabuf, fd_flags);
> > if (fd < 0) {
> > dma_buf_put(dmabuf);
> > @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > {
> > struct dma_heap_allocation_data *heap_allocation = data;
> > struct dma_heap *heap = file->private_data;
> > + struct mem_cgroup *memcg = NULL;
> > + struct task_struct *task;
> > + unsigned int pidfd_flags;
> > int fd;
> >
> > if (heap_allocation->fd)
> > @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> > if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> > return -EINVAL;
> >
> > + if (heap_allocation->charge_pid_fd) {
> > + task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> > + if (IS_ERR(task))
> > + return PTR_ERR(task);
> > +
> > + memcg = get_mem_cgroup_from_mm(task->mm);
> > + put_task_struct(task);
> > + }
> > +
> > fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> > heap_allocation->fd_flags,
> > - heap_allocation->heap_flags);
> > + heap_allocation->heap_flags,
> > + memcg);
> > + mem_cgroup_put(memcg);
> > if (fd < 0)
> > return fd;
> >
> > diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> > index 03c2b87cb1112..95d7688167b93 100644
> > --- a/drivers/dma-buf/heaps/system_heap.c
> > +++ b/drivers/dma-buf/heaps/system_heap.c
> > @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> > if (max_order < orders[i])
> > continue;
> > flags = order_flags[i];
> > - if (mem_accounting)
> > - flags |= __GFP_ACCOUNT;
> > page = alloc_pages(flags, orders[i]);
> > if (!page)
> > continue;
> > diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> > index a4cf716a49fa6..e02b0f8cbc6a1 100644
> > --- a/include/uapi/linux/dma-heap.h
> > +++ b/include/uapi/linux/dma-heap.h
> > @@ -29,6 +29,10 @@
> > * handle to the allocated dma-buf
> > * @fd_flags: file descriptor flags used when allocating
> > * @heap_flags: flags passed to heap
> > + * @charge_pid_fd: optional pidfd of the process whose cgroup should be
> > + * charged for this allocation; 0 means charge the calling
> > + * process's cgroup
> > + * @__padding: reserved, must be zero
> > *
> > * Provided by userspace as an argument to the ioctl
> > */
> > @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> > __u32 fd;
> > __u32 fd_flags;
> > __u64 heap_flags;
> > + __u32 charge_pid_fd;
> > + __u32 __padding;
> > };
> >
> > #define DMA_HEAP_IOC_MAGIC 'H'
> >
>
^ permalink raw reply
* Re: [PATCH v12 05/11] iio: core: add decimal value formatting into 64-bit value
From: Rodrigo Alencar @ 2026-05-12 19:01 UTC (permalink / raw)
To: Andy Shevchenko, Rodrigo Alencar
Cc: rodrigo.alencar, linux-kernel, linux-iio, devicetree, linux-doc,
Jonathan Cameron, David Lechner, Andy Shevchenko,
Lars-Peter Clausen, Michael Hennerich, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Andrew Morton,
Petr Mladek, Steven Rostedt, Rasmus Villemoes, Sergey Senozhatsky,
Shuah Khan
In-Reply-To: <agNoKbcwT6_spC93@ashevche-desk.local>
On 26/05/12 08:49PM, Andy Shevchenko wrote:
> On Tue, May 12, 2026 at 05:09:32PM +0100, Rodrigo Alencar wrote:
> > On 26/05/12 05:35PM, Andy Shevchenko wrote:
> > > On Sun, May 10, 2026 at 01:42:23PM +0100, Rodrigo Alencar via B4 Relay wrote:
> > >
> > > > Create new format types for iio values (IIO_VAL_DECIMAL64_*), which
> > > > defines the representation of fixed decimal point values into a single
> > > > 64-bit number. This new format increases the range of represented values,
> > > > allowing for integer parts greater than 2^32, as bits are not "wasted"
> > > > in the fractional part, which can be seen in IIO_VAL_INT_PLUS_MICRO and
> > > > IIO_VAL_INT_PLUS_NANO. Helpers are created to compose and decompose 64-bit
> > > > decimals into integer values used in IIO formatting interfaces, which
> > > > creates consistency and avoid error-prone manual assignments when using
> > > > wordpart macros. When doing the parsing, kstrtodec64() is used with the
> > > > scale defined by the specific decimal format type.
>
> ...
>
> > > > + tmp2 = div64_s64_rem(iio_val_s64_from_array(vals),
> > > > + int_pow(10, scale), &frac);
> > > > + if (tmp2 == 0 && frac < 0)
> > > > + return sysfs_emit_at(buf, offset, "-0.%0*lld", scale,
> > > > + abs(frac));
> > > > + else
> > > > + return sysfs_emit_at(buf, offset, "%lld.%0*lld", tmp2,
> > > > + scale, abs(frac));
> > > > + }
> > >
> > > What about
> > >
> > > /* Print a leading '-' for negative fractions */
> > > if (tmp2 == 0 && frac < 0)
> > > offset += sysfs_emit_at(buf, offset, "-");
> > >
> > > return sysfs_emit_at(buf, offset, "%lld.%0*lld", tmp2, scale, abs(frac));
> > >
> > > Also note this won't work with the frac that are == S64_MIN. It's UB (undefined
> > > behaviour), see the comment at abs() implementation. Maybe a time to add abs()
> > > corner case tests...
> >
> > frac cannot be S64_MIN, it is always and remainder of a power of 10 modulus.
>
> Okay, but what about input of -0.9999999999999999999 ? Will it fit the signed
> frac type?
For the scales considered here it would not be a problem (*_PICO = 12 + *_BASE).
For the max scale of 19 it would probably fail the parsing of the fractional part
with overflow.
--
Kind regards,
Rodrigo Alencar
^ permalink raw reply
* Re: [PATCH v12 02/11] lib: kstrtox: add kstrtoudec64() and kstrtodec64()
From: Andy Shevchenko @ 2026-05-12 19:08 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: Andy Shevchenko, Jonathan Cameron, Rodrigo Alencar via B4 Relay,
rodrigo.alencar, linux-kernel, linux-iio, devicetree, linux-doc,
David Lechner, Andy Shevchenko, Lars-Peter Clausen,
Michael Hennerich, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Jonathan Corbet, Andrew Morton, Petr Mladek, Steven Rostedt,
Rasmus Villemoes, Sergey Senozhatsky, Shuah Khan, David Laight
In-Reply-To: <bc7mqfgll34vyaxdtvfssgypkhyx233wd4hxfzu32rddxnolaq@rd6c3z6yu6aq>
On Tue, May 12, 2026 at 07:15:17PM +0100, Rodrigo Alencar wrote:
> On 26/05/12 08:46PM, Andy Shevchenko wrote:
> > On Tue, May 12, 2026 at 06:26:12PM +0100, Rodrigo Alencar wrote:
> > > On 26/05/12 08:13PM, Andy Shevchenko wrote:
> > > > On Tue, May 12, 2026 at 05:35:59PM +0100, Rodrigo Alencar wrote:
> > > > > On 26/05/12 06:21PM, Andy Shevchenko wrote:
> > > > > > On Tue, May 12, 2026 at 6:11 PM Rodrigo Alencar
> > > > > > <455.rodrigo.alencar@gmail.com> wrote:
> > > > > > > On 26/05/12 05:43PM, Andy Shevchenko wrote:
> > > > > > > > On Tue, May 12, 2026 at 03:12:24PM +0100, Rodrigo Alencar wrote:
> > > > > > > > > On 26/05/12 04:48PM, Andy Shevchenko wrote:
> > > > > > > > > > On Tue, May 12, 2026 at 02:21:14PM +0100, Rodrigo Alencar wrote:
> > > > > > > > > > > On 26/05/12 04:12PM, Andy Shevchenko wrote:
> > > > > > > > > > > > On Tue, May 12, 2026 at 12:39:53PM +0100, Jonathan Cameron wrote:
> > > > > > > > > > > > > On Sun, 10 May 2026 13:42:20 +0100
> > > > > > > > > > > > > Rodrigo Alencar via B4 Relay <devnull+rodrigo.alencar.analog.com@kernel.org> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Add helpers that parses decimal numbers into 64-bit number, i.e., decimal
> > > > > > > > > > > > > > point numbers with pre-defined scale are parsed into a 64-bit value (fixed
> > > > > > > > > > > > > > precision). After the decimal point, digits beyond the specified scale
> > > > > > > > > > > > > > are ignored.
...
> > > > > > > I think we are going in circles here and we could look at the code instead:
> > > > > > > - integer parsing with _parse_integer()
> > > > > > > - overflow check and validation of the return value
> > > > > > > - fractional parsing with _parse_integer_limit()
> > > > > > > - overflow check and validation of the return value
> > > > > >
> > > > > > No, this is not fully true. That's what my whole point is about. The
> > > > > > max_chars parameter limits the input check, then it skips an arbitrary
> > > > > > number of digits and only *then* it checks for \n and \0. What will be
> > > > > > the result of the
> > > > > > 0.00000000000000000000000000000000423 in your case? Whatever scale you
> > > > > > gave it will return 0 without checking on how many digits were
> > > > > > supplied.
> > > > >
> > > > > I suppose that is a valid input and 0 is the expected result there.
> > > > >
> > > > > > All the same for 0.9999999999999999999999999999999000423. My
> > > > > > point is that we should limit this by 19 digits.
> > > > >
> > > > > why we need to limit by 19? Digits beyond the scale carry no value...
> > > >
> > > > ...only if they are all 0:s.
> > >
> > > I thought your concern was on input length.
> >
> > One of, since I think you rose the topic of leading 0:s for integers and
> > I agreed with that which makes sense to have mirrored in fractional part.
> >
> > > > > just like leading zeros to the integer part (which is also accepted by
> > > > > kstrtoull() when parsing with base 10). Not sure why this is invalid input.
> > > >
> > > > See above. I agree on truncating trailing 0:s as it's done for leading ones
> > > > in integer part, but if any of the digit behind 19th is not 0, it's an overflow
> > > > condition (or bad input, depending how strict the rules are).
> > >
> > > stating in the documentation that digits beyond the scale are ignored is not
> > > enough?
> >
> > It's in case we are not for kstrto*() family. My understanding that kstrto*()
> > use strict rules on the input in overflow check.
> >
> > > > > > On top of that, what about -0.9(19 times) ? the fraction should be u64
> > > > > > in this case and it's fine. The sign applies to the combined value.
> > > > >
> > > > > yes, range for signed values are verified later.
> > > >
> > > > > > > - extra scaling and truncation happening outside if needed.
> > > > > >
> > > > > > Right, but the given input may be way too long and still needs more validation.
> > > > >
> > > > > What is the problem with a long input of digits?
> > > > > C compiler does not complain about this when parsing a float value,
> > > > > python does not
> > > > > complain about this when parsing floats or decimals either.
> > > >
> > > > Because there is an exponent limit and for double it's something like 1e307
> > > > IIRC, meaning, try 1024 digits to be sure.
> > > >
> > > > Python most likely uses the library for big numbers, you can't compare it at all with this.
> > >
> > > You would be fine if the truncation loop:
> > >
> > > while (isdigit(*s)) /* truncate */
> > > s++;
> > >
> > > is bounded by (19-scale) iteration count? or it should keep iterating if those are zero?
> >
> > Ideally both.
> >
> > We don't care about the digits in the range of 19-scale and skip all 0:s after
> > that.
> >
> > /* truncate unrequired digits within type limit, i.e. 19 decimal digits */
> > while (isdigit(*s) && "(s - pos_of_dot) is less than 19")
> > s++;
> > while (s == '0') /* truncate trailing 0:s, it's not a bad input nor overflow */
> > s++;
>
> We could have agreed on something like that since the beginning!
Yes, but who knew that we go to have this agreement?
> And I think that changing the logic to something like this would not change a
> thing on the kind of inputs we expect, it will just complicate the code.
> I suppose that kind of kstrto*() rules were never stated anywhere.
>
> |> 20th digit
> Also, 0.00000000000000000001 still sounds like a valid decimal number to me, even
> though it is going to be parsed as 0!
Hmm... It would mean that testing for 19th/20th digits is not enough... :-(
> >
> > // Now if it's not \0 nor \n and
> > // a) still a digit consider either overflow or bad input,
> > // b) if not a digit, consider as bad input.
> >
> > In a) I tend to be on par with the other k*() and consider that as overflow.
> >
> > > is that the only concern? Again, the usage of _parse_integer_limit(s, 10, &_frac, scale)
> > > avoids a 64-bit division when checking the rv.
> >
> > I'm not against usage of _parse_integer_limit(), I'm for stricter rules on the input.
> > With the above addressed, I have no more concerns.
>
> Thanks! I will proceed with the requested adjustments.
But it seems it's not enough as you pointed out!
So the biggest fraction we may consume in 64-bit (unsigned) value is
0.18446744073709551615. If we go with one digit less, the whole value
can be
In [3]: hex(9999999999999999999)
Out[3]: '0x8ac7230489e7ffff'
So, I don't know how we are supposed to represent values between
-0.9223372036854775808
-0.9999999999999999999
in a signed type as they have bit 63 set.
The easiest way out is to limit scale to 18 (but still accept 19th digit, and
with check for overflow even 20th up to 0.18446744073709551615). This will need
to run _parse_integer_limit() twice (with given scale and with 20).
Can you add the respective test cases and see what is currently going on with
them?
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH v3 0/3] Documentation: security-bugs: new updates covering triage and AI
From: Willy Tarreau @ 2026-05-12 19:13 UTC (permalink / raw)
To: Jonathan Corbet
Cc: greg, Leon Romanovsky, skhan, security, workflows, linux-doc,
linux-kernel
In-Reply-To: <871pfgpn2a.fsf@trenco.lwn.net>
On Tue, May 12, 2026 at 11:14:37AM -0600, Jonathan Corbet wrote:
> Willy Tarreau <w@1wt.eu> writes:
>
> > This series tries to translate recent discussions on the security list
> > on how to better handle reports. It details:
> > - when not to Cc: the security list
> > - what classes of bugs do not need to be handled privately
> > - minimum requirements for AI-assisted reports
> >
> > As usual, this is probably perfectible but can already help in the short
> > term as we can point it to reporters, so barring any strong disagreement,
> > better continue to proceed in small incremental improvements and observe
> > the effects.
>
> OK, I've applied the series to docs-fixes; after a short exposure in
> linux-next I'll ship it Linusward.
Thank you!
> I have a couple of comments on the individual changes that might merit
> an eventual add-on patch.
Yes, feel free to suggest. I'm not fond of how the pub/priv decision is
stretched into multiple sections and I'd like to rework it to have a
dedicate section "public or private" which describes how to take the
decision then later we can explain whom to contact depending on this
choice. It's not much different from what we have but it would clarify
certain points. So in any case I think I'll propose an update later, so
anything you can propose to improve the situation is more than welcome!
Thanks!
Willy
^ permalink raw reply
* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
From: David Hildenbrand (Arm) @ 2026-05-12 19:18 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, Liam.Howlett, akpm, decui, haiyangz, jgg, corbet, leon,
longli, ljs, mhocko, rppt, shuah, skhan, surenb, vbabka, wei.liu,
linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <agNS4llNtAHBkMA2@skinsburskii.localdomain>
On 5/12/26 18:18, Stanislav Kinsburskii wrote:
> On Tue, May 12, 2026 at 10:42:14AM +0200, David Hildenbrand (Arm) wrote:
>>
>>> + for (; addr < end; addr += PAGE_SIZE) {
>>> + vm_fault_t ret;
>>> +
>>> + ret = handle_mm_fault(vma, addr, fault_flags, NULL);
>>> +
>>> + if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
>>> + /*
>>> + * The mmap lock has been dropped by the fault handler.
>>> + * Record the failing address and signal lock-drop to
>>> + * the caller.
>>> + */
>>> + *hmm_vma_walk->locked = 0;
>>> + hmm_vma_walk->last = addr;
>>> + return -EAGAIN;
>>
>>
>> Okay, so we'll return straight from hmm_vma_fault() to
>> hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
>>
>> Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
>> the hmm_vma_fault() could be called by the caller of walk_page_range(), but
>> that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
>> requires the vma in there.
>>
>
> It looks like a caller can provide a post_vma callback in mm_walk_ops. I
> missed that case here. This callback cannot be supported by this change.
> I will update the patch.
>
>>
>> Note: am I wrong, or is hmm_vma_fault() really always called with
>> required_fault=true?
>>
>
> No, hmm_pte_need_fault can return false.
That's not what I mean. Looks like all paths leading to hmm_vma_fault() have
required_fault = true;
IOW, there is always a "if (required_fault)" before it one way or the other.
Ah, and there even is a "WARN_ON_ONCE(!required_fault)" in the function. What an
odd thing to do :)
>
>>> + }
>>> +
>>> + if (ret & VM_FAULT_ERROR)
>>> return -EFAULT;
>>> + }
>>> return -EBUSY;
>>> }
>>>
>>> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>>> if (required_fault) {
>>> int ret;
>>>
>>> + /*
>>> + * Faulting hugetlb pages on the unlockable path is not
>>> + * supported. The walk framework holds hugetlb_vma_lock_read
>>> + * which must be dropped before handle_mm_fault, but if the
>>> + * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
>>> + * be freed and the walk framework's unconditional unlock
>>> + * becomes a use-after-free.
>>> + */
>>> + if (hmm_vma_walk->locked)
>>> + return -EFAULT;
>>
>> Just because it's unlockable doesn't mean that you must unlock. Can't this be
>> kept working as is, just simulating here as if it would not be unlockable?
>>
>
> I’m not sure how to implement this. The walk_page_range code expects the
> hugetlb VMA to still be read-locked when we return from
> hmm_vma_walk_hugetlb_entry. How can we guarantee that if the VMA might
> be gone?
>
> I added a note in the docs. Whoever tackles this will likely need to
> either rework `walk_page_range` to handle the case where the VMA is
> gone, or use a different approach.
>
> Do you have any other suggestions on how to implement it?
You just want hmm_vma_fault() to not set
"FAULT_FLAG_ALLOW_RETRY·|·FAULT_FLAG_KILLABLE".
The hacky way could be:
diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83d..83dba990e10a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -564,6 +564,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
long hmask,
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
if (required_fault) {
+ int *saved_locked = hmm_vma_walk->locked;
int ret;
spin_unlock(ptl);
@@ -576,7 +577,9 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
long hmask,
* use here of either pte or ptl after dropping the vma
* lock.
*/
+ hmm_vma_walk->locked = NULL;
ret = hmm_vma_fault(addr, end, required_fault, walk);
+ hmm_vma_walk->locked = saved_locked;
hugetlb_vma_lock_read(vma);
return ret;
}
But really, I think we should just try to get uffd support working properly, not
excluding hugetlb.
GUP achieves it properly by performing the fault handling outside of page table
walking context ... essentially what I described in my first comment above:
return the information to the caller and let it just trigger the fault.
The issue here is that we trigger a fault out of walk_hugetlb_range() where we
still hold locks, resulting in this questionable hugetlb_vma_unlock_read +
hugetlb_vma_lock_read pattern.
The fault should just be triggered from a place where we don't have to play with
hugetlb vma locks or be afraid that dropping the mmap lock causes other problems.
--
Cheers,
David
^ permalink raw reply related
* Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
From: Yosry Ahmed @ 2026-05-12 19:19 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: fujunjie, Andrew Morton, Chris Li, Kairui Song, Johannes Weiner,
Nhat Pham, linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
Ryan Roberts, Barry Song, Baolin Wang, Chengming Zhou, Baoquan He,
Lorenzo Stoakes
In-Reply-To: <c0effa9f-1262-4bed-a99e-ae5441ac47ea@kernel.org>
> >> Feedback would be especially helpful on:
> >>
> >> 1. whether it makes sense to support all-zswap large folio swapin first,
> >> while keeping mixed zswap/disk ranges on the order-0 fallback path
> >
> > I think so, yes, but based on my read of the code this RFC only affects
> > synchornous swapin, which is more-or-less zram+zswap. This is an
> > uncommon setup outside of testing.
>
> BLK_FEAT_SYNCHRONOUS is also set for pmem and brd devices I think, but that's
> also pretty uncommon I assume. Well, maybe if your hypervisor provides you with
> an emulated NVDIMM to use as swap backend ... maybe.
Yeah, I said "more-or-less" to capture pmem/brd/etc :P
> I thought there were other ways to get BLK_FEAT_SYNCHRONOUS set, but I don't see
> other usage.
>
> So seeing it for zswap is pretty rare I assume.
Yeah that's my understanding as well.
^ permalink raw reply
* [PATCH][next] stddef: Fix kernel-doc/Sphinx warnings for __TRAILING_OVERLAP()
From: Gustavo A. R. Silva @ 2026-05-12 19:34 UTC (permalink / raw)
To: Kees Cook; +Cc: linux-kernel, Gustavo A. R. Silva, linux-hardening, linux-doc
Fix the following kdoc warnings:
Documentation/driver-api/basics:127: ./include/linux/stddef.h:110: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
Documentation/driver-api/basics:127: ./include/linux/stddef.h:115: ERROR: Unexpected indentation. [docutils]
Documentation/driver-api/basics:127: ./include/linux/stddef.h:116: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
Documentation/driver-api/basics:127: ./include/linux/stddef.h:117: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
Documentation/driver-api/basics:127: ./include/linux/stddef.h:122: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
Documentation/driver-api/basics:127: ./include/linux/stddef.h:124: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
Documentation/driver-api/basics:127: ./include/linux/stddef.h:139: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
Documentation/driver-api/basics:127: ./include/linux/stddef.h:140: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605120507.9iQRMgKR-lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
---
include/linux/stddef.h | 48 +++++++++++++++++++++---------------------
1 file changed, 24 insertions(+), 24 deletions(-)
diff --git a/include/linux/stddef.h b/include/linux/stddef.h
index 665d7a68cd98..f666ced2a9e7 100644
--- a/include/linux/stddef.h
+++ b/include/linux/stddef.h
@@ -104,26 +104,26 @@ enum {
* union, designated initializers for MEMBERS may overwrite portions
* previously initialized through NAME.
*
- * For example:
- *
- * struct flex {
- * size_t count;
- * u8 fam[];
- * };
- *
- * struct composite {
- * ...
- * __TRAILING_OVERLAP(struct flex, flex, fam, __packed,
- * u8 data;
- * );
- * } __packed;
- *
- * static struct composite comp = {
- * .flex = {
- * .count = 1,
- * },
- * .data = 2,
- * };
+ * For example::
+ *
+ * struct flex {
+ * size_t count;
+ * u8 fam[];
+ * };
+ *
+ * struct composite {
+ * ...
+ * __TRAILING_OVERLAP(struct flex, flex, fam, __packed,
+ * u8 data;
+ * );
+ * } __packed;
+ *
+ * static struct composite comp = {
+ * .flex = {
+ * .count = 1,
+ * },
+ * .data = 2,
+ * };
*
* In the example above, .flex and .data initialize different views of the same
* union storage. Since .data is initialized last, it _may_ overwrite portions
@@ -133,7 +133,7 @@ enum {
* A couple of alternatives are shown below.
*
* a) Initialize only one view of the overlapped storage and assign the rest
- * at runtime:
+ * at runtime::
*
* static struct composite comp = {
* .flex = {
@@ -147,9 +147,7 @@ enum {
* ...
* }
*
- * (Compiler Explorer test code: https://godbolt.org/z/zz4K1Ejvf)
- *
- * b) Alternatively, replace designated initializers with runtime assignments.
+ * b) Alternatively, replace designated initializers with runtime assignments::
*
* static void foo(void)
* {
@@ -160,6 +158,8 @@ enum {
* ...
* }
*
+ * Compiler Explorer test code: https://godbolt.org/z/zz4K1Ejvf
+ *
* For another example of the above see commit 5e54510a9389 ("acpi: nfit:
* intel: avoid multiple -Wflex-array-member-not-at-end warnings")
*
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v12 02/11] lib: kstrtox: add kstrtoudec64() and kstrtodec64()
From: Rodrigo Alencar @ 2026-05-12 19:39 UTC (permalink / raw)
To: Andy Shevchenko, Rodrigo Alencar
Cc: Andy Shevchenko, Jonathan Cameron, Rodrigo Alencar via B4 Relay,
rodrigo.alencar, linux-kernel, linux-iio, devicetree, linux-doc,
David Lechner, Andy Shevchenko, Lars-Peter Clausen,
Michael Hennerich, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Jonathan Corbet, Andrew Morton, Petr Mladek, Steven Rostedt,
Rasmus Villemoes, Sergey Senozhatsky, Shuah Khan, David Laight
In-Reply-To: <agN6onIAwG1yn5p6@ashevche-desk.local>
On 26/05/12 10:08PM, Andy Shevchenko wrote:
> On Tue, May 12, 2026 at 07:15:17PM +0100, Rodrigo Alencar wrote:
> > On 26/05/12 08:46PM, Andy Shevchenko wrote:
> > > On Tue, May 12, 2026 at 06:26:12PM +0100, Rodrigo Alencar wrote:
> > > > On 26/05/12 08:13PM, Andy Shevchenko wrote:
> > > > > On Tue, May 12, 2026 at 05:35:59PM +0100, Rodrigo Alencar wrote:
> > > > > > On 26/05/12 06:21PM, Andy Shevchenko wrote:
> > > > > > > On Tue, May 12, 2026 at 6:11 PM Rodrigo Alencar
> > > > > > > <455.rodrigo.alencar@gmail.com> wrote:
> > > > > > > > On 26/05/12 05:43PM, Andy Shevchenko wrote:
> > > > > > > > > On Tue, May 12, 2026 at 03:12:24PM +0100, Rodrigo Alencar wrote:
> > > > > > > > > > On 26/05/12 04:48PM, Andy Shevchenko wrote:
> > > > > > > > > > > On Tue, May 12, 2026 at 02:21:14PM +0100, Rodrigo Alencar wrote:
> > > > > > > > > > > > On 26/05/12 04:12PM, Andy Shevchenko wrote:
> > > > > > > > > > > > > On Tue, May 12, 2026 at 12:39:53PM +0100, Jonathan Cameron wrote:
> > > > > > > > > > > > > > On Sun, 10 May 2026 13:42:20 +0100
> > > > > > > > > > > > > > Rodrigo Alencar via B4 Relay <devnull+rodrigo.alencar.analog.com@kernel.org> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Add helpers that parses decimal numbers into 64-bit number, i.e., decimal
> > > > > > > > > > > > > > > point numbers with pre-defined scale are parsed into a 64-bit value (fixed
> > > > > > > > > > > > > > > precision). After the decimal point, digits beyond the specified scale
> > > > > > > > > > > > > > > are ignored.
>
> ...
>
> > > > > > > > I think we are going in circles here and we could look at the code instead:
> > > > > > > > - integer parsing with _parse_integer()
> > > > > > > > - overflow check and validation of the return value
> > > > > > > > - fractional parsing with _parse_integer_limit()
> > > > > > > > - overflow check and validation of the return value
> > > > > > >
> > > > > > > No, this is not fully true. That's what my whole point is about. The
> > > > > > > max_chars parameter limits the input check, then it skips an arbitrary
> > > > > > > number of digits and only *then* it checks for \n and \0. What will be
> > > > > > > the result of the
> > > > > > > 0.00000000000000000000000000000000423 in your case? Whatever scale you
> > > > > > > gave it will return 0 without checking on how many digits were
> > > > > > > supplied.
> > > > > >
> > > > > > I suppose that is a valid input and 0 is the expected result there.
> > > > > >
> > > > > > > All the same for 0.9999999999999999999999999999999000423. My
> > > > > > > point is that we should limit this by 19 digits.
> > > > > >
> > > > > > why we need to limit by 19? Digits beyond the scale carry no value...
> > > > >
> > > > > ...only if they are all 0:s.
> > > >
> > > > I thought your concern was on input length.
> > >
> > > One of, since I think you rose the topic of leading 0:s for integers and
> > > I agreed with that which makes sense to have mirrored in fractional part.
> > >
> > > > > > just like leading zeros to the integer part (which is also accepted by
> > > > > > kstrtoull() when parsing with base 10). Not sure why this is invalid input.
> > > > >
> > > > > See above. I agree on truncating trailing 0:s as it's done for leading ones
> > > > > in integer part, but if any of the digit behind 19th is not 0, it's an overflow
> > > > > condition (or bad input, depending how strict the rules are).
> > > >
> > > > stating in the documentation that digits beyond the scale are ignored is not
> > > > enough?
> > >
> > > It's in case we are not for kstrto*() family. My understanding that kstrto*()
> > > use strict rules on the input in overflow check.
> > >
> > > > > > > On top of that, what about -0.9(19 times) ? the fraction should be u64
> > > > > > > in this case and it's fine. The sign applies to the combined value.
> > > > > >
> > > > > > yes, range for signed values are verified later.
> > > > >
> > > > > > > > - extra scaling and truncation happening outside if needed.
> > > > > > >
> > > > > > > Right, but the given input may be way too long and still needs more validation.
> > > > > >
> > > > > > What is the problem with a long input of digits?
> > > > > > C compiler does not complain about this when parsing a float value,
> > > > > > python does not
> > > > > > complain about this when parsing floats or decimals either.
> > > > >
> > > > > Because there is an exponent limit and for double it's something like 1e307
> > > > > IIRC, meaning, try 1024 digits to be sure.
> > > > >
> > > > > Python most likely uses the library for big numbers, you can't compare it at all with this.
> > > >
> > > > You would be fine if the truncation loop:
> > > >
> > > > while (isdigit(*s)) /* truncate */
> > > > s++;
> > > >
> > > > is bounded by (19-scale) iteration count? or it should keep iterating if those are zero?
> > >
> > > Ideally both.
> > >
> > > We don't care about the digits in the range of 19-scale and skip all 0:s after
> > > that.
> > >
> > > /* truncate unrequired digits within type limit, i.e. 19 decimal digits */
> > > while (isdigit(*s) && "(s - pos_of_dot) is less than 19")
> > > s++;
> > > while (s == '0') /* truncate trailing 0:s, it's not a bad input nor overflow */
> > > s++;
> >
> > We could have agreed on something like that since the beginning!
>
> Yes, but who knew that we go to have this agreement?
>
> > And I think that changing the logic to something like this would not change a
> > thing on the kind of inputs we expect, it will just complicate the code.
> > I suppose that kind of kstrto*() rules were never stated anywhere.
> >
> > |> 20th digit
> > Also, 0.00000000000000000001 still sounds like a valid decimal number to me, even
> > though it is going to be parsed as 0!
>
> Hmm... It would mean that testing for 19th/20th digits is not enough... :-(
>
> > >
> > > // Now if it's not \0 nor \n and
> > > // a) still a digit consider either overflow or bad input,
> > > // b) if not a digit, consider as bad input.
> > >
> > > In a) I tend to be on par with the other k*() and consider that as overflow.
> > >
> > > > is that the only concern? Again, the usage of _parse_integer_limit(s, 10, &_frac, scale)
> > > > avoids a 64-bit division when checking the rv.
> > >
> > > I'm not against usage of _parse_integer_limit(), I'm for stricter rules on the input.
> > > With the above addressed, I have no more concerns.
> >
> > Thanks! I will proceed with the requested adjustments.
>
> But it seems it's not enough as you pointed out!
>
> So the biggest fraction we may consume in 64-bit (unsigned) value is
> 0.18446744073709551615. If we go with one digit less, the whole value
> can be
>
> In [3]: hex(9999999999999999999)
> Out[3]: '0x8ac7230489e7ffff'
>
> So, I don't know how we are supposed to represent values between
> -0.9223372036854775808
> -0.9999999999999999999
> in a signed type as they have bit 63 set.
>
> The easiest way out is to limit scale to 18 (but still accept 19th digit, and
> with check for overflow even 20th up to 0.18446744073709551615). This will need
> to run _parse_integer_limit() twice (with given scale and with 20).
>
> Can you add the respective test cases and see what is currently going on with
> them?
I can add test cases, but for the signed case the situation is:
scale = 0
max = 9223372036854775807, min = -9223372036854775808
scale = 1
max = 922337203685477580.7, min = -922337203685477580.8
scale = 2
max = 92233720368547758.07, min = -92233720368547758.08
...
scale = 18
max = 9.223372036854775807, min = -9.223372036854775808
scake = 19
max = 0.9223372036854775807, min = -0.9223372036854775808
anything outside those ranges will give you -ERANGE. Then it depends on the scale used.
I am not representing -0.9999999999999999999 as is. The desired scale will have this
truncated. It may be -0.9999 or -0.999999 or -0.9. And this is practical for a
reasonable scale value... for pico and femto precision you still get a decent range.
--
Kind regards,
Rodrigo Alencar
^ permalink raw reply
* Re: [PATCH 0/6] alloc_tag: introduce IOCTL-based filtering for MAP
From: Suren Baghdasaryan @ 2026-05-12 19:58 UTC (permalink / raw)
To: Hao Ge
Cc: Abhishek Bapat, Shuah Khan, Jonathan Corbet, linux-doc,
linux-kernel, linux-mm, Sourav Panda, Andrew Morton,
Kent Overstreet
In-Reply-To: <9bff01a8-eb97-4d09-81a4-f4dbf9b59b73@linux.dev>
On Wed, May 6, 2026 at 1:45 AM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Abhishek and Suren
>
>
> On 2026/5/5 07:36, Abhishek Bapat wrote:
> > Currently, memory allocation profiling data is primarily exposed through
> > /proc/allocinfo. While useful for manual inspection, this text-based
> > interface poses challenges for production monitoring and large-scale
> > analysis:
> >
> > 1. Userspace must parse large amounts of text to extract specific
> > fields.
> > 2. To find specific tags, userspace must read the entire dataset,
> > requiring many context switches and high data copying.
> > 3. The kernel currently aggregates per-CPU counters for every allocation
> > size, even those the user intends to filter out immediately.
> >
> > This series introduces a new IOCTL-based binary interface for allocinfo
> > that supports kernel-side filtering. By allowing the user to specify a
> > filter mask, we significantly reduce the work performed in-kernel and
> > the amount of data transferred to userspace.
> >
> > Performance measurements were conducted on an Intel Xeon Platinum 8481C
> > (224 CPUs) with caches dropped before each run.
> >
> > The IOCTL mechanism shows a ~20x performance improvement for
> > filtered queries. The kernel avoids the expensive per-CPU counter
> > aggregation (alloc_tag_read) for any tags that fail the initial string
> > or location filters.
> >
> > Scenario 1: Specific File Filtering (arch/x86/events/rapl.c)
> > 1. Traditional (cat /proc/allocinfo | grep): 22ms (sys)
> > 2. IOCTL Interface: 1ms (sys)
> >
> > Scenario 2: Compound Filtering (Filename + Size)
> > 1. Traditional: (cat ... | grep | awk): 21ms (sys)
> > 2. IOCTL Interface: 1ms (sys)
> >
> > Scenario 3: Size-Based Filtering (min_size = 1MB)
> > 1. Traditional: (cat ... | awk): 21ms (sys)
> > 2. IOCTL Interface: 14ms (sys)
>
> What a coincidence! I was just about to send an email to Suren
>
> asking about plans for upstreaming a filtering tool for /proc/allocinfo,
>
> and then I came across this patchset.
>
> I have been following and using memory allocation profiling since
>
> it was first introduced. It has been very helpful for our memory
>
> analysis by providing clear visibility into allocation data. However,
>
> we have always wanted a tool to efficiently filter this data to get
>
> exactly what we need, so I previously developed a userspace tool [1]
>
> to help with that.
>
> [1] https://lore.kernel.org/all/20250106112103.25401-1-hao.ge@linux.dev/
>
> So this patchset provides efficient filtering of allocinfo data via ioctl.
>
> Would the next step be to develop a general-purpose tool under
>
> tools/mm that leverages these ioctls instead of parsing /proc/allocinfo
> text output?
Hi Hao,
Sorry for the delay, I was travelling for LSFMM and missed a bunch of emails.
Yes, we are planning to upstream alloctop tool
(https://android-review.googlesource.com/c/platform/system/memory/libmeminfo/+/3431860)
and now with ioctl support it becomes more relevant. Once this
patchset is merged, we will prepare the tool and post the patch.
Thanks,
Suren.
>
> Thanks
>
> Best Regards
>
> Hao
>
> > Abhishek Bapat (5):
> > alloc_tag: add ioctl filters to /proc/allocinfo
> > alloc_tag: add size-based filtering to ioctl
> > alloc_tag: add accuracy based filtering to ioctl
> > kselftest: alloc_tag: add kselftest for ioctl interface
> > kselftest: alloc_tag: extend the allocinfo ioctl kselftest
> >
> > Suren Baghdasaryan (1):
> > alloc_tag: add ioctl to /proc/allocinfo
> >
> > .../userspace-api/ioctl/ioctl-number.rst | 2 +
> > include/linux/codetag.h | 1 +
> > include/uapi/linux/alloc_tag.h | 87 +++
> > lib/alloc_tag.c | 249 ++++++++-
> > lib/codetag.c | 11 +
> > tools/testing/selftests/alloc_tag/Makefile | 9 +
> > .../alloc_tag/allocinfo_ioctl_test.c | 508 ++++++++++++++++++
> > 7 files changed, 865 insertions(+), 2 deletions(-)
> > create mode 100644 include/uapi/linux/alloc_tag.h
> > create mode 100644 tools/testing/selftests/alloc_tag/Makefile
> > create mode 100644 tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> >
^ permalink raw reply
* Re: [PATCH v12 02/11] lib: kstrtox: add kstrtoudec64() and kstrtodec64()
From: Andy Shevchenko @ 2026-05-12 20:16 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: Andy Shevchenko, Jonathan Cameron, Rodrigo Alencar via B4 Relay,
rodrigo.alencar, linux-kernel, linux-iio, devicetree, linux-doc,
David Lechner, Andy Shevchenko, Lars-Peter Clausen,
Michael Hennerich, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Jonathan Corbet, Andrew Morton, Petr Mladek, Steven Rostedt,
Rasmus Villemoes, Sergey Senozhatsky, Shuah Khan, David Laight
In-Reply-To: <hvwyrb7g3ar7hzesj32zoxzqvjmdtwybamy4zxepqdbu37qvog@xnmgqhfya34f>
On Tue, May 12, 2026 at 08:39:21PM +0100, Rodrigo Alencar wrote:
> On 26/05/12 10:08PM, Andy Shevchenko wrote:
> > On Tue, May 12, 2026 at 07:15:17PM +0100, Rodrigo Alencar wrote:
> > > On 26/05/12 08:46PM, Andy Shevchenko wrote:
> > > > On Tue, May 12, 2026 at 06:26:12PM +0100, Rodrigo Alencar wrote:
> > > > > On 26/05/12 08:13PM, Andy Shevchenko wrote:
> > > > > > On Tue, May 12, 2026 at 05:35:59PM +0100, Rodrigo Alencar wrote:
> > > > > > > On 26/05/12 06:21PM, Andy Shevchenko wrote:
> > > > > > > > On Tue, May 12, 2026 at 6:11 PM Rodrigo Alencar
> > > > > > > > <455.rodrigo.alencar@gmail.com> wrote:
> > > > > > > > > On 26/05/12 05:43PM, Andy Shevchenko wrote:
> > > > > > > > > > On Tue, May 12, 2026 at 03:12:24PM +0100, Rodrigo Alencar wrote:
> > > > > > > > > > > On 26/05/12 04:48PM, Andy Shevchenko wrote:
> > > > > > > > > > > > On Tue, May 12, 2026 at 02:21:14PM +0100, Rodrigo Alencar wrote:
> > > > > > > > > > > > > On 26/05/12 04:12PM, Andy Shevchenko wrote:
> > > > > > > > > > > > > > On Tue, May 12, 2026 at 12:39:53PM +0100, Jonathan Cameron wrote:
> > > > > > > > > > > > > > > On Sun, 10 May 2026 13:42:20 +0100
> > > > > > > > > > > > > > > Rodrigo Alencar via B4 Relay <devnull+rodrigo.alencar.analog.com@kernel.org> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Add helpers that parses decimal numbers into 64-bit number, i.e., decimal
> > > > > > > > > > > > > > > > point numbers with pre-defined scale are parsed into a 64-bit value (fixed
> > > > > > > > > > > > > > > > precision). After the decimal point, digits beyond the specified scale
> > > > > > > > > > > > > > > > are ignored.
...
> > > > > > > > > I think we are going in circles here and we could look at the code instead:
> > > > > > > > > - integer parsing with _parse_integer()
> > > > > > > > > - overflow check and validation of the return value
> > > > > > > > > - fractional parsing with _parse_integer_limit()
> > > > > > > > > - overflow check and validation of the return value
> > > > > > > >
> > > > > > > > No, this is not fully true. That's what my whole point is about. The
> > > > > > > > max_chars parameter limits the input check, then it skips an arbitrary
> > > > > > > > number of digits and only *then* it checks for \n and \0. What will be
> > > > > > > > the result of the
> > > > > > > > 0.00000000000000000000000000000000423 in your case? Whatever scale you
> > > > > > > > gave it will return 0 without checking on how many digits were
> > > > > > > > supplied.
> > > > > > >
> > > > > > > I suppose that is a valid input and 0 is the expected result there.
> > > > > > >
> > > > > > > > All the same for 0.9999999999999999999999999999999000423. My
> > > > > > > > point is that we should limit this by 19 digits.
> > > > > > >
> > > > > > > why we need to limit by 19? Digits beyond the scale carry no value...
> > > > > >
> > > > > > ...only if they are all 0:s.
> > > > >
> > > > > I thought your concern was on input length.
> > > >
> > > > One of, since I think you rose the topic of leading 0:s for integers and
> > > > I agreed with that which makes sense to have mirrored in fractional part.
> > > >
> > > > > > > just like leading zeros to the integer part (which is also accepted by
> > > > > > > kstrtoull() when parsing with base 10). Not sure why this is invalid input.
> > > > > >
> > > > > > See above. I agree on truncating trailing 0:s as it's done for leading ones
> > > > > > in integer part, but if any of the digit behind 19th is not 0, it's an overflow
> > > > > > condition (or bad input, depending how strict the rules are).
> > > > >
> > > > > stating in the documentation that digits beyond the scale are ignored is not
> > > > > enough?
> > > >
> > > > It's in case we are not for kstrto*() family. My understanding that kstrto*()
> > > > use strict rules on the input in overflow check.
> > > >
> > > > > > > > On top of that, what about -0.9(19 times) ? the fraction should be u64
> > > > > > > > in this case and it's fine. The sign applies to the combined value.
> > > > > > >
> > > > > > > yes, range for signed values are verified later.
> > > > > >
> > > > > > > > > - extra scaling and truncation happening outside if needed.
> > > > > > > >
> > > > > > > > Right, but the given input may be way too long and still needs more validation.
> > > > > > >
> > > > > > > What is the problem with a long input of digits?
> > > > > > > C compiler does not complain about this when parsing a float value,
> > > > > > > python does not
> > > > > > > complain about this when parsing floats or decimals either.
> > > > > >
> > > > > > Because there is an exponent limit and for double it's something like 1e307
> > > > > > IIRC, meaning, try 1024 digits to be sure.
> > > > > >
> > > > > > Python most likely uses the library for big numbers, you can't compare it at all with this.
> > > > >
> > > > > You would be fine if the truncation loop:
> > > > >
> > > > > while (isdigit(*s)) /* truncate */
> > > > > s++;
> > > > >
> > > > > is bounded by (19-scale) iteration count? or it should keep iterating if those are zero?
> > > >
> > > > Ideally both.
> > > >
> > > > We don't care about the digits in the range of 19-scale and skip all 0:s after
> > > > that.
> > > >
> > > > /* truncate unrequired digits within type limit, i.e. 19 decimal digits */
> > > > while (isdigit(*s) && "(s - pos_of_dot) is less than 19")
> > > > s++;
> > > > while (s == '0') /* truncate trailing 0:s, it's not a bad input nor overflow */
> > > > s++;
> > >
> > > We could have agreed on something like that since the beginning!
> >
> > Yes, but who knew that we go to have this agreement?
> >
> > > And I think that changing the logic to something like this would not change a
> > > thing on the kind of inputs we expect, it will just complicate the code.
> > > I suppose that kind of kstrto*() rules were never stated anywhere.
> > >
> > > |> 20th digit
> > > Also, 0.00000000000000000001 still sounds like a valid decimal number to me, even
> > > though it is going to be parsed as 0!
> >
> > Hmm... It would mean that testing for 19th/20th digits is not enough... :-(
> >
> > > >
> > > > // Now if it's not \0 nor \n and
> > > > // a) still a digit consider either overflow or bad input,
> > > > // b) if not a digit, consider as bad input.
> > > >
> > > > In a) I tend to be on par with the other k*() and consider that as overflow.
> > > >
> > > > > is that the only concern? Again, the usage of _parse_integer_limit(s, 10, &_frac, scale)
> > > > > avoids a 64-bit division when checking the rv.
> > > >
> > > > I'm not against usage of _parse_integer_limit(), I'm for stricter rules on the input.
> > > > With the above addressed, I have no more concerns.
> > >
> > > Thanks! I will proceed with the requested adjustments.
> >
> > But it seems it's not enough as you pointed out!
> >
> > So the biggest fraction we may consume in 64-bit (unsigned) value is
> > 0.18446744073709551615. If we go with one digit less, the whole value
> > can be
> >
> > In [3]: hex(9999999999999999999)
> > Out[3]: '0x8ac7230489e7ffff'
> >
> > So, I don't know how we are supposed to represent values between
> > -0.9223372036854775808
> > -0.9999999999999999999
> > in a signed type as they have bit 63 set.
> >
> > The easiest way out is to limit scale to 18 (but still accept 19th digit, and
> > with check for overflow even 20th up to 0.18446744073709551615). This will need
> > to run _parse_integer_limit() twice (with given scale and with 20).
> >
> > Can you add the respective test cases and see what is currently going on with
> > them?
>
> I can add test cases, but for the signed case the situation is:
>
> scale = 0
> max = 9223372036854775807, min = -9223372036854775808
> scale = 1
> max = 922337203685477580.7, min = -922337203685477580.8
> scale = 2
> max = 92233720368547758.07, min = -92233720368547758.08
> ...
> scale = 18
> max = 9.223372036854775807, min = -9.223372036854775808
> scake = 19
> max = 0.9223372036854775807, min = -0.9223372036854775808
>
> anything outside those ranges will give you -ERANGE. Then it depends on the scale used.
Oh, I only now realised that this is sliding window for a single 64-bit signed value!
I was under impression that you wanted implementation that covers 128-bit signed value
(with 64 + 64)...
> I am not representing -0.9999999999999999999 as is. The desired scale will have this
> truncated. It may be -0.9999 or -0.999999 or -0.9. And this is practical for a
> reasonable scale value... for pico and femto precision you still get a decent range.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH] Documentation: kvm: update links in the references section of AMD Memory Encryption
From: Paolo Bonzini @ 2026-05-12 20:26 UTC (permalink / raw)
To: Ninad Naik, corbet, skhan, seanjc, michael.roth, liam.merwick,
vannapurve
Cc: kvm, linux-doc, linux-kernel, me, linux-kernel-mentees
In-Reply-To: <20260511174302.811918-1-ninadnaik07@gmail.com>
On 5/11/26 19:43, Ninad Naik wrote:
> Replace non-working links in the reference section with the working ones.
>
> Signed-off-by: Ninad Naik <ninadnaik07@gmail.com>
Applied, thanks.
Paolo
> ---
> Documentation/virt/kvm/x86/amd-memory-encryption.rst | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index b2395dd4769d..bd04a908a8db 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -656,8 +656,8 @@ References
> See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
> for more info.
>
> -.. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
> -.. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
> -.. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
> +.. [white-paper] https://docs.amd.com/v/u/en-US/memory-encryption-white-paper
> +.. [api-spec] https://docs.amd.com/v/u/en-US/55766_PUB_3.24_SEV_API
> +.. [amd-apm] https://docs.amd.com/v/u/en-US/24593_3.44_APM_Vol2 (section 15.34)
> .. [kvm-forum] https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
> -.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
> +.. [snp-fw-abi] https://www.amd.com/content/dam/amd/en/documents/developer/56860.pdf
^ permalink raw reply
* htmldocs: Documentation/hwmon/d1u74t.rst:4: WARNING: Title underline too short.
From: kernel test robot @ 2026-05-12 20:31 UTC (permalink / raw)
To: Abdurrahman Hussain; +Cc: oe-kbuild-all, 0day robot, linux-doc
tree: https://github.com/intel-lab-lkp/linux/commits/Abdurrahman-Hussain/dt-bindings-hwmon-pmbus-Add-Murata-D1U74T-PSU/20260512-185756
head: 0fa96a515da810b1526b58447f677f4096e601c7
commit: 0fa96a515da810b1526b58447f677f4096e601c7 hwmon: (pmbus/d1u74t) Add Murata D1U74T PSU driver
date: 9 hours ago
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260512/202605122253.zInzmUeX-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605122253.zInzmUeX-lkp@intel.com/
All warnings (new ones prefixed by >>):
Runtime Survivability
===================== [docutils]
>> Documentation/hwmon/d1u74t.rst:4: WARNING: Title underline too short.
vim +4 Documentation/hwmon/d1u74t.rst
2
3 Kernel driver d1u74t
> 4 ==================
5
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH 2/2] Documentation: maple_tree: Clarify behavior when using reserved values
From: Liam R. Howlett @ 2026-05-12 20:50 UTC (permalink / raw)
To: Wei-Lin Chang
Cc: maple-tree, linux-mm, linux-doc, linux-kernel, Liam R . Howlett,
Alice Ryhl, Andrew Ballance, Jonathan Corbet, Shuah Khan
In-Reply-To: <q2dtphja7i45kknjk3bs4hn2bpictyoaideyjfbdh4sz4pxllo@xtsyvo3eztdb>
On 26/05/07 11:09PM, Wei-Lin Chang wrote:
> On Thu, May 07, 2026 at 05:24:11AM +0200, Liam R. Howlett wrote:
> > On 26/05/04 05:57PM, Wei-Lin Chang wrote:
> > > It doesn't matter whether the normal or the advanced API is used if the
> > > user uses xa_{mk, to}_value when storing and retrieving the values. Just
> > > specify that the normal API blocks usages of reserved values while the
> > > advanced API does not.
> >
> > Your comment above is incorrect.
> >
> > The normal API will filter out reserved values on return while the
> > advanced API will return whatever is stored there regardless of the
> > value.
> >
> > Meaning, if you store a reserved value with the advanced API, it will
> > not be returned by the normal API.
>
> This is valuable information, thanks for explaining.
Hmm, maybe I answered too quickly here. We filter out XA_ZERO_ENTRY on
normal API searches, which is in the reserved range.
> However, I'm confused how this shows my comment incorrect?
It matters if you use the xa_(mk, to}_value since the top bit will be
lost. Re-reading your comment, you don't specifically say that though,
you said 'if the user uses..', so I was confused by your wording of what
you were saying.
>
> From the original doc:
>
> <quote>
> If the user needs to use a reserved value, then the user can convert the
> value when using the :ref:`maple-tree-advanced-api`, but are blocked by
> the normal API.
> </quote>
>
> To me this is conveying the following points:
>
> 1. User can convert the value with xa_{mk, to}_value() when using the
> advanced API if reserved values are being stored. This works because
> those functions transform the reserved values into non-reserved ones.
> 2. User can not use reserved values with or without xa_{mk, to}_value()
> with the normal API.
> 3. What happens when reserved values are stored is not clearly stated,
> but the normal API will block it.
>
> In my understanding 2. is incorrect because if xa_{mk, to}_value() are
> deployed, it doesn't matter whether the normal or advanced API is used,
> they both work since the values stored aren't reserved.
>
> Please do you mind pointing out what I am getting wrong here?
I think you are missing the part where the top bit may be lost?
I also don't think the reserved values will matter if you use the
advanced API exclusively. You would have to filter the special cases or
whatever you want - that is, if you mix the interfaces then you may see
odd behaviour in regards to the special cases in the normal API while
the advanced API would return the reserved items and need to be filtered
at a higher level than the maple tree code.
>
> I was genuinely confused when I was reading the doc and trying to use
> this data structure.
Then we need to rework the wording somehow. Thanks.
>
> >
> > >
> > > Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
> > > ---
> > > Documentation/core-api/maple_tree.rst | 6 +++---
> > > 1 file changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst
> > > index 87020a30ba69..e5ccafb84804 100644
> > > --- a/Documentation/core-api/maple_tree.rst
> > > +++ b/Documentation/core-api/maple_tree.rst
> > > @@ -30,9 +30,9 @@ Tree reserves values with the bottom two bits set to '10' which are below 4096
> > > (ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved
> > > entries then the users can convert the entries using xa_mk_value() and convert
> > > them back by calling xa_to_value(). Note that xa_{mk, to}_value() bit shifts
> > > -the given data, so the top bit will be lost. If the user needs to use a
> > > -reserved value, then the user can convert the value when using the
> > > -:ref:`maple-tree-advanced-api`, but are blocked by the normal API.
> > > +the given data, so the top bit will be lost. Usage of reserved values is
> > > +blocked by the normal API, and will cause undefined behavior if used with the
> > > +:ref:`maple-tree-advanced-api`.
> >
> > Which behaviour is undefined?
>
> I originally thought storing reserved values could break the tree
> because of its internal use (see 3. above).
You can't break the tree by storing reserved values. The normal API
will outright not allow storing it while the advanced API will store and
return it.
The issue comes from when you mix and match - if you store a reserved
value using the advanced api and then iterate through with the normal
api, some values may be lost. Today, that's XA_ZERO_ENTRY only, but we
reserve the right to change that if it is necessary for some tree
version.
Does that make sense?
Thanks,
Liam
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox