* [PATCH v2 00/10] LUO: PCI subsystem (phase I)
@ 2025-09-16 7:45 Chris Li
2025-09-16 7:45 ` [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator Chris Li
` (10 more replies)
0 siblings, 11 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
This is phase I of the LUO PCI series. It does the minimal set of PCI
device liveupdate which is preserving a bus master bit in the PCI command
register.
The LUO PCI subsystem is based on the LUO V2 series.
https://lore.kernel.org/lkml/20250515182322.117840-1-pasha.tatashin@soleen.com/
It registers the PCI as a LUO subsystem and forwards the liveupdate
callback to the device. The struct dev_liveupdate has been add to struct
device to keep track of the liveupdate related context.
A device can be marked as requested for liveupdate during the normal
state.
In the prepare() callback. The PCI core will build a list of the PCI device
for liveupdate based on the PCI device dependency:
1) The requested device is dependent on the PCI bridge it is on to preserve
the bridge bus master. All the way to the root bridge. If the bus master
has been disabled on the bridge, the DMA on the children devices will
get impacted.
The list of liveupdate devices is used for prepare(), cancel(), freeze()
and finalized() callback.
The PCI subsystem will preserve the driver name for each liveupdate PCI
device and only probe that driver after kexec boot up.
Disclaimer:
The data preservation format is not final. It currently uses C struct
directly. It does not deal with version change on the data format yet. I
do have some idea how to address the versioning of data layout. Those
will be outside the scope of this series.
Testing:
Testing was done with Intel diorite NVMe VF device 8086:1457. Bind the
test device with pci-lu-stub driver.
0000:05:00.1 current driver is
0000:05:00.1 bind new driver pci-lu-stub
[ 557.006998] pci-lu-stub 0000:05:00.1: Marking device liveupdate busmaster
Now perform luo prepare, the PCI subsystem builds the liveupdate device
list from the PCI root bridge. The test device will have LU_BUSMASTER
and the PCI bridge will have LU_BUSMASTER_BRIDGE.
[ 701.573423] pci-lu-stub 0000:05:00.1: PCI liveupdate: collect liveupdate device: flags 1
[ 701.582430] pcieport 0000:04:01.0: PCI liveupdate: collect liveupdate device: flags 2
[ 701.590297] pci-lu-stub 0000:05:00.1: pci_lu_stub_prepare(): data: 0x1ac6f4000
[ 701.598916] PCI liveupdate: prepare data[1f1d28000]
[ 701.603832] luo_core: Switched from [normal] to [prepared] state
After kexec reboot. The liveupdate devices are probed and restores the live
update context.
[ 3.622083] pci 0000:04:01.0: PCI liveupdate: liveupdate restore flags 2 driver: pcieport data: [0]
[ 4.768060] pci 0000:05:00.1: PCI liveupdate: liveupdate restore flags 1 driver: pci-lu-stub data: [1ac6f4000]
Perform luo finish to convert from update state to normal state. The
reserved folio will be freed.
[ 310.359830] PCI liveupdate: finish data[1f1d28000]
[ 310.364664] pci-lu-stub 0000:05:00.1: pci_lu_stub_finish(): data: 0x1ac6f4000
[ 310.371824] luo_core: Switched from [updated] to [normal] state
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Changes in v2:
- reduce the scope of the series to phase I. Only preserve the bus
master bit.
- Use finer grain flags to specify which liveupdate feature gets
preserved.
- Modify the pci-lu-stub driver to set the bus master bit before
requesting preserving the bus master.
- Add WARN_ON() for the PCI device has LU_BUSMASTER but the bus master
bit is not set.
- Link to v1: https://lore.kernel.org/r/20250728-luo-pci-v1-0-955b078dd653@kernel.org
---
Chris Li (10):
PCI/LUO: Register with Liveupdate Orchestrator
PCI/LUO: Create requested liveupdate device list
PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
PCI/LUO: Restore state at PCI enumeration
PCI/LUO: Forward finish callbacks to drivers
PCI/LUO: Save and restore driver name
PCI/LUO: Add liveupdate to pcieport driver
PCI/LUO: Add pci_liveupdate_get_driver_data()
PCI/LUO: Avoid write to bus master at boot
PCI: pci-lu-stub: Add a stub driver for Live Update testing
MAINTAINERS | 4 +
drivers/pci/Kconfig | 10 +
drivers/pci/Makefile | 2 +
drivers/pci/liveupdate.c | 450 +++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci-lu-stub.c | 140 +++++++++++++
drivers/pci/pci.c | 7 +-
drivers/pci/pci.h | 8 +
drivers/pci/pcie/portdrv.c | 13 ++
drivers/pci/probe.c | 8 +-
include/linux/dev_liveupdate.h | 69 +++++++
include/linux/device.h | 15 ++
include/linux/device/driver.h | 6 +
include/linux/pci.h | 9 +
13 files changed, 738 insertions(+), 3 deletions(-)
---
base-commit: 9ab803064e3d1be9673d2829785a69fd0578b24e
change-id: 20250724-luo-pci-1291890b710f
Best regards,
--
Chris Li <chrisl@kernel.org>
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-30 15:15 ` Greg Kroah-Hartman
2025-09-30 15:17 ` Greg Kroah-Hartman
2025-09-16 7:45 ` [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list Chris Li
` (9 subsequent siblings)
10 siblings, 2 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
Register PCI subsystem with the Liveupdate Orchestrator
and provide noop liveupdate callbacks.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
MAINTAINERS | 2 ++
drivers/pci/Makefile | 1 +
drivers/pci/liveupdate.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 91cec3288cc81aea199f730924eee1f5fda1fd72..85749a5da69f88544ccc749e9d723b1b54c0e3b7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14014,11 +14014,13 @@ F: tools/testing/selftests/livepatch/
LIVE UPDATE
M: Pasha Tatashin <pasha.tatashin@soleen.com>
+M: Chris Li <chrisl@kernel.org>
L: linux-kernel@vger.kernel.org
S: Maintained
F: Documentation/ABI/testing/sysfs-kernel-liveupdate
F: Documentation/admin-guide/liveupdate.rst
F: drivers/misc/liveupdate/
+F: drivers/pci/liveupdate/
F: include/linux/liveupdate.h
F: include/uapi/linux/liveupdate.h
F: tools/testing/selftests/liveupdate/
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 67647f1880fb8fb0629d680398f5b88d69aac660..aa1bac7aed7d12c641a6b55e56176fb3cdde4c91 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -37,6 +37,7 @@ obj-$(CONFIG_PCI_DOE) += doe.o
obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
obj-$(CONFIG_PCI_NPEM) += npem.o
obj-$(CONFIG_PCIE_TPH) += tph.o
+obj-$(CONFIG_LIVEUPDATE) += liveupdate.o
# Endpoint library must be initialized before its users
obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
new file mode 100644
index 0000000000000000000000000000000000000000..86b4f3a2fb44781c6e323ba029db510450556fa9
--- /dev/null
+++ b/drivers/pci/liveupdate.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Chris Li <chrisl@kernel.org>
+ */
+
+#define pr_fmt(fmt) "PCI liveupdate: " fmt
+
+#include <linux/liveupdate.h>
+
+#define PCI_SUBSYSTEM_NAME "pci"
+
+static int pci_liveupdate_prepare(void *arg, u64 *data)
+{
+ pr_info("prepare data[%llx]\n", *data);
+ return 0;
+}
+
+static int pci_liveupdate_freeze(void *arg, u64 *data)
+{
+ pr_info("freeze data[%llx]\n", *data);
+ return 0;
+}
+
+static void pci_liveupdate_cancel(void *arg, u64 data)
+{
+ pr_info("cancel data[%llx]\n", data);
+}
+
+static void pci_liveupdate_finish(void *arg, u64 data)
+{
+ pr_info("finish data[%llx]\n", data);
+}
+
+struct liveupdate_subsystem pci_liveupdate_ops = {
+ .prepare = pci_liveupdate_prepare,
+ .freeze = pci_liveupdate_freeze,
+ .cancel = pci_liveupdate_cancel,
+ .finish = pci_liveupdate_finish,
+ .name = PCI_SUBSYSTEM_NAME,
+};
+
+static int __init pci_liveupdate_init(void)
+{
+ int ret;
+
+ ret = liveupdate_register_subsystem(&pci_liveupdate_ops);
+ if (ret && liveupdate_state_updated())
+ panic("PCI liveupdate: Register subsystem failed: %d", ret);
+ WARN(ret, "PCI liveupdate: Register subsystem failed %d", ret);
+ return 0;
+}
+late_initcall_sync(pci_liveupdate_init);
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
2025-09-16 7:45 ` [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-29 17:46 ` Jason Gunthorpe
2025-09-30 15:26 ` Greg Kroah-Hartman
2025-09-16 7:45 ` [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver Chris Li
` (8 subsequent siblings)
10 siblings, 2 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
Introduce struct dev_liveupdate and add it to struct device.
Use the new struct to track a device's liveupdate states.
- flags: If not zero, the device participate the live update.
Currently the "flags" has two possible bit:
LU_BUSMASTER: The device is requested for perserving the bus master.
LU_BUSMASTER_BRIDGE: A child device is requested for preserving the
bus master. The bridge will need to preserve bus master as well.
In the PCI subsystem prepare callback, create the requested device list
as per the following rules:
- If the device is requested for liveupdate LU_BUSMASTER, then the parent
bridge will be set LU_BUSMASTER_BRIDGE
The list of PCI root bus and its children bus lists form a tree of all
PCI buses. The tree is walked in postorder traversal, so that the device
on the child bus can mark the parent bridge for LU_BUSMASTER_BRIDGE.
After the postorder traversal of the bus tree then reverse order
enumerates the devices in the bus, all device marks either requested or
depended will be added to the requested device list.
This list of devices will be used in the next change to forward the
liveupdate call back into individual devices.
Note that collect_liveupdate_devices() returns the number of devices it
added to request_devices. This will be used in a subsequent commit so that
the PCI subsystem can calculate what size folio to allocate for its save
Signed-off-by: Chris Li <chrisl@kernel.org>
---
MAINTAINERS | 1 +
drivers/pci/liveupdate.c | 80 ++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pcie/portdrv.c | 1 +
drivers/pci/probe.c | 4 ++-
include/linux/dev_liveupdate.h | 44 +++++++++++++++++++++++
include/linux/device.h | 15 ++++++++
6 files changed, 144 insertions(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index 85749a5da69f88544ccc749e9d723b1b54c0e3b7..1ae3d166cd35ec5c7818f202079ed5d10c09144b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14021,6 +14021,7 @@ F: Documentation/ABI/testing/sysfs-kernel-liveupdate
F: Documentation/admin-guide/liveupdate.rst
F: drivers/misc/liveupdate/
F: drivers/pci/liveupdate/
+F: include/linux/dev_liveupdate.h
F: include/linux/liveupdate.h
F: include/uapi/linux/liveupdate.h
F: tools/testing/selftests/liveupdate/
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 86b4f3a2fb44781c6e323ba029db510450556fa9..e8891844b8194dabf8d1e8e2d74d9c701bd741ca 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -6,14 +6,94 @@
*/
#define pr_fmt(fmt) "PCI liveupdate: " fmt
+#define dev_fmt(fmt) "PCI liveupdate: " fmt
+#include <linux/types.h>
#include <linux/liveupdate.h>
+#include "pci.h"
#define PCI_SUBSYSTEM_NAME "pci"
+static void stack_push_buses(struct list_head *stack, struct list_head *buses)
+{
+ struct pci_bus *bus;
+
+ list_for_each_entry(bus, buses, node)
+ list_move_tail(&bus->dev.lu.lu_next, stack);
+}
+
+static void liveupdate_add_dev(struct device *dev, struct list_head *head)
+{
+ dev_info(dev, "collect liveupdate device: flags %x\n", dev->lu.flags);
+ list_move_tail(&dev->lu.lu_next, head);
+}
+
+static int collect_bus_devices_reverse(struct pci_bus *bus, struct list_head *head)
+{
+ struct pci_dev *pdev;
+ int count = 0;
+
+ list_for_each_entry_reverse(pdev, &bus->devices, bus_list) {
+ if (pdev->dev.lu.flags & LU_BUSMASTER && pdev->dev.parent)
+ pdev->dev.parent->lu.flags |= LU_BUSMASTER_BRIDGE;
+ if (pdev->dev.lu.flags) {
+ liveupdate_add_dev(&pdev->dev, head);
+ count++;
+ }
+ }
+ return count;
+}
+
+static int build_liveupdate_devices(struct list_head *head)
+{
+ LIST_HEAD(bus_stack);
+ int count = 0;
+
+ stack_push_buses(&bus_stack, &pci_root_buses);
+
+ while (!list_empty(&bus_stack)) {
+ struct device *busdev;
+ struct pci_bus *bus;
+
+ busdev = list_last_entry(&bus_stack, struct device, lu.lu_next);
+ bus = to_pci_bus(busdev);
+ if (!busdev->lu.visited && !list_empty(&bus->children)) {
+ stack_push_buses(&bus_stack, &bus->children);
+ busdev->lu.visited = 1;
+ continue;
+ }
+
+ count += collect_bus_devices_reverse(bus, head);
+ busdev->lu.visited = 0;
+ list_del_init(&busdev->lu.lu_next);
+ }
+ return count;
+}
+
+static void cleanup_liveupdate_devices(struct list_head *head)
+{
+ struct device *d, *n;
+
+ list_for_each_entry_safe(d, n, head, lu.lu_next) {
+ d->lu.flags &= ~LU_DEPENDED;
+ list_del_init(&d->lu.lu_next);
+ }
+}
+
static int pci_liveupdate_prepare(void *arg, u64 *data)
{
+ LIST_HEAD(requested_devices);
+
pr_info("prepare data[%llx]\n", *data);
+
+ pci_lock_rescan_remove();
+ down_write(&pci_bus_sem);
+
+ build_liveupdate_devices(&requested_devices);
+ cleanup_liveupdate_devices(&requested_devices);
+
+ up_write(&pci_bus_sem);
+ pci_unlock_rescan_remove();
return 0;
}
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index e8318fd5f6ed537a1b236a3a0f054161d5710abd..0e9ef387182856771d857181d88f376632b46f0d 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -304,6 +304,7 @@ static int pcie_device_init(struct pci_dev *pdev, int service, int irq)
device = &pcie->device;
device->bus = &pcie_port_bus_type;
device->release = release_pcie_device; /* callback to free pcie dev */
+ dev_liveupdate_init(device);
dev_set_name(device, "%s:pcie%03x",
pci_name(pdev),
get_descriptor_id(pci_pcie_type(pdev), service));
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4b8693ec9e4c67fc1655e0057b3b96b4098e6630..dddd7ebc03d1a6e6ee456e0bf02ab9833a819509 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -614,6 +614,7 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus *parent)
INIT_LIST_HEAD(&b->devices);
INIT_LIST_HEAD(&b->slots);
INIT_LIST_HEAD(&b->resources);
+ dev_liveupdate_init(&b->dev);
b->max_bus_speed = PCI_SPEED_UNKNOWN;
b->cur_bus_speed = PCI_SPEED_UNKNOWN;
#ifdef CONFIG_PCI_DOMAINS_GENERIC
@@ -1985,6 +1986,7 @@ int pci_setup_device(struct pci_dev *dev)
dev->sysdata = dev->bus->sysdata;
dev->dev.parent = dev->bus->bridge;
dev->dev.bus = &pci_bus_type;
+ dev_liveupdate_init(&dev->dev);
dev->hdr_type = hdr_type & 0x7f;
dev->multifunction = !!(hdr_type & 0x80);
dev->error_state = pci_channel_io_normal;
@@ -3184,7 +3186,7 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int bus,
return NULL;
bridge->dev.parent = parent;
-
+ dev_liveupdate_init(&bridge->dev);
list_splice_init(resources, &bridge->windows);
bridge->sysdata = sysdata;
bridge->busnr = bus;
diff --git a/include/linux/dev_liveupdate.h b/include/linux/dev_liveupdate.h
new file mode 100644
index 0000000000000000000000000000000000000000..72297cba08a999e89f7bc0997dabdbe14e0aa12c
--- /dev/null
+++ b/include/linux/dev_liveupdate.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ * Chris Li <chrisl@kernel.org>
+ */
+#ifndef _LINUX_DEV_LIVEUPDATE_H
+#define _LINUX_DEV_LIVEUPDATE_H
+
+#include <linux/liveupdate.h>
+
+#ifdef CONFIG_LIVEUPDATE
+
+enum liveupdate_flag {
+ LU_BUSMASTER = 1 << 0,
+ LU_BUSMASTER_BRIDGE = 2 << 0,
+};
+
+#define LU_REQUESTED (LU_BUSMASTER)
+#define LU_DEPENDED (LU_BUSMASTER_BRIDGE)
+
+/**
+ * struct dev_liveupdate - Device state for live update operations
+ * @lu_next: List head for linking the device into live update
+ * related lists (e.g., a list of devices participating
+ * in a live update sequence).
+ * @flags: Indicate what liveupdate feature does the device
+ * participtate.
+ * @visited: Only used by the bus devices when travese the PCI buses
+ * to build the liveupdate devices list. Set if the child
+ * buses have been pushed into the pending stack.
+ *
+ * This structure holds the state information required for performing
+ * live update operations on a device. It is embedded within a struct device.
+ */
+struct dev_liveupdate {
+ struct list_head lu_next;
+ enum liveupdate_flag flags;
+ bool visited:1;
+};
+
+#endif /* CONFIG_LIVEUPDATE */
+#endif /* _LINUX_DEV_LIVEUPDATE_H */
diff --git a/include/linux/device.h b/include/linux/device.h
index 4940db137fffff4ceacf819b32433a0f4898b125..e0b35c723239f1254a3b6152f433e0412cd3fb34 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -21,6 +21,7 @@
#include <linux/lockdep.h>
#include <linux/compiler.h>
#include <linux/types.h>
+#include <linux/dev_liveupdate.h>
#include <linux/mutex.h>
#include <linux/pm.h>
#include <linux/atomic.h>
@@ -508,6 +509,7 @@ struct device_physical_location {
* @pm_domain: Provide callbacks that are executed during system suspend,
* hibernation, system resume and during runtime PM transitions
* along with subsystem-level and driver-level callbacks.
+ * @lu: Live update state.
* @em_pd: device's energy model performance domain
* @pins: For device pin management.
* See Documentation/driver-api/pin-control.rst for details.
@@ -603,6 +605,10 @@ struct device {
struct dev_pm_info power;
struct dev_pm_domain *pm_domain;
+#ifdef CONFIG_LIVEUPDATE
+ struct dev_liveupdate lu;
+#endif
+
#ifdef CONFIG_ENERGY_MODEL
struct em_perf_domain *em_pd;
#endif
@@ -1168,4 +1174,13 @@ void device_link_wait_removal(void);
#define MODULE_ALIAS_CHARDEV_MAJOR(major) \
MODULE_ALIAS("char-major-" __stringify(major) "-*")
+#ifdef CONFIG_LIVEUPDATE
+static inline void dev_liveupdate_init(struct device *dev)
+{
+ INIT_LIST_HEAD(&dev->lu.lu_next);
+}
+#else
+static inline void dev_liveupdate_init(struct device *dev) {}
+#endif
+
#endif /* _DEVICE_H_ */
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
2025-09-16 7:45 ` [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator Chris Li
2025-09-16 7:45 ` [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-29 17:48 ` Jason Gunthorpe
2025-09-30 15:27 ` Greg Kroah-Hartman
2025-09-16 7:45 ` [PATCH v2 04/10] PCI/LUO: Restore state at PCI enumeration Chris Li
` (7 subsequent siblings)
10 siblings, 2 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
After the list of preserved devices is constructed, the PCI subsystem can
now forward the liveupdate request to the driver.
The PCI subsystem saves and restores a u64 data from LUO callback. For
each device, the PCI subsystem preserve a "dev_state" struct, which
contains the path (domain + bus + devfn) and a per device u64 data.
The device driver will use such a u64 data area to store the device driver
state. The device live update callback looks very similar to the LUO
subsystem callback, with the "void *arg" change to "struct device *dev".
In the prepare callback, the PCI subsystem allocates then preserves a
folio big enough to hold all requested device state (struct pci_dev_ser)
in an array and the count.
The PCI sub system will just forward the liveupdate call back with u64
data point to the u64 field of the device state array.
If some device fails the prepare callback, all previous devices that
already successfully finished the prepare call back will get the cancel
call back to clean up the saved state. That clean up is the special case
that not the full list will be walked.
In other live update callbacks, all the devices in the preserved device
list will get the callback with their own u64 data field.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
drivers/pci/liveupdate.c | 203 +++++++++++++++++++++++++++++++++++++++--
include/linux/dev_liveupdate.h | 23 +++++
include/linux/device/driver.h | 6 ++
3 files changed, 223 insertions(+), 9 deletions(-)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index e8891844b8194dabf8d1e8e2d74d9c701bd741ca..2b215c224fb78c908579b0d22be713e1dc7ca21f 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -9,11 +9,25 @@
#define dev_fmt(fmt) "PCI liveupdate: " fmt
#include <linux/types.h>
+#include <linux/kexec_handover.h>
#include <linux/liveupdate.h>
#include "pci.h"
#define PCI_SUBSYSTEM_NAME "pci"
+static LIST_HEAD(preserved_devices);
+
+struct pci_dev_ser {
+ u32 path; /* domain + bus + slot + fn */
+ u32 flags;
+ u64 driver_data; /* driver data */
+};
+
+struct pci_ser {
+ u32 count;
+ struct pci_dev_ser devs[];
+};
+
static void stack_push_buses(struct list_head *stack, struct list_head *buses)
{
struct pci_bus *bus;
@@ -70,42 +84,213 @@ static int build_liveupdate_devices(struct list_head *head)
return count;
}
+static void dev_cleanup_liveupdate(struct device *dev)
+{
+ dev->lu.flags &= ~LU_DEPENDED;
+ list_del_init(&dev->lu.lu_next);
+}
+
static void cleanup_liveupdate_devices(struct list_head *head)
{
struct device *d, *n;
- list_for_each_entry_safe(d, n, head, lu.lu_next) {
- d->lu.flags &= ~LU_DEPENDED;
- list_del_init(&d->lu.lu_next);
+ list_for_each_entry_safe(d, n, head, lu.lu_next)
+ dev_cleanup_liveupdate(d);
+}
+
+static void cleanup_liveupdate_state(struct pci_ser *pci_state)
+{
+ struct folio *folio = virt_to_folio(pci_state);
+
+ kho_unpreserve_folio(folio);
+ folio_put(folio);
+}
+
+static void pci_call_cancel(struct pci_ser *pci_state)
+{
+ struct pci_dev_ser *si = pci_state->devs;
+ struct device *dev, *next;
+
+ list_for_each_entry_safe(dev, next, &preserved_devices, lu.lu_next) {
+ struct pci_dev_ser *s = si++;
+
+ if (!dev->driver)
+ panic("PCI liveupdate cancel: %s has no driver", dev_name(dev));
+ if (!dev->driver->lu)
+ panic("PCI liveupdate cancel: %s driver %s does not support liveupdate",
+ dev_name(dev), dev->driver->name ? : "(null name)");
+ if (dev->driver->lu->cancel)
+ dev->driver->lu->cancel(dev, s->driver_data);
+ dev_cleanup_liveupdate(dev);
}
}
-static int pci_liveupdate_prepare(void *arg, u64 *data)
+static int pci_get_device_path(struct pci_dev *pdev)
+{
+ return (pci_domain_nr(pdev->bus) << 16) | pci_dev_id(pdev);
+}
+
+static int pci_save_device_state(struct device *dev, struct pci_dev_ser *s)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ s->path = pci_get_device_path(pdev);
+ s->flags = dev->lu.flags;
+ return 0;
+}
+
+static int pci_call_prepare(struct pci_ser *pci_state,
+ struct list_head *devices)
+{
+ struct pci_dev_ser *pdev_state_current = pci_state->devs;
+ struct device *dev, *next;
+ int ret;
+ char *reason;
+
+ list_for_each_entry_safe(dev, next, devices, lu.lu_next) {
+ struct pci_dev_ser *s = pdev_state_current++;
+
+ if (!dev->driver) {
+ reason = "no driver";
+ ret = -ENOENT;
+ goto cancel;
+ }
+ if (!dev->driver->lu) {
+ reason = "driver does not support liveupdate";
+ ret = -EPERM;
+ goto cancel;
+ }
+ ret = pci_save_device_state(dev, s);
+ if (ret) {
+ reason = "save device state failed";
+ goto cancel;
+ }
+ if (dev->driver->lu->prepare) {
+ ret = dev->driver->lu->prepare(dev, &s->driver_data);
+ if (ret) {
+ reason = "prepare() failed";
+ goto cancel;
+ }
+ }
+ list_move_tail(&dev->lu.lu_next, &preserved_devices);
+ }
+ return 0;
+
+cancel:
+ dev_err(dev, "luo prepare failed %d (%s)\n", ret, reason);
+ pci_call_cancel(pci_state);
+ return ret;
+}
+
+static int __pci_liveupdate_prepare(void *arg, u64 *data)
{
LIST_HEAD(requested_devices);
+ struct pci_ser *pci_state;
+ int ret;
+ int count = build_liveupdate_devices(&requested_devices);
+ int size = sizeof(*pci_state) + sizeof(pci_state->devs[0]) * count;
+ int order = get_order(size);
+ struct folio *folio;
- pr_info("prepare data[%llx]\n", *data);
+ folio = folio_alloc(GFP_KERNEL | __GFP_ZERO, order);
+ if (!folio) {
+ ret = -ENOMEM;
+ goto cleanup_device;
+ }
- pci_lock_rescan_remove();
- down_write(&pci_bus_sem);
+ pci_state = folio_address(folio);
+ pci_state->count = count;
+
+ ret = kho_preserve_folio(folio);
+ if (ret) {
+ pr_err("liveupdate_preserve_folio failed\n");
+ goto release_folio;
+ }
+
+ ret = pci_call_prepare(pci_state, &requested_devices);
+ if (ret)
+ goto unpreserve;
- build_liveupdate_devices(&requested_devices);
+ *data = __pa(pci_state);
+ pr_info("prepare data[%llx]\n", *data);
+ return 0;
+
+unpreserve:
+ kho_unpreserve_folio(folio);
+release_folio:
+ folio_put(folio);
+cleanup_device:
cleanup_liveupdate_devices(&requested_devices);
+ return ret;
+}
+static int pci_liveupdate_prepare(void *arg, u64 *data)
+{
+ int ret;
+
+ pci_lock_rescan_remove();
+ down_write(&pci_bus_sem);
+ ret = __pci_liveupdate_prepare(arg, data);
up_write(&pci_bus_sem);
pci_unlock_rescan_remove();
+ return ret;
+}
+
+static int pci_call_freeze(struct pci_ser *pci_state, struct list_head *devlist)
+{
+ struct pci_dev_ser *n = pci_state->devs;
+ struct device *dev;
+ int ret = 0;
+
+ list_for_each_entry(dev, devlist, lu.lu_next) {
+ struct pci_dev_ser *s = n++;
+
+ if (!dev->driver) {
+ if (!dev->parent)
+ continue;
+ panic("PCI liveupdate freeze: %s has no driver", dev_name(dev));
+ }
+ if (!dev->driver->lu->freeze)
+ continue;
+ ret = dev->driver->lu->freeze(dev, &s->driver_data);
+ if (ret) {
+ dev_err(dev, "luo freeze failed %d\n", ret);
+ pci_call_cancel(pci_state);
+ return ret;
+ }
+ }
return 0;
}
static int pci_liveupdate_freeze(void *arg, u64 *data)
{
+ struct pci_ser *pci_state = phys_to_virt(*data);
+ int ret;
+
pr_info("freeze data[%llx]\n", *data);
- return 0;
+ pci_lock_rescan_remove();
+ down_write(&pci_bus_sem);
+
+ ret = pci_call_freeze(pci_state, &preserved_devices);
+
+ up_write(&pci_bus_sem);
+ pci_unlock_rescan_remove();
+ return ret;
}
static void pci_liveupdate_cancel(void *arg, u64 data)
{
+ struct pci_ser *pci_state = phys_to_virt(data);
+
pr_info("cancel data[%llx]\n", data);
+ pci_lock_rescan_remove();
+ down_write(&pci_bus_sem);
+
+ pci_call_cancel(pci_state);
+ cleanup_liveupdate_state(pci_state);
+
+ up_write(&pci_bus_sem);
+ pci_unlock_rescan_remove();
}
static void pci_liveupdate_finish(void *arg, u64 data)
diff --git a/include/linux/dev_liveupdate.h b/include/linux/dev_liveupdate.h
index 72297cba08a999e89f7bc0997dabdbe14e0aa12c..80a723c7701ac4ddc2ddd03d0ffc9cc5a62a6083 100644
--- a/include/linux/dev_liveupdate.h
+++ b/include/linux/dev_liveupdate.h
@@ -20,6 +20,8 @@ enum liveupdate_flag {
#define LU_REQUESTED (LU_BUSMASTER)
#define LU_DEPENDED (LU_BUSMASTER_BRIDGE)
+struct device;
+
/**
* struct dev_liveupdate - Device state for live update operations
* @lu_next: List head for linking the device into live update
@@ -40,5 +42,26 @@ struct dev_liveupdate {
bool visited:1;
};
+/**
+ * struct dev_liveupdate_ops - Live Update callback functions
+ * @prepare: Prepare device for the upcoming state transition. Driver and
+ * buses should save the necessary device state.
+ * @freeze: A final notification before the system jumps to the new kernel.
+ * Called from reboot() syscall.
+ * @cancel: Cancel the live update process. Driver should clean
+ * up any saved state if necessary.
+ * @finish: The system has completed a transition. Drivers and buses should
+ * have already restored the previously saved device state.
+ * Clean-up any saved state or reset unreclaimed device.
+ *
+ * This structure is used by drivers and buses to hold the callback from LUO.
+ */
+struct dev_liveupdate_ops {
+ int (*prepare)(struct device *dev, u64 *data);
+ int (*freeze)(struct device *dev, u64 *data);
+ void (*cancel)(struct device *dev, u64 data);
+ void (*finish)(struct device *dev, u64 data);
+};
+
#endif /* CONFIG_LIVEUPDATE */
#endif /* _LINUX_DEV_LIVEUPDATE_H */
diff --git a/include/linux/device/driver.h b/include/linux/device/driver.h
index cd8e0f0a634be9ea63ff22e89d66ada3b1a9eaf2..b2ba469cc3065a412f02230c62e811af19c4d2c6 100644
--- a/include/linux/device/driver.h
+++ b/include/linux/device/driver.h
@@ -19,6 +19,7 @@
#include <linux/pm.h>
#include <linux/device/bus.h>
#include <linux/module.h>
+#include <linux/dev_liveupdate.h>
/**
* enum probe_type - device driver probe type to try
@@ -80,6 +81,8 @@ enum probe_type {
* it is bound to the driver.
* @pm: Power management operations of the device which matched
* this driver.
+ * @lu: Live update callbacks, notify device of the live
+ * update state, and allow preserve device across reboot.
* @coredump: Called when sysfs entry is written to. The device driver
* is expected to call the dev_coredump API resulting in a
* uevent.
@@ -116,6 +119,9 @@ struct device_driver {
const struct attribute_group **dev_groups;
const struct dev_pm_ops *pm;
+#ifdef CONFIG_LIVEUPDATE
+ const struct dev_liveupdate_ops *lu;
+#endif
void (*coredump) (struct device *dev);
struct driver_private *p;
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 04/10] PCI/LUO: Restore state at PCI enumeration
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
` (2 preceding siblings ...)
2025-09-16 7:45 ` [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-16 7:45 ` [PATCH v2 05/10] PCI/LUO: Forward finish callbacks to drivers Chris Li
` (6 subsequent siblings)
10 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
Add a PCI device saved state member to indicate the device is requested
vs depended.
Restore the PCI subsystem saved state folio during PCI enumeration.
When a new PCI device is created, restore the per device state pointer
into the dev->lu.dev_state if the device is found in the saved
devices array, by matching the device path.
Restore the dev->lu.flags from the saved flags field.
Add such devices to the "probed_devices" list.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
drivers/pci/liveupdate.c | 50 ++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci.h | 6 +++++
drivers/pci/probe.c | 2 ++
include/linux/dev_liveupdate.h | 2 ++
4 files changed, 60 insertions(+)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 2b215c224fb78c908579b0d22be713e1dc7ca21f..305c5e85aba6bac9d02f97c83e7b3250298d2eff 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -16,6 +16,7 @@
#define PCI_SUBSYSTEM_NAME "pci"
static LIST_HEAD(preserved_devices);
+static LIST_HEAD(probe_devices);
struct pci_dev_ser {
u32 path; /* domain + bus + slot + fn */
@@ -87,6 +88,7 @@ static int build_liveupdate_devices(struct list_head *head)
static void dev_cleanup_liveupdate(struct device *dev)
{
dev->lu.flags &= ~LU_DEPENDED;
+ dev->lu.dev_state = NULL;
list_del_init(&dev->lu.lu_next);
}
@@ -306,6 +308,54 @@ struct liveupdate_subsystem pci_liveupdate_ops = {
.name = PCI_SUBSYSTEM_NAME,
};
+static struct pci_ser *pci_state_get(void)
+{
+ static struct pci_ser *pci_state;
+ struct folio *folio;
+ phys_addr_t data = 0;
+ int ret;
+
+ if (pci_state)
+ return pci_state;
+
+ ret = liveupdate_get_subsystem_data(&pci_liveupdate_ops, &data);
+ if (ret || !data)
+ panic("PCI liveupdate: get subsystem data: [%llx] ret: %d", data, ret);
+
+ folio = kho_restore_folio(data);
+ if (!folio)
+ panic("PCI liveupdate: restore folio from %llx failed", data);
+
+ /* Cache the value for future callers. */
+ pci_state = folio_address(folio);
+ return pci_state;
+}
+
+static void pci_dev_do_restore(struct pci_dev *dev, struct pci_dev_ser *s)
+{
+ dev->dev.lu.dev_state = s;
+ dev->dev.lu.flags = s->flags;
+ pci_info(dev, "liveupdate restore flags %x data: [%llx]\n",
+ s->flags, s->driver_data);
+ list_move_tail(&dev->dev.lu.lu_next, &probe_devices);
+}
+
+void pci_liveupdate_restore(struct pci_dev *dev)
+{
+ int path;
+ struct pci_dev_ser *s, *end;
+
+ if (!liveupdate_state_updated())
+ return;
+
+ path = pci_get_device_path(dev);
+ s = pci_state_get()->devs;
+ end = s + pci_state_get()->count;
+ for (; s < end; s++)
+ if (s->path == path)
+ return pci_dev_do_restore(dev, s);
+}
+
static int __init pci_liveupdate_init(void)
{
int ret;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 12215ee72afb682b669c0e3a582b5379828e70c4..c9a7383753949994e031dc362920286a475fe2ab 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1159,4 +1159,10 @@ static inline int pci_msix_write_tph_tag(struct pci_dev *pdev, unsigned int inde
(PCI_CONF1_ADDRESS(bus, dev, func, reg) | \
PCI_CONF1_EXT_REG(reg))
+#ifdef CONFIG_LIVEUPDATE
+void pci_liveupdate_restore(struct pci_dev *dev);
+#else
+static inline void pci_liveupdate_restore(struct pci_dev *dev) {}
+#endif
+
#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index dddd7ebc03d1a6e6ee456e0bf02ab9833a819509..a0605af1a699cd07b09897172803dcba1d2da9f9 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2017,6 +2017,8 @@ int pci_setup_device(struct pci_dev *dev)
if (pci_early_dump)
early_dump_pci_device(dev);
+ pci_liveupdate_restore(dev);
+
/* Need to have dev->class ready */
dev->cfg_size = pci_cfg_space_size(dev);
diff --git a/include/linux/dev_liveupdate.h b/include/linux/dev_liveupdate.h
index 80a723c7701ac4ddc2ddd03d0ffc9cc5a62a6083..bb7ecf159dfa82e3779d938811541dddcf8f40af 100644
--- a/include/linux/dev_liveupdate.h
+++ b/include/linux/dev_liveupdate.h
@@ -27,6 +27,7 @@ struct device;
* @lu_next: List head for linking the device into live update
* related lists (e.g., a list of devices participating
* in a live update sequence).
+ * @dev_state: Set to the device state at restore.
* @flags: Indicate what liveupdate feature does the device
* participtate.
* @visited: Only used by the bus devices when travese the PCI buses
@@ -38,6 +39,7 @@ struct device;
*/
struct dev_liveupdate {
struct list_head lu_next;
+ void *dev_state;
enum liveupdate_flag flags;
bool visited:1;
};
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 05/10] PCI/LUO: Forward finish callbacks to drivers
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
` (3 preceding siblings ...)
2025-09-16 7:45 ` [PATCH v2 04/10] PCI/LUO: Restore state at PCI enumeration Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-16 7:45 ` [PATCH v2 06/10] PCI/LUO: Save and restore driver name Chris Li
` (5 subsequent siblings)
10 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
When PCI receives the LUO finish callback. The PCI subsystem forwards the
finish callback to the driver with restored dev->lu.dev_state->data.
Tested: In qemu, request a virtio net device as requested.
Perform luo prepare then kexec. Verify the new kernel boot up
dmesg shows the requested device has per device live update state
restored. Perform liveupdate finish and see the device finish
callback gets invoked.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
drivers/pci/liveupdate.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 305c5e85aba6bac9d02f97c83e7b3250298d2eff..41606df346f751c78f6c69caa275b4a76be72510 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -264,6 +264,29 @@ static int pci_call_freeze(struct pci_ser *pci_state, struct list_head *devlist)
return 0;
}
+static void pci_call_finish(struct list_head *devlist)
+{
+ struct device *dev;
+
+ pci_lock_rescan_remove();
+ down_write(&pci_bus_sem);
+
+ list_for_each_entry(dev, devlist, lu.lu_next) {
+ struct pci_dev_ser *s = dev->lu.dev_state;
+
+ if (!dev->driver)
+ panic("PCI luo finish: dev %s does not have driver", dev_name(dev));
+ if (!dev->driver->lu)
+ panic("PCI luo finish: dev %s does not support liveupdate",
+ dev_name(dev));
+ if (!dev->driver->lu->finish)
+ continue;
+ dev->driver->lu->finish(dev, s->driver_data);
+ }
+ up_write(&pci_bus_sem);
+ pci_unlock_rescan_remove();
+}
+
static int pci_liveupdate_freeze(void *arg, u64 *data)
{
struct pci_ser *pci_state = phys_to_virt(*data);
@@ -297,7 +320,12 @@ static void pci_liveupdate_cancel(void *arg, u64 data)
static void pci_liveupdate_finish(void *arg, u64 data)
{
+ struct pci_ser *pci_state = phys_to_virt(data);
+
pr_info("finish data[%llx]\n", data);
+ pci_call_finish(&probe_devices);
+ cleanup_liveupdate_devices(&probe_devices);
+ cleanup_liveupdate_state(pci_state);
}
struct liveupdate_subsystem pci_liveupdate_ops = {
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
` (4 preceding siblings ...)
2025-09-16 7:45 ` [PATCH v2 05/10] PCI/LUO: Forward finish callbacks to drivers Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-29 17:57 ` Jason Gunthorpe
2025-09-16 7:45 ` [PATCH v2 07/10] PCI/LUO: Add liveupdate to pcieport driver Chris Li
` (4 subsequent siblings)
10 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
Save the PCI driver name into "struct pci_dev_ser" during the PCI
prepare callback.
After kexec, use driver_set_override() to ensure the device is
bound only to the saved driver.
Clear the override after the finish callback.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
drivers/pci/liveupdate.c | 36 ++++++++++++++++++++++++++++++++++--
drivers/pci/pci.h | 2 ++
drivers/pci/probe.c | 2 ++
3 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 41606df346f751c78f6c69caa275b4a76be72510..ae8f4dc5cf92577a4da83743c3b80bc72974a43e 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -21,6 +21,7 @@ static LIST_HEAD(probe_devices);
struct pci_dev_ser {
u32 path; /* domain + bus + slot + fn */
u32 flags;
+ char driver_name[63];
u64 driver_data; /* driver data */
};
@@ -87,6 +88,10 @@ static int build_liveupdate_devices(struct list_head *head)
static void dev_cleanup_liveupdate(struct device *dev)
{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (liveupdate_state_updated())
+ WARN_ON(driver_set_override(dev, &pdev->driver_override, "", 0));
dev->lu.flags &= ~LU_DEPENDED;
dev->lu.dev_state = NULL;
list_del_init(&dev->lu.lu_next);
@@ -135,7 +140,13 @@ static int pci_get_device_path(struct pci_dev *pdev)
static int pci_save_device_state(struct device *dev, struct pci_dev_ser *s)
{
struct pci_dev *pdev = to_pci_dev(dev);
+ const char *name = dev->driver->name;
+ if (!name)
+ return -ENXIO;
+ if (strlen(name) > sizeof(s->driver_name) - 1)
+ return -ENOSPC;
+ strscpy(s->driver_name, name, sizeof(s->driver_name));
s->path = pci_get_device_path(pdev);
s->flags = dev->lu.flags;
return 0;
@@ -363,8 +374,8 @@ static void pci_dev_do_restore(struct pci_dev *dev, struct pci_dev_ser *s)
{
dev->dev.lu.dev_state = s;
dev->dev.lu.flags = s->flags;
- pci_info(dev, "liveupdate restore flags %x data: [%llx]\n",
- s->flags, s->driver_data);
+ pci_info(dev, "liveupdate restore flags %x driver: %s data: [%llx]\n",
+ s->flags, s->driver_name, s->driver_data);
list_move_tail(&dev->dev.lu.lu_next, &probe_devices);
}
@@ -384,6 +395,27 @@ void pci_liveupdate_restore(struct pci_dev *dev)
return pci_dev_do_restore(dev, s);
}
+void pci_liveupdate_override_driver(struct pci_dev *dev)
+{
+ struct pci_dev_ser *s = dev->dev.lu.dev_state;
+ int ret;
+ int len;
+
+ if (!s)
+ return;
+
+ len = strlen(s->driver_name);
+ if (!len)
+ return;
+
+ ret = driver_set_override(&dev->dev,
+ &dev->driver_override,
+ s->driver_name, len);
+ if (ret)
+ panic("PCI Liveupdate override driver failed: %s", s->driver_name);
+}
+
+
static int __init pci_liveupdate_init(void)
{
int ret;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index c9a7383753949994e031dc362920286a475fe2ab..b79a18c5e948980fe2ef3f0a10e0d795b1eee6d7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1161,8 +1161,10 @@ static inline int pci_msix_write_tph_tag(struct pci_dev *pdev, unsigned int inde
#ifdef CONFIG_LIVEUPDATE
void pci_liveupdate_restore(struct pci_dev *dev);
+void pci_liveupdate_override_driver(struct pci_dev *dev);
#else
static inline void pci_liveupdate_restore(struct pci_dev *dev) {}
+static inline void pci_liveupdate_override_driver(struct pci_dev *dev) {}
#endif
#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index a0605af1a699cd07b09897172803dcba1d2da9f9..e41a1bef2083aa9184fd1c894d5de964f19d5c01 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2714,6 +2714,8 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
/* Set up MSI IRQ domain */
pci_set_msi_domain(dev);
+ pci_liveupdate_override_driver(dev);
+
/* Notifier could use PCI capabilities */
ret = device_add(&dev->dev);
WARN_ON(ret < 0);
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 07/10] PCI/LUO: Add liveupdate to pcieport driver
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
` (5 preceding siblings ...)
2025-09-16 7:45 ` [PATCH v2 06/10] PCI/LUO: Save and restore driver name Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-16 7:45 ` [PATCH v2 08/10] PCI/LUO: Add pci_liveupdate_get_driver_data() Chris Li
` (3 subsequent siblings)
10 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
The PCIe port driver is the driver bound to the PCI-PCI bridge.
The PCIe port device is depended on by its PCI children devices.
Add the empty liveupdate callback to the pcieport driver to indicate
this driver supports liveupdate. Otherwise it can fail the liveupdate
operation if the PCI-PCI bridge is depended..
Signed-off-by: Chris Li <chrisl@kernel.org>
---
drivers/pci/pcie/portdrv.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 0e9ef387182856771d857181d88f376632b46f0d..fd43c1ebfb9d2852fbc460b0390dd7fb016226d2 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -789,6 +789,15 @@ static const struct pci_error_handlers pcie_portdrv_err_handler = {
.mmio_enabled = pcie_portdrv_mmio_enabled,
};
+#ifdef CONFIG_LIVEUPDATE
+
+/*
+ * Empty pcie_port_lu_ops to indicate this driver support liveupdate.
+ */
+static struct dev_liveupdate_ops pcie_port_lu_ops;
+
+#endif /* CONFIG_LIVEUPDATE */
+
static struct pci_driver pcie_portdriver = {
.name = "pcieport",
.id_table = port_pci_ids,
@@ -802,6 +811,9 @@ static struct pci_driver pcie_portdriver = {
.driver_managed_dma = true,
.driver.pm = PCIE_PORTDRV_PM_OPS,
+#ifdef CONFIG_LIVEUPDATE
+ .driver.lu = &pcie_port_lu_ops,
+#endif
};
static int __init dmi_pcie_pme_disable_msi(const struct dmi_system_id *d)
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 08/10] PCI/LUO: Add pci_liveupdate_get_driver_data()
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
` (6 preceding siblings ...)
2025-09-16 7:45 ` [PATCH v2 07/10] PCI/LUO: Add liveupdate to pcieport driver Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-16 7:45 ` [PATCH v2 09/10] PCI/LUO: Avoid write to bus master at boot Chris Li
` (2 subsequent siblings)
10 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
Similar to liveupdate_get_subsystem_data(), the PCI subsystem
provide pci_liveupdate_get_driver_data() for the driver to
receive the driver data during new kernel boot up, in the liveupdate
updated state.
This function will return an error on any other liveupdate state.
For example, vfio-pci will use this API in probe() to access the
liveupdate state from the previous kernel.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
drivers/pci/liveupdate.c | 15 +++++++++++++++
include/linux/pci.h | 9 +++++++++
2 files changed, 24 insertions(+)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index ae8f4dc5cf92577a4da83743c3b80bc72974a43e..1b12fc0649f479c6f45ffb26e6e3754f41054ea8 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -395,6 +395,21 @@ void pci_liveupdate_restore(struct pci_dev *dev)
return pci_dev_do_restore(dev, s);
}
+int pci_liveupdate_get_driver_data(struct pci_dev *pdev, u64 *data)
+{
+ struct dev_liveupdate *lu = &pdev->dev.lu;
+ struct pci_dev_ser *s = lu->dev_state;
+
+ if (!liveupdate_state_updated())
+ return -EINVAL;
+
+ if (!lu->dev_state)
+ return -ENOENT;
+
+ *data = s->driver_data;
+ return 0;
+}
+
void pci_liveupdate_override_driver(struct pci_dev *dev)
{
struct pci_dev_ser *s = dev->dev.lu.dev_state;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 05e68f35f39238f8b9ce08df97b384d1c1e89bbe..50296bb04aaa7f2bbd2260f8ec4670533e019e38 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2767,4 +2767,13 @@ void pci_uevent_ers(struct pci_dev *pdev, enum pci_ers_result err_type);
WARN_ONCE(condition, "%s %s: " fmt, \
dev_driver_string(&(pdev)->dev), pci_name(pdev), ##arg)
+#ifdef CONFIG_LIVEUPDATE
+int pci_liveupdate_get_driver_data(struct pci_dev *pdev, u64 *data);
+#else
+static inline int pci_liveupdate_get_driver_data(struct pci_dev *pdev,
+ u64 *data)
+{
+ return 0;
+}
+#endif
#endif /* LINUX_PCI_H */
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 09/10] PCI/LUO: Avoid write to bus master at boot
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
` (7 preceding siblings ...)
2025-09-16 7:45 ` [PATCH v2 08/10] PCI/LUO: Add pci_liveupdate_get_driver_data() Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-29 17:14 ` Bjorn Helgaas
2025-09-16 7:45 ` [PATCH v2 10/10] PCI: pci-lu-stub: Add a stub driver for Live Update testing Chris Li
2025-09-27 17:13 ` [PATCH v2 00/10] LUO: PCI subsystem (phase I) Bjorn Helgaas
10 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
If the liveupdate flag has LU_BUSMASTER or LU_BUSMASTER_BRIDGE, the
device is participating in the liveupdate preserving bus master bit in the
PCI config space command register.
Avoid writing to the PCI command register for the bus master bit during
boot up.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
drivers/pci/liveupdate.c | 6 ++++++
drivers/pci/pci.c | 7 +++++--
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
index 1b12fc0649f479c6f45ffb26e6e3754f41054ea8..a09a166b6ee271b96bce763716c3b62b24f3edbb 100644
--- a/drivers/pci/liveupdate.c
+++ b/drivers/pci/liveupdate.c
@@ -377,6 +377,12 @@ static void pci_dev_do_restore(struct pci_dev *dev, struct pci_dev_ser *s)
pci_info(dev, "liveupdate restore flags %x driver: %s data: [%llx]\n",
s->flags, s->driver_name, s->driver_data);
list_move_tail(&dev->dev.lu.lu_next, &probe_devices);
+ if (s->flags & (LU_BUSMASTER | LU_BUSMASTER_BRIDGE)) {
+ u16 pci_command;
+
+ pci_read_config_word(dev, PCI_COMMAND, &pci_command);
+ WARN_ON(!(pci_command & PCI_COMMAND_MASTER));
+ }
}
void pci_liveupdate_restore(struct pci_dev *dev)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9e42090fb108920995ebe34bd2535a0e23fef7fd..2339ac1bd57616a78d2105ba3a4fc72bbf49973e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2248,7 +2248,8 @@ static void do_pci_disable_device(struct pci_dev *dev)
pci_read_config_word(dev, PCI_COMMAND, &pci_command);
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
- pci_write_config_word(dev, PCI_COMMAND, pci_command);
+ if (!(dev->dev.lu.flags & (LU_BUSMASTER | LU_BUSMASTER_BRIDGE)))
+ pci_write_config_word(dev, PCI_COMMAND, pci_command);
}
pcibios_disable_device(dev);
@@ -4276,7 +4277,9 @@ static void __pci_set_master(struct pci_dev *dev, bool enable)
if (cmd != old_cmd) {
pci_dbg(dev, "%s bus mastering\n",
enable ? "enabling" : "disabling");
- pci_write_config_word(dev, PCI_COMMAND, cmd);
+
+ if (!(dev->dev.lu.flags & (LU_BUSMASTER | LU_BUSMASTER_BRIDGE)))
+ pci_write_config_word(dev, PCI_COMMAND, cmd);
}
dev->is_busmaster = enable;
}
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v2 10/10] PCI: pci-lu-stub: Add a stub driver for Live Update testing
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
` (8 preceding siblings ...)
2025-09-16 7:45 ` [PATCH v2 09/10] PCI/LUO: Avoid write to bus master at boot Chris Li
@ 2025-09-16 7:45 ` Chris Li
2025-09-27 17:13 ` [PATCH v2 00/10] LUO: PCI subsystem (phase I) Bjorn Helgaas
10 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-09-16 7:45 UTC (permalink / raw)
To: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin
Cc: linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Jason Miu, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Chris Li, Jason Gunthorpe, Leon Romanovsky
Introduce a new driver, pci-lu-stub, that can be bound to any PCI device
and used to test the PCI subsystem support for Live Update. This driver
gives developers a way to opt-in a device for Live Update and driver
interaction with the PCI subsystem. This driver is only intended for
testing purposes.
In the future this driver can be extended to test other scenarios (such
as failing prepare() on purpose).
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Chris Li <chrisl@kernel.org>
---
MAINTAINERS | 1 +
drivers/pci/Kconfig | 10 ++++
drivers/pci/Makefile | 1 +
drivers/pci/pci-lu-stub.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 152 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 1ae3d166cd35ec5c7818f202079ed5d10c09144b..43e5813e6f030a80c2c109b38e332eef536707d6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14021,6 +14021,7 @@ F: Documentation/ABI/testing/sysfs-kernel-liveupdate
F: Documentation/admin-guide/liveupdate.rst
F: drivers/misc/liveupdate/
F: drivers/pci/liveupdate/
+F: drivers/pci/pci-lu-stub.c
F: include/linux/dev_liveupdate.h
F: include/linux/liveupdate.h
F: include/uapi/linux/liveupdate.h
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 9c0e4aaf4e8cb7fecd9f80ac6289b8d854ce03aa..37e44782fa35c64c2eba6a0f6942d44d8003a499 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -327,3 +327,13 @@ source "drivers/pci/switch/Kconfig"
source "drivers/pci/pwrctrl/Kconfig"
endif
+
+config PCI_LU_STUB
+ tristate "PCI Live Update Stub Driver"
+ depends on LIVEUPDATE
+ help
+ Say Y or M here if you want to enable support for the Live Update stub
+ driver. This driver can be used to test the PCI subsystem support for
+ Live Updates.
+
+ When in doubt, say N.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index aa1bac7aed7d12c641a6b55e56176fb3cdde4c91..061e98d0411a951573e1996c61ce5a98f2775e53 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
obj-$(CONFIG_PCI_NPEM) += npem.o
obj-$(CONFIG_PCIE_TPH) += tph.o
obj-$(CONFIG_LIVEUPDATE) += liveupdate.o
+obj-$(CONFIG_PCI_LU_STUB) += pci-lu-stub.o
# Endpoint library must be initialized before its users
obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
diff --git a/drivers/pci/pci-lu-stub.c b/drivers/pci/pci-lu-stub.c
new file mode 100644
index 0000000000000000000000000000000000000000..aa0404d16336278d76b062d8126dc5f45732403e
--- /dev/null
+++ b/drivers/pci/pci-lu-stub.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kexec_handover.h>
+#include <linux/liveupdate.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+
+struct pci_lu_stub_ser {
+ u16 dev_id;
+} __packed;
+
+static const struct pci_device_id pci_lu_stub_id_table[] = {
+ /* Allow binding to any device but only via driver_override. */
+ { PCI_DEVICE_DRIVER_OVERRIDE(PCI_ANY_ID, PCI_ANY_ID, 1) },
+ {},
+};
+
+static int validate_folio(struct pci_dev *dev, struct folio *folio)
+{
+ const struct pci_lu_stub_ser *ser = folio_address(folio);
+
+ if (folio_order(folio) != get_order(sizeof(*ser))) {
+ pci_err(dev, "Restored folio has unexpected order %u\n", folio_order(folio));
+ return -ERANGE;
+ }
+
+ if (ser->dev_id != pci_dev_id(dev)) {
+ pci_err(dev, "Restored folio contains unexpected dev_id: 0x%x\n", ser->dev_id);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int pci_lu_stub_probe(struct pci_dev *dev, const struct pci_device_id *id)
+{
+ struct folio *folio;
+ u64 data;
+ int ret;
+
+ if (liveupdate_state_normal()) {
+ pci_set_master(dev);
+ pci_info(dev, "Marking device liveupdate busmaster\n");
+ dev->dev.lu.flags = LU_BUSMASTER;
+ return 0;
+ }
+
+ if (!liveupdate_state_updated()) {
+ pci_err(dev, "Unable to handle probe() outside of normal and updated states.\n");
+ return -EOPNOTSUPP;
+ }
+
+ ret = pci_liveupdate_get_driver_data(dev, &data);
+ if (ret) {
+ pci_err(dev, "Failed to get driver data for device (%d)\n", ret);
+ return ret;
+ }
+
+ pci_info(dev, "%s(): data: 0x%llx\n", __func__, data);
+
+ folio = kho_restore_folio(data);
+ if (!folio) {
+ pci_err(dev, "Failed to restore folio at 0x%llx.\n", data);
+ return -ENOENT;
+ }
+
+ return validate_folio(dev, folio);
+}
+
+static void pci_lu_stub_remove(struct pci_dev *dev)
+{
+ WARN_ON(!liveupdate_state_normal());
+ dev->dev.lu.flags = 0;
+}
+
+static int pci_lu_stub_prepare(struct device *dev, u64 *data)
+{
+ struct pci_lu_stub_ser *ser;
+ struct folio *folio;
+ int ret;
+
+ folio = folio_alloc(GFP_KERNEL | __GFP_ZERO, get_order(sizeof(*ser)));
+ if (!folio)
+ return -ENOMEM;
+
+ ret = kho_preserve_folio(folio);
+ if (ret) {
+ dev_err(dev, "Failed to preserve folio (%d)\n", ret);
+ folio_put(folio);
+ return ret;
+ }
+
+ ser = folio_address(folio);
+ ser->dev_id = pci_dev_id(to_pci_dev(dev));
+
+ *data = virt_to_phys(ser);
+ dev_info(dev, "%s(): data: 0x%llx\n", __func__, *data);
+ return 0;
+}
+
+static int pci_lu_stub_freeze(struct device *dev, u64 *data)
+{
+ struct folio *folio = pfn_folio(PHYS_PFN(*data));
+
+ dev_info(dev, "%s(): data: 0x%llx\n", __func__, *data);
+ return validate_folio(to_pci_dev(dev), folio);
+}
+
+static void pci_lu_stub_finish(struct device *dev, u64 data)
+{
+ struct folio *folio = pfn_folio(PHYS_PFN(data));
+
+ dev_info(dev, "%s(): data: 0x%llx\n", __func__, data);
+ WARN_ON(validate_folio(to_pci_dev(dev), folio));
+ folio_put(folio);
+}
+
+static void pci_lu_stub_cancel(struct device *dev, u64 data)
+{
+ dev_info(dev, "%s(): data: 0x%llx\n", __func__, data);
+ pci_lu_stub_finish(dev, data);
+}
+
+static struct dev_liveupdate_ops liveupdate_ops = {
+ .prepare = pci_lu_stub_prepare,
+ .freeze = pci_lu_stub_freeze,
+ .finish = pci_lu_stub_finish,
+ .cancel = pci_lu_stub_cancel,
+};
+
+static struct pci_driver pci_lu_stub_driver = {
+ .name = "pci-lu-stub",
+ .id_table = pci_lu_stub_id_table,
+ .probe = pci_lu_stub_probe,
+ .remove = pci_lu_stub_remove,
+ .driver.lu = &liveupdate_ops,
+};
+
+module_pci_driver(pci_lu_stub_driver);
+MODULE_LICENSE("GPL");
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
` (9 preceding siblings ...)
2025-09-16 7:45 ` [PATCH v2 10/10] PCI: pci-lu-stub: Add a stub driver for Live Update testing Chris Li
@ 2025-09-27 17:13 ` Bjorn Helgaas
2025-09-27 18:05 ` Pasha Tatashin
10 siblings, 1 reply; 84+ messages in thread
From: Bjorn Helgaas @ 2025-09-27 17:13 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:08AM -0700, Chris Li wrote:
> This is phase I of the LUO PCI series. It does the minimal set of PCI
> device liveupdate which is preserving a bus master bit in the PCI command
> register.
>
> The LUO PCI subsystem is based on the LUO V2 series.
> https://lore.kernel.org/lkml/20250515182322.117840-1-pasha.tatashin@soleen.com/
Pasha's email points to
https://github.com/googleprodkernel/linux-liveupdate/tree/luo/rfc-v2,
so I cloned https://github.com/googleprodkernel/linux-liveupdate.git
and tried to apply this series on top of the luo/rfc-v2 branch (head
5c8d261fdc15 ("MAINTAINERS: add liveupdate entry")), but it doesn't
apply cleanly.
Also tried the luo/v2 branch (head 75716df00a94 ("libluo: add
tests")), but it doesn't apply there either.
Am I looking the wrong place? Do you have a public repo with this
series in it?
Bjorn
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-09-27 17:13 ` [PATCH v2 00/10] LUO: PCI subsystem (phase I) Bjorn Helgaas
@ 2025-09-27 18:05 ` Pasha Tatashin
2025-09-29 15:04 ` Bjorn Helgaas
0 siblings, 1 reply; 84+ messages in thread
From: Pasha Tatashin @ 2025-09-27 18:05 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
Hi Bjorn,
My latest submission is the following:
https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/
And github repo is in cover letter:
https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
It applies cleanly against the mainline without the first three
patches, as they were already merged.
Pasha
On Sat, Sep 27, 2025 at 1:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:08AM -0700, Chris Li wrote:
> > This is phase I of the LUO PCI series. It does the minimal set of PCI
> > device liveupdate which is preserving a bus master bit in the PCI command
> > register.
> >
> > The LUO PCI subsystem is based on the LUO V2 series.
> > https://lore.kernel.org/lkml/20250515182322.117840-1-pasha.tatashin@soleen.com/
>
> Pasha's email points to
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/rfc-v2,
> so I cloned https://github.com/googleprodkernel/linux-liveupdate.git
> and tried to apply this series on top of the luo/rfc-v2 branch (head
> 5c8d261fdc15 ("MAINTAINERS: add liveupdate entry")), but it doesn't
> apply cleanly.
>
> Also tried the luo/v2 branch (head 75716df00a94 ("libluo: add
> tests")), but it doesn't apply there either.
>
> Am I looking the wrong place? Do you have a public repo with this
> series in it?
>
> Bjorn
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-09-27 18:05 ` Pasha Tatashin
@ 2025-09-29 15:04 ` Bjorn Helgaas
2025-09-29 18:13 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: Bjorn Helgaas @ 2025-09-29 15:04 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Sat, Sep 27, 2025 at 02:05:38PM -0400, Pasha Tatashin wrote:
> Hi Bjorn,
>
> My latest submission is the following:
> https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/
>
> And github repo is in cover letter:
>
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
>
> It applies cleanly against the mainline without the first three
> patches, as they were already merged.
Not sure what I'm missing. I've tried various things but none apply
cleanly:
$ git remote add luo https://github.com/googleprodkernel/linux-liveupdate.git
$ git fetch luo
From https://github.com/googleprodkernel/linux-liveupdate
* [new branch] hack_pci_pf_stub_demo -> luo/hack_pci_pf_stub_demo
* [new branch] iommu/rfc-v1 -> luo/iommu/rfc-v1
* [new branch] kho/v5 -> luo/kho/v5
* [new branch] kho/v6 -> luo/kho/v6
* [new branch] kho/v7 -> luo/kho/v7
* [new branch] kho/v8 -> luo/kho/v8
* [new branch] lucx/v1 -> luo/lucx/v1
* [new branch] luo/kho-v8 -> luo/luo/kho-v8
* [new branch] luo/memfd-v0.1 -> luo/luo/memfd-v0.1
* [new branch] luo/rfc-v1 -> luo/luo/rfc-v1
* [new branch] luo/rfc-v2 -> luo/luo/rfc-v2
* [new branch] luo/v1 -> luo/luo/v1
* [new branch] luo/v2 -> luo/luo/v2
* [new branch] luo/v3 -> luo/luo/v3
* [new branch] luo/v4 -> luo/luo/v4
* [new branch] master -> luo/master
$ b4 am -om/ https://lore.kernel.org/r/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
Grabbing thread from lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org/t.mbox.gz
Analyzing 13 messages in the thread
Looking for additional code-review trailers on lore.kernel.org
Checking attestation on all messages, may take a moment...
---
✓ [PATCH v2 1/10] PCI/LUO: Register with Liveupdate Orchestrator
✓ [PATCH v2 2/10] PCI/LUO: Create requested liveupdate device list
✓ [PATCH v2 3/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
✓ [PATCH v2 4/10] PCI/LUO: Restore state at PCI enumeration
✓ [PATCH v2 5/10] PCI/LUO: Forward finish callbacks to drivers
✓ [PATCH v2 6/10] PCI/LUO: Save and restore driver name
✓ [PATCH v2 7/10] PCI/LUO: Add liveupdate to pcieport driver
✓ [PATCH v2 8/10] PCI/LUO: Add pci_liveupdate_get_driver_data()
✓ [PATCH v2 9/10] PCI/LUO: Avoid write to bus master at boot
✓ [PATCH v2 10/10] PCI: pci-lu-stub: Add a stub driver for Live Update testing
---
✓ Signed: DKIM/kernel.org
---
Total patches: 10
---
Cover: m/v2_20250916_chrisl_luo_pci_subsystem_phase_i.cover
Link: https://lore.kernel.org/r/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
Base: base-commit 9ab803064e3d1be9673d2829785a69fd0578b24e not known, ignoring
Base: not specified
git am m/v2_20250916_chrisl_luo_pci_subsystem_phase_i.mbx
$ git checkout -b wip/2509-chris-luo-pci-v2 luo/luo/rfc-v2; git am m/v2_20250916_chrisl_luo_pci_subsystem_phase_i.mbx
Updating files: 100% (21294/21294), done.
branch 'wip/2509-chris-luo-pci-v2' set up to track 'luo/luo/rfc-v2'.
Switched to a new branch 'wip/2509-chris-luo-pci-v2'
Applying: PCI/LUO: Register with Liveupdate Orchestrator
Applying: PCI/LUO: Create requested liveupdate device list
Applying: PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
Applying: PCI/LUO: Restore state at PCI enumeration
Applying: PCI/LUO: Forward finish callbacks to drivers
Applying: PCI/LUO: Save and restore driver name
error: patch failed: drivers/pci/probe.c:2714
error: drivers/pci/probe.c: patch does not apply
Patch failed at 0006 PCI/LUO: Save and restore driver name
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
$ git checkout -b wip/2509-chris-luo-pci-v2 luo/luo/v2; git am m/v2_20250916_chrisl_luo_pci_subsystem_phase_i.mbx
Updating files: 100% (12217/12217), done.
branch 'wip/2509-chris-luo-pci-v2' set up to track 'luo/luo/v2'.
Switched to a new branch 'wip/2509-chris-luo-pci-v2'
Applying: PCI/LUO: Register with Liveupdate Orchestrator
error: patch failed: MAINTAINERS:14014
error: MAINTAINERS: patch does not apply
Patch failed at 0001 PCI/LUO: Register with Liveupdate Orchestrator
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
$ git checkout -b wip/2509-chris-luo-pci-v2 luo/luo/v3; git am m/v2_20250916_chrisl_luo_pci_subsystem_phase_i.mbx
branch 'wip/2509-chris-luo-pci-v2' set up to track 'luo/luo/v3'.
Switched to a new branch 'wip/2509-chris-luo-pci-v2'
Applying: PCI/LUO: Register with Liveupdate Orchestrator
error: patch failed: MAINTAINERS:14014
error: MAINTAINERS: patch does not apply
Patch failed at 0001 PCI/LUO: Register with Liveupdate Orchestrator
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 09/10] PCI/LUO: Avoid write to bus master at boot
2025-09-16 7:45 ` [PATCH v2 09/10] PCI/LUO: Avoid write to bus master at boot Chris Li
@ 2025-09-29 17:14 ` Bjorn Helgaas
0 siblings, 0 replies; 84+ messages in thread
From: Bjorn Helgaas @ 2025-09-29 17:14 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:17AM -0700, Chris Li wrote:
> If the liveupdate flag has LU_BUSMASTER or LU_BUSMASTER_BRIDGE, the
> device is participating in the liveupdate preserving bus master bit in the
> PCI config space command register.
>
> Avoid writing to the PCI command register for the bus master bit during
> boot up.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> drivers/pci/liveupdate.c | 6 ++++++
> drivers/pci/pci.c | 7 +++++--
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
> index 1b12fc0649f479c6f45ffb26e6e3754f41054ea8..a09a166b6ee271b96bce763716c3b62b24f3edbb 100644
> --- a/drivers/pci/liveupdate.c
> +++ b/drivers/pci/liveupdate.c
> @@ -377,6 +377,12 @@ static void pci_dev_do_restore(struct pci_dev *dev, struct pci_dev_ser *s)
> pci_info(dev, "liveupdate restore flags %x driver: %s data: [%llx]\n",
> s->flags, s->driver_name, s->driver_data);
> list_move_tail(&dev->dev.lu.lu_next, &probe_devices);
> + if (s->flags & (LU_BUSMASTER | LU_BUSMASTER_BRIDGE)) {
> + u16 pci_command;
> +
> + pci_read_config_word(dev, PCI_COMMAND, &pci_command);
> + WARN_ON(!(pci_command & PCI_COMMAND_MASTER));
> + }
> }
>
> void pci_liveupdate_restore(struct pci_dev *dev)
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 9e42090fb108920995ebe34bd2535a0e23fef7fd..2339ac1bd57616a78d2105ba3a4fc72bbf49973e 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2248,7 +2248,8 @@ static void do_pci_disable_device(struct pci_dev *dev)
> pci_read_config_word(dev, PCI_COMMAND, &pci_command);
> if (pci_command & PCI_COMMAND_MASTER) {
> pci_command &= ~PCI_COMMAND_MASTER;
> - pci_write_config_word(dev, PCI_COMMAND, pci_command);
> + if (!(dev->dev.lu.flags & (LU_BUSMASTER | LU_BUSMASTER_BRIDGE)))
> + pci_write_config_word(dev, PCI_COMMAND, pci_command);
I think changing the semantics of interfaces like this is a problem
because callers rely on the existing semantics, and it's hard to
reason about how this change would affect them. How would you update
the kernel-doc to reflect this change?
do_pci_disable_device() is used in the PM suspend, freeze, and
poweroff paths. I suppose those paths are allowed even when devices
have been marked with LU_BUSMASTER/LU_BUSMASTER_BRIDGE? And I assume
you probably would want the existing semantics there?
I.e., if a device has been marked with LU_BUSMASTER, you want to keep
its bus mastering enabled across a liveupdate kexec. But if we
suspend before doing the kexec, I assume we would still want to clear
bus mastering on suspend and restore bus mastering on resume?
The other path that uses do_pci_disable_device() is
pci_disable_device(), which is primarily used in driver .remove()
methods. You have to modify drivers to support liveupdate anyway, so
if we call driver .remove() methods during a liveupdate kexec, I think
you should change the .remove() method so it only calls
pci_disable_device() when you want bus mastering disabled.
> }
>
> pcibios_disable_device(dev);
> @@ -4276,7 +4277,9 @@ static void __pci_set_master(struct pci_dev *dev, bool enable)
> if (cmd != old_cmd) {
> pci_dbg(dev, "%s bus mastering\n",
> enable ? "enabling" : "disabling");
> - pci_write_config_word(dev, PCI_COMMAND, cmd);
> +
> + if (!(dev->dev.lu.flags & (LU_BUSMASTER | LU_BUSMASTER_BRIDGE)))
> + pci_write_config_word(dev, PCI_COMMAND, cmd);
> }
> dev->is_busmaster = enable;
> }
>
> --
> 2.51.0.384.g4c02a37b29-goog
>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-09-16 7:45 ` [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list Chris Li
@ 2025-09-29 17:46 ` Jason Gunthorpe
2025-09-30 2:13 ` Chris Li
2025-10-03 5:33 ` Chris Li
2025-09-30 15:26 ` Greg Kroah-Hartman
1 sibling, 2 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-29 17:46 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:10AM -0700, Chris Li wrote:
> static int pci_liveupdate_prepare(void *arg, u64 *data)
> {
> + LIST_HEAD(requested_devices);
> +
> pr_info("prepare data[%llx]\n", *data);
> +
> + pci_lock_rescan_remove();
> + down_write(&pci_bus_sem);
> +
> + build_liveupdate_devices(&requested_devices);
> + cleanup_liveupdate_devices(&requested_devices);
> +
> + up_write(&pci_bus_sem);
> + pci_unlock_rescan_remove();
> return 0;
> }
This doesn't seem conceptually right, PCI should not be preserving
everything. Only devices and their related hierarchy that are opted
into live update by iommufd should be preserved.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-09-16 7:45 ` [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver Chris Li
@ 2025-09-29 17:48 ` Jason Gunthorpe
2025-09-30 2:11 ` Chris Li
2025-09-30 15:27 ` Greg Kroah-Hartman
1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-29 17:48 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> After the list of preserved devices is constructed, the PCI subsystem can
> now forward the liveupdate request to the driver.
This also seems completely backwards for how iommufd should be
working. It doesn't want callbacks triggered on prepare, it wants to
drive everything from its own ioctl.
Let's just do one thing at a time please and make this series about
iommufd to match the other luo series for iommufd.
non-iommufd cases can be proposed in their own series.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-16 7:45 ` [PATCH v2 06/10] PCI/LUO: Save and restore driver name Chris Li
@ 2025-09-29 17:57 ` Jason Gunthorpe
2025-09-30 2:10 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-29 17:57 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:14AM -0700, Chris Li wrote:
> Save the PCI driver name into "struct pci_dev_ser" during the PCI
> prepare callback.
>
> After kexec, use driver_set_override() to ensure the device is
> bound only to the saved driver.
This doesn't seem like a great idea, driver name should not be made
ABI.
I would drop this patch and punt to the initrd. We need a more
flexible way to manage driver auto binding for CC under initrd control
anyhow, the same should be reused for hypervisors to shift driver
binding policy to userspace.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-09-29 15:04 ` Bjorn Helgaas
@ 2025-09-29 18:13 ` Chris Li
2025-10-07 23:32 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-09-29 18:13 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Pasha Tatashin, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Mon, Sep 29, 2025 at 8:04 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Sat, Sep 27, 2025 at 02:05:38PM -0400, Pasha Tatashin wrote:
> > Hi Bjorn,
> >
> > My latest submission is the following:
> > https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/
> >
> > And github repo is in cover letter:
> >
> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> >
> > It applies cleanly against the mainline without the first three
> > patches, as they were already merged.
>
> Not sure what I'm missing. I've tried various things but none apply
> cleanly:
Sorry about that. Let me do a refresh of the LUOPCI V3 patch and send
out the git repo link as well. The issue is that there are other
patches not in the mainline kernel which luopci is dependent on. Using
a git repo would be easier to get a working tree.
Working on it now, please stay tuned.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-29 17:57 ` Jason Gunthorpe
@ 2025-09-30 2:10 ` Chris Li
2025-09-30 13:02 ` Pasha Tatashin
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-09-30 2:10 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Mon, Sep 29, 2025 at 10:57 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:14AM -0700, Chris Li wrote:
> > Save the PCI driver name into "struct pci_dev_ser" during the PCI
> > prepare callback.
> >
> > After kexec, use driver_set_override() to ensure the device is
> > bound only to the saved driver.
>
> This doesn't seem like a great idea, driver name should not be made
> ABI.
Let's break it down with baby steps.
1) Do you agree the liveupdated PCI device needs to bind to the exact
same driver after kexec?
To me that is a firm yes. If the driver binds to another driver, we
can't expect the other driver will understand the original driver's
saved state.
2) Assume the 1) is yes from you. Are you just not happy that the
kernel saves the driver name? You want user space to save it, is that
it?
How does it reference the driver after kexec otherwise? If the driver
has a UUID, I am happy to use that driver UUID. But it doesn't. Using
the driver name can match to the kernel PCI driver_override framework.
If we are not using driver_override API, we need some other API to
prevent it from binding to other drivers.
Do you just want the kernel not to save it and the user space(initrd)
to save the driver name? Some one needs to bind that driver_override
when the PCI device is enumerated. Specify in the initrd before the
PCI enumerate would be too early. It hasn't found the PCI saved device
state. After the PCI enumeration would be too late.
> I would drop this patch and punt to the initrd. We need a more
> flexible way to manage driver auto binding for CC under initrd control
> anyhow, the same should be reused for hypervisors to shift driver
> binding policy to userspace.
What is CC stand for?
Once in the liveupdate, the livedupdated device and the binding driver
is fixed. It seems (to me) more complicated to let the initrd fetch
the livedupate saved state and then do stuff with it. The initrd is
not part of the kernel, more like user space programing. It is not
able to get the LUO API to get the list of preserved PCI devices etc.
We can add an API route to the user space accessing preserve data in
the kernel. But that seems to be extra complexity stuff.
Once it is in the liveupdate, there is no flexible driver binding
policy for the device currently liveupdate, the device needs to bind
to its original driver.
I feel that I am missing something, please help me understand.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-09-29 17:48 ` Jason Gunthorpe
@ 2025-09-30 2:11 ` Chris Li
2025-09-30 16:38 ` Jason Gunthorpe
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-09-30 2:11 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Mon, Sep 29, 2025 at 10:48 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> > After the list of preserved devices is constructed, the PCI subsystem can
> > now forward the liveupdate request to the driver.
>
> This also seems completely backwards for how iommufd should be
> working. It doesn't want callbacks triggered on prepare, it wants to
> drive everything from its own ioctl.
This series is about basic PCI device support, not IOMMUFD.
> Let's just do one thing at a time please and make this series about
> iommufd to match the other luo series for iommufd.
I am confused by you.
> non-iommufd cases can be proposed in their own series.
This is that non-iommufd series.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-09-29 17:46 ` Jason Gunthorpe
@ 2025-09-30 2:13 ` Chris Li
2025-09-30 16:47 ` Jason Gunthorpe
2025-10-03 5:33 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-09-30 2:13 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Mon, Sep 29, 2025 at 10:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:10AM -0700, Chris Li wrote:
> > static int pci_liveupdate_prepare(void *arg, u64 *data)
> > {
> > + LIST_HEAD(requested_devices);
> > +
> > pr_info("prepare data[%llx]\n", *data);
> > +
> > + pci_lock_rescan_remove();
> > + down_write(&pci_bus_sem);
> > +
> > + build_liveupdate_devices(&requested_devices);
> > + cleanup_liveupdate_devices(&requested_devices);
> > +
> > + up_write(&pci_bus_sem);
> > + pci_unlock_rescan_remove();
> > return 0;
> > }
>
> This doesn't seem conceptually right, PCI should not be preserving
> everything. Only devices and their related hierarchy that are opted
> into live update by iommufd should be preserved.
Can you elaborate? This is not preserving everything, for repserveding
bus master, only the device and the parent PCI bridge are added to the
requested_devies list. That is done in the
build_liveupdate_devices(), the device is added to the listhead pass
into the function. So it matches the "their related hierarchy" part.
Can you explain what unnecessary device was preserved in this?
It is not using iommufd, just the dev->lu.flags flags, the flags are
set when VFIO adds a device to VM. Here the PCI LU test driver just
adds the flag in the probe function to simulate that VFIO ioctl.
Iommufd is a separate series, not this one.
Chris
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 2:10 ` Chris Li
@ 2025-09-30 13:02 ` Pasha Tatashin
2025-09-30 13:41 ` Greg Kroah-Hartman
2025-09-30 16:37 ` Jason Gunthorpe
0 siblings, 2 replies; 84+ messages in thread
From: Pasha Tatashin @ 2025-09-30 13:02 UTC (permalink / raw)
To: Chris Li
Cc: Jason Gunthorpe, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Mon, Sep 29, 2025 at 10:10 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Mon, Sep 29, 2025 at 10:57 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Sep 16, 2025 at 12:45:14AM -0700, Chris Li wrote:
> > > Save the PCI driver name into "struct pci_dev_ser" during the PCI
> > > prepare callback.
> > >
> > > After kexec, use driver_set_override() to ensure the device is
> > > bound only to the saved driver.
> >
> > This doesn't seem like a great idea, driver name should not be made
> > ABI.
>
> Let's break it down with baby steps.
>
> 1) Do you agree the liveupdated PCI device needs to bind to the exact
> same driver after kexec?
> To me that is a firm yes. If the driver binds to another driver, we
> can't expect the other driver will understand the original driver's
> saved state.
Hi Chris,
Driver name does not have to be an ABI. Drivers that support live
updates should provide a live update-specific ABI to detect
compatibility with the preserved data. We can use a preservation
schema GUID for this.
> 2) Assume the 1) is yes from you. Are you just not happy that the
> kernel saves the driver name? You want user space to save it, is that
> it?
> How does it reference the driver after kexec otherwise?
If we use GUID, drivers would advertise the GUIDs they support and we
would modify the core device-driver matching process to use this
information.
Each driver that supports this mechanism would need to declare an
array of GUIDs it is compatible with. This would be a new field in its
struct pci_driver.
static const guid_t my_driver_guids[] = {
GUID_INIT(0x123e4567, ...), // Schema V1
GUID_INIT(0x987a6543, ...), // Schema V2
{},
};
static struct pci_driver my_pci_driver = {
.name = "my_driver",
.id_table = my_pci_ids,
.probe = my_probe,
.live_update_guids = my_driver_guids,
};
The kernel's PCI core would perform an extra check before falling back
to the standard PCI ID matching.
1. When a PCI device is discovered, the core first asks the Live
Update framework: "Is there a preserved GUID for this device?"
2. If a GUID is found, the core will only attempt to bind drivers that
both match the device's PCI ID and have that specific GUID in their
live_update_guids list.
3. If no GUID is preserved for the device, the core proceeds with the
normal matching logic
4. If no driver matches the GUID, the device is left unbound. The
state gets removed during finish(), and the device is reset.
Pasha
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 13:02 ` Pasha Tatashin
@ 2025-09-30 13:41 ` Greg Kroah-Hartman
2025-09-30 14:53 ` Pasha Tatashin
2025-09-30 15:41 ` Chris Li
2025-09-30 16:37 ` Jason Gunthorpe
1 sibling, 2 replies; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-09-30 13:41 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Chris Li, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 09:02:44AM -0400, Pasha Tatashin wrote:
> On Mon, Sep 29, 2025 at 10:10 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Mon, Sep 29, 2025 at 10:57 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 12:45:14AM -0700, Chris Li wrote:
> > > > Save the PCI driver name into "struct pci_dev_ser" during the PCI
> > > > prepare callback.
> > > >
> > > > After kexec, use driver_set_override() to ensure the device is
> > > > bound only to the saved driver.
> > >
> > > This doesn't seem like a great idea, driver name should not be made
> > > ABI.
> >
> > Let's break it down with baby steps.
> >
> > 1) Do you agree the liveupdated PCI device needs to bind to the exact
> > same driver after kexec?
> > To me that is a firm yes. If the driver binds to another driver, we
> > can't expect the other driver will understand the original driver's
> > saved state.
>
> Hi Chris,
>
> Driver name does not have to be an ABI.
A driver name can NEVER be an abi, please don't do that.
> Drivers that support live
> updates should provide a live update-specific ABI to detect
> compatibility with the preserved data. We can use a preservation
> schema GUID for this.
>
> > 2) Assume the 1) is yes from you. Are you just not happy that the
> > kernel saves the driver name? You want user space to save it, is that
> > it?
> > How does it reference the driver after kexec otherwise?
>
> If we use GUID, drivers would advertise the GUIDs they support and we
> would modify the core device-driver matching process to use this
> information.
>
> Each driver that supports this mechanism would need to declare an
> array of GUIDs it is compatible with. This would be a new field in its
> struct pci_driver.
>
> static const guid_t my_driver_guids[] = {
> GUID_INIT(0x123e4567, ...), // Schema V1
> GUID_INIT(0x987a6543, ...), // Schema V2
> {},
> };
That's crazy, who is going to be adding all of that to all drivers? And
knowing to bump this if the internal data representaion changes? And it
will change underneath it without the driver even knowing? This feels
really really wrong, unless I'm missing something.
> static struct pci_driver my_pci_driver = {
> .name = "my_driver",
> .id_table = my_pci_ids,
> .probe = my_probe,
> .live_update_guids = my_driver_guids,
> };
>
> The kernel's PCI core would perform an extra check before falling back
> to the standard PCI ID matching.
> 1. When a PCI device is discovered, the core first asks the Live
> Update framework: "Is there a preserved GUID for this device?"
> 2. If a GUID is found, the core will only attempt to bind drivers that
> both match the device's PCI ID and have that specific GUID in their
> live_update_guids list.
What "core" is doing this? And how exactly?
And why is PCI somehow special here?
> 3. If no GUID is preserved for the device, the core proceeds with the
> normal matching logic
> 4. If no driver matches the GUID, the device is left unbound. The
> state gets removed during finish(), and the device is reset.
How do you reset a device you are not bound to? That feels ripe for
causing problems (think multi-function devices...)
And what about PCI drivers that are really just a aux-bus "root" point?
How is the sharing of all of the child devices going to work?
This feels really rough and might possibly work if you squint hard
enough and test it in a very limited way with almost no real hardware :)
good luck!
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 13:41 ` Greg Kroah-Hartman
@ 2025-09-30 14:53 ` Pasha Tatashin
2025-09-30 15:08 ` Greg Kroah-Hartman
2025-09-30 15:41 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: Pasha Tatashin @ 2025-09-30 14:53 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Chris Li, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 9:41 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 30, 2025 at 09:02:44AM -0400, Pasha Tatashin wrote:
> > On Mon, Sep 29, 2025 at 10:10 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Mon, Sep 29, 2025 at 10:57 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Tue, Sep 16, 2025 at 12:45:14AM -0700, Chris Li wrote:
> > > > > Save the PCI driver name into "struct pci_dev_ser" during the PCI
> > > > > prepare callback.
> > > > >
> > > > > After kexec, use driver_set_override() to ensure the device is
> > > > > bound only to the saved driver.
> > > >
> > > > This doesn't seem like a great idea, driver name should not be made
> > > > ABI.
> > >
> > > Let's break it down with baby steps.
> > >
> > > 1) Do you agree the liveupdated PCI device needs to bind to the exact
> > > same driver after kexec?
> > > To me that is a firm yes. If the driver binds to another driver, we
> > > can't expect the other driver will understand the original driver's
> > > saved state.
> >
> > Hi Chris,
> >
> > Driver name does not have to be an ABI.
>
> A driver name can NEVER be an abi, please don't do that.
>
> > Drivers that support live
> > updates should provide a live update-specific ABI to detect
> > compatibility with the preserved data. We can use a preservation
> > schema GUID for this.
> >
> > > 2) Assume the 1) is yes from you. Are you just not happy that the
> > > kernel saves the driver name? You want user space to save it, is that
> > > it?
> > > How does it reference the driver after kexec otherwise?
> >
> > If we use GUID, drivers would advertise the GUIDs they support and we
> > would modify the core device-driver matching process to use this
> > information.
> >
> > Each driver that supports this mechanism would need to declare an
> > array of GUIDs it is compatible with. This would be a new field in its
> > struct pci_driver.
> >
> > static const guid_t my_driver_guids[] = {
> > GUID_INIT(0x123e4567, ...), // Schema V1
> > GUID_INIT(0x987a6543, ...), // Schema V2
> > {},
> > };
>
> That's crazy, who is going to be adding all of that to all drivers? And
Only to the drivers that support live updates, that would be just a few drivers.
> knowing to bump this if the internal data representaion changes? And it
> will change underneath it without the driver even knowing? This feels
> really really wrong, unless I'm missing something.
A driver that preserves state across a reboot already has an implicit
contract with its future self about that data's format. The GUID
simply makes that contract explicit and machine-checkable. It does not
have to be GUID, but nevertheless there has to be a specific contract.
Pasha
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 14:53 ` Pasha Tatashin
@ 2025-09-30 15:08 ` Greg Kroah-Hartman
2025-09-30 15:56 ` Pasha Tatashin
0 siblings, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-09-30 15:08 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Chris Li, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 10:53:50AM -0400, Pasha Tatashin wrote:
> On Tue, Sep 30, 2025 at 9:41 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Sep 30, 2025 at 09:02:44AM -0400, Pasha Tatashin wrote:
> > > On Mon, Sep 29, 2025 at 10:10 PM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > On Mon, Sep 29, 2025 at 10:57 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Tue, Sep 16, 2025 at 12:45:14AM -0700, Chris Li wrote:
> > > > > > Save the PCI driver name into "struct pci_dev_ser" during the PCI
> > > > > > prepare callback.
> > > > > >
> > > > > > After kexec, use driver_set_override() to ensure the device is
> > > > > > bound only to the saved driver.
> > > > >
> > > > > This doesn't seem like a great idea, driver name should not be made
> > > > > ABI.
> > > >
> > > > Let's break it down with baby steps.
> > > >
> > > > 1) Do you agree the liveupdated PCI device needs to bind to the exact
> > > > same driver after kexec?
> > > > To me that is a firm yes. If the driver binds to another driver, we
> > > > can't expect the other driver will understand the original driver's
> > > > saved state.
> > >
> > > Hi Chris,
> > >
> > > Driver name does not have to be an ABI.
> >
> > A driver name can NEVER be an abi, please don't do that.
> >
> > > Drivers that support live
> > > updates should provide a live update-specific ABI to detect
> > > compatibility with the preserved data. We can use a preservation
> > > schema GUID for this.
> > >
> > > > 2) Assume the 1) is yes from you. Are you just not happy that the
> > > > kernel saves the driver name? You want user space to save it, is that
> > > > it?
> > > > How does it reference the driver after kexec otherwise?
> > >
> > > If we use GUID, drivers would advertise the GUIDs they support and we
> > > would modify the core device-driver matching process to use this
> > > information.
> > >
> > > Each driver that supports this mechanism would need to declare an
> > > array of GUIDs it is compatible with. This would be a new field in its
> > > struct pci_driver.
> > >
> > > static const guid_t my_driver_guids[] = {
> > > GUID_INIT(0x123e4567, ...), // Schema V1
> > > GUID_INIT(0x987a6543, ...), // Schema V2
> > > {},
> > > };
> >
> > That's crazy, who is going to be adding all of that to all drivers? And
>
> Only to the drivers that support live updates, that would be just a few drivers.
>
> > knowing to bump this if the internal data representaion changes? And it
> > will change underneath it without the driver even knowing? This feels
> > really really wrong, unless I'm missing something.
>
> A driver that preserves state across a reboot already has an implicit
> contract with its future self about that data's format. The GUID
> simply makes that contract explicit and machine-checkable. It does not
> have to be GUID, but nevertheless there has to be a specific contract.
So how are you going to "version" these GUID? I see you use "schema Vx"
above, but how is that really going to work in the end? Lots of data
structures change underneath the base driver that it knows nothing
about, not to mention basic things like compiler flags and the like
(think about how we have changed things for spectre issues over the
years...)
And when can you delete an old "schema"? This feels like you are
forcing future developers to maintain things "for forever"...
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator
2025-09-16 7:45 ` [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator Chris Li
@ 2025-09-30 15:15 ` Greg Kroah-Hartman
2025-09-30 23:41 ` Chris Li
2025-09-30 15:17 ` Greg Kroah-Hartman
1 sibling, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-09-30 15:15 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:09AM -0700, Chris Li wrote:
> Register PCI subsystem with the Liveupdate Orchestrator
> and provide noop liveupdate callbacks.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> MAINTAINERS | 2 ++
> drivers/pci/Makefile | 1 +
> drivers/pci/liveupdate.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 57 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 91cec3288cc81aea199f730924eee1f5fda1fd72..85749a5da69f88544ccc749e9d723b1b54c0e3b7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14014,11 +14014,13 @@ F: tools/testing/selftests/livepatch/
>
> LIVE UPDATE
> M: Pasha Tatashin <pasha.tatashin@soleen.com>
> +M: Chris Li <chrisl@kernel.org>
> L: linux-kernel@vger.kernel.org
> S: Maintained
> F: Documentation/ABI/testing/sysfs-kernel-liveupdate
> F: Documentation/admin-guide/liveupdate.rst
> F: drivers/misc/liveupdate/
> +F: drivers/pci/liveupdate/
> F: include/linux/liveupdate.h
> F: include/uapi/linux/liveupdate.h
> F: tools/testing/selftests/liveupdate/
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 67647f1880fb8fb0629d680398f5b88d69aac660..aa1bac7aed7d12c641a6b55e56176fb3cdde4c91 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -37,6 +37,7 @@ obj-$(CONFIG_PCI_DOE) += doe.o
> obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
> obj-$(CONFIG_PCI_NPEM) += npem.o
> obj-$(CONFIG_PCIE_TPH) += tph.o
> +obj-$(CONFIG_LIVEUPDATE) += liveupdate.o
>
> # Endpoint library must be initialized before its users
> obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
> diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..86b4f3a2fb44781c6e323ba029db510450556fa9
> --- /dev/null
> +++ b/drivers/pci/liveupdate.c
> @@ -0,0 +1,54 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Chris Li <chrisl@kernel.org>
> + */
> +
> +#define pr_fmt(fmt) "PCI liveupdate: " fmt
> +
> +#include <linux/liveupdate.h>
> +
> +#define PCI_SUBSYSTEM_NAME "pci"
> +
> +static int pci_liveupdate_prepare(void *arg, u64 *data)
> +{
> + pr_info("prepare data[%llx]\n", *data);
You do know that's a security bug, right?
Please don't do this, even in "debug" code, as it can escape into the
wild...
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator
2025-09-16 7:45 ` [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator Chris Li
2025-09-30 15:15 ` Greg Kroah-Hartman
@ 2025-09-30 15:17 ` Greg Kroah-Hartman
2025-09-30 23:38 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-09-30 15:17 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:09AM -0700, Chris Li wrote:
> Register PCI subsystem with the Liveupdate Orchestrator
> and provide noop liveupdate callbacks.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> MAINTAINERS | 2 ++
> drivers/pci/Makefile | 1 +
> drivers/pci/liveupdate.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 57 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 91cec3288cc81aea199f730924eee1f5fda1fd72..85749a5da69f88544ccc749e9d723b1b54c0e3b7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14014,11 +14014,13 @@ F: tools/testing/selftests/livepatch/
>
> LIVE UPDATE
> M: Pasha Tatashin <pasha.tatashin@soleen.com>
> +M: Chris Li <chrisl@kernel.org>
> L: linux-kernel@vger.kernel.org
> S: Maintained
> F: Documentation/ABI/testing/sysfs-kernel-liveupdate
> F: Documentation/admin-guide/liveupdate.rst
> F: drivers/misc/liveupdate/
> +F: drivers/pci/liveupdate/
> F: include/linux/liveupdate.h
> F: include/uapi/linux/liveupdate.h
> F: tools/testing/selftests/liveupdate/
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 67647f1880fb8fb0629d680398f5b88d69aac660..aa1bac7aed7d12c641a6b55e56176fb3cdde4c91 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -37,6 +37,7 @@ obj-$(CONFIG_PCI_DOE) += doe.o
> obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
> obj-$(CONFIG_PCI_NPEM) += npem.o
> obj-$(CONFIG_PCIE_TPH) += tph.o
> +obj-$(CONFIG_LIVEUPDATE) += liveupdate.o
>
> # Endpoint library must be initialized before its users
> obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
> diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..86b4f3a2fb44781c6e323ba029db510450556fa9
> --- /dev/null
> +++ b/drivers/pci/liveupdate.c
> @@ -0,0 +1,54 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Chris Li <chrisl@kernel.org>
> + */
> +
> +#define pr_fmt(fmt) "PCI liveupdate: " fmt
> +
> +#include <linux/liveupdate.h>
> +
> +#define PCI_SUBSYSTEM_NAME "pci"
> +
> +static int pci_liveupdate_prepare(void *arg, u64 *data)
> +{
> + pr_info("prepare data[%llx]\n", *data);
> + return 0;
> +}
> +
> +static int pci_liveupdate_freeze(void *arg, u64 *data)
> +{
> + pr_info("freeze data[%llx]\n", *data);
> + return 0;
> +}
> +
> +static void pci_liveupdate_cancel(void *arg, u64 data)
> +{
> + pr_info("cancel data[%llx]\n", data);
> +}
> +
> +static void pci_liveupdate_finish(void *arg, u64 data)
> +{
> + pr_info("finish data[%llx]\n", data);
> +}
> +
> +struct liveupdate_subsystem pci_liveupdate_ops = {
> + .prepare = pci_liveupdate_prepare,
> + .freeze = pci_liveupdate_freeze,
> + .cancel = pci_liveupdate_cancel,
> + .finish = pci_liveupdate_finish,
> + .name = PCI_SUBSYSTEM_NAME,
> +};
> +
> +static int __init pci_liveupdate_init(void)
> +{
> + int ret;
> +
> + ret = liveupdate_register_subsystem(&pci_liveupdate_ops);
> + if (ret && liveupdate_state_updated())
> + panic("PCI liveupdate: Register subsystem failed: %d", ret);
> + WARN(ret, "PCI liveupdate: Register subsystem failed %d", ret);
But this didn't fail.
And you just crashed the box if panic-on-warn is enabled, so if some
test infrastructure builds this first patch, boom.
{sigh}
If you are going to do a "dummy" driver, please make it at least work
and not do anything bad.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-09-16 7:45 ` [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list Chris Li
2025-09-29 17:46 ` Jason Gunthorpe
@ 2025-09-30 15:26 ` Greg Kroah-Hartman
2025-10-03 6:57 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-09-30 15:26 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:10AM -0700, Chris Li wrote:
> #define pr_fmt(fmt) "PCI liveupdate: " fmt
> +#define dev_fmt(fmt) "PCI liveupdate: " fmt
Please no. Use the default dev_ formatting so that people can correct
track the devices spitting out messages here.
> +#include <linux/types.h>
> #include <linux/liveupdate.h>
> +#include "pci.h"
>
> #define PCI_SUBSYSTEM_NAME "pci"
I still don't know why this is needed, why?
>
> +static void stack_push_buses(struct list_head *stack, struct list_head *buses)
> +{
> + struct pci_bus *bus;
> +
> + list_for_each_entry(bus, buses, node)
> + list_move_tail(&bus->dev.lu.lu_next, stack);
> +}
> +
> +static void liveupdate_add_dev(struct device *dev, struct list_head *head)
> +{
> + dev_info(dev, "collect liveupdate device: flags %x\n", dev->lu.flags);
Debugging code can go away please.
> + list_move_tail(&dev->lu.lu_next, head);
> +}
> +
> +static int collect_bus_devices_reverse(struct pci_bus *bus, struct list_head *head)
> +{
> + struct pci_dev *pdev;
> + int count = 0;
> +
> + list_for_each_entry_reverse(pdev, &bus->devices, bus_list) {
Why are you allowed to walk the pci bus list here? Shouldn't there be
some type of core function to do that?
And why in reverse?
> + if (pdev->dev.lu.flags & LU_BUSMASTER && pdev->dev.parent)
> + pdev->dev.parent->lu.flags |= LU_BUSMASTER_BRIDGE;
> + if (pdev->dev.lu.flags) {
> + liveupdate_add_dev(&pdev->dev, head);
> + count++;
No locking?
> + }
> + }
> + return count;
What prevents this value from changing right after you return it?
> +}
> +
> +static int build_liveupdate_devices(struct list_head *head)
> +{
> + LIST_HEAD(bus_stack);
> + int count = 0;
> +
> + stack_push_buses(&bus_stack, &pci_root_buses);
> +
> + while (!list_empty(&bus_stack)) {
> + struct device *busdev;
> + struct pci_bus *bus;
> +
> + busdev = list_last_entry(&bus_stack, struct device, lu.lu_next);
> + bus = to_pci_bus(busdev);
> + if (!busdev->lu.visited && !list_empty(&bus->children)) {
> + stack_push_buses(&bus_stack, &bus->children);
> + busdev->lu.visited = 1;
> + continue;
> + }
> +
> + count += collect_bus_devices_reverse(bus, head);
> + busdev->lu.visited = 0;
> + list_del_init(&busdev->lu.lu_next);
> + }
> + return count;
A comment here about what you are trying to do with walking the list of
devices. Somehow. Are you sure that's right? It feels backwards, and
the lack of any locking makes me very nervous. How is this integrating
into the normal driver model lists?
> +}
> +
> +static void cleanup_liveupdate_devices(struct list_head *head)
> +{
> + struct device *d, *n;
> +
> + list_for_each_entry_safe(d, n, head, lu.lu_next) {
> + d->lu.flags &= ~LU_DEPENDED;
> + list_del_init(&d->lu.lu_next);
> + }
> +}
What does "cleanup" mean?
> +
> static int pci_liveupdate_prepare(void *arg, u64 *data)
> {
> + LIST_HEAD(requested_devices);
> +
> pr_info("prepare data[%llx]\n", *data);
Addresses written to the kernel log?
> +
> + pci_lock_rescan_remove();
> + down_write(&pci_bus_sem);
> +
> + build_liveupdate_devices(&requested_devices);
Ah, you lock here. Document the heck out of this and put the proper
build macros in there so we know what is going on.
> + cleanup_liveupdate_devices(&requested_devices);
> +
> + up_write(&pci_bus_sem);
Why is it a write? You aren't modifying the list, are you?
> + pci_unlock_rescan_remove();
> return 0;
> }
>
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index e8318fd5f6ed537a1b236a3a0f054161d5710abd..0e9ef387182856771d857181d88f376632b46f0d 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -304,6 +304,7 @@ static int pcie_device_init(struct pci_dev *pdev, int service, int irq)
> device = &pcie->device;
> device->bus = &pcie_port_bus_type;
> device->release = release_pcie_device; /* callback to free pcie dev */
> + dev_liveupdate_init(device);
Why here?
> dev_set_name(device, "%s:pcie%03x",
> pci_name(pdev),
> get_descriptor_id(pci_pcie_type(pdev), service));
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 4b8693ec9e4c67fc1655e0057b3b96b4098e6630..dddd7ebc03d1a6e6ee456e0bf02ab9833a819509 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -614,6 +614,7 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus *parent)
> INIT_LIST_HEAD(&b->devices);
> INIT_LIST_HEAD(&b->slots);
> INIT_LIST_HEAD(&b->resources);
> + dev_liveupdate_init(&b->dev);
Same, why here? Shouldn't the driver core be doing this all for you
automatically? Are you going to make each bus do this manually?
> b->max_bus_speed = PCI_SPEED_UNKNOWN;
> b->cur_bus_speed = PCI_SPEED_UNKNOWN;
> #ifdef CONFIG_PCI_DOMAINS_GENERIC
> @@ -1985,6 +1986,7 @@ int pci_setup_device(struct pci_dev *dev)
> dev->sysdata = dev->bus->sysdata;
> dev->dev.parent = dev->bus->bridge;
> dev->dev.bus = &pci_bus_type;
> + dev_liveupdate_init(&dev->dev);
Looks like you are :(
Do it in one place please.
> dev->hdr_type = hdr_type & 0x7f;
> dev->multifunction = !!(hdr_type & 0x80);
> dev->error_state = pci_channel_io_normal;
> @@ -3184,7 +3186,7 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int bus,
> return NULL;
>
> bridge->dev.parent = parent;
> -
> + dev_liveupdate_init(&bridge->dev);
Again, one place.
> list_splice_init(resources, &bridge->windows);
> bridge->sysdata = sysdata;
> bridge->busnr = bus;
> diff --git a/include/linux/dev_liveupdate.h b/include/linux/dev_liveupdate.h
> new file mode 100644
> index 0000000000000000000000000000000000000000..72297cba08a999e89f7bc0997dabdbe14e0aa12c
> --- /dev/null
> +++ b/include/linux/dev_liveupdate.h
> @@ -0,0 +1,44 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Pasha Tatashin <pasha.tatashin@soleen.com>
> + * Chris Li <chrisl@kernel.org>
> + */
> +#ifndef _LINUX_DEV_LIVEUPDATE_H
> +#define _LINUX_DEV_LIVEUPDATE_H
> +
> +#include <linux/liveupdate.h>
> +
> +#ifdef CONFIG_LIVEUPDATE
> +
> +enum liveupdate_flag {
> + LU_BUSMASTER = 1 << 0,
> + LU_BUSMASTER_BRIDGE = 2 << 0,
BIT() please.
> +};
> +
> +#define LU_REQUESTED (LU_BUSMASTER)
> +#define LU_DEPENDED (LU_BUSMASTER_BRIDGE)
Why 2 names for the same thing?
> +
> +/**
> + * struct dev_liveupdate - Device state for live update operations
> + * @lu_next: List head for linking the device into live update
> + * related lists (e.g., a list of devices participating
> + * in a live update sequence).
> + * @flags: Indicate what liveupdate feature does the device
> + * participtate.
> + * @visited: Only used by the bus devices when travese the PCI buses
> + * to build the liveupdate devices list. Set if the child
> + * buses have been pushed into the pending stack.
> + *
> + * This structure holds the state information required for performing
> + * live update operations on a device. It is embedded within a struct device.
> + */
> +struct dev_liveupdate {
> + struct list_head lu_next;
Another list?
> + enum liveupdate_flag flags;
> + bool visited:1;
You shouldn't need this, you "know" you only touch one device at a time
when walking a bus, don't try to manually keep track of it on your own.
And again, why is the pci core doing this, the driver core should be
doing all of this, PLEASE do not bury driver-model-core-changes down in
a "PCI" patch. That will make the driver core maintainers very grumpy
when they run across stuff like this (as it did here...)
> +};
> +
> +#endif /* CONFIG_LIVEUPDATE */
> +#endif /* _LINUX_DEV_LIVEUPDATE_H */
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 4940db137fffff4ceacf819b32433a0f4898b125..e0b35c723239f1254a3b6152f433e0412cd3fb34 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -21,6 +21,7 @@
> #include <linux/lockdep.h>
> #include <linux/compiler.h>
> #include <linux/types.h>
> +#include <linux/dev_liveupdate.h>
Look, driver core changes. Please do this all in stuff that is NOT for
just PCI.
> #include <linux/mutex.h>
> #include <linux/pm.h>
> #include <linux/atomic.h>
> @@ -508,6 +509,7 @@ struct device_physical_location {
> * @pm_domain: Provide callbacks that are executed during system suspend,
> * hibernation, system resume and during runtime PM transitions
> * along with subsystem-level and driver-level callbacks.
> + * @lu: Live update state.
You have more letters, please use them. "lu" is too short.
> * @em_pd: device's energy model performance domain
> * @pins: For device pin management.
> * See Documentation/driver-api/pin-control.rst for details.
> @@ -603,6 +605,10 @@ struct device {
> struct dev_pm_info power;
> struct dev_pm_domain *pm_domain;
>
> +#ifdef CONFIG_LIVEUPDATE
> + struct dev_liveupdate lu;
> +#endif
Why not a pointer?
> +
> #ifdef CONFIG_ENERGY_MODEL
> struct em_perf_domain *em_pd;
> #endif
> @@ -1168,4 +1174,13 @@ void device_link_wait_removal(void);
> #define MODULE_ALIAS_CHARDEV_MAJOR(major) \
> MODULE_ALIAS("char-major-" __stringify(major) "-*")
>
> +#ifdef CONFIG_LIVEUPDATE
> +static inline void dev_liveupdate_init(struct device *dev)
> +{
> + INIT_LIST_HEAD(&dev->lu.lu_next);
Why does this have to be in device.h? The driver core should do this
for you (as I say above).
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-09-16 7:45 ` [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver Chris Li
2025-09-29 17:48 ` Jason Gunthorpe
@ 2025-09-30 15:27 ` Greg Kroah-Hartman
2025-10-02 20:38 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-09-30 15:27 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> include/linux/dev_liveupdate.h | 23 +++++
> include/linux/device/driver.h | 6 ++
Driver core changes under the guise of only PCI changes? Please no.
Break this series out properly, get the driver core stuff working FIRST,
then show how multiple busses will work with them (i.e. you usually need
3 to know if you got it right).
I'm guessing you will need/want PCI, platform, and something else?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 13:41 ` Greg Kroah-Hartman
2025-09-30 14:53 ` Pasha Tatashin
@ 2025-09-30 15:41 ` Chris Li
2025-10-01 5:13 ` Greg Kroah-Hartman
1 sibling, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-09-30 15:41 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Pasha Tatashin, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 6:41 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 30, 2025 at 09:02:44AM -0400, Pasha Tatashin wrote:
> > On Mon, Sep 29, 2025 at 10:10 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Mon, Sep 29, 2025 at 10:57 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Tue, Sep 16, 2025 at 12:45:14AM -0700, Chris Li wrote:
> > > > > Save the PCI driver name into "struct pci_dev_ser" during the PCI
> > > > > prepare callback.
> > > > >
> > > > > After kexec, use driver_set_override() to ensure the device is
> > > > > bound only to the saved driver.
> > > >
> > > > This doesn't seem like a great idea, driver name should not be made
> > > > ABI.
> > >
> > > Let's break it down with baby steps.
> > >
> > > 1) Do you agree the liveupdated PCI device needs to bind to the exact
> > > same driver after kexec?
> > > To me that is a firm yes. If the driver binds to another driver, we
> > > can't expect the other driver will understand the original driver's
> > > saved state.
> >
> > Hi Chris,
> >
> > Driver name does not have to be an ABI.
>
> A driver name can NEVER be an abi, please don't do that.
Can you please clarify that.
for example, the pci has this sysfs control api:
"/sys/bus/pci/devices/0000:04:00.0/driver_override" which takes the
*driver name* as data to override what driver is allowed to bind to
this device.
Does this driver_override consider it as using the driver name as part
of the abi? If not, why?
What live update wants is to make that driver_override persistent over
kexec. It does not introduce the "driver_override" API. That is
pre-existing conditions. The PCI liveupdate just wants to use it.
I want to get some basic understanding before adventure into the more
complex solutions.
> > Drivers that support live
> > updates should provide a live update-specific ABI to detect
> > compatibility with the preserved data. We can use a preservation
> > schema GUID for this.
> >
> > > 2) Assume the 1) is yes from you. Are you just not happy that the
> > > kernel saves the driver name? You want user space to save it, is that
> > > it?
> > > How does it reference the driver after kexec otherwise?
> >
> > If we use GUID, drivers would advertise the GUIDs they support and we
> > would modify the core device-driver matching process to use this
> > information.
> >
> > Each driver that supports this mechanism would need to declare an
> > array of GUIDs it is compatible with. This would be a new field in its
> > struct pci_driver.
> >
> > static const guid_t my_driver_guids[] = {
> > GUID_INIT(0x123e4567, ...), // Schema V1
> > GUID_INIT(0x987a6543, ...), // Schema V2
> > {},
> > };
>
> That's crazy, who is going to be adding all of that to all drivers? And
> knowing to bump this if the internal data representaion changes? And it
> will change underneath it without the driver even knowing? This feels
> really really wrong, unless I'm missing something.
The GUID is more complex than a driver name. I am fine with not using
GUID if you are so strongly opposed to it.
You are saying don't do A(driver name) and B(GUID). I am waiting for
the part where you say "please do C instead".
Do you have any other suggestion how to prevent the live update PCI
device bind to a different driver after kexec? I am happy to work on
the direction you point out and turn that into a patch for the
discussion purpose.
Thanks
Chris
> > static struct pci_driver my_pci_driver = {
> > .name = "my_driver",
> > .id_table = my_pci_ids,
> > .probe = my_probe,
> > .live_update_guids = my_driver_guids,
> > };
> >
> > The kernel's PCI core would perform an extra check before falling back
> > to the standard PCI ID matching.
> > 1. When a PCI device is discovered, the core first asks the Live
> > Update framework: "Is there a preserved GUID for this device?"
> > 2. If a GUID is found, the core will only attempt to bind drivers that
> > both match the device's PCI ID and have that specific GUID in their
> > live_update_guids list.
>
> What "core" is doing this? And how exactly?
>
> And why is PCI somehow special here?
>
> > 3. If no GUID is preserved for the device, the core proceeds with the
> > normal matching logic
> > 4. If no driver matches the GUID, the device is left unbound. The
> > state gets removed during finish(), and the device is reset.
>
> How do you reset a device you are not bound to? That feels ripe for
> causing problems (think multi-function devices...)
>
> And what about PCI drivers that are really just a aux-bus "root" point?
> How is the sharing of all of the child devices going to work?
>
> This feels really rough and might possibly work if you squint hard
> enough and test it in a very limited way with almost no real hardware :)
>
> good luck!
>
> greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 15:08 ` Greg Kroah-Hartman
@ 2025-09-30 15:56 ` Pasha Tatashin
2025-10-01 5:06 ` Greg Kroah-Hartman
0 siblings, 1 reply; 84+ messages in thread
From: Pasha Tatashin @ 2025-09-30 15:56 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Chris Li, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
> > A driver that preserves state across a reboot already has an implicit
> > contract with its future self about that data's format. The GUID
> > simply makes that contract explicit and machine-checkable. It does not
> > have to be GUID, but nevertheless there has to be a specific contract.
>
> So how are you going to "version" these GUID? I see you use "schema Vx"
Driver developer who changes a driver to support live-update.
> above, but how is that really going to work in the end? Lots of data
> structures change underneath the base driver that it knows nothing
> about, not to mention basic things like compiler flags and the like
> (think about how we have changed things for spectre issues over the
> years...)
We are working on versioning protocol, the GUID I am suggesting is not
to protect "struct" coherency, but just to identify which driver to
bind to which device compatability.
>
> And when can you delete an old "schema"? This feels like you are
> forcing future developers to maintain things "for forever"...
This won't be an issue because of how live update support is planned.
The support model will be phased and limited:
Initially, and for a while there will be no stability guarantees
between different kernel versions.
Eventually, we will support specific, narrow upgrade paths (e.g.,
minor-to-minor, or stable-A to stable-A+1).
Downgrades and arbitrary version jumps ("any-to-any") will not be
supported upstream. Since we only ever need to handle a well-defined
forward path, the code for old, irrelevant schemas can always be
removed. There is no "forever".
Pasha
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 13:02 ` Pasha Tatashin
2025-09-30 13:41 ` Greg Kroah-Hartman
@ 2025-09-30 16:37 ` Jason Gunthorpe
2025-10-02 21:39 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-30 16:37 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 09:02:44AM -0400, Pasha Tatashin wrote:
> The kernel's PCI core would perform an extra check before falling back
> to the standard PCI ID matching.
This still seems very complex just to solve the VFIO case.
As I said, I would punt all of this to the initrd and let the initrd
explicitly bind drivers.
The only behavior we need from the kernel is to not autobind some
drivers so userspace can control it, and in a LUO type environment
userspace should well know what drivers go where - or can get it from
a preceeding kernel from a memfd.
This is broadly the same thing we need for Confidential Compute anyhow.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-09-30 2:11 ` Chris Li
@ 2025-09-30 16:38 ` Jason Gunthorpe
2025-10-02 18:54 ` David Matlack
2025-10-02 20:44 ` Chris Li
0 siblings, 2 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-30 16:38 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Mon, Sep 29, 2025 at 07:11:06PM -0700, Chris Li wrote:
> On Mon, Sep 29, 2025 at 10:48 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> > > After the list of preserved devices is constructed, the PCI subsystem can
> > > now forward the liveupdate request to the driver.
> >
> > This also seems completely backwards for how iommufd should be
> > working. It doesn't want callbacks triggered on prepare, it wants to
> > drive everything from its own ioctl.
>
> This series is about basic PCI device support, not IOMMUFD.
>
> > Let's just do one thing at a time please and make this series about
> > iommufd to match the other luo series for iommufd.
>
> I am confused by you.
>
> > non-iommufd cases can be proposed in their own series.
>
> This is that non-iommufd series.
Then don't do generic devices until we get iommufd done and you have a
meaningful in-tree driver to consume what you are adding.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-09-30 2:13 ` Chris Li
@ 2025-09-30 16:47 ` Jason Gunthorpe
2025-10-03 7:09 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-30 16:47 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Mon, Sep 29, 2025 at 07:13:51PM -0700, Chris Li wrote:
> On Mon, Sep 29, 2025 at 10:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Sep 16, 2025 at 12:45:10AM -0700, Chris Li wrote:
> > > static int pci_liveupdate_prepare(void *arg, u64 *data)
> > > {
> > > + LIST_HEAD(requested_devices);
> > > +
> > > pr_info("prepare data[%llx]\n", *data);
> > > +
> > > + pci_lock_rescan_remove();
> > > + down_write(&pci_bus_sem);
> > > +
> > > + build_liveupdate_devices(&requested_devices);
> > > + cleanup_liveupdate_devices(&requested_devices);
> > > +
> > > + up_write(&pci_bus_sem);
> > > + pci_unlock_rescan_remove();
> > > return 0;
> > > }
> >
> > This doesn't seem conceptually right, PCI should not be preserving
> > everything. Only devices and their related hierarchy that are opted
> > into live update by iommufd should be preserved.
>
> Can you elaborate? This is not preserving everything, for repserveding
> bus master, only the device and the parent PCI bridge are added to the
> requested_devies list. That is done in the
> build_liveupdate_devices(), the device is added to the listhead pass
> into the function. So it matches the "their related hierarchy" part.
> Can you explain what unnecessary device was preserved in this?
I expected an exported function to request a pci device be preserved
and to populate a tracking list linked to a luo session when that
function is called.
This flags and then search over all the buses seems, IDK, strange and
should probably be justified.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator
2025-09-30 15:17 ` Greg Kroah-Hartman
@ 2025-09-30 23:38 ` Chris Li
0 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-09-30 23:38 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 30, 2025 at 8:31 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:09AM -0700, Chris Li wrote:
> > Register PCI subsystem with the Liveupdate Orchestrator
> > and provide noop liveupdate callbacks.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > MAINTAINERS | 2 ++
> > drivers/pci/Makefile | 1 +
> > drivers/pci/liveupdate.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 57 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 91cec3288cc81aea199f730924eee1f5fda1fd72..85749a5da69f88544ccc749e9d723b1b54c0e3b7 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -14014,11 +14014,13 @@ F: tools/testing/selftests/livepatch/
> >
> > LIVE UPDATE
> > M: Pasha Tatashin <pasha.tatashin@soleen.com>
> > +M: Chris Li <chrisl@kernel.org>
> > L: linux-kernel@vger.kernel.org
> > S: Maintained
> > F: Documentation/ABI/testing/sysfs-kernel-liveupdate
> > F: Documentation/admin-guide/liveupdate.rst
> > F: drivers/misc/liveupdate/
> > +F: drivers/pci/liveupdate/
> > F: include/linux/liveupdate.h
> > F: include/uapi/linux/liveupdate.h
> > F: tools/testing/selftests/liveupdate/
> > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> > index 67647f1880fb8fb0629d680398f5b88d69aac660..aa1bac7aed7d12c641a6b55e56176fb3cdde4c91 100644
> > --- a/drivers/pci/Makefile
> > +++ b/drivers/pci/Makefile
> > @@ -37,6 +37,7 @@ obj-$(CONFIG_PCI_DOE) += doe.o
> > obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
> > obj-$(CONFIG_PCI_NPEM) += npem.o
> > obj-$(CONFIG_PCIE_TPH) += tph.o
> > +obj-$(CONFIG_LIVEUPDATE) += liveupdate.o
> >
> > # Endpoint library must be initialized before its users
> > obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
> > diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..86b4f3a2fb44781c6e323ba029db510450556fa9
> > --- /dev/null
> > +++ b/drivers/pci/liveupdate.c
> > @@ -0,0 +1,54 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright (c) 2025, Google LLC.
> > + * Chris Li <chrisl@kernel.org>
> > + */
> > +
> > +#define pr_fmt(fmt) "PCI liveupdate: " fmt
> > +
> > +#include <linux/liveupdate.h>
> > +
> > +#define PCI_SUBSYSTEM_NAME "pci"
> > +
> > +static int pci_liveupdate_prepare(void *arg, u64 *data)
> > +{
> > + pr_info("prepare data[%llx]\n", *data);
> > + return 0;
> > +}
> > +
> > +static int pci_liveupdate_freeze(void *arg, u64 *data)
> > +{
> > + pr_info("freeze data[%llx]\n", *data);
> > + return 0;
> > +}
> > +
> > +static void pci_liveupdate_cancel(void *arg, u64 data)
> > +{
> > + pr_info("cancel data[%llx]\n", data);
> > +}
> > +
> > +static void pci_liveupdate_finish(void *arg, u64 data)
> > +{
> > + pr_info("finish data[%llx]\n", data);
> > +}
> > +
> > +struct liveupdate_subsystem pci_liveupdate_ops = {
> > + .prepare = pci_liveupdate_prepare,
> > + .freeze = pci_liveupdate_freeze,
> > + .cancel = pci_liveupdate_cancel,
> > + .finish = pci_liveupdate_finish,
> > + .name = PCI_SUBSYSTEM_NAME,
> > +};
> > +
> > +static int __init pci_liveupdate_init(void)
> > +{
> > + int ret;
> > +
> > + ret = liveupdate_register_subsystem(&pci_liveupdate_ops);
> > + if (ret && liveupdate_state_updated())
> > + panic("PCI liveupdate: Register subsystem failed: %d", ret);
> > + WARN(ret, "PCI liveupdate: Register subsystem failed %d", ret);
>
> But this didn't fail.
>
> And you just crashed the box if panic-on-warn is enabled, so if some
> test infrastructure builds this first patch, boom.
Sorry that the second WARN should be removed. That is something during
the rebase conflict resolution somehow slipped through the crack.
I will remove the second WARN() there.
Thanks for catching it.
>
> {sigh}
>
> If you are going to do a "dummy" driver, please make it at least work
> and not do anything bad.
I did test it with a real machine and real PCI device, but I did not
have the panic-on-warn.
Again my bad. Sorry about that.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator
2025-09-30 15:15 ` Greg Kroah-Hartman
@ 2025-09-30 23:41 ` Chris Li
0 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-09-30 23:41 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 30, 2025 at 8:31 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:09AM -0700, Chris Li wrote:
> > Register PCI subsystem with the Liveupdate Orchestrator
> > and provide noop liveupdate callbacks.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > MAINTAINERS | 2 ++
> > drivers/pci/Makefile | 1 +
> > drivers/pci/liveupdate.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 57 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 91cec3288cc81aea199f730924eee1f5fda1fd72..85749a5da69f88544ccc749e9d723b1b54c0e3b7 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -14014,11 +14014,13 @@ F: tools/testing/selftests/livepatch/
> >
> > LIVE UPDATE
> > M: Pasha Tatashin <pasha.tatashin@soleen.com>
> > +M: Chris Li <chrisl@kernel.org>
> > L: linux-kernel@vger.kernel.org
> > S: Maintained
> > F: Documentation/ABI/testing/sysfs-kernel-liveupdate
> > F: Documentation/admin-guide/liveupdate.rst
> > F: drivers/misc/liveupdate/
> > +F: drivers/pci/liveupdate/
> > F: include/linux/liveupdate.h
> > F: include/uapi/linux/liveupdate.h
> > F: tools/testing/selftests/liveupdate/
> > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> > index 67647f1880fb8fb0629d680398f5b88d69aac660..aa1bac7aed7d12c641a6b55e56176fb3cdde4c91 100644
> > --- a/drivers/pci/Makefile
> > +++ b/drivers/pci/Makefile
> > @@ -37,6 +37,7 @@ obj-$(CONFIG_PCI_DOE) += doe.o
> > obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
> > obj-$(CONFIG_PCI_NPEM) += npem.o
> > obj-$(CONFIG_PCIE_TPH) += tph.o
> > +obj-$(CONFIG_LIVEUPDATE) += liveupdate.o
> >
> > # Endpoint library must be initialized before its users
> > obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
> > diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..86b4f3a2fb44781c6e323ba029db510450556fa9
> > --- /dev/null
> > +++ b/drivers/pci/liveupdate.c
> > @@ -0,0 +1,54 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright (c) 2025, Google LLC.
> > + * Chris Li <chrisl@kernel.org>
> > + */
> > +
> > +#define pr_fmt(fmt) "PCI liveupdate: " fmt
> > +
> > +#include <linux/liveupdate.h>
> > +
> > +#define PCI_SUBSYSTEM_NAME "pci"
> > +
> > +static int pci_liveupdate_prepare(void *arg, u64 *data)
> > +{
> > + pr_info("prepare data[%llx]\n", *data);
>
> You do know that's a security bug, right?
Right, it is useful during debugging and inspecting the preserved data.
My bad and will remove the raw pointer.
>
> Please don't do this, even in "debug" code, as it can escape into the
> wild...
Fully agree. Thanks for catching it.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 15:56 ` Pasha Tatashin
@ 2025-10-01 5:06 ` Greg Kroah-Hartman
2025-10-01 21:03 ` Pasha Tatashin
0 siblings, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-01 5:06 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Chris Li, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 11:56:58AM -0400, Pasha Tatashin wrote:
> > > A driver that preserves state across a reboot already has an implicit
> > > contract with its future self about that data's format. The GUID
> > > simply makes that contract explicit and machine-checkable. It does not
> > > have to be GUID, but nevertheless there has to be a specific contract.
> >
> > So how are you going to "version" these GUID? I see you use "schema Vx"
>
> Driver developer who changes a driver to support live-update.
I do not understand this response, sorry.
> > above, but how is that really going to work in the end? Lots of data
> > structures change underneath the base driver that it knows nothing
> > about, not to mention basic things like compiler flags and the like
> > (think about how we have changed things for spectre issues over the
> > years...)
>
> We are working on versioning protocol, the GUID I am suggesting is not
> to protect "struct" coherency, but just to identify which driver to
> bind to which device compatability.
So you have a new way of matching drivers to devices? That's odd.
> > And when can you delete an old "schema"? This feels like you are
> > forcing future developers to maintain things "for forever"...
>
> This won't be an issue because of how live update support is planned.
> The support model will be phased and limited:
>
> Initially, and for a while there will be no stability guarantees
> between different kernel versions.
> Eventually, we will support specific, narrow upgrade paths (e.g.,
> minor-to-minor, or stable-A to stable-A+1).
> Downgrades and arbitrary version jumps ("any-to-any") will not be
> supported upstream. Since we only ever need to handle a well-defined
> forward path, the code for old, irrelevant schemas can always be
> removed. There is no "forever".
This is kernel code, it is always "forever", sorry.
If you want "minor to minor" update, how is that going to work given
that you do not add changes only to "minor" releases (that being the
6.12.y the "y" number).
Remember, Linux does not use "semantic versioning" as its release
numbering is older than that scheme. It just does "this version is
newer than that version" and that's it. You can't really take anything
else from the number.
And if this isn't for "upstream" at all, then why have it? We can't add
new features and support it if we can't actually use it and it's only
for out-of-tree vendor kernels.
And how will you document properly a "well defined forward path"? That
should be done first, before you have any code here that we are
reviewing.
Please do that, get people to agree on the idea and how it will work
before asking us to review code.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 15:41 ` Chris Li
@ 2025-10-01 5:13 ` Greg Kroah-Hartman
2025-10-02 22:05 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-01 5:13 UTC (permalink / raw)
To: Chris Li
Cc: Pasha Tatashin, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 08:41:29AM -0700, Chris Li wrote:
> On Tue, Sep 30, 2025 at 6:41 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Sep 30, 2025 at 09:02:44AM -0400, Pasha Tatashin wrote:
> > > On Mon, Sep 29, 2025 at 10:10 PM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > On Mon, Sep 29, 2025 at 10:57 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Tue, Sep 16, 2025 at 12:45:14AM -0700, Chris Li wrote:
> > > > > > Save the PCI driver name into "struct pci_dev_ser" during the PCI
> > > > > > prepare callback.
> > > > > >
> > > > > > After kexec, use driver_set_override() to ensure the device is
> > > > > > bound only to the saved driver.
> > > > >
> > > > > This doesn't seem like a great idea, driver name should not be made
> > > > > ABI.
> > > >
> > > > Let's break it down with baby steps.
> > > >
> > > > 1) Do you agree the liveupdated PCI device needs to bind to the exact
> > > > same driver after kexec?
> > > > To me that is a firm yes. If the driver binds to another driver, we
> > > > can't expect the other driver will understand the original driver's
> > > > saved state.
> > >
> > > Hi Chris,
> > >
> > > Driver name does not have to be an ABI.
> >
> > A driver name can NEVER be an abi, please don't do that.
>
> Can you please clarify that.
>
> for example, the pci has this sysfs control api:
>
> "/sys/bus/pci/devices/0000:04:00.0/driver_override" which takes the
> *driver name* as data to override what driver is allowed to bind to
> this device.
> Does this driver_override consider it as using the driver name as part
> of the abi? If not, why?
Because the bind/unbind/override was created as a debug facility for
doing kernel development and then people have turned it into a "let's
operate our massive cloud systems with this fragile feature".
We have never said that driver names will remain the same across
releases, and they have changed over time. Device ids have also moved
from one driver to another as well, making the "control" of the device
seem to have changed names.
> What live update wants is to make that driver_override persistent over
> kexec. It does not introduce the "driver_override" API. That is
> pre-existing conditions. The PCI liveupdate just wants to use it.
That does not mean that this is the correct api to use at all. Again,
this was a debugging aid, to help with users who wanted to add a device
id to a driver without having to rebuild it. Don't make it something
that it was never intended to be.
Why not just make a new api as you are doing something new here? That
way you get to define it to work exactly the way you need?
> I want to get some basic understanding before adventure into the more
> complex solutions.
You mean "real" solutions :)
> > > Drivers that support live
> > > updates should provide a live update-specific ABI to detect
> > > compatibility with the preserved data. We can use a preservation
> > > schema GUID for this.
> > >
> > > > 2) Assume the 1) is yes from you. Are you just not happy that the
> > > > kernel saves the driver name? You want user space to save it, is that
> > > > it?
> > > > How does it reference the driver after kexec otherwise?
> > >
> > > If we use GUID, drivers would advertise the GUIDs they support and we
> > > would modify the core device-driver matching process to use this
> > > information.
> > >
> > > Each driver that supports this mechanism would need to declare an
> > > array of GUIDs it is compatible with. This would be a new field in its
> > > struct pci_driver.
> > >
> > > static const guid_t my_driver_guids[] = {
> > > GUID_INIT(0x123e4567, ...), // Schema V1
> > > GUID_INIT(0x987a6543, ...), // Schema V2
> > > {},
> > > };
> >
> > That's crazy, who is going to be adding all of that to all drivers? And
> > knowing to bump this if the internal data representaion changes? And it
> > will change underneath it without the driver even knowing? This feels
> > really really wrong, unless I'm missing something.
>
> The GUID is more complex than a driver name. I am fine with not using
> GUID if you are so strongly opposed to it.
>
> You are saying don't do A(driver name) and B(GUID). I am waiting for
> the part where you say "please do C instead".
It's not my requirement to say "here is C", but rather I am saying "B is
not going to scale over time as GUIDs are a pain to manage".
> Do you have any other suggestion how to prevent the live update PCI
> device bind to a different driver after kexec? I am happy to work on
> the direction you point out and turn that into a patch for the
> discussion purpose.
Why prevent it? Why not just have a special api just for drivers that
want to use this new feature?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-10-01 5:06 ` Greg Kroah-Hartman
@ 2025-10-01 21:03 ` Pasha Tatashin
2025-10-02 6:09 ` Greg Kroah-Hartman
0 siblings, 1 reply; 84+ messages in thread
From: Pasha Tatashin @ 2025-10-01 21:03 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Chris Li, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
Hi Greg,
On Wed, Oct 1, 2025 at 1:06 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 30, 2025 at 11:56:58AM -0400, Pasha Tatashin wrote:
> > > > A driver that preserves state across a reboot already has an implicit
> > > > contract with its future self about that data's format. The GUID
> > > > simply makes that contract explicit and machine-checkable. It does not
> > > > have to be GUID, but nevertheless there has to be a specific contract.
> > >
> > > So how are you going to "version" these GUID? I see you use "schema Vx"
> >
> > Driver developer who changes a driver to support live-update.
>
> I do not understand this response, sorry.
Sorry for the confusion, I misunderstood your question. I thought you
were asking who would add a new field to a driver. My answer was that
it would be the developer who is adding support for the Live Update
feature to that specific driver.
I now realize you were asking about how the GUID would be versioned.
Using a GUID was just one of several ideas. My main point is that we
need some form of versioned compatibility identifier, whether it's a
string or a number. This would allow the system to verify that the new
driver can understand the preserved data for this device from the
previous kernel before it binds to the device.
> > > above, but how is that really going to work in the end? Lots of data
> > > structures change underneath the base driver that it knows nothing
> > > about, not to mention basic things like compiler flags and the like
> > > (think about how we have changed things for spectre issues over the
> > > years...)
> >
> > We are working on versioning protocol, the GUID I am suggesting is not
> > to protect "struct" coherency, but just to identify which driver to
> > bind to which device compatability.
>
> So you have a new way of matching drivers to devices? That's odd.
Correct. For a device that persists across a live update, the driver
matching logic in the new kernel would need to be altered
Unless, the device can stay unbound into initramfs, as Jason suggested
earlier in the thread. But, still probing would need to be altered to
keep the device unbound.
> > > And when can you delete an old "schema"? This feels like you are
> > > forcing future developers to maintain things "for forever"...
> >
> > This won't be an issue because of how live update support is planned.
> > The support model will be phased and limited:
> >
> > Initially, and for a while there will be no stability guarantees
> > between different kernel versions.
> > Eventually, we will support specific, narrow upgrade paths (e.g.,
> > minor-to-minor, or stable-A to stable-A+1).
> > Downgrades and arbitrary version jumps ("any-to-any") will not be
> > supported upstream. Since we only ever need to handle a well-defined
> > forward path, the code for old, irrelevant schemas can always be
> > removed. There is no "forever".
>
> This is kernel code, it is always "forever", sorry.
I'm sorry, but I don't quite understand what you mean. There is no
stable internal kernel API; the upstream tree is constantly evolving
with features being added, improved, and removed.
> If you want "minor to minor" update, how is that going to work given
> that you do not add changes only to "minor" releases (that being the
> 6.12.y the "y" number).
You are correct. Initially, our plan is to allow live updates to break
between any kernel version. However, it is my hope that we will
eventually stabilize this process and only allow breakages between,
for example, versions 6.n and 6.n+2, and eventually from one stable
release to stable+2. This would create a well-defined window for
safely removing deprecated data formats and the code that handles them
from the kernel.
> Remember, Linux does not use "semantic versioning" as its release
> numbering is older than that scheme. It just does "this version is
> newer than that version" and that's it. You can't really take anything
> else from the number.
Understood. If that's the case, we could use stable releases as the
basis for defining when a live update can break. It would take longer
to achieve, but it is a possibility. These are the kinds of questions
that will be discussed at the LPC Liveupdate MC. If you are attending
LPC, I encourage you to join the discussion, as your thoughts on how
we can frame long-term live update support would be very valuable.
> And if this isn't for "upstream" at all, then why have it? We can't add
> new features and support it if we can't actually use it and it's only
> for out-of-tree vendor kernels.
Our goal is to have full support in the upstream kernel. Downstream
users will then need to adapt live updates to their specific needs.
For example, if a live update from version A to version C is broken, a
downstream user would either have to update incrementally from A to B
and then to C, or they would have to internally fix whatever is
causing the breakage before performing the live update.
> And how will you document properly a "well defined forward path"? That
> should be done first, before you have any code here that we are
> reviewing.
Currently, and for the near future, live updates will only be
supported within the same kernel version.
> Please do that, get people to agree on the idea and how it will work
> before asking us to review code.
This is an industry-wide effort. We have engineers from Amazon,
Google, Microsoft, Nvidia, and other companies meeting bi-weekly to
discuss Live Update support, and sending and landing patches upstream.
We are also organizing an LPC Live Update Micro Conference where the
versioning strategy will be a topic.
For now, we have agreed that the live update can break between and
kernel versions or with any commit while the feature is under active
development. This approach allows us the flexibility to build the core
functionality while we collaboratively define the long-term versioning
and stability model.
Thank you,
Pasha
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-10-01 21:03 ` Pasha Tatashin
@ 2025-10-02 6:09 ` Greg Kroah-Hartman
2025-10-02 13:23 ` Jason Gunthorpe
2025-10-02 22:30 ` Chris Li
0 siblings, 2 replies; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-02 6:09 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Chris Li, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Wed, Oct 01, 2025 at 05:03:19PM -0400, Pasha Tatashin wrote:
> On Wed, Oct 1, 2025 at 1:06 AM Greg Kroah-Hartman
> > On Tue, Sep 30, 2025 at 11:56:58AM -0400, Pasha Tatashin wrote:
> > > > > A driver that preserves state across a reboot already has an implicit
> > > > > contract with its future self about that data's format. The GUID
> > > > > simply makes that contract explicit and machine-checkable. It does not
> > > > > have to be GUID, but nevertheless there has to be a specific contract.
> > > >
> > > > So how are you going to "version" these GUID? I see you use "schema Vx"
> > >
> > > Driver developer who changes a driver to support live-update.
> >
> > I do not understand this response, sorry.
>
> Sorry for the confusion, I misunderstood your question. I thought you
> were asking who would add a new field to a driver. My answer was that
> it would be the developer who is adding support for the Live Update
> feature to that specific driver.
> I now realize you were asking about how the GUID would be versioned.
> Using a GUID was just one of several ideas. My main point is that we
> need some form of versioned compatibility identifier, whether it's a
> string or a number. This would allow the system to verify that the new
> driver can understand the preserved data for this device from the
> previous kernel before it binds to the device.
Again, "versioned" identifiers will not work over time as you can never
drop old versions, AND a driver author does not know if the underlying
structures that are outside of the driver have changed or not, nor if
the compiler settings have changed, or anything else that could affect
it like that have changed.
> > > > And when can you delete an old "schema"? This feels like you are
> > > > forcing future developers to maintain things "for forever"...
> > >
> > > This won't be an issue because of how live update support is planned.
> > > The support model will be phased and limited:
> > >
> > > Initially, and for a while there will be no stability guarantees
> > > between different kernel versions.
> > > Eventually, we will support specific, narrow upgrade paths (e.g.,
> > > minor-to-minor, or stable-A to stable-A+1).
> > > Downgrades and arbitrary version jumps ("any-to-any") will not be
> > > supported upstream. Since we only ever need to handle a well-defined
> > > forward path, the code for old, irrelevant schemas can always be
> > > removed. There is no "forever".
> >
> > This is kernel code, it is always "forever", sorry.
>
> I'm sorry, but I don't quite understand what you mean. There is no
> stable internal kernel API; the upstream tree is constantly evolving
> with features being added, improved, and removed.
Yes, that is very true, but you can not remove user-visible
functionality, which is what you are saying you are going to do here.
> > If you want "minor to minor" update, how is that going to work given
> > that you do not add changes only to "minor" releases (that being the
> > 6.12.y the "y" number).
>
> You are correct. Initially, our plan is to allow live updates to break
> between any kernel version.
Then there is no such thing as live updates :)
> However, it is my hope that we will
> eventually stabilize this process and only allow breakages between,
> for example, versions 6.n and 6.n+2, and eventually from one stable
> release to stable+2. This would create a well-defined window for
> safely removing deprecated data formats and the code that handles them
> from the kernel.
How are you going to define this? We can not break old users when they
upgrade, and so you are going to have to support this "upgrade path" for
forever.
> > Remember, Linux does not use "semantic versioning" as its release
> > numbering is older than that scheme. It just does "this version is
> > newer than that version" and that's it. You can't really take anything
> > else from the number.
>
> Understood. If that's the case, we could use stable releases as the
> basis for defining when a live update can break.
So every single release?
> It would take longer
> to achieve, but it is a possibility. These are the kinds of questions
> that will be discussed at the LPC Liveupdate MC. If you are attending
> LPC, I encourage you to join the discussion, as your thoughts on how
> we can frame long-term live update support would be very valuable.
I will be at LPC, but can't guarantee I can make it to that MC, it all
depends on scheduling.
> > And if this isn't for "upstream" at all, then why have it? We can't add
> > new features and support it if we can't actually use it and it's only
> > for out-of-tree vendor kernels.
>
> Our goal is to have full support in the upstream kernel. Downstream
> users will then need to adapt live updates to their specific needs.
> For example, if a live update from version A to version C is broken, a
> downstream user would either have to update incrementally from A to B
> and then to C, or they would have to internally fix whatever is
> causing the breakage before performing the live update.
What does "internally fix" mean exactly here?
> > And how will you document properly a "well defined forward path"? That
> > should be done first, before you have any code here that we are
> > reviewing.
>
> Currently, and for the near future, live updates will only be
> supported within the same kernel version.
Ok, then no need for any GUID at all. Just update and pray! :)
> > Please do that, get people to agree on the idea and how it will work
> > before asking us to review code.
>
> This is an industry-wide effort. We have engineers from Amazon,
> Google, Microsoft, Nvidia, and other companies meeting bi-weekly to
> discuss Live Update support, and sending and landing patches upstream.
> We are also organizing an LPC Live Update Micro Conference where the
> versioning strategy will be a topic.
>
> For now, we have agreed that the live update can break between and
> kernel versions or with any commit while the feature is under active
> development. This approach allows us the flexibility to build the core
> functionality while we collaboratively define the long-term versioning
> and stability model.
Just keeping a device "alive" while rebooting into the same exact kernel
image seems odd to me given that this is almost never what people
actually do. They update their kernel with the weekly stable release to
get the new bugfixes (remember we fix 13 CVEs a day), and away you go.
You are saying that this workload would not actually be supported, so
why do you want live update at all? Who needs this?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-10-02 6:09 ` Greg Kroah-Hartman
@ 2025-10-02 13:23 ` Jason Gunthorpe
2025-10-02 22:30 ` Chris Li
1 sibling, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2025-10-02 13:23 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Pasha Tatashin, Chris Li, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Thu, Oct 02, 2025 at 08:09:11AM +0200, Greg Kroah-Hartman wrote:
> > However, it is my hope that we will
> > eventually stabilize this process and only allow breakages between,
> > for example, versions 6.n and 6.n+2, and eventually from one stable
> > release to stable+2. This would create a well-defined window for
> > safely removing deprecated data formats and the code that handles them
> > from the kernel.
>
> How are you going to define this? We can not break old users when they
> upgrade, and so you are going to have to support this "upgrade path" for
> forever.
I think the realistic proposal for LUO/kexec version compatability is
more like eBPF. Expressly saying it is not ABI, not stable, but here
are a bunch of tools and it is still useful.
> Just keeping a device "alive" while rebooting into the same exact kernel
> image seems odd to me given that this is almost never what people
> actually do.
This feature has a lot of development to go. Right now the baseline
for upstream is no ABI promise. You can live update between any two
kernel versions that don't change the LUO kexec ABI. In practice that
will be a lot of version pairs.
The downstreams are going to take this raw capability and choose
specific downstream version pairs, patch in support for certain ABI
versions that they need, and test.
When things mature and the project is more complete then the kernel
community may have a discussion about what upstream version pairs
should be supported by the community.
I don't think this would be as broad as every combination of linux
versions ever, but ideas like sequential pairs of stable
releases, sequential pairs of main release and so on are worth
exploring.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-09-30 16:38 ` Jason Gunthorpe
@ 2025-10-02 18:54 ` David Matlack
2025-10-02 20:57 ` Chris Li
2025-10-02 20:44 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: David Matlack @ 2025-10-02 18:54 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On 2025-09-30 01:38 PM, Jason Gunthorpe wrote:
> On Mon, Sep 29, 2025 at 07:11:06PM -0700, Chris Li wrote:
> > On Mon, Sep 29, 2025 at 10:48 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> > > > After the list of preserved devices is constructed, the PCI subsystem can
> > > > now forward the liveupdate request to the driver.
> > >
> > > This also seems completely backwards for how iommufd should be
> > > working. It doesn't want callbacks triggered on prepare, it wants to
> > > drive everything from its own ioctl.
> >
> > This series is about basic PCI device support, not IOMMUFD.
> >
> > > Let's just do one thing at a time please and make this series about
> > > iommufd to match the other luo series for iommufd.
> >
> > I am confused by you.
> >
> > > non-iommufd cases can be proposed in their own series.
> >
> > This is that non-iommufd series.
>
> Then don't do generic devices until we get iommufd done and you have a
> meaningful in-tree driver to consume what you are adding.
I agree with Jason. I don't think we can reasonably make the argument
that we need this series until we have actualy use-cases for it.
I think we should focus on vfio-pci device preservation next, and use
that to incrementally drive whatever changes are necessary to the PCI
and generic device layer bit by bit.
For example, once we a basic vfio-pci device preservation working, we
can start thinking about how to handle when that device is a VF, and we
have to start also preserving the SR-IOV state on the PF and get the PF
driver involved in the process. At that point we can discuss how to
solve that specific problem. Maybe the solution will look something like
this series, maybe it will look like something else. There is open
design space.
Without approaching it this way, I don't see how we can't reasonably
argue that anything in this series is necessary. And I suspect some
parts of this series truly are unnecessary, at least in the short term.
In our internal implementation, the only dependent device that truly
needed to participate is the PF driver when a VF is preserved.
Everything else (e.g. pcieport callbacks) have just been no-ops.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-09-30 15:27 ` Greg Kroah-Hartman
@ 2025-10-02 20:38 ` Chris Li
2025-10-03 6:18 ` Greg Kroah-Hartman
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-10-02 20:38 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 30, 2025 at 8:30 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> > include/linux/dev_liveupdate.h | 23 +++++
> > include/linux/device/driver.h | 6 ++
>
> Driver core changes under the guise of only PCI changes? Please no.
There is a reason why I use the device struct rather than the pci_dev
struct even though liveupdate currently only works with PCI devices.
It comes down to the fact that the pci_bus and pci_host_bridge are not
pci_dev struct. We need something that is common across all those
three types of PCI related struct I care about(pci_dev, pci_bus,
pci_host_bridge). The device struct is just common around those. I can
move the dev_liveupdate struct into pci_bus, pci_host_bridge and
pci_dev independently. That will be more contained inside PCI, not
touching the device struct. The patch would be bigger because the data
structure is spread into different structs. Do you have a preference
which way to go?
> Break this series out properly, get the driver core stuff working FIRST,
> then show how multiple busses will work with them (i.e. you usually need
> 3 to know if you got it right).
Multiple buses you mean different types of bus, e.g. USB, PCI and
others or 3 pci_bus is good enough? Right now we have no intention to
support bus types other than PCI devices. The liveupdate is about
preserving the GPU context cross kernel upgrade. Suggestion welcome.
> I'm guessing you will need/want PCI, platform, and something else?
This series only cares about PCI. The LUO series has subsystems. The
PCI livedupate code is registered as an LUO subsystem. I guess the
subsystem is close to the platform you have in mind? LUO also has the
memfd in addition to the subsystem.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-09-30 16:38 ` Jason Gunthorpe
2025-10-02 18:54 ` David Matlack
@ 2025-10-02 20:44 ` Chris Li
1 sibling, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-02 20:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 9:38 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > > non-iommufd cases can be proposed in their own series.
> >
> > This is that non-iommufd series.
>
> Then don't do generic devices until we get iommufd done and you have a
> meaningful in-tree driver to consume what you are adding.
Thanks for the suggestion. See the above explanation why I add to
device struct. Because it was contained in both pci_dev, pci_bus and
pci_host_bridge struct. All PCI related, but device struct is the
common base.
I can move them into three PCI related struct separately then leave
the device struct alone. Will that be better?
I did add the pci-lu-test driver to consume and test the LUO/PCI
series for the bus master bits. Are you suggesting the LUO/PCI should
wait until VFIO can consume the LUO/PCI changes then send them out as
a series.
That will delay the LUO/PCI a bit for the VFIO.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-02 18:54 ` David Matlack
@ 2025-10-02 20:57 ` Chris Li
2025-10-02 21:31 ` David Matlack
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-10-02 20:57 UTC (permalink / raw)
To: David Matlack
Cc: Jason Gunthorpe, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, Pasha Tatashin,
linux-kernel, linux-pci, linux-acpi, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Thu, Oct 2, 2025 at 11:54 AM David Matlack <dmatlack@google.com> wrote:
> > Then don't do generic devices until we get iommufd done and you have a
> > meaningful in-tree driver to consume what you are adding.
>
> I agree with Jason. I don't think we can reasonably make the argument
> that we need this series until we have actualy use-cases for it.
>
> I think we should focus on vfio-pci device preservation next, and use
> that to incrementally drive whatever changes are necessary to the PCI
> and generic device layer bit by bit.
The feedback I got for the PCI V1 was to make it as minimal as
possible. We agree preserving the BUS MASTER bit is the first minimal
step. That is what I did in the V2 phase I series. Only the bus
master. I think the pci-lu-test driver did demo the bus master bit, it
is not vfio yet. At least that was the plan shared in the upstream
alignment meeting.
> For example, once we a basic vfio-pci device preservation working, we
> can start thinking about how to handle when that device is a VF, and we
> have to start also preserving the SR-IOV state on the PF and get the PF
SR-IOV is a much bigger step than the BUS Master bit. I recall at one
point in the upstream discussion meeting that we don't do SR-IOV as
the first step. I am not opposed to it, we need to get to vfio and
SR-IOV eventually. I just feel that the PCI + VFIO + SR-IOV will be a
much bigger series. I worry the series size is not friendly for
reviewers. I wish there could be smaller incremental steps digestible.
> driver involved in the process. At that point we can discuss how to
> solve that specific problem. Maybe the solution will look something like
> this series, maybe it will look like something else. There is open
> design space.
Yes doable, just will delay the LUO/PCI series by a bit and a much
bigger series.
> Without approaching it this way, I don't see how we can't reasonably
> argue that anything in this series is necessary. And I suspect some
> parts of this series truly are unnecessary, at least in the short term.
You have me on the double negatives, always not very good at those.
If the bigger series is what we want, I can do that. Just will have
some latency to get the VFIO.
> In our internal implementation, the only dependent device that truly
> needed to participate is the PF driver when a VF is preserved.
> Everything else (e.g. pcieport callbacks) have just been no-ops.
Your VF device does not need to preserve DMA? If you want to preserve
DMA the bus master bit is required, and the pcieport driver for the
PCI-PCI bridge is also required. I am not sure pure VF and PF without
any DMA makes practical sense.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-02 20:57 ` Chris Li
@ 2025-10-02 21:31 ` David Matlack
2025-10-02 23:21 ` Jason Gunthorpe
2025-10-03 5:17 ` Chris Li
0 siblings, 2 replies; 84+ messages in thread
From: David Matlack @ 2025-10-02 21:31 UTC (permalink / raw)
To: Chris Li
Cc: Jason Gunthorpe, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, Pasha Tatashin,
linux-kernel, linux-pci, linux-acpi, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Thu, Oct 2, 2025 at 1:58 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Oct 2, 2025 at 11:54 AM David Matlack <dmatlack@google.com> wrote:
> > > Then don't do generic devices until we get iommufd done and you have a
> > > meaningful in-tree driver to consume what you are adding.
> >
> > I agree with Jason. I don't think we can reasonably make the argument
> > that we need this series until we have actualy use-cases for it.
> >
> > I think we should focus on vfio-pci device preservation next, and use
> > that to incrementally drive whatever changes are necessary to the PCI
> > and generic device layer bit by bit.
>
> The feedback I got for the PCI V1 was to make it as minimal as
> possible. We agree preserving the BUS MASTER bit is the first minimal
> step. That is what I did in the V2 phase I series. Only the bus
> master. I think the pci-lu-test driver did demo the bus master bit, it
> is not vfio yet. At least that was the plan shared in the upstream
> alignment meeting.
What do the driver callbacks in patch 3 and patches 5-8 have to do
with preserving the bus master bit? That's half the series.
>
> > For example, once we a basic vfio-pci device preservation working, we
> > can start thinking about how to handle when that device is a VF, and we
> > have to start also preserving the SR-IOV state on the PF and get the PF
>
> SR-IOV is a much bigger step than the BUS Master bit. I recall at one
> point in the upstream discussion meeting that we don't do SR-IOV as
> the first step. I am not opposed to it, we need to get to vfio and
> SR-IOV eventually. I just feel that the PCI + VFIO + SR-IOV will be a
> much bigger series. I worry the series size is not friendly for
> reviewers. I wish there could be smaller incremental steps digestible.
SR-IOV is not a first step, of course. That's not what I said. I'm
saying SR-IOV is future work that could justify some of the larger
changes in this series (e.g. driver callbacks). So I am suggesting we
revisit those changes when we are working on SR-IOV.
>
> > driver involved in the process. At that point we can discuss how to
> > solve that specific problem. Maybe the solution will look something like
> > this series, maybe it will look like something else. There is open
> > design space.
>
> Yes doable, just will delay the LUO/PCI series by a bit and a much
> bigger series.
There will not be one "LUO/PCI series". There will be many incremental steps.
>
> > Without approaching it this way, I don't see how we can't reasonably
> > argue that anything in this series is necessary. And I suspect some
> > parts of this series truly are unnecessary, at least in the short term.
>
> You have me on the double negatives, always not very good at those.
> If the bigger series is what we want, I can do that. Just will have
> some latency to get the VFIO.
Oops, typo on my end. I meant "I don't see how we _can_ reasonably
argue". I am saying we don't have enough justification (yet) for a lot
of the code changes in this series.
>
> > In our internal implementation, the only dependent device that truly
> > needed to participate is the PF driver when a VF is preserved.
> > Everything else (e.g. pcieport callbacks) have just been no-ops.
>
> Your VF device does not need to preserve DMA? If you want to preserve
> DMA the bus master bit is required, and the pcieport driver for the
> PCI-PCI bridge is also required. I am not sure pure VF and PF without
> any DMA makes practical sense.
I'm saying the only drivers that actually needed to implement Live
Update driver callbacks have been vfio-pci and PF drivers. The
vfio-pci support doesn't exist yet upstream, and we are planning to
use FD preservation there. So we don't know if driver callbacks will
be needed for that. And we don't care about PF drivers until we get to
supporting SR-IOV. So the driver callbacks all seem unnecessary at
this point.
I totally agree we need to avoid clearing the bus master bit. Let's
focus on that.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-09-30 16:37 ` Jason Gunthorpe
@ 2025-10-02 21:39 ` Chris Li
2025-10-03 14:28 ` Jason Gunthorpe
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-10-02 21:39 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Pasha Tatashin, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 9:37 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> As I said, I would punt all of this to the initrd and let the initrd
> explicitly bind drivers.
You still need a mechanism to prevent after the PCI bridge scan,
create the pci_devices, not auto probe the drivers. If it is not
driver_override, it will be some new PCI API and liveupdate is the
first user of it.
I was afraid to add a new liveupdate specific PCI API for this
purpose. However, if that is what upstream wants, I can certainly do
it in the next version.
> The only behavior we need from the kernel is to not autobind some
> drivers so userspace can control it, and in a LUO type environment
> userspace should well know what drivers go where - or can get it from
> a preceeding kernel from a memfd.
There are two slightly different things here:
1) modprobe the driver. That is typically control by udev.
2) auto probing the drive after the driver has been loaded or PCI
device scanned.
In your envisioning, the initrd autobind controls both of the above
two spec of things, right?
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-10-01 5:13 ` Greg Kroah-Hartman
@ 2025-10-02 22:05 ` Chris Li
0 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-02 22:05 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Pasha Tatashin, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 10:13 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> >
> > for example, the pci has this sysfs control api:
> >
> > "/sys/bus/pci/devices/0000:04:00.0/driver_override" which takes the
> > *driver name* as data to override what driver is allowed to bind to
> > this device.
> > Does this driver_override consider it as using the driver name as part
> > of the abi? If not, why?
>
> Because the bind/unbind/override was created as a debug facility for
> doing kernel development and then people have turned it into a "let's
> operate our massive cloud systems with this fragile feature".
Frankly, I did not know that it was a debug API or should be treated like one.
Let's say we want to make it right for now and future, any
suggestion/guide line for the new API?
> We have never said that driver names will remain the same across
> releases, and they have changed over time. Device ids have also moved
That is fine. The LUO PCI just says that at the old kernel that does
the liveupdate from, that is the driver name "foo1" in the old kernel
A1. The new kernel A2 that gets boot will know about the old kernel
A1, at least in the typical data center. There will be a test live
update A1 to A2. Validation before officially rolling out the
liveupdate kernel. The new kernel A2 can know that, oh, on this old
kernel, A1, this driver "foo2" used to call "foo1" in A1. Then it can
let the PCI core bind to the "foo2" for that device instead. Later
when A2 liveupdate to A3, A3 can drop the knowledge of the "foo1" if
we are sure the A1 kernel is no longer supported.
> from one driver to another as well, making the "control" of the device
> seem to have changed names.
The name can be changed, just the new kernel needs to know about the
change and handle it. Extra complexity but not impossible.
>
> > What live update wants is to make that driver_override persistent over
> > kexec. It does not introduce the "driver_override" API. That is
> > pre-existing conditions. The PCI liveupdate just wants to use it.
>
> That does not mean that this is the correct api to use at all. Again,
> this was a debugging aid, to help with users who wanted to add a device
> id to a driver without having to rebuild it. Don't make it something
> that it was never intended to be.
>
> Why not just make a new api as you are doing something new here? That
> way you get to define it to work exactly the way you need?
Sure, I can invent a new API. I am just a bit afraid to introduce a
new API and carry the burden of supporting it forever.
Another idea is that we don't remember the driver's name. The kernel
just enforces that, if the device is liveupdate, no auto probe at all.
Then push the responsibility to the user space to load the driver and
manually bind the device to the right driver. The user space will
still need to know what is the previous driver name or some way to
identify the right driver for this liveupdate process. Somebody will
need to know something like a driver name and pass that to the new
kernel to restore it. But not the kernel.
It will have a drawback on extra latency of the black out window, now
after PCI scans the PCI bus, a user space program will be run to bind
and probe the driver.
>
> > I want to get some basic understanding before adventure into the more
> > complex solutions.
>
> You mean "real" solutions :)
I mean the more upstream accepted solutions.
> It's not my requirement to say "here is C", but rather I am saying "B is
> not going to scale over time as GUIDs are a pain to manage".
I can agree to that.
> > Do you have any other suggestion how to prevent the live update PCI
> > device bind to a different driver after kexec? I am happy to work on
> > the direction you point out and turn that into a patch for the
> > discussion purpose.
>
> Why prevent it? Why not just have a special api just for drivers that
> want to use this new feature?
The typical GPU will bind to the VFIO driver when the VM is using it.
If we don't prevent auto probe, the PCI device will auto probe to the
native driver on the next kexec. Naturally, the native driver will
have no day to decode the data saved from the previous vfio driver.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-10-02 6:09 ` Greg Kroah-Hartman
2025-10-02 13:23 ` Jason Gunthorpe
@ 2025-10-02 22:30 ` Chris Li
1 sibling, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-02 22:30 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Pasha Tatashin, Jason Gunthorpe, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Wed, Oct 1, 2025 at 11:09 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> Just keeping a device "alive" while rebooting into the same exact kernel
> image seems odd to me given that this is almost never what people
> actually do. They update their kernel with the weekly stable release to
> get the new bugfixes (remember we fix 13 CVEs a day), and away you go.
> You are saying that this workload would not actually be supported, so
> why do you want live update at all? Who needs this?
I saw Pasha reply to a lot of your questions. I can take a stab on who
needs it. Others feel free to add/correct me. The major cloud vendor
(you know who is the usual suspect) providing GPU to the VM will want
it. The usage case is that the VM is controlled by the customer. The
cloud provider has a contract on how many maintenance downtimes to the
VM. Let's say X second maintenance downtime per year. When upgrading
the host kernel, typically the VM can be migrated to another host
without much interruption, so it does not take much from the down time
budget. However when you have a GPU attached to the VM, the GPU is
running some ML jobs, there is no good way to migrate that GPU context
to another machine. Instead, we can do a liveupdate from the host
kernel. During the liveupdate, the old kernel saves the liveupdate
state. VM is paused to memory while the GPU as a PCI device is kept on
running. ML jobs are still up. The kernel liveupdate kexec to the
new kernel version. Restore and reconstruct the software side of the
device state. VM re-attached to the file descriptor to get the
previous context. In the end the VM can resume running with the new
kernel while the GPU keeps running the ML job. From the VM point of
view, there are Y seconds the VM does not respond during the kexec.
The GPU did not lose the context and VM did not reboot. The benefit is
that Y second is much smaller than the time to reboot the VM and
restart the GPU ML jobs. So that Y can fit into the X second
maintenance downtime per year in the service contract.
Hope that explanation makes sense to you.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-02 21:31 ` David Matlack
@ 2025-10-02 23:21 ` Jason Gunthorpe
2025-10-02 23:42 ` David Matlack
2025-10-03 5:24 ` Chris Li
2025-10-03 5:17 ` Chris Li
1 sibling, 2 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2025-10-02 23:21 UTC (permalink / raw)
To: David Matlack
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Thu, Oct 02, 2025 at 02:31:08PM -0700, David Matlack wrote:
> I'm saying the only drivers that actually needed to implement Live
> Update driver callbacks have been vfio-pci and PF drivers. The
> vfio-pci support doesn't exist yet upstream, and we are planning to
> use FD preservation there. So we don't know if driver callbacks will
> be needed for that.
I don't expect driver callbacks trough the pci subsystem, and I think
they should be removed from this series.
As I said the flow is backwards from what we want. The vfio driver
gets a luo FD from an ioctl, and it goes through and calls into luo
and pci with that session object to do all the required serialization.
Any required callbacks should be routed through luo based on creating
preserved objects within luo and providing ops to luo.
There is no other way to properly link things to sessions.
> And we don't care about PF drivers until we get to
> supporting SR-IOV. So the driver callbacks all seem unnecessary at
> this point.
I guess we will see, but I'm hoping we can get quite far using
vfio-pci as the SRIOV PF driver and don't need to try to get a big PF
in-kernel driver entangled in this.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-02 23:21 ` Jason Gunthorpe
@ 2025-10-02 23:42 ` David Matlack
2025-10-03 12:03 ` Jason Gunthorpe
2025-10-03 5:24 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: David Matlack @ 2025-10-02 23:42 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Thu, Oct 2, 2025 at 4:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Oct 02, 2025 at 02:31:08PM -0700, David Matlack wrote:
> > And we don't care about PF drivers until we get to
> > supporting SR-IOV. So the driver callbacks all seem unnecessary at
> > this point.
>
> I guess we will see, but I'm hoping we can get quite far using
> vfio-pci as the SRIOV PF driver and don't need to try to get a big PF
> in-kernel driver entangled in this.
So far we have had to support vfio-pci, pci-pf-stub, and idpf as PF
drivers, and nvme looks like it's coming soon :(
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-02 21:31 ` David Matlack
2025-10-02 23:21 ` Jason Gunthorpe
@ 2025-10-03 5:17 ` Chris Li
1 sibling, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-03 5:17 UTC (permalink / raw)
To: David Matlack
Cc: Jason Gunthorpe, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, Pasha Tatashin,
linux-kernel, linux-pci, linux-acpi, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Thu, Oct 2, 2025 at 2:31 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, Oct 2, 2025 at 1:58 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Thu, Oct 2, 2025 at 11:54 AM David Matlack <dmatlack@google.com> wrote:
> > > > Then don't do generic devices until we get iommufd done and you have a
> > > > meaningful in-tree driver to consume what you are adding.
> > >
> > > I agree with Jason. I don't think we can reasonably make the argument
> > > that we need this series until we have actualy use-cases for it.
> > >
> > > I think we should focus on vfio-pci device preservation next, and use
> > > that to incrementally drive whatever changes are necessary to the PCI
> > > and generic device layer bit by bit.
> >
> > The feedback I got for the PCI V1 was to make it as minimal as
> > possible. We agree preserving the BUS MASTER bit is the first minimal
> > step. That is what I did in the V2 phase I series. Only the bus
> > master. I think the pci-lu-test driver did demo the bus master bit, it
> > is not vfio yet. At least that was the plan shared in the upstream
> > alignment meeting.
>
> What do the driver callbacks in patch 3 and patches 5-8 have to do
> with preserving the bus master bit? That's half the series.
I was thinking the pcie driver is for preserving the bus master bit on
the bridge. I just realized that the actual bus master bit preserve is
at the pci_dev level. The pcie driver call back is effectively no op.
Yes, I can delete those from the series and not save driver data at
all. Points taken.
>
> >
> > > For example, once we a basic vfio-pci device preservation working, we
> > > can start thinking about how to handle when that device is a VF, and we
> > > have to start also preserving the SR-IOV state on the PF and get the PF
> >
> > SR-IOV is a much bigger step than the BUS Master bit. I recall at one
> > point in the upstream discussion meeting that we don't do SR-IOV as
> > the first step. I am not opposed to it, we need to get to vfio and
> > SR-IOV eventually. I just feel that the PCI + VFIO + SR-IOV will be a
> > much bigger series. I worry the series size is not friendly for
> > reviewers. I wish there could be smaller incremental steps digestible.
>
> SR-IOV is not a first step, of course. That's not what I said. I'm
> saying SR-IOV is future work that could justify some of the larger
> changes in this series (e.g. driver callbacks). So I am suggesting we
> revisit those changes when we are working on SR-IOV.
Just to confirm my understanding aligned with you. We remove the
driver callbacks in PCI series until the vfio SR-IOV to add them back.
> > > driver involved in the process. At that point we can discuss how to
> > > solve that specific problem. Maybe the solution will look something like
> > > this series, maybe it will look like something else. There is open
> > > design space.
> >
> > Yes doable, just will delay the LUO/PCI series by a bit and a much
> > bigger series.
>
> There will not be one "LUO/PCI series". There will be many incremental steps.
Oh, that is good then. I can still keep the PCI series as one incremental step.
>
> >
> > > Without approaching it this way, I don't see how we can't reasonably
> > > argue that anything in this series is necessary. And I suspect some
> > > parts of this series truly are unnecessary, at least in the short term.
> >
> > You have me on the double negatives, always not very good at those.
> > If the bigger series is what we want, I can do that. Just will have
> > some latency to get the VFIO.
>
> Oops, typo on my end. I meant "I don't see how we _can_ reasonably
> argue". I am saying we don't have enough justification (yet) for a lot
> of the code changes in this series.
Ack for removing driver callbacks in this series.
> > > In our internal implementation, the only dependent device that truly
> > > needed to participate is the PF driver when a VF is preserved.
> > > Everything else (e.g. pcieport callbacks) have just been no-ops.
> >
> > Your VF device does not need to preserve DMA? If you want to preserve
> > DMA the bus master bit is required, and the pcieport driver for the
> > PCI-PCI bridge is also required. I am not sure pure VF and PF without
> > any DMA makes practical sense.
>
> I'm saying the only drivers that actually needed to implement Live
> Update driver callbacks have been vfio-pci and PF drivers. The
> vfio-pci support doesn't exist yet upstream, and we are planning to
> use FD preservation there. So we don't know if driver callbacks will
> be needed for that. And we don't care about PF drivers until we get to
> supporting SR-IOV. So the driver callbacks all seem unnecessary at
> this point.
Ack.
>
> I totally agree we need to avoid clearing the bus master bit. Let's
> focus on that.
Ack.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-02 23:21 ` Jason Gunthorpe
2025-10-02 23:42 ` David Matlack
@ 2025-10-03 5:24 ` Chris Li
2025-10-03 12:06 ` Jason Gunthorpe
1 sibling, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-10-03 5:24 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: David Matlack, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, Pasha Tatashin,
linux-kernel, linux-pci, linux-acpi, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Thu, Oct 2, 2025 at 4:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 02:31:08PM -0700, David Matlack wrote:
> > I'm saying the only drivers that actually needed to implement Live
> > Update driver callbacks have been vfio-pci and PF drivers. The
> > vfio-pci support doesn't exist yet upstream, and we are planning to
> > use FD preservation there. So we don't know if driver callbacks will
> > be needed for that.
>
> I don't expect driver callbacks trough the pci subsystem, and I think
> they should be removed from this series.
Per suggestion from David as well. I will remove the driver callback
from the PCI series.
>
> As I said the flow is backwards from what we want. The vfio driver
> gets a luo FD from an ioctl, and it goes through and calls into luo
> and pci with that session object to do all the required serialization.
>
> Any required callbacks should be routed through luo based on creating
> preserved objects within luo and providing ops to luo.
>
> There is no other way to properly link things to sessions.
As David pointed out in the other email, the PCI also supports other
non vfio PCI devices which do not have the FD and FD related sessions.
That is the original intent for the LUO PCI subsystem.
> > And we don't care about PF drivers until we get to
> > supporting SR-IOV. So the driver callbacks all seem unnecessary at
> > this point.
>
> I guess we will see, but I'm hoping we can get quite far using
> vfio-pci as the SRIOV PF driver and don't need to try to get a big PF
> in-kernel driver entangled in this.
Yes, vfio-pci is a big series as well. Getting a NIC might be easier
to get the PF DMA working with a live update but that will be thrown
away once we have the vfio-pci as the real user. Actually getting the
pci-pf-stub driver working would be a smaller and reasonable step to
justify the PF support in LUO PCI.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-09-29 17:46 ` Jason Gunthorpe
2025-09-30 2:13 ` Chris Li
@ 2025-10-03 5:33 ` Chris Li
2025-10-03 14:04 ` Jason Gunthorpe
1 sibling, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-10-03 5:33 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Mon, Sep 29, 2025 at 10:46 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:10AM -0700, Chris Li wrote:
> > static int pci_liveupdate_prepare(void *arg, u64 *data)
> > {
> > + LIST_HEAD(requested_devices);
> > +
> > pr_info("prepare data[%llx]\n", *data);
> > +
> > + pci_lock_rescan_remove();
> > + down_write(&pci_bus_sem);
> > +
> > + build_liveupdate_devices(&requested_devices);
> > + cleanup_liveupdate_devices(&requested_devices);
> > +
> > + up_write(&pci_bus_sem);
> > + pci_unlock_rescan_remove();
> > return 0;
> > }
>
> This doesn't seem conceptually right, PCI should not be preserving
> everything. Only devices and their related hierarchy that are opted
> into live update by iommufd should be preserved.
The consideration is that some non vfio device like IDPF is preserved
as well. Does the iommufd encapsulate all the PCI device hierarchy? I
was thinking the PCI layer knows about the PCI device hierarchy,
therefore using pci_dev->dev.lu.flags to indicate the participation of
the PCI liveupdate. Not sure how to drive that from iommufd. Can you
explain a bit more?
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-02 20:38 ` Chris Li
@ 2025-10-03 6:18 ` Greg Kroah-Hartman
2025-10-03 7:26 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-03 6:18 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Thu, Oct 02, 2025 at 01:38:56PM -0700, Chris Li wrote:
> On Tue, Sep 30, 2025 at 8:30 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> > > include/linux/dev_liveupdate.h | 23 +++++
> > > include/linux/device/driver.h | 6 ++
> >
> > Driver core changes under the guise of only PCI changes? Please no.
>
> There is a reason why I use the device struct rather than the pci_dev
> struct even though liveupdate currently only works with PCI devices.
> It comes down to the fact that the pci_bus and pci_host_bridge are not
> pci_dev struct. We need something that is common across all those
> three types of PCI related struct I care about(pci_dev, pci_bus,
> pci_host_bridge). The device struct is just common around those. I can
> move the dev_liveupdate struct into pci_bus, pci_host_bridge and
> pci_dev independently. That will be more contained inside PCI, not
> touching the device struct. The patch would be bigger because the data
> structure is spread into different structs. Do you have a preference
> which way to go?
If you only are caring about one single driver, don't mess with a
subsystem or the driver core, just change the driver. My objection here
was that you were claiming it was a PCI change, yet it was actually only
touching the driver core which means that all devices in the systems for
all Linux users will be affected.
> > Break this series out properly, get the driver core stuff working FIRST,
> > then show how multiple busses will work with them (i.e. you usually need
> > 3 to know if you got it right).
>
> Multiple buses you mean different types of bus, e.g. USB, PCI and
> others or 3 pci_bus is good enough? Right now we have no intention to
> support bus types other than PCI devices. The liveupdate is about
> preserving the GPU context cross kernel upgrade. Suggestion welcome.
So all of this is just for one single driver. Ugh. Just do it in the
single driver then, don't mess with the driver core, or even the PCI
core. Just make it specific to the driver and then none of us will even
notice the mess that this all creates :)
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-09-30 15:26 ` Greg Kroah-Hartman
@ 2025-10-03 6:57 ` Chris Li
0 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-03 6:57 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Tue, Sep 30, 2025 at 8:30 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:10AM -0700, Chris Li wrote:
> > #define pr_fmt(fmt) "PCI liveupdate: " fmt
> > +#define dev_fmt(fmt) "PCI liveupdate: " fmt
>
> Please no. Use the default dev_ formatting so that people can correct
> track the devices spitting out messages here.
Ack. I received another feedback request to add this private dev_fmt
prefix. I can remove that.
>
> > +#include <linux/types.h>
> > #include <linux/liveupdate.h>
> > +#include "pci.h"
> >
> > #define PCI_SUBSYSTEM_NAME "pci"
>
> I still don't know why this is needed, why?
Oh, this is requested by the LUO subsystem registration interface.
Pasha can comment more on the LUO subsystem API design. Each subsystem
will use a name to do the subsystem registration and lookup.
https://lore.kernel.org/linux-mm/20250929010321.3462457-10-pasha.tatashin@soleen.com/
>
> >
> > +static void stack_push_buses(struct list_head *stack, struct list_head *buses)
> > +{
> > + struct pci_bus *bus;
> > +
> > + list_for_each_entry(bus, buses, node)
> > + list_move_tail(&bus->dev.lu.lu_next, stack);
> > +}
> > +
> > +static void liveupdate_add_dev(struct device *dev, struct list_head *head)
> > +{
> > + dev_info(dev, "collect liveupdate device: flags %x\n", dev->lu.flags);
>
> Debugging code can go away please.
I was considering this as part of the standard kernel print out for
dmesg logging purposes, more than debugging. This is only triggered
when a liveupdate device is requested. A very small handful of the PCI
devices will be liveupdate. The same set of devices will expect to be
restored at the new kernel kexec. If the set of devices mismatch, that
is a bug. If it is not in the dmesg, it is very hard to find out the
set of devices was mismatched.
I consider it similar to when booting up the kernel some storage
driver will report founding storage device "/dev/sda" etc. Those are
not debugging prints.
Please let me know if that is not justifiable or there is another
mechanism to do the above logging.
> > + list_move_tail(&dev->lu.lu_next, head);
> > +}
> > +
> > +static int collect_bus_devices_reverse(struct pci_bus *bus, struct list_head *head)
> > +{
> > + struct pci_dev *pdev;
> > + int count = 0;
> > +
> > + list_for_each_entry_reverse(pdev, &bus->devices, bus_list) {
>
> Why are you allowed to walk the pci bus list here? Shouldn't there be
> some type of core function to do that?
Core function you mean the device core? This is the PCI liveupdate
core function.
>
> And why in reverse?
Very good question. Reverse is to allow the later created device to
mark the earlier created device as dependent. For example, the VF can
mark the PF as a dependent device.
> > + if (pdev->dev.lu.flags & LU_BUSMASTER && pdev->dev.parent)
> > + pdev->dev.parent->lu.flags |= LU_BUSMASTER_BRIDGE;
> > + if (pdev->dev.lu.flags) {
> > + liveupdate_add_dev(&pdev->dev, head);
> > + count++;
>
> No locking?
Locking in the parent calling function.
>
> > + }
> > + }
> > + return count;
>
> What prevents this value from changing right after you return it?
The liveupdate device collection is static per prepare() call back.
Each PCI device will need to go through the same set of callbacks:
prepare(), freeze() and after kexec, finish(). It will be a bug if
some device calls freeze() without prepare(). For each liveupdate
session, the number of devices partiticate liveupdate is fixed. e.g.
if a new VM tries to add another GPU device into the liveupdate set
after the prepare stage, it is not allowed. This list of liveupdate
devices will remain fixed during the liveupdate session.
Please see Pasha's LUO V4 patch for more detail of the LUO state and
callback when transitioning the state.
https://lore.kernel.org/linux-mm/20250929010321.3462457-1-pasha.tatashin@soleen.com/
> > +}
> > +
> > +static int build_liveupdate_devices(struct list_head *head)
> > +{
> > + LIST_HEAD(bus_stack);
> > + int count = 0;
> > +
> > + stack_push_buses(&bus_stack, &pci_root_buses);
> > +
> > + while (!list_empty(&bus_stack)) {
> > + struct device *busdev;
> > + struct pci_bus *bus;
> > +
> > + busdev = list_last_entry(&bus_stack, struct device, lu.lu_next);
> > + bus = to_pci_bus(busdev);
> > + if (!busdev->lu.visited && !list_empty(&bus->children)) {
> > + stack_push_buses(&bus_stack, &bus->children);
> > + busdev->lu.visited = 1;
> > + continue;
> > + }
> > +
> > + count += collect_bus_devices_reverse(bus, head);
> > + busdev->lu.visited = 0;
> > + list_del_init(&busdev->lu.lu_next);
> > + }
> > + return count;
>
> A comment here about what you are trying to do with walking the list of
This is the postraversal of the PCI bus. Make sure to visit the child
bus before the parent bus, so that the child bus can mark the parent
bus as dependent. e.g. If we want to preserve the bus master bit on a
leaf node PCI device, all the parent bridge up to the root bridge will
need to preserve the bus master bit as well. Otherwise the device will
not be able to DMA if the parent bridge does not have a busmaster.
> devices. Somehow. Are you sure that's right? It feels backwards, and
I am confident this is right. This liveupdate device list collection
is the core value of the LUO/PCI series. This way it only needs to
walk the tree just once. Rather than the two passes, first pass marks
the parent recursively to the root, then the second pass to walk the
tree to collect the bus/device that is marked.
Let me know if you think that is a bug somewhere. This code has been
running in our internal liveupdate kernel for vfio and DMA liveupdate
for over a month now.
> the lack of any locking makes me very nervous. How is this integrating
> into the normal driver model lists?
It does not integrate in the normal driver model lists. It is a new
list. Because the list needs to be fixed in a liveupdate session, it
is simpler to use a new list than walking the existing driver model
list to find out which device has the liveupdate flags. Because the VM
can add a GPU then remove the GPU before reaching to the kernel
liveupdate kexec, that also impacts the depended device (bridge/PF).
The liveupdate requested devices set is dynamic before prepare().
Creating a new list in prepare() is simpler.
> > +}
> > +
> > +static void cleanup_liveupdate_devices(struct list_head *head)
> > +{
> > + struct device *d, *n;
> > +
> > + list_for_each_entry_safe(d, n, head, lu.lu_next) {
> > + d->lu.flags &= ~LU_DEPENDED;
> > + list_del_init(&d->lu.lu_next);
> > + }
> > +}
>
> What does "cleanup" mean?
Cleanup means removing the dependent device flags, which is derivative
from the requested PCI devices. The devices are also removed from the
liveupdate device list because it is not in the liveupdate session any
more.
This clean up happens when transition from the "finish" to "normal"
state due to finish(), or from "prepare" to "normal" due to cancel().
>
> > +
> > static int pci_liveupdate_prepare(void *arg, u64 *data)
> > {
> > + LIST_HEAD(requested_devices);
> > +
> > pr_info("prepare data[%llx]\n", *data);
>
> Addresses written to the kernel log?
Ack, my bad. Will fix that, as point out by Jason as well.
>
> > +
> > + pci_lock_rescan_remove();
> > + down_write(&pci_bus_sem);
> > +
> > + build_liveupdate_devices(&requested_devices);
>
> Ah, you lock here. Document the heck out of this and put the proper
In this function or in build_liveupdate_devices() as well?
> build macros in there so we know what is going on.
build macros you mean the assert some lock was held that kind of the macro?
Let me know if you have other types of build macros in mind.
>
> > + cleanup_liveupdate_devices(&requested_devices);
> > +
> > + up_write(&pci_bus_sem);
>
> Why is it a write? You aren't modifying the list, are you?
I am modifying the requested_devices list but not the normal PCI
device driver model list. I think I can change to up_read().
>
> > + pci_unlock_rescan_remove();
> > return 0;
> > }
> >
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index e8318fd5f6ed537a1b236a3a0f054161d5710abd..0e9ef387182856771d857181d88f376632b46f0d 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -304,6 +304,7 @@ static int pcie_device_init(struct pci_dev *pdev, int service, int irq)
> > device = &pcie->device;
> > device->bus = &pcie_port_bus_type;
> > device->release = release_pcie_device; /* callback to free pcie dev */
> > + dev_liveupdate_init(device);
>
> Why here?
Because the device is just allocated. I need to initialize the
device->lu.list pointers. Otherwise list debug will complain of NULL
pointers when added to the liveupdate device list.
>
> > dev_set_name(device, "%s:pcie%03x",
> > pci_name(pdev),
> > get_descriptor_id(pci_pcie_type(pdev), service));
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index 4b8693ec9e4c67fc1655e0057b3b96b4098e6630..dddd7ebc03d1a6e6ee456e0bf02ab9833a819509 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -614,6 +614,7 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus *parent)
> > INIT_LIST_HEAD(&b->devices);
> > INIT_LIST_HEAD(&b->slots);
> > INIT_LIST_HEAD(&b->resources);
> > + dev_liveupdate_init(&b->dev);
>
> Same, why here? Shouldn't the driver core be doing this all for you
> automatically? Are you going to make each bus do this manually?
No, the PCI enumeration happened way before the driver core was
registering the device. We already need to add the device to the
liveupdate device list during the PCI enumeration. That is before the
driver is bound and probed.
Yes, it needs to happen when the bus is allocated and initialized.
Earlier than driver init.
> > b->max_bus_speed = PCI_SPEED_UNKNOWN;
> > b->cur_bus_speed = PCI_SPEED_UNKNOWN;
> > #ifdef CONFIG_PCI_DOMAINS_GENERIC
> > @@ -1985,6 +1986,7 @@ int pci_setup_device(struct pci_dev *dev)
> > dev->sysdata = dev->bus->sysdata;
> > dev->dev.parent = dev->bus->bridge;
> > dev->dev.bus = &pci_bus_type;
> > + dev_liveupdate_init(&dev->dev);
>
> Looks like you are :(
Yes, I need it initialized earlier. Suggestions are welcome. I haven't
found a way to insert the dev_liveupdate_init() into some device init
function. The existing device init function was called too late.
> Do it in one place please.
Which place? If there is such a function called by all different
flavors of device and initialized early enough, I am happy to move
there. There is none as far as I can tell.
> > dev->hdr_type = hdr_type & 0x7f;
> > dev->multifunction = !!(hdr_type & 0x80);
> > dev->error_state = pci_channel_io_normal;
> > @@ -3184,7 +3186,7 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int bus,
> > return NULL;
> >
> > bridge->dev.parent = parent;
> > -
> > + dev_liveupdate_init(&bridge->dev);
>
> Again, one place.
Any suggestions where to move to. dev_liveupdate_init() is the one
place to perform the work. Just need to have multiple entrances. I
can't find an alternative yet.
> > list_splice_init(resources, &bridge->windows);
> > bridge->sysdata = sysdata;
> > bridge->busnr = bus;
> > diff --git a/include/linux/dev_liveupdate.h b/include/linux/dev_liveupdate.h
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..72297cba08a999e89f7bc0997dabdbe14e0aa12c
> > --- /dev/null
> > +++ b/include/linux/dev_liveupdate.h
> > @@ -0,0 +1,44 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +/*
> > + * Copyright (c) 2025, Google LLC.
> > + * Pasha Tatashin <pasha.tatashin@soleen.com>
> > + * Chris Li <chrisl@kernel.org>
> > + */
> > +#ifndef _LINUX_DEV_LIVEUPDATE_H
> > +#define _LINUX_DEV_LIVEUPDATE_H
> > +
> > +#include <linux/liveupdate.h>
> > +
> > +#ifdef CONFIG_LIVEUPDATE
> > +
> > +enum liveupdate_flag {
> > + LU_BUSMASTER = 1 << 0,
> > + LU_BUSMASTER_BRIDGE = 2 << 0,
>
> BIT() please.
Ack. Will do.
>
> > +};
> > +
> > +#define LU_REQUESTED (LU_BUSMASTER)
> > +#define LU_DEPENDED (LU_BUSMASTER_BRIDGE)
>
> Why 2 names for the same thing?
LU_DEPENDED is for all dependent devices, the derivatives device gets
pulled into the liveupdate device set. When a liveupdate session gets
canceled. Those derivatives device sets need to be cleaned up. There
will be more liveupdate feature flags (e.g. LU_DMA, LU_SRIOV) added to
the requested and dependent flags. These two are the aggregate flags
for all requested features and dependent features.
> > +
> > +/**
> > + * struct dev_liveupdate - Device state for live update operations
> > + * @lu_next: List head for linking the device into live update
> > + * related lists (e.g., a list of devices participating
> > + * in a live update sequence).
> > + * @flags: Indicate what liveupdate feature does the device
> > + * participtate.
> > + * @visited: Only used by the bus devices when travese the PCI buses
> > + * to build the liveupdate devices list. Set if the child
> > + * buses have been pushed into the pending stack.
> > + *
> > + * This structure holds the state information required for performing
> > + * live update operations on a device. It is embedded within a struct device.
> > + */
> > +struct dev_liveupdate {
> > + struct list_head lu_next;
>
> Another list?
Yes, as explained earlier, a fixed list for the liveupdate session.
>
> > + enum liveupdate_flag flags;
> > + bool visited:1;
>
> You shouldn't need this, you "know" you only touch one device at a time
> when walking a bus, don't try to manually keep track of it on your own.
No, I do need this due to the postravesal visit of the bus. I need to
know if this is the first time I visit this bus, if it is, walk its
children bus first. Else means the second time you visit this bus, all
the children bus has been visited, now add this bus to the liveupdate
list if it has non zero liveupdate flags.
I can do recursive bus walking without using additional bits to
indicate if this is the first time I visit the bus. But recursive tree
walking in the kernel is considered bad due to the stack usage.
> And again, why is the pci core doing this, the driver core should be
> doing all of this, PLEASE do not bury driver-model-core-changes down in
The driver core does not have the knowledge of doing this, e.g. the PF
and VF relationship. The reason liveupdate struct was added to the
device struct is because device struct is embedded in pci_dev,
pci_host_bridge, pci_bus. That is the three structs I care about for
the liveupdate.
The alternative is adding livedupate struct in the above three structs
separately without touching device struct. The patch will be bigger if
possible. I recall having some problem due to the bus->bridge being a
device struct rather than pci_dev or pci_host_bridge. I can try that
again if you think that is better.
> a "PCI" patch. That will make the driver core maintainers very grumpy
> when they run across stuff like this (as it did here...)
Driver core I assume you mean the core around "struct device". As I
explained earlier, it needs PCI special knowledge outside of the
common driver core. Suggestion welcome how to unset you less :-)
> > +};
> > +
> > +#endif /* CONFIG_LIVEUPDATE */
> > +#endif /* _LINUX_DEV_LIVEUPDATE_H */
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index 4940db137fffff4ceacf819b32433a0f4898b125..e0b35c723239f1254a3b6152f433e0412cd3fb34 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -21,6 +21,7 @@
> > #include <linux/lockdep.h>
> > #include <linux/compiler.h>
> > #include <linux/types.h>
> > +#include <linux/dev_liveupdate.h>
>
> Look, driver core changes. Please do this all in stuff that is NOT for
> just PCI.
But I only have PCI devices that are supported. Should I also consider
having an entry point in the device and then PCI as one of the
register device subsystems knows about livedupate and get called that
way?
Another way is just move livedupate all into PCI related struct only.
> > #include <linux/mutex.h>
> > #include <linux/pm.h>
> > #include <linux/atomic.h>
> > @@ -508,6 +509,7 @@ struct device_physical_location {
> > * @pm_domain: Provide callbacks that are executed during system suspend,
> > * hibernation, system resume and during runtime PM transitions
> > * along with subsystem-level and driver-level callbacks.
> > + * @lu: Live update state.
>
> You have more letters, please use them. "lu" is too short.
>
> > * @em_pd: device's energy model performance domain
> > * @pins: For device pin management.
> > * See Documentation/driver-api/pin-control.rst for details.
> > @@ -603,6 +605,10 @@ struct device {
> > struct dev_pm_info power;
> > struct dev_pm_domain *pm_domain;
> >
> > +#ifdef CONFIG_LIVEUPDATE
> > + struct dev_liveupdate lu;
> > +#endif
>
> Why not a pointer?
To avoid allocating additional memory failure during the repaire()
callback try to set a dependent device as liveupdate. Actually now
prepare() can be cancelled. I can make this as a pointer and
dynamically allocate the lu struct as well if that is prefered.
>
> > +
> > #ifdef CONFIG_ENERGY_MODEL
> > struct em_perf_domain *em_pd;
> > #endif
> > @@ -1168,4 +1174,13 @@ void device_link_wait_removal(void);
> > #define MODULE_ALIAS_CHARDEV_MAJOR(major) \
> > MODULE_ALIAS("char-major-" __stringify(major) "-*")
> >
> > +#ifdef CONFIG_LIVEUPDATE
> > +static inline void dev_liveupdate_init(struct device *dev)
> > +{
> > + INIT_LIST_HEAD(&dev->lu.lu_next);
>
> Why does this have to be in device.h? The driver core should do this
> for you (as I say above).
I need a more specific pointer which driver core function can do for
me. PCI device enumeration happens pretty early, that is before
registering the device.
Thanks for the long detailed feedback. I am still working on my way to
catch up on my email.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-09-30 16:47 ` Jason Gunthorpe
@ 2025-10-03 7:09 ` Chris Li
0 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-03 7:09 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Tue, Sep 30, 2025 at 9:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Sep 29, 2025 at 07:13:51PM -0700, Chris Li wrote:
> > Can you elaborate? This is not preserving everything, for repserveding
> > bus master, only the device and the parent PCI bridge are added to the
> > requested_devies list. That is done in the
> > build_liveupdate_devices(), the device is added to the listhead pass
> > into the function. So it matches the "their related hierarchy" part.
> > Can you explain what unnecessary device was preserved in this?
>
> I expected an exported function to request a pci device be preserved
> and to populate a tracking list linked to a luo session when that
> function is called.
The current PCI subsystem is designed outside of memfd.
As for the request PCI device function and that function populated a
liveupdate device list. It has been considered and the current
approach is simpler. The reason is that, if you want to populate the
device list, you will have to know about all device dependent rules,
devices depend on parent bridge, the VF depends on PF. Because the
request can be canceled as well before reaching the live update
prepare(). Those derived dependent flags need to be tracked and
reference counted. Even worse, it needs to be reference counted by
each liveupdate feature. e.g. LU_BUSMASTER vs LU_SRIOV vs LU_DMA each
need to have a reference counter, so it can remove that dependent flag
when its refcount drops to zero.
> This flags and then search over all the buses seems, IDK, strange and
> should probably be justified.
The current approach is much simpler when request and unrequest a PCI
device. Don't need a recursive walk parent or the PF relationship.
In prepare() it only walks the PCI root bus tree top down one pass.
That is the only place to deal with dependent relationships. It is
simpler and doesn't need to maintain per live update dependent feature
refcount.
That is my justification.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 6:18 ` Greg Kroah-Hartman
@ 2025-10-03 7:26 ` Chris Li
2025-10-03 12:26 ` Greg Kroah-Hartman
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-10-03 7:26 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Thu, Oct 2, 2025 at 11:19 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Thu, Oct 02, 2025 at 01:38:56PM -0700, Chris Li wrote:
> > On Tue, Sep 30, 2025 at 8:30 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> > > > include/linux/dev_liveupdate.h | 23 +++++
> > > > include/linux/device/driver.h | 6 ++
> > >
> > > Driver core changes under the guise of only PCI changes? Please no.
> >
> > There is a reason why I use the device struct rather than the pci_dev
> > struct even though liveupdate currently only works with PCI devices.
> > It comes down to the fact that the pci_bus and pci_host_bridge are not
> > pci_dev struct. We need something that is common across all those
> > three types of PCI related struct I care about(pci_dev, pci_bus,
> > pci_host_bridge). The device struct is just common around those. I can
> > move the dev_liveupdate struct into pci_bus, pci_host_bridge and
> > pci_dev independently. That will be more contained inside PCI, not
> > touching the device struct. The patch would be bigger because the data
> > structure is spread into different structs. Do you have a preference
> > which way to go?
>
> If you only are caring about one single driver, don't mess with a
> subsystem or the driver core, just change the driver. My objection here
It is more than just one driver, we have vfio-pci, idpf, pci-pf-stub
and possible nvme driver.
The change needs to happen in the PCI enumeration and probing as well,
that is outside of the driver code.
> was that you were claiming it was a PCI change, yet it was actually only
> touching the driver core which means that all devices in the systems for
In theory all the devices can be liveupdate preserved. But now we only
support PCI.
I can look into containing the change in PCI only and not touching the
device struct if that is what you mean. I recall I tried that
previously and failed because bus->bridge is a device struct rather
than pci_dev or pci_host_bridge. I can try harder not to touch device
structs. Patch will be bigger and more complex than this right now.
But at least the damage is limited to PCI only if successful.
> all Linux users will be affected.
I understand your concerns. I was wishing one day all devices could
support liveupdate, but that is not the case right now.
> > > Break this series out properly, get the driver core stuff working FIRST,
> > > then show how multiple busses will work with them (i.e. you usually need
> > > 3 to know if you got it right).
> >
> > Multiple buses you mean different types of bus, e.g. USB, PCI and
> > others or 3 pci_bus is good enough? Right now we have no intention to
> > support bus types other than PCI devices. The liveupdate is about
> > preserving the GPU context cross kernel upgrade. Suggestion welcome.
>
> So all of this is just for one single driver. Ugh. Just do it in the
> single driver then, don't mess with the driver core, or even the PCI
Not a single driver. It is the whole PCI core. Or in your book the
whole PCI is just one single driver?
> core. Just make it specific to the driver and then none of us will even
> notice the mess that this all creates :)
OK. Let me try that PCI only approach again and try it harder. I will
report back.
Thanks for the feedback.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-02 23:42 ` David Matlack
@ 2025-10-03 12:03 ` Jason Gunthorpe
2025-10-03 16:03 ` David Matlack
0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-10-03 12:03 UTC (permalink / raw)
To: David Matlack
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Thu, Oct 02, 2025 at 04:42:17PM -0700, David Matlack wrote:
> On Thu, Oct 2, 2025 at 4:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Thu, Oct 02, 2025 at 02:31:08PM -0700, David Matlack wrote:
> > > And we don't care about PF drivers until we get to
> > > supporting SR-IOV. So the driver callbacks all seem unnecessary at
> > > this point.
> >
> > I guess we will see, but I'm hoping we can get quite far using
> > vfio-pci as the SRIOV PF driver and don't need to try to get a big PF
> > in-kernel driver entangled in this.
>
> So far we have had to support vfio-pci, pci-pf-stub, and idpf as PF
> drivers, and nvme looks like it's coming soon :(
How much effort did you put into moving them to vfio though? Hack Hack
in the kernel is easy, but upstreaming may be very hard :\
Shutting down enough of the PF kernel driver to safely kexec is almost
the same as unbinding it completely.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 5:24 ` Chris Li
@ 2025-10-03 12:06 ` Jason Gunthorpe
2025-10-03 16:27 ` David Matlack
2025-10-03 17:44 ` Chris Li
0 siblings, 2 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2025-10-03 12:06 UTC (permalink / raw)
To: Chris Li
Cc: David Matlack, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, Pasha Tatashin,
linux-kernel, linux-pci, linux-acpi, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Thu, Oct 02, 2025 at 10:24:59PM -0700, Chris Li wrote:
> As David pointed out in the other email, the PCI also supports other
> non vfio PCI devices which do not have the FD and FD related sessions.
> That is the original intent for the LUO PCI subsystem.
This doesn't make sense. We don't know how to solve this problem yet,
but I'm pretty confident we will need to inject a FD and session into
these drivers too.
> away once we have the vfio-pci as the real user. Actually getting the
> pci-pf-stub driver working would be a smaller and reasonable step to
> justify the PF support in LUO PCI.
In this contex pci-pf-stub is useless, just use vfio-pci as the SRIOV
stub. I wouldn't invest in it. Especially since it creates more
complexity because we don't have an obvious way to get the session FD.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 7:26 ` Chris Li
@ 2025-10-03 12:26 ` Greg Kroah-Hartman
2025-10-03 17:49 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-03 12:26 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Fri, Oct 03, 2025 at 12:26:01AM -0700, Chris Li wrote:
> On Thu, Oct 2, 2025 at 11:19 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Oct 02, 2025 at 01:38:56PM -0700, Chris Li wrote:
> > > On Tue, Sep 30, 2025 at 8:30 AM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Tue, Sep 16, 2025 at 12:45:11AM -0700, Chris Li wrote:
> > > > > include/linux/dev_liveupdate.h | 23 +++++
> > > > > include/linux/device/driver.h | 6 ++
> > > >
> > > > Driver core changes under the guise of only PCI changes? Please no.
> > >
> > > There is a reason why I use the device struct rather than the pci_dev
> > > struct even though liveupdate currently only works with PCI devices.
> > > It comes down to the fact that the pci_bus and pci_host_bridge are not
> > > pci_dev struct. We need something that is common across all those
> > > three types of PCI related struct I care about(pci_dev, pci_bus,
> > > pci_host_bridge). The device struct is just common around those. I can
> > > move the dev_liveupdate struct into pci_bus, pci_host_bridge and
> > > pci_dev independently. That will be more contained inside PCI, not
> > > touching the device struct. The patch would be bigger because the data
> > > structure is spread into different structs. Do you have a preference
> > > which way to go?
> >
> > If you only are caring about one single driver, don't mess with a
> > subsystem or the driver core, just change the driver. My objection here
>
> It is more than just one driver, we have vfio-pci, idpf, pci-pf-stub
> and possible nvme driver.
Why is nvme considered a "GPU" that needs context saved?
> The change needs to happen in the PCI enumeration and probing as well,
> that is outside of the driver code.
So all just PCI drivers? Then keep this in PCI-only please, and don't
touch the driver core.
> > was that you were claiming it was a PCI change, yet it was actually only
> > touching the driver core which means that all devices in the systems for
>
> In theory all the devices can be liveupdate preserved. But now we only
> support PCI.
Then for now, only focus on PCI.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-10-03 5:33 ` Chris Li
@ 2025-10-03 14:04 ` Jason Gunthorpe
2025-10-03 21:06 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-10-03 14:04 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Thu, Oct 02, 2025 at 10:33:20PM -0700, Chris Li wrote:
> The consideration is that some non vfio device like IDPF is preserved
> as well. Does the iommufd encapsulate all the PCI device hierarchy? I
> was thinking the PCI layer knows about the PCI device hierarchy,
> therefore using pci_dev->dev.lu.flags to indicate the participation of
> the PCI liveupdate. Not sure how to drive that from iommufd. Can you
> explain a bit more?
I think you need to start from here and explain what is minimally
needed and identify what gets put in the luo session and what has to
be luo global.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
2025-10-02 21:39 ` Chris Li
@ 2025-10-03 14:28 ` Jason Gunthorpe
0 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2025-10-03 14:28 UTC (permalink / raw)
To: Chris Li
Cc: Pasha Tatashin, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Thu, Oct 02, 2025 at 02:39:26PM -0700, Chris Li wrote:
> On Tue, Sep 30, 2025 at 9:37 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > As I said, I would punt all of this to the initrd and let the initrd
> > explicitly bind drivers.
>
> You still need a mechanism to prevent after the PCI bridge scan,
> create the pci_devices, not auto probe the drivers. If it is not
> driver_override, it will be some new PCI API and liveupdate is the
> first user of it.
Yes, we need userspace to control the timing of driver binding for
Confidential Compute too, so I would prefer to see a generic proposal
that can solve both.
If this is to be a luo thing then preserving disabling driver auto
bind for specific devices could be reasonable.
> There are two slightly different things here:
> 1) modprobe the driver. That is typically control by udev.
> 2) auto probing the drive after the driver has been loaded or PCI
> device scanned.
> In your envisioning, the initrd autobind controls both of the above
> two spec of things, right?
Today the initrd runs udev which does the module loading and then
the kernel does driver auto binding.
You'd want to move driver binding to userspace so that userspace can
select which is the right driver for luo and for CC we want to delay
binding the drivers until after userspace has measured and verified
the device.
The idea is that userpsace, through the modules.alias file, would run
the same driver selection algorithm and signal the kernel to load the
driver.
Also, for VFIO we have addressed Greg's remarks about driver name ABI
by adding VFIO specific module.alias entries:
alias vfio_pci:v*d*sv*sd*bc*sc*i* vfio_pci
alias vfio_pci:v000015B3d0000101Esv*sd*bc*sc*i* mlx5_vfio_pci
Modern userspace is already supposed to be entering VFIO mode by using
this file and avoiding making driver name an ABI.
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 12:03 ` Jason Gunthorpe
@ 2025-10-03 16:03 ` David Matlack
2025-10-03 16:16 ` Jason Gunthorpe
0 siblings, 1 reply; 84+ messages in thread
From: David Matlack @ 2025-10-03 16:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky, Brian Vazquez
On Fri, Oct 3, 2025 at 5:04 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 04:42:17PM -0700, David Matlack wrote:
> > On Thu, Oct 2, 2025 at 4:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Thu, Oct 02, 2025 at 02:31:08PM -0700, David Matlack wrote:
> > > > And we don't care about PF drivers until we get to
> > > > supporting SR-IOV. So the driver callbacks all seem unnecessary at
> > > > this point.
> > >
> > > I guess we will see, but I'm hoping we can get quite far using
> > > vfio-pci as the SRIOV PF driver and don't need to try to get a big PF
> > > in-kernel driver entangled in this.
> >
> > So far we have had to support vfio-pci, pci-pf-stub, and idpf as PF
> > drivers, and nvme looks like it's coming soon :(
>
> How much effort did you put into moving them to vfio though? Hack Hack
> in the kernel is easy, but upstreaming may be very hard :\
>
> Shutting down enough of the PF kernel driver to safely kexec is almost
> the same as unbinding it completely.
I think it's totally fair to tell us to replace pci-pf-stub with
vfio-pci. That gets rid of one PF driver.
idpf cannot be easily replaced with vfio-pci, since the PF is also
used for host networking. Brian Vazquez from Google will be giving a
talk about the idpf support at LPC so we can revisit this topic there.
We took the approach of only preserving the SR-IOV configuration in
the PF, everything else gets reset (so no DMA mapping preservation, no
driver state preservation, etc.).
We haven't looked into nvme yet so we'll have to revisit that discussion later.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 16:03 ` David Matlack
@ 2025-10-03 16:16 ` Jason Gunthorpe
2025-10-03 16:28 ` Pasha Tatashin
0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-10-03 16:16 UTC (permalink / raw)
To: David Matlack
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky, Brian Vazquez
On Fri, Oct 03, 2025 at 09:03:36AM -0700, David Matlack wrote:
> > Shutting down enough of the PF kernel driver to safely kexec is almost
> > the same as unbinding it completely.
>
> I think it's totally fair to tell us to replace pci-pf-stub with
> vfio-pci. That gets rid of one PF driver.
>
> idpf cannot be easily replaced with vfio-pci, since the PF is also
> used for host networking.
Run host networking on a VF instead?
> Brian Vazquez from Google will be giving a
> talk about the idpf support at LPC so we can revisit this topic there.
> We took the approach of only preserving the SR-IOV configuration in
> the PF, everything else gets reset (so no DMA mapping preservation, no
> driver state preservation, etc.).
Yes, that's pretty much what you'd have to do, it sure would be nice
to have some helper to manage this to minimize driver work. It really
is remove the existing driver and just leave it idle unless luo fails
then rebind it..
> We haven't looked into nvme yet so we'll have to revisit that discussion later.
Put any host storage on a NVMe VF?
Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 12:06 ` Jason Gunthorpe
@ 2025-10-03 16:27 ` David Matlack
2025-10-03 16:41 ` Vipin Sharma
2025-10-03 17:44 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: David Matlack @ 2025-10-03 16:27 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Chris Li, Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Fri, Oct 3, 2025 at 5:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 10:24:59PM -0700, Chris Li wrote:
>
> > As David pointed out in the other email, the PCI also supports other
> > non vfio PCI devices which do not have the FD and FD related sessions.
> > That is the original intent for the LUO PCI subsystem.
>
> This doesn't make sense. We don't know how to solve this problem yet,
> but I'm pretty confident we will need to inject a FD and session into
> these drivers too.
Google's LUO PCI subsystem (i.e. this series) predated a lot of the
discussion about FD preservation and needed to support the legacy vfio
container/group model. Outside of vfio-pci, the only other drivers
that participate are the PF drivers (pci-pf-stub and idpf), but they
just register empty callbacks.
So from an upstream perspective we don't really have a usecase for
callbacks. Chris ,I saw in your other email that you agree with
dropping them in the next version, so it sounds like we are aligned
then.
Vipin Sharma is working on the vfio-pci MVP series. Vipin, if you
anticipate VFIO is going to need driver callbacks on top of the LUO FD
callbacks, please chime in here.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 16:16 ` Jason Gunthorpe
@ 2025-10-03 16:28 ` Pasha Tatashin
2025-10-03 16:56 ` David Matlack
0 siblings, 1 reply; 84+ messages in thread
From: Pasha Tatashin @ 2025-10-03 16:28 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: David Matlack, Chris Li, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky, Brian Vazquez
On Fri, Oct 3, 2025 at 12:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 03, 2025 at 09:03:36AM -0700, David Matlack wrote:
> > > Shutting down enough of the PF kernel driver to safely kexec is almost
> > > the same as unbinding it completely.
> >
> > I think it's totally fair to tell us to replace pci-pf-stub with
> > vfio-pci. That gets rid of one PF driver.
> >
> > idpf cannot be easily replaced with vfio-pci, since the PF is also
> > used for host networking.
>
> Run host networking on a VF instead?
There is a plan for this, but not immediately. In upstream, I suspect
vfio-pci is all we need, and other drivers can be added when it really
necessary.
>
> > Brian Vazquez from Google will be giving a
> > talk about the idpf support at LPC so we can revisit this topic there.
> > We took the approach of only preserving the SR-IOV configuration in
> > the PF, everything else gets reset (so no DMA mapping preservation, no
> > driver state preservation, etc.).
>
> Yes, that's pretty much what you'd have to do, it sure would be nice
> to have some helper to manage this to minimize driver work. It really
> is remove the existing driver and just leave it idle unless luo fails
> then rebind it..
>
> > We haven't looked into nvme yet so we'll have to revisit that discussion later.
>
> Put any host storage on a NVMe VF?
>
> Jason
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 16:27 ` David Matlack
@ 2025-10-03 16:41 ` Vipin Sharma
0 siblings, 0 replies; 84+ messages in thread
From: Vipin Sharma @ 2025-10-03 16:41 UTC (permalink / raw)
To: David Matlack
Cc: Jason Gunthorpe, Chris Li, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, Pasha Tatashin,
linux-kernel, linux-pci, linux-acpi, Pasha Tatashin, Jason Miu,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky
On Fri, Oct 3, 2025 at 9:27 AM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Oct 3, 2025 at 5:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Oct 02, 2025 at 10:24:59PM -0700, Chris Li wrote:
> >
> > > As David pointed out in the other email, the PCI also supports other
> > > non vfio PCI devices which do not have the FD and FD related sessions.
> > > That is the original intent for the LUO PCI subsystem.
> >
> > This doesn't make sense. We don't know how to solve this problem yet,
> > but I'm pretty confident we will need to inject a FD and session into
> > these drivers too.
>
> Google's LUO PCI subsystem (i.e. this series) predated a lot of the
> discussion about FD preservation and needed to support the legacy vfio
> container/group model. Outside of vfio-pci, the only other drivers
> that participate are the PF drivers (pci-pf-stub and idpf), but they
> just register empty callbacks.
>
> So from an upstream perspective we don't really have a usecase for
> callbacks. Chris ,I saw in your other email that you agree with
> dropping them in the next version, so it sounds like we are aligned
> then.
>
> Vipin Sharma is working on the vfio-pci MVP series. Vipin, if you
> anticipate VFIO is going to need driver callbacks on top of the LUO FD
> callbacks, please chime in here.
Currently, LUO FD callbacks are the only ones I need. I think we are
fine for now without PCI callbacks. I will have better clarity next
week but for now I am only using LUO FD callbacks.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 16:28 ` Pasha Tatashin
@ 2025-10-03 16:56 ` David Matlack
0 siblings, 0 replies; 84+ messages in thread
From: David Matlack @ 2025-10-03 16:56 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Jason Gunthorpe, Chris Li, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Leon Romanovsky, Brian Vazquez
On Fri, Oct 3, 2025 at 9:28 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Fri, Oct 3, 2025 at 12:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Fri, Oct 03, 2025 at 09:03:36AM -0700, David Matlack wrote:
> > > > Shutting down enough of the PF kernel driver to safely kexec is almost
> > > > the same as unbinding it completely.
> > >
> > > I think it's totally fair to tell us to replace pci-pf-stub with
> > > vfio-pci. That gets rid of one PF driver.
> > >
> > > idpf cannot be easily replaced with vfio-pci, since the PF is also
> > > used for host networking.
> >
> > Run host networking on a VF instead?
>
> There is a plan for this, but not immediately. In upstream, I suspect
> vfio-pci is all we need, and other drivers can be added when it really
> necessary.
Sounds good to me.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 12:06 ` Jason Gunthorpe
2025-10-03 16:27 ` David Matlack
@ 2025-10-03 17:44 ` Chris Li
1 sibling, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-03 17:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: David Matlack, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, Pasha Tatashin,
linux-kernel, linux-pci, linux-acpi, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Fri, Oct 3, 2025 at 5:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 10:24:59PM -0700, Chris Li wrote:
>
> > As David pointed out in the other email, the PCI also supports other
> > non vfio PCI devices which do not have the FD and FD related sessions.
> > That is the original intent for the LUO PCI subsystem.
>
> This doesn't make sense. We don't know how to solve this problem yet,
> but I'm pretty confident we will need to inject a FD and session into
> these drivers too.
Ack. I can start hacking on hook up the PCI layer to the vfio FD and
sessions. Not sure how to do that at this point yet, I will give it a
stab and report back.
>
> > away once we have the vfio-pci as the real user. Actually getting the
> > pci-pf-stub driver working would be a smaller and reasonable step to
> > justify the PF support in LUO PCI.
>
> In this contex pci-pf-stub is useless, just use vfio-pci as the SRIOV
> stub. I wouldn't invest in it. Especially since it creates more
> complexity because we don't have an obvious way to get the session FD.
Ack. I will not do pci-pf-stub then.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 12:26 ` Greg Kroah-Hartman
@ 2025-10-03 17:49 ` Chris Li
2025-10-03 18:27 ` David Matlack
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-10-03 17:49 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Bjorn Helgaas, Rafael J. Wysocki, Danilo Krummrich, Len Brown,
Pasha Tatashin, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Fri, Oct 3, 2025 at 5:26 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Oct 03, 2025 at 12:26:01AM -0700, Chris Li wrote:
>
> > It is more than just one driver, we have vfio-pci, idpf, pci-pf-stub
> > and possible nvme driver.
>
> Why is nvme considered a "GPU" that needs context saved?
NVME is not a GPU. The internal reason to have NVME participate in the
liveupdate is because the NVME shutdown of the IO queue is very slow,
it contributes the largest chunk of delay in the black out window for
liveupdate. The NVME participation is just an optimization to avoid
resetting the NVME queue. Consider it as (optional ) speed
optimization.
> > The change needs to happen in the PCI enumeration and probing as well,
> > that is outside of the driver code.
>
> So all just PCI drivers? Then keep this in PCI-only please, and don't
> touch the driver core.
Ack. Will do.
>
> > > was that you were claiming it was a PCI change, yet it was actually only
> > > touching the driver core which means that all devices in the systems for
> >
> > In theory all the devices can be liveupdate preserved. But now we only
> > support PCI.
>
> Then for now, only focus on PCI.
Agree, thanks for the alignment.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 17:49 ` Chris Li
@ 2025-10-03 18:27 ` David Matlack
2025-10-03 21:10 ` Chris Li
0 siblings, 1 reply; 84+ messages in thread
From: David Matlack @ 2025-10-03 18:27 UTC (permalink / raw)
To: Chris Li
Cc: Greg Kroah-Hartman, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Fri, Oct 3, 2025 at 10:49 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Oct 3, 2025 at 5:26 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, Oct 03, 2025 at 12:26:01AM -0700, Chris Li wrote:
> >
> > > It is more than just one driver, we have vfio-pci, idpf, pci-pf-stub
> > > and possible nvme driver.
> >
> > Why is nvme considered a "GPU" that needs context saved?
>
> NVME is not a GPU. The internal reason to have NVME participate in the
> liveupdate is because the NVME shutdown of the IO queue is very slow,
> it contributes the largest chunk of delay in the black out window for
> liveupdate. The NVME participation is just an optimization to avoid
> resetting the NVME queue. Consider it as (optional ) speed
> optimization.
This is not true. We haven't made any changes to the nvme driver
within Google for Live Update.
The reason I mentioned nvme in another email chain is because Google
has some hosts where we want to preserve VFs bound to vfio-pci across
Live Update where the PF is bound to nvme. But Jason is suggesting we
seriously explore switching the PF driver to vfio-pci before trying to
upstream nvme support for Live Update, which I think is fair.
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
2025-10-03 14:04 ` Jason Gunthorpe
@ 2025-10-03 21:06 ` Chris Li
0 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-03 21:06 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Leon Romanovsky
On Fri, Oct 3, 2025 at 7:05 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 10:33:20PM -0700, Chris Li wrote:
> > The consideration is that some non vfio device like IDPF is preserved
> > as well. Does the iommufd encapsulate all the PCI device hierarchy? I
> > was thinking the PCI layer knows about the PCI device hierarchy,
> > therefore using pci_dev->dev.lu.flags to indicate the participation of
> > the PCI liveupdate. Not sure how to drive that from iommufd. Can you
> > explain a bit more?
>
> I think you need to start from here and explain what is minimally
> needed and identify what gets put in the luo session and what has to
> be luo global.
That means it is a bigger conversion that PCI alone, this will need to
change the LUO subsystem design. I can start from the vfio/iommufd
point of view, if the liveupdate device is driven from the
vfio/iommufd side, what is the PCI layer needed to do collaborate. It
will likely end up changing the LUO subsystem and callback design. I
will make that my starting point for the next step.
Thank you.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver
2025-10-03 18:27 ` David Matlack
@ 2025-10-03 21:10 ` Chris Li
0 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-03 21:10 UTC (permalink / raw)
To: David Matlack
Cc: Greg Kroah-Hartman, Bjorn Helgaas, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, Pasha Tatashin, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky
On Fri, Oct 3, 2025 at 11:28 AM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Oct 3, 2025 at 10:49 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Fri, Oct 3, 2025 at 5:26 AM Greg Kroah-Hartman
> > NVME is not a GPU. The internal reason to have NVME participate in the
> > liveupdate is because the NVME shutdown of the IO queue is very slow,
> > it contributes the largest chunk of delay in the black out window for
> > liveupdate. The NVME participation is just an optimization to avoid
> > resetting the NVME queue. Consider it as (optional ) speed
> > optimization.
>
> This is not true. We haven't made any changes to the nvme driver
> within Google for Live Update.
>
> The reason I mentioned nvme in another email chain is because Google
> has some hosts where we want to preserve VFs bound to vfio-pci across
> Live Update where the PF is bound to nvme. But Jason is suggesting we
> seriously explore switching the PF driver to vfio-pci before trying to
> upstream nvme support for Live Update, which I think is fair.
Ah, thanks for the clarification and sorry for my confusion. I think I
was thinking of a different storage driver, not the NVME you have in
mind.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-09-29 18:13 ` Chris Li
@ 2025-10-07 23:32 ` Chris Li
2025-10-08 23:00 ` David Matlack
2025-10-09 23:21 ` Pratyush Yadav
0 siblings, 2 replies; 84+ messages in thread
From: Chris Li @ 2025-10-07 23:32 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Pasha Tatashin, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Jason Gunthorpe, Leon Romanovsky,
skhawaja
Thanks to one that provides good feedback on the PCI series.
I just want to give an update on the state of the LUO PCI series,
based on the feedback I received. The LUO PCI series should be called
from the memfd side and remove global subsystem state if possible.
Which means the PCI series will depend on the VIFO or iommu series.
I have some internal alignment with Vipin (for VFIO) and Samiullah
(for iommu). Here is the new plan for upstream patch submission:
1) KHO series go first, which is already happening with additional improvement.
2) Next is Pasha's LUO series with memfd support, also happening right now.
3) Next series will be Vipin's VFIO series with preserving one
busmaster bit in the config space of the end point vfio device, there
is no PCI layer involved yet. The VFIO will use some driver trick to
prevent the native driver from binding to the liveupdate device used
by VFIO after kexec. After kexec, the VFIO driver validates that the
busmaster in the PCI config register is already set.
4) After the VFIO series, the PCI can start to preserve the livedupate
device by BDF. Avoid the driver auto probe on the livedupate devices.
At this point the VFIO driver in stage 3 will not need the other
driver trick to avoid the auto bind of native driver. The PCI layer
takes the core of that. This series PCI will have very limited
support, most of the driver callback is not needed, no bridge device
dependent as well.
5) VFIO device will continue DMA across the kexec. This series will
require the IOMMU series for DMA mapping support. The PCI will hook up
with the VFIO and build the list of the liveupdate device, which
includes the PCI bridge with bus master big preserved as well.
So I will pause the LUO PCI series a bit to wait for the integration
with VFIO series.
Meanwhile, I will continue to fix up the LUO PCI series internally for
the other feedback I have received:
- Clean up device info printing, remove raw address value (Greg KH, Jason).
- Remove the device format string (Greg KH).
- Remove the liveupdate struct from struct device, move it to the PCI (Greg KH).
- Remove LUO call back forwarding and hook it up with the VFIO (Jason, David)
- Drive the PCI from memfd context on VFIO or iommu, no subsystem
registration. (Jason)
- up_read(&pci_bus_sem); instead of up_write (Greg KH)
- Avoid preserving the driver name, just avoid auto-probing the
liveupdate devices. Let user space do the driver loading in initrd
(Jason).
That will keep me busy for a while waiting for the VFIO series.
Thanks
Chris
On Mon, Sep 29, 2025 at 11:13 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Mon, Sep 29, 2025 at 8:04 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Sat, Sep 27, 2025 at 02:05:38PM -0400, Pasha Tatashin wrote:
> > > Hi Bjorn,
> > >
> > > My latest submission is the following:
> > > https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/
> > >
> > > And github repo is in cover letter:
> > >
> > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> > >
> > > It applies cleanly against the mainline without the first three
> > > patches, as they were already merged.
> >
> > Not sure what I'm missing. I've tried various things but none apply
> > cleanly:
>
> Sorry about that. Let me do a refresh of the LUOPCI V3 patch and send
> out the git repo link as well. The issue is that there are other
> patches not in the mainline kernel which luopci is dependent on. Using
> a git repo would be easier to get a working tree.
>
> Working on it now, please stay tuned.
>
> Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-10-07 23:32 ` Chris Li
@ 2025-10-08 23:00 ` David Matlack
2025-10-09 17:12 ` Chris Li
2025-10-09 23:21 ` Pratyush Yadav
1 sibling, 1 reply; 84+ messages in thread
From: David Matlack @ 2025-10-08 23:00 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Pasha Tatashin, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky, skhawaja
On Tue, Oct 7, 2025 at 4:32 PM Chris Li <chrisl@kernel.org> wrote:
>
> Thanks to one that provides good feedback on the PCI series.
>
> I just want to give an update on the state of the LUO PCI series,
> based on the feedback I received. The LUO PCI series should be called
> from the memfd side and remove global subsystem state if possible.
By "memfd side" I believe you are referring to LUO fd preservation
(likely the VFIO cdev fd).
> Which means the PCI series will depend on the VIFO or iommu series.
> I have some internal alignment with Vipin (for VFIO) and Samiullah
> (for iommu). Here is the new plan for upstream patch submission:
>
> 1) KHO series go first, which is already happening with additional improvement.
>
> 2) Next is Pasha's LUO series with memfd support, also happening right now.
>
> 3) Next series will be Vipin's VFIO series with preserving one
> busmaster bit in the config space of the end point vfio device, there
> is no PCI layer involved yet. The VFIO will use some driver trick to
> prevent the native driver from binding to the liveupdate device used
> by VFIO after kexec. After kexec, the VFIO driver validates that the
> busmaster in the PCI config register is already set.
Yes. Last we discussed Vipin is planning to just compile out the
native driver of the device he is using to test. So we don't expect to
need any kernel code changes to unblock basic testing and posting the
RFC.
>
> 4) After the VFIO series, the PCI can start to preserve the livedupate
> device by BDF. Avoid the driver auto probe on the livedupate devices.
> At this point the VFIO driver in stage 3 will not need the other
> driver trick to avoid the auto bind of native driver. The PCI layer
> takes the core of that. This series PCI will have very limited
> support, most of the driver callback is not needed, no bridge device
> dependent as well.
I suspect we'll need the new file-lifecycle-bound global state thing
that Pasha is working on [1] to accomplish this. So please track
LUOv5+ as a dependency for this.
[1] https://lore.kernel.org/lkml/CA+CK2bB+RdapsozPHe84MP4NVSPLo6vje5hji5MKSg8L6ViAbw@mail.gmail.com/
>
> 5) VFIO device will continue DMA across the kexec. This series will
> require the IOMMU series for DMA mapping support. The PCI will hook up
> with the VFIO and build the list of the liveupdate device, which
> includes the PCI bridge with bus master big preserved as well.
>
> So I will pause the LUO PCI series a bit to wait for the integration
> with VFIO series.
> Meanwhile, I will continue to fix up the LUO PCI series internally for
> the other feedback I have received:
> - Clean up device info printing, remove raw address value (Greg KH, Jason).
> - Remove the device format string (Greg KH).
> - Remove the liveupdate struct from struct device, move it to the PCI (Greg KH).
> - Remove LUO call back forwarding and hook it up with the VFIO (Jason, David)
> - Drive the PCI from memfd context on VFIO or iommu, no subsystem
> registration. (Jason)
> - up_read(&pci_bus_sem); instead of up_write (Greg KH)
> - Avoid preserving the driver name, just avoid auto-probing the
> liveupdate devices. Let user space do the driver loading in initrd
> (Jason).
>
> That will keep me busy for a while waiting for the VFIO series.
Sounds good. Thanks for the update Chris!
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-10-08 23:00 ` David Matlack
@ 2025-10-09 17:12 ` Chris Li
0 siblings, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-09 17:12 UTC (permalink / raw)
To: David Matlack
Cc: Bjorn Helgaas, Pasha Tatashin, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, Pasha Tatashin, Jason Miu, Vipin Sharma,
Saeed Mahameed, Adithya Jayachandran, Parav Pandit, William Tu,
Mike Rapoport, Jason Gunthorpe, Leon Romanovsky, skhawaja
On Wed, Oct 8, 2025 at 4:01 PM David Matlack <dmatlack@google.com> wrote:
>
> On Tue, Oct 7, 2025 at 4:32 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Thanks to one that provides good feedback on the PCI series.
> >
> > I just want to give an update on the state of the LUO PCI series,
> > based on the feedback I received. The LUO PCI series should be called
> > from the memfd side and remove global subsystem state if possible.
>
> By "memfd side" I believe you are referring to LUO fd preservation
> (likely the VFIO cdev fd).
Yes. I haven't taken a closer look at the recent LUO fd preservation
series. It is on my to do list, now I am depending on it.
> > Which means the PCI series will depend on the VIFO or iommu series.
> > I have some internal alignment with Vipin (for VFIO) and Samiullah
> > (for iommu). Here is the new plan for upstream patch submission:
> >
> > 1) KHO series go first, which is already happening with additional improvement.
> >
> > 2) Next is Pasha's LUO series with memfd support, also happening right now.
> >
> > 3) Next series will be Vipin's VFIO series with preserving one
> > busmaster bit in the config space of the end point vfio device, there
> > is no PCI layer involved yet. The VFIO will use some driver trick to
> > prevent the native driver from binding to the liveupdate device used
> > by VFIO after kexec. After kexec, the VFIO driver validates that the
> > busmaster in the PCI config register is already set.
>
> Yes. Last we discussed Vipin is planning to just compile out the
> native driver of the device he is using to test. So we don't expect to
> need any kernel code changes to unblock basic testing and posting the
> RFC.
Ack.
>
> >
> > 4) After the VFIO series, the PCI can start to preserve the livedupate
> > device by BDF. Avoid the driver auto probe on the livedupate devices.
> > At this point the VFIO driver in stage 3 will not need the other
> > driver trick to avoid the auto bind of native driver. The PCI layer
> > takes the core of that. This series PCI will have very limited
> > support, most of the driver callback is not needed, no bridge device
> > dependent as well.
>
> I suspect we'll need the new file-lifecycle-bound global state thing
> that Pasha is working on [1] to accomplish this. So please track
> LUOv5+ as a dependency for this.
>
> [1] https://lore.kernel.org/lkml/CA+CK2bB+RdapsozPHe84MP4NVSPLo6vje5hji5MKSg8L6ViAbw@mail.gmail.com/
Agree, I need to figure out the boiler plate change to hook up PCI to
the file descriptors.
Thanks for the clarification.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-10-07 23:32 ` Chris Li
2025-10-08 23:00 ` David Matlack
@ 2025-10-09 23:21 ` Pratyush Yadav
2025-10-10 4:19 ` Chris Li
1 sibling, 1 reply; 84+ messages in thread
From: Pratyush Yadav @ 2025-10-09 23:21 UTC (permalink / raw)
To: Chris Li
Cc: Bjorn Helgaas, Pasha Tatashin, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Jason Gunthorpe, Leon Romanovsky,
skhawaja
On Tue, Oct 07 2025, Chris Li wrote:
[...]
> That will keep me busy for a while waiting for the VFIO series.
I recall we talked in one of the biweekly meetings about some sanity
checking of folios right before reboot (make sure they are right order,
etc.) under a KEXEC_HANDOVER_DEBUG option. If you have some spare time
on your hands, would be cool to see some patches for that as well :-)
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-10-09 23:21 ` Pratyush Yadav
@ 2025-10-10 4:19 ` Chris Li
2025-10-10 23:49 ` Jason Miu
0 siblings, 1 reply; 84+ messages in thread
From: Chris Li @ 2025-10-10 4:19 UTC (permalink / raw)
To: Pratyush Yadav
Cc: Bjorn Helgaas, Pasha Tatashin, Bjorn Helgaas, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Len Brown, linux-kernel,
linux-pci, linux-acpi, David Matlack, Pasha Tatashin, Jason Miu,
Vipin Sharma, Saeed Mahameed, Adithya Jayachandran, Parav Pandit,
William Tu, Mike Rapoport, Jason Gunthorpe, Leon Romanovsky,
skhawaja
On Thu, Oct 9, 2025 at 4:21 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Oct 07 2025, Chris Li wrote:
>
> [...]
> > That will keep me busy for a while waiting for the VFIO series.
>
> I recall we talked in one of the biweekly meetings about some sanity
> checking of folios right before reboot (make sure they are right order,
> etc.) under a KEXEC_HANDOVER_DEBUG option. If you have some spare time
> on your hands, would be cool to see some patches for that as well :-)
Sure, I will add that to my "nice to have" list. No promised I got
time to get to it with the PCI. It belong to the KHO series not PCI
though.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-10-10 4:19 ` Chris Li
@ 2025-10-10 23:49 ` Jason Miu
2025-10-13 13:58 ` Pratyush Yadav
0 siblings, 1 reply; 84+ messages in thread
From: Jason Miu @ 2025-10-10 23:49 UTC (permalink / raw)
To: Chris Li
Cc: Pratyush Yadav, Bjorn Helgaas, Pasha Tatashin, Bjorn Helgaas,
Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
Len Brown, linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Jason Gunthorpe, Leon Romanovsky, skhawaja
On Thu, Oct 9, 2025 at 9:19 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Oct 9, 2025 at 4:21 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> >
> > On Tue, Oct 07 2025, Chris Li wrote:
> >
> > [...]
> > > That will keep me busy for a while waiting for the VFIO series.
> >
> > I recall we talked in one of the biweekly meetings about some sanity
> > checking of folios right before reboot (make sure they are right order,
> > etc.) under a KEXEC_HANDOVER_DEBUG option. If you have some spare time
> > on your hands, would be cool to see some patches for that as well :-)
>
> Sure, I will add that to my "nice to have" list. No promised I got
> time to get to it with the PCI. It belong to the KHO series not PCI
> though.
>
> Chris
Hi Pratyush, Chris,
For the folio sanity check with KEXEC_HANDOVER_DEBUG, I can follow
that up. Would you tell me what we like to check before reboot, I may
have missed some context. Thanks!
--
Jason Miu
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-10-10 23:49 ` Jason Miu
@ 2025-10-13 13:58 ` Pratyush Yadav
2025-10-14 16:11 ` Pratyush Yadav
2025-10-14 20:44 ` Chris Li
0 siblings, 2 replies; 84+ messages in thread
From: Pratyush Yadav @ 2025-10-13 13:58 UTC (permalink / raw)
To: Jason Miu
Cc: Chris Li, Pratyush Yadav, Bjorn Helgaas, Pasha Tatashin,
Bjorn Helgaas, Greg Kroah-Hartman, Rafael J. Wysocki,
Danilo Krummrich, Len Brown, linux-kernel, linux-pci, linux-acpi,
David Matlack, Pasha Tatashin, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Jason Gunthorpe, Leon Romanovsky, skhawaja
On Fri, Oct 10 2025, Jason Miu wrote:
> On Thu, Oct 9, 2025 at 9:19 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> On Thu, Oct 9, 2025 at 4:21 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>> >
>> > On Tue, Oct 07 2025, Chris Li wrote:
>> >
>> > [...]
>> > > That will keep me busy for a while waiting for the VFIO series.
>> >
>> > I recall we talked in one of the biweekly meetings about some sanity
>> > checking of folios right before reboot (make sure they are right order,
>> > etc.) under a KEXEC_HANDOVER_DEBUG option. If you have some spare time
>> > on your hands, would be cool to see some patches for that as well :-)
>>
>> Sure, I will add that to my "nice to have" list. No promised I got
>> time to get to it with the PCI. It belong to the KHO series not PCI
>> though.
>>
Right. It is only a "nice to have", and not a requirement. And certainly
not for the PCI series.
>
> Hi Pratyush, Chris,
>
> For the folio sanity check with KEXEC_HANDOVER_DEBUG, I can follow
> that up. Would you tell me what we like to check before reboot, I may
> have missed some context. Thanks!
The idea is to sanity-check the preserved folios in the kexec-reboot
flow somewhere. The main check discussed was to make sure the folios are
of the same order as they were preserved with. This will help catch bugs
where folios might split after being preserved.
Maybe we can add some more checks too? Like making sure the folios
aren't freed after they were preserved. But that condition is a bit
trickier to catch. But at least the former should be simple enough to
do as a start.
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-10-13 13:58 ` Pratyush Yadav
@ 2025-10-14 16:11 ` Pratyush Yadav
2025-10-14 20:44 ` Chris Li
1 sibling, 0 replies; 84+ messages in thread
From: Pratyush Yadav @ 2025-10-14 16:11 UTC (permalink / raw)
To: Pratyush Yadav
Cc: Jason Miu, Chris Li, Bjorn Helgaas, Pasha Tatashin, Bjorn Helgaas,
Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
Len Brown, linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Jason Gunthorpe, Leon Romanovsky, skhawaja
On Mon, Oct 13 2025, Pratyush Yadav wrote:
> On Fri, Oct 10 2025, Jason Miu wrote:
[...]
>> For the folio sanity check with KEXEC_HANDOVER_DEBUG, I can follow
>> that up. Would you tell me what we like to check before reboot, I may
>> have missed some context. Thanks!
>
> The idea is to sanity-check the preserved folios in the kexec-reboot
> flow somewhere. The main check discussed was to make sure the folios are
> of the same order as they were preserved with. This will help catch bugs
> where folios might split after being preserved.
>
> Maybe we can add some more checks too? Like making sure the folios
> aren't freed after they were preserved. But that condition is a bit
> trickier to catch. But at least the former should be simple enough to
> do as a start.
Also perhaps check in kho_preserve_folio() that the preserved folio is
not in scratch memory? This can be a non-debug check as well I suppose,
though looping through all scratch areas every time might end up being
too slow.
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v2 00/10] LUO: PCI subsystem (phase I)
2025-10-13 13:58 ` Pratyush Yadav
2025-10-14 16:11 ` Pratyush Yadav
@ 2025-10-14 20:44 ` Chris Li
1 sibling, 0 replies; 84+ messages in thread
From: Chris Li @ 2025-10-14 20:44 UTC (permalink / raw)
To: Pratyush Yadav
Cc: Jason Miu, Bjorn Helgaas, Pasha Tatashin, Bjorn Helgaas,
Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
Len Brown, linux-kernel, linux-pci, linux-acpi, David Matlack,
Pasha Tatashin, Vipin Sharma, Saeed Mahameed,
Adithya Jayachandran, Parav Pandit, William Tu, Mike Rapoport,
Jason Gunthorpe, Leon Romanovsky, skhawaja
On Mon, Oct 13, 2025 at 6:58 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Fri, Oct 10 2025, Jason Miu wrote:
>
> > On Thu, Oct 9, 2025 at 9:19 PM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> On Thu, Oct 9, 2025 at 4:21 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> >> >
> >> > On Tue, Oct 07 2025, Chris Li wrote:
> >> >
> >> > [...]
> >> > > That will keep me busy for a while waiting for the VFIO series.
> >> >
> >> > I recall we talked in one of the biweekly meetings about some sanity
> >> > checking of folios right before reboot (make sure they are right order,
> >> > etc.) under a KEXEC_HANDOVER_DEBUG option. If you have some spare time
> >> > on your hands, would be cool to see some patches for that as well :-)
> >>
> >> Sure, I will add that to my "nice to have" list. No promised I got
> >> time to get to it with the PCI. It belong to the KHO series not PCI
> >> though.
> >>
>
> Right. It is only a "nice to have", and not a requirement. And certainly
> not for the PCI series.
Ack.
> >
> > For the folio sanity check with KEXEC_HANDOVER_DEBUG, I can follow
> > that up. Would you tell me what we like to check before reboot, I may
> > have missed some context. Thanks!
>
> The idea is to sanity-check the preserved folios in the kexec-reboot
> flow somewhere. The main check discussed was to make sure the folios are
> of the same order as they were preserved with. This will help catch bugs
> where folios might split after being preserved.
Yes, the idea is that, for all folio that has been preserved, remember
the folio order at the time of pserver_folio. Right before kexec
reboot, maybe after the freeze() call, the KHO can go though the
internal list of the preserved folio and verify the folio starting at
that physical address still has the same order compare to the
preservation time. In other words, the folio order hasn't change since
the between preserve_folio() and kexec reboot, for the folio that has
been preserved.
> Maybe we can add some more checks too? Like making sure the folios
> aren't freed after they were preserved. But that condition is a bit
> trickier to catch. But at least the former should be simple enough to
> do as a start.
Agree, we can have more check there. We can also add those additional
check as follow up patches in the same series or later series. They
don't have to be done in one go.
Chris
^ permalink raw reply [flat|nested] 84+ messages in thread
end of thread, other threads:[~2025-10-14 20:44 UTC | newest]
Thread overview: 84+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-16 7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
2025-09-16 7:45 ` [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator Chris Li
2025-09-30 15:15 ` Greg Kroah-Hartman
2025-09-30 23:41 ` Chris Li
2025-09-30 15:17 ` Greg Kroah-Hartman
2025-09-30 23:38 ` Chris Li
2025-09-16 7:45 ` [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list Chris Li
2025-09-29 17:46 ` Jason Gunthorpe
2025-09-30 2:13 ` Chris Li
2025-09-30 16:47 ` Jason Gunthorpe
2025-10-03 7:09 ` Chris Li
2025-10-03 5:33 ` Chris Li
2025-10-03 14:04 ` Jason Gunthorpe
2025-10-03 21:06 ` Chris Li
2025-09-30 15:26 ` Greg Kroah-Hartman
2025-10-03 6:57 ` Chris Li
2025-09-16 7:45 ` [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver Chris Li
2025-09-29 17:48 ` Jason Gunthorpe
2025-09-30 2:11 ` Chris Li
2025-09-30 16:38 ` Jason Gunthorpe
2025-10-02 18:54 ` David Matlack
2025-10-02 20:57 ` Chris Li
2025-10-02 21:31 ` David Matlack
2025-10-02 23:21 ` Jason Gunthorpe
2025-10-02 23:42 ` David Matlack
2025-10-03 12:03 ` Jason Gunthorpe
2025-10-03 16:03 ` David Matlack
2025-10-03 16:16 ` Jason Gunthorpe
2025-10-03 16:28 ` Pasha Tatashin
2025-10-03 16:56 ` David Matlack
2025-10-03 5:24 ` Chris Li
2025-10-03 12:06 ` Jason Gunthorpe
2025-10-03 16:27 ` David Matlack
2025-10-03 16:41 ` Vipin Sharma
2025-10-03 17:44 ` Chris Li
2025-10-03 5:17 ` Chris Li
2025-10-02 20:44 ` Chris Li
2025-09-30 15:27 ` Greg Kroah-Hartman
2025-10-02 20:38 ` Chris Li
2025-10-03 6:18 ` Greg Kroah-Hartman
2025-10-03 7:26 ` Chris Li
2025-10-03 12:26 ` Greg Kroah-Hartman
2025-10-03 17:49 ` Chris Li
2025-10-03 18:27 ` David Matlack
2025-10-03 21:10 ` Chris Li
2025-09-16 7:45 ` [PATCH v2 04/10] PCI/LUO: Restore state at PCI enumeration Chris Li
2025-09-16 7:45 ` [PATCH v2 05/10] PCI/LUO: Forward finish callbacks to drivers Chris Li
2025-09-16 7:45 ` [PATCH v2 06/10] PCI/LUO: Save and restore driver name Chris Li
2025-09-29 17:57 ` Jason Gunthorpe
2025-09-30 2:10 ` Chris Li
2025-09-30 13:02 ` Pasha Tatashin
2025-09-30 13:41 ` Greg Kroah-Hartman
2025-09-30 14:53 ` Pasha Tatashin
2025-09-30 15:08 ` Greg Kroah-Hartman
2025-09-30 15:56 ` Pasha Tatashin
2025-10-01 5:06 ` Greg Kroah-Hartman
2025-10-01 21:03 ` Pasha Tatashin
2025-10-02 6:09 ` Greg Kroah-Hartman
2025-10-02 13:23 ` Jason Gunthorpe
2025-10-02 22:30 ` Chris Li
2025-09-30 15:41 ` Chris Li
2025-10-01 5:13 ` Greg Kroah-Hartman
2025-10-02 22:05 ` Chris Li
2025-09-30 16:37 ` Jason Gunthorpe
2025-10-02 21:39 ` Chris Li
2025-10-03 14:28 ` Jason Gunthorpe
2025-09-16 7:45 ` [PATCH v2 07/10] PCI/LUO: Add liveupdate to pcieport driver Chris Li
2025-09-16 7:45 ` [PATCH v2 08/10] PCI/LUO: Add pci_liveupdate_get_driver_data() Chris Li
2025-09-16 7:45 ` [PATCH v2 09/10] PCI/LUO: Avoid write to bus master at boot Chris Li
2025-09-29 17:14 ` Bjorn Helgaas
2025-09-16 7:45 ` [PATCH v2 10/10] PCI: pci-lu-stub: Add a stub driver for Live Update testing Chris Li
2025-09-27 17:13 ` [PATCH v2 00/10] LUO: PCI subsystem (phase I) Bjorn Helgaas
2025-09-27 18:05 ` Pasha Tatashin
2025-09-29 15:04 ` Bjorn Helgaas
2025-09-29 18:13 ` Chris Li
2025-10-07 23:32 ` Chris Li
2025-10-08 23:00 ` David Matlack
2025-10-09 17:12 ` Chris Li
2025-10-09 23:21 ` Pratyush Yadav
2025-10-10 4:19 ` Chris Li
2025-10-10 23:49 ` Jason Miu
2025-10-13 13:58 ` Pratyush Yadav
2025-10-14 16:11 ` Pratyush Yadav
2025-10-14 20:44 ` Chris Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).