Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Jork Loeser <jloeser@linux.microsoft.com>
To: linux-hyperv@vger.kernel.org, linux-mm@kvack.org,
	kexec@lists.infradead.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
	Long Li <longli@microsoft.com>, Mike Rapoport <rppt@kernel.org>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Pratyush Yadav <pratyush@kernel.org>,
	Alexander Graf <graf@amazon.com>, Jason Miu <jasonmiu@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Muchun Song <muchun.song@linux.dev>,
	Oscar Salvador <osalvador@suse.de>, Baoquan He <bhe@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Thomas Gleixner <tglx@kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Kees Cook <kees@kernel.org>,
	Ran Xiaokai <ran.xiaokai@zte.com.cn>,
	Justinien Bouron <jbouron@amazon.com>,
	Sourabh Jain <sourabhjain@linux.ibm.com>,
	Pingfan Liu <piliu@redhat.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	Mario Limonciello <mario.limonciello@amd.com>,
	linux-arm-kernel@lists.infradead.org, x86@kernel.org,
	linux-kernel@vger.kernel.org,
	Michael Kelley <mhklinux@outlook.com>,
	Jork Loeser <jloeser@linux.microsoft.com>
Subject: [RFC PATCH 20/20] mshv: freeze and vacuum partitions across kexec
Date: Wed, 27 May 2026 17:42:02 -0700	[thread overview]
Message-ID: <20260528004204.1484584-21-jloeser@linux.microsoft.com> (raw)
In-Reply-To: <20260528004204.1484584-1-jloeser@linux.microsoft.com>

Before kexec the kernel must ensure no VMs are actively running so that
no VP modifies VM-memory that Linux will re-use post-kexec, and record
the partition IDs so the next kernel can clean them up.

Add mshv_freeze_and_get_partition_ids() which:
- Sets a global frozen flag to block new partition and VP creation
- Prevents (re-)dispatching of existing VPs
- Kicks all VPs to exit their dispatch loops, then waits for each to
  finish by acquiring its mutex
- Tears down doorbell ports owned by the parent partition, which
  otherwise survive kexec and cause port ID collisions in the new kernel
- Collects all partition IDs into a kho_alloc_preserve()'d array

The freeze is triggered from the reboot notifier callback. The ID
array is serialized into the KHO FDT so the next kernel can retrieve
it via mshv_retrieve_frozen_partition_ids().

After kexec the previous kernel's partitions are still alive in the
hypervisor. vacuum_stale_partitions() retrieves their IDs from the
KHO-preserved FDT and tears each one down (finalize, withdraw
deposited memory, delete) so the pages become available to the new
kernel.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_page_preserve.c | 100 +++++++++-
 drivers/hv/mshv_page_preserve.h |   3 +
 drivers/hv/mshv_root.h          |   2 +
 drivers/hv/mshv_root_main.c     | 339 +++++++++++++++++++++++++++++---
 4 files changed, 417 insertions(+), 27 deletions(-)

diff --git a/drivers/hv/mshv_page_preserve.c b/drivers/hv/mshv_page_preserve.c
index e16fb946790d..dba7975ab058 100644
--- a/drivers/hv/mshv_page_preserve.c
+++ b/drivers/hv/mshv_page_preserve.c
@@ -24,6 +24,8 @@
 
 static void *fdt_page;
 static struct kho_radix_tree preserved_pages_tree;
+static u64 *frozen_partition_ids;
+static unsigned int nr_frozen_partition_ids;
 
 /**
  * mshv_register_preserve_page() - Register a page to be preserved by KHO
@@ -78,7 +80,7 @@ static int preserve_table_cb(phys_addr_t phys, void *data)
 	return kho_preserve_pages(phys_to_page(phys), 1);
 }
 
-static int create_fdt(void)
+static int create_fdt(u64 *partition_ids, unsigned int nr_partition_ids)
 {
 	int err;
 	void *fdt;
@@ -106,6 +108,19 @@ static int create_fdt(void)
 	err = fdt_property(fdt, "root_table", &root_table, sizeof(root_table));
 	if (err)
 		return err;
+	if (nr_partition_ids) {
+		phys_addr_t ids_pa = virt_to_phys(partition_ids);
+		u32 count = nr_partition_ids;
+
+		err = fdt_property(fdt, "partition_ids", &ids_pa,
+				   sizeof(ids_pa));
+		if (err)
+			return err;
+		err = fdt_property(fdt, "nr_partition_ids", &count,
+				   sizeof(count));
+		if (err)
+			return err;
+	}
 	err = fdt_end_node(fdt);
 	if (err)
 		return err;
@@ -118,6 +133,8 @@ static int create_fdt(void)
 
 /**
  * preserve_tree() - Preserve pages owned by Microsoft Hypervisor
+ * @partition_ids: array of frozen partition IDs to serialize, or NULL
+ * @nr_partition_ids: number of entries in @partition_ids
  *
  * This gets called prior to kexec and is our signal to finally preserve the
  * pages with KHO, and create & register the named FDT. We also need to freeze
@@ -125,7 +142,7 @@ static int create_fdt(void)
  *
  * Return: 0 on success, -errno on error.
  */
-static int preserve_tree(void)
+static int preserve_tree(u64 *partition_ids, unsigned int nr_partition_ids)
 {
 	const struct kho_radix_walk_cb preserve_cb = {
 		.key = preserve_key_cb,
@@ -141,7 +158,7 @@ static int preserve_tree(void)
 	}
 
 	/* Populate the pre-allocated FDT page with current tree state */
-	err = create_fdt();
+	err = create_fdt(partition_ids, nr_partition_ids);
 	if (err) {
 		pr_warn("%s() - create_fdt() failed: %d\n", __func__, err);
 		return err;
@@ -177,6 +194,11 @@ static int preserve_tree(void)
 /*
  * Reboot-callback triggering page preservation prior to kexec. Other reboots
  * need no KHO preservation.
+ *
+ * The mshv_root module's higher-priority reboot notifier freezes all VPs
+ * and hands off partition IDs via mshv_set_frozen_partition_ids() before
+ * this callback runs. If the module is not loaded, no partitions exist
+ * and the tree is preserved without partition IDs.
  */
 static int reboot_cb(struct notifier_block *nb, unsigned long action,
 		     void *data)
@@ -185,9 +207,9 @@ static int reboot_cb(struct notifier_block *nb, unsigned long action,
 	if (kexec_in_progress) {
 		int err;
 
-		/* Finalize handover: write KHO descriptors, flush metadata */
 		pr_debug("%s() - KHO-preserving page tree\n", __func__);
-		err = preserve_tree();
+		err = preserve_tree(frozen_partition_ids,
+				    nr_frozen_partition_ids);
 		if (err)
 			panic("preserve_tree() failed - must not kexec: %d\n",
 			      err);
@@ -260,6 +282,74 @@ static int __init restore_tree(void)
 	return 0;
 }
 
+/**
+ * mshv_set_frozen_partition_ids() - Hand off frozen partition IDs for KHO
+ * @ids: kho_alloc_preserve()'d array of partition IDs, or NULL
+ * @nr: number of entries in @ids
+ *
+ * Called by the mshv_root module's reboot notifier (which runs at higher
+ * priority) to pass the frozen partition ID list to the built-in page
+ * preservation code before it serializes the KHO FDT.
+ */
+void mshv_set_frozen_partition_ids(u64 *ids, unsigned int nr)
+{
+	frozen_partition_ids = ids;
+	nr_frozen_partition_ids = nr;
+}
+EXPORT_SYMBOL_GPL(mshv_set_frozen_partition_ids);
+
+/**
+ * mshv_retrieve_frozen_partition_ids() - Retrieve frozen partition IDs
+ * @partition_ids: receives pointer to the preserved ID array, or NULL
+ * @nr_ids: receives the number of entries, or 0
+ *
+ * Counterpart to mshv_freeze_and_get_partition_ids(). Reads the partition
+ * ID list from the KHO-preserved FDT. The returned pointer (if non-NULL)
+ * refers to kho_alloc_preserve()'d memory from the previous kernel.
+ *
+ * Return: 0 on success (including when no IDs are found), negative errno on
+ *  error.
+ */
+int mshv_retrieve_frozen_partition_ids(u64 **partition_ids,
+				       unsigned int *nr_ids)
+{
+	int node, len;
+	const phys_addr_t *ids_pa;
+	const u32 *count_prop;
+
+	*partition_ids = NULL;
+	*nr_ids = 0;
+
+	if (!fdt_page)
+		return 0;
+
+	node = fdt_path_offset(fdt_page, "/");
+	if (node < 0)
+		return 0;
+
+	ids_pa = fdt_getprop(fdt_page, node, "partition_ids", &len);
+	if (!ids_pa)
+		return 0;
+
+	if (len != sizeof(*ids_pa)) {
+		pr_err("Malformed preserved FDT: invalid partition_ids property.\n");
+		return -EINVAL;
+	}
+
+	count_prop = fdt_getprop(fdt_page, node, "nr_partition_ids", &len);
+	if (!count_prop || len != sizeof(*count_prop)) {
+		pr_err("Malformed preserved FDT: invalid nr_partition_ids property.\n");
+		return -EINVAL;
+	}
+
+	*partition_ids = phys_to_virt(*ids_pa);
+	*nr_ids = *count_prop;
+
+	pr_info("Retrieved %u frozen partition ID(s) from KHO\n", *nr_ids);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mshv_retrieve_frozen_partition_ids);
+
 /*
  * Restore individual pages using KHO's helper during boot.
  *
diff --git a/drivers/hv/mshv_page_preserve.h b/drivers/hv/mshv_page_preserve.h
index ac99b4e33285..4625d59a3070 100644
--- a/drivers/hv/mshv_page_preserve.h
+++ b/drivers/hv/mshv_page_preserve.h
@@ -14,5 +14,8 @@ int mshv_preserve_init(void);
 int mshv_register_preserve_page(struct page *pg);
 int mshv_unregister_preserve_page(struct page *pg);
 int mshv_iterate_preserved(const struct kho_radix_walk_cb *cb, void *data);
+void mshv_set_frozen_partition_ids(u64 *ids, unsigned int nr);
+int mshv_retrieve_frozen_partition_ids(u64 **partition_ids,
+				       unsigned int *nr_ids);
 
 #endif /* _MSHV_PAGE_PRESERVE_H */
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 216053f8e0ab..7476398d3b47 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -195,6 +195,7 @@ struct mshv_root {
 	spinlock_t pt_ht_lock;
 	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
 	struct hv_partition_property_vmm_capabilities vmm_caps;
+	bool frozen;
 };
 
 /*
@@ -364,6 +365,7 @@ static inline void mshv_debugfs_vp_remove(struct mshv_vp *vp) { }
 
 extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
+
 extern u8 * __percpu *hv_synic_eventring_tail;
 
 struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 5fbd01c12ab8..e95abf4698f8 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -18,6 +18,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/mm.h>
 #include <linux/io.h>
+#include <linux/cleanup.h>
 #include <linux/cpuhotplug.h>
 #include <linux/random.h>
 #include <asm/mshyperv.h>
@@ -27,6 +28,7 @@
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
 #include <linux/crash_dump.h>
+#include <linux/kexec_handover.h>
 #include <linux/panic_notifier.h>
 #include <linux/vmalloc.h>
 #include <linux/rseq.h>
@@ -359,6 +361,7 @@ mshv_suspend_vp(const struct mshv_vp *vp, bool *message_in_flight)
  */
 static long mshv_run_vp_with_hyp_scheduler(struct mshv_vp *vp)
 {
+	bool message_in_flight;
 	long ret;
 	struct hv_register_assoc suspend_regs[2] = {
 			{ .name = HV_REGISTER_INTERCEPT_SUSPEND },
@@ -375,32 +378,40 @@ static long mshv_run_vp_with_hyp_scheduler(struct mshv_vp *vp)
 	}
 
 	ret = wait_event_interruptible(vp->run.vp_suspend_queue,
-				       vp->run.kicked_by_hv == 1);
-	if (ret) {
-		bool message_in_flight;
+				       vp->run.kicked_by_hv == 1 ||
+				       READ_ONCE(mshv_root.frozen));
 
-		/*
-		 * Otherwise the waiting was interrupted by a signal: suspend
-		 * the vCPU explicitly and copy message in flight (if any).
-		 */
-		ret = mshv_suspend_vp(vp, &message_in_flight);
-		if (ret)
-			return ret;
+	/* Normal wakeup: intercept arrived */
+	if (!ret && !READ_ONCE(mshv_root.frozen)) {
+		vp->run.kicked_by_hv = 0;
+		return 0;
+	}
 
-		/* Return if no message in flight */
-		if (!message_in_flight)
-			return -EINTR;
+	/*
+	 * Signal or frozen: VP was resumed above and may still be
+	 * running in the hypervisor. Suspend it before returning.
+	 */
+	ret = mshv_suspend_vp(vp, &message_in_flight);
+	if (ret)
+		return ret;
 
-		/* Wait for the message in flight. */
-		wait_event(vp->run.vp_suspend_queue, vp->run.kicked_by_hv == 1);
-	}
+	/* No in-flight message or frozen — nothing to deliver */
+	if (!message_in_flight || READ_ONCE(mshv_root.frozen))
+		return -EINTR;
+
+	/* Signal case: wait for the in-flight intercept message */
+	wait_event(vp->run.vp_suspend_queue,
+		   vp->run.kicked_by_hv == 1 ||
+		   READ_ONCE(mshv_root.frozen));
+
+	if (READ_ONCE(mshv_root.frozen))
+		return -EINTR;
 
 	/*
 	 * Reset the flag to make the wait_event call above work
 	 * next time.
 	 */
 	vp->run.kicked_by_hv = 0;
-
 	return 0;
 }
 
@@ -503,7 +514,8 @@ mshv_vp_wait_for_hv_kick(struct mshv_vp *vp)
 	ret = wait_event_interruptible(vp->run.vp_suspend_queue,
 				       (vp->run.kicked_by_hv == 1 &&
 					!mshv_vp_dispatch_thread_blocked(vp)) ||
-				       mshv_vp_interrupt_pending(vp));
+				       mshv_vp_interrupt_pending(vp) ||
+				       READ_ONCE(mshv_root.frozen));
 	if (ret)
 		return -EINTR;
 
@@ -513,6 +525,9 @@ mshv_vp_wait_for_hv_kick(struct mshv_vp *vp)
 				       mshv_vp_dispatch_thread_blocked(vp),
 				       mshv_vp_interrupt_pending(vp));
 
+	if (READ_ONCE(mshv_root.frozen))
+		return -EBUSY;
+
 	vp->run.flags.root_sched_blocked = 0;
 	vp->run.kicked_by_hv = 0;
 
@@ -539,6 +554,11 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 		u32 flags = 0;
 		struct hv_output_dispatch_vp output;
 
+		if (READ_ONCE(mshv_root.frozen)) {
+			ret = -EBUSY;
+			break;
+		}
+
 		if (__xfer_to_guest_mode_work_pending()) {
 			ret = xfer_to_guest_mode_handle_work();
 
@@ -712,6 +732,11 @@ static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 	trace_mshv_run_vp_entry(vp->vp_partition->pt_id, vp->vp_index);
 
 	do {
+		if (READ_ONCE(mshv_root.frozen)) {
+			rc = -EBUSY;
+			break;
+		}
+
 		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
 			rc = mshv_run_vp_with_root_scheduler(vp);
 		else
@@ -1074,6 +1099,9 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	struct hv_stats_page *stats_pages[2];
 	long ret;
 
+	if (READ_ONCE(mshv_root.frozen))
+		return -EBUSY;
+
 	if (copy_from_user(&args, arg, sizeof(args)))
 		return -EFAULT;
 
@@ -1762,6 +1790,201 @@ static void drain_all_vps(const struct mshv_partition *partition)
 	}
 }
 
+/**
+ * mshv_freeze_and_get_partition_ids() - Freeze all partitions and collect IDs
+ * @partition_ids: on success, receives a kho_alloc_preserve()'d array of
+ *      partition IDs; set to NULL on failure or when no partitions exist
+ * @nr_ids: on success, receives the number of entries in @partition_ids; set to
+ *      0 on failure or when no partitions exist
+ *
+ * Sets the global frozen flag to prevent creation of new partitions and
+ * (re-)dispatching of VPs. Kicks all VPs so they exit their dispatch loops,
+ * then waits for each VP to actually finish by acquiring its mutex.
+ *
+ * Must be called before kexec to ensure no VP modifies VM-memory that Linux
+ * will re-use post-kexec.
+ *
+ * Return: 0 on success, negative errno on failure. On failure, partitions
+ *  and VPs are left in an undefined state — the caller must not proceed
+ *  with kexec and should panic.
+ */
+static int
+mshv_freeze_and_get_partition_ids(u64 **partition_ids, unsigned int *nr_ids)
+{
+	unsigned int nr_alloc = 0, nr_ref = 0, nr_noref = 0;
+	struct mshv_partition *partition;
+	struct mshv_vp *vp;
+	int bkt, i;
+	u64 *ids;
+
+	*partition_ids = NULL;
+	*nr_ids = 0;
+
+	scoped_guard(spinlock, &mshv_root.pt_ht_lock)
+		mshv_root.frozen = true;
+
+	/*
+	 * Count partitions to size the ID array. Frozen prevents new additions,
+	 * so this is an upper bound.
+	 */
+	scoped_guard(rcu)
+		hash_for_each_rcu(mshv_root.pt_htable, bkt, partition, pt_hnode)
+			nr_alloc++;
+
+	if (!nr_alloc) {
+		pr_info("Frozen 0 partition(s) for kexec\n");
+		return 0;
+	}
+
+	ids = kho_alloc_preserve(nr_alloc * sizeof(*ids));
+	if (IS_ERR(ids)) {
+		pr_err("Failed to allocate partition ID array for freeze\n");
+		return PTR_ERR(ids);
+	}
+
+	/*
+	 * Record every partition's ID and obtain a reference for later use.
+	 *
+	 * Zero-refcount partitions (destroy_partition() in progress) still get
+	 * their ID recorded — destruction may not complete before kexec, and
+	 * the next kernel must clean them up. Their IDs are stored at the back
+	 * of the array so the kick/drain phase can iterate only the ref'd
+	 * prefix ids[0..nr_ref).
+	 *
+	 * VP kicking is deferred to the next phase where it happens under
+	 * pt_mutex, which serializes against mshv_partition_ioctl_create_vp().
+	 */
+	rcu_read_lock();
+	hash_for_each_rcu(mshv_root.pt_htable, bkt, partition, pt_hnode) {
+		if (!mshv_partition_get(partition)) {
+			/*
+			 * Zero refcount — destroy_partition() is in progress.
+			 * All fds are closed so no VP ioctl can be running.
+			 * Store at the back; skip VP kicking.
+			 */
+			ids[nr_alloc - 1 - nr_noref++] = partition->pt_id;
+			continue;
+		}
+
+		ids[nr_ref++] = partition->pt_id;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * For each ref'd partition, acquire and release pt_mutex as a barrier
+	 * against any in-flight create_vp. After this, the frozen flag
+	 * prevents new VPs from being created, so pt_vp_array is stable.
+	 * Then kick all VPs and drain by acquiring each vp_mutex.
+	 *
+	 * Root scheduler: disable_vp_dispatch() sets
+	 * HV_REGISTER_DISPATCH_SUSPEND, which causes any in-progress dispatch
+	 * hypercall to return. This is safe regardless of VP state because the
+	 * VP only executes while the kernel thread's dispatch hypercall is
+	 * active — once it returns, the VP cannot run until re-dispatched,
+	 * which the frozen check prevents.
+	 *
+	 * Hyp scheduler: the VP runs independently in the hypervisor and must
+	 * be explicitly suspended from within its dispatch loop (via
+	 * mshv_suspend_vp()) when the kernel thread detects the frozen flag.
+	 * wake_up_all() unblocks the kernel thread so it can do so.
+	 */
+	for (i = 0; i < nr_ref; i++) {
+		/* Ref held; partition stays in hash and alive outside RCU */
+		scoped_guard(rcu)
+			partition = mshv_partition_find(ids[i]);
+
+		/* Barrier: wait for any in-flight create_vp to complete */
+		scoped_guard(mutex, &partition->pt_mutex) {}
+
+		for (bkt = 0; bkt < MSHV_MAX_VPS; bkt++) {
+			vp = partition->pt_vp_array[bkt];
+			if (!vp)
+				continue;
+
+			if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
+				disable_vp_dispatch(vp);
+
+			wake_up_all(&vp->run.vp_suspend_queue);
+		}
+
+		/*
+		 * Wait for every VP to finish its current ioctl. Taking the VP
+		 * mutex proves the VP is no longer inside run_vp.
+		 *
+		 * On Hyp-scheduler, prior mshv_suspend_vp() might have failed.
+		 * Since it's idempotent, we can safely re-issue and fail kexec
+		 * if suspend fails again. In this case, the caller is expected
+		 * to panic, so cleanup is unnecessary.
+		 */
+		for (bkt = 0; bkt < MSHV_MAX_VPS; bkt++) {
+			vp = partition->pt_vp_array[bkt];
+			if (!vp)
+				continue;
+
+			scoped_guard(mutex, &vp->vp_mutex) {
+				if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT) {
+					bool mif;
+					int ret;
+
+					ret = mshv_suspend_vp(vp, &mif);
+					if (ret)
+						return ret;
+				}
+			}
+		}
+
+		/*
+		 * Tear down doorbell ports owned by the parent partition.
+		 * These survive child partition deletion and kexec, so the
+		 * new kernel would collide on port IDs if we leave them.
+		 */
+		mshv_eventfd_release(partition);
+
+		mshv_partition_put(partition);
+	}
+
+	/* Move non-ref'd IDs next to ref'd IDs to form a contiguous array */
+	if (nr_noref) {
+		memmove(&ids[nr_ref], &ids[nr_alloc - nr_noref],
+			nr_noref * sizeof(*ids));
+	}
+
+	*partition_ids = ids;
+	*nr_ids = nr_ref + nr_noref;
+
+	pr_info("Frozen %u partition(s) for kexec\n", nr_ref + nr_noref);
+	return 0;
+}
+
+/*
+ * Reboot notifier for the mshv_root module. Runs at higher priority than
+ * the built-in page-preservation notifier so that all VPs are frozen and
+ * partition IDs are handed off before the tree is serialized.
+ */
+static int mshv_root_reboot_cb(struct notifier_block *nb, unsigned long action,
+			       void *data)
+{
+	if (kexec_in_progress) {
+		u64 *partition_ids;
+		unsigned int nr_partition_ids;
+		int err;
+
+		err = mshv_freeze_and_get_partition_ids(&partition_ids,
+							&nr_partition_ids);
+		if (err)
+			panic("mshv_freeze_and_get_partition_ids() failed - must not kexec: %d\n",
+			      err);
+
+		mshv_set_frozen_partition_ids(partition_ids, nr_partition_ids);
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block mshv_root_reboot_notifier = {
+	.notifier_call = mshv_root_reboot_cb,
+	.priority = 1, /* higher than the built-in preserve notifier (0) */
+};
+
 static void
 remove_partition(struct mshv_partition *partition)
 {
@@ -1911,13 +2134,27 @@ mshv_partition_release(struct inode *inode, struct file *filp)
 static int
 add_partition(struct mshv_partition *partition)
 {
-	spin_lock(&mshv_root.pt_ht_lock);
+	guard(spinlock)(&mshv_root.pt_ht_lock);
+
+	/*
+	 * Reject new partitions once frozen. Note: there is a small window
+	 * where a concurrent create-ioctl has already called
+	 * hv_call_create_partition() but not yet reached here. If kexec fires
+	 * during that window, the caller's error-path
+	 * hv_call_delete_partition() may never execute and the empty partition
+	 * leaks in the hypervisor.
+	 *
+	 * No pages are deposited at that point, so only the hypervisor-internal
+	 * tracking is lost. Closing this fully would require reworking the
+	 * entire mshv-locking logic so that the frozen check and the hypervisor
+	 * create call happen atomically.
+	 */
+	if (mshv_root.frozen)
+		return -EBUSY;
 
 	hash_add_rcu(mshv_root.pt_htable, &partition->pt_hnode,
 		     partition->pt_id);
 
-	spin_unlock(&mshv_root.pt_ht_lock);
-
 	return 0;
 }
 
@@ -2316,6 +2553,55 @@ root_scheduler_deinit(void)
 	free_percpu(root_scheduler_output);
 }
 
+/**
+ * vacuum_stale_partitions() - Tear down partitions left by a prior kernel.
+ * @dev: device for logging
+ *
+ * After kexec the previous kernel's partitions are still alive in the
+ * hypervisor. Retrieve their IDs from the KHO-preserved FDT and finalize,
+ * withdraw, and delete each one so the deposited pages return to the free pool.
+ */
+static void __init vacuum_stale_partitions(struct device *dev)
+{
+	u64 *ids;
+	unsigned int nr;
+	int i, err;
+
+	err = mshv_retrieve_frozen_partition_ids(&ids, &nr);
+	if (err) {
+		dev_err(dev, "Failed to retrieve stale partition IDs: %d\n",
+			err);
+		return;
+	}
+
+	for (i = 0; i < nr; i++) {
+		dev_info(dev, "Cleaning up stale partition %llu\n",
+			 ids[i]);
+
+		err = hv_call_finalize_partition(ids[i]);
+		if (err == -EINVAL) {
+			dev_info(dev, "partition %llu already gone\n",
+				 ids[i]);
+			continue;
+		}
+		if (err)
+			dev_warn(dev, "finalize partition %llu failed: %d\n",
+				 ids[i], err);
+
+		err = hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, ids[i]);
+		if (err)
+			dev_warn(dev, "withdraw memory %llu failed: %d\n",
+				 ids[i], err);
+
+		err = hv_call_delete_partition(ids[i]);
+		if (err)
+			dev_warn(dev, "delete partition %llu failed: %d\n",
+				 ids[i], err);
+	}
+
+	kho_restore_free(ids);
+}
+
 static int __init mshv_init_vmm_caps(struct device *dev)
 {
 	int ret;
@@ -2372,10 +2658,16 @@ static int __init mshv_parent_partition_init(void)
 	if (ret)
 		goto synic_cleanup;
 
-	ret = root_scheduler_init(dev);
+	vacuum_stale_partitions(dev);
+
+	ret = register_reboot_notifier(&mshv_root_reboot_notifier);
 	if (ret)
 		goto synic_cleanup;
 
+	ret = root_scheduler_init(dev);
+	if (ret)
+		goto unregister_reboot;
+
 	ret = mshv_debugfs_init();
 	if (ret)
 		goto deinit_root_scheduler;
@@ -2395,6 +2687,8 @@ static int __init mshv_parent_partition_init(void)
 	mshv_debugfs_exit();
 deinit_root_scheduler:
 	root_scheduler_deinit();
+unregister_reboot:
+	unregister_reboot_notifier(&mshv_root_reboot_notifier);
 synic_cleanup:
 	mshv_synic_exit();
 device_deregister:
@@ -2410,6 +2704,7 @@ static void __exit mshv_parent_partition_exit(void)
 	misc_deregister(&mshv_dev);
 	mshv_irqfd_wq_cleanup();
 	root_scheduler_deinit();
+	unregister_reboot_notifier(&mshv_root_reboot_notifier);
 	mshv_synic_exit();
 }
 
-- 
2.43.0



      parent reply	other threads:[~2026-05-28  0:43 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-28  0:41 [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 01/20] kho: generalize radix tree APIs Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 02/20] kho: store incoming radix tree in kho_in Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 03/20] kho: add a struct for radix callbacks Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 04/20] kho: add callback for table pages Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 05/20] kho: add data argument to radix walk callback Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 06/20] kho: allow early-boot usage of the KHO radix tree Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 07/20] kho: allow destroying " Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 08/20] kho: add kho_radix_init_tree() Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 09/20] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 10/20] kho: extended scratch Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 11/20] kho: return virtual address of mem_map Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 12/20] mm/hugetlb: make bootmem allocation work with KHO Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 13/20] kho: add radix tree freeze and del_key() error reporting Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 14/20] kho: Add crash-kernel-safe radix tree presence check Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 15/20] mshv: Use page tracker to manage MSHV-owned pages and preserve with KHO Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 16/20] mshv: Add debugfs interface to page tracker Jork Loeser
2026-05-28  0:41 ` [RFC PATCH 17/20] hyperv: Reserve crash MSR P2 for page preservation root PA Jork Loeser
2026-05-28  0:42 ` [RFC PATCH 18/20] mshv: Exclude Hyper-V donated pages from crash dump collection Jork Loeser
2026-05-28  0:42 ` [RFC PATCH 19/20] kexec: export kexec_in_progress for modules Jork Loeser
2026-05-28  0:42 ` Jork Loeser [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260528004204.1484584-21-jloeser@linux.microsoft.com \
    --to=jloeser@linux.microsoft.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=decui@microsoft.com \
    --cc=graf@amazon.com \
    --cc=haiyangz@microsoft.com \
    --cc=hpa@zytor.com \
    --cc=jasonmiu@google.com \
    --cc=jbouron@amazon.com \
    --cc=kees@kernel.org \
    --cc=kexec@lists.infradead.org \
    --cc=kys@microsoft.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longli@microsoft.com \
    --cc=mario.limonciello@amd.com \
    --cc=mhklinux@outlook.com \
    --cc=mingo@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=piliu@redhat.com \
    --cc=pratyush@kernel.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=ran.xiaokai@zte.com.cn \
    --cc=rppt@kernel.org \
    --cc=sourabhjain@linux.ibm.com \
    --cc=tglx@kernel.org \
    --cc=wei.liu@kernel.org \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox