[PATCH 0/3] Introduce movable pages for Hyper-V guests

linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Introduce movable pages for Hyper-V guests
@ 2025-09-24 21:30 Stanislav Kinsburskii
  2025-09-24 21:31 ` [PATCH 1/3] Drivers: hv: Rename a few memory region related functions for clarity Stanislav Kinsburskii
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Stanislav Kinsburskii @ 2025-09-24 21:30 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

From the start, the root-partition driver allocates, pins, and maps all
guest memory into the hypervisor at guest creation. This is simple: Linux
cannot move the pages, so the guest’s view in Linux and in Microsoft
Hypervisor never diverges.

However, this approach has major drawbacks:
- NUMA: affinity can’t be changed at runtime, so you can’t migrate guest memory closer to the CPUs running it → performance hit.
- Memory management: unused guest memory can’t be swapped out, compacted, or merged.
- Provisioning time: upfront allocation/pinning slows guest create/destroy.
- Overcommit: no memory overcommit on hosts with pinned-guest memory.

This series adds movable memory pages for Hyper-V child partitions. Guest
pages are no longer allocated upfront; they’re allocated and mapped into
the hypervisor on demand (i.e., when the guest touches a GFN that isn’t yet
backed by a host PFN).
When a page is moved, Linux no longer holds it and it is unmapped from the hypervisor.
As a result, Hyper-V guests behave like regular Linux processes, enabling standard Linux memory features to apply to guests.

Exceptions (still pinned):
  1. Encrypted guests (explicit).
  2 Guests with passthrough devices (implicitly pinned by the VFIO framework).

---

Stanislav Kinsburskii (3):
      Drivers: hv: Rename a few memory region related functions for clarity
      Drivers: hv: Centralize guest memory region destruction in helper
      Drivers: hv: Add support for movable memory regions

 drivers/hv/Kconfig          |    1 
 drivers/hv/mshv_root.h      |    8 +
 drivers/hv/mshv_root_main.c |  448 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 397 insertions(+), 60 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] Drivers: hv: Rename a few memory region related functions for clarity
  2025-09-24 21:30 [PATCH 0/3] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
@ 2025-09-24 21:31 ` Stanislav Kinsburskii
  2025-09-26 17:14   ` Nuno Das Neves
  2025-09-24 21:31 ` [PATCH 2/3] Drivers: hv: Centralize guest memory region destruction in helper Stanislav Kinsburskii
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Stanislav Kinsburskii @ 2025-09-24 21:31 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

A cleanup and precursor patch.

Rename "mshv_partition_mem_region_map" to "mshv_handle_pinned_region",
"mshv_region_populate" to "mshv_pin_region" and
"mshv_region_populate_pages" to "mshv_region_pin_pages"
to better reflect its purpose of handling pinned memory regions.

Update the "mshv_handle_pinned_region" function's documentation to provide
detailed information about its behavior and return values.

Also drop the check for range as pinned, as this function is static and
all the memory regions are pinned anyway.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   41 ++++++++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index a1c8c3bc79bf1..5ed6bce334417 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1137,8 +1137,8 @@ mshv_region_evict(struct mshv_mem_region *region)
 }
 
 static int
-mshv_region_populate_pages(struct mshv_mem_region *region,
-			   u64 page_offset, u64 page_count)
+mshv_region_pin_pages(struct mshv_mem_region *region,
+		      u64 page_offset, u64 page_count)
 {
 	u64 done_count, nr_pages;
 	struct page **pages;
@@ -1164,14 +1164,9 @@ mshv_region_populate_pages(struct mshv_mem_region *region,
 		 * with the FOLL_LONGTERM flag does a large temporary
 		 * allocation of contiguous memory.
 		 */
-		if (region->flags.range_pinned)
-			ret = pin_user_pages_fast(userspace_addr,
-						  nr_pages,
-						  FOLL_WRITE | FOLL_LONGTERM,
-						  pages);
-		else
-			ret = -EOPNOTSUPP;
-
+		ret = pin_user_pages_fast(userspace_addr, nr_pages,
+					  FOLL_WRITE | FOLL_LONGTERM,
+					  pages);
 		if (ret < 0)
 			goto release_pages;
 	}
@@ -1187,9 +1182,9 @@ mshv_region_populate_pages(struct mshv_mem_region *region,
 }
 
 static int
-mshv_region_populate(struct mshv_mem_region *region)
+mshv_region_pin(struct mshv_mem_region *region)
 {
-	return mshv_region_populate_pages(region, 0, region->nr_pages);
+	return mshv_region_pin_pages(region, 0, region->nr_pages);
 }
 
 static struct mshv_mem_region *
@@ -1264,17 +1259,25 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	return 0;
 }
 
-/*
- * Map guest ram. if snp, make sure to release that from the host first
- * Side Effects: In case of failure, pages are unpinned when feasible.
+/**
+ * mshv_handle_pinned_region - Handle pinned memory regions
+ * @region: Pointer to the memory region structure
+ *
+ * This function processes memory regions that are explicitly marked as pinned.
+ * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
+ * population. The function ensures the region is properly populated, handles
+ * encryption requirements for SNP partitions if applicable, maps the region,
+ * and performs necessary sharing or eviction operations based on the mapping
+ * result.
+ *
+ * Return: 0 on success, negative error code on failure.
  */
-static int
-mshv_partition_mem_region_map(struct mshv_mem_region *region)
+static int mshv_handle_pinned_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
-	ret = mshv_region_populate(region);
+	ret = mshv_region_pin(region);
 	if (ret) {
 		pt_err(partition, "Failed to populate memory region: %d\n",
 		       ret);
@@ -1368,7 +1371,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
 					     mmio_pfn, HVPFN_DOWN(mem.size));
 	else
-		ret = mshv_partition_mem_region_map(region);
+		ret = mshv_handle_pinned_region(region);
 
 	if (ret)
 		goto errout;



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] Drivers: hv: Centralize guest memory region destruction in helper
  2025-09-24 21:30 [PATCH 0/3] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
  2025-09-24 21:31 ` [PATCH 1/3] Drivers: hv: Rename a few memory region related functions for clarity Stanislav Kinsburskii
@ 2025-09-24 21:31 ` Stanislav Kinsburskii
  2025-09-26 18:15   ` Nuno Das Neves
  2025-09-24 21:31 ` [PATCH 3/3] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  2025-09-27  2:02 ` [PATCH 0/3] Introduce movable pages for Hyper-V guests Mukesh R
  3 siblings, 1 reply; 11+ messages in thread
From: Stanislav Kinsburskii @ 2025-09-24 21:31 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

This is a precursor and cleanup patch.

- Introduce mshv_partition_destroy_region() to encapsulate memory region
  cleanup, including:
  - Removing the region from the partition's list
  - Regaining access for encrypted partitions
  - Unmapping only mapped pages for efficiency
  - Evicting and freeing the region

- Update mshv_unmap_user_memory() to call mshv_partition_destroy_region()
  instead of duplicating cleanup logic.

- Update destroy_partition() to use mshv_partition_destroy_region() for
  all regions, removing the previous inlined cleanup loop.

These changes eliminate code duplication, ensure consistent cleanup, and
improve maintainability for both unmap and partition destruction paths.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   83 ++++++++++++++++++++++++++-----------------
 1 file changed, 51 insertions(+), 32 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 5ed6bce334417..c0f6023e459c2 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1386,13 +1386,59 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	return ret;
 }
 
+static void mshv_partition_destroy_region(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	u64 page_offset, page_count;
+	u32 unmap_flags = 0;
+	int ret;
+
+	hlist_del(&region->hnode);
+
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_partition_region_share(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
+			       ret);
+			return;
+		}
+	}
+
+	if (region->flags.large_pages)
+		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+
+	/*
+	 * Unmap only the mapped pages to optimize performance,
+	 * especially for large memory regions.
+	 */
+	for (page_offset = 0; page_offset < region->nr_pages; page_offset += page_count) {
+		page_count = 1;
+		if (!region->pages[page_offset])
+			continue;
+
+		for (; page_count < region->nr_pages - page_offset; page_count++) {
+			if (!region->pages[page_offset + page_count])
+				break;
+		}
+
+		/* ignore unmap failures and continue as process may be exiting */
+		hv_call_unmap_gpa_pages(partition->pt_id,
+					region->start_gfn + page_offset,
+					page_count, unmap_flags);
+	}
+
+	mshv_region_evict(region);
+
+	vfree(region);
+}
+
 /* Called for unmapping both the guest ram and the mmio space */
 static long
 mshv_unmap_user_memory(struct mshv_partition *partition,
 		       struct mshv_user_mem_region mem)
 {
 	struct mshv_mem_region *region;
-	u32 unmap_flags = 0;
 
 	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
 		return -EINVAL;
@@ -1407,18 +1453,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	    region->nr_pages != HVPFN_DOWN(mem.size))
 		return -EINVAL;
 
-	hlist_del(&region->hnode);
-
-	if (region->flags.large_pages)
-		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-
-	/* ignore unmap failures and continue as process may be exiting */
-	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
-				region->nr_pages, unmap_flags);
-
-	mshv_region_evict(region);
-
-	vfree(region);
+	mshv_partition_destroy_region(region);
 	return 0;
 }
 
@@ -1754,8 +1789,8 @@ static void destroy_partition(struct mshv_partition *partition)
 {
 	struct mshv_vp *vp;
 	struct mshv_mem_region *region;
-	int i, ret;
 	struct hlist_node *n;
+	int i;
 
 	if (refcount_read(&partition->pt_ref_count)) {
 		pt_err(partition,
@@ -1815,25 +1850,9 @@ static void destroy_partition(struct mshv_partition *partition)
 
 	remove_partition(partition);
 
-	/* Remove regions, regain access to the memory and unpin the pages */
 	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
-				  hnode) {
-		hlist_del(&region->hnode);
-
-		if (mshv_partition_encrypted(partition)) {
-			ret = mshv_partition_region_share(region);
-			if (ret) {
-				pt_err(partition,
-				       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
-				      ret);
-				return;
-			}
-		}
-
-		mshv_region_evict(region);
-
-		vfree(region);
-	}
+				  hnode)
+		mshv_partition_destroy_region(region);
 
 	/* Withdraw and free all pages we deposited */
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] Drivers: hv: Add support for movable memory regions
  2025-09-24 21:30 [PATCH 0/3] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
  2025-09-24 21:31 ` [PATCH 1/3] Drivers: hv: Rename a few memory region related functions for clarity Stanislav Kinsburskii
  2025-09-24 21:31 ` [PATCH 2/3] Drivers: hv: Centralize guest memory region destruction in helper Stanislav Kinsburskii
@ 2025-09-24 21:31 ` Stanislav Kinsburskii
  2025-09-27  2:02 ` [PATCH 0/3] Introduce movable pages for Hyper-V guests Mukesh R
  3 siblings, 0 replies; 11+ messages in thread
From: Stanislav Kinsburskii @ 2025-09-24 21:31 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Introduce support for movable memory regions in the Hyper-V root
partition driver. This includes integration with MMU notifiers to
handle memory invalidation and remapping efficiently.

- Integrated `mmu_interval_notifier` for movable regions.
- Implemented functions to handle HMM faults and memory invalidation.
- Updated memory region mapping logic to support movable regions.

This change improves memory management flexibility and prepares the
driver for advanced use cases like dynamic memory remapping.

While MMU notifiers are commonly used in virtualization drivers, this
implementation leverages HMM (Heterogeneous Memory Management) for its
tailored functionality. HMM provides a ready-made framework for mirroring,
invalidation, and fault handling, avoiding the need to reimplement these
mechanisms for a single callback. Although MMU notifiers are more generic,
using HMM reduces boilerplate and ensures maintainability by utilizing a
mechanism specifically designed for such use cases.

Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Kconfig          |    1 
 drivers/hv/mshv_root.h      |    8 +
 drivers/hv/mshv_root_main.c |  326 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 325 insertions(+), 10 deletions(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index e24f6299c3760..9d24a8c8c52e3 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -68,6 +68,7 @@ config MSHV_ROOT
 	depends on PAGE_SIZE_4KB
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
+	select HMM_MIRROR
 	default n
 	help
 	  Select this option to enable support for booting and running as root
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index e3931b0f12693..ac64f062b5a51 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -15,6 +15,7 @@
 #include <linux/hashtable.h>
 #include <linux/dev_printk.h>
 #include <linux/build_bug.h>
+#include <linux/mmu_notifier.h>
 #include <uapi/linux/mshv.h>
 
 /*
@@ -79,9 +80,14 @@ struct mshv_mem_region {
 	struct {
 		u64 large_pages:  1; /* 2MiB */
 		u64 range_pinned: 1;
-		u64 reserved:	 62;
+		u64 is_ram	: 1; /* mem region can be ram or mmio */
+		u64 reserved:	 61;
 	} flags;
 	struct mshv_partition *partition;
+#if defined(CONFIG_MMU_NOTIFIER)
+	struct mmu_interval_notifier mni;
+	struct mutex mutex;	/* protects region pages remapping */
+#endif
 	struct page *pages[];
 };
 
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index c0f6023e459c2..e7066efbcccd5 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -29,11 +29,14 @@
 #include <linux/crash_dump.h>
 #include <linux/panic_notifier.h>
 #include <linux/vmalloc.h>
+#include <linux/hmm.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
 #include "mshv_root.h"
 
+#define MSHV_MAP_FAULT_IN_PAGES			HPAGE_PMD_NR
+
 MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
@@ -74,6 +77,11 @@ static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
 static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
 static int mshv_init_async_handler(struct mshv_partition *partition);
 static void mshv_async_hvcall_handler(void *data, u64 *status);
+static struct mshv_mem_region
+	*mshv_partition_region_by_gfn(struct mshv_partition *pt, u64 gfn);
+static int mshv_region_remap_pages(struct mshv_mem_region *region,
+				   u32 map_flags, u64 page_offset,
+				   u64 page_count);
 
 static const union hv_input_vtl input_vtl_zero;
 static const union hv_input_vtl input_vtl_normal = {
@@ -600,14 +608,197 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
 	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
 
+#ifdef CONFIG_X86_64
+
+#if defined(CONFIG_MMU_NOTIFIER)
+/**
+ * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
+ * @region: Pointer to the memory region structure
+ * @range: Pointer to the HMM range structure
+ *
+ * This function performs the following steps:
+ * 1. Reads the notifier sequence for the HMM range.
+ * 2. Acquires a read lock on the memory map.
+ * 3. Handles HMM faults for the specified range.
+ * 4. Releases the read lock on the memory map.
+ * 5. If successful, locks the memory region mutex.
+ * 6. Verifies if the notifier sequence has changed during the operation.
+ *    If it has, releases the mutex and returns -EBUSY to match with
+ *    hmm_range_fault() return code for repeating.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
+					  struct hmm_range *range)
+{
+	int ret;
+
+	range->notifier_seq = mmu_interval_read_begin(range->notifier);
+	mmap_read_lock(region->mni.mm);
+	ret = hmm_range_fault(range);
+	mmap_read_unlock(region->mni.mm);
+	if (ret)
+		return ret;
+
+	mutex_lock(&region->mutex);
+
+	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
+		mutex_unlock(&region->mutex);
+		cond_resched();
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+/**
+ * mshv_region_range_fault - Handle memory range faults for a given region.
+ * @region: Pointer to the memory region structure.
+ * @page_offset: Offset of the page within the region.
+ * @page_count: Number of pages to handle.
+ *
+ * This function resolves memory faults for a specified range of pages
+ * within a memory region. It uses HMM (Heterogeneous Memory Management)
+ * to fault in the required pages and updates the region's page array.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 page_offset, u64 page_count)
+{
+	struct hmm_range range = {
+		.notifier = &region->mni,
+		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
+	};
+	unsigned long *pfns;
+	int ret;
+	u64 i;
+
+	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);
+	if (!pfns)
+		return -ENOMEM;
+
+	range.hmm_pfns = pfns;
+	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
+	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
+
+	do {
+		ret = mshv_region_hmm_fault_and_lock(region, &range);
+	} while (ret == -EBUSY);
+
+	if (ret)
+		goto out;
+
+	for (i = 0; i < page_count; i++)
+		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+
+	if (PageHuge(region->pages[page_offset]))
+		region->flags.large_pages = true;
+
+	ret = mshv_region_remap_pages(region, region->hv_map_flags,
+				      page_offset, page_count);
+
+	mutex_unlock(&region->mutex);
+out:
+	kfree(pfns);
+	return ret;
+}
+#else /* CONFIG_MMU_NOTIFIER */
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 page_offset, u64 page_count)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_MMU_NOTIFIER */
+
+static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
+{
+	u64 page_offset, page_count;
+	int ret;
+
+	if (WARN_ON_ONCE(region->flags.range_pinned))
+		return false;
+
+	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
+	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
+				 MSHV_MAP_FAULT_IN_PAGES);
+
+	/* Map more pages than requested to reduce the number of faults. */
+	page_count = min(region->nr_pages - page_offset,
+			 MSHV_MAP_FAULT_IN_PAGES);
+
+	ret = mshv_region_range_fault(region, page_offset, page_count);
+
+	WARN_ONCE(ret,
+		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
+		  region->partition->pt_id, region->start_uaddr,
+		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  gfn, page_offset, page_count);
+
+	return !ret;
+}
+
+/**
+ * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
+ * @vp: Pointer to the virtual processor structure.
+ *
+ * This function processes GPA intercepts by identifying the memory region
+ * corresponding to the intercepted GPA, aligning the page offset, and
+ * mapping the required pages. It ensures that the region is valid and
+ * handles faults efficiently by mapping multiple pages at once.
+ *
+ * Return: true if the intercept was handled successfully, false otherwise.
+ */
+static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
+{
+	struct mshv_partition *p = vp->vp_partition;
+	struct mshv_mem_region *region;
+	struct hv_x64_memory_intercept_message *msg;
+	u64 gfn;
+
+	msg = (struct hv_x64_memory_intercept_message *)
+		vp->vp_intercept_msg_page->u.payload;
+
+	gfn = HVPFN_DOWN(msg->guest_physical_address);
+
+	region = mshv_partition_region_by_gfn(p, gfn);
+	if (!region)
+		return false;
+
+	if (WARN_ON_ONCE(!region->flags.is_ram))
+		return false;
+
+	if (WARN_ON_ONCE(region->flags.range_pinned))
+		return false;
+
+	return mshv_region_handle_gfn_fault(region, gfn);
+}
+
+#else	/* CONFIG_X86_64 */
+
+static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
+
+#endif	/* CONFIG_X86_64 */
+
+static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
+{
+	switch (vp->vp_intercept_msg_page->header.message_type) {
+	case HVMSG_GPA_INTERCEPT:
+		return mshv_handle_gpa_intercept(vp);
+	}
+	return false;
+}
+
 static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 {
 	long rc;
 
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-		rc = mshv_run_vp_with_root_scheduler(vp);
-	else
-		rc = mshv_run_vp_with_hyp_scheduler(vp);
+	do {
+		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
+			rc = mshv_run_vp_with_root_scheduler(vp);
+		else
+			rc = mshv_run_vp_with_hyp_scheduler(vp);
+	} while (rc == 0 && mshv_vp_handle_intercept(vp));
 
 	if (rc)
 		return rc;
@@ -1216,6 +1407,108 @@ mshv_partition_region_by_uaddr(struct mshv_partition *partition, u64 uaddr)
 	return NULL;
 }
 
+#if defined(CONFIG_MMU_NOTIFIER)
+static void mshv_region_movable_fini(struct mshv_mem_region *region)
+{
+	if (region->flags.range_pinned)
+		return;
+
+	mmu_interval_notifier_remove(&region->mni);
+}
+
+/**
+ * mshv_region_invalidate - Invalidate a memory region
+ * @mni: Pointer to the mmu_interval_notifier structure
+ * @range: Pointer to the mmu_notifier_range structure
+ * @cur_seq: Current sequence number for the interval notifier
+ *
+ * This function invalidates a memory region by remapping its pages with
+ * no access permissions. It locks the region's mutex to ensure thread safety
+ * and updates the sequence number for the interval notifier. If the range
+ * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
+ * lock and returns false if unsuccessful.
+ *
+ * Return: true if the region was successfully invalidated, false otherwise.
+ */
+static bool mshv_region_invalidate(struct mmu_interval_notifier *mni,
+				   const struct mmu_notifier_range *range,
+				   unsigned long cur_seq)
+{
+	struct mshv_mem_region *region = container_of(mni,
+						struct mshv_mem_region,
+						mni);
+	u64 page_offset, page_count;
+	unsigned long mstart, mend;
+	int ret;
+
+	if (!mmget_not_zero(mni->mm))
+		return true;
+
+	if (mmu_notifier_range_blockable(range)) {
+		mutex_lock(&region->mutex);
+	} else if (!mutex_trylock(&region->mutex)) {
+		mmput(mni->mm);
+		return false;
+	}
+
+	mmu_interval_set_seq(mni, cur_seq);
+
+	mstart = max(range->start, region->start_uaddr);
+	mend = min(range->end, region->start_uaddr +
+		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
+
+	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
+	page_count = HVPFN_DOWN(mend - mstart);
+
+	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
+				      page_offset, page_count);
+
+	WARN_ONCE(ret,
+		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
+		  region->start_uaddr,
+		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  range->start, range->end, range->event,
+		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
+
+	memset(region->pages + page_offset, 0,
+	       page_count * sizeof(struct page *));
+
+	mutex_unlock(&region->mutex);
+	mmput(mni->mm);
+
+	return true;
+}
+
+static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
+	.invalidate = mshv_region_invalidate,
+};
+
+static bool mshv_region_movable_init(struct mshv_mem_region *region)
+{
+	int ret;
+
+	ret = mmu_interval_notifier_insert(&region->mni, current->mm,
+					   region->start_uaddr,
+					   region->nr_pages << HV_HYP_PAGE_SHIFT,
+					   &mshv_region_mni_ops);
+	if (ret)
+		return false;
+
+	mutex_init(&region->mutex);
+
+	return true;
+}
+#else
+static inline void mshv_region_movable_fini(struct mshv_mem_region *region)
+{
+}
+
+static inline bool mshv_region_movable_init(struct mshv_mem_region *region)
+{
+	return false;
+}
+#endif
+
 /*
  * NB: caller checks and makes sure mem->size is page aligned
  * Returns: 0 with regionpp updated on success, or -errno
@@ -1248,9 +1541,14 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
-	/* Note: large_pages flag populated when we pin the pages */
-	if (!is_mmio)
-		region->flags.range_pinned = true;
+	/* Note: large_pages flag populated when pages are allocated. */
+	if (!is_mmio) {
+		region->flags.is_ram = true;
+
+		if (mshv_partition_encrypted(partition) ||
+		    !mshv_region_movable_init(region))
+			region->flags.range_pinned = true;
+	}
 
 	region->partition = partition;
 
@@ -1370,9 +1668,16 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	if (is_mmio)
 		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
 					     mmio_pfn, HVPFN_DOWN(mem.size));
-	else
+	else if (region->flags.range_pinned)
 		ret = mshv_handle_pinned_region(region);
-
+	else
+		/*
+		 * For non-pinned regions, remap with no access to let the
+		 * hypervisor track dirty pages, enabling pre-copy live
+		 * migration.
+		 */
+		ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
+					      0, region->nr_pages);
 	if (ret)
 		goto errout;
 
@@ -1395,6 +1700,9 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
 
 	hlist_del(&region->hnode);
 
+	if (region->flags.is_ram)
+		mshv_region_movable_fini(region);
+
 	if (mshv_partition_encrypted(partition)) {
 		ret = mshv_partition_region_share(region);
 		if (ret) {



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] Drivers: hv: Rename a few memory region related functions for clarity
  2025-09-24 21:31 ` [PATCH 1/3] Drivers: hv: Rename a few memory region related functions for clarity Stanislav Kinsburskii
@ 2025-09-26 17:14   ` Nuno Das Neves
  2025-09-26 21:58     ` Stanislav Kinsburskii
  0 siblings, 1 reply; 11+ messages in thread
From: Nuno Das Neves @ 2025-09-26 17:14 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: linux-hyperv, linux-kernel

On 9/24/2025 2:31 PM, Stanislav Kinsburskii wrote:
> A cleanup and precursor patch.
> 
This line doesn't add much, I think you can remove it.

> Rename "mshv_partition_mem_region_map" to "mshv_handle_pinned_region",
> "mshv_region_populate" to "mshv_pin_region" and
> "mshv_region_populate_pages" to "mshv_region_pin_pages"
> to better reflect its purpose of handling pinned memory regions.
> 
> Update the "mshv_handle_pinned_region" function's documentation to provide
> detailed information about its behavior and return values.
> 
> Also drop the check for range as pinned, as this function is static and
> all the memory regions are pinned anyway.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c |   41 ++++++++++++++++++++++-------------------
>  1 file changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index a1c8c3bc79bf1..5ed6bce334417 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1137,8 +1137,8 @@ mshv_region_evict(struct mshv_mem_region *region)
>  }
>  
>  static int
> -mshv_region_populate_pages(struct mshv_mem_region *region,
> -			   u64 page_offset, u64 page_count)
> +mshv_region_pin_pages(struct mshv_mem_region *region,
> +		      u64 page_offset, u64 page_count)
>  {
>  	u64 done_count, nr_pages;
>  	struct page **pages;
> @@ -1164,14 +1164,9 @@ mshv_region_populate_pages(struct mshv_mem_region *region,
>  		 * with the FOLL_LONGTERM flag does a large temporary
>  		 * allocation of contiguous memory.
>  		 */
> -		if (region->flags.range_pinned)
> -			ret = pin_user_pages_fast(userspace_addr,
> -						  nr_pages,
> -						  FOLL_WRITE | FOLL_LONGTERM,
> -						  pages);
> -		else
> -			ret = -EOPNOTSUPP;
> -
> +		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> +					  FOLL_WRITE | FOLL_LONGTERM,
> +					  pages);
>  		if (ret < 0)
>  			goto release_pages;
>  	}
> @@ -1187,9 +1182,9 @@ mshv_region_populate_pages(struct mshv_mem_region *region,
>  }
>  
>  static int
> -mshv_region_populate(struct mshv_mem_region *region)
> +mshv_region_pin(struct mshv_mem_region *region)
>  {
> -	return mshv_region_populate_pages(region, 0, region->nr_pages);
> +	return mshv_region_pin_pages(region, 0, region->nr_pages);
>  }
Do we ever partially pin a region? Maybe we don't need a function called
mshv_region_pin_pages() and we just have mshv_region_pin() instead.

>  
>  static struct mshv_mem_region *
> @@ -1264,17 +1259,25 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
>  	return 0;
>  }
>  
> -/*
> - * Map guest ram. if snp, make sure to release that from the host first
> - * Side Effects: In case of failure, pages are unpinned when feasible.
> +/**
> + * mshv_handle_pinned_region - Handle pinned memory regions
> + * @region: Pointer to the memory region structure
> + *
> + * This function processes memory regions that are explicitly marked as pinned.
> + * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
> + * population. The function ensures the region is properly populated, handles
> + * encryption requirements for SNP partitions if applicable, maps the region,
> + * and performs necessary sharing or eviction operations based on the mapping
> + * result.
> + *
> + * Return: 0 on success, negative error code on failure.
>   */
> -static int
> -mshv_partition_mem_region_map(struct mshv_mem_region *region)
> +static int mshv_handle_pinned_region(struct mshv_mem_region *region)

Why the verb "handle"? It doesn't provide any information on what the function does,
when it might be called etc. Maybe mshv_init_pinned_region() ?

>  {
>  	struct mshv_partition *partition = region->partition;
>  	int ret;
>  
> -	ret = mshv_region_populate(region);
> +	ret = mshv_region_pin(region);
>  	if (ret) {
>  		pt_err(partition, "Failed to populate memory region: %d\n",
>  		       ret);
> @@ -1368,7 +1371,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
>  					     mmio_pfn, HVPFN_DOWN(mem.size));
>  	else
> -		ret = mshv_partition_mem_region_map(region);
> +		ret = mshv_handle_pinned_region(region);
>  
>  	if (ret)
>  		goto errout;
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] Drivers: hv: Centralize guest memory region destruction in helper
  2025-09-24 21:31 ` [PATCH 2/3] Drivers: hv: Centralize guest memory region destruction in helper Stanislav Kinsburskii
@ 2025-09-26 18:15   ` Nuno Das Neves
  2025-09-26 22:10     ` Stanislav Kinsburskii
  0 siblings, 1 reply; 11+ messages in thread
From: Nuno Das Neves @ 2025-09-26 18:15 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: linux-hyperv, linux-kernel

On 9/24/2025 2:31 PM, Stanislav Kinsburskii wrote:
> This is a precursor and cleanup patch.
> 
This line can be removed, IMO, it doesn't add much.

> - Introduce mshv_partition_destroy_region() to encapsulate memory region
>   cleanup, including:

I think these points should just be short paragraphs instead.

>   - Removing the region from the partition's list
>   - Regaining access for encrypted partitions
>   - Unmapping only mapped pages for efficiency

The optimization here seems like new functionality that should be in a
separate patch. Also, can there be unmapped pages in a region before
patch #3? If not it should be introduced with/after that patch.

>   - Evicting and freeing the region
> 
> - Update mshv_unmap_user_memory() to call mshv_partition_destroy_region()
>   instead of duplicating cleanup logic.
> 
> - Update destroy_partition() to use mshv_partition_destroy_region() for
>   all regions, removing the previous inlined cleanup loop.
> 
> These changes eliminate code duplication, ensure consistent cleanup, and
> improve maintainability for both unmap and partition destruction paths.

This sentence maybe should go first in the description, because it summarizes
the reasoning for the changes nicely.

Also, make sure to describe your changes in imperative mood, e.g. Instead of
"These changes eliminate code duplication..." just "Eliminate code duplication..."

https://docs.kernel.org/process/submitting-patches.html#describe-your-changes

> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c |   83 ++++++++++++++++++++++++++-----------------
>  1 file changed, 51 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 5ed6bce334417..c0f6023e459c2 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1386,13 +1386,59 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  	return ret;
>  }
>  
> +static void mshv_partition_destroy_region(struct mshv_mem_region *region)
> +{
> +	struct mshv_partition *partition = region->partition;
> +	u64 page_offset, page_count;
> +	u32 unmap_flags = 0;
> +	int ret;
> +
> +	hlist_del(&region->hnode);
> +
> +	if (mshv_partition_encrypted(partition)) {
> +		ret = mshv_partition_region_share(region);
> +		if (ret) {
> +			pt_err(partition,
> +			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
> +			       ret);
> +			return;
> +		}
> +	}
> +
> +	if (region->flags.large_pages)
> +		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> +
> +	/*
> +	 * Unmap only the mapped pages to optimize performance,
> +	 * especially for large memory regions.
> +	 */
> +	for (page_offset = 0; page_offset < region->nr_pages; page_offset += page_count) {
> +		page_count = 1;
> +		if (!region->pages[page_offset])
> +			continue;
I mentioned it above, but can this even happen in the current code (i.e. without
moveable pages)?

Also, has the impact of this change been measured? I understand the logic behind
the change - there could be large unmapped sequences within the region so we might
be able to skip a lot of reps of the unmap hypercall, but the region could also be
very fragmented and this method might cause *more* reps in that case, right?

Either way, this change belongs in a separate patch.
> +
> +		for (; page_count < region->nr_pages - page_offset; page_count++) {
> +			if (!region->pages[page_offset + page_count])
> +				break;
> +		}
> +
> +		/* ignore unmap failures and continue as process may be exiting */
> +		hv_call_unmap_gpa_pages(partition->pt_id,
> +					region->start_gfn + page_offset,
> +					page_count, unmap_flags);
> +	}
> +
> +	mshv_region_evict(region);
> +
> +	vfree(region);
> +}
> +
>  /* Called for unmapping both the guest ram and the mmio space */
>  static long
>  mshv_unmap_user_memory(struct mshv_partition *partition,
>  		       struct mshv_user_mem_region mem)
>  {
>  	struct mshv_mem_region *region;
> -	u32 unmap_flags = 0;
>  
>  	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
>  		return -EINVAL;
> @@ -1407,18 +1453,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
>  	    region->nr_pages != HVPFN_DOWN(mem.size))
>  		return -EINVAL;
>  
> -	hlist_del(&region->hnode);
> -
> -	if (region->flags.large_pages)
> -		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> -
> -	/* ignore unmap failures and continue as process may be exiting */
> -	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
> -				region->nr_pages, unmap_flags);
> -
> -	mshv_region_evict(region);
> -
> -	vfree(region);
> +	mshv_partition_destroy_region(region);
>  	return 0;
>  }
>  
> @@ -1754,8 +1789,8 @@ static void destroy_partition(struct mshv_partition *partition)
>  {
>  	struct mshv_vp *vp;
>  	struct mshv_mem_region *region;
> -	int i, ret;
>  	struct hlist_node *n;
> +	int i;
>  
>  	if (refcount_read(&partition->pt_ref_count)) {
>  		pt_err(partition,
> @@ -1815,25 +1850,9 @@ static void destroy_partition(struct mshv_partition *partition)
>  
>  	remove_partition(partition);
>  
> -	/* Remove regions, regain access to the memory and unpin the pages */
>  	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
> -				  hnode) {
> -		hlist_del(&region->hnode);
> -
> -		if (mshv_partition_encrypted(partition)) {
> -			ret = mshv_partition_region_share(region);
> -			if (ret) {
> -				pt_err(partition,
> -				       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
> -				      ret);
> -				return;
> -			}
> -		}
> -
> -		mshv_region_evict(region);
> -
> -		vfree(region);
> -	}
> +				  hnode)
> +		mshv_partition_destroy_region(region);
>  
>  	/* Withdraw and free all pages we deposited */
>  	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] Drivers: hv: Rename a few memory region related functions for clarity
  2025-09-26 17:14   ` Nuno Das Neves
@ 2025-09-26 21:58     ` Stanislav Kinsburskii
  0 siblings, 0 replies; 11+ messages in thread
From: Stanislav Kinsburskii @ 2025-09-26 21:58 UTC (permalink / raw)
  To: Nuno Das Neves; +Cc: kys, haiyangz, wei.liu, decui, linux-hyperv, linux-kernel

On Fri, Sep 26, 2025 at 10:14:25AM -0700, Nuno Das Neves wrote:
> On 9/24/2025 2:31 PM, Stanislav Kinsburskii wrote:
> > A cleanup and precursor patch.
> > 
> This line doesn't add much, I think you can remove it.
> 

It actually means something important: it explains why a change is being
made and that other changes to follow will make more sense out of this
one.

> >  
> >  static int
> > -mshv_region_populate(struct mshv_mem_region *region)
> > +mshv_region_pin(struct mshv_mem_region *region)
> >  {
> > -	return mshv_region_populate_pages(region, 0, region->nr_pages);
> > +	return mshv_region_pin_pages(region, 0, region->nr_pages);
> >  }
> Do we ever partially pin a region? Maybe we don't need a function called
> mshv_region_pin_pages() and we just have mshv_region_pin() instead.
> 

We don't and we likley won't until we support virtio-iommu.
I'll can remove mshv_region_pin_pages.

> >  
> >  static struct mshv_mem_region *
> > @@ -1264,17 +1259,25 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> >  	return 0;
> >  }
> >  
> > -/*
> > - * Map guest ram. if snp, make sure to release that from the host first
> > - * Side Effects: In case of failure, pages are unpinned when feasible.
> > +/**
> > + * mshv_handle_pinned_region - Handle pinned memory regions
> > + * @region: Pointer to the memory region structure
> > + *
> > + * This function processes memory regions that are explicitly marked as pinned.
> > + * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
> > + * population. The function ensures the region is properly populated, handles
> > + * encryption requirements for SNP partitions if applicable, maps the region,
> > + * and performs necessary sharing or eviction operations based on the mapping
> > + * result.
> > + *
> > + * Return: 0 on success, negative error code on failure.
> >   */
> > -static int
> > -mshv_partition_mem_region_map(struct mshv_mem_region *region)
> > +static int mshv_handle_pinned_region(struct mshv_mem_region *region)
> 
> Why the verb "handle"? It doesn't provide any information on what the function does,
> when it might be called etc. Maybe mshv_init_pinned_region() ?
> 

I see what you mean. Indeed, "handle" isn't goot, but "init" is quite
overloaded either. I think "mshv_prepare_pinned_region" suit better
here.
Is it okay with you?

Thanks,
Stanilav


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] Drivers: hv: Centralize guest memory region destruction in helper
  2025-09-26 18:15   ` Nuno Das Neves
@ 2025-09-26 22:10     ` Stanislav Kinsburskii
  0 siblings, 0 replies; 11+ messages in thread
From: Stanislav Kinsburskii @ 2025-09-26 22:10 UTC (permalink / raw)
  To: Nuno Das Neves; +Cc: kys, haiyangz, wei.liu, decui, linux-hyperv, linux-kernel

On Fri, Sep 26, 2025 at 11:15:54AM -0700, Nuno Das Neves wrote:
> On 9/24/2025 2:31 PM, Stanislav Kinsburskii wrote:

<snip>

> > +	/*
> > +	 * Unmap only the mapped pages to optimize performance,
> > +	 * especially for large memory regions.
> > +	 */
> > +	for (page_offset = 0; page_offset < region->nr_pages; page_offset += page_count) {
> > +		page_count = 1;
> > +		if (!region->pages[page_offset])
> > +			continue;
> I mentioned it above, but can this even happen in the current code (i.e. without
> moveable pages)?
> 

No.

> Also, has the impact of this change been measured? I understand the logic behind
> the change - there could be large unmapped sequences within the region so we might
> be able to skip a lot of reps of the unmap hypercall, but the region could also be
> very fragmented and this method might cause *more* reps in that case, right?
> 

I see your point. Indeed, we should optimize the number of pages to
unmap by the maximum number allowed for the hypercall.
I'll make this change, thanks.

> Either way, this change belongs in a separate patch.

Fair enough.

Thanks,
Stanislav

> > +
> > +		for (; page_count < region->nr_pages - page_offset; page_count++) {
> > +			if (!region->pages[page_offset + page_count])
> > +				break;
> > +		}
> > +
> > +		/* ignore unmap failures and continue as process may be exiting */
> > +		hv_call_unmap_gpa_pages(partition->pt_id,
> > +					region->start_gfn + page_offset,
> > +					page_count, unmap_flags);
> > +	}
> > +
> > +	mshv_region_evict(region);
> > +
> > +	vfree(region);
> > +}
> > +
> >  /* Called for unmapping both the guest ram and the mmio space */
> >  static long
> >  mshv_unmap_user_memory(struct mshv_partition *partition,
> >  		       struct mshv_user_mem_region mem)
> >  {
> >  	struct mshv_mem_region *region;
> > -	u32 unmap_flags = 0;
> >  
> >  	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
> >  		return -EINVAL;
> > @@ -1407,18 +1453,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
> >  	    region->nr_pages != HVPFN_DOWN(mem.size))
> >  		return -EINVAL;
> >  
> > -	hlist_del(&region->hnode);
> > -
> > -	if (region->flags.large_pages)
> > -		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > -
> > -	/* ignore unmap failures and continue as process may be exiting */
> > -	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
> > -				region->nr_pages, unmap_flags);
> > -
> > -	mshv_region_evict(region);
> > -
> > -	vfree(region);
> > +	mshv_partition_destroy_region(region);
> >  	return 0;
> >  }
> >  
> > @@ -1754,8 +1789,8 @@ static void destroy_partition(struct mshv_partition *partition)
> >  {
> >  	struct mshv_vp *vp;
> >  	struct mshv_mem_region *region;
> > -	int i, ret;
> >  	struct hlist_node *n;
> > +	int i;
> >  
> >  	if (refcount_read(&partition->pt_ref_count)) {
> >  		pt_err(partition,
> > @@ -1815,25 +1850,9 @@ static void destroy_partition(struct mshv_partition *partition)
> >  
> >  	remove_partition(partition);
> >  
> > -	/* Remove regions, regain access to the memory and unpin the pages */
> >  	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
> > -				  hnode) {
> > -		hlist_del(&region->hnode);
> > -
> > -		if (mshv_partition_encrypted(partition)) {
> > -			ret = mshv_partition_region_share(region);
> > -			if (ret) {
> > -				pt_err(partition,
> > -				       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
> > -				      ret);
> > -				return;
> > -			}
> > -		}
> > -
> > -		mshv_region_evict(region);
> > -
> > -		vfree(region);
> > -	}
> > +				  hnode)
> > +		mshv_partition_destroy_region(region);
> >  
> >  	/* Withdraw and free all pages we deposited */
> >  	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Introduce movable pages for Hyper-V guests
  2025-09-24 21:30 [PATCH 0/3] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
                   ` (2 preceding siblings ...)
  2025-09-24 21:31 ` [PATCH 3/3] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
@ 2025-09-27  2:02 ` Mukesh R
  2025-10-01  4:18   ` Wei Liu
  3 siblings, 1 reply; 11+ messages in thread
From: Mukesh R @ 2025-09-27  2:02 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: linux-hyperv, linux-kernel

On 9/24/25 14:30, Stanislav Kinsburskii wrote:
>>From the start, the root-partition driver allocates, pins, and maps all
> guest memory into the hypervisor at guest creation. This is simple: Linux
> cannot move the pages, so the guest?s view in Linux and in Microsoft
> Hypervisor never diverges.
> 
> However, this approach has major drawbacks:
> - NUMA: affinity can?t be changed at runtime, so you can?t migrate guest memory closer to the CPUs running it ? performance hit.
> - Memory management: unused guest memory can?t be swapped out, compacted, or merged.
> - Provisioning time: upfront allocation/pinning slows guest create/destroy.
> - Overcommit: no memory overcommit on hosts with pinned-guest memory.
> 
> This series adds movable memory pages for Hyper-V child partitions. Guest
> pages are no longer allocated upfront; they?re allocated and mapped into
> the hypervisor on demand (i.e., when the guest touches a GFN that isn?t yet
> backed by a host PFN).
> When a page is moved, Linux no longer holds it and it is unmapped from the hypervisor.
> As a result, Hyper-V guests behave like regular Linux processes, enabling standard Linux memory features to apply to guests.
> 
> Exceptions (still pinned):
>   1. Encrypted guests (explicit).
>   2 Guests with passthrough devices (implicitly pinned by the VFIO framework).


As I had commented internally, I am not fully comfortable about the
approach here, specially around use of HMM, and the correctness of
locking for shared memory regions, but my knowledge is from 4.15 and
maybe outdated, and don't have time right now. So I won't object to it
if other hard core mmu developers think there are no issues.

However, we won't be using this for minkernel, so would like a driver
boot option to disable it upon boot that we can just set in minkernel
init path. This option can also be used to disable it if problems are
observed on the field. Minkernel design is still being worked on, so I
cannot provide much details on it yet.

Thanks,
-Mukesh


> ---
> 
> Stanislav Kinsburskii (3):
>       Drivers: hv: Rename a few memory region related functions for clarity
>       Drivers: hv: Centralize guest memory region destruction in helper
>       Drivers: hv: Add support for movable memory regions
> 
> 
>  drivers/hv/Kconfig          |    1 
>  drivers/hv/mshv_root.h      |    8 +
>  drivers/hv/mshv_root_main.c |  448 +++++++++++++++++++++++++++++++++++++------
>  3 files changed, 397 insertions(+), 60 deletions(-)
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Introduce movable pages for Hyper-V guests
  2025-09-27  2:02 ` [PATCH 0/3] Introduce movable pages for Hyper-V guests Mukesh R
@ 2025-10-01  4:18   ` Wei Liu
  2025-10-01 16:39     ` Mike Rapoport
  0 siblings, 1 reply; 11+ messages in thread
From: Wei Liu @ 2025-10-01  4:18 UTC (permalink / raw)
  To: Mukesh R, Mike Rapoport
  Cc: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui,
	linux-hyperv, linux-kernel

+Mike Rapoport, our resident memory management expert.

On Fri, Sep 26, 2025 at 07:02:02PM -0700, Mukesh R wrote:
> On 9/24/25 14:30, Stanislav Kinsburskii wrote:
> >>From the start, the root-partition driver allocates, pins, and maps all
> > guest memory into the hypervisor at guest creation. This is simple: Linux
> > cannot move the pages, so the guest?s view in Linux and in Microsoft
> > Hypervisor never diverges.
> > 
> > However, this approach has major drawbacks:
> > - NUMA: affinity can?t be changed at runtime, so you can?t migrate guest memory closer to the CPUs running it ? performance hit.
> > - Memory management: unused guest memory can?t be swapped out, compacted, or merged.
> > - Provisioning time: upfront allocation/pinning slows guest create/destroy.
> > - Overcommit: no memory overcommit on hosts with pinned-guest memory.
> > 
> > This series adds movable memory pages for Hyper-V child partitions. Guest
> > pages are no longer allocated upfront; they?re allocated and mapped into
> > the hypervisor on demand (i.e., when the guest touches a GFN that isn?t yet
> > backed by a host PFN).
> > When a page is moved, Linux no longer holds it and it is unmapped from the hypervisor.
> > As a result, Hyper-V guests behave like regular Linux processes, enabling standard Linux memory features to apply to guests.
> > 
> > Exceptions (still pinned):
> >   1. Encrypted guests (explicit).
> >   2 Guests with passthrough devices (implicitly pinned by the VFIO framework).
> 
> 
> As I had commented internally, I am not fully comfortable about the
> approach here, specially around use of HMM, and the correctness of
> locking for shared memory regions, but my knowledge is from 4.15 and
> maybe outdated, and don't have time right now. So I won't object to it
> if other hard core mmu developers think there are no issues.
> 

Mike, I seem to remember you had a discussion with Stanislav about this?
Can you confirm that this is a reasonable approach?

Better yet, if you have time to review the code, that would be great.
Note that there is a v2 on linux-hyperv.  But I would like to close
Mukesh's question first.

Thanks,
Wei

> However, we won't be using this for minkernel, so would like a driver
> boot option to disable it upon boot that we can just set in minkernel
> init path. This option can also be used to disable it if problems are
> observed on the field. Minkernel design is still being worked on, so I
> cannot provide much details on it yet.
> 
> Thanks,
> -Mukesh
> 
> 
> > ---
> > 
> > Stanislav Kinsburskii (3):
> >       Drivers: hv: Rename a few memory region related functions for clarity
> >       Drivers: hv: Centralize guest memory region destruction in helper
> >       Drivers: hv: Add support for movable memory regions
> > 
> > 
> >  drivers/hv/Kconfig          |    1 
> >  drivers/hv/mshv_root.h      |    8 +
> >  drivers/hv/mshv_root_main.c |  448 +++++++++++++++++++++++++++++++++++++------
> >  3 files changed, 397 insertions(+), 60 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Introduce movable pages for Hyper-V guests
  2025-10-01  4:18   ` Wei Liu
@ 2025-10-01 16:39     ` Mike Rapoport
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Rapoport @ 2025-10-01 16:39 UTC (permalink / raw)
  To: Wei Liu
  Cc: Mukesh R, Stanislav Kinsburskii, kys, haiyangz, decui,
	linux-hyperv, linux-kernel

On Wed, Oct 01, 2025 at 04:18:30AM +0000, Wei Liu wrote:
> +Mike Rapoport, our resident memory management expert.
> 
> On Fri, Sep 26, 2025 at 07:02:02PM -0700, Mukesh R wrote:
> > On 9/24/25 14:30, Stanislav Kinsburskii wrote:
> > >>From the start, the root-partition driver allocates, pins, and maps all
> > > guest memory into the hypervisor at guest creation. This is simple: Linux
> > > cannot move the pages, so the guest?s view in Linux and in Microsoft
> > > Hypervisor never diverges.
> > > 
> > > However, this approach has major drawbacks:
> > > - NUMA: affinity can?t be changed at runtime, so you can?t migrate guest memory closer to the CPUs running it ? performance hit.
> > > - Memory management: unused guest memory can?t be swapped out, compacted, or merged.
> > > - Provisioning time: upfront allocation/pinning slows guest create/destroy.
> > > - Overcommit: no memory overcommit on hosts with pinned-guest memory.
> > > 
> > > This series adds movable memory pages for Hyper-V child partitions. Guest
> > > pages are no longer allocated upfront; they?re allocated and mapped into
> > > the hypervisor on demand (i.e., when the guest touches a GFN that isn?t yet
> > > backed by a host PFN).
> > > When a page is moved, Linux no longer holds it and it is unmapped from the hypervisor.
> > > As a result, Hyper-V guests behave like regular Linux processes, enabling standard Linux memory features to apply to guests.
> > > 
> > > Exceptions (still pinned):
> > >   1. Encrypted guests (explicit).
> > >   2 Guests with passthrough devices (implicitly pinned by the VFIO framework).
> > 
> > 
> > As I had commented internally, I am not fully comfortable about the
> > approach here, specially around use of HMM, and the correctness of
> > locking for shared memory regions, but my knowledge is from 4.15 and
> > maybe outdated, and don't have time right now. So I won't object to it
> > if other hard core mmu developers think there are no issues.
> > 
> 
> Mike, I seem to remember you had a discussion with Stanislav about this?
> Can you confirm that this is a reasonable approach?
>
> Better yet, if you have time to review the code, that would be great.
> Note that there is a v2 on linux-hyperv.  But I would like to close
> Mukesh's question first.

I only had time to skip through the patches and yes, this is a reasonable
approach. I also confirmed privately with HMM maintainer a while ago that
the use of HMM and MMU notifiers is correct.

I don't know enough about mshv to see if there are corner cases that these
patches don't cover, but conceptually they make memory model follow KVM
best practices.
 
> > However, we won't be using this for minkernel, so would like a driver
> > boot option to disable it upon boot that we can just set in minkernel
> > init path. This option can also be used to disable it if problems are
> > observed on the field. Minkernel design is still being worked on, so I
> > cannot provide much details on it yet.

The usual way we do things in the kernel is to add functionality when it
has users, so a boot option can be added later when minkernel design will
be more mature and ready for upstream.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-10-01 16:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 21:30 [PATCH 0/3] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
2025-09-24 21:31 ` [PATCH 1/3] Drivers: hv: Rename a few memory region related functions for clarity Stanislav Kinsburskii
2025-09-26 17:14   ` Nuno Das Neves
2025-09-26 21:58     ` Stanislav Kinsburskii
2025-09-24 21:31 ` [PATCH 2/3] Drivers: hv: Centralize guest memory region destruction in helper Stanislav Kinsburskii
2025-09-26 18:15   ` Nuno Das Neves
2025-09-26 22:10     ` Stanislav Kinsburskii
2025-09-24 21:31 ` [PATCH 3/3] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
2025-09-27  2:02 ` [PATCH 0/3] Introduce movable pages for Hyper-V guests Mukesh R
2025-10-01  4:18   ` Wei Liu
2025-10-01 16:39     ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).