public inbox for linux-hyperv@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] Introduce movable pages for Hyper-V guests
@ 2025-10-16  0:26 Stanislav Kinsburskii
  2025-10-16  0:26 ` [PATCH v5 1/5] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Stanislav Kinsburskii @ 2025-10-16  0:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

From the start, the root-partition driver allocates, pins, and maps all
guest memory into the hypervisor at guest creation. This is simple: Linux
cannot move the pages, so the guest’s view in Linux and in Microsoft
Hypervisor never diverges.

However, this approach has major drawbacks:
 - NUMA: affinity can’t be changed at runtime, so you can’t migrate guest memory closer to the CPUs running it → performance hit.
 - Memory management: unused guest memory can’t be swapped out, compacted, or merged.
 - Provisioning time: upfront allocation/pinning slows guest create/destroy.
 - Overcommit: no memory overcommit on hosts with pinned-guest memory.

This series adds movable memory pages for Hyper-V child partitions. Guest
pages are no longer allocated upfront; they’re allocated and mapped into
the hypervisor on demand (i.e., when the guest touches a GFN that isn’t yet
backed by a host PFN).
When a page is moved, Linux no longer holds it and it is unmapped from the hypervisor.
As a result, Hyper-V guests behave like regular Linux processes, enabling standard Linux memory features to apply to guests.

Exceptions (still pinned):
 1. Encrypted guests (explicit).
 2. Guests with passthrough devices (implicitly pinned by the VFIO framework).

v5:
 - Fix a bug in MMU notifier handling where an uninitialized 'ret' variable
   could cause the warning about failed page invalidation to be skipped.
 - Improve comment grammar regarding skipping the unmapping of non-mapped pages.

v4:
 - Fix a bug in batch unmapping can skip mapped pages when selecting a new
   batch due to wrong offset calculation.
 - Fix an error message in case of failed memory region pinning.

v3:
 - Region is invalidated even if the mm has no users.
 - Page remapping logic is updated to support 2M-unaligned remappings for
   regions that are PMD-aligned, which can occur during both faults and
   invalidations.

v2:
 - Split unmap batching into a separate patch.
 - Fixed commit messages from v1 review.
 - Renamed a few functions for clarity.

---

Stanislav Kinsburskii (5):
      Drivers: hv: Refactor and rename memory region handling functions
      Drivers: hv: Centralize guest memory region destruction
      Drivers: hv: Batch GPA unmap operations to improve large region performance
      Drivers: hv: Ensure large page GPA mapping is PMD-aligned
      Drivers: hv: Add support for movable memory regions


 drivers/hv/Kconfig             |    1 
 drivers/hv/mshv_root.h         |   10 +
 drivers/hv/mshv_root_hv_call.c |    2 
 drivers/hv/mshv_root_main.c    |  495 +++++++++++++++++++++++++++++++++-------
 4 files changed, 424 insertions(+), 84 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/5] Drivers: hv: Refactor and rename memory region handling functions
  2025-10-16  0:26 [PATCH v5 0/5] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
@ 2025-10-16  0:26 ` Stanislav Kinsburskii
  2025-10-16  0:27 ` [PATCH v5 2/5] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Stanislav Kinsburskii @ 2025-10-16  0:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Simplify and unify memory region management to improve code clarity and
reliability. Consolidate pinning and invalidation logic, adopt consistent
naming, and remove redundant checks to reduce complexity.

Enhance documentation and update call sites for maintainability.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   80 +++++++++++++++++++------------------------
 1 file changed, 36 insertions(+), 44 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index fa42c40e1e02..e923947d3c54 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1120,8 +1120,8 @@ mshv_region_map(struct mshv_mem_region *region)
 }
 
 static void
-mshv_region_evict_pages(struct mshv_mem_region *region,
-			u64 page_offset, u64 page_count)
+mshv_region_invalidate_pages(struct mshv_mem_region *region,
+			     u64 page_offset, u64 page_count)
 {
 	if (region->flags.range_pinned)
 		unpin_user_pages(region->pages + page_offset, page_count);
@@ -1131,29 +1131,24 @@ mshv_region_evict_pages(struct mshv_mem_region *region,
 }
 
 static void
-mshv_region_evict(struct mshv_mem_region *region)
+mshv_region_invalidate(struct mshv_mem_region *region)
 {
-	mshv_region_evict_pages(region, 0, region->nr_pages);
+	mshv_region_invalidate_pages(region, 0, region->nr_pages);
 }
 
 static int
-mshv_region_populate_pages(struct mshv_mem_region *region,
-			   u64 page_offset, u64 page_count)
+mshv_region_pin(struct mshv_mem_region *region)
 {
 	u64 done_count, nr_pages;
 	struct page **pages;
 	__u64 userspace_addr;
 	int ret;
 
-	if (page_offset + page_count > region->nr_pages)
-		return -EINVAL;
-
-	for (done_count = 0; done_count < page_count; done_count += ret) {
-		pages = region->pages + page_offset + done_count;
+	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
+		pages = region->pages + done_count;
 		userspace_addr = region->start_uaddr +
-				(page_offset + done_count) *
-				HV_HYP_PAGE_SIZE;
-		nr_pages = min(page_count - done_count,
+				 done_count * HV_HYP_PAGE_SIZE;
+		nr_pages = min(region->nr_pages - done_count,
 			       MSHV_PIN_PAGES_BATCH_SIZE);
 
 		/*
@@ -1164,34 +1159,23 @@ mshv_region_populate_pages(struct mshv_mem_region *region,
 		 * with the FOLL_LONGTERM flag does a large temporary
 		 * allocation of contiguous memory.
 		 */
-		if (region->flags.range_pinned)
-			ret = pin_user_pages_fast(userspace_addr,
-						  nr_pages,
-						  FOLL_WRITE | FOLL_LONGTERM,
-						  pages);
-		else
-			ret = -EOPNOTSUPP;
-
+		ret = pin_user_pages_fast(userspace_addr, nr_pages,
+					  FOLL_WRITE | FOLL_LONGTERM,
+					  pages);
 		if (ret < 0)
 			goto release_pages;
 	}
 
-	if (PageHuge(region->pages[page_offset]))
+	if (PageHuge(region->pages[0]))
 		region->flags.large_pages = true;
 
 	return 0;
 
 release_pages:
-	mshv_region_evict_pages(region, page_offset, done_count);
+	mshv_region_invalidate_pages(region, 0, done_count);
 	return ret;
 }
 
-static int
-mshv_region_populate(struct mshv_mem_region *region)
-{
-	return mshv_region_populate_pages(region, 0, region->nr_pages);
-}
-
 static struct mshv_mem_region *
 mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 {
@@ -1264,19 +1248,27 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	return 0;
 }
 
-/*
- * Map guest ram. if snp, make sure to release that from the host first
- * Side Effects: In case of failure, pages are unpinned when feasible.
+/**
+ * mshv_prepare_pinned_region - Pin and map memory regions
+ * @region: Pointer to the memory region structure
+ *
+ * This function processes memory regions that are explicitly marked as pinned.
+ * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
+ * population. The function ensures the region is properly populated, handles
+ * encryption requirements for SNP partitions if applicable, maps the region,
+ * and performs necessary sharing or eviction operations based on the mapping
+ * result.
+ *
+ * Return: 0 on success, negative error code on failure.
  */
-static int
-mshv_partition_mem_region_map(struct mshv_mem_region *region)
+static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
-	ret = mshv_region_populate(region);
+	ret = mshv_region_pin(region);
 	if (ret) {
-		pt_err(partition, "Failed to populate memory region: %d\n",
+		pt_err(partition, "Failed to pin memory region: %d\n",
 		       ret);
 		goto err_out;
 	}
@@ -1294,7 +1286,7 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 			pt_err(partition,
 			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
 			       region->start_gfn, ret);
-			goto evict_region;
+			goto invalidate_region;
 		}
 	}
 
@@ -1304,7 +1296,7 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 
 		shrc = mshv_partition_region_share(region);
 		if (!shrc)
-			goto evict_region;
+			goto invalidate_region;
 
 		pt_err(partition,
 		       "Failed to share memory region (guest_pfn: %llu): %d\n",
@@ -1318,8 +1310,8 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 
 	return 0;
 
-evict_region:
-	mshv_region_evict(region);
+invalidate_region:
+	mshv_region_invalidate(region);
 err_out:
 	return ret;
 }
@@ -1368,7 +1360,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
 					     mmio_pfn, HVPFN_DOWN(mem.size));
 	else
-		ret = mshv_partition_mem_region_map(region);
+		ret = mshv_prepare_pinned_region(region);
 
 	if (ret)
 		goto errout;
@@ -1413,7 +1405,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
 				region->nr_pages, unmap_flags);
 
-	mshv_region_evict(region);
+	mshv_region_invalidate(region);
 
 	vfree(region);
 	return 0;
@@ -1827,7 +1819,7 @@ static void destroy_partition(struct mshv_partition *partition)
 			}
 		}
 
-		mshv_region_evict(region);
+		mshv_region_invalidate(region);
 
 		vfree(region);
 	}



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/5] Drivers: hv: Centralize guest memory region destruction
  2025-10-16  0:26 [PATCH v5 0/5] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
  2025-10-16  0:26 ` [PATCH v5 1/5] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
@ 2025-10-16  0:27 ` Stanislav Kinsburskii
  2025-10-16  0:27 ` [PATCH v5 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance Stanislav Kinsburskii
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Stanislav Kinsburskii @ 2025-10-16  0:27 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Centralize guest memory region destruction to prevent resource leaks and
inconsistent cleanup across unmap and partition destruction paths.

Unify region removal, encrypted partition access recovery, and region
invalidation to improve maintainability and reliability. Reduce code
duplication and make future updates less error-prone by encapsulating
cleanup logic in a single helper.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   65 ++++++++++++++++++++++---------------------
 1 file changed, 34 insertions(+), 31 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e923947d3c54..97e322f3c6b5 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1375,13 +1375,42 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	return ret;
 }
 
+static void mshv_partition_destroy_region(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	u32 unmap_flags = 0;
+	int ret;
+
+	hlist_del(&region->hnode);
+
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_partition_region_share(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
+			       ret);
+			return;
+		}
+	}
+
+	if (region->flags.large_pages)
+		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+
+	/* ignore unmap failures and continue as process may be exiting */
+	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
+				region->nr_pages, unmap_flags);
+
+	mshv_region_invalidate(region);
+
+	vfree(region);
+}
+
 /* Called for unmapping both the guest ram and the mmio space */
 static long
 mshv_unmap_user_memory(struct mshv_partition *partition,
 		       struct mshv_user_mem_region mem)
 {
 	struct mshv_mem_region *region;
-	u32 unmap_flags = 0;
 
 	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
 		return -EINVAL;
@@ -1396,18 +1425,8 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	    region->nr_pages != HVPFN_DOWN(mem.size))
 		return -EINVAL;
 
-	hlist_del(&region->hnode);
+	mshv_partition_destroy_region(region);
 
-	if (region->flags.large_pages)
-		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-
-	/* ignore unmap failures and continue as process may be exiting */
-	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
-				region->nr_pages, unmap_flags);
-
-	mshv_region_invalidate(region);
-
-	vfree(region);
 	return 0;
 }
 
@@ -1743,8 +1762,8 @@ static void destroy_partition(struct mshv_partition *partition)
 {
 	struct mshv_vp *vp;
 	struct mshv_mem_region *region;
-	int i, ret;
 	struct hlist_node *n;
+	int i;
 
 	if (refcount_read(&partition->pt_ref_count)) {
 		pt_err(partition,
@@ -1804,25 +1823,9 @@ static void destroy_partition(struct mshv_partition *partition)
 
 	remove_partition(partition);
 
-	/* Remove regions, regain access to the memory and unpin the pages */
 	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
-				  hnode) {
-		hlist_del(&region->hnode);
-
-		if (mshv_partition_encrypted(partition)) {
-			ret = mshv_partition_region_share(region);
-			if (ret) {
-				pt_err(partition,
-				       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
-				      ret);
-				return;
-			}
-		}
-
-		mshv_region_invalidate(region);
-
-		vfree(region);
-	}
+				  hnode)
+		mshv_partition_destroy_region(region);
 
 	/* Withdraw and free all pages we deposited */
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance
  2025-10-16  0:26 [PATCH v5 0/5] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
  2025-10-16  0:26 ` [PATCH v5 1/5] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
  2025-10-16  0:27 ` [PATCH v5 2/5] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
@ 2025-10-16  0:27 ` Stanislav Kinsburskii
  2025-10-16  0:27 ` [PATCH v5 4/5] Drivers: hv: Ensure large page GPA mapping is PMD-aligned Stanislav Kinsburskii
  2025-10-16  0:27 ` [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  4 siblings, 0 replies; 9+ messages in thread
From: Stanislav Kinsburskii @ 2025-10-16  0:27 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Reduce overhead when unmapping large memory regions by batching GPA unmap
operations in 2MB-aligned chunks.

Use a dedicated constant for batch size to improve code clarity and
maintainability.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root.h         |    2 ++
 drivers/hv/mshv_root_hv_call.c |    2 +-
 drivers/hv/mshv_root_main.c    |   28 +++++++++++++++++++++++++---
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index e3931b0f1269..97e64d5341b6 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -32,6 +32,8 @@ static_assert(HV_HYP_PAGE_SIZE == MSHV_HV_PAGE_SIZE);
 
 #define MSHV_PIN_PAGES_BATCH_SIZE	(0x10000000ULL / HV_HYP_PAGE_SIZE)
 
+#define MSHV_MAX_UNMAP_GPA_PAGES	512
+
 struct mshv_vp {
 	u32 vp_index;
 	struct mshv_partition *vp_partition;
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index c9c274f29c3c..0696024ccfe3 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -17,7 +17,7 @@
 /* Determined empirically */
 #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
 #define HV_MAP_GPA_DEPOSIT_PAGES	256
-#define HV_UMAP_GPA_PAGES		512
+#define HV_UMAP_GPA_PAGES		MSHV_MAX_UNMAP_GPA_PAGES
 
 #define HV_PAGE_COUNT_2M_ALIGNED(pg_count) (!((pg_count) & (0x200 - 1)))
 
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 97e322f3c6b5..a3e5b41f3a7f 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1378,6 +1378,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 static void mshv_partition_destroy_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
+	u64 gfn, gfn_count, start_gfn, end_gfn;
 	u32 unmap_flags = 0;
 	int ret;
 
@@ -1396,9 +1397,30 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
 	if (region->flags.large_pages)
 		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
 
-	/* ignore unmap failures and continue as process may be exiting */
-	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
-				region->nr_pages, unmap_flags);
+	start_gfn = region->start_gfn;
+	end_gfn = region->start_gfn + region->nr_pages;
+
+	for (gfn = start_gfn; gfn < end_gfn; gfn += gfn_count) {
+		if (gfn % MSHV_MAX_UNMAP_GPA_PAGES)
+			gfn_count = ALIGN(gfn, MSHV_MAX_UNMAP_GPA_PAGES) - gfn;
+		else
+			gfn_count = MSHV_MAX_UNMAP_GPA_PAGES;
+
+		if (gfn + gfn_count > end_gfn)
+			gfn_count = end_gfn - gfn;
+
+		/* Skip all pages in this range if none are mapped */
+		if (!memchr_inv(region->pages + (gfn - start_gfn), 0,
+				gfn_count * sizeof(struct page *)))
+			continue;
+
+		ret = hv_call_unmap_gpa_pages(partition->pt_id, gfn,
+					      gfn_count, unmap_flags);
+		if (ret)
+			pt_err(partition,
+			       "Failed to unmap GPA pages %#llx-%#llx: %d\n",
+			       gfn, gfn + gfn_count - 1, ret);
+	}
 
 	mshv_region_invalidate(region);
 



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 4/5] Drivers: hv: Ensure large page GPA mapping is PMD-aligned
  2025-10-16  0:26 [PATCH v5 0/5] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
                   ` (2 preceding siblings ...)
  2025-10-16  0:27 ` [PATCH v5 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance Stanislav Kinsburskii
@ 2025-10-16  0:27 ` Stanislav Kinsburskii
  2025-10-16  0:27 ` [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  4 siblings, 0 replies; 9+ messages in thread
From: Stanislav Kinsburskii @ 2025-10-16  0:27 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

With the upcoming introduction of movable pages, a region doesn't guarantee
always having large pages mapped. Both mapping on fault and unmapping
during PTE invalidation may not be 2M-aligned, while the hypervisor
requires both the GFN and page count to be 2M-aligned to use the large page
flag.

Update the logic for large page mapping in mshv_region_remap_pages() to
require both page_offset and page_count to be PMD-aligned.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index a3e5b41f3a7f..c4f114376435 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -34,6 +34,8 @@
 #include "mshv.h"
 #include "mshv_root.h"
 
+#define VALUE_PMD_ALIGNED(c)			(!((c) & (PTRS_PER_PMD - 1)))
+
 MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
@@ -1100,7 +1102,9 @@ mshv_region_remap_pages(struct mshv_mem_region *region, u32 map_flags,
 	if (page_offset + page_count > region->nr_pages)
 		return -EINVAL;
 
-	if (region->flags.large_pages)
+	if (region->flags.large_pages &&
+	    VALUE_PMD_ALIGNED(page_offset) &&
+	    VALUE_PMD_ALIGNED(page_count))
 		map_flags |= HV_MAP_GPA_LARGE_PAGE;
 
 	/* ask the hypervisor to map guest ram */



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions
  2025-10-16  0:26 [PATCH v5 0/5] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
                   ` (3 preceding siblings ...)
  2025-10-16  0:27 ` [PATCH v5 4/5] Drivers: hv: Ensure large page GPA mapping is PMD-aligned Stanislav Kinsburskii
@ 2025-10-16  0:27 ` Stanislav Kinsburskii
  2025-10-16 14:47   ` kernel test robot
  2025-10-17 23:55   ` Mukesh R
  4 siblings, 2 replies; 9+ messages in thread
From: Stanislav Kinsburskii @ 2025-10-16  0:27 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Introduce support for movable memory regions in the Hyper-V root partition
driver, thus improving memory management flexibility and preparing the
driver for advanced use cases such as dynamic memory remapping.

Integrate mmu_interval_notifier for movable regions, implement functions to
handle HMM faults and memory invalidation, and update memory region mapping
logic to support movable regions.

While MMU notifiers are commonly used in virtualization drivers, this
implementation leverages HMM (Heterogeneous Memory Management) for its
tailored functionality. HMM provides a ready-made framework for mirroring,
invalidation, and fault handling, avoiding the need to reimplement these
mechanisms for a single callback. Although MMU notifiers are more generic,
using HMM reduces boilerplate and ensures maintainability by utilizing a
mechanism specifically designed for such use cases.

Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Kconfig          |    1 
 drivers/hv/mshv_root.h      |    8 +
 drivers/hv/mshv_root_main.c |  328 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 327 insertions(+), 10 deletions(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 0b8c391a0342..5f1637cbb6e3 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -75,6 +75,7 @@ config MSHV_ROOT
 	depends on PAGE_SIZE_4KB
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
+	select HMM_MIRROR
 	default n
 	help
 	  Select this option to enable support for booting and running as root
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 97e64d5341b6..13367c84497c 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -15,6 +15,7 @@
 #include <linux/hashtable.h>
 #include <linux/dev_printk.h>
 #include <linux/build_bug.h>
+#include <linux/mmu_notifier.h>
 #include <uapi/linux/mshv.h>
 
 /*
@@ -81,9 +82,14 @@ struct mshv_mem_region {
 	struct {
 		u64 large_pages:  1; /* 2MiB */
 		u64 range_pinned: 1;
-		u64 reserved:	 62;
+		u64 is_ram	: 1; /* mem region can be ram or mmio */
+		u64 reserved:	 61;
 	} flags;
 	struct mshv_partition *partition;
+#if defined(CONFIG_MMU_NOTIFIER)
+	struct mmu_interval_notifier mni;
+	struct mutex mutex;	/* protects region pages remapping */
+#endif
 	struct page *pages[];
 };
 
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index c4f114376435..b2738443ac5d 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -29,6 +29,7 @@
 #include <linux/crash_dump.h>
 #include <linux/panic_notifier.h>
 #include <linux/vmalloc.h>
+#include <linux/hmm.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
@@ -36,6 +37,8 @@
 
 #define VALUE_PMD_ALIGNED(c)			(!((c) & (PTRS_PER_PMD - 1)))
 
+#define MSHV_MAP_FAULT_IN_PAGES			HPAGE_PMD_NR
+
 MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
@@ -76,6 +79,11 @@ static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
 static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
 static int mshv_init_async_handler(struct mshv_partition *partition);
 static void mshv_async_hvcall_handler(void *data, u64 *status);
+static struct mshv_mem_region
+	*mshv_partition_region_by_gfn(struct mshv_partition *pt, u64 gfn);
+static int mshv_region_remap_pages(struct mshv_mem_region *region,
+				   u32 map_flags, u64 page_offset,
+				   u64 page_count);
 
 static const union hv_input_vtl input_vtl_zero;
 static const union hv_input_vtl input_vtl_normal = {
@@ -602,14 +610,197 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
 	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
 
+#ifdef CONFIG_X86_64
+
+#if defined(CONFIG_MMU_NOTIFIER)
+/**
+ * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
+ * @region: Pointer to the memory region structure
+ * @range: Pointer to the HMM range structure
+ *
+ * This function performs the following steps:
+ * 1. Reads the notifier sequence for the HMM range.
+ * 2. Acquires a read lock on the memory map.
+ * 3. Handles HMM faults for the specified range.
+ * 4. Releases the read lock on the memory map.
+ * 5. If successful, locks the memory region mutex.
+ * 6. Verifies if the notifier sequence has changed during the operation.
+ *    If it has, releases the mutex and returns -EBUSY to match with
+ *    hmm_range_fault() return code for repeating.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
+					  struct hmm_range *range)
+{
+	int ret;
+
+	range->notifier_seq = mmu_interval_read_begin(range->notifier);
+	mmap_read_lock(region->mni.mm);
+	ret = hmm_range_fault(range);
+	mmap_read_unlock(region->mni.mm);
+	if (ret)
+		return ret;
+
+	mutex_lock(&region->mutex);
+
+	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
+		mutex_unlock(&region->mutex);
+		cond_resched();
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+/**
+ * mshv_region_range_fault - Handle memory range faults for a given region.
+ * @region: Pointer to the memory region structure.
+ * @page_offset: Offset of the page within the region.
+ * @page_count: Number of pages to handle.
+ *
+ * This function resolves memory faults for a specified range of pages
+ * within a memory region. It uses HMM (Heterogeneous Memory Management)
+ * to fault in the required pages and updates the region's page array.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 page_offset, u64 page_count)
+{
+	struct hmm_range range = {
+		.notifier = &region->mni,
+		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
+	};
+	unsigned long *pfns;
+	int ret;
+	u64 i;
+
+	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);
+	if (!pfns)
+		return -ENOMEM;
+
+	range.hmm_pfns = pfns;
+	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
+	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
+
+	do {
+		ret = mshv_region_hmm_fault_and_lock(region, &range);
+	} while (ret == -EBUSY);
+
+	if (ret)
+		goto out;
+
+	for (i = 0; i < page_count; i++)
+		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+
+	if (PageHuge(region->pages[page_offset]))
+		region->flags.large_pages = true;
+
+	ret = mshv_region_remap_pages(region, region->hv_map_flags,
+				      page_offset, page_count);
+
+	mutex_unlock(&region->mutex);
+out:
+	kfree(pfns);
+	return ret;
+}
+#else /* CONFIG_MMU_NOTIFIER */
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 page_offset, u64 page_count)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_MMU_NOTIFIER */
+
+static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
+{
+	u64 page_offset, page_count;
+	int ret;
+
+	if (WARN_ON_ONCE(region->flags.range_pinned))
+		return false;
+
+	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
+	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
+				 MSHV_MAP_FAULT_IN_PAGES);
+
+	/* Map more pages than requested to reduce the number of faults. */
+	page_count = min(region->nr_pages - page_offset,
+			 MSHV_MAP_FAULT_IN_PAGES);
+
+	ret = mshv_region_range_fault(region, page_offset, page_count);
+
+	WARN_ONCE(ret,
+		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
+		  region->partition->pt_id, region->start_uaddr,
+		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  gfn, page_offset, page_count);
+
+	return !ret;
+}
+
+/**
+ * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
+ * @vp: Pointer to the virtual processor structure.
+ *
+ * This function processes GPA intercepts by identifying the memory region
+ * corresponding to the intercepted GPA, aligning the page offset, and
+ * mapping the required pages. It ensures that the region is valid and
+ * handles faults efficiently by mapping multiple pages at once.
+ *
+ * Return: true if the intercept was handled successfully, false otherwise.
+ */
+static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
+{
+	struct mshv_partition *p = vp->vp_partition;
+	struct mshv_mem_region *region;
+	struct hv_x64_memory_intercept_message *msg;
+	u64 gfn;
+
+	msg = (struct hv_x64_memory_intercept_message *)
+		vp->vp_intercept_msg_page->u.payload;
+
+	gfn = HVPFN_DOWN(msg->guest_physical_address);
+
+	region = mshv_partition_region_by_gfn(p, gfn);
+	if (!region)
+		return false;
+
+	if (WARN_ON_ONCE(!region->flags.is_ram))
+		return false;
+
+	if (WARN_ON_ONCE(region->flags.range_pinned))
+		return false;
+
+	return mshv_region_handle_gfn_fault(region, gfn);
+}
+
+#else	/* CONFIG_X86_64 */
+
+static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
+
+#endif	/* CONFIG_X86_64 */
+
+static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
+{
+	switch (vp->vp_intercept_msg_page->header.message_type) {
+	case HVMSG_GPA_INTERCEPT:
+		return mshv_handle_gpa_intercept(vp);
+	}
+	return false;
+}
+
 static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 {
 	long rc;
 
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-		rc = mshv_run_vp_with_root_scheduler(vp);
-	else
-		rc = mshv_run_vp_with_hyp_scheduler(vp);
+	do {
+		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
+			rc = mshv_run_vp_with_root_scheduler(vp);
+		else
+			rc = mshv_run_vp_with_hyp_scheduler(vp);
+	} while (rc == 0 && mshv_vp_handle_intercept(vp));
 
 	if (rc)
 		return rc;
@@ -1209,6 +1400,110 @@ mshv_partition_region_by_uaddr(struct mshv_partition *partition, u64 uaddr)
 	return NULL;
 }
 
+#if defined(CONFIG_MMU_NOTIFIER)
+static void mshv_region_movable_fini(struct mshv_mem_region *region)
+{
+	if (region->flags.range_pinned)
+		return;
+
+	mmu_interval_notifier_remove(&region->mni);
+}
+
+/**
+ * mshv_region_interval_invalidate - Invalidate a range of memory region
+ * @mni: Pointer to the mmu_interval_notifier structure
+ * @range: Pointer to the mmu_notifier_range structure
+ * @cur_seq: Current sequence number for the interval notifier
+ *
+ * This function invalidates a memory region by remapping its pages with
+ * no access permissions. It locks the region's mutex to ensure thread safety
+ * and updates the sequence number for the interval notifier. If the range
+ * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
+ * lock and returns false if unsuccessful.
+ *
+ * NOTE: Failure to invalidate a region is a serious error, as the pages will
+ * be considered freed while they are still mapped by the hypervisor.
+ * Any attempt to access such pages will likely crash the system.
+ *
+ * Return: true if the region was successfully invalidated, false otherwise.
+ */
+static bool
+mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
+				const struct mmu_notifier_range *range,
+				unsigned long cur_seq)
+{
+	struct mshv_mem_region *region = container_of(mni,
+						struct mshv_mem_region,
+						mni);
+	u64 page_offset, page_count;
+	unsigned long mstart, mend;
+	int ret = -EPERM;
+
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&region->mutex);
+	else if (!mutex_trylock(&region->mutex))
+		goto out_fail;
+
+	mmu_interval_set_seq(mni, cur_seq);
+
+	mstart = max(range->start, region->start_uaddr);
+	mend = min(range->end, region->start_uaddr +
+		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
+
+	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
+	page_count = HVPFN_DOWN(mend - mstart);
+
+	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
+				      page_offset, page_count);
+	if (ret)
+		goto out_fail;
+
+	mshv_region_invalidate_pages(region, page_offset, page_count);
+
+	mutex_unlock(&region->mutex);
+
+	return true;
+
+out_fail:
+	WARN_ONCE(ret,
+		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
+		  region->start_uaddr,
+		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  range->start, range->end, range->event,
+		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
+	return false;
+}
+
+static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
+	.invalidate = mshv_region_interval_invalidate,
+};
+
+static bool mshv_region_movable_init(struct mshv_mem_region *region)
+{
+	int ret;
+
+	ret = mmu_interval_notifier_insert(&region->mni, current->mm,
+					   region->start_uaddr,
+					   region->nr_pages << HV_HYP_PAGE_SHIFT,
+					   &mshv_region_mni_ops);
+	if (ret)
+		return false;
+
+	mutex_init(&region->mutex);
+
+	return true;
+}
+#else
+static inline void mshv_region_movable_fini(struct mshv_mem_region *region)
+{
+}
+
+static inline bool mshv_region_movable_init(struct mshv_mem_region *region)
+{
+	return false;
+}
+#endif
+
 /*
  * NB: caller checks and makes sure mem->size is page aligned
  * Returns: 0 with regionpp updated on success, or -errno
@@ -1241,9 +1536,14 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
-	/* Note: large_pages flag populated when we pin the pages */
-	if (!is_mmio)
-		region->flags.range_pinned = true;
+	/* Note: large_pages flag populated when pages are allocated. */
+	if (!is_mmio) {
+		region->flags.is_ram = true;
+
+		if (mshv_partition_encrypted(partition) ||
+		    !mshv_region_movable_init(region))
+			region->flags.range_pinned = true;
+	}
 
 	region->partition = partition;
 
@@ -1363,9 +1663,16 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	if (is_mmio)
 		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
 					     mmio_pfn, HVPFN_DOWN(mem.size));
-	else
+	else if (region->flags.range_pinned)
 		ret = mshv_prepare_pinned_region(region);
-
+	else
+		/*
+		 * For non-pinned regions, remap with no access to let the
+		 * hypervisor track dirty pages, enabling pre-copy live
+		 * migration.
+		 */
+		ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
+					      0, region->nr_pages);
 	if (ret)
 		goto errout;
 
@@ -1388,6 +1695,9 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
 
 	hlist_del(&region->hnode);
 
+	if (region->flags.is_ram)
+		mshv_region_movable_fini(region);
+
 	if (mshv_partition_encrypted(partition)) {
 		ret = mshv_partition_region_share(region);
 		if (ret) {



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions
  2025-10-16  0:27 ` [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
@ 2025-10-16 14:47   ` kernel test robot
  2025-10-17 23:55   ` Mukesh R
  1 sibling, 0 replies; 9+ messages in thread
From: kernel test robot @ 2025-10-16 14:47 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: oe-kbuild-all, linux-hyperv, linux-kernel

Hi Stanislav,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.18-rc1 next-20251015]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Kinsburskii/Drivers-hv-Refactor-and-rename-memory-region-handling-functions/20251016-082944
base:   linus/master
patch link:    https://lore.kernel.org/r/176057443695.74314.10584965103467299030.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
patch subject: [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions
config: x86_64-buildonly-randconfig-002-20251016 (https://download.01.org/0day-ci/archive/20251016/202510162231.7UOw1jQq-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251016/202510162231.7UOw1jQq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510162231.7UOw1jQq-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/hmm.c: In function 'hmm_range_fault':
>> mm/hmm.c:667:21: error: implicit declaration of function 'mmu_interval_check_retry' [-Wimplicit-function-declaration]
     667 |                 if (mmu_interval_check_retry(range->notifier,
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~


vim +/mmu_interval_check_retry +667 mm/hmm.c

7b86ac3371b70c Christoph Hellwig 2019-08-28  634  
9a4903e49e495b Christoph Hellwig 2019-07-25  635  /**
9a4903e49e495b Christoph Hellwig 2019-07-25  636   * hmm_range_fault - try to fault some address in a virtual address range
f970b977e068aa Jason Gunthorpe   2020-03-27  637   * @range:	argument structure
9a4903e49e495b Christoph Hellwig 2019-07-25  638   *
be957c886d92aa Jason Gunthorpe   2020-05-01  639   * Returns 0 on success or one of the following error codes:
73231612dc7c90 Jérôme Glisse     2019-05-13  640   *
9a4903e49e495b Christoph Hellwig 2019-07-25  641   * -EINVAL:	Invalid arguments or mm or virtual address is in an invalid vma
9a4903e49e495b Christoph Hellwig 2019-07-25  642   *		(e.g., device file vma).
73231612dc7c90 Jérôme Glisse     2019-05-13  643   * -ENOMEM:	Out of memory.
9a4903e49e495b Christoph Hellwig 2019-07-25  644   * -EPERM:	Invalid permission (e.g., asking for write and range is read
9a4903e49e495b Christoph Hellwig 2019-07-25  645   *		only).
9a4903e49e495b Christoph Hellwig 2019-07-25  646   * -EBUSY:	The range has been invalidated and the caller needs to wait for
9a4903e49e495b Christoph Hellwig 2019-07-25  647   *		the invalidation to finish.
f970b977e068aa Jason Gunthorpe   2020-03-27  648   * -EFAULT:     A page was requested to be valid and could not be made valid
f970b977e068aa Jason Gunthorpe   2020-03-27  649   *              ie it has no backing VMA or it is illegal to access
74eee180b935fc Jérôme Glisse     2017-09-08  650   *
f970b977e068aa Jason Gunthorpe   2020-03-27  651   * This is similar to get_user_pages(), except that it can read the page tables
f970b977e068aa Jason Gunthorpe   2020-03-27  652   * without mutating them (ie causing faults).
74eee180b935fc Jérôme Glisse     2017-09-08  653   */
be957c886d92aa Jason Gunthorpe   2020-05-01  654  int hmm_range_fault(struct hmm_range *range)
74eee180b935fc Jérôme Glisse     2017-09-08  655  {
d28c2c9a487708 Ralph Campbell    2019-11-04  656  	struct hmm_vma_walk hmm_vma_walk = {
d28c2c9a487708 Ralph Campbell    2019-11-04  657  		.range = range,
d28c2c9a487708 Ralph Campbell    2019-11-04  658  		.last = range->start,
d28c2c9a487708 Ralph Campbell    2019-11-04  659  	};
a22dd506400d0f Jason Gunthorpe   2019-11-12  660  	struct mm_struct *mm = range->notifier->mm;
74eee180b935fc Jérôme Glisse     2017-09-08  661  	int ret;
74eee180b935fc Jérôme Glisse     2017-09-08  662  
42fc541404f249 Michel Lespinasse 2020-06-08  663  	mmap_assert_locked(mm);
74eee180b935fc Jérôme Glisse     2017-09-08  664  
a3e0d41c2b1f86 Jérôme Glisse     2019-05-13  665  	do {
a3e0d41c2b1f86 Jérôme Glisse     2019-05-13  666  		/* If range is no longer valid force retry. */
a22dd506400d0f Jason Gunthorpe   2019-11-12 @667  		if (mmu_interval_check_retry(range->notifier,
a22dd506400d0f Jason Gunthorpe   2019-11-12  668  					     range->notifier_seq))
2bcbeaefde2f03 Christoph Hellwig 2019-07-24  669  			return -EBUSY;
d28c2c9a487708 Ralph Campbell    2019-11-04  670  		ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
7b86ac3371b70c Christoph Hellwig 2019-08-28  671  				      &hmm_walk_ops, &hmm_vma_walk);
be957c886d92aa Jason Gunthorpe   2020-05-01  672  		/*
be957c886d92aa Jason Gunthorpe   2020-05-01  673  		 * When -EBUSY is returned the loop restarts with
be957c886d92aa Jason Gunthorpe   2020-05-01  674  		 * hmm_vma_walk.last set to an address that has not been stored
be957c886d92aa Jason Gunthorpe   2020-05-01  675  		 * in pfns. All entries < last in the pfn array are set to their
be957c886d92aa Jason Gunthorpe   2020-05-01  676  		 * output, and all >= are still at their input values.
be957c886d92aa Jason Gunthorpe   2020-05-01  677  		 */
d28c2c9a487708 Ralph Campbell    2019-11-04  678  	} while (ret == -EBUSY);
73231612dc7c90 Jérôme Glisse     2019-05-13  679  	return ret;
74eee180b935fc Jérôme Glisse     2017-09-08  680  }
73231612dc7c90 Jérôme Glisse     2019-05-13  681  EXPORT_SYMBOL(hmm_range_fault);
8cad4713056612 Leon Romanovsky   2025-04-28  682  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions
  2025-10-16  0:27 ` [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  2025-10-16 14:47   ` kernel test robot
@ 2025-10-17 23:55   ` Mukesh R
  2025-10-20 17:35     ` Stanislav Kinsburskii
  1 sibling, 1 reply; 9+ messages in thread
From: Mukesh R @ 2025-10-17 23:55 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: linux-hyperv, linux-kernel

On 10/15/25 17:27, Stanislav Kinsburskii wrote:
> Introduce support for movable memory regions in the Hyper-V root partition
> driver, thus improving memory management flexibility and preparing the
> driver for advanced use cases such as dynamic memory remapping.
> 
> Integrate mmu_interval_notifier for movable regions, implement functions to
> handle HMM faults and memory invalidation, and update memory region mapping
> logic to support movable regions.
> 
> While MMU notifiers are commonly used in virtualization drivers, this
> implementation leverages HMM (Heterogeneous Memory Management) for its
> tailored functionality. HMM provides a ready-made framework for mirroring,
> invalidation, and fault handling, avoiding the need to reimplement these
> mechanisms for a single callback. Although MMU notifiers are more generic,
> using HMM reduces boilerplate and ensures maintainability by utilizing a
> mechanism specifically designed for such use cases.
> 
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/Kconfig          |    1 
>  drivers/hv/mshv_root.h      |    8 +
>  drivers/hv/mshv_root_main.c |  328 ++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 327 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 0b8c391a0342..5f1637cbb6e3 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -75,6 +75,7 @@ config MSHV_ROOT
>  	depends on PAGE_SIZE_4KB
>  	select EVENTFD
>  	select VIRT_XFER_TO_GUEST_WORK
> +	select HMM_MIRROR
>  	default n
>  	help
>  	  Select this option to enable support for booting and running as root
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 97e64d5341b6..13367c84497c 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -15,6 +15,7 @@
>  #include <linux/hashtable.h>
>  #include <linux/dev_printk.h>
>  #include <linux/build_bug.h>
> +#include <linux/mmu_notifier.h>
>  #include <uapi/linux/mshv.h>
>  
>  /*
> @@ -81,9 +82,14 @@ struct mshv_mem_region {
>  	struct {
>  		u64 large_pages:  1; /* 2MiB */
>  		u64 range_pinned: 1;
> -		u64 reserved:	 62;
> +		u64 is_ram	: 1; /* mem region can be ram or mmio */

In case this gets accepted/merged, this is named 
+               u64 memreg_isram: 1; /* mem region can be ram or mmio */

Keeping the name same will avoid unnecessary diffs when we try to compare
files to see what is missing internally or externally.

Thanks,
-Mukesh

>  	} flags;
>  	struct mshv_partition *partition;
> +#if defined(CONFIG_MMU_NOTIFIER)
> +	struct mmu_interval_notifier mni;
> +	struct mutex mutex;	/* protects region pages remapping */
> +#endif
>  	struct page *pages[];
>  };
>  
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index c4f114376435..b2738443ac5d 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -29,6 +29,7 @@
>  #include <linux/crash_dump.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/vmalloc.h>
> +#include <linux/hmm.h>
>  
>  #include "mshv_eventfd.h"
>  #include "mshv.h"
> @@ -36,6 +37,8 @@
>  
>  #define VALUE_PMD_ALIGNED(c)			(!((c) & (PTRS_PER_PMD - 1)))
>  
> +#define MSHV_MAP_FAULT_IN_PAGES			HPAGE_PMD_NR
> +
>  MODULE_AUTHOR("Microsoft");
>  MODULE_LICENSE("GPL");
>  MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
> @@ -76,6 +79,11 @@ static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
>  static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
>  static int mshv_init_async_handler(struct mshv_partition *partition);
>  static void mshv_async_hvcall_handler(void *data, u64 *status);
> +static struct mshv_mem_region
> +	*mshv_partition_region_by_gfn(struct mshv_partition *pt, u64 gfn);
> +static int mshv_region_remap_pages(struct mshv_mem_region *region,
> +				   u32 map_flags, u64 page_offset,
> +				   u64 page_count);
>  
>  static const union hv_input_vtl input_vtl_zero;
>  static const union hv_input_vtl input_vtl_normal = {
> @@ -602,14 +610,197 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
>  static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
>  	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
>  
> +#ifdef CONFIG_X86_64
> +
> +#if defined(CONFIG_MMU_NOTIFIER)
> +/**
> + * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
> + * @region: Pointer to the memory region structure
> + * @range: Pointer to the HMM range structure
> + *
> + * This function performs the following steps:
> + * 1. Reads the notifier sequence for the HMM range.
> + * 2. Acquires a read lock on the memory map.
> + * 3. Handles HMM faults for the specified range.
> + * 4. Releases the read lock on the memory map.
> + * 5. If successful, locks the memory region mutex.
> + * 6. Verifies if the notifier sequence has changed during the operation.
> + *    If it has, releases the mutex and returns -EBUSY to match with
> + *    hmm_range_fault() return code for repeating.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> +					  struct hmm_range *range)
> +{
> +	int ret;
> +
> +	range->notifier_seq = mmu_interval_read_begin(range->notifier);
> +	mmap_read_lock(region->mni.mm);
> +	ret = hmm_range_fault(range);
> +	mmap_read_unlock(region->mni.mm);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&region->mutex);
> +
> +	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
> +		mutex_unlock(&region->mutex);
> +		cond_resched();
> +		return -EBUSY;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * mshv_region_range_fault - Handle memory range faults for a given region.
> + * @region: Pointer to the memory region structure.
> + * @page_offset: Offset of the page within the region.
> + * @page_count: Number of pages to handle.
> + *
> + * This function resolves memory faults for a specified range of pages
> + * within a memory region. It uses HMM (Heterogeneous Memory Management)
> + * to fault in the required pages and updates the region's page array.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +static int mshv_region_range_fault(struct mshv_mem_region *region,
> +				   u64 page_offset, u64 page_count)
> +{
> +	struct hmm_range range = {
> +		.notifier = &region->mni,
> +		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> +	};
> +	unsigned long *pfns;
> +	int ret;
> +	u64 i;
> +
> +	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);
> +	if (!pfns)
> +		return -ENOMEM;
> +
> +	range.hmm_pfns = pfns;
> +	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
> +	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
> +
> +	do {
> +		ret = mshv_region_hmm_fault_and_lock(region, &range);
> +	} while (ret == -EBUSY);
> +
> +	if (ret)
> +		goto out;
> +
> +	for (i = 0; i < page_count; i++)
> +		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> +
> +	if (PageHuge(region->pages[page_offset]))
> +		region->flags.large_pages = true;
> +
> +	ret = mshv_region_remap_pages(region, region->hv_map_flags,
> +				      page_offset, page_count);
> +
> +	mutex_unlock(&region->mutex);
> +out:
> +	kfree(pfns);
> +	return ret;
> +}
> +#else /* CONFIG_MMU_NOTIFIER */
> +static int mshv_region_range_fault(struct mshv_mem_region *region,
> +				   u64 page_offset, u64 page_count)
> +{
> +	return -ENODEV;
> +}
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> +{
> +	u64 page_offset, page_count;
> +	int ret;
> +
> +	if (WARN_ON_ONCE(region->flags.range_pinned))
> +		return false;
> +
> +	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> +	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> +				 MSHV_MAP_FAULT_IN_PAGES);
> +
> +	/* Map more pages than requested to reduce the number of faults. */
> +	page_count = min(region->nr_pages - page_offset,
> +			 MSHV_MAP_FAULT_IN_PAGES);
> +
> +	ret = mshv_region_range_fault(region, page_offset, page_count);
> +
> +	WARN_ONCE(ret,
> +		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> +		  region->partition->pt_id, region->start_uaddr,
> +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> +		  gfn, page_offset, page_count);
> +
> +	return !ret;
> +}
> +
> +/**
> + * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
> + * @vp: Pointer to the virtual processor structure.
> + *
> + * This function processes GPA intercepts by identifying the memory region
> + * corresponding to the intercepted GPA, aligning the page offset, and
> + * mapping the required pages. It ensures that the region is valid and
> + * handles faults efficiently by mapping multiple pages at once.
> + *
> + * Return: true if the intercept was handled successfully, false otherwise.
> + */
> +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> +{
> +	struct mshv_partition *p = vp->vp_partition;
> +	struct mshv_mem_region *region;
> +	struct hv_x64_memory_intercept_message *msg;
> +	u64 gfn;
> +
> +	msg = (struct hv_x64_memory_intercept_message *)
> +		vp->vp_intercept_msg_page->u.payload;
> +
> +	gfn = HVPFN_DOWN(msg->guest_physical_address);
> +
> +	region = mshv_partition_region_by_gfn(p, gfn);
> +	if (!region)
> +		return false;
> +
> +	if (WARN_ON_ONCE(!region->flags.is_ram))
> +		return false;
> +
> +	if (WARN_ON_ONCE(region->flags.range_pinned))
> +		return false;
> +
> +	return mshv_region_handle_gfn_fault(region, gfn);
> +}
> +
> +#else	/* CONFIG_X86_64 */
> +
> +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> +
> +#endif	/* CONFIG_X86_64 */
> +
> +static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> +{
> +	switch (vp->vp_intercept_msg_page->header.message_type) {
> +	case HVMSG_GPA_INTERCEPT:
> +		return mshv_handle_gpa_intercept(vp);
> +	}
> +	return false;
> +}
> +
>  static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
>  {
>  	long rc;
>  
> -	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> -		rc = mshv_run_vp_with_root_scheduler(vp);
> -	else
> -		rc = mshv_run_vp_with_hyp_scheduler(vp);
> +	do {
> +		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> +			rc = mshv_run_vp_with_root_scheduler(vp);
> +		else
> +			rc = mshv_run_vp_with_hyp_scheduler(vp);
> +	} while (rc == 0 && mshv_vp_handle_intercept(vp));
>  
>  	if (rc)
>  		return rc;
> @@ -1209,6 +1400,110 @@ mshv_partition_region_by_uaddr(struct mshv_partition *partition, u64 uaddr)
>  	return NULL;
>  }
>  
> +#if defined(CONFIG_MMU_NOTIFIER)
> +static void mshv_region_movable_fini(struct mshv_mem_region *region)
> +{
> +	if (region->flags.range_pinned)
> +		return;
> +
> +	mmu_interval_notifier_remove(&region->mni);
> +}
> +
> +/**
> + * mshv_region_interval_invalidate - Invalidate a range of memory region
> + * @mni: Pointer to the mmu_interval_notifier structure
> + * @range: Pointer to the mmu_notifier_range structure
> + * @cur_seq: Current sequence number for the interval notifier
> + *
> + * This function invalidates a memory region by remapping its pages with
> + * no access permissions. It locks the region's mutex to ensure thread safety
> + * and updates the sequence number for the interval notifier. If the range
> + * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
> + * lock and returns false if unsuccessful.
> + *
> + * NOTE: Failure to invalidate a region is a serious error, as the pages will
> + * be considered freed while they are still mapped by the hypervisor.
> + * Any attempt to access such pages will likely crash the system.
> + *
> + * Return: true if the region was successfully invalidated, false otherwise.
> + */
> +static bool
> +mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> +				const struct mmu_notifier_range *range,
> +				unsigned long cur_seq)
> +{
> +	struct mshv_mem_region *region = container_of(mni,
> +						struct mshv_mem_region,
> +						mni);
> +	u64 page_offset, page_count;
> +	unsigned long mstart, mend;
> +	int ret = -EPERM;
> +
> +	if (mmu_notifier_range_blockable(range))
> +		mutex_lock(&region->mutex);
> +	else if (!mutex_trylock(&region->mutex))
> +		goto out_fail;
> +
> +	mmu_interval_set_seq(mni, cur_seq);
> +
> +	mstart = max(range->start, region->start_uaddr);
> +	mend = min(range->end, region->start_uaddr +
> +		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
> +
> +	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> +	page_count = HVPFN_DOWN(mend - mstart);
> +
> +	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> +				      page_offset, page_count);
> +	if (ret)
> +		goto out_fail;
> +
> +	mshv_region_invalidate_pages(region, page_offset, page_count);
> +
> +	mutex_unlock(&region->mutex);
> +
> +	return true;
> +
> +out_fail:
> +	WARN_ONCE(ret,
> +		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
> +		  region->start_uaddr,
> +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> +		  range->start, range->end, range->event,
> +		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
> +	return false;
> +}
> +
> +static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
> +	.invalidate = mshv_region_interval_invalidate,
> +};
> +
> +static bool mshv_region_movable_init(struct mshv_mem_region *region)
> +{
> +	int ret;
> +
> +	ret = mmu_interval_notifier_insert(&region->mni, current->mm,
> +					   region->start_uaddr,
> +					   region->nr_pages << HV_HYP_PAGE_SHIFT,
> +					   &mshv_region_mni_ops);
> +	if (ret)
> +		return false;
> +
> +	mutex_init(&region->mutex);
> +
> +	return true;
> +}
> +#else
> +static inline void mshv_region_movable_fini(struct mshv_mem_region *region)
> +{
> +}
> +
> +static inline bool mshv_region_movable_init(struct mshv_mem_region *region)
> +{
> +	return false;
> +}
> +#endif
> +
>  /*
>   * NB: caller checks and makes sure mem->size is page aligned
>   * Returns: 0 with regionpp updated on success, or -errno
> @@ -1241,9 +1536,14 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
>  	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
>  		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
>  
> -	/* Note: large_pages flag populated when we pin the pages */
> -	if (!is_mmio)
> -		region->flags.range_pinned = true;
> +	/* Note: large_pages flag populated when pages are allocated. */
> +	if (!is_mmio) {
> +		region->flags.is_ram = true;
> +
> +		if (mshv_partition_encrypted(partition) ||
> +		    !mshv_region_movable_init(region))
> +			region->flags.range_pinned = true;
> +	}
>  
>  	region->partition = partition;
>  
> @@ -1363,9 +1663,16 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  	if (is_mmio)
>  		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
>  					     mmio_pfn, HVPFN_DOWN(mem.size));
> -	else
> +	else if (region->flags.range_pinned)
>  		ret = mshv_prepare_pinned_region(region);
> -
> +	else
> +		/*
> +		 * For non-pinned regions, remap with no access to let the
> +		 * hypervisor track dirty pages, enabling pre-copy live
> +		 * migration.
> +		 */
> +		ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> +					      0, region->nr_pages);
>  	if (ret)
>  		goto errout;
>  
> @@ -1388,6 +1695,9 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
>  
>  	hlist_del(&region->hnode);
>  
> +	if (region->flags.is_ram)
> +		mshv_region_movable_fini(region);
> +
>  	if (mshv_partition_encrypted(partition)) {
>  		ret = mshv_partition_region_share(region);
>  		if (ret) {
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions
  2025-10-17 23:55   ` Mukesh R
@ 2025-10-20 17:35     ` Stanislav Kinsburskii
  0 siblings, 0 replies; 9+ messages in thread
From: Stanislav Kinsburskii @ 2025-10-20 17:35 UTC (permalink / raw)
  To: Mukesh R; +Cc: kys, haiyangz, wei.liu, decui, linux-hyperv, linux-kernel

On Fri, Oct 17, 2025 at 04:55:24PM -0700, Mukesh R wrote:
> On 10/15/25 17:27, Stanislav Kinsburskii wrote:
> > Introduce support for movable memory regions in the Hyper-V root partition
> > driver, thus improving memory management flexibility and preparing the
> > driver for advanced use cases such as dynamic memory remapping.
> > 
> > Integrate mmu_interval_notifier for movable regions, implement functions to
> > handle HMM faults and memory invalidation, and update memory region mapping
> > logic to support movable regions.
> > 
> > While MMU notifiers are commonly used in virtualization drivers, this
> > implementation leverages HMM (Heterogeneous Memory Management) for its
> > tailored functionality. HMM provides a ready-made framework for mirroring,
> > invalidation, and fault handling, avoiding the need to reimplement these
> > mechanisms for a single callback. Although MMU notifiers are more generic,
> > using HMM reduces boilerplate and ensures maintainability by utilizing a
> > mechanism specifically designed for such use cases.
> > 
> > Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/Kconfig          |    1 
> >  drivers/hv/mshv_root.h      |    8 +
> >  drivers/hv/mshv_root_main.c |  328 ++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 327 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > index 0b8c391a0342..5f1637cbb6e3 100644
> > --- a/drivers/hv/Kconfig
> > +++ b/drivers/hv/Kconfig
> > @@ -75,6 +75,7 @@ config MSHV_ROOT
> >  	depends on PAGE_SIZE_4KB
> >  	select EVENTFD
> >  	select VIRT_XFER_TO_GUEST_WORK
> > +	select HMM_MIRROR
> >  	default n
> >  	help
> >  	  Select this option to enable support for booting and running as root
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index 97e64d5341b6..13367c84497c 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -15,6 +15,7 @@
> >  #include <linux/hashtable.h>
> >  #include <linux/dev_printk.h>
> >  #include <linux/build_bug.h>
> > +#include <linux/mmu_notifier.h>
> >  #include <uapi/linux/mshv.h>
> >  
> >  /*
> > @@ -81,9 +82,14 @@ struct mshv_mem_region {
> >  	struct {
> >  		u64 large_pages:  1; /* 2MiB */
> >  		u64 range_pinned: 1;
> > -		u64 reserved:	 62;
> > +		u64 is_ram	: 1; /* mem region can be ram or mmio */
> 
> In case this gets accepted/merged, this is named 
> +               u64 memreg_isram: 1; /* mem region can be ram or mmio */
> 

The proposed naming scheme doesn't align with the other fields in this structure.
Renaming this one should be probably done along with the other fields in
a separate change.

Thanks,
Stanislav

> Keeping the name same will avoid unnecessary diffs when we try to compare
> files to see what is missing internally or externally.
> 
> Thanks,
> -Mukesh
> 
> >  	} flags;
> >  	struct mshv_partition *partition;
> > +#if defined(CONFIG_MMU_NOTIFIER)
> > +	struct mmu_interval_notifier mni;
> > +	struct mutex mutex;	/* protects region pages remapping */
> > +#endif
> >  	struct page *pages[];
> >  };
> >  
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index c4f114376435..b2738443ac5d 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/crash_dump.h>
> >  #include <linux/panic_notifier.h>
> >  #include <linux/vmalloc.h>
> > +#include <linux/hmm.h>
> >  
> >  #include "mshv_eventfd.h"
> >  #include "mshv.h"
> > @@ -36,6 +37,8 @@
> >  
> >  #define VALUE_PMD_ALIGNED(c)			(!((c) & (PTRS_PER_PMD - 1)))
> >  
> > +#define MSHV_MAP_FAULT_IN_PAGES			HPAGE_PMD_NR
> > +
> >  MODULE_AUTHOR("Microsoft");
> >  MODULE_LICENSE("GPL");
> >  MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
> > @@ -76,6 +79,11 @@ static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
> >  static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
> >  static int mshv_init_async_handler(struct mshv_partition *partition);
> >  static void mshv_async_hvcall_handler(void *data, u64 *status);
> > +static struct mshv_mem_region
> > +	*mshv_partition_region_by_gfn(struct mshv_partition *pt, u64 gfn);
> > +static int mshv_region_remap_pages(struct mshv_mem_region *region,
> > +				   u32 map_flags, u64 page_offset,
> > +				   u64 page_count);
> >  
> >  static const union hv_input_vtl input_vtl_zero;
> >  static const union hv_input_vtl input_vtl_normal = {
> > @@ -602,14 +610,197 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
> >  static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
> >  	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
> >  
> > +#ifdef CONFIG_X86_64
> > +
> > +#if defined(CONFIG_MMU_NOTIFIER)
> > +/**
> > + * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
> > + * @region: Pointer to the memory region structure
> > + * @range: Pointer to the HMM range structure
> > + *
> > + * This function performs the following steps:
> > + * 1. Reads the notifier sequence for the HMM range.
> > + * 2. Acquires a read lock on the memory map.
> > + * 3. Handles HMM faults for the specified range.
> > + * 4. Releases the read lock on the memory map.
> > + * 5. If successful, locks the memory region mutex.
> > + * 6. Verifies if the notifier sequence has changed during the operation.
> > + *    If it has, releases the mutex and returns -EBUSY to match with
> > + *    hmm_range_fault() return code for repeating.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > +					  struct hmm_range *range)
> > +{
> > +	int ret;
> > +
> > +	range->notifier_seq = mmu_interval_read_begin(range->notifier);
> > +	mmap_read_lock(region->mni.mm);
> > +	ret = hmm_range_fault(range);
> > +	mmap_read_unlock(region->mni.mm);
> > +	if (ret)
> > +		return ret;
> > +
> > +	mutex_lock(&region->mutex);
> > +
> > +	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
> > +		mutex_unlock(&region->mutex);
> > +		cond_resched();
> > +		return -EBUSY;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * mshv_region_range_fault - Handle memory range faults for a given region.
> > + * @region: Pointer to the memory region structure.
> > + * @page_offset: Offset of the page within the region.
> > + * @page_count: Number of pages to handle.
> > + *
> > + * This function resolves memory faults for a specified range of pages
> > + * within a memory region. It uses HMM (Heterogeneous Memory Management)
> > + * to fault in the required pages and updates the region's page array.
> > + *
> > + * Return: 0 on success, negative error code on failure.
> > + */
> > +static int mshv_region_range_fault(struct mshv_mem_region *region,
> > +				   u64 page_offset, u64 page_count)
> > +{
> > +	struct hmm_range range = {
> > +		.notifier = &region->mni,
> > +		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> > +	};
> > +	unsigned long *pfns;
> > +	int ret;
> > +	u64 i;
> > +
> > +	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);
> > +	if (!pfns)
> > +		return -ENOMEM;
> > +
> > +	range.hmm_pfns = pfns;
> > +	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
> > +	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
> > +
> > +	do {
> > +		ret = mshv_region_hmm_fault_and_lock(region, &range);
> > +	} while (ret == -EBUSY);
> > +
> > +	if (ret)
> > +		goto out;
> > +
> > +	for (i = 0; i < page_count; i++)
> > +		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> > +
> > +	if (PageHuge(region->pages[page_offset]))
> > +		region->flags.large_pages = true;
> > +
> > +	ret = mshv_region_remap_pages(region, region->hv_map_flags,
> > +				      page_offset, page_count);
> > +
> > +	mutex_unlock(&region->mutex);
> > +out:
> > +	kfree(pfns);
> > +	return ret;
> > +}
> > +#else /* CONFIG_MMU_NOTIFIER */
> > +static int mshv_region_range_fault(struct mshv_mem_region *region,
> > +				   u64 page_offset, u64 page_count)
> > +{
> > +	return -ENODEV;
> > +}
> > +#endif /* CONFIG_MMU_NOTIFIER */
> > +
> > +static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> > +{
> > +	u64 page_offset, page_count;
> > +	int ret;
> > +
> > +	if (WARN_ON_ONCE(region->flags.range_pinned))
> > +		return false;
> > +
> > +	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> > +	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> > +				 MSHV_MAP_FAULT_IN_PAGES);
> > +
> > +	/* Map more pages than requested to reduce the number of faults. */
> > +	page_count = min(region->nr_pages - page_offset,
> > +			 MSHV_MAP_FAULT_IN_PAGES);
> > +
> > +	ret = mshv_region_range_fault(region, page_offset, page_count);
> > +
> > +	WARN_ONCE(ret,
> > +		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> > +		  region->partition->pt_id, region->start_uaddr,
> > +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > +		  gfn, page_offset, page_count);
> > +
> > +	return !ret;
> > +}
> > +
> > +/**
> > + * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
> > + * @vp: Pointer to the virtual processor structure.
> > + *
> > + * This function processes GPA intercepts by identifying the memory region
> > + * corresponding to the intercepted GPA, aligning the page offset, and
> > + * mapping the required pages. It ensures that the region is valid and
> > + * handles faults efficiently by mapping multiple pages at once.
> > + *
> > + * Return: true if the intercept was handled successfully, false otherwise.
> > + */
> > +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > +{
> > +	struct mshv_partition *p = vp->vp_partition;
> > +	struct mshv_mem_region *region;
> > +	struct hv_x64_memory_intercept_message *msg;
> > +	u64 gfn;
> > +
> > +	msg = (struct hv_x64_memory_intercept_message *)
> > +		vp->vp_intercept_msg_page->u.payload;
> > +
> > +	gfn = HVPFN_DOWN(msg->guest_physical_address);
> > +
> > +	region = mshv_partition_region_by_gfn(p, gfn);
> > +	if (!region)
> > +		return false;
> > +
> > +	if (WARN_ON_ONCE(!region->flags.is_ram))
> > +		return false;
> > +
> > +	if (WARN_ON_ONCE(region->flags.range_pinned))
> > +		return false;
> > +
> > +	return mshv_region_handle_gfn_fault(region, gfn);
> > +}
> > +
> > +#else	/* CONFIG_X86_64 */
> > +
> > +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > +
> > +#endif	/* CONFIG_X86_64 */
> > +
> > +static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > +{
> > +	switch (vp->vp_intercept_msg_page->header.message_type) {
> > +	case HVMSG_GPA_INTERCEPT:
> > +		return mshv_handle_gpa_intercept(vp);
> > +	}
> > +	return false;
> > +}
> > +
> >  static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
> >  {
> >  	long rc;
> >  
> > -	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> > -		rc = mshv_run_vp_with_root_scheduler(vp);
> > -	else
> > -		rc = mshv_run_vp_with_hyp_scheduler(vp);
> > +	do {
> > +		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> > +			rc = mshv_run_vp_with_root_scheduler(vp);
> > +		else
> > +			rc = mshv_run_vp_with_hyp_scheduler(vp);
> > +	} while (rc == 0 && mshv_vp_handle_intercept(vp));
> >  
> >  	if (rc)
> >  		return rc;
> > @@ -1209,6 +1400,110 @@ mshv_partition_region_by_uaddr(struct mshv_partition *partition, u64 uaddr)
> >  	return NULL;
> >  }
> >  
> > +#if defined(CONFIG_MMU_NOTIFIER)
> > +static void mshv_region_movable_fini(struct mshv_mem_region *region)
> > +{
> > +	if (region->flags.range_pinned)
> > +		return;
> > +
> > +	mmu_interval_notifier_remove(&region->mni);
> > +}
> > +
> > +/**
> > + * mshv_region_interval_invalidate - Invalidate a range of memory region
> > + * @mni: Pointer to the mmu_interval_notifier structure
> > + * @range: Pointer to the mmu_notifier_range structure
> > + * @cur_seq: Current sequence number for the interval notifier
> > + *
> > + * This function invalidates a memory region by remapping its pages with
> > + * no access permissions. It locks the region's mutex to ensure thread safety
> > + * and updates the sequence number for the interval notifier. If the range
> > + * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
> > + * lock and returns false if unsuccessful.
> > + *
> > + * NOTE: Failure to invalidate a region is a serious error, as the pages will
> > + * be considered freed while they are still mapped by the hypervisor.
> > + * Any attempt to access such pages will likely crash the system.
> > + *
> > + * Return: true if the region was successfully invalidated, false otherwise.
> > + */
> > +static bool
> > +mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> > +				const struct mmu_notifier_range *range,
> > +				unsigned long cur_seq)
> > +{
> > +	struct mshv_mem_region *region = container_of(mni,
> > +						struct mshv_mem_region,
> > +						mni);
> > +	u64 page_offset, page_count;
> > +	unsigned long mstart, mend;
> > +	int ret = -EPERM;
> > +
> > +	if (mmu_notifier_range_blockable(range))
> > +		mutex_lock(&region->mutex);
> > +	else if (!mutex_trylock(&region->mutex))
> > +		goto out_fail;
> > +
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +
> > +	mstart = max(range->start, region->start_uaddr);
> > +	mend = min(range->end, region->start_uaddr +
> > +		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
> > +
> > +	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> > +	page_count = HVPFN_DOWN(mend - mstart);
> > +
> > +	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> > +				      page_offset, page_count);
> > +	if (ret)
> > +		goto out_fail;
> > +
> > +	mshv_region_invalidate_pages(region, page_offset, page_count);
> > +
> > +	mutex_unlock(&region->mutex);
> > +
> > +	return true;
> > +
> > +out_fail:
> > +	WARN_ONCE(ret,
> > +		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
> > +		  region->start_uaddr,
> > +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > +		  range->start, range->end, range->event,
> > +		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
> > +	return false;
> > +}
> > +
> > +static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
> > +	.invalidate = mshv_region_interval_invalidate,
> > +};
> > +
> > +static bool mshv_region_movable_init(struct mshv_mem_region *region)
> > +{
> > +	int ret;
> > +
> > +	ret = mmu_interval_notifier_insert(&region->mni, current->mm,
> > +					   region->start_uaddr,
> > +					   region->nr_pages << HV_HYP_PAGE_SHIFT,
> > +					   &mshv_region_mni_ops);
> > +	if (ret)
> > +		return false;
> > +
> > +	mutex_init(&region->mutex);
> > +
> > +	return true;
> > +}
> > +#else
> > +static inline void mshv_region_movable_fini(struct mshv_mem_region *region)
> > +{
> > +}
> > +
> > +static inline bool mshv_region_movable_init(struct mshv_mem_region *region)
> > +{
> > +	return false;
> > +}
> > +#endif
> > +
> >  /*
> >   * NB: caller checks and makes sure mem->size is page aligned
> >   * Returns: 0 with regionpp updated on success, or -errno
> > @@ -1241,9 +1536,14 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> >  	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
> >  		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
> >  
> > -	/* Note: large_pages flag populated when we pin the pages */
> > -	if (!is_mmio)
> > -		region->flags.range_pinned = true;
> > +	/* Note: large_pages flag populated when pages are allocated. */
> > +	if (!is_mmio) {
> > +		region->flags.is_ram = true;
> > +
> > +		if (mshv_partition_encrypted(partition) ||
> > +		    !mshv_region_movable_init(region))
> > +			region->flags.range_pinned = true;
> > +	}
> >  
> >  	region->partition = partition;
> >  
> > @@ -1363,9 +1663,16 @@ mshv_map_user_memory(struct mshv_partition *partition,
> >  	if (is_mmio)
> >  		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
> >  					     mmio_pfn, HVPFN_DOWN(mem.size));
> > -	else
> > +	else if (region->flags.range_pinned)
> >  		ret = mshv_prepare_pinned_region(region);
> > -
> > +	else
> > +		/*
> > +		 * For non-pinned regions, remap with no access to let the
> > +		 * hypervisor track dirty pages, enabling pre-copy live
> > +		 * migration.
> > +		 */
> > +		ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> > +					      0, region->nr_pages);
> >  	if (ret)
> >  		goto errout;
> >  
> > @@ -1388,6 +1695,9 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
> >  
> >  	hlist_del(&region->hnode);
> >  
> > +	if (region->flags.is_ram)
> > +		mshv_region_movable_fini(region);
> > +
> >  	if (mshv_partition_encrypted(partition)) {
> >  		ret = mshv_partition_region_share(region);
> >  		if (ret) {
> > 
> > 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-10-20 17:36 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-16  0:26 [PATCH v5 0/5] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
2025-10-16  0:26 ` [PATCH v5 1/5] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
2025-10-16  0:27 ` [PATCH v5 2/5] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
2025-10-16  0:27 ` [PATCH v5 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance Stanislav Kinsburskii
2025-10-16  0:27 ` [PATCH v5 4/5] Drivers: hv: Ensure large page GPA mapping is PMD-aligned Stanislav Kinsburskii
2025-10-16  0:27 ` [PATCH v5 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
2025-10-16 14:47   ` kernel test robot
2025-10-17 23:55   ` Mukesh R
2025-10-20 17:35     ` Stanislav Kinsburskii

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox