[PATCH v6 0/5] Series short description

linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/5] Series short description
@ 2025-11-17 16:52 Stanislav Kinsburskii
  2025-11-17 16:52 ` [PATCH v6 1/5] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-17 16:52 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

From the start, the root-partition driver allocates, pins, and maps all
guest memory into the hypervisor at guest creation. This is simple: Linux
cannot move the pages, so the guest’s view in Linux and in Microsoft
Hypervisor never diverges.

However, this approach has major drawbacks:
 - NUMA: affinity can’t be changed at runtime, so you can’t migrate guest memory closer to the CPUs running it → performance hit.
 - Memory management: unused guest memory can’t be swapped out, compacted, or merged.
 - Provisioning time: upfront allocation/pinning slows guest create/destroy.
 - Overcommit: no memory overcommit on hosts with pinned-guest memory.

This series adds movable memory pages for Hyper-V child partitions. Guest
pages are no longer allocated upfront; they’re allocated and mapped into
the hypervisor on demand (i.e., when the guest touches a GFN that isn’t yet
backed by a host PFN).
When a page is moved, Linux no longer holds it and it is unmapped from the hypervisor.
As a result, Hyper-V guests behave like regular Linux processes, enabling standard Linux memory features to apply to guests.

Exceptions (still pinned):
 1. Encrypted guests (explicit).
 2. Guests with passthrough devices (implicitly pinned by the VFIO framework).

v6:
 - Fix a bug in large page remapping where setting the large map flag based
   on the PFN offset's large page alignment within the region implicitly
   assumed that the region's start offset was also large page aligned,
   which could cause map hypercall failures.
 - Fix a bug in large page unmapping where setting the large unmap flag for
   an unaligned guest PFN range could result in unmap hypercall failures.

v5:
 - Fix a bug in MMU notifier handling where an uninitialized 'ret' variable
   could cause the warning about failed page invalidation to be skipped.
 - Improve comment grammar regarding skipping the unmapping of non-mapped pages.

v4:
 - Fix a bug in batch unmapping can skip mapped pages when selecting a new
   batch due to wrong offset calculation.
 - Fix an error message in case of failed memory region pinning.

v3:
 - Region is invalidated even if the mm has no users.
 - Page remapping logic is updated to support 2M-unaligned remappings for
   regions that are PMD-aligned, which can occur during both faults and
   invalidations.

v2:
 - Split unmap batching into a separate patch.
 - Fixed commit messages from v1 review.
 - Renamed a few functions for clarity.

---

Stanislav Kinsburskii (5):
      Drivers: hv: Refactor and rename memory region handling functions
      Drivers: hv: Centralize guest memory region destruction
      Drivers: hv: Batch GPA unmap operations to improve large region performance
      Drivers: hv: Ensure large page GPA mapping is PMD-aligned
      Drivers: hv: Add support for movable memory regions


 drivers/hv/Kconfig             |    1 
 drivers/hv/mshv_root.h         |   10 +
 drivers/hv/mshv_root_hv_call.c |    2 
 drivers/hv/mshv_root_main.c    |  497 +++++++++++++++++++++++++++++++++-------
 4 files changed, 426 insertions(+), 84 deletions(-)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v6 1/5] Drivers: hv: Refactor and rename memory region handling functions
  2025-11-17 16:52 [PATCH v6 0/5] Series short description Stanislav Kinsburskii
@ 2025-11-17 16:52 ` Stanislav Kinsburskii
  2025-11-17 16:52 ` [PATCH v6 2/5] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-17 16:52 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Simplify and unify memory region management to improve code clarity and
reliability. Consolidate pinning and invalidation logic, adopt consistent
naming, and remove redundant checks to reduce complexity.

Enhance documentation and update call sites for maintainability.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   80 +++++++++++++++++++------------------------
 1 file changed, 36 insertions(+), 44 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index a6b9dbaed291..bddb3f1096b5 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1120,8 +1120,8 @@ mshv_region_map(struct mshv_mem_region *region)
 }
 
 static void
-mshv_region_evict_pages(struct mshv_mem_region *region,
-			u64 page_offset, u64 page_count)
+mshv_region_invalidate_pages(struct mshv_mem_region *region,
+			     u64 page_offset, u64 page_count)
 {
 	if (region->flags.range_pinned)
 		unpin_user_pages(region->pages + page_offset, page_count);
@@ -1131,29 +1131,24 @@ mshv_region_evict_pages(struct mshv_mem_region *region,
 }
 
 static void
-mshv_region_evict(struct mshv_mem_region *region)
+mshv_region_invalidate(struct mshv_mem_region *region)
 {
-	mshv_region_evict_pages(region, 0, region->nr_pages);
+	mshv_region_invalidate_pages(region, 0, region->nr_pages);
 }
 
 static int
-mshv_region_populate_pages(struct mshv_mem_region *region,
-			   u64 page_offset, u64 page_count)
+mshv_region_pin(struct mshv_mem_region *region)
 {
 	u64 done_count, nr_pages;
 	struct page **pages;
 	__u64 userspace_addr;
 	int ret;
 
-	if (page_offset + page_count > region->nr_pages)
-		return -EINVAL;
-
-	for (done_count = 0; done_count < page_count; done_count += ret) {
-		pages = region->pages + page_offset + done_count;
+	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
+		pages = region->pages + done_count;
 		userspace_addr = region->start_uaddr +
-				(page_offset + done_count) *
-				HV_HYP_PAGE_SIZE;
-		nr_pages = min(page_count - done_count,
+				 done_count * HV_HYP_PAGE_SIZE;
+		nr_pages = min(region->nr_pages - done_count,
 			       MSHV_PIN_PAGES_BATCH_SIZE);
 
 		/*
@@ -1164,34 +1159,23 @@ mshv_region_populate_pages(struct mshv_mem_region *region,
 		 * with the FOLL_LONGTERM flag does a large temporary
 		 * allocation of contiguous memory.
 		 */
-		if (region->flags.range_pinned)
-			ret = pin_user_pages_fast(userspace_addr,
-						  nr_pages,
-						  FOLL_WRITE | FOLL_LONGTERM,
-						  pages);
-		else
-			ret = -EOPNOTSUPP;
-
+		ret = pin_user_pages_fast(userspace_addr, nr_pages,
+					  FOLL_WRITE | FOLL_LONGTERM,
+					  pages);
 		if (ret < 0)
 			goto release_pages;
 	}
 
-	if (PageHuge(region->pages[page_offset]))
+	if (PageHuge(region->pages[0]))
 		region->flags.large_pages = true;
 
 	return 0;
 
 release_pages:
-	mshv_region_evict_pages(region, page_offset, done_count);
+	mshv_region_invalidate_pages(region, 0, done_count);
 	return ret;
 }
 
-static int
-mshv_region_populate(struct mshv_mem_region *region)
-{
-	return mshv_region_populate_pages(region, 0, region->nr_pages);
-}
-
 static struct mshv_mem_region *
 mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 {
@@ -1251,19 +1235,27 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	return 0;
 }
 
-/*
- * Map guest ram. if snp, make sure to release that from the host first
- * Side Effects: In case of failure, pages are unpinned when feasible.
+/**
+ * mshv_prepare_pinned_region - Pin and map memory regions
+ * @region: Pointer to the memory region structure
+ *
+ * This function processes memory regions that are explicitly marked as pinned.
+ * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
+ * population. The function ensures the region is properly populated, handles
+ * encryption requirements for SNP partitions if applicable, maps the region,
+ * and performs necessary sharing or eviction operations based on the mapping
+ * result.
+ *
+ * Return: 0 on success, negative error code on failure.
  */
-static int
-mshv_partition_mem_region_map(struct mshv_mem_region *region)
+static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
-	ret = mshv_region_populate(region);
+	ret = mshv_region_pin(region);
 	if (ret) {
-		pt_err(partition, "Failed to populate memory region: %d\n",
+		pt_err(partition, "Failed to pin memory region: %d\n",
 		       ret);
 		goto err_out;
 	}
@@ -1281,7 +1273,7 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 			pt_err(partition,
 			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
 			       region->start_gfn, ret);
-			goto evict_region;
+			goto invalidate_region;
 		}
 	}
 
@@ -1291,7 +1283,7 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 
 		shrc = mshv_partition_region_share(region);
 		if (!shrc)
-			goto evict_region;
+			goto invalidate_region;
 
 		pt_err(partition,
 		       "Failed to share memory region (guest_pfn: %llu): %d\n",
@@ -1305,8 +1297,8 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 
 	return 0;
 
-evict_region:
-	mshv_region_evict(region);
+invalidate_region:
+	mshv_region_invalidate(region);
 err_out:
 	return ret;
 }
@@ -1355,7 +1347,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
 					     mmio_pfn, HVPFN_DOWN(mem.size));
 	else
-		ret = mshv_partition_mem_region_map(region);
+		ret = mshv_prepare_pinned_region(region);
 
 	if (ret)
 		goto errout;
@@ -1400,7 +1392,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
 				region->nr_pages, unmap_flags);
 
-	mshv_region_evict(region);
+	mshv_region_invalidate(region);
 
 	vfree(region);
 	return 0;
@@ -1818,7 +1810,7 @@ static void destroy_partition(struct mshv_partition *partition)
 			}
 		}
 
-		mshv_region_evict(region);
+		mshv_region_invalidate(region);
 
 		vfree(region);
 	}



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 2/5] Drivers: hv: Centralize guest memory region destruction
  2025-11-17 16:52 [PATCH v6 0/5] Series short description Stanislav Kinsburskii
  2025-11-17 16:52 ` [PATCH v6 1/5] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
@ 2025-11-17 16:52 ` Stanislav Kinsburskii
  2025-11-17 16:52 ` [PATCH v6 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance Stanislav Kinsburskii
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-17 16:52 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Centralize guest memory region destruction to prevent resource leaks and
inconsistent cleanup across unmap and partition destruction paths.

Unify region removal, encrypted partition access recovery, and region
invalidation to improve maintainability and reliability. Reduce code
duplication and make future updates less error-prone by encapsulating
cleanup logic in a single helper.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   65 ++++++++++++++++++++++---------------------
 1 file changed, 34 insertions(+), 31 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index bddb3f1096b5..a85872b72b1a 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1362,13 +1362,42 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	return ret;
 }
 
+static void mshv_partition_destroy_region(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	u32 unmap_flags = 0;
+	int ret;
+
+	hlist_del(&region->hnode);
+
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_partition_region_share(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
+			       ret);
+			return;
+		}
+	}
+
+	if (region->flags.large_pages)
+		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+
+	/* ignore unmap failures and continue as process may be exiting */
+	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
+				region->nr_pages, unmap_flags);
+
+	mshv_region_invalidate(region);
+
+	vfree(region);
+}
+
 /* Called for unmapping both the guest ram and the mmio space */
 static long
 mshv_unmap_user_memory(struct mshv_partition *partition,
 		       struct mshv_user_mem_region mem)
 {
 	struct mshv_mem_region *region;
-	u32 unmap_flags = 0;
 
 	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
 		return -EINVAL;
@@ -1383,18 +1412,8 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	    region->nr_pages != HVPFN_DOWN(mem.size))
 		return -EINVAL;
 
-	hlist_del(&region->hnode);
+	mshv_partition_destroy_region(region);
 
-	if (region->flags.large_pages)
-		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-
-	/* ignore unmap failures and continue as process may be exiting */
-	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
-				region->nr_pages, unmap_flags);
-
-	mshv_region_invalidate(region);
-
-	vfree(region);
 	return 0;
 }
 
@@ -1730,8 +1749,8 @@ static void destroy_partition(struct mshv_partition *partition)
 {
 	struct mshv_vp *vp;
 	struct mshv_mem_region *region;
-	int i, ret;
 	struct hlist_node *n;
+	int i;
 
 	if (refcount_read(&partition->pt_ref_count)) {
 		pt_err(partition,
@@ -1795,25 +1814,9 @@ static void destroy_partition(struct mshv_partition *partition)
 
 	remove_partition(partition);
 
-	/* Remove regions, regain access to the memory and unpin the pages */
 	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
-				  hnode) {
-		hlist_del(&region->hnode);
-
-		if (mshv_partition_encrypted(partition)) {
-			ret = mshv_partition_region_share(region);
-			if (ret) {
-				pt_err(partition,
-				       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
-				      ret);
-				return;
-			}
-		}
-
-		mshv_region_invalidate(region);
-
-		vfree(region);
-	}
+				  hnode)
+		mshv_partition_destroy_region(region);
 
 	/* Withdraw and free all pages we deposited */
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance
  2025-11-17 16:52 [PATCH v6 0/5] Series short description Stanislav Kinsburskii
  2025-11-17 16:52 ` [PATCH v6 1/5] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
  2025-11-17 16:52 ` [PATCH v6 2/5] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
@ 2025-11-17 16:52 ` Stanislav Kinsburskii
  2025-11-18 16:28   ` Michael Kelley
  2025-11-17 16:52 ` [PATCH v6 4/5] Drivers: hv: Ensure large page GPA mapping is PMD-aligned Stanislav Kinsburskii
  2025-11-17 16:52 ` [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  4 siblings, 1 reply; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-17 16:52 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Reduce overhead when unmapping large memory regions by batching GPA unmap
operations in 2MB-aligned chunks.

Use a dedicated constant for batch size to improve code clarity and
maintainability.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root.h         |    2 ++
 drivers/hv/mshv_root_hv_call.c |    2 +-
 drivers/hv/mshv_root_main.c    |   36 ++++++++++++++++++++++++++++++------
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3eb815011b46..5eece7077f8b 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -32,6 +32,8 @@ static_assert(HV_HYP_PAGE_SIZE == MSHV_HV_PAGE_SIZE);
 
 #define MSHV_PIN_PAGES_BATCH_SIZE	(0x10000000ULL / HV_HYP_PAGE_SIZE)
 
+#define MSHV_MAX_UNMAP_GPA_PAGES	512
+
 struct mshv_vp {
 	u32 vp_index;
 	struct mshv_partition *vp_partition;
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index caf02cfa49c9..8fa983f1109b 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -17,7 +17,7 @@
 /* Determined empirically */
 #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
 #define HV_MAP_GPA_DEPOSIT_PAGES	256
-#define HV_UMAP_GPA_PAGES		512
+#define HV_UMAP_GPA_PAGES		MSHV_MAX_UNMAP_GPA_PAGES
 
 #define HV_PAGE_COUNT_2M_ALIGNED(pg_count) (!((pg_count) & (0x200 - 1)))
 
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index a85872b72b1a..ef36d8115d8a 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1365,7 +1365,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 static void mshv_partition_destroy_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
-	u32 unmap_flags = 0;
+	u64 gfn, gfn_count, start_gfn, end_gfn;
 	int ret;
 
 	hlist_del(&region->hnode);
@@ -1380,12 +1380,36 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
 		}
 	}
 
-	if (region->flags.large_pages)
-		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+	start_gfn = region->start_gfn;
+	end_gfn = region->start_gfn + region->nr_pages;
+
+	for (gfn = start_gfn; gfn < end_gfn; gfn += gfn_count) {
+		u32 unmap_flags = 0;
+
+		if (gfn % MSHV_MAX_UNMAP_GPA_PAGES)
+			gfn_count = ALIGN(gfn, MSHV_MAX_UNMAP_GPA_PAGES) - gfn;
+		else
+			gfn_count = MSHV_MAX_UNMAP_GPA_PAGES;
+
+		if (gfn + gfn_count > end_gfn)
+			gfn_count = end_gfn - gfn;
 
-	/* ignore unmap failures and continue as process may be exiting */
-	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
-				region->nr_pages, unmap_flags);
+		/* Skip all pages in this range if none are mapped */
+		if (!memchr_inv(region->pages + (gfn - start_gfn), 0,
+				gfn_count * sizeof(struct page *)))
+			continue;
+
+		if (region->flags.large_pages &&
+		    VALUE_PMD_ALIGNED(gfn) && VALUE_PMD_ALIGNED(gfn_count))
+			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+
+		ret = hv_call_unmap_gpa_pages(partition->pt_id, gfn,
+					      gfn_count, unmap_flags);
+		if (ret)
+			pt_err(partition,
+			       "Failed to unmap GPA pages %#llx-%#llx: %d\n",
+			       gfn, gfn + gfn_count - 1, ret);
+	}
 
 	mshv_region_invalidate(region);
 



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 4/5] Drivers: hv: Ensure large page GPA mapping is PMD-aligned
  2025-11-17 16:52 [PATCH v6 0/5] Series short description Stanislav Kinsburskii
                   ` (2 preceding siblings ...)
  2025-11-17 16:52 ` [PATCH v6 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance Stanislav Kinsburskii
@ 2025-11-17 16:52 ` Stanislav Kinsburskii
  2025-11-17 16:52 ` [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  4 siblings, 0 replies; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-17 16:52 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

With the upcoming introduction of movable pages, a region doesn't guarantee
always having large pages mapped. Both mapping on fault and unmapping
during PTE invalidation may not be 2M-aligned, while the hypervisor
requires both the GFN and page count to be 2M-aligned to use the large page
flag.

Update the logic for large page mapping in mshv_region_remap_pages() to
require both page_offset and page_count to be PMD-aligned.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index ef36d8115d8a..73aaa149c14c 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -34,6 +34,8 @@
 #include "mshv.h"
 #include "mshv_root.h"
 
+#define VALUE_PMD_ALIGNED(c)			(!((c) & (PTRS_PER_PMD - 1)))
+
 MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
@@ -1100,7 +1102,9 @@ mshv_region_remap_pages(struct mshv_mem_region *region, u32 map_flags,
 	if (page_offset + page_count > region->nr_pages)
 		return -EINVAL;
 
-	if (region->flags.large_pages)
+	if (region->flags.large_pages &&
+	    VALUE_PMD_ALIGNED(region->start_gfn + page_offset) &&
+	    VALUE_PMD_ALIGNED(page_count))
 		map_flags |= HV_MAP_GPA_LARGE_PAGE;
 
 	/* ask the hypervisor to map guest ram */



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions
  2025-11-17 16:52 [PATCH v6 0/5] Series short description Stanislav Kinsburskii
                   ` (3 preceding siblings ...)
  2025-11-17 16:52 ` [PATCH v6 4/5] Drivers: hv: Ensure large page GPA mapping is PMD-aligned Stanislav Kinsburskii
@ 2025-11-17 16:52 ` Stanislav Kinsburskii
  2025-11-18 16:29   ` Michael Kelley
  2025-11-19 18:06   ` kernel test robot
  4 siblings, 2 replies; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-17 16:52 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Introduce support for movable memory regions in the Hyper-V root partition
driver, thus improving memory management flexibility and preparing the
driver for advanced use cases such as dynamic memory remapping.

Integrate mmu_interval_notifier for movable regions, implement functions to
handle HMM faults and memory invalidation, and update memory region mapping
logic to support movable regions.

While MMU notifiers are commonly used in virtualization drivers, this
implementation leverages HMM (Heterogeneous Memory Management) for its
tailored functionality. HMM provides a ready-made framework for mirroring,
invalidation, and fault handling, avoiding the need to reimplement these
mechanisms for a single callback. Although MMU notifiers are more generic,
using HMM reduces boilerplate and ensures maintainability by utilizing a
mechanism specifically designed for such use cases.

Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Kconfig          |    1 
 drivers/hv/mshv_root.h      |    8 +
 drivers/hv/mshv_root_main.c |  328 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 327 insertions(+), 10 deletions(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 0b8c391a0342..5f1637cbb6e3 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -75,6 +75,7 @@ config MSHV_ROOT
 	depends on PAGE_SIZE_4KB
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
+	select HMM_MIRROR
 	default n
 	help
 	  Select this option to enable support for booting and running as root
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 5eece7077f8b..117399dea780 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -15,6 +15,7 @@
 #include <linux/hashtable.h>
 #include <linux/dev_printk.h>
 #include <linux/build_bug.h>
+#include <linux/mmu_notifier.h>
 #include <uapi/linux/mshv.h>
 
 /*
@@ -81,9 +82,14 @@ struct mshv_mem_region {
 	struct {
 		u64 large_pages:  1; /* 2MiB */
 		u64 range_pinned: 1;
-		u64 reserved:	 62;
+		u64 is_ram	: 1; /* mem region can be ram or mmio */
+		u64 reserved:	 61;
 	} flags;
 	struct mshv_partition *partition;
+#if defined(CONFIG_MMU_NOTIFIER)
+	struct mmu_interval_notifier mni;
+	struct mutex mutex;	/* protects region pages remapping */
+#endif
 	struct page *pages[];
 };
 
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 73aaa149c14c..fe0c5eaa1248 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -29,6 +29,7 @@
 #include <linux/crash_dump.h>
 #include <linux/panic_notifier.h>
 #include <linux/vmalloc.h>
+#include <linux/hmm.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
@@ -36,6 +37,8 @@
 
 #define VALUE_PMD_ALIGNED(c)			(!((c) & (PTRS_PER_PMD - 1)))
 
+#define MSHV_MAP_FAULT_IN_PAGES			HPAGE_PMD_NR
+
 MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
@@ -76,6 +79,11 @@ static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
 static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
 static int mshv_init_async_handler(struct mshv_partition *partition);
 static void mshv_async_hvcall_handler(void *data, u64 *status);
+static struct mshv_mem_region
+	*mshv_partition_region_by_gfn(struct mshv_partition *pt, u64 gfn);
+static int mshv_region_remap_pages(struct mshv_mem_region *region,
+				   u32 map_flags, u64 page_offset,
+				   u64 page_count);
 
 static const union hv_input_vtl input_vtl_zero;
 static const union hv_input_vtl input_vtl_normal = {
@@ -602,14 +610,197 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
 	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
 
+#ifdef CONFIG_X86_64
+
+#if defined(CONFIG_MMU_NOTIFIER)
+/**
+ * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
+ * @region: Pointer to the memory region structure
+ * @range: Pointer to the HMM range structure
+ *
+ * This function performs the following steps:
+ * 1. Reads the notifier sequence for the HMM range.
+ * 2. Acquires a read lock on the memory map.
+ * 3. Handles HMM faults for the specified range.
+ * 4. Releases the read lock on the memory map.
+ * 5. If successful, locks the memory region mutex.
+ * 6. Verifies if the notifier sequence has changed during the operation.
+ *    If it has, releases the mutex and returns -EBUSY to match with
+ *    hmm_range_fault() return code for repeating.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
+					  struct hmm_range *range)
+{
+	int ret;
+
+	range->notifier_seq = mmu_interval_read_begin(range->notifier);
+	mmap_read_lock(region->mni.mm);
+	ret = hmm_range_fault(range);
+	mmap_read_unlock(region->mni.mm);
+	if (ret)
+		return ret;
+
+	mutex_lock(&region->mutex);
+
+	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
+		mutex_unlock(&region->mutex);
+		cond_resched();
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+/**
+ * mshv_region_range_fault - Handle memory range faults for a given region.
+ * @region: Pointer to the memory region structure.
+ * @page_offset: Offset of the page within the region.
+ * @page_count: Number of pages to handle.
+ *
+ * This function resolves memory faults for a specified range of pages
+ * within a memory region. It uses HMM (Heterogeneous Memory Management)
+ * to fault in the required pages and updates the region's page array.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 page_offset, u64 page_count)
+{
+	struct hmm_range range = {
+		.notifier = &region->mni,
+		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
+	};
+	unsigned long *pfns;
+	int ret;
+	u64 i;
+
+	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);
+	if (!pfns)
+		return -ENOMEM;
+
+	range.hmm_pfns = pfns;
+	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
+	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
+
+	do {
+		ret = mshv_region_hmm_fault_and_lock(region, &range);
+	} while (ret == -EBUSY);
+
+	if (ret)
+		goto out;
+
+	for (i = 0; i < page_count; i++)
+		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+
+	if (PageHuge(region->pages[page_offset]))
+		region->flags.large_pages = true;
+
+	ret = mshv_region_remap_pages(region, region->hv_map_flags,
+				      page_offset, page_count);
+
+	mutex_unlock(&region->mutex);
+out:
+	kfree(pfns);
+	return ret;
+}
+#else /* CONFIG_MMU_NOTIFIER */
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 page_offset, u64 page_count)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_MMU_NOTIFIER */
+
+static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
+{
+	u64 page_offset, page_count;
+	int ret;
+
+	if (WARN_ON_ONCE(region->flags.range_pinned))
+		return false;
+
+	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
+	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
+				 MSHV_MAP_FAULT_IN_PAGES);
+
+	/* Map more pages than requested to reduce the number of faults. */
+	page_count = min(region->nr_pages - page_offset,
+			 MSHV_MAP_FAULT_IN_PAGES);
+
+	ret = mshv_region_range_fault(region, page_offset, page_count);
+
+	WARN_ONCE(ret,
+		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
+		  region->partition->pt_id, region->start_uaddr,
+		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  gfn, page_offset, page_count);
+
+	return !ret;
+}
+
+/**
+ * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
+ * @vp: Pointer to the virtual processor structure.
+ *
+ * This function processes GPA intercepts by identifying the memory region
+ * corresponding to the intercepted GPA, aligning the page offset, and
+ * mapping the required pages. It ensures that the region is valid and
+ * handles faults efficiently by mapping multiple pages at once.
+ *
+ * Return: true if the intercept was handled successfully, false otherwise.
+ */
+static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
+{
+	struct mshv_partition *p = vp->vp_partition;
+	struct mshv_mem_region *region;
+	struct hv_x64_memory_intercept_message *msg;
+	u64 gfn;
+
+	msg = (struct hv_x64_memory_intercept_message *)
+		vp->vp_intercept_msg_page->u.payload;
+
+	gfn = HVPFN_DOWN(msg->guest_physical_address);
+
+	region = mshv_partition_region_by_gfn(p, gfn);
+	if (!region)
+		return false;
+
+	if (WARN_ON_ONCE(!region->flags.is_ram))
+		return false;
+
+	if (WARN_ON_ONCE(region->flags.range_pinned))
+		return false;
+
+	return mshv_region_handle_gfn_fault(region, gfn);
+}
+
+#else	/* CONFIG_X86_64 */
+
+static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
+
+#endif	/* CONFIG_X86_64 */
+
+static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
+{
+	switch (vp->vp_intercept_msg_page->header.message_type) {
+	case HVMSG_GPA_INTERCEPT:
+		return mshv_handle_gpa_intercept(vp);
+	}
+	return false;
+}
+
 static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 {
 	long rc;
 
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-		rc = mshv_run_vp_with_root_scheduler(vp);
-	else
-		rc = mshv_run_vp_with_hyp_scheduler(vp);
+	do {
+		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
+			rc = mshv_run_vp_with_root_scheduler(vp);
+		else
+			rc = mshv_run_vp_with_hyp_scheduler(vp);
+	} while (rc == 0 && mshv_vp_handle_intercept(vp));
 
 	if (rc)
 		return rc;
@@ -1194,6 +1385,110 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 	return NULL;
 }
 
+#if defined(CONFIG_MMU_NOTIFIER)
+static void mshv_region_movable_fini(struct mshv_mem_region *region)
+{
+	if (region->flags.range_pinned)
+		return;
+
+	mmu_interval_notifier_remove(&region->mni);
+}
+
+/**
+ * mshv_region_interval_invalidate - Invalidate a range of memory region
+ * @mni: Pointer to the mmu_interval_notifier structure
+ * @range: Pointer to the mmu_notifier_range structure
+ * @cur_seq: Current sequence number for the interval notifier
+ *
+ * This function invalidates a memory region by remapping its pages with
+ * no access permissions. It locks the region's mutex to ensure thread safety
+ * and updates the sequence number for the interval notifier. If the range
+ * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
+ * lock and returns false if unsuccessful.
+ *
+ * NOTE: Failure to invalidate a region is a serious error, as the pages will
+ * be considered freed while they are still mapped by the hypervisor.
+ * Any attempt to access such pages will likely crash the system.
+ *
+ * Return: true if the region was successfully invalidated, false otherwise.
+ */
+static bool
+mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
+				const struct mmu_notifier_range *range,
+				unsigned long cur_seq)
+{
+	struct mshv_mem_region *region = container_of(mni,
+						struct mshv_mem_region,
+						mni);
+	u64 page_offset, page_count;
+	unsigned long mstart, mend;
+	int ret = -EPERM;
+
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&region->mutex);
+	else if (!mutex_trylock(&region->mutex))
+		goto out_fail;
+
+	mmu_interval_set_seq(mni, cur_seq);
+
+	mstart = max(range->start, region->start_uaddr);
+	mend = min(range->end, region->start_uaddr +
+		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
+
+	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
+	page_count = HVPFN_DOWN(mend - mstart);
+
+	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
+				      page_offset, page_count);
+	if (ret)
+		goto out_fail;
+
+	mshv_region_invalidate_pages(region, page_offset, page_count);
+
+	mutex_unlock(&region->mutex);
+
+	return true;
+
+out_fail:
+	WARN_ONCE(ret,
+		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
+		  region->start_uaddr,
+		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  range->start, range->end, range->event,
+		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
+	return false;
+}
+
+static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
+	.invalidate = mshv_region_interval_invalidate,
+};
+
+static bool mshv_region_movable_init(struct mshv_mem_region *region)
+{
+	int ret;
+
+	ret = mmu_interval_notifier_insert(&region->mni, current->mm,
+					   region->start_uaddr,
+					   region->nr_pages << HV_HYP_PAGE_SHIFT,
+					   &mshv_region_mni_ops);
+	if (ret)
+		return false;
+
+	mutex_init(&region->mutex);
+
+	return true;
+}
+#else
+static inline void mshv_region_movable_fini(struct mshv_mem_region *region)
+{
+}
+
+static inline bool mshv_region_movable_init(struct mshv_mem_region *region)
+{
+	return false;
+}
+#endif
+
 /*
  * NB: caller checks and makes sure mem->size is page aligned
  * Returns: 0 with regionpp updated on success, or -errno
@@ -1228,9 +1523,14 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
-	/* Note: large_pages flag populated when we pin the pages */
-	if (!is_mmio)
-		region->flags.range_pinned = true;
+	/* Note: large_pages flag populated when pages are allocated. */
+	if (!is_mmio) {
+		region->flags.is_ram = true;
+
+		if (mshv_partition_encrypted(partition) ||
+		    !mshv_region_movable_init(region))
+			region->flags.range_pinned = true;
+	}
 
 	region->partition = partition;
 
@@ -1350,9 +1650,16 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	if (is_mmio)
 		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
 					     mmio_pfn, HVPFN_DOWN(mem.size));
-	else
+	else if (region->flags.range_pinned)
 		ret = mshv_prepare_pinned_region(region);
-
+	else
+		/*
+		 * For non-pinned regions, remap with no access to let the
+		 * hypervisor track dirty pages, enabling pre-copy live
+		 * migration.
+		 */
+		ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
+					      0, region->nr_pages);
 	if (ret)
 		goto errout;
 
@@ -1374,6 +1681,9 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
 
 	hlist_del(&region->hnode);
 
+	if (region->flags.is_ram)
+		mshv_region_movable_fini(region);
+
 	if (mshv_partition_encrypted(partition)) {
 		ret = mshv_partition_region_share(region);
 		if (ret) {



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH v6 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance
  2025-11-17 16:52 ` [PATCH v6 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance Stanislav Kinsburskii
@ 2025-11-18 16:28   ` Michael Kelley
  2025-11-19 17:42     ` Stanislav Kinsburskii
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Kelley @ 2025-11-18 16:28 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, November 17, 2025 8:53 AM
> 
> Reduce overhead when unmapping large memory regions by batching GPA unmap
> operations in 2MB-aligned chunks.
> 
> Use a dedicated constant for batch size to improve code clarity and
> maintainability.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h         |    2 ++
>  drivers/hv/mshv_root_hv_call.c |    2 +-
>  drivers/hv/mshv_root_main.c    |   36 ++++++++++++++++++++++++++++++------
>  3 files changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 3eb815011b46..5eece7077f8b 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -32,6 +32,8 @@ static_assert(HV_HYP_PAGE_SIZE == MSHV_HV_PAGE_SIZE);
> 
>  #define MSHV_PIN_PAGES_BATCH_SIZE	(0x10000000ULL / HV_HYP_PAGE_SIZE)
> 
> +#define MSHV_MAX_UNMAP_GPA_PAGES	512
> +
>  struct mshv_vp {
>  	u32 vp_index;
>  	struct mshv_partition *vp_partition;
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index caf02cfa49c9..8fa983f1109b 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -17,7 +17,7 @@
>  /* Determined empirically */
>  #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
>  #define HV_MAP_GPA_DEPOSIT_PAGES	256
> -#define HV_UMAP_GPA_PAGES		512
> +#define HV_UMAP_GPA_PAGES		MSHV_MAX_UNMAP_GPA_PAGES
> 
>  #define HV_PAGE_COUNT_2M_ALIGNED(pg_count) (!((pg_count) & (0x200 - 1)))
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index a85872b72b1a..ef36d8115d8a 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1365,7 +1365,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  static void mshv_partition_destroy_region(struct mshv_mem_region *region)
>  {
>  	struct mshv_partition *partition = region->partition;
> -	u32 unmap_flags = 0;
> +	u64 gfn, gfn_count, start_gfn, end_gfn;
>  	int ret;
> 
>  	hlist_del(&region->hnode);
> @@ -1380,12 +1380,36 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
>  		}
>  	}
> 
> -	if (region->flags.large_pages)
> -		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> +	start_gfn = region->start_gfn;
> +	end_gfn = region->start_gfn + region->nr_pages;
> +
> +	for (gfn = start_gfn; gfn < end_gfn; gfn += gfn_count) {
> +		u32 unmap_flags = 0;
> +
> +		if (gfn % MSHV_MAX_UNMAP_GPA_PAGES)
> +			gfn_count = ALIGN(gfn, MSHV_MAX_UNMAP_GPA_PAGES) - gfn;
> +		else
> +			gfn_count = MSHV_MAX_UNMAP_GPA_PAGES;
> +
> +		if (gfn + gfn_count > end_gfn)
> +			gfn_count = end_gfn - gfn;
> 
> -	/* ignore unmap failures and continue as process may be exiting */
> -	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
> -				region->nr_pages, unmap_flags);
> +		/* Skip all pages in this range if none are mapped */
> +		if (!memchr_inv(region->pages + (gfn - start_gfn), 0,
> +				gfn_count * sizeof(struct page *)))
> +			continue;
> +
> +		if (region->flags.large_pages &&
> +		    VALUE_PMD_ALIGNED(gfn) && VALUE_PMD_ALIGNED(gfn_count))

VALUE_PMD_ALIGNED isn't defined until Patch 4 of this series.

The idea of a page count being PMD aligned occurs a few other places in
Linux kernel code, and it is usually written as IS_ALIGNED(count, PTRS_PER_PMD),
though there's one occurrence of !(count % PTRS_PER_PMD).

Also mshv_root_hv_call.c has HV_PAGE_COUNT_2M_ALIGNED() that does
the same thing. Some macro consolidation is appropriate, or just open code
as IS_ALIGNED(<cnt>, PTRS_PER_PMD) and eliminate the macros.

> +			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> +
> +		ret = hv_call_unmap_gpa_pages(partition->pt_id, gfn,
> +					      gfn_count, unmap_flags);
> +		if (ret)
> +			pt_err(partition,
> +			       "Failed to unmap GPA pages %#llx-%#llx: %d\n",
> +			       gfn, gfn + gfn_count - 1, ret);
> +	}
> 
>  	mshv_region_invalidate(region);
> 
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions
  2025-11-17 16:52 ` [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
@ 2025-11-18 16:29   ` Michael Kelley
  2025-11-20  0:45     ` Stanislav Kinsburskii
  2025-11-19 18:06   ` kernel test robot
  1 sibling, 1 reply; 13+ messages in thread
From: Michael Kelley @ 2025-11-18 16:29 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, November 17, 2025 8:53 AM
> 
> Introduce support for movable memory regions in the Hyper-V root partition
> driver, thus improving memory management flexibility and preparing the
> driver for advanced use cases such as dynamic memory remapping.
> 
> Integrate mmu_interval_notifier for movable regions, implement functions to
> handle HMM faults and memory invalidation, and update memory region mapping
> logic to support movable regions.
> 
> While MMU notifiers are commonly used in virtualization drivers, this
> implementation leverages HMM (Heterogeneous Memory Management) for its
> tailored functionality. HMM provides a ready-made framework for mirroring,
> invalidation, and fault handling, avoiding the need to reimplement these
> mechanisms for a single callback. Although MMU notifiers are more generic,
> using HMM reduces boilerplate and ensures maintainability by utilizing a
> mechanism specifically designed for such use cases.

I wasn't previously familiar with HMM when reviewing this code, so I had to
work through the details, and mentally build a high-level view of how this
case use maps to concepts in the Linux kernel documentation for HMM.
Here's my take:

In this use case, the goal is what HMM calls "address space mirroring"
between Linux in the root partition and a guest VM. Linux in the root
partition is the primary owner of the memory. Each guest VM is a
"device" from the HMM standpoint, and the device page tables are the
hypervisor's SLAT entries that are managed using hypercalls to map and
unmap memory. When a guest VM is using unpinned memory, the guest
VM faults and generates a VP intercept when it first touches a page in its
GFN space. MSHV is the device driver for the "device", and it handles the
VP intercept by calling hmm_range_fault() to populate the pages, and then
making hypercalls to set up the "device" page tables (i.e., the SLAT entries).
Linux in the root partition can reclaim ownership of a memory range via
the HMM "invalidate" callback, which MSHV handles by making hypercalls
to clear the SLAT entries. After such a invalidate, the guest VM will fault
again if it touches the memory range.

Is this a correct understanding?  If so, including such a summary in the
commit message or as a code comment would be helpful to people
in the future who are looking at the code.

> 
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/Kconfig          |    1
>  drivers/hv/mshv_root.h      |    8 +
>  drivers/hv/mshv_root_main.c |  328
> ++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 327 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 0b8c391a0342..5f1637cbb6e3 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -75,6 +75,7 @@ config MSHV_ROOT
>  	depends on PAGE_SIZE_4KB
>  	select EVENTFD
>  	select VIRT_XFER_TO_GUEST_WORK
> +	select HMM_MIRROR

Couldn't you also do "select MMU_NOTIFIER" to avoid the #ifdef's
and stubs for when it isn't selected? There are other Linux kernel
drivers that select it. Or is the intent to allow building an image that
doesn't support unpinned memory, and the #ifdef's save space?

>  	default n
>  	help
>  	  Select this option to enable support for booting and running as root
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 5eece7077f8b..117399dea780 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -15,6 +15,7 @@
>  #include <linux/hashtable.h>
>  #include <linux/dev_printk.h>
>  #include <linux/build_bug.h>
> +#include <linux/mmu_notifier.h>
>  #include <uapi/linux/mshv.h>
> 
>  /*
> @@ -81,9 +82,14 @@ struct mshv_mem_region {
>  	struct {
>  		u64 large_pages:  1; /* 2MiB */
>  		u64 range_pinned: 1;
> -		u64 reserved:	 62;
> +		u64 is_ram	: 1; /* mem region can be ram or mmio */

The combination of "range_pinned" and "is_ram" effectively define three
MSHV region types:

1) MMIO
2) Pinned RAM
3) Unpinned RAM

Would it make sense to have an enum instead of 2 booleans? It seems
like that would simplify the code a bit in a couple places. For example,
mshv_handle_gpa_intercept() could just do one WARN_ON() instead of two.
Also you would not need mshv_partition_destroy_region() testing "is_ram",
and then mshv_region_movable_fini() testing "range_pinned". A single test
could cover both.

Just a suggestion. Ignore my comment if you prefer the 2 booleans.

> +		u64 reserved:	 61;
>  	} flags;
>  	struct mshv_partition *partition;
> +#if defined(CONFIG_MMU_NOTIFIER)
> +	struct mmu_interval_notifier mni;
> +	struct mutex mutex;	/* protects region pages remapping */
> +#endif
>  	struct page *pages[];
>  };
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 73aaa149c14c..fe0c5eaa1248 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -29,6 +29,7 @@
>  #include <linux/crash_dump.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/vmalloc.h>
> +#include <linux/hmm.h>
> 
>  #include "mshv_eventfd.h"
>  #include "mshv.h"
> @@ -36,6 +37,8 @@
> 
>  #define VALUE_PMD_ALIGNED(c)			(!((c) & (PTRS_PER_PMD - 1)))
> 
> +#define MSHV_MAP_FAULT_IN_PAGES			HPAGE_PMD_NR
> +
>  MODULE_AUTHOR("Microsoft");
>  MODULE_LICENSE("GPL");
>  MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
> @@ -76,6 +79,11 @@ static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
>  static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
>  static int mshv_init_async_handler(struct mshv_partition *partition);
>  static void mshv_async_hvcall_handler(void *data, u64 *status);
> +static struct mshv_mem_region
> +	*mshv_partition_region_by_gfn(struct mshv_partition *pt, u64 gfn);
> +static int mshv_region_remap_pages(struct mshv_mem_region *region,
> +				   u32 map_flags, u64 page_offset,
> +				   u64 page_count);
> 
>  static const union hv_input_vtl input_vtl_zero;
>  static const union hv_input_vtl input_vtl_normal = {
> @@ -602,14 +610,197 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
>  static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
>  	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
> 
> +#ifdef CONFIG_X86_64
> +
> +#if defined(CONFIG_MMU_NOTIFIER)
> +/**
> + * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
> + * @region: Pointer to the memory region structure
> + * @range: Pointer to the HMM range structure
> + *
> + * This function performs the following steps:
> + * 1. Reads the notifier sequence for the HMM range.
> + * 2. Acquires a read lock on the memory map.
> + * 3. Handles HMM faults for the specified range.
> + * 4. Releases the read lock on the memory map.
> + * 5. If successful, locks the memory region mutex.
> + * 6. Verifies if the notifier sequence has changed during the operation.
> + *    If it has, releases the mutex and returns -EBUSY to match with
> + *    hmm_range_fault() return code for repeating.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> +					  struct hmm_range *range)
> +{
> +	int ret;
> +
> +	range->notifier_seq = mmu_interval_read_begin(range->notifier);
> +	mmap_read_lock(region->mni.mm);
> +	ret = hmm_range_fault(range);
> +	mmap_read_unlock(region->mni.mm);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&region->mutex);
> +
> +	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
> +		mutex_unlock(&region->mutex);
> +		cond_resched();
> +		return -EBUSY;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * mshv_region_range_fault - Handle memory range faults for a given region.
> + * @region: Pointer to the memory region structure.
> + * @page_offset: Offset of the page within the region.
> + * @page_count: Number of pages to handle.
> + *
> + * This function resolves memory faults for a specified range of pages
> + * within a memory region. It uses HMM (Heterogeneous Memory Management)
> + * to fault in the required pages and updates the region's page array.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +static int mshv_region_range_fault(struct mshv_mem_region *region,
> +				   u64 page_offset, u64 page_count)
> +{
> +	struct hmm_range range = {
> +		.notifier = &region->mni,
> +		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> +	};
> +	unsigned long *pfns;
> +	int ret;
> +	u64 i;
> +
> +	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);
> +	if (!pfns)
> +		return -ENOMEM;
> +
> +	range.hmm_pfns = pfns;
> +	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
> +	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
> +
> +	do {
> +		ret = mshv_region_hmm_fault_and_lock(region, &range);
> +	} while (ret == -EBUSY);

I expected the looping on -EBUSY to be in mshv_region_hmm_fault_and_lock(),
but I guess it really doesn't matter.

> +
> +	if (ret)
> +		goto out;
> +
> +	for (i = 0; i < page_count; i++)
> +		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);

The comment with hmm_pfn_to_page() says that the caller is assumed to
have tested the pfn for HMM_PFN_VALID, which I don't see done anywhere.
Is there a reason it's not necessary to test?

> +
> +	if (PageHuge(region->pages[page_offset]))
> +		region->flags.large_pages = true;

See comment below in mshv_region_handle_gfn_fault().

> +
> +	ret = mshv_region_remap_pages(region, region->hv_map_flags,
> +				      page_offset, page_count);
> +
> +	mutex_unlock(&region->mutex);
> +out:
> +	kfree(pfns);
> +	return ret;
> +}
> +#else /* CONFIG_MMU_NOTIFIER */
> +static int mshv_region_range_fault(struct mshv_mem_region *region,
> +				   u64 page_offset, u64 page_count)
> +{
> +	return -ENODEV;
> +}
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> +{
> +	u64 page_offset, page_count;
> +	int ret;
> +
> +	if (WARN_ON_ONCE(region->flags.range_pinned))
> +		return false;
> +
> +	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> +	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> +				 MSHV_MAP_FAULT_IN_PAGES);
> +
> +	/* Map more pages than requested to reduce the number of faults. */
> +	page_count = min(region->nr_pages - page_offset,
> +			 MSHV_MAP_FAULT_IN_PAGES);

These computations make the range defined by page_offset and page_count
start on a 512 page boundary relative to start_gfn, and have a size that is a
multiple of 512 pages. But they don't ensure that the range aligns to a large page
boundary within gfn space since region->start_gfn may not be a multiple of
512. Then mshv_region_range_fault() tests the first page of the range for
being a large page, and if so, sets region->large_pages. This doesn't make
sense to me if the range doesn't align to a large page boundary.

Does this code need to make sure that the range is aligned to a large
page boundary in gfn space? Or am I misunderstanding what the
region->large_pages flag means? Given the fixes in this v6 of the
patch set, I was thinking that region->large_pages means that every
large page aligned area within the region is a large page. If region->
start_gfn and region->nr_pages aren't multiples of 512, then there
may be an initial range and a final range that aren't large pages,
but everything in between is. If that's not a correct understanding,
could you clarify the exact meaning of the region->large_pages
flag?

> +
> +	ret = mshv_region_range_fault(region, page_offset, page_count);
> +
> +	WARN_ONCE(ret,
> +		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> +		  region->partition->pt_id, region->start_uaddr,
> +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> +		  gfn, page_offset, page_count);
> +
> +	return !ret;
> +}
> +
> +/**
> + * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
> + * @vp: Pointer to the virtual processor structure.
> + *
> + * This function processes GPA intercepts by identifying the memory region
> + * corresponding to the intercepted GPA, aligning the page offset, and
> + * mapping the required pages. It ensures that the region is valid and
> + * handles faults efficiently by mapping multiple pages at once.
> + *
> + * Return: true if the intercept was handled successfully, false otherwise.
> + */
> +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> +{
> +	struct mshv_partition *p = vp->vp_partition;
> +	struct mshv_mem_region *region;
> +	struct hv_x64_memory_intercept_message *msg;
> +	u64 gfn;
> +
> +	msg = (struct hv_x64_memory_intercept_message *)
> +		vp->vp_intercept_msg_page->u.payload;
> +
> +	gfn = HVPFN_DOWN(msg->guest_physical_address);
> +
> +	region = mshv_partition_region_by_gfn(p, gfn);
> +	if (!region)
> +		return false;

Does it ever happen that the gfn is legitimately not found in any
region, perhaps due to a race? I think the vp_mutex is held here,
so maybe that protects the region layout for the VP and "not found"
should never occur. If so, should there be a WARN_ON here?

If "gfn not found" can be legitimate, perhaps a comment to
explain the circumstances would be helpful.

> +
> +	if (WARN_ON_ONCE(!region->flags.is_ram))
> +		return false;
> +
> +	if (WARN_ON_ONCE(region->flags.range_pinned))
> +		return false;
> +
> +	return mshv_region_handle_gfn_fault(region, gfn);
> +}
> +
> +#else	/* CONFIG_X86_64 */
> +
> +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> +
> +#endif	/* CONFIG_X86_64 */
> +
> +static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> +{
> +	switch (vp->vp_intercept_msg_page->header.message_type) {
> +	case HVMSG_GPA_INTERCEPT:
> +		return mshv_handle_gpa_intercept(vp);
> +	}
> +	return false;
> +}
> +
>  static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
>  {
>  	long rc;
> 
> -	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> -		rc = mshv_run_vp_with_root_scheduler(vp);
> -	else
> -		rc = mshv_run_vp_with_hyp_scheduler(vp);
> +	do {
> +		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> +			rc = mshv_run_vp_with_root_scheduler(vp);
> +		else
> +			rc = mshv_run_vp_with_hyp_scheduler(vp);
> +	} while (rc == 0 && mshv_vp_handle_intercept(vp));
> 
>  	if (rc)
>  		return rc;
> @@ -1194,6 +1385,110 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>  	return NULL;
>  }
> 
> +#if defined(CONFIG_MMU_NOTIFIER)
> +static void mshv_region_movable_fini(struct mshv_mem_region *region)
> +{
> +	if (region->flags.range_pinned)
> +		return;
> +
> +	mmu_interval_notifier_remove(&region->mni);
> +}
> +
> +/**
> + * mshv_region_interval_invalidate - Invalidate a range of memory region
> + * @mni: Pointer to the mmu_interval_notifier structure
> + * @range: Pointer to the mmu_notifier_range structure
> + * @cur_seq: Current sequence number for the interval notifier
> + *
> + * This function invalidates a memory region by remapping its pages with
> + * no access permissions. It locks the region's mutex to ensure thread safety
> + * and updates the sequence number for the interval notifier. If the range
> + * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
> + * lock and returns false if unsuccessful.
> + *
> + * NOTE: Failure to invalidate a region is a serious error, as the pages will
> + * be considered freed while they are still mapped by the hypervisor.
> + * Any attempt to access such pages will likely crash the system.
> + *
> + * Return: true if the region was successfully invalidated, false otherwise.
> + */
> +static bool
> +mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> +				const struct mmu_notifier_range *range,
> +				unsigned long cur_seq)
> +{
> +	struct mshv_mem_region *region = container_of(mni,
> +						struct mshv_mem_region,
> +						mni);
> +	u64 page_offset, page_count;
> +	unsigned long mstart, mend;
> +	int ret = -EPERM;
> +
> +	if (mmu_notifier_range_blockable(range))
> +		mutex_lock(&region->mutex);
> +	else if (!mutex_trylock(&region->mutex))
> +		goto out_fail;
> +
> +	mmu_interval_set_seq(mni, cur_seq);
> +
> +	mstart = max(range->start, region->start_uaddr);
> +	mend = min(range->end, region->start_uaddr +
> +		   (region->nr_pages << HV_HYP_PAGE_SHIFT));

I'm pretty sure region->start_uaddr is always page aligned. But what
about range->start and range->end?  The code here and below assumes
they are page aligned. It also assumes that range->end is greater than
range->start so the computation of page_count doesn't wrap and so
page_count is >= 1. I don't know whether checks for these assumptions
are appropriate.

> +
> +	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> +	page_count = HVPFN_DOWN(mend - mstart);
> +
> +	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> +				      page_offset, page_count);
> +	if (ret)
> +		goto out_fail;
> +
> +	mshv_region_invalidate_pages(region, page_offset, page_count);
> +
> +	mutex_unlock(&region->mutex);
> +
> +	return true;
> +
> +out_fail:
> +	WARN_ONCE(ret,
> +		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
> +		  region->start_uaddr,
> +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> +		  range->start, range->end, range->event,
> +		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
> +	return false;
> +}
> +
> +static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
> +	.invalidate = mshv_region_interval_invalidate,
> +};
> +
> +static bool mshv_region_movable_init(struct mshv_mem_region *region)
> +{
> +	int ret;
> +
> +	ret = mmu_interval_notifier_insert(&region->mni, current->mm,
> +					   region->start_uaddr,
> +					   region->nr_pages << HV_HYP_PAGE_SHIFT,
> +					   &mshv_region_mni_ops);
> +	if (ret)
> +		return false;
> +
> +	mutex_init(&region->mutex);
> +
> +	return true;
> +}
> +#else
> +static inline void mshv_region_movable_fini(struct mshv_mem_region *region)
> +{
> +}
> +
> +static inline bool mshv_region_movable_init(struct mshv_mem_region *region)
> +{
> +	return false;
> +}
> +#endif
> +
>  /*
>   * NB: caller checks and makes sure mem->size is page aligned
>   * Returns: 0 with regionpp updated on success, or -errno
> @@ -1228,9 +1523,14 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
>  	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
>  		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
> 
> -	/* Note: large_pages flag populated when we pin the pages */
> -	if (!is_mmio)
> -		region->flags.range_pinned = true;
> +	/* Note: large_pages flag populated when pages are allocated. */
> +	if (!is_mmio) {
> +		region->flags.is_ram = true;
> +
> +		if (mshv_partition_encrypted(partition) ||
> +		    !mshv_region_movable_init(region))
> +			region->flags.range_pinned = true;
> +	}
> 
>  	region->partition = partition;
> 
> @@ -1350,9 +1650,16 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  	if (is_mmio)
>  		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
>  					     mmio_pfn, HVPFN_DOWN(mem.size));
> -	else
> +	else if (region->flags.range_pinned)
>  		ret = mshv_prepare_pinned_region(region);
> -
> +	else
> +		/*
> +		 * For non-pinned regions, remap with no access to let the
> +		 * hypervisor track dirty pages, enabling pre-copy live
> +		 * migration.
> +		 */
> +		ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> +					      0, region->nr_pages);
>  	if (ret)
>  		goto errout;
> 
> @@ -1374,6 +1681,9 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
> 
>  	hlist_del(&region->hnode);
> 
> +	if (region->flags.is_ram)
> +		mshv_region_movable_fini(region);
> +
>  	if (mshv_partition_encrypted(partition)) {
>  		ret = mshv_partition_region_share(region);
>  		if (ret) {
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance
  2025-11-18 16:28   ` Michael Kelley
@ 2025-11-19 17:42     ` Stanislav Kinsburskii
  0 siblings, 0 replies; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-19 17:42 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Tue, Nov 18, 2025 at 04:28:55PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, November 17, 2025 8:53 AM
> > 
> > Reduce overhead when unmapping large memory regions by batching GPA unmap
> > operations in 2MB-aligned chunks.
> > 
> > Use a dedicated constant for batch size to improve code clarity and
> > maintainability.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> > Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root.h         |    2 ++
> >  drivers/hv/mshv_root_hv_call.c |    2 +-
> >  drivers/hv/mshv_root_main.c    |   36 ++++++++++++++++++++++++++++++------
> >  3 files changed, 33 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index 3eb815011b46..5eece7077f8b 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -32,6 +32,8 @@ static_assert(HV_HYP_PAGE_SIZE == MSHV_HV_PAGE_SIZE);
> > 
> >  #define MSHV_PIN_PAGES_BATCH_SIZE	(0x10000000ULL / HV_HYP_PAGE_SIZE)
> > 
> > +#define MSHV_MAX_UNMAP_GPA_PAGES	512
> > +
> >  struct mshv_vp {
> >  	u32 vp_index;
> >  	struct mshv_partition *vp_partition;
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index caf02cfa49c9..8fa983f1109b 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -17,7 +17,7 @@
> >  /* Determined empirically */
> >  #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
> >  #define HV_MAP_GPA_DEPOSIT_PAGES	256
> > -#define HV_UMAP_GPA_PAGES		512
> > +#define HV_UMAP_GPA_PAGES		MSHV_MAX_UNMAP_GPA_PAGES
> > 
> >  #define HV_PAGE_COUNT_2M_ALIGNED(pg_count) (!((pg_count) & (0x200 - 1)))
> > 
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index a85872b72b1a..ef36d8115d8a 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -1365,7 +1365,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
> >  static void mshv_partition_destroy_region(struct mshv_mem_region *region)
> >  {
> >  	struct mshv_partition *partition = region->partition;
> > -	u32 unmap_flags = 0;
> > +	u64 gfn, gfn_count, start_gfn, end_gfn;
> >  	int ret;
> > 
> >  	hlist_del(&region->hnode);
> > @@ -1380,12 +1380,36 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
> >  		}
> >  	}
> > 
> > -	if (region->flags.large_pages)
> > -		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > +	start_gfn = region->start_gfn;
> > +	end_gfn = region->start_gfn + region->nr_pages;
> > +
> > +	for (gfn = start_gfn; gfn < end_gfn; gfn += gfn_count) {
> > +		u32 unmap_flags = 0;
> > +
> > +		if (gfn % MSHV_MAX_UNMAP_GPA_PAGES)
> > +			gfn_count = ALIGN(gfn, MSHV_MAX_UNMAP_GPA_PAGES) - gfn;
> > +		else
> > +			gfn_count = MSHV_MAX_UNMAP_GPA_PAGES;
> > +
> > +		if (gfn + gfn_count > end_gfn)
> > +			gfn_count = end_gfn - gfn;
> > 
> > -	/* ignore unmap failures and continue as process may be exiting */
> > -	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
> > -				region->nr_pages, unmap_flags);
> > +		/* Skip all pages in this range if none are mapped */
> > +		if (!memchr_inv(region->pages + (gfn - start_gfn), 0,
> > +				gfn_count * sizeof(struct page *)))
> > +			continue;
> > +
> > +		if (region->flags.large_pages &&
> > +		    VALUE_PMD_ALIGNED(gfn) && VALUE_PMD_ALIGNED(gfn_count))
> 
> VALUE_PMD_ALIGNED isn't defined until Patch 4 of this series.
> 
> The idea of a page count being PMD aligned occurs a few other places in
> Linux kernel code, and it is usually written as IS_ALIGNED(count, PTRS_PER_PMD),
> though there's one occurrence of !(count % PTRS_PER_PMD).
> 
> Also mshv_root_hv_call.c has HV_PAGE_COUNT_2M_ALIGNED() that does
> the same thing. Some macro consolidation is appropriate, or just open code
> as IS_ALIGNED(<cnt>, PTRS_PER_PMD) and eliminate the macros.
> 

Thanks for the idea.
I'll update accordingly.

Thanks,
Stanislav

> > +			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > +
> > +		ret = hv_call_unmap_gpa_pages(partition->pt_id, gfn,
> > +					      gfn_count, unmap_flags);
> > +		if (ret)
> > +			pt_err(partition,
> > +			       "Failed to unmap GPA pages %#llx-%#llx: %d\n",
> > +			       gfn, gfn + gfn_count - 1, ret);
> > +	}
> > 
> >  	mshv_region_invalidate(region);
> > 
> > 
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions
  2025-11-17 16:52 ` [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  2025-11-18 16:29   ` Michael Kelley
@ 2025-11-19 18:06   ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2025-11-19 18:06 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: oe-kbuild-all, linux-hyperv, linux-kernel

Hi Stanislav,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.18-rc6]
[cannot apply to next-20251119]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Kinsburskii/Drivers-hv-Refactor-and-rename-memory-region-handling-functions/20251118-010501
base:   linus/master
patch link:    https://lore.kernel.org/r/176339837995.27330.14240947043073674139.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
patch subject: [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions
config: x86_64-randconfig-076-20251119 (https://download.01.org/0day-ci/archive/20251120/202511200145.iDV1mUnj-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251120/202511200145.iDV1mUnj-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511200145.iDV1mUnj-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/vdso/const.h:5,
                    from include/linux/const.h:4,
                    from include/uapi/linux/kernel.h:6,
                    from include/linux/cache.h:5,
                    from arch/x86/include/asm/current.h:10,
                    from include/linux/sched.h:12,
                    from include/linux/resume_user_mode.h:6,
                    from include/linux/entry-virt.h:6,
                    from drivers/hv/mshv_root_main.c:11:
   In function 'mshv_region_handle_gfn_fault',
       inlined from 'mshv_handle_gpa_intercept' at drivers/hv/mshv_root_main.c:742:9,
       inlined from 'mshv_vp_handle_intercept' at drivers/hv/mshv_root_main.c:755:10,
       inlined from 'mshv_vp_ioctl_run_vp' at drivers/hv/mshv_root_main.c:769:22,
       inlined from 'mshv_vp_ioctl' at drivers/hv/mshv_root_main.c:958:7:
>> include/linux/compiler_types.h:602:45: error: call to '__compiletime_assert_1093' declared with attribute error: BUILD_BUG failed
     602 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/uapi/linux/const.h:49:44: note: in definition of macro '__ALIGN_KERNEL_MASK'
      49 | #define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))
         |                                            ^
   include/vdso/align.h:9:33: note: in expansion of macro '__ALIGN_KERNEL'
       9 | #define ALIGN_DOWN(x, a)        __ALIGN_KERNEL((x) - ((a) - 1), (a))
         |                                 ^~~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:691:23: note: in expansion of macro 'ALIGN_DOWN'
     691 |         page_offset = ALIGN_DOWN(gfn - region->start_gfn,
         |                       ^~~~~~~~~~
   include/linux/compiler_types.h:590:9: note: in expansion of macro '__compiletime_assert'
     590 |         __compiletime_assert(condition, msg, prefix, suffix)
         |         ^~~~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:602:9: note: in expansion of macro '_compiletime_assert'
     602 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
         |                     ^~~~~~~~~~~~~~~~
   include/linux/huge_mm.h:113:28: note: in expansion of macro 'BUILD_BUG'
     113 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
         |                            ^~~~~~~~~
   include/linux/huge_mm.h:117:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
     117 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
         |                          ^~~~~~~~~~~~~~~
   include/linux/huge_mm.h:118:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
     118 | #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
         |                          ^~~~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:40:49: note: in expansion of macro 'HPAGE_PMD_NR'
      40 | #define MSHV_MAP_FAULT_IN_PAGES                 HPAGE_PMD_NR
         |                                                 ^~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:692:34: note: in expansion of macro 'MSHV_MAP_FAULT_IN_PAGES'
     692 |                                  MSHV_MAP_FAULT_IN_PAGES);
         |                                  ^~~~~~~~~~~~~~~~~~~~~~~


vim +/__compiletime_assert_1093 +602 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  588  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  589  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  590  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  591  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  592  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  593   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  594   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  595   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  596   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  597   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  598   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  599   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  600   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  601  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @602  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  603  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions
  2025-11-18 16:29   ` Michael Kelley
@ 2025-11-20  0:45     ` Stanislav Kinsburskii
  2025-11-21  5:45       ` Michael Kelley
  0 siblings, 1 reply; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-20  0:45 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Tue, Nov 18, 2025 at 04:29:56PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, November 17, 2025 8:53 AM
> > 
> > Introduce support for movable memory regions in the Hyper-V root partition
> > driver, thus improving memory management flexibility and preparing the
> > driver for advanced use cases such as dynamic memory remapping.
> > 
> > Integrate mmu_interval_notifier for movable regions, implement functions to
> > handle HMM faults and memory invalidation, and update memory region mapping
> > logic to support movable regions.
> > 
> > While MMU notifiers are commonly used in virtualization drivers, this
> > implementation leverages HMM (Heterogeneous Memory Management) for its
> > tailored functionality. HMM provides a ready-made framework for mirroring,
> > invalidation, and fault handling, avoiding the need to reimplement these
> > mechanisms for a single callback. Although MMU notifiers are more generic,
> > using HMM reduces boilerplate and ensures maintainability by utilizing a
> > mechanism specifically designed for such use cases.
> 
> I wasn't previously familiar with HMM when reviewing this code, so I had to
> work through the details, and mentally build a high-level view of how this
> case use maps to concepts in the Linux kernel documentation for HMM.
> Here's my take:
> 
> In this use case, the goal is what HMM calls "address space mirroring"
> between Linux in the root partition and a guest VM. Linux in the root
> partition is the primary owner of the memory. Each guest VM is a
> "device" from the HMM standpoint, and the device page tables are the
> hypervisor's SLAT entries that are managed using hypercalls to map and
> unmap memory. When a guest VM is using unpinned memory, the guest
> VM faults and generates a VP intercept when it first touches a page in its
> GFN space. MSHV is the device driver for the "device", and it handles the
> VP intercept by calling hmm_range_fault() to populate the pages, and then
> making hypercalls to set up the "device" page tables (i.e., the SLAT entries).
> Linux in the root partition can reclaim ownership of a memory range via
> the HMM "invalidate" callback, which MSHV handles by making hypercalls
> to clear the SLAT entries. After such a invalidate, the guest VM will fault
> again if it touches the memory range.
> 
> Is this a correct understanding?  If so, including such a summary in the
> commit message or as a code comment would be helpful to people
> in the future who are looking at the code.
> 

Yes, this is a correct understanding.
I'll add a description to the commit message.

> > 
> > Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/Kconfig          |    1
> >  drivers/hv/mshv_root.h      |    8 +
> >  drivers/hv/mshv_root_main.c |  328
> > ++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 327 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > index 0b8c391a0342..5f1637cbb6e3 100644
> > --- a/drivers/hv/Kconfig
> > +++ b/drivers/hv/Kconfig
> > @@ -75,6 +75,7 @@ config MSHV_ROOT
> >  	depends on PAGE_SIZE_4KB
> >  	select EVENTFD
> >  	select VIRT_XFER_TO_GUEST_WORK
> > +	select HMM_MIRROR
> 
> Couldn't you also do "select MMU_NOTIFIER" to avoid the #ifdef's
> and stubs for when it isn't selected? There are other Linux kernel
> drivers that select it. Or is the intent to allow building an image that
> doesn't support unpinned memory, and the #ifdef's save space?
> 

That's an interesting question. This driver can function without MMU notifiers
by pinning all memory, which might be advantageous for certain real-time
applications.
However, since most other virtualization solutions use MMU_NOTIFIER, there
doesn't appear to be a strong reason for this driver to deviate.

> >  	default n
> >  	help
> >  	  Select this option to enable support for booting and running as root
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index 5eece7077f8b..117399dea780 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -15,6 +15,7 @@
> >  #include <linux/hashtable.h>
> >  #include <linux/dev_printk.h>
> >  #include <linux/build_bug.h>
> > +#include <linux/mmu_notifier.h>
> >  #include <uapi/linux/mshv.h>
> > 
> >  /*
> > @@ -81,9 +82,14 @@ struct mshv_mem_region {
> >  	struct {
> >  		u64 large_pages:  1; /* 2MiB */
> >  		u64 range_pinned: 1;
> > -		u64 reserved:	 62;
> > +		u64 is_ram	: 1; /* mem region can be ram or mmio */
> 
> The combination of "range_pinned" and "is_ram" effectively define three
> MSHV region types:
> 
> 1) MMIO
> 2) Pinned RAM
> 3) Unpinned RAM
> 
> Would it make sense to have an enum instead of 2 booleans? It seems
> like that would simplify the code a bit in a couple places. For example,
> mshv_handle_gpa_intercept() could just do one WARN_ON() instead of two.
> Also you would not need mshv_partition_destroy_region() testing "is_ram",
> and then mshv_region_movable_fini() testing "range_pinned". A single test
> could cover both.
> 
> Just a suggestion. Ignore my comment if you prefer the 2 booleans.
> 

I's rather keep it as is as it's a legacy pattern people have been used
to. A small runtime optimization doesn't worth the churn from my POV.

> > +/**
> > + * mshv_region_range_fault - Handle memory range faults for a given region.
> > + * @region: Pointer to the memory region structure.
> > + * @page_offset: Offset of the page within the region.
> > + * @page_count: Number of pages to handle.
> > + *
> > + * This function resolves memory faults for a specified range of pages
> > + * within a memory region. It uses HMM (Heterogeneous Memory Management)
> > + * to fault in the required pages and updates the region's page array.
> > + *
> > + * Return: 0 on success, negative error code on failure.
> > + */
> > +static int mshv_region_range_fault(struct mshv_mem_region *region,
> > +				   u64 page_offset, u64 page_count)
> > +{
> > +	struct hmm_range range = {
> > +		.notifier = &region->mni,
> > +		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> > +	};
> > +	unsigned long *pfns;
> > +	int ret;
> > +	u64 i;
> > +
> > +	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);
> > +	if (!pfns)
> > +		return -ENOMEM;
> > +
> > +	range.hmm_pfns = pfns;
> > +	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
> > +	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
> > +
> > +	do {
> > +		ret = mshv_region_hmm_fault_and_lock(region, &range);
> > +	} while (ret == -EBUSY);
> 
> I expected the looping on -EBUSY to be in mshv_region_hmm_fault_and_lock(),
> but I guess it really doesn't matter.
> 
> > +
> > +	if (ret)
> > +		goto out;
> > +
> > +	for (i = 0; i < page_count; i++)
> > +		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> 
> The comment with hmm_pfn_to_page() says that the caller is assumed to
> have tested the pfn for HMM_PFN_VALID, which I don't see done anywhere.
> Is there a reason it's not necessary to test?
> 

The reason is the HMM_PFN_REQ_FAULT range flag, which requires all PFNs to be
faulted and populated, or mshv_region_hmm_fault_and_lock will return -EFAULT.
Additionally, note that mshv_region_hmm_fault_and_lock returns with
region->mutex held upon success, ensuring that no page can be moved until the
lock is released.

> > +
> > +	if (PageHuge(region->pages[page_offset]))
> > +		region->flags.large_pages = true;
> 
> See comment below in mshv_region_handle_gfn_fault().
> 
> > +
> > +	ret = mshv_region_remap_pages(region, region->hv_map_flags,
> > +				      page_offset, page_count);
> > +
> > +	mutex_unlock(&region->mutex);
> > +out:
> > +	kfree(pfns);
> > +	return ret;
> > +}
> > +#else /* CONFIG_MMU_NOTIFIER */
> > +static int mshv_region_range_fault(struct mshv_mem_region *region,
> > +				   u64 page_offset, u64 page_count)
> > +{
> > +	return -ENODEV;
> > +}
> > +#endif /* CONFIG_MMU_NOTIFIER */
> > +
> > +static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> > +{
> > +	u64 page_offset, page_count;
> > +	int ret;
> > +
> > +	if (WARN_ON_ONCE(region->flags.range_pinned))
> > +		return false;
> > +
> > +	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> > +	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> > +				 MSHV_MAP_FAULT_IN_PAGES);
> > +
> > +	/* Map more pages than requested to reduce the number of faults. */
> > +	page_count = min(region->nr_pages - page_offset,
> > +			 MSHV_MAP_FAULT_IN_PAGES);
> 
> These computations make the range defined by page_offset and page_count
> start on a 512 page boundary relative to start_gfn, and have a size that is a
> multiple of 512 pages. But they don't ensure that the range aligns to a large page
> boundary within gfn space since region->start_gfn may not be a multiple of
> 512. Then mshv_region_range_fault() tests the first page of the range for
> being a large page, and if so, sets region->large_pages. This doesn't make
> sense to me if the range doesn't align to a large page boundary.
> 
> Does this code need to make sure that the range is aligned to a large
> page boundary in gfn space? Or am I misunderstanding what the
> region->large_pages flag means? Given the fixes in this v6 of the
> patch set, I was thinking that region->large_pages means that every
> large page aligned area within the region is a large page. If region->
> start_gfn and region->nr_pages aren't multiples of 512, then there
> may be an initial range and a final range that aren't large pages,
> but everything in between is. If that's not a correct understanding,
> could you clarify the exact meaning of the region->large_pages
> flag?
>

That's a good catch. Initially, the approach to memory deposit involved pinning
and depositing all pages. The code assumes that if the first page in the region
is huge, it is sufficient to use the "map huge page" flag in the hypercall.

With this series, the region is sparse by default, reducing the likelihood of
huge pages in the region. As a result, using this flag seems neither correct
nor reasonable.

Ideally, whether to use the flag should be determined during each guest memory
map/unmap operation, rather than relying on the flag set during the initial
region mapping.

For now, I will remove the large_pages flag for movable regions in this
series, as it is the least intrusive change. However, I plan to investigate
this further and potentially replace the large_pages flag with a runtime
check in the next series.

> > +
> > +	ret = mshv_region_range_fault(region, page_offset, page_count);
> > +
> > +	WARN_ONCE(ret,
> > +		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> > +		  region->partition->pt_id, region->start_uaddr,
> > +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > +		  gfn, page_offset, page_count);
> > +
> > +	return !ret;
> > +}
> > +
> > +/**
> > + * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
> > + * @vp: Pointer to the virtual processor structure.
> > + *
> > + * This function processes GPA intercepts by identifying the memory region
> > + * corresponding to the intercepted GPA, aligning the page offset, and
> > + * mapping the required pages. It ensures that the region is valid and
> > + * handles faults efficiently by mapping multiple pages at once.
> > + *
> > + * Return: true if the intercept was handled successfully, false otherwise.
> > + */
> > +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > +{
> > +	struct mshv_partition *p = vp->vp_partition;
> > +	struct mshv_mem_region *region;
> > +	struct hv_x64_memory_intercept_message *msg;
> > +	u64 gfn;
> > +
> > +	msg = (struct hv_x64_memory_intercept_message *)
> > +		vp->vp_intercept_msg_page->u.payload;
> > +
> > +	gfn = HVPFN_DOWN(msg->guest_physical_address);
> > +
> > +	region = mshv_partition_region_by_gfn(p, gfn);
> > +	if (!region)
> > +		return false;
> 
> Does it ever happen that the gfn is legitimately not found in any
> region, perhaps due to a race? I think the vp_mutex is held here,
> so maybe that protects the region layout for the VP and "not found"
> should never occur. If so, should there be a WARN_ON here?
> 
> If "gfn not found" can be legitimate, perhaps a comment to
> explain the circumstances would be helpful.
> 

This is possible, if hypervisor returns some invalid GFN.
But there is also a possibility, that this code can race with region removal from a guest.
I'll address it in the next revision.

> > +
> > +	if (WARN_ON_ONCE(!region->flags.is_ram))
> > +		return false;
> > +
> > +	if (WARN_ON_ONCE(region->flags.range_pinned))
> > +		return false;
> > +
> > +	return mshv_region_handle_gfn_fault(region, gfn);
> > +}
> > +
> > +#else	/* CONFIG_X86_64 */
> > +
> > +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > +
> > +#endif	/* CONFIG_X86_64 */
> > +
> > +static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > +{
> > +	switch (vp->vp_intercept_msg_page->header.message_type) {
> > +	case HVMSG_GPA_INTERCEPT:
> > +		return mshv_handle_gpa_intercept(vp);
> > +	}
> > +	return false;
> > +}
> > +
> >  static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
> >  {
> >  	long rc;
> > 
> > -	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> > -		rc = mshv_run_vp_with_root_scheduler(vp);
> > -	else
> > -		rc = mshv_run_vp_with_hyp_scheduler(vp);
> > +	do {
> > +		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> > +			rc = mshv_run_vp_with_root_scheduler(vp);
> > +		else
> > +			rc = mshv_run_vp_with_hyp_scheduler(vp);
> > +	} while (rc == 0 && mshv_vp_handle_intercept(vp));
> > 
> >  	if (rc)
> >  		return rc;
> > @@ -1194,6 +1385,110 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> >  	return NULL;
> >  }
> > 
> > +#if defined(CONFIG_MMU_NOTIFIER)
> > +static void mshv_region_movable_fini(struct mshv_mem_region *region)
> > +{
> > +	if (region->flags.range_pinned)
> > +		return;
> > +
> > +	mmu_interval_notifier_remove(&region->mni);
> > +}
> > +
> > +/**
> > + * mshv_region_interval_invalidate - Invalidate a range of memory region
> > + * @mni: Pointer to the mmu_interval_notifier structure
> > + * @range: Pointer to the mmu_notifier_range structure
> > + * @cur_seq: Current sequence number for the interval notifier
> > + *
> > + * This function invalidates a memory region by remapping its pages with
> > + * no access permissions. It locks the region's mutex to ensure thread safety
> > + * and updates the sequence number for the interval notifier. If the range
> > + * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
> > + * lock and returns false if unsuccessful.
> > + *
> > + * NOTE: Failure to invalidate a region is a serious error, as the pages will
> > + * be considered freed while they are still mapped by the hypervisor.
> > + * Any attempt to access such pages will likely crash the system.
> > + *
> > + * Return: true if the region was successfully invalidated, false otherwise.
> > + */
> > +static bool
> > +mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> > +				const struct mmu_notifier_range *range,
> > +				unsigned long cur_seq)
> > +{
> > +	struct mshv_mem_region *region = container_of(mni,
> > +						struct mshv_mem_region,
> > +						mni);
> > +	u64 page_offset, page_count;
> > +	unsigned long mstart, mend;
> > +	int ret = -EPERM;
> > +
> > +	if (mmu_notifier_range_blockable(range))
> > +		mutex_lock(&region->mutex);
> > +	else if (!mutex_trylock(&region->mutex))
> > +		goto out_fail;
> > +
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +
> > +	mstart = max(range->start, region->start_uaddr);
> > +	mend = min(range->end, region->start_uaddr +
> > +		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
> 
> I'm pretty sure region->start_uaddr is always page aligned. But what
> about range->start and range->end?  The code here and below assumes
> they are page aligned. It also assumes that range->end is greater than
> range->start so the computation of page_count doesn't wrap and so
> page_count is >= 1. I don't know whether checks for these assumptions
> are appropriate.
> 

There is a check for memory region size to be non-zero and page aligned
in mshv_partition_ioct_set_memory function, which is the only caller for
memory region creation. And region start is defined in PFNs.

Thanks,
Stanislav


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions
  2025-11-20  0:45     ` Stanislav Kinsburskii
@ 2025-11-21  5:45       ` Michael Kelley
  2025-11-24 21:28         ` Stanislav Kinsburskii
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Kelley @ 2025-11-21  5:45 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, November 19, 2025 4:45 PM
> 
> On Tue, Nov 18, 2025 at 04:29:56PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, November 17, 2025 8:53 AM

[snip]

> > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > index 0b8c391a0342..5f1637cbb6e3 100644
> > > --- a/drivers/hv/Kconfig
> > > +++ b/drivers/hv/Kconfig
> > > @@ -75,6 +75,7 @@ config MSHV_ROOT
> > >  	depends on PAGE_SIZE_4KB
> > >  	select EVENTFD
> > >  	select VIRT_XFER_TO_GUEST_WORK
> > > +	select HMM_MIRROR
> >
> > Couldn't you also do "select MMU_NOTIFIER" to avoid the #ifdef's
> > and stubs for when it isn't selected? There are other Linux kernel
> > drivers that select it. Or is the intent to allow building an image that
> > doesn't support unpinned memory, and the #ifdef's save space?
> >
> 
> That's an interesting question. This driver can function without MMU notifiers
> by pinning all memory, which might be advantageous for certain real-time
> applications.
> However, since most other virtualization solutions use MMU_NOTIFIER, there
> doesn't appear to be a strong reason for this driver to deviate.

I'm not clear on your last sentence. Could you elaborate?

> 
> > >  	default n
> > >  	help
> > >  	  Select this option to enable support for booting and running as root

[snip]

> > > +
> > > +	for (i = 0; i < page_count; i++)
> > > +		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> >
> > The comment with hmm_pfn_to_page() says that the caller is assumed to
> > have tested the pfn for HMM_PFN_VALID, which I don't see done anywhere.
> > Is there a reason it's not necessary to test?
> 
> The reason is the HMM_PFN_REQ_FAULT range flag, which requires all PFNs to be
> faulted and populated, or mshv_region_hmm_fault_and_lock will return -EFAULT.
> Additionally, note that mshv_region_hmm_fault_and_lock returns with
> region->mutex held upon success, ensuring that no page can be moved until the
> lock is released.

OK, that makes sense.

> 
> > > +
> > > +	if (PageHuge(region->pages[page_offset]))
> > > +		region->flags.large_pages = true;
> >
> > See comment below in mshv_region_handle_gfn_fault().
> >
> > > +
> > > +	ret = mshv_region_remap_pages(region, region->hv_map_flags,
> > > +				      page_offset, page_count);
> > > +
> > > +	mutex_unlock(&region->mutex);
> > > +out:
> > > +	kfree(pfns);
> > > +	return ret;
> > > +}
> > > +#else /* CONFIG_MMU_NOTIFIER */
> > > +static int mshv_region_range_fault(struct mshv_mem_region *region,
> > > +				   u64 page_offset, u64 page_count)
> > > +{
> > > +	return -ENODEV;
> > > +}
> > > +#endif /* CONFIG_MMU_NOTIFIER */
> > > +
> > > +static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> > > +{
> > > +	u64 page_offset, page_count;
> > > +	int ret;
> > > +
> > > +	if (WARN_ON_ONCE(region->flags.range_pinned))
> > > +		return false;
> > > +
> > > +	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> > > +	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> > > +				 MSHV_MAP_FAULT_IN_PAGES);
> > > +
> > > +	/* Map more pages than requested to reduce the number of faults. */
> > > +	page_count = min(region->nr_pages - page_offset,
> > > +			 MSHV_MAP_FAULT_IN_PAGES);
> >
> > These computations make the range defined by page_offset and page_count
> > start on a 512 page boundary relative to start_gfn, and have a size that is a
> > multiple of 512 pages. But they don't ensure that the range aligns to a large page
> > boundary within gfn space since region->start_gfn may not be a multiple of
> > 512. Then mshv_region_range_fault() tests the first page of the range for
> > being a large page, and if so, sets region->large_pages. This doesn't make
> > sense to me if the range doesn't align to a large page boundary.
> >
> > Does this code need to make sure that the range is aligned to a large
> > page boundary in gfn space? Or am I misunderstanding what the
> > region->large_pages flag means? Given the fixes in this v6 of the
> > patch set, I was thinking that region->large_pages means that every
> > large page aligned area within the region is a large page. If region->
> > start_gfn and region->nr_pages aren't multiples of 512, then there
> > may be an initial range and a final range that aren't large pages,
> > but everything in between is. If that's not a correct understanding,
> > could you clarify the exact meaning of the region->large_pages
> > flag?
> >
> 
> That's a good catch. Initially, the approach to memory deposit involved pinning
> and depositing all pages. The code assumes that if the first page in the region
> is huge, it is sufficient to use the "map huge page" flag in the hypercall.

Right. But even for pinned regions as coded today, is that assumption
correct? Due to memory being fragmented at the time of region creation,
it would be possible that some 2Meg ranges in a region are backed by a large
page, while other 2Meg ranges are not. In that case, a single per-region flag
isn't enough information. Or does the hypercall work OK if the "map huge
page" flag is passed when the range isn't a huge page? I'm not clear on what
the hypercall requires as input.

> 
> With this series, the region is sparse by default, reducing the likelihood of
> huge pages in the region. As a result, using this flag seems neither correct
> nor reasonable.

Yep.

> 
> Ideally, whether to use the flag should be determined during each guest memory
> map/unmap operation, rather than relying on the flag set during the initial
> region mapping.
> 
> For now, I will remove the large_pages flag for movable regions in this
> series, as it is the least intrusive change. However, I plan to investigate
> this further and potentially replace the large_pages flag with a runtime
> check in the next series.

OK.  So what is the impact? Losing the perf benefit of mapping guest
memory in the SLAT as a 2 Meg large page vs. a bunch of individual 4K
pages? Anything else?

> 
> > > +
> > > +	ret = mshv_region_range_fault(region, page_offset, page_count);
> > > +
> > > +	WARN_ONCE(ret,
> > > +		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> > > +		  region->partition->pt_id, region->start_uaddr,
> > > +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > > +		  gfn, page_offset, page_count);
> > > +
> > > +	return !ret;
> > > +}
> > > +
> > > +/**
> > > + * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
> > > + * @vp: Pointer to the virtual processor structure.
> > > + *
> > > + * This function processes GPA intercepts by identifying the memory region
> > > + * corresponding to the intercepted GPA, aligning the page offset, and
> > > + * mapping the required pages. It ensures that the region is valid and
> > > + * handles faults efficiently by mapping multiple pages at once.
> > > + *
> > > + * Return: true if the intercept was handled successfully, false otherwise.
> > > + */
> > > +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > +{
> > > +	struct mshv_partition *p = vp->vp_partition;
> > > +	struct mshv_mem_region *region;
> > > +	struct hv_x64_memory_intercept_message *msg;
> > > +	u64 gfn;
> > > +
> > > +	msg = (struct hv_x64_memory_intercept_message *)
> > > +		vp->vp_intercept_msg_page->u.payload;
> > > +
> > > +	gfn = HVPFN_DOWN(msg->guest_physical_address);
> > > +
> > > +	region = mshv_partition_region_by_gfn(p, gfn);
> > > +	if (!region)
> > > +		return false;
> >
> > Does it ever happen that the gfn is legitimately not found in any
> > region, perhaps due to a race? I think the vp_mutex is held here,
> > so maybe that protects the region layout for the VP and "not found"
> > should never occur. If so, should there be a WARN_ON here?
> >
> > If "gfn not found" can be legitimate, perhaps a comment to
> > explain the circumstances would be helpful.
> >
> 
> This is possible, if hypervisor returns some invalid GFN.
> But there is also a possibility, that this code can race with region removal from a guest.
> I'll address it in the next revision.

In either of these cases, what happens next? The MSHV_RUN_VP ioctl
will return to user space with the unhandled HVMSG_GPA_INTERCEPT
message. Is there anything user space can do to enable the VP to make
progress past the fault? Or does user space just have to terminate the
guest VM?

> 
> > > +
> > > +	if (WARN_ON_ONCE(!region->flags.is_ram))
> > > +		return false;
> > > +
> > > +	if (WARN_ON_ONCE(region->flags.range_pinned))
> > > +		return false;
> > > +
> > > +	return mshv_region_handle_gfn_fault(region, gfn);
> > > +}
> > > +
> > > +#else	/* CONFIG_X86_64 */
> > > +
> > > +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > +
> > > +#endif	/* CONFIG_X86_64 */
> > > +
> > > +static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > +{
> > > +	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > +	case HVMSG_GPA_INTERCEPT:
> > > +		return mshv_handle_gpa_intercept(vp);
> > > +	}
> > > +	return false;
> > > +}
> > > +
> > >  static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
> > >  {
> > >  	long rc;
> > >
> > > -	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> > > -		rc = mshv_run_vp_with_root_scheduler(vp);
> > > -	else
> > > -		rc = mshv_run_vp_with_hyp_scheduler(vp);
> > > +	do {
> > > +		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> > > +			rc = mshv_run_vp_with_root_scheduler(vp);
> > > +		else
> > > +			rc = mshv_run_vp_with_hyp_scheduler(vp);
> > > +	} while (rc == 0 && mshv_vp_handle_intercept(vp));
> > >
> > >  	if (rc)
> > >  		return rc;
> > > @@ -1194,6 +1385,110 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > >  	return NULL;
> > >  }
> > >
> > > +#if defined(CONFIG_MMU_NOTIFIER)
> > > +static void mshv_region_movable_fini(struct mshv_mem_region *region)
> > > +{
> > > +	if (region->flags.range_pinned)
> > > +		return;
> > > +
> > > +	mmu_interval_notifier_remove(&region->mni);
> > > +}
> > > +
> > > +/**
> > > + * mshv_region_interval_invalidate - Invalidate a range of memory region
> > > + * @mni: Pointer to the mmu_interval_notifier structure
> > > + * @range: Pointer to the mmu_notifier_range structure
> > > + * @cur_seq: Current sequence number for the interval notifier
> > > + *
> > > + * This function invalidates a memory region by remapping its pages with
> > > + * no access permissions. It locks the region's mutex to ensure thread safety
> > > + * and updates the sequence number for the interval notifier. If the range
> > > + * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
> > > + * lock and returns false if unsuccessful.
> > > + *
> > > + * NOTE: Failure to invalidate a region is a serious error, as the pages will
> > > + * be considered freed while they are still mapped by the hypervisor.
> > > + * Any attempt to access such pages will likely crash the system.
> > > + *
> > > + * Return: true if the region was successfully invalidated, false otherwise.
> > > + */
> > > +static bool
> > > +mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> > > +				const struct mmu_notifier_range *range,
> > > +				unsigned long cur_seq)
> > > +{
> > > +	struct mshv_mem_region *region = container_of(mni,
> > > +						struct mshv_mem_region,
> > > +						mni);
> > > +	u64 page_offset, page_count;
> > > +	unsigned long mstart, mend;
> > > +	int ret = -EPERM;
> > > +
> > > +	if (mmu_notifier_range_blockable(range))
> > > +		mutex_lock(&region->mutex);
> > > +	else if (!mutex_trylock(&region->mutex))
> > > +		goto out_fail;
> > > +
> > > +	mmu_interval_set_seq(mni, cur_seq);
> > > +
> > > +	mstart = max(range->start, region->start_uaddr);
> > > +	mend = min(range->end, region->start_uaddr +
> > > +		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
> >
> > I'm pretty sure region->start_uaddr is always page aligned. But what
> > about range->start and range->end?  The code here and below assumes
> > they are page aligned. It also assumes that range->end is greater than
> > range->start so the computation of page_count doesn't wrap and so
> > page_count is >= 1. I don't know whether checks for these assumptions
> > are appropriate.
> >
> 
> There is a check for memory region size to be non-zero and page aligned
> in mshv_partition_ioct_set_memory function, which is the only caller for
> memory region creation. And region start is defined in PFNs.
> 

Right -- no disagreement that the region start and size are page aligned
and non-zero. But what about the range that is being invalidated?
(i.e., range->start and range->end) The values in that range are coming
from the mm subsystem, and aren't governed by how a region is created.
If that range is a subset of the MSHV region, then
mshv_region_internal_invalidate() will be operating on whatever subset
was provided in the 'range' argument.

mshv_region_interval_invalidate() is ultimately called from
mmu_notifier_invalidate_range_start(), which has about 30 different
callers in the kernel, mostly in the mm subsystem. It wasn't
clear to me what rules, if any, those 30 callers are following when they
set up the range to be invalidated. 

Michael

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions
  2025-11-21  5:45       ` Michael Kelley
@ 2025-11-24 21:28         ` Stanislav Kinsburskii
  0 siblings, 0 replies; 13+ messages in thread
From: Stanislav Kinsburskii @ 2025-11-24 21:28 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Fri, Nov 21, 2025 at 05:45:20AM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, November 19, 2025 4:45 PM
> > 
> > On Tue, Nov 18, 2025 at 04:29:56PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, November 17, 2025 8:53 AM
> 
> [snip]
> 
> > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > index 0b8c391a0342..5f1637cbb6e3 100644
> > > > --- a/drivers/hv/Kconfig
> > > > +++ b/drivers/hv/Kconfig
> > > > @@ -75,6 +75,7 @@ config MSHV_ROOT
> > > >  	depends on PAGE_SIZE_4KB
> > > >  	select EVENTFD
> > > >  	select VIRT_XFER_TO_GUEST_WORK
> > > > +	select HMM_MIRROR
> > >
> > > Couldn't you also do "select MMU_NOTIFIER" to avoid the #ifdef's
> > > and stubs for when it isn't selected? There are other Linux kernel
> > > drivers that select it. Or is the intent to allow building an image that
> > > doesn't support unpinned memory, and the #ifdef's save space?
> > >
> > 
> > That's an interesting question. This driver can function without MMU notifiers
> > by pinning all memory, which might be advantageous for certain real-time
> > applications.
> > However, since most other virtualization solutions use MMU_NOTIFIER, there
> > doesn't appear to be a strong reason for this driver to deviate.
> 
> I'm not clear on your last sentence. Could you elaborate?
> 

I meant I'll select MMU_NOTIFIER.

> 
> Right. But even for pinned regions as coded today, is that assumption
> correct? Due to memory being fragmented at the time of region creation,
> it would be possible that some 2Meg ranges in a region are backed by a large
> page, while other 2Meg ranges are not. In that case, a single per-region flag
> isn't enough information. Or does the hypercall work OK if the "map huge
> page" flag is passed when the range isn't a huge page? I'm not clear on what
> the hypercall requires as input.
> 

It is not with THP enabled and missing MAP_HUGETLB mmap flag.
I'll fix it in the next revision.

> 
> OK.  So what is the impact? Losing the perf benefit of mapping guest
> memory in the SLAT as a 2 Meg large page vs. a bunch of individual 4K
> pages? Anything else?
> 

I decided to rework it in scope of these the series.

> > 
> > This is possible, if hypervisor returns some invalid GFN.
> > But there is also a possibility, that this code can race with region removal from a guest.
> > I'll address it in the next revision.
> 
> In either of these cases, what happens next? The MSHV_RUN_VP ioctl
> will return to user space with the unhandled HVMSG_GPA_INTERCEPT
> message. Is there anything user space can do to enable the VP to make
> progress past the fault? Or does user space just have to terminate the
> guest VM?
> 

I don't think there is much to be done here in kernel.
Control will be returned to VMM, which will likely kill the guest.

> > >
> > > I'm pretty sure region->start_uaddr is always page aligned. But what
> > > about range->start and range->end?  The code here and below assumes
> > > they are page aligned. It also assumes that range->end is greater than
> > > range->start so the computation of page_count doesn't wrap and so
> > > page_count is >= 1. I don't know whether checks for these assumptions
> > > are appropriate.
> > >
> > 
> > There is a check for memory region size to be non-zero and page aligned
> > in mshv_partition_ioct_set_memory function, which is the only caller for
> > memory region creation. And region start is defined in PFNs.
> > 
> 
> Right -- no disagreement that the region start and size are page aligned
> and non-zero. But what about the range that is being invalidated?
> (i.e., range->start and range->end) The values in that range are coming
> from the mm subsystem, and aren't governed by how a region is created.
> If that range is a subset of the MSHV region, then
> mshv_region_internal_invalidate() will be operating on whatever subset
> was provided in the 'range' argument.
> 
> mshv_region_interval_invalidate() is ultimately called from
> mmu_notifier_invalidate_range_start(), which has about 30 different
> callers in the kernel, mostly in the mm subsystem. It wasn't
> clear to me what rules, if any, those 30 callers are following when they
> set up the range to be invalidated. 
> 
> Michael

I see what you mean. Yes, these values are guarantee to be page aligned.
And it must be as anything else can't be invalidated.

Thanks,
Stanislav



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-11-24 21:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-17 16:52 [PATCH v6 0/5] Series short description Stanislav Kinsburskii
2025-11-17 16:52 ` [PATCH v6 1/5] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
2025-11-17 16:52 ` [PATCH v6 2/5] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
2025-11-17 16:52 ` [PATCH v6 3/5] Drivers: hv: Batch GPA unmap operations to improve large region performance Stanislav Kinsburskii
2025-11-18 16:28   ` Michael Kelley
2025-11-19 17:42     ` Stanislav Kinsburskii
2025-11-17 16:52 ` [PATCH v6 4/5] Drivers: hv: Ensure large page GPA mapping is PMD-aligned Stanislav Kinsburskii
2025-11-17 16:52 ` [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
2025-11-18 16:29   ` Michael Kelley
2025-11-20  0:45     ` Stanislav Kinsburskii
2025-11-21  5:45       ` Michael Kelley
2025-11-24 21:28         ` Stanislav Kinsburskii
2025-11-19 18:06   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).