linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/6] Introduce movable pages for Hyper-V guests
@ 2025-12-03 18:23 Stanislav Kinsburskii
  2025-12-03 18:24 ` [PATCH v8 1/6] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Stanislav Kinsburskii @ 2025-12-03 18:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

From the start, the root-partition driver allocates, pins, and maps all
guest memory into the hypervisor at guest creation. This is simple: Linux
cannot move the pages, so the guest’s view in Linux and in Microsoft
Hypervisor never diverges.

However, this approach has major drawbacks:
 - NUMA: affinity can’t be changed at runtime, so you can’t migrate guest memory closer to the CPUs running it → performance hit.
 - Memory management: unused guest memory can’t be swapped out, compacted, or merged.
 - Provisioning time: upfront allocation/pinning slows guest create/destroy.
 - Overcommit: no memory overcommit on hosts with pinned-guest memory.

This series adds movable memory pages for Hyper-V child partitions. Guest
pages are no longer allocated upfront; they’re allocated and mapped into
the hypervisor on demand (i.e., when the guest touches a GFN that isn’t yet
backed by a host PFN).
When a page is moved, Linux no longer holds it and it is unmapped from the hypervisor.
As a result, Hyper-V guests behave like regular Linux processes, enabling standard Linux memory features to apply to guests.

Exceptions (still pinned):
 1. Encrypted guests (explicit).
 2. Guests with passthrough devices (implicitly pinned by the VFIO framework).

v8:
 - Drop the guest regions intersection clean up patch as it's doesn't catch
   a case when a new region completely overlaps an existing region.
 - Fix missing propagation of passed region unmap flags during chunk unmap.

v7:
 - Only the first two patches remain unchanged from v6.
 - Introduced reference counting for memory regions to resolve a race
   condition between region servicing (faulting and invalidation) and region
   destruction.
 - Corrected the assumption that regions starting with a huge page contain
   only huge pages; the code now properly handles regions with mixed page
   size segments.
 - Consolidated region management logic into a dedicated file.
 - Updated the driver to select MMU_NOTIFIER, removing support for
   configurations without this option.
 - Cleaned up and refactored the region management code.
 - Fixed a build issue reported by the kernel test robot for configurations
   where HPAGE_PMD_NR is defined to result in build bug.
 - Replaced VALUE_PMD_ALIGNED with the generic IS_ALIGNED macro.
 - Simplified region flags by introducing a region type for clarity.
 - Improved commit messages.

v6:
 - Fix a bug in large page remapping where setting the large map flag based
   on the PFN offset's large page alignment within the region implicitly
   assumed that the region's start offset was also large page aligned,
   which could cause map hypercall failures.
 - Fix a bug in large page unmapping where setting the large unmap flag for
   an unaligned guest PFN range could result in unmap hypercall failures.

v5:
 - Fix a bug in MMU notifier handling where an uninitialized 'ret' variable
   could cause the warning about failed page invalidation to be skipped.
 - Improve comment grammar regarding skipping the unmapping of non-mapped pages.

v4:
 - Fix a bug in batch unmapping can skip mapped pages when selecting a new
   batch due to wrong offset calculation.
 - Fix an error message in case of failed memory region pinning.


v3:
 - Region is invalidated even if the mm has no users.
 - Page remapping logic is updated to support 2M-unaligned remappings for
   regions that are PMD-aligned, which can occur during both faults and
   invalidations.

v2:
 - Split unmap batching into a separate patch.
 - Fixed commit messages from v1 review.
 - Renamed a few functions for clarity.

---

Stanislav Kinsburskii (6):
      Drivers: hv: Refactor and rename memory region handling functions
      Drivers: hv: Centralize guest memory region destruction
      Drivers: hv: Move region management to mshv_regions.c
      Drivers: hv: Fix huge page handling in memory region traversal
      Drivers: hv: Add refcount and locking to mem regions
      Drivers: hv: Add support for movable memory regions


 drivers/hv/Kconfig          |    2 
 drivers/hv/Makefile         |    2 
 drivers/hv/mshv_regions.c   |  548 +++++++++++++++++++++++++++++++++++++++++++
 drivers/hv/mshv_root.h      |   32 ++-
 drivers/hv/mshv_root_main.c |  376 +++++++++++++-----------------
 5 files changed, 743 insertions(+), 217 deletions(-)
 create mode 100644 drivers/hv/mshv_regions.c


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v8 1/6] Drivers: hv: Refactor and rename memory region handling functions
  2025-12-03 18:23 [PATCH v8 0/6] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
@ 2025-12-03 18:24 ` Stanislav Kinsburskii
  2025-12-03 18:24 ` [PATCH v8 2/6] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Stanislav Kinsburskii @ 2025-12-03 18:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Simplify and unify memory region management to improve code clarity and
reliability. Consolidate pinning and invalidation logic, adopt consistent
naming, and remove redundant checks to reduce complexity.

Enhance documentation and update call sites for maintainability.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_root_main.c |   80 +++++++++++++++++++------------------------
 1 file changed, 36 insertions(+), 44 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index bc15d6f6922f..fec82619684a 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1114,8 +1114,8 @@ mshv_region_map(struct mshv_mem_region *region)
 }
 
 static void
-mshv_region_evict_pages(struct mshv_mem_region *region,
-			u64 page_offset, u64 page_count)
+mshv_region_invalidate_pages(struct mshv_mem_region *region,
+			     u64 page_offset, u64 page_count)
 {
 	if (region->flags.range_pinned)
 		unpin_user_pages(region->pages + page_offset, page_count);
@@ -1125,29 +1125,24 @@ mshv_region_evict_pages(struct mshv_mem_region *region,
 }
 
 static void
-mshv_region_evict(struct mshv_mem_region *region)
+mshv_region_invalidate(struct mshv_mem_region *region)
 {
-	mshv_region_evict_pages(region, 0, region->nr_pages);
+	mshv_region_invalidate_pages(region, 0, region->nr_pages);
 }
 
 static int
-mshv_region_populate_pages(struct mshv_mem_region *region,
-			   u64 page_offset, u64 page_count)
+mshv_region_pin(struct mshv_mem_region *region)
 {
 	u64 done_count, nr_pages;
 	struct page **pages;
 	__u64 userspace_addr;
 	int ret;
 
-	if (page_offset + page_count > region->nr_pages)
-		return -EINVAL;
-
-	for (done_count = 0; done_count < page_count; done_count += ret) {
-		pages = region->pages + page_offset + done_count;
+	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
+		pages = region->pages + done_count;
 		userspace_addr = region->start_uaddr +
-				(page_offset + done_count) *
-				HV_HYP_PAGE_SIZE;
-		nr_pages = min(page_count - done_count,
+				 done_count * HV_HYP_PAGE_SIZE;
+		nr_pages = min(region->nr_pages - done_count,
 			       MSHV_PIN_PAGES_BATCH_SIZE);
 
 		/*
@@ -1158,34 +1153,23 @@ mshv_region_populate_pages(struct mshv_mem_region *region,
 		 * with the FOLL_LONGTERM flag does a large temporary
 		 * allocation of contiguous memory.
 		 */
-		if (region->flags.range_pinned)
-			ret = pin_user_pages_fast(userspace_addr,
-						  nr_pages,
-						  FOLL_WRITE | FOLL_LONGTERM,
-						  pages);
-		else
-			ret = -EOPNOTSUPP;
-
+		ret = pin_user_pages_fast(userspace_addr, nr_pages,
+					  FOLL_WRITE | FOLL_LONGTERM,
+					  pages);
 		if (ret < 0)
 			goto release_pages;
 	}
 
-	if (PageHuge(region->pages[page_offset]))
+	if (PageHuge(region->pages[0]))
 		region->flags.large_pages = true;
 
 	return 0;
 
 release_pages:
-	mshv_region_evict_pages(region, page_offset, done_count);
+	mshv_region_invalidate_pages(region, 0, done_count);
 	return ret;
 }
 
-static int
-mshv_region_populate(struct mshv_mem_region *region)
-{
-	return mshv_region_populate_pages(region, 0, region->nr_pages);
-}
-
 static struct mshv_mem_region *
 mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 {
@@ -1245,19 +1229,27 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	return 0;
 }
 
-/*
- * Map guest ram. if snp, make sure to release that from the host first
- * Side Effects: In case of failure, pages are unpinned when feasible.
+/**
+ * mshv_prepare_pinned_region - Pin and map memory regions
+ * @region: Pointer to the memory region structure
+ *
+ * This function processes memory regions that are explicitly marked as pinned.
+ * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
+ * population. The function ensures the region is properly populated, handles
+ * encryption requirements for SNP partitions if applicable, maps the region,
+ * and performs necessary sharing or eviction operations based on the mapping
+ * result.
+ *
+ * Return: 0 on success, negative error code on failure.
  */
-static int
-mshv_partition_mem_region_map(struct mshv_mem_region *region)
+static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
-	ret = mshv_region_populate(region);
+	ret = mshv_region_pin(region);
 	if (ret) {
-		pt_err(partition, "Failed to populate memory region: %d\n",
+		pt_err(partition, "Failed to pin memory region: %d\n",
 		       ret);
 		goto err_out;
 	}
@@ -1275,7 +1267,7 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 			pt_err(partition,
 			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
 			       region->start_gfn, ret);
-			goto evict_region;
+			goto invalidate_region;
 		}
 	}
 
@@ -1285,7 +1277,7 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 
 		shrc = mshv_partition_region_share(region);
 		if (!shrc)
-			goto evict_region;
+			goto invalidate_region;
 
 		pt_err(partition,
 		       "Failed to share memory region (guest_pfn: %llu): %d\n",
@@ -1299,8 +1291,8 @@ mshv_partition_mem_region_map(struct mshv_mem_region *region)
 
 	return 0;
 
-evict_region:
-	mshv_region_evict(region);
+invalidate_region:
+	mshv_region_invalidate(region);
 err_out:
 	return ret;
 }
@@ -1349,7 +1341,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
 					     mmio_pfn, HVPFN_DOWN(mem.size));
 	else
-		ret = mshv_partition_mem_region_map(region);
+		ret = mshv_prepare_pinned_region(region);
 
 	if (ret)
 		goto errout;
@@ -1394,7 +1386,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
 				region->nr_pages, unmap_flags);
 
-	mshv_region_evict(region);
+	mshv_region_invalidate(region);
 
 	vfree(region);
 	return 0;
@@ -1812,7 +1804,7 @@ static void destroy_partition(struct mshv_partition *partition)
 			}
 		}
 
-		mshv_region_evict(region);
+		mshv_region_invalidate(region);
 
 		vfree(region);
 	}



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v8 2/6] Drivers: hv: Centralize guest memory region destruction
  2025-12-03 18:23 [PATCH v8 0/6] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
  2025-12-03 18:24 ` [PATCH v8 1/6] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
@ 2025-12-03 18:24 ` Stanislav Kinsburskii
  2025-12-03 18:24 ` [PATCH v8 3/6] Drivers: hv: Move region management to mshv_regions.c Stanislav Kinsburskii
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Stanislav Kinsburskii @ 2025-12-03 18:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Centralize guest memory region destruction to prevent resource leaks and
inconsistent cleanup across unmap and partition destruction paths.

Unify region removal, encrypted partition access recovery, and region
invalidation to improve maintainability and reliability. Reduce code
duplication and make future updates less error-prone by encapsulating
cleanup logic in a single helper.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_root_main.c |   65 ++++++++++++++++++++++---------------------
 1 file changed, 34 insertions(+), 31 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index fec82619684a..ec18984c3f2d 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1356,13 +1356,42 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	return ret;
 }
 
+static void mshv_partition_destroy_region(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	u32 unmap_flags = 0;
+	int ret;
+
+	hlist_del(&region->hnode);
+
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_partition_region_share(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
+			       ret);
+			return;
+		}
+	}
+
+	if (region->flags.large_pages)
+		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+
+	/* ignore unmap failures and continue as process may be exiting */
+	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
+				region->nr_pages, unmap_flags);
+
+	mshv_region_invalidate(region);
+
+	vfree(region);
+}
+
 /* Called for unmapping both the guest ram and the mmio space */
 static long
 mshv_unmap_user_memory(struct mshv_partition *partition,
 		       struct mshv_user_mem_region mem)
 {
 	struct mshv_mem_region *region;
-	u32 unmap_flags = 0;
 
 	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
 		return -EINVAL;
@@ -1377,18 +1406,8 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	    region->nr_pages != HVPFN_DOWN(mem.size))
 		return -EINVAL;
 
-	hlist_del(&region->hnode);
+	mshv_partition_destroy_region(region);
 
-	if (region->flags.large_pages)
-		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-
-	/* ignore unmap failures and continue as process may be exiting */
-	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
-				region->nr_pages, unmap_flags);
-
-	mshv_region_invalidate(region);
-
-	vfree(region);
 	return 0;
 }
 
@@ -1724,8 +1743,8 @@ static void destroy_partition(struct mshv_partition *partition)
 {
 	struct mshv_vp *vp;
 	struct mshv_mem_region *region;
-	int i, ret;
 	struct hlist_node *n;
+	int i;
 
 	if (refcount_read(&partition->pt_ref_count)) {
 		pt_err(partition,
@@ -1789,25 +1808,9 @@ static void destroy_partition(struct mshv_partition *partition)
 
 	remove_partition(partition);
 
-	/* Remove regions, regain access to the memory and unpin the pages */
 	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
-				  hnode) {
-		hlist_del(&region->hnode);
-
-		if (mshv_partition_encrypted(partition)) {
-			ret = mshv_partition_region_share(region);
-			if (ret) {
-				pt_err(partition,
-				       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
-				      ret);
-				return;
-			}
-		}
-
-		mshv_region_invalidate(region);
-
-		vfree(region);
-	}
+				  hnode)
+		mshv_partition_destroy_region(region);
 
 	/* Withdraw and free all pages we deposited */
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v8 3/6] Drivers: hv: Move region management to mshv_regions.c
  2025-12-03 18:23 [PATCH v8 0/6] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
  2025-12-03 18:24 ` [PATCH v8 1/6] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
  2025-12-03 18:24 ` [PATCH v8 2/6] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
@ 2025-12-03 18:24 ` Stanislav Kinsburskii
  2025-12-03 18:24 ` [PATCH v8 4/6] Drivers: hv: Fix huge page handling in memory region traversal Stanislav Kinsburskii
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Stanislav Kinsburskii @ 2025-12-03 18:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Refactor memory region management functions from mshv_root_main.c into
mshv_regions.c for better modularity and code organization.

Adjust function calls and headers to use the new implementation. Improve
maintainability and separation of concerns in the mshv_root module.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/Makefile         |    2 
 drivers/hv/mshv_regions.c   |  175 +++++++++++++++++++++++++++++++++++++++++++
 drivers/hv/mshv_root.h      |   10 ++
 drivers/hv/mshv_root_main.c |  176 +++----------------------------------------
 4 files changed, 198 insertions(+), 165 deletions(-)
 create mode 100644 drivers/hv/mshv_regions.c

diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 58b8d07639f3..46d4f4f1b252 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -14,7 +14,7 @@ hv_vmbus-y := vmbus_drv.o \
 hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
 hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
 mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
-	       mshv_root_hv_call.o mshv_portid_table.o
+	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
 mshv_vtl-y := mshv_vtl_main.o
 
 # Code that must be built-in
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
new file mode 100644
index 000000000000..35b866670840
--- /dev/null
+++ b/drivers/hv/mshv_regions.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, Microsoft Corporation.
+ *
+ * Memory region management for mshv_root module.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+
+#include <asm/mshyperv.h>
+
+#include "mshv_root.h"
+
+struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
+					   u64 uaddr, u32 flags,
+					   bool is_mmio)
+{
+	struct mshv_mem_region *region;
+
+	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+	if (!region)
+		return ERR_PTR(-ENOMEM);
+
+	region->nr_pages = nr_pages;
+	region->start_gfn = guest_pfn;
+	region->start_uaddr = uaddr;
+	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
+	if (flags & BIT(MSHV_SET_MEM_BIT_WRITABLE))
+		region->hv_map_flags |= HV_MAP_GPA_WRITABLE;
+	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
+		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
+
+	/* Note: large_pages flag populated when we pin the pages */
+	if (!is_mmio)
+		region->flags.range_pinned = true;
+
+	return region;
+}
+
+int mshv_region_share(struct mshv_mem_region *region)
+{
+	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
+
+	if (region->flags.large_pages)
+		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
+
+	return hv_call_modify_spa_host_access(region->partition->pt_id,
+			region->pages, region->nr_pages,
+			HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE,
+			flags, true);
+}
+
+int mshv_region_unshare(struct mshv_mem_region *region)
+{
+	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
+
+	if (region->flags.large_pages)
+		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
+
+	return hv_call_modify_spa_host_access(region->partition->pt_id,
+			region->pages, region->nr_pages,
+			0,
+			flags, false);
+}
+
+static int mshv_region_remap_pages(struct mshv_mem_region *region,
+				   u32 map_flags,
+				   u64 page_offset, u64 page_count)
+{
+	if (page_offset + page_count > region->nr_pages)
+		return -EINVAL;
+
+	if (region->flags.large_pages)
+		map_flags |= HV_MAP_GPA_LARGE_PAGE;
+
+	return hv_call_map_gpa_pages(region->partition->pt_id,
+				     region->start_gfn + page_offset,
+				     page_count, map_flags,
+				     region->pages + page_offset);
+}
+
+int mshv_region_map(struct mshv_mem_region *region)
+{
+	u32 map_flags = region->hv_map_flags;
+
+	return mshv_region_remap_pages(region, map_flags,
+				       0, region->nr_pages);
+}
+
+static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
+					 u64 page_offset, u64 page_count)
+{
+	if (region->flags.range_pinned)
+		unpin_user_pages(region->pages + page_offset, page_count);
+
+	memset(region->pages + page_offset, 0,
+	       page_count * sizeof(struct page *));
+}
+
+void mshv_region_invalidate(struct mshv_mem_region *region)
+{
+	mshv_region_invalidate_pages(region, 0, region->nr_pages);
+}
+
+int mshv_region_pin(struct mshv_mem_region *region)
+{
+	u64 done_count, nr_pages;
+	struct page **pages;
+	__u64 userspace_addr;
+	int ret;
+
+	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
+		pages = region->pages + done_count;
+		userspace_addr = region->start_uaddr +
+				 done_count * HV_HYP_PAGE_SIZE;
+		nr_pages = min(region->nr_pages - done_count,
+			       MSHV_PIN_PAGES_BATCH_SIZE);
+
+		/*
+		 * Pinning assuming 4k pages works for large pages too.
+		 * All page structs within the large page are returned.
+		 *
+		 * Pin requests are batched because pin_user_pages_fast
+		 * with the FOLL_LONGTERM flag does a large temporary
+		 * allocation of contiguous memory.
+		 */
+		ret = pin_user_pages_fast(userspace_addr, nr_pages,
+					  FOLL_WRITE | FOLL_LONGTERM,
+					  pages);
+		if (ret < 0)
+			goto release_pages;
+	}
+
+	if (PageHuge(region->pages[0]))
+		region->flags.large_pages = true;
+
+	return 0;
+
+release_pages:
+	mshv_region_invalidate_pages(region, 0, done_count);
+	return ret;
+}
+
+void mshv_region_destroy(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	u32 unmap_flags = 0;
+	int ret;
+
+	hlist_del(&region->hnode);
+
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_region_share(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
+			       ret);
+			return;
+		}
+	}
+
+	if (region->flags.large_pages)
+		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+
+	/* ignore unmap failures and continue as process may be exiting */
+	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
+				region->nr_pages, unmap_flags);
+
+	mshv_region_invalidate(region);
+
+	vfree(region);
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3eb815011b46..0366f416c2f0 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -312,4 +312,14 @@ extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
 
+struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
+					   u64 uaddr, u32 flags,
+					   bool is_mmio);
+int mshv_region_share(struct mshv_mem_region *region);
+int mshv_region_unshare(struct mshv_mem_region *region);
+int mshv_region_map(struct mshv_mem_region *region);
+void mshv_region_invalidate(struct mshv_mem_region *region);
+int mshv_region_pin(struct mshv_mem_region *region);
+void mshv_region_destroy(struct mshv_mem_region *region);
+
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index ec18984c3f2d..5dfb933da981 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1059,117 +1059,6 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
 	*status = partition->async_hypercall_status;
 }
 
-static int
-mshv_partition_region_share(struct mshv_mem_region *region)
-{
-	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
-
-	if (region->flags.large_pages)
-		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
-
-	return hv_call_modify_spa_host_access(region->partition->pt_id,
-			region->pages, region->nr_pages,
-			HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE,
-			flags, true);
-}
-
-static int
-mshv_partition_region_unshare(struct mshv_mem_region *region)
-{
-	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
-
-	if (region->flags.large_pages)
-		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
-
-	return hv_call_modify_spa_host_access(region->partition->pt_id,
-			region->pages, region->nr_pages,
-			0,
-			flags, false);
-}
-
-static int
-mshv_region_remap_pages(struct mshv_mem_region *region, u32 map_flags,
-			u64 page_offset, u64 page_count)
-{
-	if (page_offset + page_count > region->nr_pages)
-		return -EINVAL;
-
-	if (region->flags.large_pages)
-		map_flags |= HV_MAP_GPA_LARGE_PAGE;
-
-	/* ask the hypervisor to map guest ram */
-	return hv_call_map_gpa_pages(region->partition->pt_id,
-				     region->start_gfn + page_offset,
-				     page_count, map_flags,
-				     region->pages + page_offset);
-}
-
-static int
-mshv_region_map(struct mshv_mem_region *region)
-{
-	u32 map_flags = region->hv_map_flags;
-
-	return mshv_region_remap_pages(region, map_flags,
-				       0, region->nr_pages);
-}
-
-static void
-mshv_region_invalidate_pages(struct mshv_mem_region *region,
-			     u64 page_offset, u64 page_count)
-{
-	if (region->flags.range_pinned)
-		unpin_user_pages(region->pages + page_offset, page_count);
-
-	memset(region->pages + page_offset, 0,
-	       page_count * sizeof(struct page *));
-}
-
-static void
-mshv_region_invalidate(struct mshv_mem_region *region)
-{
-	mshv_region_invalidate_pages(region, 0, region->nr_pages);
-}
-
-static int
-mshv_region_pin(struct mshv_mem_region *region)
-{
-	u64 done_count, nr_pages;
-	struct page **pages;
-	__u64 userspace_addr;
-	int ret;
-
-	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
-		pages = region->pages + done_count;
-		userspace_addr = region->start_uaddr +
-				 done_count * HV_HYP_PAGE_SIZE;
-		nr_pages = min(region->nr_pages - done_count,
-			       MSHV_PIN_PAGES_BATCH_SIZE);
-
-		/*
-		 * Pinning assuming 4k pages works for large pages too.
-		 * All page structs within the large page are returned.
-		 *
-		 * Pin requests are batched because pin_user_pages_fast
-		 * with the FOLL_LONGTERM flag does a large temporary
-		 * allocation of contiguous memory.
-		 */
-		ret = pin_user_pages_fast(userspace_addr, nr_pages,
-					  FOLL_WRITE | FOLL_LONGTERM,
-					  pages);
-		if (ret < 0)
-			goto release_pages;
-	}
-
-	if (PageHuge(region->pages[0]))
-		region->flags.large_pages = true;
-
-	return 0;
-
-release_pages:
-	mshv_region_invalidate_pages(region, 0, done_count);
-	return ret;
-}
-
 static struct mshv_mem_region *
 mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 {
@@ -1193,7 +1082,7 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 					struct mshv_mem_region **regionpp,
 					bool is_mmio)
 {
-	struct mshv_mem_region *region, *rg;
+	struct mshv_mem_region *rg;
 	u64 nr_pages = HVPFN_DOWN(mem->size);
 
 	/* Reject overlapping regions */
@@ -1205,26 +1094,15 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 		return -EEXIST;
 	}
 
-	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
-	if (!region)
-		return -ENOMEM;
-
-	region->nr_pages = nr_pages;
-	region->start_gfn = mem->guest_pfn;
-	region->start_uaddr = mem->userspace_addr;
-	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
-	if (mem->flags & BIT(MSHV_SET_MEM_BIT_WRITABLE))
-		region->hv_map_flags |= HV_MAP_GPA_WRITABLE;
-	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
-		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
-
-	/* Note: large_pages flag populated when we pin the pages */
-	if (!is_mmio)
-		region->flags.range_pinned = true;
+	rg = mshv_region_create(mem->guest_pfn, nr_pages,
+				mem->userspace_addr, mem->flags,
+				is_mmio);
+	if (IS_ERR(rg))
+		return PTR_ERR(rg);
 
-	region->partition = partition;
+	rg->partition = partition;
 
-	*regionpp = region;
+	*regionpp = rg;
 
 	return 0;
 }
@@ -1262,7 +1140,7 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 	 * access to guest memory regions.
 	 */
 	if (mshv_partition_encrypted(partition)) {
-		ret = mshv_partition_region_unshare(region);
+		ret = mshv_region_unshare(region);
 		if (ret) {
 			pt_err(partition,
 			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
@@ -1275,7 +1153,7 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 	if (ret && mshv_partition_encrypted(partition)) {
 		int shrc;
 
-		shrc = mshv_partition_region_share(region);
+		shrc = mshv_region_share(region);
 		if (!shrc)
 			goto invalidate_region;
 
@@ -1356,36 +1234,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	return ret;
 }
 
-static void mshv_partition_destroy_region(struct mshv_mem_region *region)
-{
-	struct mshv_partition *partition = region->partition;
-	u32 unmap_flags = 0;
-	int ret;
-
-	hlist_del(&region->hnode);
-
-	if (mshv_partition_encrypted(partition)) {
-		ret = mshv_partition_region_share(region);
-		if (ret) {
-			pt_err(partition,
-			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
-			       ret);
-			return;
-		}
-	}
-
-	if (region->flags.large_pages)
-		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-
-	/* ignore unmap failures and continue as process may be exiting */
-	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
-				region->nr_pages, unmap_flags);
-
-	mshv_region_invalidate(region);
-
-	vfree(region);
-}
-
 /* Called for unmapping both the guest ram and the mmio space */
 static long
 mshv_unmap_user_memory(struct mshv_partition *partition,
@@ -1406,7 +1254,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	    region->nr_pages != HVPFN_DOWN(mem.size))
 		return -EINVAL;
 
-	mshv_partition_destroy_region(region);
+	mshv_region_destroy(region);
 
 	return 0;
 }
@@ -1810,7 +1658,7 @@ static void destroy_partition(struct mshv_partition *partition)
 
 	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
 				  hnode)
-		mshv_partition_destroy_region(region);
+		mshv_region_destroy(region);
 
 	/* Withdraw and free all pages we deposited */
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v8 4/6] Drivers: hv: Fix huge page handling in memory region traversal
  2025-12-03 18:23 [PATCH v8 0/6] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
                   ` (2 preceding siblings ...)
  2025-12-03 18:24 ` [PATCH v8 3/6] Drivers: hv: Move region management to mshv_regions.c Stanislav Kinsburskii
@ 2025-12-03 18:24 ` Stanislav Kinsburskii
  2025-12-03 19:47   ` Nuno Das Neves
  2025-12-03 18:24 ` [PATCH v8 5/6] Drivers: hv: Add refcount and locking to mem regions Stanislav Kinsburskii
  2025-12-03 18:24 ` [PATCH v8 6/6] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  5 siblings, 1 reply; 12+ messages in thread
From: Stanislav Kinsburskii @ 2025-12-03 18:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

The previous code assumed that if a region's first page was huge, the
entire region consisted of huge pages and stored this in a large_pages
flag. This premise is incorrect not only for movable regions (where
pages can be split and merged on invalidate callbacks or page faults),
but even for pinned regions: THPs can be split and merged during
allocation, so a large, pinned region may contain a mix of huge and
regular pages.

This change removes the large_pages flag and replaces region-wide
assumptions with per-chunk inspection of the actual page size when
mapping, unmapping, sharing, and unsharing. This makes huge page
handling correct for mixed-page regions and avoids relying on stale
metadata that can easily become invalid as memory is remapped.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_regions.c |  213 +++++++++++++++++++++++++++++++++++++++------
 drivers/hv/mshv_root.h    |    3 -
 2 files changed, 184 insertions(+), 32 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 35b866670840..1356f68ccb29 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -14,6 +14,124 @@
 
 #include "mshv_root.h"
 
+/**
+ * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
+ *                             in a region.
+ * @region     : Pointer to the memory region structure.
+ * @flags      : Flags to pass to the handler.
+ * @page_offset: Offset into the region's pages array to start processing.
+ * @page_count : Number of pages to process.
+ * @handler    : Callback function to handle the chunk.
+ *
+ * This function scans the region's pages starting from @page_offset,
+ * checking for contiguous present pages of the same size (normal or huge).
+ * It invokes @handler for the chunk of contiguous pages found. Returns the
+ * number of pages handled, or a negative error code if the first page is
+ * not present or the handler fails.
+ *
+ * Note: The @handler callback must be able to handle both normal and huge
+ * pages.
+ *
+ * Return: Number of pages handled, or negative error code.
+ */
+static long mshv_region_process_chunk(struct mshv_mem_region *region,
+				      u32 flags,
+				      u64 page_offset, u64 page_count,
+				      int (*handler)(struct mshv_mem_region *region,
+						     u32 flags,
+						     u64 page_offset,
+						     u64 page_count))
+{
+	u64 count, stride;
+	unsigned int page_order;
+	struct page *page;
+	int ret;
+
+	page = region->pages[page_offset];
+	if (!page)
+		return -EINVAL;
+
+	page_order = folio_order(page_folio(page));
+	/* 1G huge pages aren't supported by the hypercalls */
+	if (page_order == PUD_ORDER)
+		return -EINVAL;
+
+	stride = 1 << page_order;
+
+	/* Start at stride since the first page is validated */
+	for (count = stride; count < page_count; count += stride) {
+		page = region->pages[page_offset + count];
+
+		/* Break if current page is not present */
+		if (!page)
+			break;
+
+		/* Break if page size changes */
+		if (page_order != folio_order(page_folio(page)))
+			break;
+	}
+
+	ret = handler(region, flags, page_offset, count);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+/**
+ * mshv_region_process_range - Processes a range of memory pages in a
+ *                             region.
+ * @region     : Pointer to the memory region structure.
+ * @flags      : Flags to pass to the handler.
+ * @page_offset: Offset into the region's pages array to start processing.
+ * @page_count : Number of pages to process.
+ * @handler    : Callback function to handle each chunk of contiguous
+ *               pages.
+ *
+ * Iterates over the specified range of pages in @region, skipping
+ * non-present pages. For each contiguous chunk of present pages, invokes
+ * @handler via mshv_region_process_chunk.
+ *
+ * Note: The @handler callback must be able to handle both normal and huge
+ * pages.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+static int mshv_region_process_range(struct mshv_mem_region *region,
+				     u32 flags,
+				     u64 page_offset, u64 page_count,
+				     int (*handler)(struct mshv_mem_region *region,
+						    u32 flags,
+						    u64 page_offset,
+						    u64 page_count))
+{
+	long ret;
+
+	if (page_offset + page_count > region->nr_pages)
+		return -EINVAL;
+
+	while (page_count) {
+		/* Skip non-present pages */
+		if (!region->pages[page_offset]) {
+			page_offset++;
+			page_count--;
+			continue;
+		}
+
+		ret = mshv_region_process_chunk(region, flags,
+						page_offset,
+						page_count,
+						handler);
+		if (ret < 0)
+			return ret;
+
+		page_offset += ret;
+		page_count -= ret;
+	}
+
+	return 0;
+}
+
 struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 					   u64 uaddr, u32 flags,
 					   bool is_mmio)
@@ -33,55 +151,80 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
-	/* Note: large_pages flag populated when we pin the pages */
 	if (!is_mmio)
 		region->flags.range_pinned = true;
 
 	return region;
 }
 
+static int mshv_region_chunk_share(struct mshv_mem_region *region,
+				   u32 flags,
+				   u64 page_offset, u64 page_count)
+{
+	if (PageTransCompound(region->pages[page_offset]))
+		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
+
+	return hv_call_modify_spa_host_access(region->partition->pt_id,
+					      region->pages + page_offset,
+					      page_count,
+					      HV_MAP_GPA_READABLE |
+					      HV_MAP_GPA_WRITABLE,
+					      flags, true);
+}
+
 int mshv_region_share(struct mshv_mem_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
-	if (region->flags.large_pages)
+	return mshv_region_process_range(region, flags,
+					 0, region->nr_pages,
+					 mshv_region_chunk_share);
+}
+
+static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
+				     u32 flags,
+				     u64 page_offset, u64 page_count)
+{
+	if (PageTransCompound(region->pages[page_offset]))
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-			region->pages, region->nr_pages,
-			HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE,
-			flags, true);
+					      region->pages + page_offset,
+					      page_count, 0,
+					      flags, false);
 }
 
 int mshv_region_unshare(struct mshv_mem_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
-	if (region->flags.large_pages)
-		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
-
-	return hv_call_modify_spa_host_access(region->partition->pt_id,
-			region->pages, region->nr_pages,
-			0,
-			flags, false);
+	return mshv_region_process_range(region, flags,
+					 0, region->nr_pages,
+					 mshv_region_chunk_unshare);
 }
 
-static int mshv_region_remap_pages(struct mshv_mem_region *region,
-				   u32 map_flags,
+static int mshv_region_chunk_remap(struct mshv_mem_region *region,
+				   u32 flags,
 				   u64 page_offset, u64 page_count)
 {
-	if (page_offset + page_count > region->nr_pages)
-		return -EINVAL;
-
-	if (region->flags.large_pages)
-		map_flags |= HV_MAP_GPA_LARGE_PAGE;
+	if (PageTransCompound(region->pages[page_offset]))
+		flags |= HV_MAP_GPA_LARGE_PAGE;
 
 	return hv_call_map_gpa_pages(region->partition->pt_id,
 				     region->start_gfn + page_offset,
-				     page_count, map_flags,
+				     page_count, flags,
 				     region->pages + page_offset);
 }
 
+static int mshv_region_remap_pages(struct mshv_mem_region *region,
+				   u32 map_flags,
+				   u64 page_offset, u64 page_count)
+{
+	return mshv_region_process_range(region, map_flags,
+					 page_offset, page_count,
+					 mshv_region_chunk_remap);
+}
+
 int mshv_region_map(struct mshv_mem_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
@@ -134,9 +277,6 @@ int mshv_region_pin(struct mshv_mem_region *region)
 			goto release_pages;
 	}
 
-	if (PageHuge(region->pages[0]))
-		region->flags.large_pages = true;
-
 	return 0;
 
 release_pages:
@@ -144,10 +284,28 @@ int mshv_region_pin(struct mshv_mem_region *region)
 	return ret;
 }
 
+static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
+				   u32 flags,
+				   u64 page_offset, u64 page_count)
+{
+	if (PageTransCompound(region->pages[page_offset]))
+		flags |= HV_UNMAP_GPA_LARGE_PAGE;
+
+	return hv_call_unmap_gpa_pages(region->partition->pt_id,
+				       region->start_gfn + page_offset,
+				       page_count, flags);
+}
+
+static int mshv_region_unmap(struct mshv_mem_region *region)
+{
+	return mshv_region_process_range(region, 0,
+					 0, region->nr_pages,
+					 mshv_region_chunk_unmap);
+}
+
 void mshv_region_destroy(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
-	u32 unmap_flags = 0;
 	int ret;
 
 	hlist_del(&region->hnode);
@@ -162,12 +320,7 @@ void mshv_region_destroy(struct mshv_mem_region *region)
 		}
 	}
 
-	if (region->flags.large_pages)
-		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-
-	/* ignore unmap failures and continue as process may be exiting */
-	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
-				region->nr_pages, unmap_flags);
+	mshv_region_unmap(region);
 
 	mshv_region_invalidate(region);
 
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 0366f416c2f0..ff3374f13691 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -77,9 +77,8 @@ struct mshv_mem_region {
 	u64 start_uaddr;
 	u32 hv_map_flags;
 	struct {
-		u64 large_pages:  1; /* 2MiB */
 		u64 range_pinned: 1;
-		u64 reserved:	 62;
+		u64 reserved:	 63;
 	} flags;
 	struct mshv_partition *partition;
 	struct page *pages[];



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v8 5/6] Drivers: hv: Add refcount and locking to mem regions
  2025-12-03 18:23 [PATCH v8 0/6] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
                   ` (3 preceding siblings ...)
  2025-12-03 18:24 ` [PATCH v8 4/6] Drivers: hv: Fix huge page handling in memory region traversal Stanislav Kinsburskii
@ 2025-12-03 18:24 ` Stanislav Kinsburskii
  2025-12-03 19:26   ` Nuno Das Neves
  2025-12-03 18:24 ` [PATCH v8 6/6] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
  5 siblings, 1 reply; 12+ messages in thread
From: Stanislav Kinsburskii @ 2025-12-03 18:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Introduce kref-based reference counting and spinlock protection for
memory regions in Hyper-V partition management. This change improves
memory region lifecycle management and ensures thread-safe access to the
region list.

Also improves the check for overlapped memory regions during region
creation, preventing duplicate or conflicting mappings.

Previously, the regions list was protected by the partition mutex.
However, this approach is too heavy for frequent fault and invalidation
operations. Finer grained locking is now used to improve efficiency and
concurrency.

This is a precursor to supporting movable memory regions. Fault and
invalidation handling for movable regions will require safe traversal of
the region list and holding a region reference while performing
invalidation or fault operations.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   19 ++++++++++++++++---
 drivers/hv/mshv_root.h      |    6 +++++-
 drivers/hv/mshv_root_main.c |   32 ++++++++++++++++++++++++--------
 3 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 1356f68ccb29..94f33754f545 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -7,6 +7,7 @@
  * Authors: Microsoft Linux virtualization team
  */
 
+#include <linux/kref.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
 
@@ -154,6 +155,8 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 	if (!is_mmio)
 		region->flags.range_pinned = true;
 
+	kref_init(&region->refcount);
+
 	return region;
 }
 
@@ -303,13 +306,13 @@ static int mshv_region_unmap(struct mshv_mem_region *region)
 					 mshv_region_chunk_unmap);
 }
 
-void mshv_region_destroy(struct mshv_mem_region *region)
+static void mshv_region_destroy(struct kref *ref)
 {
+	struct mshv_mem_region *region =
+		container_of(ref, struct mshv_mem_region, refcount);
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
-	hlist_del(&region->hnode);
-
 	if (mshv_partition_encrypted(partition)) {
 		ret = mshv_region_share(region);
 		if (ret) {
@@ -326,3 +329,13 @@ void mshv_region_destroy(struct mshv_mem_region *region)
 
 	vfree(region);
 }
+
+void mshv_region_put(struct mshv_mem_region *region)
+{
+	kref_put(&region->refcount, mshv_region_destroy);
+}
+
+int mshv_region_get(struct mshv_mem_region *region)
+{
+	return kref_get_unless_zero(&region->refcount);
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index ff3374f13691..4249534ba900 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -72,6 +72,7 @@ do { \
 
 struct mshv_mem_region {
 	struct hlist_node hnode;
+	struct kref refcount;
 	u64 nr_pages;
 	u64 start_gfn;
 	u64 start_uaddr;
@@ -97,6 +98,8 @@ struct mshv_partition {
 	u64 pt_id;
 	refcount_t pt_ref_count;
 	struct mutex pt_mutex;
+
+	spinlock_t pt_mem_regions_lock;
 	struct hlist_head pt_mem_regions; // not ordered
 
 	u32 pt_vp_count;
@@ -319,6 +322,7 @@ int mshv_region_unshare(struct mshv_mem_region *region);
 int mshv_region_map(struct mshv_mem_region *region);
 void mshv_region_invalidate(struct mshv_mem_region *region);
 int mshv_region_pin(struct mshv_mem_region *region);
-void mshv_region_destroy(struct mshv_mem_region *region);
+void mshv_region_put(struct mshv_mem_region *region);
+int mshv_region_get(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 5dfb933da981..aa1a11f4dc3e 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1086,13 +1086,15 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	u64 nr_pages = HVPFN_DOWN(mem->size);
 
 	/* Reject overlapping regions */
+	spin_lock(&partition->pt_mem_regions_lock);
 	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
 		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
 		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
 			continue;
-
+		spin_unlock(&partition->pt_mem_regions_lock);
 		return -EEXIST;
 	}
+	spin_unlock(&partition->pt_mem_regions_lock);
 
 	rg = mshv_region_create(mem->guest_pfn, nr_pages,
 				mem->userspace_addr, mem->flags,
@@ -1224,8 +1226,9 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	if (ret)
 		goto errout;
 
-	/* Install the new region */
+	spin_lock(&partition->pt_mem_regions_lock);
 	hlist_add_head(&region->hnode, &partition->pt_mem_regions);
+	spin_unlock(&partition->pt_mem_regions_lock);
 
 	return 0;
 
@@ -1244,17 +1247,27 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
 		return -EINVAL;
 
+	spin_lock(&partition->pt_mem_regions_lock);
+
 	region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
-	if (!region)
-		return -EINVAL;
+	if (!region) {
+		spin_unlock(&partition->pt_mem_regions_lock);
+		return -ENOENT;
+	}
 
 	/* Paranoia check */
 	if (region->start_uaddr != mem.userspace_addr ||
 	    region->start_gfn != mem.guest_pfn ||
-	    region->nr_pages != HVPFN_DOWN(mem.size))
+	    region->nr_pages != HVPFN_DOWN(mem.size)) {
+		spin_unlock(&partition->pt_mem_regions_lock);
 		return -EINVAL;
+	}
+
+	hlist_del(&region->hnode);
 
-	mshv_region_destroy(region);
+	spin_unlock(&partition->pt_mem_regions_lock);
+
+	mshv_region_put(region);
 
 	return 0;
 }
@@ -1657,8 +1670,10 @@ static void destroy_partition(struct mshv_partition *partition)
 	remove_partition(partition);
 
 	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
-				  hnode)
-		mshv_region_destroy(region);
+				  hnode) {
+		hlist_del(&region->hnode);
+		mshv_region_put(region);
+	}
 
 	/* Withdraw and free all pages we deposited */
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
@@ -1856,6 +1871,7 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
 
 	INIT_HLIST_HEAD(&partition->pt_devices);
 
+	spin_lock_init(&partition->pt_mem_regions_lock);
 	INIT_HLIST_HEAD(&partition->pt_mem_regions);
 
 	mshv_eventfd_init(partition);



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v8 6/6] Drivers: hv: Add support for movable memory regions
  2025-12-03 18:23 [PATCH v8 0/6] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
                   ` (4 preceding siblings ...)
  2025-12-03 18:24 ` [PATCH v8 5/6] Drivers: hv: Add refcount and locking to mem regions Stanislav Kinsburskii
@ 2025-12-03 18:24 ` Stanislav Kinsburskii
  2025-12-03 20:36   ` Nuno Das Neves
  5 siblings, 1 reply; 12+ messages in thread
From: Stanislav Kinsburskii @ 2025-12-03 18:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui; +Cc: linux-hyperv, linux-kernel

Introduce support for movable memory regions in the Hyper-V root partition
driver to improve memory management flexibility and enable advanced use
cases such as dynamic memory remapping.

Mirror the address space between the Linux root partition and guest VMs
using HMM. The root partition owns the memory, while guest VMs act as
devices with page tables managed via hypercalls. MSHV handles VP intercepts
by invoking hmm_range_fault() and updating SLAT entries. When memory is
reclaimed, HMM invalidates the relevant regions, prompting MSHV to clear
SLAT entries; guest VMs will fault again on access.

Integrate mmu_interval_notifier for movable regions, implement handlers for
HMM faults and memory invalidation, and update memory region mapping logic
to support movable regions.

While MMU notifiers are commonly used in virtualization drivers, this
implementation leverages HMM (Heterogeneous Memory Management) for its
specialized functionality. HMM provides a framework for mirroring,
invalidation, and fault handling, reducing boilerplate and improving
maintainability compared to generic MMU notifiers.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Kconfig          |    2 
 drivers/hv/mshv_regions.c   |  215 ++++++++++++++++++++++++++++++++++++++++++-
 drivers/hv/mshv_root.h      |   17 +++
 drivers/hv/mshv_root_main.c |  139 +++++++++++++++++++++++-----
 4 files changed, 343 insertions(+), 30 deletions(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index d4a8d349200c..7937ac0cbd0f 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -76,6 +76,8 @@ config MSHV_ROOT
 	depends on PAGE_SIZE_4KB
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
+	select HMM_MIRROR
+	select MMU_NOTIFIER
 	default n
 	help
 	  Select this option to enable support for booting and running as root
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 94f33754f545..afe03258caf0 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -7,6 +7,8 @@
  * Authors: Microsoft Linux virtualization team
  */
 
+#include <linux/hmm.h>
+#include <linux/hyperv.h>
 #include <linux/kref.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
@@ -15,6 +17,8 @@
 
 #include "mshv_root.h"
 
+#define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
+
 /**
  * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
  *                             in a region.
@@ -152,9 +156,6 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
-	if (!is_mmio)
-		region->flags.range_pinned = true;
-
 	kref_init(&region->refcount);
 
 	return region;
@@ -239,7 +240,7 @@ int mshv_region_map(struct mshv_mem_region *region)
 static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
 					 u64 page_offset, u64 page_count)
 {
-	if (region->flags.range_pinned)
+	if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
 		unpin_user_pages(region->pages + page_offset, page_count);
 
 	memset(region->pages + page_offset, 0,
@@ -313,6 +314,9 @@ static void mshv_region_destroy(struct kref *ref)
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
+	if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
+		mshv_region_movable_fini(region);
+
 	if (mshv_partition_encrypted(partition)) {
 		ret = mshv_region_share(region);
 		if (ret) {
@@ -339,3 +343,206 @@ int mshv_region_get(struct mshv_mem_region *region)
 {
 	return kref_get_unless_zero(&region->refcount);
 }
+
+/**
+ * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
+ * @region: Pointer to the memory region structure
+ * @range: Pointer to the HMM range structure
+ *
+ * This function performs the following steps:
+ * 1. Reads the notifier sequence for the HMM range.
+ * 2. Acquires a read lock on the memory map.
+ * 3. Handles HMM faults for the specified range.
+ * 4. Releases the read lock on the memory map.
+ * 5. If successful, locks the memory region mutex.
+ * 6. Verifies if the notifier sequence has changed during the operation.
+ *    If it has, releases the mutex and returns -EBUSY to match with
+ *    hmm_range_fault() return code for repeating.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
+					  struct hmm_range *range)
+{
+	int ret;
+
+	range->notifier_seq = mmu_interval_read_begin(range->notifier);
+	mmap_read_lock(region->mni.mm);
+	ret = hmm_range_fault(range);
+	mmap_read_unlock(region->mni.mm);
+	if (ret)
+		return ret;
+
+	mutex_lock(&region->mutex);
+
+	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
+		mutex_unlock(&region->mutex);
+		cond_resched();
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+/**
+ * mshv_region_range_fault - Handle memory range faults for a given region.
+ * @region: Pointer to the memory region structure.
+ * @page_offset: Offset of the page within the region.
+ * @page_count: Number of pages to handle.
+ *
+ * This function resolves memory faults for a specified range of pages
+ * within a memory region. It uses HMM (Heterogeneous Memory Management)
+ * to fault in the required pages and updates the region's page array.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 page_offset, u64 page_count)
+{
+	struct hmm_range range = {
+		.notifier = &region->mni,
+		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
+	};
+	unsigned long *pfns;
+	int ret;
+	u64 i;
+
+	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);
+	if (!pfns)
+		return -ENOMEM;
+
+	range.hmm_pfns = pfns;
+	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
+	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
+
+	do {
+		ret = mshv_region_hmm_fault_and_lock(region, &range);
+	} while (ret == -EBUSY);
+
+	if (ret)
+		goto out;
+
+	for (i = 0; i < page_count; i++)
+		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+
+	ret = mshv_region_remap_pages(region, region->hv_map_flags,
+				      page_offset, page_count);
+
+	mutex_unlock(&region->mutex);
+out:
+	kfree(pfns);
+	return ret;
+}
+
+bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
+{
+	u64 page_offset, page_count;
+	int ret;
+
+	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
+	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
+				 MSHV_MAP_FAULT_IN_PAGES);
+
+	/* Map more pages than requested to reduce the number of faults. */
+	page_count = min(region->nr_pages - page_offset,
+			 MSHV_MAP_FAULT_IN_PAGES);
+
+	ret = mshv_region_range_fault(region, page_offset, page_count);
+
+	WARN_ONCE(ret,
+		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
+		  region->partition->pt_id, region->start_uaddr,
+		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  gfn, page_offset, page_count);
+
+	return !ret;
+}
+
+/**
+ * mshv_region_interval_invalidate - Invalidate a range of memory region
+ * @mni: Pointer to the mmu_interval_notifier structure
+ * @range: Pointer to the mmu_notifier_range structure
+ * @cur_seq: Current sequence number for the interval notifier
+ *
+ * This function invalidates a memory region by remapping its pages with
+ * no access permissions. It locks the region's mutex to ensure thread safety
+ * and updates the sequence number for the interval notifier. If the range
+ * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
+ * lock and returns false if unsuccessful.
+ *
+ * NOTE: Failure to invalidate a region is a serious error, as the pages will
+ * be considered freed while they are still mapped by the hypervisor.
+ * Any attempt to access such pages will likely crash the system.
+ *
+ * Return: true if the region was successfully invalidated, false otherwise.
+ */
+static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
+					    const struct mmu_notifier_range *range,
+					    unsigned long cur_seq)
+{
+	struct mshv_mem_region *region = container_of(mni,
+						      struct mshv_mem_region,
+						      mni);
+	u64 page_offset, page_count;
+	unsigned long mstart, mend;
+	int ret = -EPERM;
+
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&region->mutex);
+	else if (!mutex_trylock(&region->mutex))
+		goto out_fail;
+
+	mmu_interval_set_seq(mni, cur_seq);
+
+	mstart = max(range->start, region->start_uaddr);
+	mend = min(range->end, region->start_uaddr +
+		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
+
+	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
+	page_count = HVPFN_DOWN(mend - mstart);
+
+	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
+				      page_offset, page_count);
+	if (ret)
+		goto out_fail;
+
+	mshv_region_invalidate_pages(region, page_offset, page_count);
+
+	mutex_unlock(&region->mutex);
+
+	return true;
+
+out_fail:
+	WARN_ONCE(ret,
+		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
+		  region->start_uaddr,
+		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  range->start, range->end, range->event,
+		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
+	return false;
+}
+
+static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
+	.invalidate = mshv_region_interval_invalidate,
+};
+
+void mshv_region_movable_fini(struct mshv_mem_region *region)
+{
+	mmu_interval_notifier_remove(&region->mni);
+}
+
+bool mshv_region_movable_init(struct mshv_mem_region *region)
+{
+	int ret;
+
+	ret = mmu_interval_notifier_insert(&region->mni, current->mm,
+					   region->start_uaddr,
+					   region->nr_pages << HV_HYP_PAGE_SHIFT,
+					   &mshv_region_mni_ops);
+	if (ret)
+		return false;
+
+	mutex_init(&region->mutex);
+
+	return true;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 4249534ba900..9cd76076d490 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -15,6 +15,7 @@
 #include <linux/hashtable.h>
 #include <linux/dev_printk.h>
 #include <linux/build_bug.h>
+#include <linux/mmu_notifier.h>
 #include <uapi/linux/mshv.h>
 
 /*
@@ -70,6 +71,12 @@ do { \
 #define vp_info(v, fmt, ...)	vp_devprintk(info, v, fmt, ##__VA_ARGS__)
 #define vp_dbg(v, fmt, ...)	vp_devprintk(dbg, v, fmt, ##__VA_ARGS__)
 
+enum mshv_region_type {
+	MSHV_REGION_TYPE_MEM_PINNED,
+	MSHV_REGION_TYPE_MEM_MOVABLE,
+	MSHV_REGION_TYPE_MMIO
+};
+
 struct mshv_mem_region {
 	struct hlist_node hnode;
 	struct kref refcount;
@@ -77,11 +84,10 @@ struct mshv_mem_region {
 	u64 start_gfn;
 	u64 start_uaddr;
 	u32 hv_map_flags;
-	struct {
-		u64 range_pinned: 1;
-		u64 reserved:	 63;
-	} flags;
 	struct mshv_partition *partition;
+	enum mshv_region_type type;
+	struct mmu_interval_notifier mni;
+	struct mutex mutex;	/* protects region pages remapping */
 	struct page *pages[];
 };
 
@@ -324,5 +330,8 @@ void mshv_region_invalidate(struct mshv_mem_region *region);
 int mshv_region_pin(struct mshv_mem_region *region);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
+bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
+void mshv_region_movable_fini(struct mshv_mem_region *region);
+bool mshv_region_movable_init(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index aa1a11f4dc3e..536182463f84 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -594,14 +594,98 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
 	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
 
+static struct mshv_mem_region *
+mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
+{
+	struct mshv_mem_region *region;
+
+	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
+		if (gfn >= region->start_gfn &&
+		    gfn < region->start_gfn + region->nr_pages)
+			return region;
+	}
+
+	return NULL;
+}
+
+#ifdef CONFIG_X86_64
+static struct mshv_mem_region *
+mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
+{
+	struct mshv_mem_region *region;
+
+	spin_lock(&p->pt_mem_regions_lock);
+	region = mshv_partition_region_by_gfn(p, gfn);
+	if (!region || !mshv_region_get(region)) {
+		spin_unlock(&p->pt_mem_regions_lock);
+		return NULL;
+	}
+	spin_unlock(&p->pt_mem_regions_lock);
+
+	return region;
+}
+
+/**
+ * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
+ * @vp: Pointer to the virtual processor structure.
+ *
+ * This function processes GPA intercepts by identifying the memory region
+ * corresponding to the intercepted GPA, aligning the page offset, and
+ * mapping the required pages. It ensures that the region is valid and
+ * handles faults efficiently by mapping multiple pages at once.
+ *
+ * Return: true if the intercept was handled successfully, false otherwise.
+ */
+static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
+{
+	struct mshv_partition *p = vp->vp_partition;
+	struct mshv_mem_region *region;
+	struct hv_x64_memory_intercept_message *msg;
+	bool ret;
+	u64 gfn;
+
+	msg = (struct hv_x64_memory_intercept_message *)
+		vp->vp_intercept_msg_page->u.payload;
+
+	gfn = HVPFN_DOWN(msg->guest_physical_address);
+
+	region = mshv_partition_region_by_gfn_get(p, gfn);
+	if (!region)
+		return false;
+
+	/* Only movable memory ranges are supported for GPA intercepts */
+	if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
+		ret = mshv_region_handle_gfn_fault(region, gfn);
+	else
+		ret = false;
+
+	mshv_region_put(region);
+
+	return ret;
+}
+#else  /* CONFIG_X86_64 */
+static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
+#endif /* CONFIG_X86_64 */
+
+static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
+{
+	switch (vp->vp_intercept_msg_page->header.message_type) {
+	case HVMSG_GPA_INTERCEPT:
+		return mshv_handle_gpa_intercept(vp);
+	}
+	return false;
+}
+
 static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 {
 	long rc;
 
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-		rc = mshv_run_vp_with_root_scheduler(vp);
-	else
-		rc = mshv_run_vp_with_hyp_scheduler(vp);
+	do {
+		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
+			rc = mshv_run_vp_with_root_scheduler(vp);
+		else
+			rc = mshv_run_vp_with_hyp_scheduler(vp);
+	} while (rc == 0 && mshv_vp_handle_intercept(vp));
 
 	if (rc)
 		return rc;
@@ -1059,20 +1143,6 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
 	*status = partition->async_hypercall_status;
 }
 
-static struct mshv_mem_region *
-mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
-{
-	struct mshv_mem_region *region;
-
-	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
-		if (gfn >= region->start_gfn &&
-		    gfn < region->start_gfn + region->nr_pages)
-			return region;
-	}
-
-	return NULL;
-}
-
 /*
  * NB: caller checks and makes sure mem->size is page aligned
  * Returns: 0 with regionpp updated on success, or -errno
@@ -1102,6 +1172,14 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	if (IS_ERR(rg))
 		return PTR_ERR(rg);
 
+	if (is_mmio)
+		rg->type = MSHV_REGION_TYPE_MMIO;
+	else if (mshv_partition_encrypted(partition) ||
+		 !mshv_region_movable_init(rg))
+		rg->type = MSHV_REGION_TYPE_MEM_PINNED;
+	else
+		rg->type = MSHV_REGION_TYPE_MEM_MOVABLE;
+
 	rg->partition = partition;
 
 	*regionpp = rg;
@@ -1217,11 +1295,28 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	if (ret)
 		return ret;
 
-	if (is_mmio)
-		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
-					     mmio_pfn, HVPFN_DOWN(mem.size));
-	else
+	switch (region->type) {
+	case MSHV_REGION_TYPE_MEM_PINNED:
 		ret = mshv_prepare_pinned_region(region);
+		break;
+	case MSHV_REGION_TYPE_MEM_MOVABLE:
+		/*
+		 * For movable memory regions, remap with no access to let
+		 * the hypervisor track dirty pages, enabling pre-copy live
+		 * migration.
+		 */
+		ret = hv_call_map_gpa_pages(partition->pt_id,
+					    region->start_gfn,
+					    region->nr_pages,
+					    HV_MAP_GPA_NO_ACCESS, NULL);
+		break;
+	case MSHV_REGION_TYPE_MMIO:
+		ret = hv_call_map_mmio_pages(partition->pt_id,
+					     region->start_gfn,
+					     mmio_pfn,
+					     region->nr_pages);
+		break;
+	}
 
 	if (ret)
 		goto errout;



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v8 5/6] Drivers: hv: Add refcount and locking to mem regions
  2025-12-03 18:24 ` [PATCH v8 5/6] Drivers: hv: Add refcount and locking to mem regions Stanislav Kinsburskii
@ 2025-12-03 19:26   ` Nuno Das Neves
  2025-12-03 19:55     ` Stanislav Kinsburskii
  0 siblings, 1 reply; 12+ messages in thread
From: Nuno Das Neves @ 2025-12-03 19:26 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: linux-hyperv, linux-kernel

On 12/3/2025 10:24 AM, Stanislav Kinsburskii wrote:
> Introduce kref-based reference counting and spinlock protection for
> memory regions in Hyper-V partition management. This change improves
> memory region lifecycle management and ensures thread-safe access to the
> region list.
> 
> Also improves the check for overlapped memory regions during region
> creation, preventing duplicate or conflicting mappings.
> 
> Previously, the regions list was protected by the partition mutex.
> However, this approach is too heavy for frequent fault and invalidation
> operations. Finer grained locking is now used to improve efficiency and
> concurrency.
> 
> This is a precursor to supporting movable memory regions. Fault and
> invalidation handling for movable regions will require safe traversal of
> the region list and holding a region reference while performing
> invalidation or fault operations.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c   |   19 ++++++++++++++++---
>  drivers/hv/mshv_root.h      |    6 +++++-
>  drivers/hv/mshv_root_main.c |   32 ++++++++++++++++++++++++--------
>  3 files changed, 45 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 1356f68ccb29..94f33754f545 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -7,6 +7,7 @@
>   * Authors: Microsoft Linux virtualization team
>   */
>  
> +#include <linux/kref.h>
>  #include <linux/mm.h>
>  #include <linux/vmalloc.h>
>  
> @@ -154,6 +155,8 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
>  	if (!is_mmio)
>  		region->flags.range_pinned = true;
>  
> +	kref_init(&region->refcount);
> +
>  	return region;
>  }
>  
> @@ -303,13 +306,13 @@ static int mshv_region_unmap(struct mshv_mem_region *region)
>  					 mshv_region_chunk_unmap);
>  }
>  
> -void mshv_region_destroy(struct mshv_mem_region *region)
> +static void mshv_region_destroy(struct kref *ref)
>  {
> +	struct mshv_mem_region *region =
> +		container_of(ref, struct mshv_mem_region, refcount);
>  	struct mshv_partition *partition = region->partition;
>  	int ret;
>  
> -	hlist_del(&region->hnode);
> -
>  	if (mshv_partition_encrypted(partition)) {
>  		ret = mshv_region_share(region);
>  		if (ret) {
> @@ -326,3 +329,13 @@ void mshv_region_destroy(struct mshv_mem_region *region)
>  
>  	vfree(region);
>  }
> +
> +void mshv_region_put(struct mshv_mem_region *region)
> +{
> +	kref_put(&region->refcount, mshv_region_destroy);
> +}
> +
> +int mshv_region_get(struct mshv_mem_region *region)
> +{
> +	return kref_get_unless_zero(&region->refcount);
> +}
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index ff3374f13691..4249534ba900 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -72,6 +72,7 @@ do { \
>  
>  struct mshv_mem_region {
>  	struct hlist_node hnode;
> +	struct kref refcount;
>  	u64 nr_pages;
>  	u64 start_gfn;
>  	u64 start_uaddr;
> @@ -97,6 +98,8 @@ struct mshv_partition {
>  	u64 pt_id;
>  	refcount_t pt_ref_count;
>  	struct mutex pt_mutex;
> +
> +	spinlock_t pt_mem_regions_lock;
>  	struct hlist_head pt_mem_regions; // not ordered
>  
>  	u32 pt_vp_count;
> @@ -319,6 +322,7 @@ int mshv_region_unshare(struct mshv_mem_region *region);
>  int mshv_region_map(struct mshv_mem_region *region);
>  void mshv_region_invalidate(struct mshv_mem_region *region);
>  int mshv_region_pin(struct mshv_mem_region *region);
> -void mshv_region_destroy(struct mshv_mem_region *region);
> +void mshv_region_put(struct mshv_mem_region *region);
> +int mshv_region_get(struct mshv_mem_region *region);
>  
>  #endif /* _MSHV_ROOT_H_ */
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 5dfb933da981..aa1a11f4dc3e 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1086,13 +1086,15 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
>  	u64 nr_pages = HVPFN_DOWN(mem->size);
>  
>  	/* Reject overlapping regions */
> +	spin_lock(&partition->pt_mem_regions_lock);
>  	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
>  		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
>  		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
>  			continue;
> -
> +		spin_unlock(&partition->pt_mem_regions_lock);
>  		return -EEXIST;
>  	}
> +	spin_unlock(&partition->pt_mem_regions_lock);
>  
>  	rg = mshv_region_create(mem->guest_pfn, nr_pages,
>  				mem->userspace_addr, mem->flags,
> @@ -1224,8 +1226,9 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  	if (ret)
>  		goto errout;
>  
> -	/* Install the new region */
> +	spin_lock(&partition->pt_mem_regions_lock);
>  	hlist_add_head(&region->hnode, &partition->pt_mem_regions);
> +	spin_unlock(&partition->pt_mem_regions_lock);
>  
>  	return 0;
>  
> @@ -1244,17 +1247,27 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
>  	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
>  		return -EINVAL;
>  
> +	spin_lock(&partition->pt_mem_regions_lock);
> +
>  	region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
> -	if (!region)
> -		return -EINVAL;
> +	if (!region) {
> +		spin_unlock(&partition->pt_mem_regions_lock);
> +		return -ENOENT;
> +	}
>  
>  	/* Paranoia check */
>  	if (region->start_uaddr != mem.userspace_addr ||
>  	    region->start_gfn != mem.guest_pfn ||
> -	    region->nr_pages != HVPFN_DOWN(mem.size))
> +	    region->nr_pages != HVPFN_DOWN(mem.size)) {
> +		spin_unlock(&partition->pt_mem_regions_lock);
>  		return -EINVAL;
> +	}
> +
> +	hlist_del(&region->hnode);
>  
> -	mshv_region_destroy(region);
> +	spin_unlock(&partition->pt_mem_regions_lock);
> +
> +	mshv_region_put(region);
>  
>  	return 0;
>  }
> @@ -1657,8 +1670,10 @@ static void destroy_partition(struct mshv_partition *partition)
>  	remove_partition(partition);
>  
>  	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
> -				  hnode)
> -		mshv_region_destroy(region);
> +				  hnode) {
> +		hlist_del(&region->hnode);
> +		mshv_region_put(region);
> +	}
>  

With the following patch introducing movable memory, it looks like the
list could be traversed by mshv_partition_region_by_gfn() even while
the this hlist_del() is being called.

Maybe that's not possible for some reason I'm unaware of, could you
explain why we don't need to spin_lock here for hlist_del()?
Or, alternatively, use hlist_for_each_entry_safe() in
mshv_partition_region_by_gfn() to guard against the deletion?

>  	/* Withdraw and free all pages we deposited */
>  	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
> @@ -1856,6 +1871,7 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
>  
>  	INIT_HLIST_HEAD(&partition->pt_devices);
>  
> +	spin_lock_init(&partition->pt_mem_regions_lock);
>  	INIT_HLIST_HEAD(&partition->pt_mem_regions);
>  
>  	mshv_eventfd_init(partition);
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v8 4/6] Drivers: hv: Fix huge page handling in memory region traversal
  2025-12-03 18:24 ` [PATCH v8 4/6] Drivers: hv: Fix huge page handling in memory region traversal Stanislav Kinsburskii
@ 2025-12-03 19:47   ` Nuno Das Neves
  0 siblings, 0 replies; 12+ messages in thread
From: Nuno Das Neves @ 2025-12-03 19:47 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: linux-hyperv, linux-kernel

On 12/3/2025 10:24 AM, Stanislav Kinsburskii wrote:
> The previous code assumed that if a region's first page was huge, the
> entire region consisted of huge pages and stored this in a large_pages
> flag. This premise is incorrect not only for movable regions (where
> pages can be split and merged on invalidate callbacks or page faults),
> but even for pinned regions: THPs can be split and merged during
> allocation, so a large, pinned region may contain a mix of huge and
> regular pages.
> 
> This change removes the large_pages flag and replaces region-wide
> assumptions with per-chunk inspection of the actual page size when
> mapping, unmapping, sharing, and unsharing. This makes huge page
> handling correct for mixed-page regions and avoids relying on stale
> metadata that can easily become invalid as memory is remapped.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

I was still looking at v7 when you posted v8 of this series.
Re-posting my review of this patch here so the discussion can be in the
right thread.

> ---
>  drivers/hv/mshv_regions.c |  213 +++++++++++++++++++++++++++++++++++++++------
>  drivers/hv/mshv_root.h    |    3 -
>  2 files changed, 184 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 35b866670840..1356f68ccb29 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -14,6 +14,124 @@
>  
>  #include "mshv_root.h"
>  
> +/**
> + * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
> + *                             in a region.
> + * @region     : Pointer to the memory region structure.
> + * @flags      : Flags to pass to the handler.
> + * @page_offset: Offset into the region's pages array to start processing.
> + * @page_count : Number of pages to process.
> + * @handler    : Callback function to handle the chunk.
> + *
> + * This function scans the region's pages starting from @page_offset,
> + * checking for contiguous present pages of the same size (normal or huge).
> + * It invokes @handler for the chunk of contiguous pages found. Returns the
> + * number of pages handled, or a negative error code if the first page is
> + * not present or the handler fails.
> + *
> + * Note: The @handler callback must be able to handle both normal and huge
> + * pages.
> + *
> + * Return: Number of pages handled, or negative error code.
> + */
> +static long mshv_region_process_chunk(struct mshv_mem_region *region,
> +				      u32 flags,
> +				      u64 page_offset, u64 page_count,
> +				      int (*handler)(struct mshv_mem_region *region,
> +						     u32 flags,
> +						     u64 page_offset,
> +						     u64 page_count))
> +{
> +	u64 count, stride;
> +	unsigned int page_order;
> +	struct page *page;
> +	int ret;
> +
> +	page = region->pages[page_offset];
> +	if (!page)
> +		return -EINVAL;
> +
> +	page_order = folio_order(page_folio(page));
> +	/* 1G huge pages aren't supported by the hypercalls */
> +	if (page_order == PUD_ORDER)
> +		return -EINVAL;
> +

I'd prefer to be explicit about exactly which page_orders we *do*
support instead of just disallowing PUD_ORDER.

Without looking up folio_order(), there's an implication here that
page_order can be anything except PUD_ORDER, but that's not the case;
there's only 2 valid values for page_order.

The comment can instead read something like:
"The hypervisor only supports 4K and 2M page sizes"

> +	stride = 1 << page_order;
> +
> +	/* Start at stride since the first page is validated */
> +	for (count = stride; count < page_count; count += stride) {
> +		page = region->pages[page_offset + count];
> +
> +		/* Break if current page is not present */
> +		if (!page)
> +			break;
> +
> +		/* Break if page size changes */
> +		if (page_order != folio_order(page_folio(page)))
> +			break;
> +	}
> +
> +	ret = handler(region, flags, page_offset, count);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +
> +/**
> + * mshv_region_process_range - Processes a range of memory pages in a
> + *                             region.
> + * @region     : Pointer to the memory region structure.
> + * @flags      : Flags to pass to the handler.
> + * @page_offset: Offset into the region's pages array to start processing.
> + * @page_count : Number of pages to process.
> + * @handler    : Callback function to handle each chunk of contiguous
> + *               pages.
> + *
> + * Iterates over the specified range of pages in @region, skipping
> + * non-present pages. For each contiguous chunk of present pages, invokes
> + * @handler via mshv_region_process_chunk.
> + *
> + * Note: The @handler callback must be able to handle both normal and huge
> + * pages.
> + *
> + * Returns 0 on success, or a negative error code on failure.
> + */
> +static int mshv_region_process_range(struct mshv_mem_region *region,
> +				     u32 flags,
> +				     u64 page_offset, u64 page_count,
> +				     int (*handler)(struct mshv_mem_region *region,
> +						    u32 flags,
> +						    u64 page_offset,
> +						    u64 page_count))
> +{
> +	long ret;
> +
> +	if (page_offset + page_count > region->nr_pages)
> +		return -EINVAL;
> +
> +	while (page_count) {
> +		/* Skip non-present pages */
> +		if (!region->pages[page_offset]) {
> +			page_offset++;
> +			page_count--;
> +			continue;
> +		}
> +
> +		ret = mshv_region_process_chunk(region, flags,
> +						page_offset,
> +						page_count,
> +						handler);
> +		if (ret < 0)
> +			return ret;
> +
> +		page_offset += ret;
> +		page_count -= ret;
> +	}
> +
> +	return 0;
> +}
> +
>  struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
>  					   u64 uaddr, u32 flags,
>  					   bool is_mmio)
> @@ -33,55 +151,80 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
>  	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
>  		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
>  
> -	/* Note: large_pages flag populated when we pin the pages */
>  	if (!is_mmio)
>  		region->flags.range_pinned = true;
>  
>  	return region;
>  }
>  
> +static int mshv_region_chunk_share(struct mshv_mem_region *region,
> +				   u32 flags,
> +				   u64 page_offset, u64 page_count)
> +{
> +	if (PageTransCompound(region->pages[page_offset]))
> +		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;

PageTransCompound() returns false if CONFIG_TRANSPARENT_HUGEPAGE is not
enabled. This won't work for hugetlb pages, will it?

Do we need to check if (PageHuge(page) || PageTransCompound(page)) ?

> +
> +	return hv_call_modify_spa_host_access(region->partition->pt_id,
> +					      region->pages + page_offset,
> +					      page_count,
> +					      HV_MAP_GPA_READABLE |
> +					      HV_MAP_GPA_WRITABLE,
> +					      flags, true);
> +}
> +
>  int mshv_region_share(struct mshv_mem_region *region)
>  {
>  	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
>  
> -	if (region->flags.large_pages)
> +	return mshv_region_process_range(region, flags,
> +					 0, region->nr_pages,
> +					 mshv_region_chunk_share);
> +}
> +
> +static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> +				     u32 flags,
> +				     u64 page_offset, u64 page_count)
> +{
> +	if (PageTransCompound(region->pages[page_offset]))
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>  
>  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> -			region->pages, region->nr_pages,
> -			HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE,
> -			flags, true);
> +					      region->pages + page_offset,
> +					      page_count, 0,
> +					      flags, false);
>  }
>  
>  int mshv_region_unshare(struct mshv_mem_region *region)
>  {
>  	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
>  
> -	if (region->flags.large_pages)
> -		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> -
> -	return hv_call_modify_spa_host_access(region->partition->pt_id,
> -			region->pages, region->nr_pages,
> -			0,
> -			flags, false);
> +	return mshv_region_process_range(region, flags,
> +					 0, region->nr_pages,
> +					 mshv_region_chunk_unshare);
>  }
>  
> -static int mshv_region_remap_pages(struct mshv_mem_region *region,
> -				   u32 map_flags,
> +static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> +				   u32 flags,
>  				   u64 page_offset, u64 page_count)

nit: Why the name change from map_flags to flags? It creates some
noise here.

>  {
> -	if (page_offset + page_count > region->nr_pages)
> -		return -EINVAL;
> -
> -	if (region->flags.large_pages)
> -		map_flags |= HV_MAP_GPA_LARGE_PAGE;
> +	if (PageTransCompound(region->pages[page_offset]))
> +		flags |= HV_MAP_GPA_LARGE_PAGE;
>  
>  	return hv_call_map_gpa_pages(region->partition->pt_id,
>  				     region->start_gfn + page_offset,
> -				     page_count, map_flags,
> +				     page_count, flags,
>  				     region->pages + page_offset);
>  }
>  
> +static int mshv_region_remap_pages(struct mshv_mem_region *region,
> +				   u32 map_flags,
> +				   u64 page_offset, u64 page_count)
> +{
> +	return mshv_region_process_range(region, map_flags,
> +					 page_offset, page_count,
> +					 mshv_region_chunk_remap);
> +}
> +
>  int mshv_region_map(struct mshv_mem_region *region)
>  {
>  	u32 map_flags = region->hv_map_flags;
> @@ -134,9 +277,6 @@ int mshv_region_pin(struct mshv_mem_region *region)
>  			goto release_pages;
>  	}
>  
> -	if (PageHuge(region->pages[0]))
> -		region->flags.large_pages = true;
> -
>  	return 0;
>  
>  release_pages:
> @@ -144,10 +284,28 @@ int mshv_region_pin(struct mshv_mem_region *region)
>  	return ret;
>  }
>  
> +static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> +				   u32 flags,
> +				   u64 page_offset, u64 page_count)
> +{
> +	if (PageTransCompound(region->pages[page_offset]))
> +		flags |= HV_UNMAP_GPA_LARGE_PAGE;
> +
> +	return hv_call_unmap_gpa_pages(region->partition->pt_id,
> +				       region->start_gfn + page_offset,
> +				       page_count, flags);
> +}
> +
> +static int mshv_region_unmap(struct mshv_mem_region *region)
> +{
> +	return mshv_region_process_range(region, 0,
> +					 0, region->nr_pages,
> +					 mshv_region_chunk_unmap);
> +}
> +
>  void mshv_region_destroy(struct mshv_mem_region *region)
>  {
>  	struct mshv_partition *partition = region->partition;
> -	u32 unmap_flags = 0;
>  	int ret;
>  
>  	hlist_del(&region->hnode);
> @@ -162,12 +320,7 @@ void mshv_region_destroy(struct mshv_mem_region *region)
>  		}
>  	}
>  
> -	if (region->flags.large_pages)
> -		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> -
> -	/* ignore unmap failures and continue as process may be exiting */
> -	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
> -				region->nr_pages, unmap_flags);
> +	mshv_region_unmap(region);
>  
>  	mshv_region_invalidate(region);
>  
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 0366f416c2f0..ff3374f13691 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -77,9 +77,8 @@ struct mshv_mem_region {
>  	u64 start_uaddr;
>  	u32 hv_map_flags;
>  	struct {
> -		u64 large_pages:  1; /* 2MiB */
>  		u64 range_pinned: 1;
> -		u64 reserved:	 62;
> +		u64 reserved:	 63;
>  	} flags;
>  	struct mshv_partition *partition;
>  	struct page *pages[];
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v8 5/6] Drivers: hv: Add refcount and locking to mem regions
  2025-12-03 19:26   ` Nuno Das Neves
@ 2025-12-03 19:55     ` Stanislav Kinsburskii
  2025-12-03 20:09       ` Nuno Das Neves
  0 siblings, 1 reply; 12+ messages in thread
From: Stanislav Kinsburskii @ 2025-12-03 19:55 UTC (permalink / raw)
  To: Nuno Das Neves; +Cc: kys, haiyangz, wei.liu, decui, linux-hyperv, linux-kernel

On Wed, Dec 03, 2025 at 11:26:23AM -0800, Nuno Das Neves wrote:
> On 12/3/2025 10:24 AM, Stanislav Kinsburskii wrote:
> > Introduce kref-based reference counting and spinlock protection for
> > memory regions in Hyper-V partition management. This change improves
> > memory region lifecycle management and ensures thread-safe access to the
> > region list.
> > 
> > Also improves the check for overlapped memory regions during region
> > creation, preventing duplicate or conflicting mappings.
> > 
> > Previously, the regions list was protected by the partition mutex.
> > However, this approach is too heavy for frequent fault and invalidation
> > operations. Finer grained locking is now used to improve efficiency and
> > concurrency.
> > 
> > This is a precursor to supporting movable memory regions. Fault and
> > invalidation handling for movable regions will require safe traversal of
> > the region list and holding a region reference while performing
> > invalidation or fault operations.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_regions.c   |   19 ++++++++++++++++---
> >  drivers/hv/mshv_root.h      |    6 +++++-
> >  drivers/hv/mshv_root_main.c |   32 ++++++++++++++++++++++++--------
> >  3 files changed, 45 insertions(+), 12 deletions(-)
> > 
> > @@ -1657,8 +1670,10 @@ static void destroy_partition(struct mshv_partition *partition)
> >  	remove_partition(partition);
> >  
> >  	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
> > -				  hnode)
> > -		mshv_region_destroy(region);
> > +				  hnode) {
> > +		hlist_del(&region->hnode);
> > +		mshv_region_put(region);
> > +	}
> >  
> 
> With the following patch introducing movable memory, it looks like the
> list could be traversed by mshv_partition_region_by_gfn() even while
> the this hlist_del() is being called.
> 
> Maybe that's not possible for some reason I'm unaware of, could you
> explain why we don't need to spin_lock here for hlist_del()?
> Or, alternatively, use hlist_for_each_entry_safe() in
> mshv_partition_region_by_gfn() to guard against the deletion?
> 

This function (destroy_partition) is called when there are no active
references for neither partition not its VPs (they are destroyed in the
same function above).
In other words, there can't be any callers for mshv_partition_region_by_gfn.

As per mshv_partition_region_by_gfn function itself, the caller is
supposed to take the lock.

Giving it more thought, I'm strating to think that rw lock her ewould be
a better option than a spinlock + reference count, as regions won't be
added or remove to often and using it would allow to get rid of
reference counting.

However, this looks like an optimization that isn't required an its
usefulness can be investigated in future.

Thanks,
Stanislav

> >  	/* Withdraw and free all pages we deposited */
> >  	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
> > @@ -1856,6 +1871,7 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
> >  
> >  	INIT_HLIST_HEAD(&partition->pt_devices);
> >  
> > +	spin_lock_init(&partition->pt_mem_regions_lock);
> >  	INIT_HLIST_HEAD(&partition->pt_mem_regions);
> >  
> >  	mshv_eventfd_init(partition);
> > 
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v8 5/6] Drivers: hv: Add refcount and locking to mem regions
  2025-12-03 19:55     ` Stanislav Kinsburskii
@ 2025-12-03 20:09       ` Nuno Das Neves
  0 siblings, 0 replies; 12+ messages in thread
From: Nuno Das Neves @ 2025-12-03 20:09 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, linux-hyperv, linux-kernel

On 12/3/2025 11:55 AM, Stanislav Kinsburskii wrote:
> On Wed, Dec 03, 2025 at 11:26:23AM -0800, Nuno Das Neves wrote:
>> On 12/3/2025 10:24 AM, Stanislav Kinsburskii wrote:
>>> Introduce kref-based reference counting and spinlock protection for
>>> memory regions in Hyper-V partition management. This change improves
>>> memory region lifecycle management and ensures thread-safe access to the
>>> region list.
>>>
>>> Also improves the check for overlapped memory regions during region
>>> creation, preventing duplicate or conflicting mappings.
>>>
>>> Previously, the regions list was protected by the partition mutex.
>>> However, this approach is too heavy for frequent fault and invalidation
>>> operations. Finer grained locking is now used to improve efficiency and
>>> concurrency.
>>>
>>> This is a precursor to supporting movable memory regions. Fault and
>>> invalidation handling for movable regions will require safe traversal of
>>> the region list and holding a region reference while performing
>>> invalidation or fault operations.
>>>
>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>> ---
>>>  drivers/hv/mshv_regions.c   |   19 ++++++++++++++++---
>>>  drivers/hv/mshv_root.h      |    6 +++++-
>>>  drivers/hv/mshv_root_main.c |   32 ++++++++++++++++++++++++--------
>>>  3 files changed, 45 insertions(+), 12 deletions(-)
>>>
>>> @@ -1657,8 +1670,10 @@ static void destroy_partition(struct mshv_partition *partition)
>>>  	remove_partition(partition);
>>>  
>>>  	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
>>> -				  hnode)
>>> -		mshv_region_destroy(region);
>>> +				  hnode) {
>>> +		hlist_del(&region->hnode);
>>> +		mshv_region_put(region);
>>> +	}
>>>  
>>
>> With the following patch introducing movable memory, it looks like the
>> list could be traversed by mshv_partition_region_by_gfn() even while
>> the this hlist_del() is being called.
>>
>> Maybe that's not possible for some reason I'm unaware of, could you
>> explain why we don't need to spin_lock here for hlist_del()?
>> Or, alternatively, use hlist_for_each_entry_safe() in
>> mshv_partition_region_by_gfn() to guard against the deletion?
>>
> 
> This function (destroy_partition) is called when there are no active
> references for neither partition not its VPs (they are destroyed in the
> same function above).
> In other words, there can't be any callers for mshv_partition_region_by_gfn.

Ah, I see, even if the mmu_notifier is still active, it doesn't traverse the
list, as it gets the region by container_of().
Thanks.

> 
> As per mshv_partition_region_by_gfn function itself, the caller is
> supposed to take the lock.
> 
> Giving it more thought, I'm strating to think that rw lock her ewould be
> a better option than a spinlock + reference count, as regions won't be
> added or remove to often and using it would allow to get rid of
> reference counting.
> 
> However, this looks like an optimization that isn't required an its
> usefulness can be investigated in future.
> 
> Thanks,
> Stanislav
> 
>>>  	/* Withdraw and free all pages we deposited */
>>>  	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
>>> @@ -1856,6 +1871,7 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
>>>  
>>>  	INIT_HLIST_HEAD(&partition->pt_devices);
>>>  
>>> +	spin_lock_init(&partition->pt_mem_regions_lock);
>>>  	INIT_HLIST_HEAD(&partition->pt_mem_regions);
>>>  
>>>  	mshv_eventfd_init(partition);
>>>
>>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v8 6/6] Drivers: hv: Add support for movable memory regions
  2025-12-03 18:24 ` [PATCH v8 6/6] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
@ 2025-12-03 20:36   ` Nuno Das Neves
  0 siblings, 0 replies; 12+ messages in thread
From: Nuno Das Neves @ 2025-12-03 20:36 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui
  Cc: linux-hyperv, linux-kernel

On 12/3/2025 10:24 AM, Stanislav Kinsburskii wrote:
> Introduce support for movable memory regions in the Hyper-V root partition
> driver to improve memory management flexibility and enable advanced use
> cases such as dynamic memory remapping.
> 
> Mirror the address space between the Linux root partition and guest VMs
> using HMM. The root partition owns the memory, while guest VMs act as
> devices with page tables managed via hypercalls. MSHV handles VP intercepts
> by invoking hmm_range_fault() and updating SLAT entries. When memory is
> reclaimed, HMM invalidates the relevant regions, prompting MSHV to clear
> SLAT entries; guest VMs will fault again on access.
> 
> Integrate mmu_interval_notifier for movable regions, implement handlers for
> HMM faults and memory invalidation, and update memory region mapping logic
> to support movable regions.
> 
> While MMU notifiers are commonly used in virtualization drivers, this
> implementation leverages HMM (Heterogeneous Memory Management) for its
> specialized functionality. HMM provides a framework for mirroring,
> invalidation, and fault handling, reducing boilerplate and improving
> maintainability compared to generic MMU notifiers.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/Kconfig          |    2 
>  drivers/hv/mshv_regions.c   |  215 ++++++++++++++++++++++++++++++++++++++++++-
>  drivers/hv/mshv_root.h      |   17 +++
>  drivers/hv/mshv_root_main.c |  139 +++++++++++++++++++++++-----
>  4 files changed, 343 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index d4a8d349200c..7937ac0cbd0f 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -76,6 +76,8 @@ config MSHV_ROOT
>  	depends on PAGE_SIZE_4KB
>  	select EVENTFD
>  	select VIRT_XFER_TO_GUEST_WORK
> +	select HMM_MIRROR
> +	select MMU_NOTIFIER
>  	default n
>  	help
>  	  Select this option to enable support for booting and running as root
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 94f33754f545..afe03258caf0 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -7,6 +7,8 @@
>   * Authors: Microsoft Linux virtualization team
>   */
>  
> +#include <linux/hmm.h>
> +#include <linux/hyperv.h>
>  #include <linux/kref.h>
>  #include <linux/mm.h>
>  #include <linux/vmalloc.h>
> @@ -15,6 +17,8 @@
>  
>  #include "mshv_root.h"
>  
> +#define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
> +
>  /**
>   * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
>   *                             in a region.
> @@ -152,9 +156,6 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
>  	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
>  		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
>  
> -	if (!is_mmio)
> -		region->flags.range_pinned = true;
> -

The parameter is_mmio is now unused in this function.

>  	kref_init(&region->refcount);
>  
>  	return region;
> @@ -239,7 +240,7 @@ int mshv_region_map(struct mshv_mem_region *region)
>  static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
>  					 u64 page_offset, u64 page_count)
>  {
> -	if (region->flags.range_pinned)
> +	if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
>  		unpin_user_pages(region->pages + page_offset, page_count);
>  
>  	memset(region->pages + page_offset, 0,
> @@ -313,6 +314,9 @@ static void mshv_region_destroy(struct kref *ref)
>  	struct mshv_partition *partition = region->partition;
>  	int ret;
>  
> +	if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
> +		mshv_region_movable_fini(region);
> +
>  	if (mshv_partition_encrypted(partition)) {
>  		ret = mshv_region_share(region);
>  		if (ret) {
> @@ -339,3 +343,206 @@ int mshv_region_get(struct mshv_mem_region *region)
>  {
>  	return kref_get_unless_zero(&region->refcount);
>  }
> +
> +/**
> + * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
> + * @region: Pointer to the memory region structure
> + * @range: Pointer to the HMM range structure
> + *
> + * This function performs the following steps:
> + * 1. Reads the notifier sequence for the HMM range.
> + * 2. Acquires a read lock on the memory map.
> + * 3. Handles HMM faults for the specified range.
> + * 4. Releases the read lock on the memory map.
> + * 5. If successful, locks the memory region mutex.
> + * 6. Verifies if the notifier sequence has changed during the operation.
> + *    If it has, releases the mutex and returns -EBUSY to match with
> + *    hmm_range_fault() return code for repeating.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> +					  struct hmm_range *range)
> +{
> +	int ret;
> +
> +	range->notifier_seq = mmu_interval_read_begin(range->notifier);
> +	mmap_read_lock(region->mni.mm);
> +	ret = hmm_range_fault(range);
> +	mmap_read_unlock(region->mni.mm);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&region->mutex);
> +
> +	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
> +		mutex_unlock(&region->mutex);
> +		cond_resched();
> +		return -EBUSY;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * mshv_region_range_fault - Handle memory range faults for a given region.
> + * @region: Pointer to the memory region structure.
> + * @page_offset: Offset of the page within the region.
> + * @page_count: Number of pages to handle.
> + *
> + * This function resolves memory faults for a specified range of pages
> + * within a memory region. It uses HMM (Heterogeneous Memory Management)
> + * to fault in the required pages and updates the region's page array.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +static int mshv_region_range_fault(struct mshv_mem_region *region,
> +				   u64 page_offset, u64 page_count)
> +{
> +	struct hmm_range range = {
> +		.notifier = &region->mni,
> +		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> +	};
> +	unsigned long *pfns;
> +	int ret;
> +	u64 i;
> +
> +	pfns = kmalloc_array(page_count, sizeof(unsigned long), GFP_KERNEL);

nit: Prefer sizeof(*pfns)

<snip>

The rest looks fine to me. With the minor issues above fixed,
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-12-03 20:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03 18:23 [PATCH v8 0/6] Introduce movable pages for Hyper-V guests Stanislav Kinsburskii
2025-12-03 18:24 ` [PATCH v8 1/6] Drivers: hv: Refactor and rename memory region handling functions Stanislav Kinsburskii
2025-12-03 18:24 ` [PATCH v8 2/6] Drivers: hv: Centralize guest memory region destruction Stanislav Kinsburskii
2025-12-03 18:24 ` [PATCH v8 3/6] Drivers: hv: Move region management to mshv_regions.c Stanislav Kinsburskii
2025-12-03 18:24 ` [PATCH v8 4/6] Drivers: hv: Fix huge page handling in memory region traversal Stanislav Kinsburskii
2025-12-03 19:47   ` Nuno Das Neves
2025-12-03 18:24 ` [PATCH v8 5/6] Drivers: hv: Add refcount and locking to mem regions Stanislav Kinsburskii
2025-12-03 19:26   ` Nuno Das Neves
2025-12-03 19:55     ` Stanislav Kinsburskii
2025-12-03 20:09       ` Nuno Das Neves
2025-12-03 18:24 ` [PATCH v8 6/6] Drivers: hv: Add support for movable memory regions Stanislav Kinsburskii
2025-12-03 20:36   ` Nuno Das Neves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).