Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH v3 4/7] mshv: Optimize memory region mapping operations
From: Stanislav Kinsburskii @ 2026-04-09 15:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Two specific operations don't require PFN iteration: region unmapping
and region remapping with no access. For unmapping, all frames in MSHV
memory regions are guaranteed to be mapped with page access, so we can
unmap them all without checking individual PFNs. For remapping with no
access, all frames are already mapped with page access, allowing us to
unmap them all in one pass.

Since neither operation needs PFN validation, iterating over PFNs is
redundant. Batch operations into large page-aligned chunks followed by
remaining pages. This eliminates PFN traversal for these operations,
requires no additional hypercalls compared to the PFN-checking approach,
and provides the simplest possible sequential execution path.

The optimization utilizes HV_MAP_GPA_LARGE_PAGE and
HV_UNMAP_GPA_LARGE_PAGE flags for aligned portions, processing only the
remainder with base page granularity. This removes
mshv_region_chunk_unmap() and eliminates PFN iteration for unmap and
no-access operations, reducing code complexity.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   87 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 68 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 2c4215381e0b..f209a34afb3a 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -449,27 +449,38 @@ static int mshv_region_pin(struct mshv_region *region)
 	return ret < 0 ? ret : -ENOMEM;
 }
 
-static int mshv_region_chunk_unmap(struct mshv_region *region,
-				   u32 flags,
-				   u64 pfn_offset, u64 pfn_count,
-				   bool huge_page)
+static int mshv_region_unmap(struct mshv_region *region)
 {
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
-		return 0;
+	u64 gfn, nr_pfns, starting_pfns, aligned_pfns, remaining_pfns;
+	int ret = 0;
 
-	if (huge_page)
-		flags |= HV_UNMAP_GPA_LARGE_PAGE;
+	gfn = region->start_gfn;
+	nr_pfns = region->nr_pfns;
 
-	return hv_call_unmap_pfns(region->partition->pt_id,
-				  region->start_gfn + pfn_offset,
-				  pfn_count, flags);
-}
+	starting_pfns = min(ALIGN(gfn, PTRS_PER_PMD) - gfn, nr_pfns);
+	aligned_pfns = ALIGN_DOWN(nr_pfns - starting_pfns, PTRS_PER_PMD);
+	remaining_pfns = nr_pfns - aligned_pfns - starting_pfns;
 
-static int mshv_region_unmap(struct mshv_region *region)
-{
-	return mshv_region_process_range(region, 0,
-					 0, region->nr_pfns,
-					 mshv_region_chunk_unmap);
+	if (starting_pfns)
+		ret = hv_call_unmap_pfns(region->partition->pt_id,
+					 gfn, starting_pfns,
+					 0);
+
+	gfn += starting_pfns;
+
+	if (!ret && aligned_pfns)
+		ret = hv_call_unmap_pfns(region->partition->pt_id,
+					 gfn, aligned_pfns,
+					 HV_UNMAP_GPA_LARGE_PAGE);
+
+	gfn += aligned_pfns;
+
+	if (!ret && remaining_pfns)
+		ret = hv_call_unmap_pfns(region->partition->pt_id,
+					 gfn, remaining_pfns,
+					 0);
+
+	return ret;
 }
 
 static void mshv_region_destroy(struct kref *ref)
@@ -684,6 +695,45 @@ bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn)
 	return !ret;
 }
 
+static int mshv_region_map_no_access(struct mshv_region *region,
+				     u64 pfn_offset, u64 pfn_count)
+{
+	u64 gfn, nr_pfns, starting_pfns, aligned_pfns, remaining_pfns;
+	int ret = 0;
+
+	gfn = region->start_gfn + pfn_offset;
+	nr_pfns = pfn_count;
+
+	starting_pfns = min(ALIGN(gfn, PTRS_PER_PMD) - gfn, nr_pfns);
+	aligned_pfns = ALIGN_DOWN(nr_pfns - starting_pfns, PTRS_PER_PMD);
+	remaining_pfns = nr_pfns - aligned_pfns - starting_pfns;
+
+	if (starting_pfns)
+		ret = hv_call_map_ram_pfns(region->partition->pt_id,
+					   gfn, starting_pfns,
+					   HV_MAP_GPA_NO_ACCESS,
+					   NULL);
+
+	gfn += starting_pfns;
+
+	if (!ret && aligned_pfns)
+		ret = hv_call_map_ram_pfns(region->partition->pt_id,
+					   gfn, aligned_pfns,
+					   HV_MAP_GPA_NO_ACCESS |
+					   HV_MAP_GPA_LARGE_PAGE,
+					   NULL);
+
+	gfn += aligned_pfns;
+
+	if (!ret && remaining_pfns)
+		ret = hv_call_map_ram_pfns(region->partition->pt_id,
+					   gfn, remaining_pfns,
+					   HV_MAP_GPA_NO_ACCESS,
+					   NULL);
+
+	return ret;
+}
+
 /**
  * mshv_region_interval_invalidate - Invalidate a range of memory region
  * @mni: Pointer to the mmu_interval_notifier structure
@@ -727,8 +777,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 
 	mmu_interval_set_seq(mni, cur_seq);
 
-	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
-				     pfn_offset, pfn_count);
+	ret = mshv_region_map_no_access(region, pfn_offset, pfn_count);
 	if (ret)
 		goto out_unlock;
 



^ permalink raw reply related

* [PATCH v3 3/7] mshv: Rename mshv_mem_region to mshv_region
From: Stanislav Kinsburskii @ 2026-04-09 15:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The mshv_mem_region structure represents guest address space regions,
which can be either RAM-backed memory or memory-mapped IO regions
without physical backing. The "mem_" prefix incorrectly suggests the
structure only handles memory regions, creating confusion about its
actual purpose.

Remove the "mem_" prefix to align with existing function naming
(mshv_region_map, mshv_region_pin, etc.) and accurately reflect that
this structure manages arbitrary guest address space mappings
regardless of their backing type.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   74 ++++++++++++++++++++++---------------------
 drivers/hv/mshv_root.h      |   18 +++++-----
 drivers/hv/mshv_root_main.c |   20 ++++++------
 3 files changed, 56 insertions(+), 56 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 70cd0857a28e..2c4215381e0b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -20,7 +20,7 @@
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
 #define MSHV_INVALID_PFN				ULONG_MAX
 
-typedef int (*pfn_handler_t)(struct mshv_mem_region *region, u32 flags,
+typedef int (*pfn_handler_t)(struct mshv_region *region, u32 flags,
 			     u64 pfn_offset, u64 pfn_count,
 			     bool huge_page);
 
@@ -81,7 +81,7 @@ static int mshv_chunk_stride(struct page *page,
  *
  * Return: Number of pages handled, or negative error code.
  */
-static long mshv_region_process_pfns(struct mshv_mem_region *region,
+static long mshv_region_process_pfns(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     pfn_handler_t handler)
@@ -135,7 +135,7 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
  *
  * Return: Number of PFNs handled, or negative error code.
  */
-static long mshv_region_process_hole(struct mshv_mem_region *region,
+static long mshv_region_process_hole(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     pfn_handler_t handler)
@@ -149,7 +149,7 @@ static long mshv_region_process_hole(struct mshv_mem_region *region,
 	return pfn_count;
 }
 
-static long mshv_region_process_chunk(struct mshv_mem_region *region,
+static long mshv_region_process_chunk(struct mshv_region *region,
 				      u32 flags,
 				      u64 pfn_offset, u64 pfn_count,
 				      pfn_handler_t handler)
@@ -182,7 +182,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
  *
  * Returns 0 on success, or a negative error code on failure.
  */
-static int mshv_region_process_range(struct mshv_mem_region *region,
+static int mshv_region_process_range(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     pfn_handler_t handler)
@@ -231,12 +231,12 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
-					   u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags,
-					   ulong mmio_pfn)
+struct mshv_region *mshv_region_create(enum mshv_region_type type,
+				       u64 guest_pfn, u64 nr_pfns,
+				       u64 uaddr, u32 flags,
+				       ulong mmio_pfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	int ret = 0;
 	u64 i;
 
@@ -286,7 +286,7 @@ struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
 	return ERR_PTR(ret);
 }
 
-static int mshv_region_chunk_share(struct mshv_mem_region *region,
+static int mshv_region_chunk_share(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -305,7 +305,7 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 					      flags, true);
 }
 
-static int mshv_region_share(struct mshv_mem_region *region)
+static int mshv_region_share(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
@@ -314,7 +314,7 @@ static int mshv_region_share(struct mshv_mem_region *region)
 					 mshv_region_chunk_share);
 }
 
-static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
+static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
@@ -331,7 +331,7 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 					      flags, false);
 }
 
-static int mshv_region_unshare(struct mshv_mem_region *region)
+static int mshv_region_unshare(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
@@ -340,7 +340,7 @@ static int mshv_region_unshare(struct mshv_mem_region *region)
 					 mshv_region_chunk_unshare);
 }
 
-static int mshv_region_chunk_remap(struct mshv_mem_region *region,
+static int mshv_region_chunk_remap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -362,7 +362,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				    region->mreg_pfns + pfn_offset);
 }
 
-static int mshv_region_remap_pfns(struct mshv_mem_region *region,
+static int mshv_region_remap_pfns(struct mshv_region *region,
 				  u32 map_flags,
 				  u64 pfn_offset, u64 pfn_count)
 {
@@ -371,7 +371,7 @@ static int mshv_region_remap_pfns(struct mshv_mem_region *region,
 					 mshv_region_chunk_remap);
 }
 
-static int mshv_region_map(struct mshv_mem_region *region)
+static int mshv_region_map(struct mshv_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
@@ -379,7 +379,7 @@ static int mshv_region_map(struct mshv_mem_region *region)
 				      0, region->nr_pfns);
 }
 
-static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
+static void mshv_region_invalidate_pfns(struct mshv_region *region,
 					u64 pfn_offset, u64 pfn_count)
 {
 	u64 i;
@@ -395,12 +395,12 @@ static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
 	}
 }
 
-static void mshv_region_invalidate(struct mshv_mem_region *region)
+static void mshv_region_invalidate(struct mshv_region *region)
 {
 	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
-static int mshv_region_pin(struct mshv_mem_region *region)
+static int mshv_region_pin(struct mshv_region *region)
 {
 	u64 done_count, nr_pfns, i;
 	unsigned long *pfns;
@@ -449,7 +449,7 @@ static int mshv_region_pin(struct mshv_mem_region *region)
 	return ret < 0 ? ret : -ENOMEM;
 }
 
-static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
+static int mshv_region_chunk_unmap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -465,7 +465,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
 				  pfn_count, flags);
 }
 
-static int mshv_region_unmap(struct mshv_mem_region *region)
+static int mshv_region_unmap(struct mshv_region *region)
 {
 	return mshv_region_process_range(region, 0,
 					 0, region->nr_pfns,
@@ -474,8 +474,8 @@ static int mshv_region_unmap(struct mshv_mem_region *region)
 
 static void mshv_region_destroy(struct kref *ref)
 {
-	struct mshv_mem_region *region =
-		container_of(ref, struct mshv_mem_region, mreg_refcount);
+	struct mshv_region *region =
+		container_of(ref, struct mshv_region, mreg_refcount);
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
@@ -499,12 +499,12 @@ static void mshv_region_destroy(struct kref *ref)
 	vfree(region);
 }
 
-void mshv_region_put(struct mshv_mem_region *region)
+void mshv_region_put(struct mshv_region *region)
 {
 	kref_put(&region->mreg_refcount, mshv_region_destroy);
 }
 
-int mshv_region_get(struct mshv_mem_region *region)
+int mshv_region_get(struct mshv_region *region)
 {
 	return kref_get_unless_zero(&region->mreg_refcount);
 }
@@ -534,7 +534,7 @@ int mshv_region_get(struct mshv_mem_region *region)
  *
  * Return: 0 on success, a negative error code otherwise.
  */
-static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
+static int mshv_region_hmm_fault_and_lock(struct mshv_region *region,
 					  unsigned long start,
 					  unsigned long end,
 					  unsigned long *pfns,
@@ -613,7 +613,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
  *
  * Return: 0 on success, negative errno on failure.
  */
-static int mshv_region_collect_and_map(struct mshv_mem_region *region,
+static int mshv_region_collect_and_map(struct mshv_region *region,
 				       u64 pfn_offset, u64 pfn_count,
 				       bool do_fault)
 {
@@ -653,14 +653,14 @@ static int mshv_region_collect_and_map(struct mshv_mem_region *region,
 	return ret;
 }
 
-static int mshv_region_range_fault(struct mshv_mem_region *region,
+static int mshv_region_range_fault(struct mshv_region *region,
 				   u64 pfn_offset, u64 pfn_count)
 {
 	return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
 					   true);
 }
 
-bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
+bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn)
 {
 	u64 pfn_offset, pfn_count;
 	int ret;
@@ -706,9 +706,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 					    const struct mmu_notifier_range *range,
 					    unsigned long cur_seq)
 {
-	struct mshv_mem_region *region = container_of(mni,
-						      struct mshv_mem_region,
-						      mreg_mni);
+	struct mshv_region *region = container_of(mni,
+						  struct mshv_region,
+						  mreg_mni);
 	u64 pfn_offset, pfn_count;
 	unsigned long mstart, mend;
 	int ret = -EPERM;
@@ -767,7 +767,7 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
  *
  * Return: 0 on success, negative error code on failure.
  */
-static int mshv_map_pinned_region(struct mshv_mem_region *region)
+static int mshv_map_pinned_region(struct mshv_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
@@ -823,13 +823,13 @@ static int mshv_map_pinned_region(struct mshv_mem_region *region)
 	return ret;
 }
 
-static int mshv_map_movable_region(struct mshv_mem_region *region)
+static int mshv_map_movable_region(struct mshv_region *region)
 {
 	return mshv_region_collect_and_map(region, 0, region->nr_pfns,
 					   false);
 }
 
-static int mshv_map_mmio_region(struct mshv_mem_region *region)
+static int mshv_map_mmio_region(struct mshv_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 
@@ -838,7 +838,7 @@ static int mshv_map_mmio_region(struct mshv_mem_region *region)
 				     region->nr_pfns);
 }
 
-int mshv_map_region(struct mshv_mem_region *region)
+int mshv_map_region(struct mshv_region *region)
 {
 	switch (region->mreg_type) {
 	case MSHV_REGION_TYPE_MEM_PINNED:
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 2bcdfa070517..97659ba55418 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -81,7 +81,7 @@ enum mshv_region_type {
 	MSHV_REGION_TYPE_MMIO
 };
 
-struct mshv_mem_region {
+struct mshv_region {
 	struct hlist_node hnode;
 	struct kref mreg_refcount;
 	u64 nr_pfns;
@@ -367,13 +367,13 @@ extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
 
-struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
-					   u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags,
-					   ulong mmio_pfn);
-void mshv_region_put(struct mshv_mem_region *region);
-int mshv_region_get(struct mshv_mem_region *region);
-bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
-int mshv_map_region(struct mshv_mem_region *region);
+struct mshv_region *mshv_region_create(enum mshv_region_type type,
+				       u64 guest_pfn, u64 nr_pfns,
+				       u64 uaddr, u32 flags,
+				       ulong mmio_pfn);
+void mshv_region_put(struct mshv_region *region);
+int mshv_region_get(struct mshv_region *region);
+bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn);
+int mshv_map_region(struct mshv_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 3bfa9e9c575f..9d83a2348655 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -612,10 +612,10 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
 	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
 
-static struct mshv_mem_region *
+static struct mshv_region *
 mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
 		if (gfn >= region->start_gfn &&
@@ -626,10 +626,10 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 	return NULL;
 }
 
-static struct mshv_mem_region *
+static struct mshv_region *
 mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	spin_lock(&p->pt_mem_regions_lock);
 	region = mshv_partition_region_by_gfn(p, gfn);
@@ -656,7 +656,7 @@ mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
 static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
 {
 	struct mshv_partition *p = vp->vp_partition;
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	bool ret = false;
 	u64 gfn;
 #if defined(CONFIG_X86_64)
@@ -1217,9 +1217,9 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
  */
 static int mshv_partition_create_region(struct mshv_partition *partition,
 					struct mshv_user_mem_region *mem,
-					struct mshv_mem_region **regionpp)
+					struct mshv_region **regionpp)
 {
-	struct mshv_mem_region *rg;
+	struct mshv_region *rg;
 	enum mshv_region_type type;
 	u64 nr_pfns = HVPFN_DOWN(mem->size);
 	struct vm_area_struct *vma;
@@ -1282,7 +1282,7 @@ static long
 mshv_map_user_memory(struct mshv_partition *partition,
 		     struct mshv_user_mem_region mem)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	long ret;
 
 	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
@@ -1318,7 +1318,7 @@ static long
 mshv_unmap_user_memory(struct mshv_partition *partition,
 		       struct mshv_user_mem_region mem)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
 		return -EINVAL;
@@ -1690,7 +1690,7 @@ remove_partition(struct mshv_partition *partition)
 static void destroy_partition(struct mshv_partition *partition)
 {
 	struct mshv_vp *vp;
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	struct hlist_node *n;
 	int i;
 



^ permalink raw reply related

* [PATCH v3 2/7] mshv: Improve code readability with handler function typedef
From: Stanislav Kinsburskii @ 2026-04-09 15:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The inline function pointer declarations in mshv_region_process_*
functions make the code harder to read and maintain. Each function
signature repeats the same lengthy callback parameter definition,
adding visual noise and making the actual logic less clear.

Introduce pfn_handler_t typedef to replace the repeated inline
function pointer declarations. This simplifies function signatures,
makes the code more maintainable, and follows common kernel
patterns for callback handling.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   28 ++++++++--------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index a85d18e2c279..70cd0857a28e 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -20,6 +20,10 @@
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
 #define MSHV_INVALID_PFN				ULONG_MAX
 
+typedef int (*pfn_handler_t)(struct mshv_mem_region *region, u32 flags,
+			     u64 pfn_offset, u64 pfn_count,
+			     bool huge_page);
+
 static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
 
 /**
@@ -80,11 +84,7 @@ static int mshv_chunk_stride(struct page *page,
 static long mshv_region_process_pfns(struct mshv_mem_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+				     pfn_handler_t handler)
 {
 	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count;
@@ -138,11 +138,7 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
 static long mshv_region_process_hole(struct mshv_mem_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+				     pfn_handler_t handler)
 {
 	long ret;
 
@@ -156,11 +152,7 @@ static long mshv_region_process_hole(struct mshv_mem_region *region,
 static long mshv_region_process_chunk(struct mshv_mem_region *region,
 				      u32 flags,
 				      u64 pfn_offset, u64 pfn_count,
-				      int (*handler)(struct mshv_mem_region *region,
-						     u32 flags,
-						     u64 pfn_offset,
-						     u64 pfn_count,
-						     bool huge_page))
+				      pfn_handler_t handler)
 {
 	if (pfn_valid(region->mreg_pfns[pfn_offset]))
 		return mshv_region_process_pfns(region, flags,
@@ -193,11 +185,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 static int mshv_region_process_range(struct mshv_mem_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+				     pfn_handler_t handler)
 {
 	u64 start, end;
 	long ret;



^ permalink raw reply related

* [PATCH v3 1/7] mshv: Consolidate region creation and mapping
From: Stanislav Kinsburskii @ 2026-04-09 15:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Consolidate region type detection and initialization into
mshv_region_create() to simplify the region creation flow. Move type
determination logic (MMIO/pinned/movable) earlier in the process and
initialize type-specific fields during creation rather than after.

This eliminates the need for mshv_region_movable_init/fini() by
handling MMU interval notifier setup directly in the constructor and
teardown in the destructor. Region mapping is also unified through a
single mshv_map_region() dispatcher that routes to the appropriate
type-specific handler.

Changes improve code organization by:
- Reducing API surface (4 fewer exported functions)
- Centralizing type determination and validation
- Making region lifecycle more explicit and easier to follow
- Removing post-construction initialization steps

The refactoring maintains existing functionality while making the
codebase more maintainable and less error-prone.

Additionally, movable region initialization now fails explicitly
if mmu_interval_notifier_insert() returns an error, rather than
silently falling back to pinned memory. This fail-fast approach
makes configuration issues more visible.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   81 ++++++++++++++++++++++++++++---------------
 drivers/hv/mshv_root.h      |   14 +++----
 drivers/hv/mshv_root_main.c |   61 +++++++++++++-------------------
 3 files changed, 83 insertions(+), 73 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 6b703b269a4f..a85d18e2c279 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -20,6 +20,8 @@
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
 #define MSHV_INVALID_PFN				ULONG_MAX
 
+static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
+
 /**
  * mshv_chunk_stride - Compute stride for mapping guest memory
  * @page      : The page to check for huge page backing
@@ -241,16 +243,39 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags)
+struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
+					   u64 guest_pfn, u64 nr_pfns,
+					   u64 uaddr, u32 flags,
+					   ulong mmio_pfn)
 {
 	struct mshv_mem_region *region;
+	int ret = 0;
 	u64 i;
 
 	region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 
+	switch (type) {
+	case MSHV_REGION_TYPE_MEM_MOVABLE:
+		ret = mmu_interval_notifier_insert(&region->mreg_mni,
+						   current->mm, uaddr,
+						   nr_pfns << HV_HYP_PAGE_SHIFT,
+						   &mshv_region_mni_ops);
+		break;
+	case MSHV_REGION_TYPE_MEM_PINNED:
+		break;
+	case MSHV_REGION_TYPE_MMIO:
+		region->mreg_mmio_pfn = mmio_pfn;
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	if (ret)
+		goto free_region;
+
+	region->mreg_type = type;
 	region->nr_pfns = nr_pfns;
 	region->start_gfn = guest_pfn;
 	region->start_uaddr = uaddr;
@@ -263,9 +288,14 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
 	for (i = 0; i < nr_pfns; i++)
 		region->mreg_pfns[i] = MSHV_INVALID_PFN;
 
+	mutex_init(&region->mreg_mutex);
 	kref_init(&region->mreg_refcount);
 
 	return region;
+
+free_region:
+	vfree(region);
+	return ERR_PTR(ret);
 }
 
 static int mshv_region_chunk_share(struct mshv_mem_region *region,
@@ -462,7 +492,7 @@ static void mshv_region_destroy(struct kref *ref)
 	int ret;
 
 	if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
-		mshv_region_movable_fini(region);
+		mmu_interval_notifier_remove(&region->mreg_mni);
 
 	if (mshv_partition_encrypted(partition)) {
 		ret = mshv_region_share(region);
@@ -736,27 +766,6 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
 	.invalidate = mshv_region_interval_invalidate,
 };
 
-void mshv_region_movable_fini(struct mshv_mem_region *region)
-{
-	mmu_interval_notifier_remove(&region->mreg_mni);
-}
-
-bool mshv_region_movable_init(struct mshv_mem_region *region)
-{
-	int ret;
-
-	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
-					   region->start_uaddr,
-					   region->nr_pfns << HV_HYP_PAGE_SHIFT,
-					   &mshv_region_mni_ops);
-	if (ret)
-		return false;
-
-	mutex_init(&region->mreg_mutex);
-
-	return true;
-}
-
 /**
  * mshv_map_pinned_region - Pin and map memory regions
  * @region: Pointer to the memory region structure
@@ -770,7 +779,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
  *
  * Return: 0 on success, negative error code on failure.
  */
-int mshv_map_pinned_region(struct mshv_mem_region *region)
+static int mshv_map_pinned_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
@@ -826,17 +835,31 @@ int mshv_map_pinned_region(struct mshv_mem_region *region)
 	return ret;
 }
 
-int mshv_map_movable_region(struct mshv_mem_region *region)
+static int mshv_map_movable_region(struct mshv_mem_region *region)
 {
 	return mshv_region_collect_and_map(region, 0, region->nr_pfns,
 					   false);
 }
 
-int mshv_map_mmio_region(struct mshv_mem_region *region,
-			 unsigned long mmio_pfn)
+static int mshv_map_mmio_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 
 	return hv_call_map_mmio_pfns(partition->pt_id, region->start_gfn,
-				     mmio_pfn, region->nr_pfns);
+				     region->mreg_mmio_pfn,
+				     region->nr_pfns);
+}
+
+int mshv_map_region(struct mshv_mem_region *region)
+{
+	switch (region->mreg_type) {
+	case MSHV_REGION_TYPE_MEM_PINNED:
+		return mshv_map_pinned_region(region);
+	case MSHV_REGION_TYPE_MEM_MOVABLE:
+		return mshv_map_movable_region(region);
+	case MSHV_REGION_TYPE_MMIO:
+		return mshv_map_mmio_region(region);
+	}
+
+	return -EINVAL;
 }
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 1f92b9f85b60..2bcdfa070517 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -92,6 +92,7 @@ struct mshv_mem_region {
 	enum mshv_region_type mreg_type;
 	struct mmu_interval_notifier mreg_mni;
 	struct mutex mreg_mutex;	/* protects region PFNs remapping */
+	u64 mreg_mmio_pfn;
 	unsigned long mreg_pfns[];
 };
 
@@ -366,16 +367,13 @@ extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
-					   u64 uaddr, u32 flags);
+struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
+					   u64 guest_pfn, u64 nr_pfns,
+					   u64 uaddr, u32 flags,
+					   ulong mmio_pfn);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
-void mshv_region_movable_fini(struct mshv_mem_region *region);
-bool mshv_region_movable_init(struct mshv_mem_region *region);
-int mshv_map_pinned_region(struct mshv_mem_region *region);
-int mshv_map_movable_region(struct mshv_mem_region *region);
-int mshv_map_mmio_region(struct mshv_mem_region *region,
-			 unsigned long mmio_pfn);
+int mshv_map_region(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index adb09350205a..3bfa9e9c575f 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1217,11 +1217,14 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
  */
 static int mshv_partition_create_region(struct mshv_partition *partition,
 					struct mshv_user_mem_region *mem,
-					struct mshv_mem_region **regionpp,
-					bool is_mmio)
+					struct mshv_mem_region **regionpp)
 {
 	struct mshv_mem_region *rg;
+	enum mshv_region_type type;
 	u64 nr_pfns = HVPFN_DOWN(mem->size);
+	struct vm_area_struct *vma;
+	ulong mmio_pfn;
+	bool is_mmio;
 
 	/* Reject overlapping regions */
 	spin_lock(&partition->pt_mem_regions_lock);
@@ -1234,18 +1237,27 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	}
 	spin_unlock(&partition->pt_mem_regions_lock);
 
-	rg = mshv_region_create(mem->guest_pfn, nr_pfns,
-				mem->userspace_addr, mem->flags);
-	if (IS_ERR(rg))
-		return PTR_ERR(rg);
+	mmap_read_lock(current->mm);
+	vma = vma_lookup(current->mm, mem->userspace_addr);
+	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
+	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
+	mmap_read_unlock(current->mm);
+
+	if (!vma)
+		return -EINVAL;
 
 	if (is_mmio)
-		rg->mreg_type = MSHV_REGION_TYPE_MMIO;
-	else if (mshv_partition_encrypted(partition) ||
-		 !mshv_region_movable_init(rg))
-		rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
+		type = MSHV_REGION_TYPE_MMIO;
+	else if (mshv_partition_encrypted(partition))
+		type = MSHV_REGION_TYPE_MEM_PINNED;
 	else
-		rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
+		type = MSHV_REGION_TYPE_MEM_MOVABLE;
+
+	rg = mshv_region_create(type, mem->guest_pfn, nr_pfns,
+				mem->userspace_addr, mem->flags,
+				mmio_pfn);
+	if (IS_ERR(rg))
+		return PTR_ERR(rg);
 
 	rg->partition = partition;
 
@@ -1271,40 +1283,17 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		     struct mshv_user_mem_region mem)
 {
 	struct mshv_mem_region *region;
-	struct vm_area_struct *vma;
-	bool is_mmio;
-	ulong mmio_pfn;
 	long ret;
 
 	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
 	    !access_ok((const void __user *)mem.userspace_addr, mem.size))
 		return -EINVAL;
 
-	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, mem.userspace_addr);
-	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
-	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
-	mmap_read_unlock(current->mm);
-
-	if (!vma)
-		return -EINVAL;
-
-	ret = mshv_partition_create_region(partition, &mem, &region,
-					   is_mmio);
+	ret = mshv_partition_create_region(partition, &mem, &region);
 	if (ret)
 		return ret;
 
-	switch (region->mreg_type) {
-	case MSHV_REGION_TYPE_MEM_PINNED:
-		ret = mshv_map_pinned_region(region);
-		break;
-	case MSHV_REGION_TYPE_MEM_MOVABLE:
-		ret = mshv_map_movable_region(region);
-		break;
-	case MSHV_REGION_TYPE_MMIO:
-		ret = mshv_map_mmio_region(region, mmio_pfn);
-		break;
-	}
+	ret = mshv_map_region(region);
 
 	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
 				   region->start_gfn, region->nr_pfns,



^ permalink raw reply related

* [PATCH v3 0/7] mshv: Reduce memory consumption for unpinned regions
From: Stanislav Kinsburskii @ 2026-04-09 15:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

This series reduces memory consumption for unpinned regions by avoiding
PFN array allocation. A 1GB unpinned region currently wastes 2MB for an
unused PFN array that HMM-managed regions don't need.

The first three patches are preparatory refactoring. Patch 1 consolidates
region creation and mapping logic, reducing API surface by 4 functions.
Patch 2 introduces a typedef for PFN handler callbacks to simplify
function signatures. Patch 3 renames mshv_mem_region to mshv_region to
align with existing function naming conventions.

Patch 4 optimizes unmap and no-access remap operations by eliminating
redundant PFN iteration when all frames are guaranteed to be mapped.
This uses large page flags for aligned chunks and removes unnecessary
helper functions.

Patches 5-6 decouple PFN processing from the region->pfns storage.
Patch 5 threads the pfns pointer explicitly through the processing
chain. Patch 6 removes offset-based indexing by having callers pass
pre-offset pointers.

Patch 7 converts the pfns array from a flexible array member to a
conditional pointer, allocated only for pinned regions that need it
for share/unshare/evict operations. This eliminates the memory waste
for unpinned regions and allows using kzalloc instead of vzalloc.

v3:
- Fix missing unmap/remap of pages before the first huge page.

v2:
- Improved commit message
- Fixed invalid vfree(region->mreg_pfns) call for MMIO-backed regions
- Fixed unpinning of already-released pages in the error path during
  pinned region creation
- Removed redundant mshv_map_region helper in favor of the new
  optimized mapping logic

---

Stanislav Kinsburskii (7):
      mshv: Consolidate region creation and mapping
      mshv: Improve code readability with handler function typedef
      mshv: Rename mshv_mem_region to mshv_region
      mshv: Optimize memory region mapping operations
      mshv: Pass pfns array explicitly through processing chain
      mshv: Simplify pfn array handling in region processing
      mshv: Allocate pfns array only for pinned regions

 drivers/hv/mshv_regions.c   |  372 ++++++++++++++++++++++++++-----------------
 drivers/hv/mshv_root.h      |   26 ++-
 drivers/hv/mshv_root_main.c |   79 ++++-----
 3 files changed, 271 insertions(+), 206 deletions(-)

^ permalink raw reply

* Re: [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO
From: Stanislav Kinsburskii @ 2026-04-09 14:41 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <c30ede65-46c4-02b1-756a-868f9a265cf1@linux.microsoft.com>

On Tue, Apr 07, 2026 at 10:41:12AM -0700, Mukesh R wrote:
> On 1/20/26 08:09, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:21PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > 
> > > Add a new file to implement VFIO-MSHV bridge pseudo device. These
> > > functions are called in the VFIO framework, and credits to kvm/vfio.c
> > > as this file was adapted from it.
> > > 
> > > Original author: Wei Liu <wei.liu@kernel.org>
> > > (Slightly modified from the original version).
> > > 
> > 
> > There is a Linux standard for giving credits when code is adapted from.
> > This doesn't follow that standard. Please fix.
> > 
> > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > ---
> > >   drivers/hv/Makefile    |   3 +-
> > >   drivers/hv/mshv_vfio.c | 210 +++++++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 212 insertions(+), 1 deletion(-)
> > >   create mode 100644 drivers/hv/mshv_vfio.c
> > > 
> > > diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> > > index a49f93c2d245..eae003c4cb8f 100644
> > > --- a/drivers/hv/Makefile
> > > +++ b/drivers/hv/Makefile
> > > @@ -14,7 +14,8 @@ hv_vmbus-y := vmbus_drv.o \
> > >   hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
> > >   hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
> > >   mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
> > > -	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
> > > +	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o \
> > > +               mshv_vfio.o
> > >   mshv_vtl-y := mshv_vtl_main.o
> > >   # Code that must be built-in
> > > diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
> > > new file mode 100644
> > > index 000000000000..6ea4d99a3bd2
> > > --- /dev/null
> > > +++ b/drivers/hv/mshv_vfio.c
> > > @@ -0,0 +1,210 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * VFIO-MSHV bridge pseudo device
> > > + *
> > > + * Heavily inspired by the VFIO-KVM bridge pseudo device.
> > > + */
> > > +#include <linux/errno.h>
> > > +#include <linux/file.h>
> > > +#include <linux/list.h>
> > > +#include <linux/module.h>
> > > +#include <linux/mutex.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/vfio.h>
> > > +
> > > +#include "mshv.h"
> > > +#include "mshv_root.h"
> > > +
> > > +struct mshv_vfio_file {
> > > +	struct list_head node;
> > > +	struct file *file;	/* list of struct mshv_vfio_file */
> > > +};
> > > +
> > > +struct mshv_vfio {
> > > +	struct list_head file_list;
> > > +	struct mutex lock;
> > > +};
> > > +
> > > +static bool mshv_vfio_file_is_valid(struct file *file)
> > > +{
> > > +	bool (*fn)(struct file *file);
> > > +	bool ret;
> > > +
> > > +	fn = symbol_get(vfio_file_is_valid);
> > > +	if (!fn)
> > > +		return false;
> > > +
> > > +	ret = fn(file);
> > > +
> > > +	symbol_put(vfio_file_is_valid);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static long mshv_vfio_file_add(struct mshv_device *mshvdev, unsigned int fd)
> > > +{
> > > +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
> > > +	struct mshv_vfio_file *mvf;
> > > +	struct file *filp;
> > > +	long ret = 0;
> > > +
> > > +	filp = fget(fd);
> > > +	if (!filp)
> > > +		return -EBADF;
> > > +
> > > +	/* Ensure the FD is a vfio FD. */
> > > +	if (!mshv_vfio_file_is_valid(filp)) {
> > > +		ret = -EINVAL;
> > > +		goto out_fput;
> > > +	}
> > > +
> > > +	mutex_lock(&mshv_vfio->lock);
> > > +
> > > +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
> > > +		if (mvf->file == filp) {
> > > +			ret = -EEXIST;
> > > +			goto out_unlock;
> > > +		}
> > > +	}
> > > +
> > > +	mvf = kzalloc(sizeof(*mvf), GFP_KERNEL_ACCOUNT);
> > > +	if (!mvf) {
> > > +		ret = -ENOMEM;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	mvf->file = get_file(filp);
> > > +	list_add_tail(&mvf->node, &mshv_vfio->file_list);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&mshv_vfio->lock);
> > > +out_fput:
> > > +	fput(filp);
> > > +	return ret;
> > > +}
> > > +
> > > +static long mshv_vfio_file_del(struct mshv_device *mshvdev, unsigned int fd)
> > > +{
> > > +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
> > > +	struct mshv_vfio_file *mvf;
> > > +	long ret;
> > > +
> > > +	CLASS(fd, f)(fd);
> > > +
> > > +	if (fd_empty(f))
> > > +		return -EBADF;
> > > +
> > > +	ret = -ENOENT;
> > > +	mutex_lock(&mshv_vfio->lock);
> > > +
> > > +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
> > > +		if (mvf->file != fd_file(f))
> > > +			continue;
> > > +
> > > +		list_del(&mvf->node);
> > > +		fput(mvf->file);
> > > +		kfree(mvf);
> > > +		ret = 0;
> > > +		break;
> > > +	}
> > > +
> > > +	mutex_unlock(&mshv_vfio->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static long mshv_vfio_set_file(struct mshv_device *mshvdev, long attr,
> > > +			      void __user *arg)
> > > +{
> > > +	int32_t __user *argp = arg;
> > > +	int32_t fd;
> > > +
> > > +	switch (attr) {
> > > +	case MSHV_DEV_VFIO_FILE_ADD:
> > > +		if (get_user(fd, argp))
> > > +			return -EFAULT;
> > > +		return mshv_vfio_file_add(mshvdev, fd);
> > > +
> > > +	case MSHV_DEV_VFIO_FILE_DEL:
> > > +		if (get_user(fd, argp))
> > > +			return -EFAULT;
> > > +		return mshv_vfio_file_del(mshvdev, fd);
> > > +	}
> > > +
> > > +	return -ENXIO;
> > > +}
> > > +
> > > +static long mshv_vfio_set_attr(struct mshv_device *mshvdev,
> > > +			      struct mshv_device_attr *attr)
> > > +{
> > > +	switch (attr->group) {
> > > +	case MSHV_DEV_VFIO_FILE:
> > > +		return mshv_vfio_set_file(mshvdev, attr->attr,
> > > +					  u64_to_user_ptr(attr->addr));
> > > +	}
> > > +
> > > +	return -ENXIO;
> > > +}
> > > +
> > > +static long mshv_vfio_has_attr(struct mshv_device *mshvdev,
> > > +			      struct mshv_device_attr *attr)
> > > +{
> > > +	switch (attr->group) {
> > > +	case MSHV_DEV_VFIO_FILE:
> > > +		switch (attr->attr) {
> > > +		case MSHV_DEV_VFIO_FILE_ADD:
> > > +		case MSHV_DEV_VFIO_FILE_DEL:
> > > +			return 0;
> > > +		}
> > > +
> > > +		break;
> > > +	}
> > > +
> > > +	return -ENXIO;
> > > +}
> > > +
> > > +static long mshv_vfio_create_device(struct mshv_device *mshvdev, u32 type)
> > > +{
> > > +	struct mshv_device *tmp;
> > > +	struct mshv_vfio *mshv_vfio;
> > > +
> > > +	/* Only one VFIO "device" per VM */
> > > +	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
> > > +			     device_ptnode)
> > > +		if (tmp->device_ops == &mshv_vfio_device_ops)
> > > +			return -EBUSY;
> > > +
> > > +	mshv_vfio = kzalloc(sizeof(*mshv_vfio), GFP_KERNEL_ACCOUNT);
> > > +	if (mshv_vfio == NULL)
> > > +		return -ENOMEM;
> > > +
> > > +	INIT_LIST_HEAD(&mshv_vfio->file_list);
> > > +	mutex_init(&mshv_vfio->lock);
> > > +
> > > +	mshvdev->device_private = mshv_vfio;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/* This is called from mshv_device_fop_release() */
> > > +static void mshv_vfio_release_device(struct mshv_device *mshvdev)
> > > +{
> > > +	struct mshv_vfio *mv = mshvdev->device_private;
> > > +	struct mshv_vfio_file *mvf, *tmp;
> > > +
> > > +	list_for_each_entry_safe(mvf, tmp, &mv->file_list, node) {
> > > +		fput(mvf->file);
> > 
> > This put must be sync as device must be detached from domain before
> > attempting partition destruction.
> 
> Like I said in 6.6 PR, this does not attach or detach devices.
> 

You are mistaken. It absolutely does.

Thanks,
Stanislav

> > This was explicitly mentioned in the patch originated this code.
> > Please fix, add a comment and credits to the commit message.
> 
> That was ".detstroy" hook which is gone.
> 
> Thanks,
> -Mukesh
> 
> 
> > Thanks,
> > Stanislav
> > 
> > 
> > > +		list_del(&mvf->node);
> > > +		kfree(mvf);
> > > +	}
> > > +
> > > +	kfree(mv);
> > > +	kfree(mshvdev);
> > > +}
> > > +
> > > +struct mshv_device_ops mshv_vfio_device_ops = {
> > > +	.device_name = "mshv-vfio",
> > > +	.device_create = mshv_vfio_create_device,
> > > +	.device_release = mshv_vfio_release_device,
> > > +	.device_set_attr = mshv_vfio_set_attr,
> > > +	.device_has_attr = mshv_vfio_has_attr,
> > > +};
> > > -- 
> > > 2.51.2.vfs.0.1
> > > 
> 

^ permalink raw reply

* [PATCH v3] tools: hv: Fix cross-compilation
From: Aditya Garg @ 2026-04-09 10:32 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, gregkh, ssengar,
	linux-hyperv, linux-kernel, avladu, vdso, gargaditya, gargaditya
  Cc: Roman Kisel

Use the native ARCH only in case it is not set, this will allow the
cross-compilation where ARCH is explicitly set.

Additionally, simplify the ARCH check to build the fcopy daemon only
for x86 and x86_64.

Fixes: 82b0945ce2c2 ("tools: hv: Add new fcopy application based on uio driver")
Reported-by: Adrian Vladu <avladu@cloudbasesolutions.com>
Closes: https://lore.kernel.org/linux-hyperv/PR3PR09MB54119DB2FD76977C62D8DD6AB04D2@PR3PR09MB5411.eurprd09.prod.outlook.com/
Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
---
Changes since v2:
    - Handle the normalized ARCH=x86 value from the top-level kernel Makefile

Changes since v1:
    - Dropped the info target printing CC, LD and ARCH

v2: https://lore.kernel.org/all/20260407122040.249733-1-gargaditya@linux.microsoft.com/
v1: https://lore.kernel.org/all/1733992114-7305-1-git-send-email-ssengar@linux.microsoft.com/
---
 tools/hv/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/hv/Makefile b/tools/hv/Makefile
index 34ffcec264ab..016753f3dd7f 100644
--- a/tools/hv/Makefile
+++ b/tools/hv/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Hyper-V tools
 include ../scripts/Makefile.include
 
-ARCH := $(shell uname -m 2>/dev/null)
+ARCH ?= $(shell uname -m 2>/dev/null)
 sbindir ?= /usr/sbin
 libexecdir ?= /usr/libexec
 sharedstatedir ?= /var/lib
@@ -20,7 +20,7 @@ override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -I$(OUTPUT)include
 override CFLAGS += -Wno-address-of-packed-member
 
 ALL_TARGETS := hv_kvp_daemon hv_vss_daemon
-ifneq ($(ARCH), aarch64)
+ifneq ($(filter x86_64 x86,$(ARCH)),)
 ALL_TARGETS += hv_fcopy_uio_daemon
 endif
 ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Tianyu Lan @ 2026-04-09  2:05 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: kys, haiyangz, wei.liu, decui, longli, James.Bottomley,
	martin.petersen, apais, Tianyu Lan, linux-hyperv, linux-kernel,
	linux-scsi, vdso, mhklinux
In-Reply-To: <2a80b7a6-2cfe-4bd0-a799-ff855df7bd41@linux.microsoft.com>

On Thu, Apr 9, 2026 at 12:55 AM Easwar Hariharan
<easwar.hariharan@linux.microsoft.com> wrote:
>
> On 4/8/2026 12:31 AM, Tianyu Lan wrote:
> > Hyper-V provides Confidential VMBus to communicate between
> > device model and device guest driver via encrypted/private
> > memory in Confidential VM. The device model is in OpenHCL
> > (https://openvmm.dev/guide/user_guide/openhcl.html) that
> > plays the paravisor role.
> >
> > For a VMBus device, there are two communication methods to
> > talk with Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic
> > DMA transfer.
> >
> > The Confidential VMBus Ring buffer has been upstreamed by
> > Roman Kisel(commit 6802d8af47d1).
> >
> > The dynamic DMA transition of VMBus device normally goes
> > through DMA core and it uses SWIOTLB as bounce buffer in
> > a CoCo VM.
> >
> > The Confidential VMBus device can do DMA directly to
> > private/encrypted memory. Because the swiotlb is decrypted
> > memory, the DMA transfer must not be bounced through the
> > swiotlb, so as to preserve confidentiality. This is different
> > from the default for Linux CoCo VMs, so not use DMA(SWIOTLB)
> > API in VMBus driver when confidential dynamic DMA transfers
> > capability is present.
> >
> > Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> > ---
> >  drivers/scsi/storvsc_drv.c | 28 +++++++++++++++++++++-------
> >  include/linux/hyperv.h     |  1 +
> >  2 files changed, 22 insertions(+), 7 deletions(-)
> >
>
> Does netvsc not need this same sort of patch?
>

Hi Easwar:
     Thanks for your review. AFAIK, storvsc support the capability
We may add such change for netvsc driver later once netvsc
also supports confidential external memory.

-- 
Thanks
Tianyu Lan

^ permalink raw reply

* Re: [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
From: Martin K. Petersen @ 2026-04-09  1:52 UTC (permalink / raw)
  To: Li Tian
  Cc: linux-scsi, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, James E.J. Bottomley, Martin K. Petersen, linux-hyperv,
	linux-kernel
In-Reply-To: <20260406015344.12566-1-litian@redhat.com>


Li,

> The storvsc driver has become stricter in handling SRB status codes
> returned by the Hyper-V host. When using Virtual Fibre Channel (vFC)
> passthrough, the host may return SRB_STATUS_DATA_OVERRUN for
> PERSISTENT_RESERVE_IN commands if the allocation length in the CDB
> does not match the host's expected response size.

Applied to 7.1/scsi-staging, thanks!

-- 
Martin K. Petersen

^ permalink raw reply

* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Easwar Hariharan @ 2026-04-08 20:20 UTC (permalink / raw)
  To: Michael Kelley
  Cc: easwar.hariharan, Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB4157098A14BE63FCA8C0A70ED480A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 1/11/2026 9:36 AM, Michael Kelley wrote:
> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:41 AM
>>
>> On 1/8/2026 10:46 AM, Michael Kelley wrote:
>>> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
>>>>
>>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>>>
>>>> Hyper-V uses a logical device ID to identify a PCI endpoint device for
>>>> child partitions. This ID will also be required for future hypercalls
>>>> used by the Hyper-V IOMMU driver.
>>>>
>>>> Refactor the logic for building this logical device ID into a standalone
>>>> helper function and export the interface for wider use.
>>>>
>>>> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>>> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
>>>> ---
>>>>  drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
>>>>  include/asm-generic/mshyperv.h      |  2 ++
>>>>  2 files changed, 22 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>>>> index 146b43981b27..4b82e06b5d93 100644
>>>> --- a/drivers/pci/controller/pci-hyperv.c
>>>> +++ b/drivers/pci/controller/pci-hyperv.c
>>>> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>>>>
>>>>  #define hv_msi_prepare		pci_msi_prepare
>>>>
>>>> +/**
>>>> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
>>>> + * function number of the device.
>>>> + */
>>>> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
>>>> +{
>>>> +	struct pci_bus *pbus = pdev->bus;
>>>> +	struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
>>>> +						struct hv_pcibus_device, sysdata);
>>>> +
>>>> +	return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
>>>> +		     (hbus->hdev->dev_instance.b[4] << 16) |
>>>> +		     (hbus->hdev->dev_instance.b[7] << 8)  |
>>>> +		     (hbus->hdev->dev_instance.b[6] & 0xf8) |
>>>> +		     PCI_FUNC(pdev->devfn));
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);
>>>
>>> This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
>>> new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
>>> The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
>>> use this symbol in that case -- you'll get a link error on vmlinux when building
>>> the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
>>> require that the VMBus driver not be built as a module, so I don't think that's
>>> the right solution.
>>>
>>> This is a messy problem. The new IOMMU driver needs to start with a generic
>>> "struct device" for the PCI device, and somehow find the corresponding VMBus
>>> PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
>>> about ways to do this that don't depend on code and data structures that are
>>> private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.
>>
>> Thank you, Michael. FWIW, I did try to pull out the device ID components out of
>> pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h
>> but it was just too messy as you say.
> 
> Yes, the current approach for getting the device ID wanders through struct
> hv_pcibus_device (which is private to the pci-hyperv driver), and through
> struct hv_device (which is a VMBus data structure). That makes the linkage
> between the PV IOMMU driver and the pci-hyperv and VMBus drivers rather
> substantial, which is not good.

Hi Michael,

I missed this, or made a mental note to follow up but forgot. Either way, Yu reminded
me about this email chain and I started looking at it this week.

> 
> But here's an idea for an alternate approach. The PV IOMMU driver doesn't
> have to generate the logical device ID on-the-fly by going to the dev_instance
> field of struct hv_device. Instead, the pci-hyperv driver can generate the logical
> device ID in hv_pci_probe(), and put it somewhere that's easy for the IOMMU
> driver to access. The logical device ID doesn't change while Linux is running, so
> stashing another copy somewhere isn't a problem.

In my exploration and consulting with Dexuan, I realized that one of the components of
the logical device ID, the PCI function number is set only in pci_scan_device(), well into
pci_scan_root_bus_bridge() that you call out as the point by which the communication must
have occurred.

But then, Dexuan also pointed me to hv_pci_assign_slots() with its call to wslot_to_devfn() and I'm
honestly confused how these two interact. With the current approach, it looks like whatever
devfn pci_scan_device() set is the correct function number to use for the logical device
ID, in which case, the best I can do with your suggested approach below is to inform the
pvIOMMU driver of the GUID, rather than the logical device ID itself.

Perhaps with your history, you can clarify the interaction, and/or share your thoughts
on the above?

> 
> So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept
> a PCI domain ID and the related logical device ID. The PV IOMMU driver is
> responsible for storing this data in a form that it can later search. hv_pci_probe()
> calls this new function when it instantiates a new PCI pass-thru device. Then when
> the IOMMU driver needs to attach a new device, it can get the PCI domain ID
> from the struct pci_dev (or struct pci_bus), search for the related logical device
> ID in its own data structure, and use it. The pci-hyperv driver has a dependency
> on the IOMMU driver, but that's a dependency in the desired direction. The
> PCI domain ID and logical device ID are just integers, so no data structures are
> shared.

In a previous reply on this thread, you raised the uniqueness issue of bytes 4 and 5
of the GUID being used to create the domain number. I thought this approach could
help with that too, but as I coded it up, I realized that using the domain number 
(not guaranteed to be unique) to search for the bus instance GUID (guaranteed to be unique)
is the wrong way around. It is unfortunately the only available key in the pci_dev
handed to the pvIOMMU driver in this approach though...

Do you think that's a fatal flaw?

> 
> Note that the pci-hyperv must inform the PV IOMMU driver of the logical
> device ID *before* create_root_hv_pci_bus() calls pci_scan_root_bus_bridge().
> The latter function eventually invokes hv_iommu_attach_dev(), which will
> need the logical device ID. See example stack trace. [1]
> 
> I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
> remove the information if a PCI pass-thru device is unbound or removed, as
> the logical device ID will be the same if the device ever comes back. At worst,
> the IOMMU driver can simply replace an existing logical device ID if a new one
> is provided for the same PCI domain ID.

As above, replacing a unique GUID when a result is found for a non-unique
key value may be prone to failure if it happens that the device that came "back"
is not in fact the same device (or class of device) that went away and just happens
to, either due to bytes 4 and 5 being identical, or due to collision in the
pci_domain_nr_dynamic_ida, have the same domain number. 

Thanks,
Easwar (he/him)

> 
> An include file must provide a stub for the new function if
> CONFIG_HYPERV_PVIOMMU is not defined, so that the pci-hyperv driver still
> builds and works.
> 
> I haven't coded this up, but it seems like it should be pretty clean.
> 
> Michael
> 
> [1] Example stack trace, starting with vmbus_add_channel_work() as a
> result of Hyper-V offering the PCI pass-thru device to the guest.
> hv_pci_probe() runs, and ends up in the generic Linux code for adding
> a PCI device, which in turn sets up the IOMMU.
> 
> [    1.731786]  hv_iommu_attach_dev+0xf0/0x1d0
> [    1.731788]  __iommu_attach_device+0x21/0xb0
> [    1.731790]  __iommu_device_set_domain+0x65/0xd0
> [    1.731792]  __iommu_group_set_domain_internal+0x61/0x120
> [    1.731795]  iommu_setup_default_domain+0x3a4/0x530
> [    1.731796]  __iommu_probe_device.part.0+0x15d/0x1d0
> [    1.731798]  iommu_probe_device+0x81/0xb0
> [    1.731799]  iommu_bus_notifier+0x2c/0x80
> [    1.731800]  notifier_call_chain+0x66/0xe0
> [    1.731802]  blocking_notifier_call_chain+0x47/0x70
> [    1.731804]  bus_notify+0x3b/0x50
> [    1.731805]  device_add+0x631/0x850
> [    1.731807]  pci_device_add+0x2db/0x670
> [    1.731809]  pci_scan_single_device+0xc3/0x100
> [    1.731810]  pci_scan_slot+0x97/0x230
> [    1.731812]  pci_scan_child_bus_extend+0x3b/0x2f0
> [    1.731814]  pci_scan_root_bus_bridge+0xc0/0xf0
> [    1.731816]  hv_pci_probe+0x398/0x5f0
> [    1.731817]  vmbus_probe+0x42/0xa0
> [    1.731819]  really_probe+0xe5/0x3e0
> [    1.731822]  __driver_probe_device+0x7e/0x170
> [    1.731823]  driver_probe_device+0x23/0xa0
> [    1.731824]  __device_attach_driver+0x92/0x130
> [    1.731826]  bus_for_each_drv+0x8c/0xe0
> [    1.731828]  __device_attach+0xc0/0x200
> [    1.731830]  device_initial_probe+0x4c/0x50
> [    1.731831]  bus_probe_device+0x32/0x90
> [    1.731832]  device_add+0x65b/0x850
> [    1.731836]  device_register+0x1f/0x30
> [    1.731837]  vmbus_device_register+0x87/0x130
> [    1.731840]  vmbus_add_channel_work+0x139/0x1a0
> [    1.731841]  process_one_work+0x19f/0x3f0
> [    1.731843]  worker_thread+0x188/0x2f0
> [    1.731845]  kthread+0x119/0x230
> [    1.731852]  ret_from_fork+0x1b4/0x1e0
> [    1.731854]  ret_from_fork_asm+0x1a/0x30
> 
>>

^ permalink raw reply

* Re: [EXTERNAL] [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
From: Laurence Oberman @ 2026-04-08 18:06 UTC (permalink / raw)
  To: Long Li, Li Tian, linux-scsi@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	James E.J. Bottomley, Martin K. Petersen,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB6683ABEAC8B490387658B7C7CE5AA@SA1PR21MB6683.namprd21.prod.outlook.com>

On Tue, 2026-04-07 at 22:30 +0000, Long Li wrote:
> 
> 
> > -----Original Message-----
> > From: Li Tian <litian@redhat.com>
> > Sent: Sunday, April 5, 2026 6:54 PM
> > To: linux-scsi@vger.kernel.org
> > Cc: Li Tian <litian@redhat.com>; KY Srinivasan <kys@microsoft.com>;
> > Haiyang
> > Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>;
> > Dexuan Cui
> > <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; James E.J.
> > Bottomley
> > <James.Bottomley@HansenPartnership.com>; Martin K. Petersen
> > <martin.petersen@oracle.com>; linux-hyperv@vger.kernel.org; linux-
> > kernel@vger.kernel.org
> > Subject: [EXTERNAL] [PATCH] scsi: storvsc: Handle
> > PERSISTENT_RESERVE_IN
> > truncation for Hyper-V vFC
> > 
> > The storvsc driver has become stricter in handling SRB status codes
> > returned by
> > the Hyper-V host. When using Virtual Fibre Channel (vFC)
> > passthrough, the host
> > may return SRB_STATUS_DATA_OVERRUN for PERSISTENT_RESERVE_IN
> > commands if the allocation length in the CDB does not match the
> > host's expected
> > response size.
> > 
> > Currently, this status is treated as a fatal error, propagating
> > Host_status=0x07 [DID_ERROR] to the SCSI mid-layer. This causes
> > userspace
> > storage utilities (such as sg_persist) to fail with transport
> > errors, even when the
> > host has actually returned the requested reservation data in the
> > buffer.
> > 
> > Refactor the existing command-specific workarounds into a new
> > helper function,
> > storvsc_host_mishandles_cmd(), and add PERSISTENT_RESERVE_IN to the
> > list of
> > commands where SRB status errors should be suppressed for vFC
> > devices. This
> > ensures that the SCSI mid-layer processes the returned data buffer
> > instead of
> > terminating the command.
> > 
> > Signed-off-by: Li Tian <litian@redhat.com>
> 
> Reviewed-by: Long Li <longli@microsoft.com>
> 
> 
> > ---
> >  drivers/scsi/storvsc_drv.c | 32 +++++++++++++++++++++-----------
> >  1 file changed, 21 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/scsi/storvsc_drv.c
> > b/drivers/scsi/storvsc_drv.c index
> > ae1abab97835..6977ca8a0658 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -1131,6 +1131,26 @@ static void
> > storvsc_command_completion(struct
> > storvsc_cmd_request *cmd_request,
> >  		kfree(payload);
> >  }
> > 
> > +/*
> > + * The current SCSI handling on the host side does not correctly
> > handle:
> > + * INQUIRY with page code 0x80, MODE_SENSE / MODE_SENSE_10 with
> > cmd[2]
> > +== 0x1c,
> > + * and (for FC) MAINTENANCE_IN / PERSISTENT_RESERVE_IN
> > passthrough.
> > + */
> > +static bool storvsc_host_mishandles_cmd(u8 opcode, struct
> > hv_device
> > +*device) {
> > +	switch (opcode) {
> > +	case INQUIRY:
> > +	case MODE_SENSE:
> > +	case MODE_SENSE_10:
> > +		return true;
> > +	case MAINTENANCE_IN:
> > +	case PERSISTENT_RESERVE_IN:
> > +		return hv_dev_is_fc(device);
> > +	default:
> > +		return false;
> > +	}
> > +}
> > +
> >  static void storvsc_on_io_completion(struct storvsc_device
> > *stor_device,
> >  				  struct vstor_packet
> > *vstor_packet,
> >  				  struct storvsc_cmd_request
> > *request) @@ -
> > 1141,22 +1161,12 @@ static void storvsc_on_io_completion(struct
> > storvsc_device *stor_device,
> >  	stor_pkt = &request->vstor_packet;
> > 
> >  	/*
> > -	 * The current SCSI handling on the host side does
> > -	 * not correctly handle:
> > -	 * INQUIRY command with page code parameter set to 0x80
> > -	 * MODE_SENSE and MODE_SENSE_10 command with cmd[2] ==
> > 0x1c
> > -	 * MAINTENANCE_IN is not supported by HyperV FC
> > passthrough
> > -	 *
> >  	 * Setup srb and scsi status so this won't be fatal.
> >  	 * We do this so we can distinguish truly fatal failues
> >  	 * (srb status == 0x4) and off-line the device in that
> > case.
> >  	 */
> > 
> > -	if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
> > -	   (stor_pkt->vm_srb.cdb[0] == MODE_SENSE) ||
> > -	   (stor_pkt->vm_srb.cdb[0] == MODE_SENSE_10) ||
> > -	   (stor_pkt->vm_srb.cdb[0] == MAINTENANCE_IN &&
> > -	   hv_dev_is_fc(device))) {
> > +	if (storvsc_host_mishandles_cmd(stor_pkt->vm_srb.cdb[0],
> > device)) {
> >  		vstor_packet->vm_srb.scsi_status = 0;
> >  		vstor_packet->vm_srb.srb_status =
> > SRB_STATUS_SUCCESS;
> >  	}
> > --
> > 2.53.0
> 

Looks good, rewrite of how it was done before but will achieve the same
behavior we wanted for the new addition for PR.

Reviewed-by: Laurence Oberman <loberman@redhat.com>


^ permalink raw reply

* Re: [PATCH] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Easwar Hariharan @ 2026-04-08 16:54 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: kys, haiyangz, wei.liu, decui, longli, James.Bottomley,
	martin.petersen, apais, easwar.hariharan, Tianyu Lan,
	linux-hyperv, linux-kernel, linux-scsi, vdso, mhklinux
In-Reply-To: <20260408073105.272255-1-tiala@microsoft.com>

On 4/8/2026 12:31 AM, Tianyu Lan wrote:
> Hyper-V provides Confidential VMBus to communicate between
> device model and device guest driver via encrypted/private
> memory in Confidential VM. The device model is in OpenHCL
> (https://openvmm.dev/guide/user_guide/openhcl.html) that
> plays the paravisor role.
> 
> For a VMBus device, there are two communication methods to
> talk with Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic
> DMA transfer.
> 
> The Confidential VMBus Ring buffer has been upstreamed by
> Roman Kisel(commit 6802d8af47d1).
> 
> The dynamic DMA transition of VMBus device normally goes
> through DMA core and it uses SWIOTLB as bounce buffer in
> a CoCo VM.
> 
> The Confidential VMBus device can do DMA directly to
> private/encrypted memory. Because the swiotlb is decrypted
> memory, the DMA transfer must not be bounced through the
> swiotlb, so as to preserve confidentiality. This is different
> from the default for Linux CoCo VMs, so not use DMA(SWIOTLB)
> API in VMBus driver when confidential dynamic DMA transfers
> capability is present.
> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  drivers/scsi/storvsc_drv.c | 28 +++++++++++++++++++++-------
>  include/linux/hyperv.h     |  1 +
>  2 files changed, 22 insertions(+), 7 deletions(-)
> 

Does netvsc not need this same sort of patch?

Thanks,
Easwar (he/him)



^ permalink raw reply

* Re: [PATCH 2/8] firmware: efi: Never declare sysfb_primary_display on x86
From: Thomas Zimmermann @ 2026-04-08 14:07 UTC (permalink / raw)
  To: Ard Biesheuvel, Javier Martinez Canillas, Arnd Bergmann,
	Ilias Apalodimas, Huacai Chen, WANG Xuerui, maarten.lankhorst,
	mripard, David Airlie, Simona Vetter, kys, haiyangz, Wei Liu,
	decui, Long Li, Helge Deller
  Cc: linux-arm-kernel, loongarch, linux-efi, linux-riscv, dri-devel,
	linux-hyperv, linux-fbdev, kernel test robot
In-Reply-To: <d0624a61-b96b-4b2f-89c2-029e8671039d@app.fastmail.com>

Hi

Am 08.04.26 um 15:45 schrieb Ard Biesheuvel:
> Hi Thomas,
>
> On Thu, 2 Apr 2026, at 11:09, Thomas Zimmermann wrote:
>> The x86 architecture comes with its own instance of the global
>> state variable sysfb_primary_display. Never declare it in the EFI
>> subsystem. Fix the test for CONFIG_FIRMWARE_EDID accordingly.
>>
>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>> Fixes: e65ca1646311 ("efi: export sysfb_primary_display for EDID")
>> Cc: kernel test robot <lkp@intel.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Thomas Zimmermann <tzimmermann@suse.de>
>> Cc: Ard Biesheuvel <ardb@kernel.org>
>> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
>> Cc: linux-efi@vger.kernel.org
>> ---
>>   drivers/firmware/efi/efi-init.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
> Should this be sent out as a fix?

Yes, please.



-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-08 13:53 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Long Li, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69215C164B06109C6682984EBF5BA@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Wednesday, April 8, 2026 2:24 AM
> 
> > From: Michael Kelley <mhklinux@outlook.com>
> > Sent: Sunday, April 5, 2026 4:15 PM
> > > ...
> > > Note: we still need to figure out how to address the possible MMIO
> > > conflict between hyperv_drm and pci_hyperv in the case of 32-bit PCI
> > > MMIO BARs, but that's of low priority because all PCI devices available
> > > to a Linux VM on Azure or on a modern host should use 64-bit BARs and
> > > should not use 32-bit BARs -- I checked Mellanox VFs, MANA VFs, NVMe
> > > devices, and GPUs in Linux VMs on Azure, and found no 32-bit BARs.
> >
> > Just to clarify, since this patch is predicated on all BARs being 64-bit,
> > hv_pci_alloc_bridge_windows() never encounters a non-zero
> > hbus->low_mmio_space, and hence also never allocates from low
> > MMIO space. So hv_pci_alloc_bridge_windows() does not need to be
> > patched. Is that correct?
> 
> Correct. For 32-bit BARs (if any), IMO we can't really do anything for
> them in hv_pci_allocate_bridge_windows(), since they must reside
> below 4GB.
> 
> Note: while the patch doesn't fix the MMIO conflict if there are any
> 32-bit BARs, the patch doesn't make things worse for 32-bit BARs (if any).

OK, right. Your patch doesn't prevent 32-bit BARs from working. It
just doesn't fix any potential frame buffer conflicts with 32-bit BARs.
I misinterpreted the situation.

> 
> > Taking a broader view, fundamentally the current MMIO location of
> > the frame buffer may be unknown to the Linux guest. At the same time,
> > Linux must ensure that PCI devices don't get assigned to the MMIO space
> > where the frame buffer is located. While the current MMIO location of
> > the frame buffer may be unknown, we can assume it was placed in low
> > MMIO space by the host -- either Windows Hyper-V or Linux/VMM
> > in the root partition, and perhaps as mediated by a paravisor. Probably
> > need to confirm with the Linux-in-the-root partition team (and maybe
> > the OpenHCL team) that this assumption is true.
> 
> IMO this is a good idea! It looks like the framebuffer base always starts
> at the beginning of the low MMIO space. We can reserve some
> MMIO for the framebuffer at the beginning of the low MMIO space.
> 
> > Presumably the
> > hyperv_drm driver doesn't need to move the frame buffer, but if it
> > does, it must stay in the low MMIO space.
> 
> It looks like this assumption is true.
> 
> > This patch depends on this assumption, and effectively reserves
> > the entire low MMIO space for the frame buffer.
> 
> To make it precise, the patch reserves the entire low MMIO space for
> the frame buffer and the 32-bit BARs (if any), and there is no MMIO
> conflict in the first kernel (assuming hyperv_drm doesn't relocate the
> MMIO range), and there can be an MMIO conflict in the
> kdump/kexec kernel if there is any 32-bit BAR.
> 
> > The low MMIO space
> > size defaults to 128 MiB on a local Hyper-V,
> Yes, by default, the low MMIO base =0xf800_0000, size=128MB,
> but the range [0xfed4_0000, 0xffff_ffff], whose size is 18.75MB,
> is reserved for vTPM: see vmbus_walk_resources(). So by default
> the available low MMIO size for hyperv_drm is 128 - 18.75 =
> 109.25 MB.
> 
> The size of the framebuffer should be aligned to 2MB, so if the
> framebuffer size is bigger than 108MB, it looks like there is no
> enough MMIO space in the low MMIO range, e.g. with the below
> command:
> Set-VMVideo -VMName vm_name -HorizontalResolution 7680
> -VerticalResolution 4320 -ResolutionType Maximum
> , the resulting max framebuffer size is
> 7680 * 4320 * 32/8 /1024.0/1024 = 126.5625, which would be
> rounded up to 128MB.
> 
> However, according to my testing, with the above command,
> the low MMIO base = 0xf000_0000, size=256MB, so it's probably
> ok to reserve 128 MB for the frame buffer.
> 
> In case the low MMIO size is <=64MB, we would want to reserve
> less MMIO for the frame buffer.
> 
> > and is set to 3 GiB in most
> > Azure VMs (or to 1 GiB in an Azure CVM), so that all gets reserved.
> >
> > A slightly different approach to the whole problem is to change
> > vmbus_reserve_fb(). If it is unable to get a non-zero "start" value, then
> > it should use the same assumption as above, and reserve a frame buffer
> > area starting at the lowest address in low MMIO space. The reserved size
> > could be the max possible frame buffer size, which I think is 64 MiB (?).
> 
> It can be 128MB with the highest resolution 7680*4320 (I hope the
> highest resolution won't become bigger in the future).

Indeed!

> 
> > This still leaves low MMIO space for subsequent PCI devices, and allows
> > 32-bit BARs to continue to work. This approach requires one further
> > assumption, which is that the host, plus any movement by hyperv_drm,
> > has kept the frame buffer at the low end of the low MMIO space. From
> > what I've seen, that assumption is reality -- the frame buffer always
> > starts at the beginning of low MMIO space.
> >
> > This approach could be taken one step further, where vmbus_reserve_fb()
> > *always* reserves 64 MiB starting at the low end of low MMIO space,
> > regardless of the value of "start". The messy code for getting "start"
> > could be dropped entirely, and the dependency on CONFIG_SYSFB goes
> > away. Or maybe still get the value of "start" and "size", and if non-zero
> > just do a sanity check that they are within the fixed 64 MiB reserved area.
> >
> > Thoughts? To me tweaking vmbus_reserve_fb() is a more
> > straightforward and explicit way to do the reserving, vs. modifying
> > the requested range in the Hyper-V PCI driver.
> 
> Agreed. Let me try to make a new patch for review.
> 
> > And FWIW, it avoids  introducing the 32-bit BAR limitation.
> 
> This patch addresses the MMIO conflict for 64-bit BARs and not for
> 32-bit BARs (if any). The patch does not introduce the 32-bit BAR limitation.

Right.  I misinterpreted the problem you mentioned about 32-bit BARs.

Michael

^ permalink raw reply

* Re: [PATCH 2/8] firmware: efi: Never declare sysfb_primary_display on x86
From: Ard Biesheuvel @ 2026-04-08 13:45 UTC (permalink / raw)
  To: Thomas Zimmermann, Javier Martinez Canillas, Arnd Bergmann,
	Ilias Apalodimas, Huacai Chen, WANG Xuerui, maarten.lankhorst,
	mripard, David Airlie, Simona Vetter, kys, haiyangz, Wei Liu,
	decui, Long Li, Helge Deller
  Cc: linux-arm-kernel, loongarch, linux-efi, linux-riscv, dri-devel,
	linux-hyperv, linux-fbdev, kernel test robot
In-Reply-To: <20260402092305.208728-3-tzimmermann@suse.de>

Hi Thomas,

On Thu, 2 Apr 2026, at 11:09, Thomas Zimmermann wrote:
> The x86 architecture comes with its own instance of the global
> state variable sysfb_primary_display. Never declare it in the EFI
> subsystem. Fix the test for CONFIG_FIRMWARE_EDID accordingly.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Fixes: e65ca1646311 ("efi: export sysfb_primary_display for EDID")
> Cc: kernel test robot <lkp@intel.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Cc: linux-efi@vger.kernel.org
> ---
>  drivers/firmware/efi/efi-init.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

Should this be sent out as a fix?

^ permalink raw reply

* Re: [PATCH v2] tools: hv: Fix cross-compilation
From: Aditya Garg @ 2026-04-08 12:36 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, gregkh, ssengar,
	linux-hyperv, linux-kernel, romank, avladu, vdso, gargaditya
In-Reply-To: <20260407122040.249733-1-gargaditya@linux.microsoft.com>

On 07-04-2026 17:50, Aditya Garg wrote:
> Use the native ARCH only in case it is not set, this will allow the
> cross-compilation where ARCH is explicitly set.
> 
> Additionally, simplify the check for ARCH so that fcopy daemon is built
> only for x86_64.
> 
> Fixes: 82b0945ce2c2 ("tools: hv: Add new fcopy application based on uio driver")
> Reported-by: Adrian Vladu <avladu@cloudbasesolutions.com>
> Closes: https://lore.kernel.org/linux-hyperv/PR3PR09MB54119DB2FD76977C62D8DD6AB04D2@PR3PR09MB5411.eurprd09.prod.outlook.com/
> Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
> Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
> ---
> Changes since v1:
>      - Dropped the info target printing CC, LD and ARCH
> 
> v1: https://lore.kernel.org/all/1733992114-7305-1-git-send-email-ssengar@linux.microsoft.com/
> ---
>   tools/hv/Makefile | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/hv/Makefile b/tools/hv/Makefile
> index 34ffcec264ab..e377caf89fb6 100644
> --- a/tools/hv/Makefile
> +++ b/tools/hv/Makefile
> @@ -2,7 +2,7 @@
>   # Makefile for Hyper-V tools
>   include ../scripts/Makefile.include
>   
> -ARCH := $(shell uname -m 2>/dev/null)
> +ARCH ?= $(shell uname -m 2>/dev/null)
>   sbindir ?= /usr/sbin
>   libexecdir ?= /usr/libexec
>   sharedstatedir ?= /var/lib
> @@ -20,7 +20,7 @@ override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -I$(OUTPUT)include
>   override CFLAGS += -Wno-address-of-packed-member
>   
>   ALL_TARGETS := hv_kvp_daemon hv_vss_daemon
> -ifneq ($(ARCH), aarch64)
> +ifeq ($(ARCH), x86_64)
>   ALL_TARGETS += hv_fcopy_uio_daemon
>   endif
>   ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))

Sashiko AI review flagged an issue, I tested it and confirmed.
When building via make tools/hv from the top-level kernel directory,
scripts/subarch.include normalizes x86_64 to x86, and since ARCH is
exported, the ?= assignment in tools/hv/Makefile preserves the
normalized value, causing ifeq ($(ARCH), x86_64) to be false and
hv_fcopy_uio_daemon to be silently excluded.

I'll change this to include x86 as well in v3.

Regards,
Aditya

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: mana: Fix probe/remove error path bugs
From: Erni Sri Satya Vennela @ 2026-04-08 10:17 UTC (permalink / raw)
  To: Mohsin Bashir
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <62c3fcc1-6758-47c4-984e-f6940139de0c@gmail.com>

On Fri, Apr 03, 2026 at 11:28:34PM -0700, Mohsin Bashir wrote:
> 
> 
> On 4/3/26 12:09 AM, Erni Sri Satya Vennela wrote:
> > Fix four pre-existing bugs in mana_probe()/mana_remove() error handling
> > that can cause warnings on uninitialized work structs, masked errors,
> > and resource leaks when early probe steps fail.
> > 
> > Patches 1-2 move work struct initialization (link_change_work and
> > gf_stats_work) to before any error path that could trigger
> > mana_remove(), preventing WARN_ON in __flush_work() or debug object
> > warnings when sync cancellation runs on uninitialized work structs.
> > 
> > Patch 3 prevents add_adev() from overwriting a port probe error,
> > which could leave the driver in a broken state with NULL ports while
> > reporting success.
> > 
> > Patch 4 changes 'goto out' to 'break' in mana_remove()'s port loop
> > so that mana_destroy_eq() is always reached, preventing EQ leaks when
> > a NULL port is encountered.
> > 
> > Erni Sri Satya Vennela (4):
> >    net: mana: Init link_change_work before potential error paths in probe
> >    net: mana: Init gf_stats_work before potential error paths in probe
> >    net: mana: Don't overwrite port probe error with add_adev result
> >    net: mana: Fix EQ leak in mana_remove on NULL port
> > 
> >   drivers/net/ethernet/microsoft/mana/mana_en.c | 28 +++++++++----------
> >   1 file changed, 14 insertions(+), 14 deletions(-)
> > 
> I believe mana is already in the mainline so fixes go to the net tree?

Thanks for the correction Mohsin.
I'll make this chaneg in the next version.

- Vennela

^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-08  9:24 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Long Li, lpieralisi@kernel.org, kwilczynski@kernel.org,
	mani@kernel.org, robh@kernel.org, bhelgaas@google.com,
	Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SN6PR02MB415794E53D2B621F6A8BA382D45CA@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Sunday, April 5, 2026 4:15 PM
> > ...
> > Note: we still need to figure out how to address the possible MMIO
> > conflict between hyperv_drm and pci_hyperv in the case of 32-bit PCI
> > MMIO BARs, but that's of low priority because all PCI devices available
> > to a Linux VM on Azure or on a modern host should use 64-bit BARs and
> > should not use 32-bit BARs -- I checked Mellanox VFs, MANA VFs, NVMe
> > devices, and GPUs in Linux VMs on Azure, and found no 32-bit BARs.
> 
> Just to clarify, since this patch is predicated on all BARs being 64-bit,
> hv_pci_alloc_bridge_windows() never encounters a non-zero
> hbus->low_mmio_space, and hence also never allocates from low
> MMIO space. So hv_pci_alloc_bridge_windows() does not need to be
> patched. Is that correct?

Correct. For 32-bit BARs (if any), IMO we can't really do anything for
them in hv_pci_allocate_bridge_windows(), since they must reside
below 4GB.

Note: while the patch doesn't fix the MMIO conflict if there are any
32-bit BARs, the patch doesn't make things worse for 32-bit BARs (if any).

> Taking a broader view, fundamentally the current MMIO location of
> the frame buffer may be unknown to the Linux guest. At the same time,
> Linux must ensure that PCI devices don't get assigned to the MMIO space
> where the frame buffer is located. While the current MMIO location of
> the frame buffer may be unknown, we can assume it was placed in low
> MMIO space by the host -- either Windows Hyper-V or Linux/VMM
> in the root partition, and perhaps as mediated by a paravisor. Probably
> need to confirm with the Linux-in-the-root partition team (and maybe
> the OpenHCL team) that this assumption is true. 

IMO this is a good idea! It looks like the framebuffer base always starts
at the beginning of the low MMIO space. We can reserve some
MMIO for the framebuffer at the beginning of the low MMIO space.

> Presumably the
> hyperv_drm driver doesn't need to move the frame buffer, but if it
> does, it must stay in the low MMIO space.

It looks like this assumption is true.

> This patch depends on this assumption, and effectively reserves
> the entire low MMIO space for the frame buffer. 

To make it precise, the patch reserves the entire low MMIO space for
the frame buffer and the 32-bit BARs (if any), and there is no MMIO
conflict in the first kernel (assuming hyperv_drm doesn't relocate the
MMIO range), and there can be an MMIO conflict in the
kdump/kexec kernel if there is any 32-bit BAR.

> The low MMIO space
> size defaults to 128 MiB on a local Hyper-V, 
Yes, by default, the low MMIO base =0xf800_0000, size=128MB, 
but the range [0xfed4_0000, 0xffff_ffff], whose size is 18.75MB,
is reserved for vTPM: see vmbus_walk_resources(). So by default
the available low MMIO size for hyperv_drm is 128 - 18.75 = 
109.25 MB.

The size of the framebuffer should be aligned to 2MB, so if the
framebuffer size is bigger than 108MB, it looks like there is no
enough MMIO space in the low MMIO range, e.g. with the below
command:
Set-VMVideo -VMName vm_name -HorizontalResolution 7680
-VerticalResolution 4320 -ResolutionType Maximum
, the resulting max framebuffer size is 
7680 * 4320 * 32/8 /1024.0/1024 = 126.5625, which would be
rounded up to 128MB.

However, according to my testing, with the above command,
the low MMIO base = 0xf000_0000, size=256MB, so it's probably
ok to reserve 128 MB for the frame buffer. 

In case the low MMIO size is <=64MB, we would want to reserve
less MMIO for the frame buffer.

> and is set to 3 GiB in most
> Azure VMs (or to 1 GiB in an Azure CVM), so that all gets reserved.
> 
> A slightly different approach to the whole problem is to change
> vmbus_reserve_fb(). If it is unable to get a non-zero "start" value, then
> it should use the same assumption as above, and reserve a frame buffer
> area starting at the lowest address in low MMIO space. The reserved size
> could be the max possible frame buffer size, which I think is 64 MiB (?).

It can be 128MB with the highest resolution 7680*4320 (I hope the
highest resolution won't become bigger in the future).

> This still leaves low MMIO space for subsequent PCI devices, and allows
> 32-bit BARs to continue to work. This approach requires one further
> assumption, which is that the host, plus any movement by hyperv_drm,
> has kept the frame buffer at the low end of the low MMIO space. From
> what I've seen, that assumption is reality -- the frame buffer always
> starts at the beginning of low MMIO space.
> 
> This approach could be taken one step further, where vmbus_reserve_fb()
> *always* reserves 64 MiB starting at the low end of low MMIO space,
> regardless of the value of "start". The messy code for getting "start"
> could be dropped entirely, and the dependency on CONFIG_SYSFB goes
> away. Or maybe still get the value of "start" and "size", and if non-zero
> just do a sanity check that they are within the fixed 64 MiB reserved area.
> 
> Thoughts? To me tweaking vmbus_reserve_fb() is a more
> straightforward and explicit way to do the reserving, vs. modifying
> the requested range in the Hyper-V PCI driver. 

Agreed. Let me try to make a new patch for review.

> And FWIW, it avoids  introducing the 32-bit BAR limitation.

This patch addresses the MMIO conflict for 64-bit BARs and not for
32-bit BARs (if any). The patch does not introduce the 32-bit BAR limitation.

Thanks,
-- Dexuan

^ permalink raw reply

* [PATCH net-next v6] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-04-08  8:15 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
	shirazsaleem, yury.norov, kees, ssengar, ernis, dipayanroy,
	gargaditya, linux-hyperv, netdev, linux-kernel, linux-rdma

Add debugfs entries to expose hardware configuration and diagnostic
information that aids in debugging driver initialization and runtime
operations without adding noise to dmesg.

The debugfs directory for each PCI device is named using pci_name()
(the unique BDF address), and its creation and removal is integrated
into mana_gd_setup() and mana_gd_cleanup_device() respectively, so
that all callers (probe, remove, suspend, resume, shutdown) share a
single code path.

Device-level entries (under /sys/kernel/debug/mana/<BDF>/):
  - num_msix_usable, max_num_queues: Max resources from hardware
  - gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
  - num_vports, bm_hostmode: Device configuration

Per-vPort entries (under /sys/kernel/debug/mana/<BDF>/vportN/):
  - port_handle: Hardware vPort handle
  - max_sq, max_rq: Max queues from vPort config
  - indir_table_sz: Indirection table size
  - steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
    Last applied steering configuration parameters

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
This patch depends on the following fixes submitted to net:
  - "net: mana: Use pci_name() for debugfs directory naming"
  - "net: mana: Move current_speed debugfs file to mana_init_port()"
Conflict resolution may be needed when net merges into net-next.
---
Changes in v6:
* Move out of patchset and create a separate patch.
Changes in v5:
* Update commit message.
* Fix conflicts to align with the new patches.
* Make it part of patchset.
Changes in v4:
* Rebase and fix conflicts.
Changes in v3:
* Rename mana_gd_cleanup to mana_gd_cleanup_device.
* Add creation of debugfs entries in mana_gd_setup.
* Add removal of debugfs entries in mana_gd_cleanup_device.
* Remove bm_hostmode and num_vports from debugfs in mana_remove itself,
  because "ac" gets freed before debugfs_remove_recursive, to avoid
  Use-After-Free error.
* Add "goto out:" in mana_cfg_vport_steering to avoid populating apc
  values when resp.hdr.status is not NULL.
Changes in v2:
* Add debugfs_remove_recursice for gc>mana_pci_debugfs in
  mana_gd_suspend to handle multiple duplicates creation in
  mana_gd_setup and mana_gd_resume path.
* Move debugfs creation for num_vports and bm_hostmode out of
  if(!resuming) condition since we have to create it again even for
  resume.
* Recreate mana_pci_debugfs in mana_gd_resume.
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 59 ++++++++++---------
 drivers/net/ethernet/microsoft/mana/mana_en.c | 33 +++++++++++
 include/net/mana/gdma.h                       |  1 +
 include/net/mana/mana.h                       |  8 +++
 4 files changed, 74 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..7a99db9afa03 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -194,6 +194,11 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	if (gc->max_num_queues > gc->num_msix_usable - 1)
 		gc->max_num_queues = gc->num_msix_usable - 1;
 
+	debugfs_create_u32("num_msix_usable", 0400, gc->mana_pci_debugfs,
+			   &gc->num_msix_usable);
+	debugfs_create_u32("max_num_queues", 0400, gc->mana_pci_debugfs,
+			   &gc->max_num_queues);
+
 	return 0;
 }
 
@@ -1264,6 +1269,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 		return err ? err : -EPROTO;
 	}
 	gc->pf_cap_flags1 = resp.pf_cap_flags1;
+	gc->gdma_protocol_ver = resp.gdma_protocol_ver;
+
+	debugfs_create_x64("gdma_protocol_ver", 0400, gc->mana_pci_debugfs,
+			   &gc->gdma_protocol_ver);
+	debugfs_create_x64("pf_cap_flags1", 0400, gc->mana_pci_debugfs,
+			   &gc->pf_cap_flags1);
+
 	if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
 		err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
 		if (err) {
@@ -1943,15 +1955,20 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	int err;
 
+	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
+						  mana_debugfs_root);
+
 	err = mana_gd_init_registers(pdev);
 	if (err)
-		return err;
+		goto remove_debugfs;
 
 	mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
 
 	gc->service_wq = alloc_ordered_workqueue("gdma_service_wq", 0);
-	if (!gc->service_wq)
-		return -ENOMEM;
+	if (!gc->service_wq) {
+		err = -ENOMEM;
+		goto remove_debugfs;
+	}
 
 	err = mana_gd_setup_hwc_irqs(pdev);
 	if (err) {
@@ -1992,11 +2009,14 @@ static int mana_gd_setup(struct pci_dev *pdev)
 free_workqueue:
 	destroy_workqueue(gc->service_wq);
 	gc->service_wq = NULL;
+remove_debugfs:
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
 	dev_err(&pdev->dev, "%s failed (error %d)\n", __func__, err);
 	return err;
 }
 
-static void mana_gd_cleanup(struct pci_dev *pdev)
+static void mana_gd_cleanup_device(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 
@@ -2008,6 +2028,10 @@ static void mana_gd_cleanup(struct pci_dev *pdev)
 		destroy_workqueue(gc->service_wq);
 		gc->service_wq = NULL;
 	}
+
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
+
 	dev_dbg(&pdev->dev, "mana gdma cleanup successful\n");
 }
 
@@ -2065,9 +2089,6 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	gc->dev = &pdev->dev;
 	xa_init(&gc->irq_contexts);
 
-	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
-						  mana_debugfs_root);
-
 	err = mana_gd_setup(pdev);
 	if (err)
 		goto unmap_bar;
@@ -2096,16 +2117,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 cleanup_mana:
 	mana_remove(&gc->mana, false);
 cleanup_gd:
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 unmap_bar:
-	/*
-	 * at this point we know that the other debugfs child dir/files
-	 * are either not yet created or are already cleaned up.
-	 * The pci debugfs folder clean-up now, will only be cleaning up
-	 * adapter-MTU file and apc->mana_pci_debugfs folder.
-	 */
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-	gc->mana_pci_debugfs = NULL;
 	xa_destroy(&gc->irq_contexts);
 	pci_iounmap(pdev, bar0_va);
 free_gc:
@@ -2155,11 +2168,7 @@ static void mana_gd_remove(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, false);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	xa_destroy(&gc->irq_contexts);
 
@@ -2181,7 +2190,7 @@ int mana_gd_suspend(struct pci_dev *pdev, pm_message_t state)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 
 	return 0;
 }
@@ -2220,11 +2229,7 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	pci_disable_device(pdev);
 }
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 6302432b9bf6..e7c627e3379a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1276,6 +1276,9 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
 	apc->port_handle = resp.vport;
 	ether_addr_copy(apc->mac_addr, resp.mac_addr);
 
+	apc->vport_max_sq = *max_sq;
+	apc->vport_max_rq = *max_rq;
+
 	return 0;
 }
 
@@ -1430,6 +1433,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 
 	netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
 		    apc->port_handle, apc->indir_table_sz);
+
+	apc->steer_rx = rx;
+	apc->steer_rss = apc->rss_state;
+	apc->steer_update_tab = update_tab;
+	apc->steer_cqe_coalescing = req->cqe_coalescing_enable;
 out:
 	kfree(req);
 	return err;
@@ -3154,6 +3162,23 @@ static int mana_init_port(struct net_device *ndev)
 	eth_hw_addr_set(ndev, apc->mac_addr);
 	sprintf(vport, "vport%d", port_idx);
 	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+
+	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
+			   &apc->port_handle);
+	debugfs_create_u32("max_sq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_sq);
+	debugfs_create_u32("max_rq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_rq);
+	debugfs_create_u32("indir_table_sz", 0400, apc->mana_port_debugfs,
+			   &apc->indir_table_sz);
+	debugfs_create_u32("steer_rx", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rx);
+	debugfs_create_u32("steer_rss", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rss);
+	debugfs_create_u32("steer_update_tab", 0400, apc->mana_port_debugfs,
+			   &apc->steer_update_tab);
+	debugfs_create_u32("steer_cqe_coalescing", 0400, apc->mana_port_debugfs,
+			   &apc->steer_cqe_coalescing);
 	debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs,
 			   &apc->speed);
 	return 0;
@@ -3646,6 +3671,11 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	ac->bm_hostmode = bm_hostmode;
 
+	debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
+			   &ac->num_ports);
+	debugfs_create_u8("bm_hostmode", 0400, gc->mana_pci_debugfs,
+			  &ac->bm_hostmode);
+
 	if (!resuming) {
 		ac->num_ports = num_ports;
 
@@ -3786,6 +3816,9 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 
 	mana_gd_deregister_device(gd);
 
+	debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
+	debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
+
 	if (suspending)
 		return;
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 7fe3a1b61b2d..c4e3ce5147f7 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -442,6 +442,7 @@ struct gdma_context {
 	struct gdma_dev		mana_ib;
 
 	u64 pf_cap_flags1;
+	u64 gdma_protocol_ver;
 
 	struct workqueue_struct *service_wq;
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 96d21cbbdee2..6d2e05a7368c 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -568,6 +568,14 @@ struct mana_port_context {
 
 	/* Debugfs */
 	struct dentry *mana_port_debugfs;
+
+	/* Cached vport/steering config for debugfs */
+	u32 vport_max_sq;
+	u32 vport_max_rq;
+	u32 steer_rx;
+	u32 steer_rss;
+	u32 steer_update_tab;
+	u32 steer_cqe_coalescing;
 };
 
 netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 2/2] net: mana: Move current_speed debugfs file to mana_init_port()
From: Erni Sri Satya Vennela @ 2026-04-08  8:12 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260408081224.302308-1-ernis@linux.microsoft.com>

Move the current_speed debugfs file creation from mana_probe_port() to
mana_init_port(). The file was previously created only during initial
probe, but mana_cleanup_port_context() removes the entire vPort debugfs
directory during detach/attach cycles. Since mana_init_port() recreates
the directory on re-attach, moving current_speed here ensures it survives
these cycles.

Fixes: 75cabb46935b ("net: mana: Add support for net_shaper_ops")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 07630322545f..6302432b9bf6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3154,6 +3154,8 @@ static int mana_init_port(struct net_device *ndev)
 	eth_hw_addr_set(ndev, apc->mac_addr);
 	sprintf(vport, "vport%d", port_idx);
 	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+	debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs,
+			   &apc->speed);
 	return 0;
 
 reset_apc:
@@ -3432,8 +3434,6 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 
 	netif_carrier_on(ndev);
 
-	debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs, &apc->speed);
-
 	return 0;
 
 free_indir:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 1/2] net: mana: Use pci_name() for debugfs directory naming
From: Erni Sri Satya Vennela @ 2026-04-08  8:12 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260408081224.302308-1-ernis@linux.microsoft.com>

Use pci_name(pdev) for the per-device debugfs directory instead of
hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
previous approach had two issues:

1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
   in environments like generic VFIO passthrough or nested KVM,
   causing a NULL pointer dereference.

2. Multiple PFs would all use "0", and VFs across different PCI
   domains or buses could share the same slot name, leading to
   -EEXIST errors from debugfs_create_dir().

pci_name(pdev) returns the unique BDF address, is always valid, and is
unique across the system.

Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 43741cd35af8..098fbda0d128 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -2065,11 +2065,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	gc->dev = &pdev->dev;
 	xa_init(&gc->irq_contexts);
 
-	if (gc->is_pf)
-		gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
-	else
-		gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
-							  mana_debugfs_root);
+	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
+						  mana_debugfs_root);
 
 	err = mana_gd_setup(pdev);
 	if (err)
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 0/2] net: mana: Fix debugfs directory naming and file lifecycle
From: Erni Sri Satya Vennela @ 2026-04-08  8:12 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
	linux-kernel

This series fixes two pre-existing debugfs issues in the MANA driver.

Patch 1 fixes the per-device debugfs directory naming to use the unique
PCI BDF address via pci_name(), avoiding a potential NULL pointer
dereference when pdev->slot is NULL (e.g. VFIO passthrough, nested KVM)
and preventing name collisions across multiple PFs or VFs.

Patch 2 moves the current_speed debugfs file creation from
mana_probe_port() to mana_init_port() so it survives detach/attach
cycles triggered by MTU changes or XDP program changes.

Erni Sri Satya Vennela (2):
  net: mana: Use pci_name() for debugfs directory naming
  net: mana: Move current_speed debugfs file to mana_init_port()

 drivers/net/ethernet/microsoft/mana/gdma_main.c | 7 ++-----
 drivers/net/ethernet/microsoft/mana/mana_en.c   | 4 ++--
 2 files changed, 4 insertions(+), 7 deletions(-)

-- 
2.34.1

^ permalink raw reply

* Re: [PATCH net-next v5 1/3] net: mana: Use pci_name() for debugfs directory naming
From: Erni Sri Satya Vennela @ 2026-04-08  8:12 UTC (permalink / raw)
  To: Simon Horman
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, shradhagupta, shirazsaleem,
	yury.norov, kees, ssengar, dipayanroy, gargaditya, linux-hyperv,
	netdev, linux-kernel, linux-rdma
In-Reply-To: <20260404090514.GS113102@horms.kernel.org>

On Sat, Apr 04, 2026 at 10:05:14AM +0100, Simon Horman wrote:
> On Thu, Apr 02, 2026 at 11:26:55AM -0700, Erni Sri Satya Vennela wrote:
> > Use pci_name(pdev) for the per-device debugfs directory instead of
> > hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
> > previous approach had two issues:
> > 
> > 1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
> >    in environments like generic VFIO passthrough or nested KVM,
> >    causing a NULL pointer dereference.
> > 
> > 2. Multiple PFs would all use "0", and VFs across different PCI
> >    domains or buses could share the same slot name, leading to
> >    -EEXIST errors from debugfs_create_dir().
> > 
> > pci_name(pdev) returns the unique BDF address, is always valid, and
> > is unique across the system.
> > 
> > Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> 
> Hi Erni,
> 
> Possibly the code differs between net and net-next.
> But if this is fixing a bug in code present in net - as per the cited
> commit - then I think it should be a patch that targets net.
> With some strategy for merging that change into net-next
> if conflicts are expected.

Thankyou for the clarity Simon.
I will send a separate patchset for net tree with the fixes.

- Vennela

^ permalink raw reply

* [PATCH] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Tianyu Lan @ 2026-04-08  7:31 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, James.Bottomley,
	martin.petersen, apais
  Cc: Tianyu Lan, linux-hyperv, linux-kernel, linux-scsi, vdso,
	mhklinux

Hyper-V provides Confidential VMBus to communicate between
device model and device guest driver via encrypted/private
memory in Confidential VM. The device model is in OpenHCL
(https://openvmm.dev/guide/user_guide/openhcl.html) that
plays the paravisor role.

For a VMBus device, there are two communication methods to
talk with Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic
DMA transfer.

The Confidential VMBus Ring buffer has been upstreamed by
Roman Kisel(commit 6802d8af47d1).

The dynamic DMA transition of VMBus device normally goes
through DMA core and it uses SWIOTLB as bounce buffer in
a CoCo VM.

The Confidential VMBus device can do DMA directly to
private/encrypted memory. Because the swiotlb is decrypted
memory, the DMA transfer must not be bounced through the
swiotlb, so as to preserve confidentiality. This is different
from the default for Linux CoCo VMs, so not use DMA(SWIOTLB)
API in VMBus driver when confidential dynamic DMA transfers
capability is present.

Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
 drivers/scsi/storvsc_drv.c | 28 +++++++++++++++++++++-------
 include/linux/hyperv.h     |  1 +
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index ae1abab97835..79b7611518b7 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1316,7 +1316,8 @@ static void storvsc_on_channel_callback(void *context)
 					continue;
 				}
 				request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd);
-				scsi_dma_unmap(scmnd);
+				if (!device->co_external_memory)
+					scsi_dma_unmap(scmnd);
 			}
 
 			storvsc_on_receive(stor_device, packet, request);
@@ -1339,6 +1340,8 @@ static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size,
 
 	device->channel->max_pkt_size = STORVSC_MAX_PKT_SIZE;
 	device->channel->next_request_id_callback = storvsc_next_request_id;
+	if (device->channel->co_external_memory)
+		device->co_external_memory = true;
 
 	ret = vmbus_open(device->channel,
 			 ring_size,
@@ -1805,7 +1808,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
 		unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
 		unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
 		struct scatterlist *sg;
-		unsigned long hvpfn, hvpfns_to_add;
+		unsigned long hvpfn, hvpfns_to_add, hvpgoff;
 		int j, i = 0, sg_count;
 
 		payload_sz = (hvpg_count * sizeof(u64) +
@@ -1821,7 +1824,11 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
 		payload->range.len = length;
 		payload->range.offset = offset_in_hvpg;
 
-		sg_count = scsi_dma_map(scmnd);
+		if (dev->co_external_memory)
+			sg_count = scsi_sg_count(scmnd);
+		else
+			sg_count = scsi_dma_map(scmnd);
+
 		if (sg_count < 0) {
 			ret = SCSI_MLQUEUE_DEVICE_BUSY;
 			goto err_free_payload;
@@ -1836,9 +1843,16 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
 			 * Such offsets are handled even on other than the first
 			 * sgl entry, provided they are a multiple of PAGE_SIZE.
 			 */
-			hvpfn = HVPFN_DOWN(sg_dma_address(sg));
-			hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
-						 sg_dma_len(sg)) - hvpfn;
+			if (dev->co_external_memory) {
+				hvpgoff = HVPFN_DOWN(sg->offset);
+				hvpfn = page_to_hvpfn(sg_page(sg)) + hvpgoff;
+				hvpfns_to_add =	HVPFN_UP(sg->offset + sg->length) -
+							hvpgoff;
+			} else {
+				hvpfn = HVPFN_DOWN(sg_dma_address(sg));
+				hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
+							 sg_dma_len(sg)) - hvpfn;
+			}
 
 			/*
 			 * Fill the next portion of the PFN array with
@@ -1860,7 +1874,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
 	ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
 	migrate_enable();
 
-	if (ret)
+	if (ret && (!dev->co_external_memory))
 		scsi_dma_unmap(scmnd);
 
 	if (ret == -EAGAIN) {
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..bcb143766d6e 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1285,6 +1285,7 @@ struct hv_device {
 
 	/* place holder to keep track of the dir for hv device in debugfs */
 	struct dentry *debug_dir;
+	bool co_external_memory;
 
 };
 
-- 
2.50.1


^ permalink raw reply related

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-08  6:37 UTC (permalink / raw)
  To: Michael Kelley, Matthew Ruffell
  Cc: bhelgaas@google.com, Haiyang Zhang, Jake Oshins,
	kwilczynski@kernel.org, KY Srinivasan,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, Long Li, lpieralisi@kernel.org,
	mani@kernel.org, robh@kernel.org, stable@vger.kernel.org,
	wei.liu@kernel.org
In-Reply-To: <SN6PR02MB4157F48F3B37D74E3A9F1AFBD45CA@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Sunday, April 5, 2026 4:13 PM
> > ...
> > 96959283a58d adds "select SYSFB if !HYPERV_VTL_MODE", but we can
> > still manually unset CONFIG_SYSFB (I happened to do this when debugging
> > the kdump issue), and hv_pci won't work.
> 
> Just curious -- how would you manually unset CONFIG_SYSFB? The kernel
> makefile always resync's .config against the Kconfig rules, which would add
> CONFIG_SYSFB back again. The Kconfig files essentially say that removing
> CONFIG_SYSFB is an invalid configuration.

Sorry, my description above is wrong: on the mainline kernel that has
96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests"),
I'm unable to unset CONFIG_SYSFB.

When I was able to unset CONFIG_SYSFB, I was actually on Ubuntu 22.04
(Ubuntu-azure-6.8-6.8.0-1049.55_22.04.1, released in Feb 2026). I thought the
kernel has 96959283a58d, but actually it doesn't...

> > IMO vmbus_reserve_fb() should unconditionally reserve the frame buffer
> > MMIO range. I'll post a patch like this:
> >
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -2395,10 +2398,8 @@ static void __maybe_unused
> vmbus_reserve_fb(void)
> >
> >         if (efi_enabled(EFI_BOOT)) {
> >                 /* Gen2 VM: get FB base from EFI framebuffer */
> > -               if (IS_ENABLED(CONFIG_SYSFB)) {
> > -                       start = sysfb_primary_display.screen.lfb_base;
> > -                       size = max_t(__u32, sysfb_primary_display.screen.lfb_size,
> 0x800000);
> > -               }
> > +               start = sysfb_primary_display.screen.lfb_base;
> > +               size = max_t(__u32, sysfb_primary_display.screen.lfb_size,
> 0x800000);

Please ignore the change above.

> On arm64 the existence of sysfb_primary_display is conditional on
> several config variables, including CONFIG_SYSFB and CONFIG_EFI_EARLYCON.
> (see drivers/firmware/efi/efi-init.c) If you can take away CONFIG_SYSFB, you
> could also take away CONFIG_EFI_EARLYCON and end up with build error on
> arm64. So I'm not clear how this approach would be more robust against
> invalid .config changes.

Agreed. Then let's keep vmbus_reserve_fb() as is.

Thanks,
Dexuan


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox