Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
From: Anirudh Rayabharam @ 2026-05-13 12:14 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816862362.21765.11809618639989414561.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> In mshv_irqfd_assign(), the level-triggered validation for resample
> irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> mshv_irqfd_update() has populated the field. Since the irqfd struct is
> zero-allocated, level_triggered is always 0 at that point, causing the
> check to always reject resample irqfds with -EINVAL. This makes
> level-triggered interrupt resampling — used to avoid interrupt storms
> with assigned devices — completely non-functional.

What bugs would this manifest as? Why haven't we seen any such bugs so
far?

Thanks,
Anirudh.

^ permalink raw reply

* Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU
From: sashiko-bot @ 2026-05-13 12:41 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-10-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [Critical] `hv_iommu_unmap_pages` makes a rep hypercall but ignores the completion count, leaving pages permanently mapped.
- [Critical] Integer truncation in `hv_iommu_unmap_pages` causes incomplete hardware unmapping.
- [Critical] Global `unique_id` data race leads to IOMMU domain collisions and cross-VM DMA sharing.
- [High] The `sscanf` return value check in `hv_iommu_probe_device` breaks the loop on success instead of failure.
- [High] Partial unmaps corrupt the interval tree state, desynchronizing hardware and software.
- [High] Unsafe interval tree iteration leads to skipped nodes during removal.
- [High] Memory leak of interval tree nodes on domain free.
- [High] Ignoring `gfp` flags causes premature OOM and risks sleeping in atomic context.
- [Medium] Missing NULL check on `mapped` pointer in fast path.
- [Low] The `__setup` handlers violate the API by returning 0.
--

commit 4a3c258bcace1805e7afe537c6eb3e451a6fd030
Author: Mukesh R <mrathor@linux.microsoft.com>

x86/hyperv: Implement Hyper-V virtual IOMMU

Add a new file to implement management of device domains, mapping and
unmapping of IOMMU memory, and other iommu_ops to fit within the VFIO
framework for PCI passthru on Hyper-V running Linux as baremetal root
or L1VH root.

> diff --git a/drivers/iommu/hyperv-iommu-root.c b/drivers/iommu/hyperv-iommu-root.c
> new file mode 100644
> index 0000000000000..a2e0f6cc78e66
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu-root.c

[ ... ]

> +static char *pci_devs_to_skip;
> +static int __init hv_iommu_setup_skip(char *str)
> +{
> +	pci_devs_to_skip = str;
> +
> +	return 0;
> +}

Should this return 1 instead of 0? Returning 0 causes the kernel to print an
unknown command line parameters warning and passes the parameters to the init
process.

> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> +
> +bool hv_no_attdev;	 /* disable direct device attach for passthru */
> +EXPORT_SYMBOL_GPL(hv_no_attdev);
> +static int __init setup_hv_no_attdev(char *str)
> +{
> +	hv_no_attdev = true;
> +	return 0;
> +}

Should this also return 1 instead of 0 to indicate the parameter was handled?

[ ... ]

> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> +	struct hv_domain *hvdom;
> +	int rc;

[ ... ]

> +	/* Called under iommu group mutex, so single threaded */
> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_NULL)   /* ie, UINTMAX */
> +		goto out_err;

Can this increment of unique_id race if called concurrently?

iommu_domain_alloc() can be called by subsystems without holding global or
group locks. If multiple allocations race, it could assign identical domid_num
values to different domains, which might place them in the same S2 domain.

[ ... ]

> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> +{
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_delete_device_domain *input;
> +
> +	if (hv_special_domain(hvdom))
> +		return;
> +
> +	if (!hv_dom_owner_is_vmm(hvdom) || hv_no_attdev) {

[ ... ]

> +	}
> +
> +	kfree(hvdom);
> +}

Does this code leak the interval tree nodes?

It looks like the remaining struct hv_iommu_mapping nodes in
hvdom->mappings_tree are not freed before kfree(hvdom).

[ ... ]

> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> +				     unsigned long iova, phys_addr_t paddr,
> +				     size_t size, u32 flags)
> +{
> +	unsigned long irqflags;
> +	struct hv_iommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);

Should this use the gfp flag passed into hv_iommu_map_pages() instead of
GFP_ATOMIC? It might avoid failing prematurely under memory pressure when the
caller can sleep.

[ ... ]

> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> +					unsigned long iova, size_t size)
> +{
> +	unsigned long flags;
> +	size_t unmapped = 0;
> +	unsigned long last = iova + size - 1;
> +	struct hv_iommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/* Trying to split a mapping? Not supported for now. */
> +		if (mapping->iova.start < iova)
> +			break;

Does breaking out early here leave stale mappings in the tree?

If the target region is unmapped in hardware, the tracking node would remain,
causing stale lookups. Also, if unmapping the head of a mapping, is the
entire node deleted even though the tail is still mapped?

> +
> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> +		interval_tree_remove(node, &hvdom->mappings_tree);

Can erasing the node corrupt the iteration?

Since interval_tree_remove() rebalances the tree and alters parent pointers,
calling interval_tree_iter_next(next) on the next loop iteration might skip
overlapping nodes.

[ ... ]

> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
> +			      int prot, gfp_t gfp, size_t *mapped)
> +{
> +	u32 map_flags;
> +	int ret;
> +	u64 status;
> +	unsigned long npages, done = 0;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t size = pgsize * pgcount;
> +
> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (hvdom->attached_dom) {
> +		*mapped = size;
> +		return 0;
> +	}

Is it possible for mapped to be NULL here?

The fast path assigns to *mapped without validating the pointer, while later
in the function there are defensive checks like if (mapped).

[ ... ]

> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +						    hv_current_partition_id,
> +						    256);

Could this trigger a sleeping in atomic context bug?

hv_call_deposit_pages() typically performs sleepable allocations, but the
caller of hv_iommu_map_pages() might have provided a non-sleeping gfp flag
which was not forwarded.

[ ... ]

> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	unsigned long flags, npages;
> +	struct hv_input_unmap_device_gpa_pages *input;
> +	u64 status;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t unmapped, size = pgsize * pgcount;

[ ... ]

> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->target_device_va_base = iova;
> +
> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> +				     0, input, NULL);

Can npages be truncated here?

The hypercall parameter rep_count in hv_do_rep_hypercall() expects a u16.
If npages is >= 65536, it would be silently truncated, causing incomplete
unmapping in the hardware.

Does this also need a loop to handle partial completions?

hv_do_rep_hypercall() might only partially complete. Since the completion count
from hv_repcomp(status) is ignored, remaining pages could be left fully
mapped in the hardware.

[ ... ]

> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int rc, pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> +				    &segment, &bus, &slot, &func, &parsed);
> +			if (rc)
> +				break;

Should this check for failure instead of success?

sscanf() returns the number of successfully matched items. If it succeeds, rc
will be 4, causing the loop to break prematurely without checking the parsed
values against the device.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=9

^ permalink raw reply

* Re: [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
From: Frederic Weisbecker @ 2026-05-13 13:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Anna-Maria Behnsen,
	Ingo Molnar, Chen Ridong, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
	linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <87340od7ev.ffs@tglx>

Le Tue, Apr 21, 2026 at 10:50:00AM +0200, Thomas Gleixner a écrit :
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> > +	/*
> > +	 * To properly enable/disable nohz_full dynticks for the affected CPUs,
> > +	 * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
> > +	 * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
> > +	 * for those CPUs that have their states changed. Those CPUs should be
> > +	 * in an offline state.
> > +	 */
> > +	for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
> > +		WARN_ON_ONCE(cpu_online(cpu));
> > +		ct_cpu_track_user(cpu);
> > +		cpumask_set_cpu(cpu, tick_nohz_full_mask);
> > +	}
> > +
> > +	for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
> > +		WARN_ON_ONCE(cpu_online(cpu));
> > +		ct_cpu_untrack_user(cpu);
> > +		cpumask_clear_cpu(cpu, tick_nohz_full_mask);
> > +	}
> > +}
> 
> So this writes to tick_nohz_full_mask while other CPUs can access
> it. That's just wrong and I'm not at all interested in the resulting
> KCSAN warnings.
> 
> tick_nohz_full_mask needs to become a RCU protected pointer, which is
> updated once the new mask is established in a separately allocated one.

How about just dropping tick_nohz_full_mask that is just
 ~housekeeping_cpumask(HK_TYPE_KERNEL_NOISE) which itself is becoming RCU
protected in this patchset?

Thanks.

> 
> Thanks,
> 
>         tglx
> 
> 

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

* [PATCH v4] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Anirudh Rayabharam (Microsoft) @ 2026-05-13 13:25 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
  Cc: linux-hyperv, linux-kernel, Anirudh Rayabharam (Microsoft),
	Stanislav Kinsburskii

The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
chunks into 1G mappings when alignment permits, so the driver can
support 1G hugepages by feeding them in as 2M chunks. Note that this
is the only way to make 1G mappings; there is no way to directly map
a 1G hugepage using the hypercall.

Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
hypercall has no 1G stride, so 1G folios are processed as a
sequence of 2M chunks. Folios whose order is less than PMD_ORDER
(e.g. mTHP) fall back to single-page stride; mapping them as 2M
would fail in the hypervisor anyway.

Assisted-by: Copilot-CLI:claude-opus-4.7
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
Changes in v4:
- Changed the check to page_order < PMD_ORDER for using page stride of 1
  - Also updated the commit message.
- Pick up Acked-by:
- Link to v3: https://lore.kernel.org/r/20260506-huge_1g-v3-1-26e1e4c439e4@anirudhrb.com

Changes in v3:
- Fixed various corner cases reported by Sashiko.
- Link to v2: https://lore.kernel.org/r/20260505-huge_1g-v2-1-b6a91327a88d@anirudhrb.com

Changes in v2:
- Handled the case where we can have 2M aligned pages in the middle of a
  1G page
- Brought back the page order check but expanded it to include 1G
- Clamp stride to requested page count in mshv_region_process_chunk
- Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
---
 drivers/hv/mshv_regions.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6..6d65e5b42152 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -29,29 +29,27 @@
  * Uses huge page stride if the backing page is huge and the guest mapping
  * is properly aligned; otherwise falls back to single page stride.
  *
- * Return: Stride in pages, or -EINVAL if page order is unsupported.
+ * Return: Stride in pages.
  */
-static int mshv_chunk_stride(struct page *page,
-			     u64 gfn, u64 page_count)
+static unsigned int mshv_chunk_stride(struct page *page, u64 gfn,
+				      u64 page_count)
 {
-	unsigned int page_order;
+	unsigned int page_order = folio_order(page_folio(page));
 
 	/*
 	 * Use single page stride by default. For huge page stride, the
-	 * page must be compound and point to the head of the compound
-	 * page, and both gfn and page_count must be huge-page aligned.
+	 * folio order must be at least PMD_ORDER, the page's PFN must be
+	 * 2M-aligned (so that a 2M-aligned tail page of a larger folio is
+	 * acceptable), and both gfn and page_count must be 2M-aligned.
 	 */
-	if (!PageCompound(page) || !PageHead(page) ||
+	if (page_order < PMD_ORDER ||
+	    !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
 	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
 	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
 		return 1;
 
-	page_order = folio_order(page_folio(page));
-	/* The hypervisor only supports 2M huge page */
-	if (page_order != PMD_ORDER)
-		return -EINVAL;
-
-	return 1 << page_order;
+	/* Use 2M stride always i.e. process 1G folios as 2M chunks */
+	return 1 << PMD_ORDER;
 }
 
 /**
@@ -86,15 +84,14 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 	u64 gfn = region->start_gfn + page_offset;
 	u64 count;
 	struct page *page;
-	int stride, ret;
+	unsigned int stride;
+	int ret;
 
 	page = region->mreg_pages[page_offset];
 	if (!page)
 		return -EINVAL;
 
 	stride = mshv_chunk_stride(page, gfn, page_count);
-	if (stride < 0)
-		return stride;
 
 	/* Start at stride since the first stride is validated */
 	for (count = stride; count < page_count; count += stride) {

---
base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
change-id: 20260416-huge_1g-e44461393c8f

Best regards,
-- 
Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply related

* Re: [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: Dennis Dalessandro @ 2026-05-13 13:45 UTC (permalink / raw)
  To: Jason Gunthorpe, sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260513113835.GE7655@nvidia.com>

On 5/13/26 7:38 AM, Jason Gunthorpe wrote:
> On Wed, May 13, 2026 at 03:12:36AM +0000, sashiko-bot@kernel.org wrote:
>> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
>> - [Critical] TOCTOU heap buffer overflow due to unvalidated `num_sge` from user-shared memory.
>> - [High] Memory leak of the kernel queue structure (`srq->rq.kwq`) on user-backed SRQ modifications.
>> - [High] Locking imbalance and freeing memory while locked.
>> - [High] Inconsistent state and Use-After-Free on error path.
>> - [Low] Uninitialized variable compiler warning for `offset_addr`.
>> --
> 
> These are all pre-existing and rvt is too confusing to try to fix be
> me. Dennis will have to handle them

Will take care of it.

-Denny

^ permalink raw reply

* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: Souradeep Chakrabarti @ 2026-05-13 15:08 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-9-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:02:56PM -0700, Mukesh R wrote:
> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> interrupts, etc need a device ID as a parameter. This device ID refers
> to that specific device during the lifetime of passthru.
> 
> An L1VH VM only contains VMBus based devices. A device ID for a VMBus
> device is slightly different in that it uses the hv_pcibus_device info
> for building it to make sure it matches exactly what the hypervisor
> expects. This VMBus based device ID is needed when attaching devices in
> an L1VH based guest VM. Add a function to build and export it. Before
> building it, a check is done to make sure the device is a valid VMBus
> device.
> 
> In remaining cases, PCI device ID is used. So, also make PCI device ID
> build function hv_build_devid_type_pci() public.
>
Reviewed-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c         |  9 +++++----
>  arch/x86/include/asm/mshyperv.h     |  6 ++++++
>  drivers/pci/controller/pci-hyperv.c | 24 ++++++++++++++++++++++++
>  include/asm-generic/mshyperv.h      | 11 +++++++++++
>  4 files changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index b3ad50a874dc..8780573a4332 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
>  	int pos;
>  	union hv_device_id hv_devid;
> @@ -172,8 +172,9 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>  	}
>  
>  out:
> -	return hv_devid;
> +	return hv_devid.as_uint64;
>  }
> +EXPORT_SYMBOL_GPL(hv_build_devid_type_pci);
>  
>  /*
>   * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
> @@ -196,7 +197,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
>  
>  	msidesc = irq_data_get_msi_desc(data);
>  	pdev = msi_desc_to_pci_dev(msidesc);
> -	hv_devid = hv_build_devid_type_pci(pdev);
> +	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
>  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>  
>  	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
> @@ -271,7 +272,7 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>  {
>  	union hv_device_id hv_devid;
>  
> -	hv_devid = hv_build_devid_type_pci(pdev);
> +	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
>  	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
>  }
>  
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index f64393e853ee..2ef34001f8d3 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
>  static inline void hv_root_crash_init(void) {}
>  #endif  /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
>  
> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
> +
>  #else /* CONFIG_HYPERV */
>  static inline void hyperv_init(void) {}
>  static inline void hyperv_setup_mmu_ops(void) {}
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index cfc8fa403dad..50d793ca8f31 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -573,6 +573,7 @@ struct hv_pci_compl {
>  };
>  
>  static void hv_pci_onchannelcallback(void *context);
> +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
>  
>  #ifdef CONFIG_X86
>  #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
> @@ -1005,6 +1006,24 @@ static struct irq_domain *hv_pci_get_root_domain(void)
>  static void hv_arch_irq_unmask(struct irq_data *data) { }
>  #endif /* CONFIG_ARM64 */
>  
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> +{
> +	struct hv_pcibus_device *hbus;
> +	struct pci_bus *pbus = pdev->bus;
> +
> +	if (!hv_vmbus_pci_device(pbus))
> +		return 0;
> +
> +	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> +
> +	return	(hbus->hdev->dev_instance.b[5] << 24) |
> +		(hbus->hdev->dev_instance.b[4] << 16) |
> +		(hbus->hdev->dev_instance.b[7] << 8) |
> +		(hbus->hdev->dev_instance.b[6] & 0xf8) |
> +		PCI_FUNC(pdev->devfn);
> +}
> +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> +
>  /**
>   * hv_pci_generic_compl() - Invoked for a completion packet
>   * @context:		Set up by the sender of the packet.
> @@ -1403,6 +1422,11 @@ static struct pci_ops hv_pcifront_ops = {
>  	.write = hv_pcifront_write_config,
>  };
>  
> +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> +{
> +	return pbus->ops == &hv_pcifront_ops;
> +}
> +
>  /*
>   * Paravirtual backchannel
>   *
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index e8cbc4e3f7ad..25ac7ca0fd8b 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -204,6 +204,9 @@ extern u64 (*hv_read_reference_counter)(void);
>  /* Sentinel value for an uninitialized entry in hv_vp_index array */
>  #define VP_INVAL	U32_MAX
>  
> +/* Forward declarations */
> +struct pci_dev;
> +
>  int __init hv_common_init(void);
>  void __init hv_get_partition_id(void);
>  void __init hv_common_free(void);
> @@ -316,6 +319,14 @@ void hv_para_set_synic_register(unsigned int reg, u64 val);
>  void hyperv_cleanup(void);
>  bool hv_query_ext_cap(u64 cap_query);
>  void hv_setup_dma_ops(struct device *dev, bool coherent);
> +
> +#if IS_ENABLED(CONFIG_PCI_HYPERV)
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> +{ return 0; }
> +#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
> +
>  #else /* CONFIG_HYPERV */
>  static inline void hv_identify_partition_type(void) {}
>  static inline bool hv_is_hyperv_initialized(void) { return false; }
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* RE: [PATCH v4] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Michael Kelley @ 2026-05-13 15:13 UTC (permalink / raw)
  To: Anirudh Rayabharam (Microsoft), K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Stanislav Kinsburskii
In-Reply-To: <20260513-huge_1g-v4-1-33cda59e4a70@anirudhrb.com>

From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Sent: Wednesday, May 13, 2026 6:26 AM
> 
> The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
> chunks into 1G mappings when alignment permits, so the driver can
> support 1G hugepages by feeding them in as 2M chunks. Note that this
> is the only way to make 1G mappings; there is no way to directly map
> a 1G hugepage using the hypercall.
> 
> Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
> hypercall has no 1G stride, so 1G folios are processed as a
> sequence of 2M chunks. Folios whose order is less than PMD_ORDER
> (e.g. mTHP) fall back to single-page stride; mapping them as 2M
> would fail in the hypervisor anyway.
> 
> Assisted-by: Copilot-CLI:claude-opus-4.7
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> ---
> Changes in v4:
> - Changed the check to page_order < PMD_ORDER for using page stride of 1
>   - Also updated the commit message.
> - Pick up Acked-by:
> - Link to v3: https://lore.kernel.org/r/20260506-huge_1g-v3-1-26e1e4c439e4@anirudhrb.com
> 
> Changes in v3:
> - Fixed various corner cases reported by Sashiko.
> - Link to v2: https://lore.kernel.org/r/20260505-huge_1g-v2-1-b6a91327a88d@anirudhrb.com
> 
> Changes in v2:
> - Handled the case where we can have 2M aligned pages in the middle of a
>   1G page
> - Brought back the page order check but expanded it to include 1G
> - Clamp stride to requested page count in mshv_region_process_chunk
> - Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
> ---
>  drivers/hv/mshv_regions.c | 29 +++++++++++++----------------
>  1 file changed, 13 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index fdffd4f002f6..6d65e5b42152 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -29,29 +29,27 @@
>   * Uses huge page stride if the backing page is huge and the guest mapping
>   * is properly aligned; otherwise falls back to single page stride.
>   *
> - * Return: Stride in pages, or -EINVAL if page order is unsupported.
> + * Return: Stride in pages.
>   */
> -static int mshv_chunk_stride(struct page *page,
> -			     u64 gfn, u64 page_count)
> +static unsigned int mshv_chunk_stride(struct page *page, u64 gfn,
> +				      u64 page_count)
>  {
> -	unsigned int page_order;
> +	unsigned int page_order = folio_order(page_folio(page));
> 
>  	/*
>  	 * Use single page stride by default. For huge page stride, the
> -	 * page must be compound and point to the head of the compound
> -	 * page, and both gfn and page_count must be huge-page aligned.
> +	 * folio order must be at least PMD_ORDER, the page's PFN must be
> +	 * 2M-aligned (so that a 2M-aligned tail page of a larger folio is
> +	 * acceptable), and both gfn and page_count must be 2M-aligned.
>  	 */
> -	if (!PageCompound(page) || !PageHead(page) ||
> +	if (page_order < PMD_ORDER ||
> +	    !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
>  	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
>  	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
>  		return 1;
> 
> -	page_order = folio_order(page_folio(page));
> -	/* The hypervisor only supports 2M huge page */
> -	if (page_order != PMD_ORDER)
> -		return -EINVAL;
> -
> -	return 1 << page_order;
> +	/* Use 2M stride always i.e. process 1G folios as 2M chunks */
> +	return 1 << PMD_ORDER;
>  }
> 
>  /**
> @@ -86,15 +84,14 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>  	u64 gfn = region->start_gfn + page_offset;
>  	u64 count;
>  	struct page *page;
> -	int stride, ret;
> +	unsigned int stride;
> +	int ret;
> 
>  	page = region->mreg_pages[page_offset];
>  	if (!page)
>  		return -EINVAL;
> 
>  	stride = mshv_chunk_stride(page, gfn, page_count);
> -	if (stride < 0)
> -		return stride;
> 
>  	/* Start at stride since the first stride is validated */
>  	for (count = stride; count < page_count; count += stride) {
> 
> ---
> base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
> change-id: 20260416-huge_1g-e44461393c8f
> 
> Best regards,
> --
> Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 


^ permalink raw reply

* Re: [PATCH V1 3/3] mshv: Implement guest irq migration for passthru devices
From: Souradeep Chakrabarti @ 2026-05-13 15:15 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, linux-hyperv, linux-kernel, iommu,
	linux-pci, linux-arch
In-Reply-To: <20260512021242.1679786-4-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:12:42PM -0700, Mukesh R wrote:
> Ask the hypervisor to retarget interrupts to new guest cpu or vector
> upon guest irq migration. This happens in the irqfd update path.
> 
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_eventfd.c | 78 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> index 1f5c1e9ee9b7..c05201d857fd 100644
> --- a/drivers/hv/mshv_eventfd.c
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -192,6 +192,77 @@ static int mshv_map_device_interrupt(u64 ptid, union hv_device_id hv_devid,
>  
>  }
>  
> +/* NOTE: caller does spin_lock_irq on pt_irqfds_lock, hence no disable here */
> +static void mshv_do_guest_irq_retarget(u64 partid, struct mshv_irqfd *irqfd)
> +{
> +	int rc, var_size;
> +	u64 status;
> +	union hv_device_id hv_devid;
> +	struct hv_input_get_vp_set_from_mda *mda_input;
> +	union hv_output_get_vp_set_from_mda *mda_output;
> +	struct hv_retarget_device_interrupt *remap_inp;
> +	struct pci_dev *pdev;
> +	struct irq_data *irqdata;
> +	struct mshv_lapic_irq *lapic_irq = &irqfd->irqfd_lapic_irq;
> +	struct hv_interrupt_entry *inte = NULL;
> +
> +	if (!irqfd->irqfd_girq_ent.girq_entry_valid ||
> +	    irqfd->irqfd_bypass_prod == NULL)
> +		return;
> +
> +	rc = mshv_parse_mshv_irqfd(irqfd, &pdev, &irqdata);
> +	if (rc)
> +		return;
> +
> +	inte = irqdata->chip_data;
> +	if (inte == NULL)
> +		return;
> +
> +	hv_devid.as_uint64 = hv_devid_from_pdev(pdev);
> +
> +
> +	mda_input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	mda_output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	rc = hv_vpset_from_hyp_disabled(mda_input, mda_output, lapic_irq,
> +					partid);
> +	if (rc)
> +		return;
> +
> +	remap_inp = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(remap_inp, 0, sizeof(*remap_inp));
> +
> +	rc = hv_copy_vpset(&remap_inp->int_target.vp_set,
> +			   &mda_output->target_vpset);
> +	if (rc <= 0) {
> +		pr_err("Hyper-V: ptid %lld - vpset copy failed (%d)\n",
> +		       partid, rc);
> +		return;
> +	}
> +
> +	/*
> +	 * var-sized hcall: var-size starts after vp_mask (thus vp_set.format
> +	 * does not count, but vp_set.valid_bank_mask does).
> +	 */
> +	var_size = rc + 1;
> +
> +	remap_inp->partition_id = partid;
> +	remap_inp->device_id = hv_devid.as_uint64;
> +	remap_inp->int_target.vector = lapic_irq->lapic_vector;
> +	remap_inp->int_target.flags = HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET;
> +
> +	remap_inp->int_entry.source = inte->source;
> +	remap_inp->int_entry.msi_entry.as_uint64 = inte->msi_entry.as_uint64;
> +
> +	status = hv_do_rep_hypercall(HVCALL_RETARGET_INTERRUPT, 0, var_size,
> +				     remap_inp, NULL);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "pt:%lld vec:%d lapic-id:%lld\n",
> +			      partid, lapic_irq->lapic_vector,
> +			      lapic_irq->lapic_apic_id);
> +}
> +
>  static int mshv_unmap_device_interrupt(union hv_device_id hv_devid,
>  				       struct hv_interrupt_entry *irq_entry)
>  {
> @@ -729,9 +800,12 @@ static void mshv_irqfd_update(struct mshv_partition *pt,
>  			      struct mshv_irqfd *irqfd)
>  {
>  	write_seqcount_begin(&irqfd->irqfd_irqe_sc);
> -	irqfd->irqfd_girq_ent = mshv_ret_girq_entry(pt,
> -						    irqfd->irqfd_irqnum);
> +	irqfd->irqfd_girq_ent = mshv_ret_girq_entry(pt, irqfd->irqfd_irqnum);
>  	mshv_copy_girq_info(&irqfd->irqfd_girq_ent, &irqfd->irqfd_lapic_irq);
> +
> +#if IS_ENABLED(CONFIG_X86_64)
> +	mshv_do_guest_irq_retarget(pt->pt_id, irqfd);
> +#endif
>  	write_seqcount_end(&irqfd->irqfd_irqe_sc);
>  }
>  
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: Souradeep Chakrabarti @ 2026-05-13 15:17 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <agST9ZRXNWhbUGMY@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Wed, May 13, 2026 at 08:08:37AM -0700, Souradeep Chakrabarti wrote:
> On Mon, May 11, 2026 at 07:02:56PM -0700, Mukesh R wrote:
> > On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> > interrupts, etc need a device ID as a parameter. This device ID refers
> > to that specific device during the lifetime of passthru.
> > 
> > An L1VH VM only contains VMBus based devices. A device ID for a VMBus
> > device is slightly different in that it uses the hv_pcibus_device info
> > for building it to make sure it matches exactly what the hypervisor
> > expects. This VMBus based device ID is needed when attaching devices in
> > an L1VH based guest VM. Add a function to build and export it. Before
> > building it, a check is done to make sure the device is a valid VMBus
> > device.
> > 
> > In remaining cases, PCI device ID is used. So, also make PCI device ID
> > build function hv_build_devid_type_pci() public.
> >
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Reviewed-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
> > Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> > ---
> >  arch/x86/hyperv/irqdomain.c         |  9 +++++----
> >  arch/x86/include/asm/mshyperv.h     |  6 ++++++
> >  drivers/pci/controller/pci-hyperv.c | 24 ++++++++++++++++++++++++
> >  include/asm-generic/mshyperv.h      | 11 +++++++++++
> >  4 files changed, 46 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> > index b3ad50a874dc..8780573a4332 100644
> > --- a/arch/x86/hyperv/irqdomain.c
> > +++ b/arch/x86/hyperv/irqdomain.c
> > @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
> >  	return 0;
> >  }
> >  
> > -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> > +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
> >  {
> >  	int pos;
> >  	union hv_device_id hv_devid;
> > @@ -172,8 +172,9 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> >  	}
> >  
> >  out:
> > -	return hv_devid;
> > +	return hv_devid.as_uint64;
> >  }
> > +EXPORT_SYMBOL_GPL(hv_build_devid_type_pci);
> >  
> >  /*
> >   * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
> > @@ -196,7 +197,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
> >  
> >  	msidesc = irq_data_get_msi_desc(data);
> >  	pdev = msi_desc_to_pci_dev(msidesc);
> > -	hv_devid = hv_build_devid_type_pci(pdev);
> > +	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> >  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
> >  
> >  	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
> > @@ -271,7 +272,7 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> >  {
> >  	union hv_device_id hv_devid;
> >  
> > -	hv_devid = hv_build_devid_type_pci(pdev);
> > +	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> >  	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
> >  }
> >  
> > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > index f64393e853ee..2ef34001f8d3 100644
> > --- a/arch/x86/include/asm/mshyperv.h
> > +++ b/arch/x86/include/asm/mshyperv.h
> > @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
> >  static inline void hv_root_crash_init(void) {}
> >  #endif  /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
> >  
> > +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> > +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> > +#else
> > +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> > +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
> > +
> >  #else /* CONFIG_HYPERV */
> >  static inline void hyperv_init(void) {}
> >  static inline void hyperv_setup_mmu_ops(void) {}
> > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > index cfc8fa403dad..50d793ca8f31 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -573,6 +573,7 @@ struct hv_pci_compl {
> >  };
> >  
> >  static void hv_pci_onchannelcallback(void *context);
> > +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
> >  
> >  #ifdef CONFIG_X86
> >  #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
> > @@ -1005,6 +1006,24 @@ static struct irq_domain *hv_pci_get_root_domain(void)
> >  static void hv_arch_irq_unmask(struct irq_data *data) { }
> >  #endif /* CONFIG_ARM64 */
> >  
> > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > +{
> > +	struct hv_pcibus_device *hbus;
> > +	struct pci_bus *pbus = pdev->bus;
> > +
> > +	if (!hv_vmbus_pci_device(pbus))
> > +		return 0;
> > +
> > +	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > +
> > +	return	(hbus->hdev->dev_instance.b[5] << 24) |
> > +		(hbus->hdev->dev_instance.b[4] << 16) |
> > +		(hbus->hdev->dev_instance.b[7] << 8) |
> > +		(hbus->hdev->dev_instance.b[6] & 0xf8) |
> > +		PCI_FUNC(pdev->devfn);
> > +}
> > +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> > +
> >  /**
> >   * hv_pci_generic_compl() - Invoked for a completion packet
> >   * @context:		Set up by the sender of the packet.
> > @@ -1403,6 +1422,11 @@ static struct pci_ops hv_pcifront_ops = {
> >  	.write = hv_pcifront_write_config,
> >  };
> >  
> > +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> > +{
> > +	return pbus->ops == &hv_pcifront_ops;
> > +}
> > +
> >  /*
> >   * Paravirtual backchannel
> >   *
> > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> > index e8cbc4e3f7ad..25ac7ca0fd8b 100644
> > --- a/include/asm-generic/mshyperv.h
> > +++ b/include/asm-generic/mshyperv.h
> > @@ -204,6 +204,9 @@ extern u64 (*hv_read_reference_counter)(void);
> >  /* Sentinel value for an uninitialized entry in hv_vp_index array */
> >  #define VP_INVAL	U32_MAX
> >  
> > +/* Forward declarations */
> > +struct pci_dev;
> > +
> >  int __init hv_common_init(void);
> >  void __init hv_get_partition_id(void);
> >  void __init hv_common_free(void);
> > @@ -316,6 +319,14 @@ void hv_para_set_synic_register(unsigned int reg, u64 val);
> >  void hyperv_cleanup(void);
> >  bool hv_query_ext_cap(u64 cap_query);
> >  void hv_setup_dma_ops(struct device *dev, bool coherent);
> > +
> > +#if IS_ENABLED(CONFIG_PCI_HYPERV)
> > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> > +#else
> > +static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > +{ return 0; }
> > +#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
> > +
> >  #else /* CONFIG_HYPERV */
> >  static inline void hv_identify_partition_type(void) {}
> >  static inline bool hv_is_hyperv_initialized(void) { return false; }
> > -- 
> > 2.51.2.vfs.0.1
> > 

^ permalink raw reply

* Re: [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
From: Frederic Weisbecker @ 2026-05-13 16:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Chen Ridong, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
	linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <20260421030351.281436-9-longman@redhat.com>

Le Mon, Apr 20, 2026 at 11:03:36PM -0400, Waiman Long a écrit :
> As the HK_TYPE_TICK cpumask is going to be changeable at run time, we
> need to use RCU to protect access to the cpumask to prevent it from
> going away in the middle of the operation.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  arch/arm64/kernel/topology.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index b32f13358fbb..48f150801689 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -173,6 +173,7 @@ void arch_cpu_idle_enter(void)
>  	if (!amu_fie_cpu_supported(cpu))
>  		return;
>  
> +	guard(rcu)();
>  	/* Kick in AMU update but only if one has not happened already */
>  	if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
>  	    time_is_before_jiffies(per_cpu(cpu_amu_samples.last_scale_update,
>  	cpu)))

This is called with IRQs disabled in the current CPU that is online so it's
already guaranteed to be stable.


> @@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
>  	unsigned int start_cpu = cpu;
>  	unsigned long last_update;
>  	unsigned int freq = 0;
> +	bool hk_cpu;
>  	u64 scale;
>  
>  	if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
>  		return -EOPNOTSUPP;
>  
> +	scoped_guard(rcu) {
> +		hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
> +	}
> +
>  	while (1) {
>  
>  		amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> @@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
>  		 * (and thus freq scale), if available, for given policy: this boils
>  		 * down to identifying an active cpu within the same freq domain, if any.
>  		 */
> -		if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> +		if (!hk_cpu ||
>  		    time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
>  			struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> +			bool hk_intersects;
>  			int ref_cpu;
>  
>  			if (!policy)
>  				return -EINVAL;
>  
> -			if (!cpumask_intersects(policy->related_cpus,
> -						housekeeping_cpumask(HK_TYPE_TICK))) {
> +			scoped_guard(rcu) {
> +				hk_intersects = cpumask_intersects(policy->related_cpus,
> +							housekeeping_cpumask(HK_TYPE_TICK));
> +			}
> +
> +			if (!hk_intersects) {
>  				cpufreq_cpu_put(policy);
>  				return -EOPNOTSUPP;
>  			}

Ok so this is racy but it's fine because:

This function is only used by cpufreq with either cpufreq_policy_write or
cpufreq_policy_read held (that is, struct cpufreq_policy::rwsem).

And that rwsem is write held on cpufreq_online() -> cpufreq_policy_online() and
also offline to guarantee the policy->cpus and policy->cpu stability.

Therefore housekeeping_cpumask() should only deal with stable online CPUs here. So
even if the housekeeping mask can be changed concurrently, those CPUs can't
appear or disappear from it.

Would be worth adding a comment about that.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
From: Stanislav Kinsburskii @ 2026-05-13 17:31 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-roaring-gentle-kiwi-21bccd@anirudhrb>

On Wed, May 13, 2026 at 11:15:37AM +0000, Anirudh Rayabharam wrote:
> On Mon, May 11, 2026 at 08:06:44AM -0700, Stanislav Kinsburskii wrote:
> > On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> > > On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > > > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > > > fails on an unencrypted partition. The condition on the error path:
> > > > 
> > > >     if (ret && mshv_partition_encrypted(partition))
> > > > 
> > > > only handles map failures for encrypted partitions — if the partition is
> > > > not encrypted and the map fails, execution falls through to 'return 0',
> > > > silently ignoring the error.
> > > > 
> > > > Additionally, calling mshv_region_invalidate() inline on map failure
> > > > zeroes the mreg_pages array before the caller's cleanup path
> > > > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > > > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > > > mappings for partially-mapped pages.
> > > > 
> > > > Fix by returning immediately on success and falling through to error
> > > > return on failure. For unencrypted partitions, the caller's
> > > > mshv_region_destroy() handles unmap followed by invalidate in the
> > > > correct order. For encrypted partitions where re-sharing fails, zero
> > > > the page array without unpinning — the pages are inaccessible to the
> > > > host and must not be unpinned, but zeroing prevents
> > > > mshv_region_destroy() from attempting to unpin them.
> > > 
> > > mshv_region_destroy() calls mshv_region_invalidate() which calls
> > > mshv_region_invalidate_pages() which just does:
> > > 
> > > 	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > > 						 u64 page_offset, u64 page_count)
> > > 	{
> > > 		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > > 			unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > > 
> > > 		memset(region->mreg_pages + page_offset, 0,
> > > 		       page_count * sizeof(struct page *));
> > > 	}
> > > 
> > > It doesn't checked for zeroed pages before unpinning.
> > > 
> > > > 
> > > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > >  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
> > > >  1 file changed, 16 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > index 665d565899c15..7e4252b6bc65c 100644
> > > > --- a/drivers/hv/mshv_root_main.c
> > > > +++ b/drivers/hv/mshv_root_main.c
> > > > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> > > >  			pt_err(partition,
> > > >  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> > > >  			       region->start_gfn, ret);
> > > > -			goto invalidate_region;
> > > > +			goto err_out;
> > > >  		}
> > > >  	}
> > > >  
> > > >  	ret = mshv_region_map(region);
> > > > -	if (ret && mshv_partition_encrypted(partition)) {
> > > > +	if (ret)
> > > > +		goto share_region;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +share_region:
> > > > +	if (mshv_partition_encrypted(partition)) {
> > > >  		int shrc;
> > > >  
> > > >  		shrc = mshv_region_share(region);
> > > >  		if (!shrc)
> > > > -			goto invalidate_region;
> > > > +			goto err_out;
> > > >  
> > > >  		pt_err(partition,
> > > >  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> > > >  		       region->start_gfn, shrc);
> > > >  		/*
> > > > -		 * Don't unpin if marking shared failed because pages are no
> > > > -		 * longer mapped in the host, ie root, anymore.
> > > > +		 * Re-sharing failed — the pages remain inaccessible to the
> > > > +		 * host.  Zero the page array so that mshv_region_destroy()
> > > > +		 * won't attempt to unpin them (leaking the page references
> > > > +		 * is intentional; unpinning host-inaccessible pages would be
> > > > +		 * unsafe).
> > > >  		 */
> > > 
> > > Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> > > can remove it from here altogether. Either way, should this zeroing
> > > logic be added there too so as not to crash the host by unpinning pages
> > > that weren't marked shared?
> > > 
> > 
> > Indeed.
> > The issue with all this code is that we are leaking state in
> > mshv_region_pin if we wimply return from it: we don't know if the pages
> > are pinned or unshared (or mapped) in the destruction callback.
> > We should either introduce a state variable to track this or just don't
> > call the generic mshv_region_put on case of region creation error.
> > The latter approch make mshv_map_user_memory idempotent and thus looks
> > like a better way forward.
> > What do you think?
> 

I tried to say, that there can be two apporaches to solving this issue:
  1. Either make sure mshv_region_pus() can destroy a partially created
     region (tha'ts what this patch is doing) or
  2. Make mshv_map_user_region be idempotent by not calling mshv_region_put() if the region was not
     fully created and destroy the partioally created parts of it on
     error paths. This would require preserving some state in the region
     struct to know if the region was pinned or shared or mapped.
     This looks like a leaner apporach, but it requires additional state
     tracking and more code changes.

Thanks,
Stanislav


> I'm not sure I follow the latter approach. Omitting mshv_region_put()
> would cause a dangling reference to the mshv_region right?
> 
> Thanks,
> Anirudh.

^ permalink raw reply

* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
From: Stanislav Kinsburskii @ 2026-05-13 17:38 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-omniscient-enchanted-otter-dfd602@anirudhrb>

On Wed, May 13, 2026 at 12:14:49PM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> > In mshv_irqfd_assign(), the level-triggered validation for resample
> > irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> > mshv_irqfd_update() has populated the field. Since the irqfd struct is
> > zero-allocated, level_triggered is always 0 at that point, causing the
> > check to always reject resample irqfds with -EINVAL. This makes
> > level-triggered interrupt resampling — used to avoid interrupt storms
> > with assigned devices — completely non-functional.
> 
> What bugs would this manifest as? Why haven't we seen any such bugs so
> far?
> 

This patch fixes a logical error.
Whtout the change this hunk always fails:

        if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
            !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {

and the reason we never seen it as that we never used
register_irqfd_with_resample() function of the mshv crate.

Thanks,
Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply

* Re: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
From: Stanislav Kinsburskii @ 2026-05-13 17:39 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-grinning-firefly-of-refinement-ed06cd@anirudhrb>

On Wed, May 13, 2026 at 11:12:05AM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:44:26PM +0000, Stanislav Kinsburskii wrote:
> > handle_pair_message() iterates up to msg->vp_count without verifying it
> > against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> > from untrusted hypervisor data, a malformed message with a large value
> > would cause out-of-bounds reads from the partition_ids and vp_indexes
> > arrays.
> > 
> > handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> > 64) and advances bank_contents for each one. However, the payload buffer
> > only has space for 16 bank entries. A valid_bank_mask with more than 16
> > bits set causes bank_contents to read beyond the message buffer.
> > 
> > Fix both by adding bounds validation:
> > - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> > - Track banks consumed and stop before exceeding buffer capacity
> > 
> > Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_synic.c |   20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index 89207aad7cf1f..5d509299f14d7 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
> >  static void
> >  handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
> >  {
> > -	int bank_idx, vps_signaled = 0, bank_mask_size;
> > +	int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> > +	const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> > +			      sizeof(u64) - 2; /* subtract format + mask */
> 
> Could this be a constant in the header?
> 

Yes, it could. But it the only place it's used and it's pretty
self-explanatory, so I don't think it needs to be.

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 

^ permalink raw reply

* RE: [EXTERNAL] [PATCH] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Long Li @ 2026-05-13 18:30 UTC (permalink / raw)
  To: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, James.Bottomley@HansenPartnership.com,
	martin.petersen@oracle.com, Allen Pais
  Cc: Tianyu Lan, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org,
	vdso@hexbites.dev, mhklinux@outlook.com
In-Reply-To: <20260408073105.272255-1-tiala@microsoft.com>

> Hyper-V provides Confidential VMBus to communicate between device model
> and device guest driver via encrypted/private memory in Confidential VM. The
> device model is in OpenHCL
> (https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopenvm
> m.dev%2Fguide%2Fuser_guide%2Fopenhcl.html&data=05%7C02%7Clongli%40mi
> crosoft.com%7C0ccfea7cda8e4500ae9808de9540d01e%7C72f988bf86f141af91a
> b2d7cd011db47%7C1%7C0%7C639112302777934798%7CUnknown%7CTWFpbG
> Zsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIk
> FOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=5Uc%2FM4ZVgJT1
> NAq08cIlNtfF5oW4n%2FTj%2Bqg3YqBUeZg%3D&reserved=0) that plays the
> paravisor role.
> 
> For a VMBus device, there are two communication methods to talk with
> Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic DMA transfer.
> 
> The Confidential VMBus Ring buffer has been upstreamed by Roman Kisel(commit
> 6802d8af47d1).
> 
> The dynamic DMA transition of VMBus device normally goes through DMA core
> and it uses SWIOTLB as bounce buffer in a CoCo VM.
> 
> The Confidential VMBus device can do DMA directly to private/encrypted
> memory. Because the swiotlb is decrypted memory, the DMA transfer must not
> be bounced through the swiotlb, so as to preserve confidentiality. This is different
> from the default for Linux CoCo VMs, so not use DMA(SWIOTLB) API in VMBus
> driver when confidential dynamic DMA transfers capability is present.
> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  drivers/scsi/storvsc_drv.c | 28 +++++++++++++++++++++-------
>  include/linux/hyperv.h     |  1 +
>  2 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
> ae1abab97835..79b7611518b7 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1316,7 +1316,8 @@ static void storvsc_on_channel_callback(void *context)
>  					continue;
>  				}
>  				request = (struct storvsc_cmd_request
> *)scsi_cmd_priv(scmnd);
> -				scsi_dma_unmap(scmnd);
> +				if (!device->co_external_memory)
> +					scsi_dma_unmap(scmnd);
>  			}
> 
>  			storvsc_on_receive(stor_device, packet, request); @@ -
> 1339,6 +1340,8 @@ static int storvsc_connect_to_vsp(struct hv_device *device,
> u32 ring_size,
> 
>  	device->channel->max_pkt_size = STORVSC_MAX_PKT_SIZE;
>  	device->channel->next_request_id_callback = storvsc_next_request_id;
> +	if (device->channel->co_external_memory)
> +		device->co_external_memory = true;
> 
>  	ret = vmbus_open(device->channel,
>  			 ring_size,
> @@ -1805,7 +1808,7 @@ static enum scsi_qc_status
> storvsc_queuecommand(struct Scsi_Host *host,
>  		unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
>  		unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
>  		struct scatterlist *sg;
> -		unsigned long hvpfn, hvpfns_to_add;
> +		unsigned long hvpfn, hvpfns_to_add, hvpgoff;
>  		int j, i = 0, sg_count;
> 
>  		payload_sz = (hvpg_count * sizeof(u64) + @@ -1821,7 +1824,11
> @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
>  		payload->range.len = length;
>  		payload->range.offset = offset_in_hvpg;
> 
> -		sg_count = scsi_dma_map(scmnd);
> +		if (dev->co_external_memory)
> +			sg_count = scsi_sg_count(scmnd);

scsi_sg_count() returns unsigned int, sg_count can't be negative. The check for sg_count < 0 below becomes dead code. Add a comment to say this is expected behavior.

> +		else
> +			sg_count = scsi_dma_map(scmnd);
> +
>  		if (sg_count < 0) {
>  			ret = SCSI_MLQUEUE_DEVICE_BUSY;
>  			goto err_free_payload;
> @@ -1836,9 +1843,16 @@ static enum scsi_qc_status
> storvsc_queuecommand(struct Scsi_Host *host,
>  			 * Such offsets are handled even on other than the first
>  			 * sgl entry, provided they are a multiple of PAGE_SIZE.
>  			 */
> -			hvpfn = HVPFN_DOWN(sg_dma_address(sg));
> -			hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
> -						 sg_dma_len(sg)) - hvpfn;
> +			if (dev->co_external_memory) {
> +				hvpgoff = HVPFN_DOWN(sg->offset);
> +				hvpfn = page_to_hvpfn(sg_page(sg)) + hvpgoff;
> +				hvpfns_to_add =	HVPFN_UP(sg->offset
> + sg->length) -
> +							hvpgoff;
> +			} else {
> +				hvpfn = HVPFN_DOWN(sg_dma_address(sg));
> +				hvpfns_to_add =
> HVPFN_UP(sg_dma_address(sg) +
> +							 sg_dma_len(sg)) -
> hvpfn;
> +			}
> 
>  			/*
>  			 * Fill the next portion of the PFN array with @@ -1860,7
> +1874,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct
> Scsi_Host *host,
>  	ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
>  	migrate_enable();
> 
> -	if (ret)
> +	if (ret && (!dev->co_external_memory))
>  		scsi_dma_unmap(scmnd);
> 
>  	if (ret == -EAGAIN) {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index
> dfc516c1c719..bcb143766d6e 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1285,6 +1285,7 @@ struct hv_device {
> 
>  	/* place holder to keep track of the dir for hv device in debugfs */
>  	struct dentry *debug_dir;
> +	bool co_external_memory;

You don't need to introduce co_external_memory in hv_device, vmbus_channel already has co_external_memory. Is it possible that you can check the vmbus_channel->co_external_memory directly? If you can remove this,  you can reword this patch to " scsi: storvsc: Confidential VMBus for dynamic DMA transfers".

Thanks,
Long



^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Jacob Pan @ 2026-05-13 18:39 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, jgg, mhklinux,
	tgopinath, easwar.hariharan, jacob.pan
In-Reply-To: <20260511162408.1180069-4-zhangyu1@linux.microsoft.com>

Hi Yu,

On Tue, 12 May 2026 00:24:07 +0800
Yu Zhang <zhangyu1@linux.microsoft.com> wrote:

> Add a para-virtualized IOMMU driver for Linux guests running on
> Hyper-V. This driver implements stage-1 IO translation within the
> guest OS. It integrates with the Linux IOMMU core, utilizing Hyper-V
> hypercalls for:
>  - Capability discovery
>  - Domain allocation, configuration, and deallocation
>  - Device attachment and detachment
>  - IOTLB invalidation
> 
> The driver constructs x86-compatible stage-1 IO page tables in the
> guest memory using consolidated IO page table helpers. This allows
> the guest to manage stage-1 translations independently of vendor-
> specific drivers (like Intel VT-d or AMD IOMMU).
> 
> Hyper-V consumes this stage-1 IO page table when a device domain is
> created and configured, and nests it with the host's stage-2 IO page
> tables, therefore eliminating the VM exits for guest IOMMU mapping
> operations. For unmapping operations, VM exits to perform the IOTLB
> flush are still unavoidable.
> 
> Hyper-V identifies each PCI pass-thru device by a logical device ID
> in its hypercall interface. The vPCI driver (pci-hyperv) registers the
> per-bus portion of this ID with the pvIOMMU driver during bus probe.
> The pvIOMMU driver stores this mapping and combines it with the
> function number of the endpoint PCI device to form the complete ID
> for hypercalls.
> 
> Co-developed-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Co-developed-by: Easwar Hariharan
> <easwar.hariharan@linux.microsoft.com> Signed-off-by: Easwar
> Hariharan <easwar.hariharan@linux.microsoft.com> Signed-off-by: Yu
> Zhang <zhangyu1@linux.microsoft.com> ---
>  arch/x86/hyperv/hv_init.c           |   4 +
>  arch/x86/include/asm/mshyperv.h     |   4 +
>  drivers/iommu/hyperv/Kconfig        |  17 +
>  drivers/iommu/hyperv/Makefile       |   1 +
>  drivers/iommu/hyperv/iommu.c        | 705
> ++++++++++++++++++++++++++++ drivers/iommu/hyperv/iommu.h        |
> 54 +++ drivers/pci/controller/pci-hyperv.c |  19 +-
>  include/asm-generic/mshyperv.h      |  12 +
>  8 files changed, 815 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/iommu/hyperv/iommu.c
>  create mode 100644 drivers/iommu/hyperv/iommu.h
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 323adc93f2dc..2c8ff8e06249 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -578,6 +578,10 @@ void __init hyperv_init(void)
>  	old_setup_percpu_clockev =
> x86_init.timers.setup_percpu_clockev;
> x86_init.timers.setup_percpu_clockev =
> hv_stimer_setup_percpu_clockev; +#ifdef CONFIG_HYPERV_PVIOMMU
> +	x86_init.iommu.iommu_init = hv_iommu_init;
> +#endif
> +
>  	hv_apic_init();
>  
>  	x86_init.pci.arch_init = hv_pci_init;
> diff --git a/arch/x86/include/asm/mshyperv.h
> b/arch/x86/include/asm/mshyperv.h index f64393e853ee..20d947c2c758
> 100644 --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -313,6 +313,10 @@ static inline void
> mshv_vtl_return_hypercall(void) {} static inline void
> __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {} #endif
>  
> +#ifdef CONFIG_HYPERV_PVIOMMU
> +int __init hv_iommu_init(void);
> +#endif
> +
>  #include <asm-generic/mshyperv.h>
>  
>  #endif
> diff --git a/drivers/iommu/hyperv/Kconfig
> b/drivers/iommu/hyperv/Kconfig index 30f40d867036..9e658d5c9a77 100644
> --- a/drivers/iommu/hyperv/Kconfig
> +++ b/drivers/iommu/hyperv/Kconfig
> @@ -8,3 +8,20 @@ config HYPERV_IOMMU
>  	help
>  	  Stub IOMMU driver to handle IRQs to support Hyper-V Linux
>  	  guest and root partitions.
> +
> +if HYPERV_IOMMU
> +config HYPERV_PVIOMMU
> +	bool "Microsoft Hypervisor para-virtualized IOMMU support"
> +	depends on X86 && HYPERV
> +	select IOMMU_API
> +	select GENERIC_PT
> +	select IOMMU_PT
> +	select IOMMU_PT_X86_64
nit:
If HYPERV_PVIOMMU is enabled on a (hypothetical) platform with
GENERIC_ATOMIC64=y, the select would force-enable IOMMU_PT_X86_64 even
though its depends on is unsatisfied — leading to a build failure.

In practice this can't happen today because HYPERV_PVIOMMU already
depends on X86 && HYPERV, and x86 never sets GENERIC_ATOMIC64. But
adding the explicit guard is more defensive.
i.e.
       depends on !GENERIC_ATOMIC64    # for cmpxchg64 in IOMMU_PT

> +	select IOMMU_IOVA
> +	default HYPERV
> +	help
> +	  Para-virtualized IOMMU driver for Linux guests running on
> +	  Microsoft Hyper-V. Provides DMA remapping and IOTLB
> +	  flush support to enable DMA isolation for devices
> +	  assigned to the guest.
> +endif
> diff --git a/drivers/iommu/hyperv/Makefile
> b/drivers/iommu/hyperv/Makefile index 9f557bad94ff..8669741c0a51
> 100644 --- a/drivers/iommu/hyperv/Makefile
> +++ b/drivers/iommu/hyperv/Makefile
> @@ -1,2 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-$(CONFIG_HYPERV_IOMMU) += irq_remapping.o
> +obj-$(CONFIG_HYPERV_PVIOMMU) += iommu.o
> diff --git a/drivers/iommu/hyperv/iommu.c
> b/drivers/iommu/hyperv/iommu.c new file mode 100644
> index 000000000000..e5fc625314b5
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.c
> @@ -0,0 +1,705 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Hyper-V IOMMU driver.
> + *
> + * Copyright (C) 2019, 2024-2026 Microsoft, Inc.
> + */
> +
> +#define pr_fmt(fmt) "Hyper-V pvIOMMU: " fmt
> +#define dev_fmt(fmt) pr_fmt(fmt)
> +
> +#include <linux/iommu.h>
> +#include <linux/pci.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/generic_pt/iommu.h>
> +#include <linux/pci-ats.h>
> +
> +#include <asm/iommu.h>
> +#include <asm/hypervisor.h>
> +#include <asm/mshyperv.h>
> +
> +#include "iommu.h"
> +#include "../iommu-pages.h"
> +
> +struct hv_iommu_dev *hv_iommu_device;
> +
> +/*
> + * Identity and blocking domains are static singletons: identity is
> a 1:1
> + * passthrough with no page table, blocking rejects all DMA. Neither
> holds
> + * per-IOMMU state, so one instance suffices even with multiple
> vIOMMUs.
> + */
> +static struct hv_iommu_domain hv_identity_domain;
> +static struct hv_iommu_domain hv_blocking_domain;
> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
> +static struct iommu_ops hv_iommu_ops;
> +static LIST_HEAD(hv_iommu_pci_bus_list);
> +static DEFINE_SPINLOCK(hv_iommu_pci_bus_lock);
> +
> +#define hv_iommu_present(iommu_cap) (iommu_cap &
> HV_IOMMU_CAP_PRESENT) +#define
> hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1)
> +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap &
> HV_IOMMU_CAP_S1_5LVL) +#define hv_iommu_ats_supported(iommu_cap)
> (iommu_cap & HV_IOMMU_CAP_ATS) + +int hv_iommu_register_pci_bus(int
> pci_domain_nr, u32 logical_dev_id_prefix) +{
> +	struct hv_pci_busdata *bus, *new;
> +	int ret = 0;
> +
> +	if (no_iommu || !iommu_detected)
> +		return 0;
> +
> +	new = kzalloc_obj(*new, GFP_KERNEL);
> +	if (!new)
> +		return -ENOMEM;
> +
> +	spin_lock(&hv_iommu_pci_bus_lock);
> +	list_for_each_entry(bus, &hv_iommu_pci_bus_list, list) {
> +		if (bus->pci_domain_nr != pci_domain_nr)
> +			continue;
> +
> +		if (bus->logical_dev_id_prefix !=
> logical_dev_id_prefix) {
> +			pr_err("stale registration for PCI domain %d
> (old prefix 0x%08x, new 0x%08x)\n",
> +			       pci_domain_nr,
> bus->logical_dev_id_prefix,
> +			       logical_dev_id_prefix);
> +			ret = -EEXIST;
> +		}
> +
> +		goto out_free;
> +	}
> +
> +	new->pci_domain_nr = pci_domain_nr;
> +	new->logical_dev_id_prefix = logical_dev_id_prefix;
> +	list_add(&new->list, &hv_iommu_pci_bus_list);
> +	spin_unlock(&hv_iommu_pci_bus_lock);
> +	return 0;
> +
> +out_free:
> +	spin_unlock(&hv_iommu_pci_bus_lock);
> +	kfree(new);
> +	return ret;
> +}
> +EXPORT_SYMBOL_FOR_MODULES(hv_iommu_register_pci_bus, "pci-hyperv");
> +
> +void hv_iommu_unregister_pci_bus(int pci_domain_nr)
> +{
> +	struct hv_pci_busdata *bus, *tmp;
> +
> +	spin_lock(&hv_iommu_pci_bus_lock);
> +	list_for_each_entry_safe(bus, tmp, &hv_iommu_pci_bus_list,
> list) {
> +		if (bus->pci_domain_nr == pci_domain_nr) {
> +			list_del(&bus->list);
> +			kfree(bus);
> +			break;
> +		}
> +	}
> +	spin_unlock(&hv_iommu_pci_bus_lock);
> +}
> +EXPORT_SYMBOL_FOR_MODULES(hv_iommu_unregister_pci_bus, "pci-hyperv");
> +
> +/*
> + * Look up the logical device ID for a vPCI device. Returns 0 on
> success
> + * with *logical_id filled in; -ENODEV if no entry registered for
> this
> + * device's vPCI bus.
> + */
> +static int hv_iommu_lookup_logical_dev_id(struct pci_dev *pdev, u64
> *logical_id) +{
> +	struct hv_pci_busdata *bus;
> +	int domain = pci_domain_nr(pdev->bus);
> +	int ret = -ENODEV;
> +
> +	spin_lock(&hv_iommu_pci_bus_lock);
this is called under local_irq_save, should use  raw_spinlock_t for RT
kernel?

> +	list_for_each_entry(bus, &hv_iommu_pci_bus_list, list) {
> +		if (bus->pci_domain_nr == domain) {
> +			*logical_id =
> (u64)bus->logical_dev_id_prefix |
> +				      PCI_FUNC(pdev->devfn);
> +			ret = 0;
> +			break;
> +		}
> +	}
> +	spin_unlock(&hv_iommu_pci_bus_lock);
> +	return ret;
> +}
> +
> +static int hv_create_device_domain(struct hv_iommu_domain
> *hv_domain, u32 domain_stage) +{
> +	int ret;
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_create_device_domain *input;
> +
> +	ret = ida_alloc_range(&hv_iommu_device->domain_ids,
> +			hv_iommu_device->first_domain,
> hv_iommu_device->last_domain,
> +			GFP_KERNEL);
> +	if (ret < 0)
> +		return ret;
> +
> +	hv_domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	hv_domain->device_domain.domain_id.type = domain_stage;
> +	hv_domain->device_domain.domain_id.id = ret;
> +	hv_domain->hv_iommu = hv_iommu_device;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	input->create_device_domain_flags.forward_progress_required
> = 1;
> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status)) {
> +		pr_err("HVCALL_CREATE_DEVICE_DOMAIN failed, status
> %lld\n", status);
> +		ida_free(&hv_iommu_device->domain_ids,
> hv_domain->device_domain.domain_id.id);
> +	}
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static void hv_delete_device_domain(struct hv_iommu_domain
> *hv_domain) +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_delete_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_DELETE_DEVICE_DOMAIN failed, status
> %lld\n", status); +
> +	ida_free(&hv_domain->hv_iommu->domain_ids,
> hv_domain->device_domain.domain_id.id); +}
> +
> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +		return true;
> +	case IOMMU_CAP_DEFERRED_FLUSH:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_flush_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN failed, status
> %lld\n", status); +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct
> device *dev) +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +	struct pci_dev *pdev;
> +	struct hv_iommu_domain *hv_domain =
> to_hv_iommu_domain(domain);
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	dev_dbg(dev, "detaching from domain %d\n",
> hv_domain->device_domain.domain_id.id); +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	if (hv_iommu_lookup_logical_dev_id(pdev,
> &input->device_id.as_uint64)) {
> +		local_irq_restore(flags);
> +		dev_warn(&pdev->dev, "no IOMMU registration for vPCI
> bus on detach\n");
> +		return;
> +	}
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_DETACH_DEVICE_DOMAIN failed, status
> %lld\n", status); +
> +	hv_flush_device_domain(hv_domain);
> +
> +	vdev->hv_domain = NULL;
> +}
> +
> +static int hv_iommu_attach_dev(struct iommu_domain *domain, struct
> device *dev,
> +			       struct iommu_domain *old)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct pci_dev *pdev;
> +	struct hv_input_attach_device_domain *input;
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +	struct hv_iommu_domain *hv_domain =
> to_hv_iommu_domain(domain);
> +	int ret;
> +
> +	/* Only allow PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return -EINVAL;
> +
> +	if (vdev->hv_domain == hv_domain)
> +		return 0;
> +
> +	if (vdev->hv_domain)
> +		hv_iommu_detach_dev(&vdev->hv_domain->domain, dev);
> +
> +	pdev = to_pci_dev(dev);
> +	dev_dbg(dev, "attaching to domain %d\n",
> +		hv_domain->device_domain.domain_id.id);
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	ret = hv_iommu_lookup_logical_dev_id(pdev,
> &input->device_id.as_uint64);
> +	if (ret) {
> +		local_irq_restore(flags);
> +		dev_err(&pdev->dev, "no IOMMU registration for vPCI
> bus\n");
> +		return ret;
> +	}
> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_ATTACH_DEVICE_DOMAIN failed, status
> %lld\n", status);
> +	else
> +		vdev->hv_domain = hv_domain;
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static int hv_iommu_get_logical_device_property(struct device *dev,
> +					u32 code,
> +					struct
> hv_output_get_logical_device_property *property) +{
> +	u64 status, lid;
> +	unsigned long flags;
> +	int ret;
> +	struct hv_input_get_logical_device_property *input;
> +	struct hv_output_get_logical_device_property *output;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_input_arg) +
> sizeof(*input);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	ret = hv_iommu_lookup_logical_dev_id(to_pci_dev(dev), &lid);
> +	if (ret) {
> +		local_irq_restore(flags);
> +		return ret;
> +	}
> +	input->logical_device_id = lid;
> +	input->code = code;
> +	status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY,
> input, output);
> +	*property = *output;
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_GET_LOGICAL_DEVICE_PROPERTY failed,
> status %lld\n", status); +
> +	return hv_result_to_errno(status);
> +}
> +
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	struct pci_dev *pdev;
> +	struct hv_iommu_endpoint *vdev;
> +	struct hv_output_get_logical_device_property
> device_iommu_property = {0}; +
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	pdev = to_pci_dev(dev);
> +
> +	if (hv_iommu_get_logical_device_property(dev,
> +
> HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU,
> +
> &device_iommu_property) ||
> +	    !(device_iommu_property.device_iommu &
> HV_DEVICE_IOMMU_ENABLED))
> +		return ERR_PTR(-ENODEV);
> +
> +	vdev = kzalloc_obj(*vdev, GFP_KERNEL);
> +	if (!vdev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	vdev->dev = dev;
> +	vdev->hv_iommu = hv_iommu_device;
> +	dev_iommu_priv_set(dev, vdev);
> +
> +	if (hv_iommu_ats_supported(hv_iommu_device->cap) &&
> +	    pci_ats_supported(pdev))
> +		pci_enable_ats(pdev,
> __ffs(hv_iommu_device->pgsize_bitmap)); +
> +	return &vdev->hv_iommu->iommu;
> +}
> +
> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (pdev->ats_enabled)
> +		pci_disable_ats(pdev);
> +
> +	dev_iommu_priv_set(dev, NULL);
> +	set_dma_ops(dev, NULL);
I don't think this is necessary.

> +
> +	kfree(vdev);
> +}
> +
> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
non pci device already rejected during attach, maybe we should warn
here?
        WARN_ON_ONCE(1);
        return generic_device_group(dev);

> +}
> +
> +static int hv_configure_device_domain(struct hv_iommu_domain
> *hv_domain, u32 domain_type) +{
> +	u64 status;
> +	unsigned long flags;
> +	struct pt_iommu_x86_64_hw_info pt_info;
> +	struct hv_input_configure_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	input->settings.flags.blocked = (domain_type ==
> IOMMU_DOMAIN_BLOCKED);
> +	input->settings.flags.translation_enabled = (domain_type !=
> IOMMU_DOMAIN_IDENTITY); +
Should this be:
input->settings.flags.translation_enabled =
     (domain_type & __IOMMU_DOMAIN_PAGING);
Otherwise, blocked domain will have translation enabled. Maybe add some
explanation of what HV expects.

> +	if (domain_type & __IOMMU_DOMAIN_PAGING) {
> +		pt_iommu_x86_64_hw_info(&hv_domain->pt_iommu_x86_64,
> &pt_info);
> +		input->settings.page_table_root = pt_info.gcr3_pt;
> +		input->settings.flags.first_stage_paging_mode =
> +			pt_info.levels == 5;
> +	}
> +	status = hv_do_hypercall(HVCALL_CONFIGURE_DEVICE_DOMAIN,
> input, NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_CONFIGURE_DEVICE_DOMAIN failed,
> status %lld\n", status); +
> +	return hv_result_to_errno(status);
> +}
> +
> +static int __init hv_initialize_static_domains(void)
> +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +
> +	/* Default stage-1 identity domain */
> +	hv_domain = &hv_identity_domain;
> +
> +	ret = hv_create_device_domain(hv_domain,
> HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		return ret;
> +
> +	ret = hv_configure_device_domain(hv_domain,
> IOMMU_DOMAIN_IDENTITY);
> +	if (ret)
> +		goto delete_identity_domain;
> +
> +	hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY;
> +	hv_domain->domain.ops = &hv_iommu_identity_domain_ops;
> +	hv_domain->domain.owner = &hv_iommu_ops;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->domain.pgsize_bitmap =
> hv_iommu_device->pgsize_bitmap; +
> +	/* Default stage-1 blocked domain */
> +	hv_domain = &hv_blocking_domain;
> +
> +	ret = hv_create_device_domain(hv_domain,
> HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		goto delete_identity_domain;
> +
> +	ret = hv_configure_device_domain(hv_domain,
> IOMMU_DOMAIN_BLOCKED);
> +	if (ret)
> +		goto delete_blocked_domain;
> +
> +	hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED;
> +	hv_domain->domain.ops = &hv_iommu_blocking_domain_ops;
> +	hv_domain->domain.owner = &hv_iommu_ops;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->domain.pgsize_bitmap =
> hv_iommu_device->pgsize_bitmap; +
> +	return 0;
> +
> +delete_blocked_domain:
> +	hv_delete_device_domain(&hv_blocking_domain);
> +delete_identity_domain:
> +	hv_delete_device_domain(&hv_identity_domain);
> +	return ret;
> +}
> +
> +#define INTERRUPT_RANGE_START	(0xfee00000)
> +#define INTERRUPT_RANGE_END	(0xfeefffff)
> +static void hv_iommu_get_resv_regions(struct device *dev,
> +		struct list_head *head)
> +{
> +	struct iommu_resv_region *region;
> +
> +	region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
> +				      INTERRUPT_RANGE_END -
> INTERRUPT_RANGE_START + 1,
> +				      0, IOMMU_RESV_MSI, GFP_KERNEL);
> +	if (!region)
> +		return;
> +
> +	list_add_tail(&region->list, head);
> +}
> +
> +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> +{
> +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +}
> +
> +static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
> +				struct iommu_iotlb_gather
> *iotlb_gather) +{
> +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +
> +	iommu_put_pages_list(&iotlb_gather->freelist);
> +}
> +
> +static void hv_iommu_paging_domain_free(struct iommu_domain *domain)
> +{
> +	struct hv_iommu_domain *hv_domain =
> to_hv_iommu_domain(domain); +
> +	/* Free all remaining mappings */
> +	pt_iommu_deinit(&hv_domain->pt_iommu);
> +
> +	hv_delete_device_domain(hv_domain);
> +
> +	kfree(hv_domain);
> +}
> +
> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +	IOMMU_PT_DOMAIN_OPS(x86_64),
> +	.flush_iotlb_all = hv_iommu_flush_iotlb_all,
> +	.iotlb_sync = hv_iommu_iotlb_sync,
> +	.free = hv_iommu_paging_domain_free,
> +};
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
> device *dev) +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +	struct pt_iommu_x86_64_cfg cfg = {};
> +
> +	hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
> +	if (!hv_domain)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = hv_create_device_domain(hv_domain,
> HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret) {
> +		kfree(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->pt_iommu.nid = dev_to_node(dev);
> +
> +	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
> +	cfg.common.hw_max_oasz_lg2 = 52;
> +	cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 :
> 3; +
> +	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64,
> &cfg, GFP_KERNEL);
> +	if (ret) {
> +		hv_delete_device_domain(hv_domain);
> +		kfree(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	/* Constrain to page sizes the hypervisor supports */
> +	hv_domain->domain.pgsize_bitmap &=
> hv_iommu_device->pgsize_bitmap; +
> +	hv_domain->domain.ops = &hv_iommu_paging_domain_ops;
> +
> +	ret = hv_configure_device_domain(hv_domain,
> __IOMMU_DOMAIN_PAGING);
> +	if (ret) {
> +		pt_iommu_deinit(&hv_domain->pt_iommu);
> +		hv_delete_device_domain(hv_domain);
> +		kfree(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return &hv_domain->domain;
> +}
> +
> +static struct iommu_ops hv_iommu_ops = {
> +	.capable		  = hv_iommu_capable,
> +	.domain_alloc_paging	  = hv_iommu_domain_alloc_paging,
> +	.probe_device		  = hv_iommu_probe_device,
> +	.release_device		  = hv_iommu_release_device,
> +	.device_group		  = hv_iommu_device_group,
> +	.get_resv_regions	  = hv_iommu_get_resv_regions,
> +	.owner			  = THIS_MODULE,
> +	.identity_domain	  = &hv_identity_domain.domain,
> +	.blocked_domain		  =
> &hv_blocking_domain.domain,
> +	.release_domain		  =
> &hv_blocking_domain.domain, +};
> +
> +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities
> *hv_iommu_cap) +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_get_iommu_capabilities *input;
> +	struct hv_output_get_iommu_capabilities *output;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_input_arg) +
> sizeof(*input);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	status = hv_do_hypercall(HVCALL_GET_IOMMU_CAPABILITIES,
> input, output);
> +	*hv_iommu_cap = *output;
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_GET_IOMMU_CAPABILITIES failed, status
> %lld\n", status); +
> +	return hv_result_to_errno(status);
> +}
> +
> +static void __init hv_init_iommu_device(struct hv_iommu_dev
> *hv_iommu,
> +			struct hv_output_get_iommu_capabilities
> *hv_iommu_cap) +{
> +	ida_init(&hv_iommu->domain_ids);
> +
> +	hv_iommu->cap = hv_iommu_cap->iommu_cap;
> +	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> +	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> +	    hv_iommu->max_iova_width > 48) {
> +		pr_info("5-level paging not supported, limiting iova
> width to 48.\n");
> +		hv_iommu->max_iova_width = 48;
> +	}
> +
> +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> +		.aperture_start = 0,
> +		.aperture_end = (((u64)1) <<
> hv_iommu->max_iova_width) - 1,
> +		.force_aperture = true,
> +	};
> +
> +	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
> +	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
> +	/* Only x86 page sizes (4K/2M/1G) are supported */
> +	hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap &
> +				  (SZ_4K | SZ_2M | SZ_1G);
> +	if (hv_iommu->pgsize_bitmap != hv_iommu_cap->pgsize_bitmap)
> +		pr_warn("unsupported page sizes masked: 0x%llx ->
> 0x%llx\n",
> +			hv_iommu_cap->pgsize_bitmap,
> hv_iommu->pgsize_bitmap);
> +	if (!hv_iommu->pgsize_bitmap) {
> +		pr_warn("no supported page sizes, defaulting to
> 4K\n");
> +		hv_iommu->pgsize_bitmap = SZ_4K;
> +	}
> +	hv_iommu_device = hv_iommu;
> +}
> +
> +int __init hv_iommu_init(void)
> +{
> +	int ret = 0;
> +	struct hv_iommu_dev *hv_iommu = NULL;
> +	struct hv_output_get_iommu_capabilities hv_iommu_cap = {0};
> +
> +	if (no_iommu || iommu_detected)
> +		return -ENODEV;
> +
> +	if (!hv_is_hyperv_initialized())
> +		return -ENODEV;
> +
> +	ret = hv_iommu_detect(&hv_iommu_cap);
> +	if (ret) {
> +		pr_err("HVCALL_GET_IOMMU_CAPABILITIES failed: %d\n",
> ret);
> +		return -ENODEV;
> +	}
> +
> +	if (!hv_iommu_present(hv_iommu_cap.iommu_cap) ||
> +	    !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap)) {
> +		pr_err("IOMMU capabilities not sufficient:
> cap=0x%llx\n",
> +		       hv_iommu_cap.iommu_cap);
> +		return -ENODEV;
> +	}
> +
> +	iommu_detected = 1;
> +	pci_request_acs();
> +
> +	hv_iommu = kzalloc_obj(*hv_iommu, GFP_KERNEL);
> +	if (!hv_iommu)
> +		return -ENOMEM;
> +
> +	hv_init_iommu_device(hv_iommu, &hv_iommu_cap);
> +
> +	ret = hv_initialize_static_domains();
> +	if (ret) {
> +		pr_err("static domains init failed: %d\n", ret);
> +		goto err_free;
> +	}
> +
> +	ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL,
> "%s", "hv-iommu");
> +	if (ret) {
> +		pr_err("iommu_device_sysfs_add failed: %d\n", ret);
> +		goto err_delete_static_domains;
> +	}
> +
> +	ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops,
> NULL);
> +	if (ret) {
> +		pr_err("iommu_device_register failed: %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	pr_info("successfully initialized\n");
> +	return 0;
> +
> +err_sysfs_remove:
> +	iommu_device_sysfs_remove(&hv_iommu->iommu);
> +err_delete_static_domains:
> +	hv_delete_device_domain(&hv_blocking_domain);
> +	hv_delete_device_domain(&hv_identity_domain);
> +err_free:
> +	kfree(hv_iommu);
> +	return ret;
> +}
> diff --git a/drivers/iommu/hyperv/iommu.h
> b/drivers/iommu/hyperv/iommu.h new file mode 100644
> index 000000000000..43f20d371245
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.h
> @@ -0,0 +1,54 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Hyper-V IOMMU driver.
> + *
> + * Copyright (C) 2024-2025, Microsoft, Inc.
> + *
> + */
> +
> +#ifndef _HYPERV_IOMMU_H
> +#define _HYPERV_IOMMU_H
> +
> +struct hv_iommu_dev {
> +	struct iommu_device iommu;
> +	struct ida domain_ids;
> +
> +	/* Device configuration */
> +	u8  max_iova_width;
> +	u8  max_pasid_width;
> +	u64 cap;
> +	u64 pgsize_bitmap;
> +
> +	struct iommu_domain_geometry geometry;
> +	u64 first_domain;
> +	u64 last_domain;
> +};
> +
> +struct hv_iommu_domain {
> +	union {
> +		struct iommu_domain    domain;
> +		struct pt_iommu        pt_iommu;
> +		struct pt_iommu_x86_64 pt_iommu_x86_64;
> +	};
> +	struct hv_iommu_dev *hv_iommu;
> +	struct hv_input_device_domain device_domain;
> +	u64		pgsize_bitmap;
> +};
> +
> +struct hv_pci_busdata {
> +	int               pci_domain_nr;
> +	u32               logical_dev_id_prefix;
> +	struct list_head  list;
> +};
> +
> +struct hv_iommu_endpoint {
> +	struct device *dev;
> +	struct hv_iommu_dev *hv_iommu;
> +	struct hv_iommu_domain *hv_domain;
> +};
> +
> +#define to_hv_iommu_domain(d) \
> +	container_of(d, struct hv_iommu_domain, domain)
> +
> +#endif /* _HYPERV_IOMMU_H */
> diff --git a/drivers/pci/controller/pci-hyperv.c
> b/drivers/pci/controller/pci-hyperv.c index
> cfc8fa403dad..a4af9c8c2220 100644 ---
> a/drivers/pci/controller/pci-hyperv.c +++
> b/drivers/pci/controller/pci-hyperv.c @@ -3715,6 +3715,7 @@ static
> int hv_pci_probe(struct hv_device *hdev, struct hv_pcibus_device
> *hbus; int ret, dom;
>  	u16 dom_req;
> +	u32 prefix;
>  	char *name;
>  
>  	bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
> @@ -3857,13 +3858,25 @@ static int hv_pci_probe(struct hv_device
> *hdev, 
>  	hbus->state = hv_pcibus_probed;
>  
> -	ret = create_root_hv_pci_bus(hbus);
> +	/* Notify pvIOMMU before any device on the bus is scanned. */
> +	prefix = (hdev->dev_instance.b[5] << 24) |
> +		 (hdev->dev_instance.b[4] << 16) |
> +		 (hdev->dev_instance.b[7] <<  8) |
> +		 (hdev->dev_instance.b[6] & 0xf8);
> +
> +	ret = hv_iommu_register_pci_bus(dom, prefix);
>  	if (ret)
>  		goto free_windows;
>  
> +	ret = create_root_hv_pci_bus(hbus);
> +	if (ret)
> +		goto unregister_pviommu;
> +
>  	mutex_unlock(&hbus->state_lock);
>  	return 0;
>  
> +unregister_pviommu:
> +	hv_iommu_unregister_pci_bus(dom);
>  free_windows:
>  	hv_pci_free_bridge_windows(hbus);
>  exit_d0:
> @@ -3974,8 +3987,10 @@ static int hv_pci_bus_exit(struct hv_device
> *hdev, bool keep_devs) static void hv_pci_remove(struct hv_device
> *hdev) {
>  	struct hv_pcibus_device *hbus;
> +	int dom;
>  
>  	hbus = hv_get_drvdata(hdev);
> +	dom = hbus->bridge->domain_nr;
>  	if (hbus->state == hv_pcibus_installed) {
>  		tasklet_disable(&hdev->channel->callback_event);
>  		hbus->state = hv_pcibus_removing;
> @@ -3994,6 +4009,8 @@ static void hv_pci_remove(struct hv_device
> *hdev) hv_pci_remove_slots(hbus);
>  		pci_remove_root_bus(hbus->bridge->bus);
>  		pci_unlock_rescan_remove();
> +
> +		hv_iommu_unregister_pci_bus(dom);
>  	}
>  
>  	hv_pci_bus_exit(hdev, false);
> diff --git a/include/asm-generic/mshyperv.h
> b/include/asm-generic/mshyperv.h index bf601d67cecb..b71345c74568
> 100644 --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -73,6 +73,18 @@ extern enum hv_partition_type
> hv_curr_partition_type; extern void * __percpu *hyperv_pcpu_input_arg;
>  extern void * __percpu *hyperv_pcpu_output_arg;
>  
> +#ifdef CONFIG_HYPERV_PVIOMMU
> +int  hv_iommu_register_pci_bus(int pci_domain_nr, u32
> logical_dev_id_prefix); +void hv_iommu_unregister_pci_bus(int
> pci_domain_nr); +#else
> +static inline int hv_iommu_register_pci_bus(int pci_domain_nr,
> +					    u32
> logical_dev_id_prefix) +{
> +	return 0;
> +}
> +static inline void hv_iommu_unregister_pci_bus(int pci_domain_nr) { }
> +#endif
> +
>  u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
>  u64 hv_do_fast_hypercall8(u16 control, u64 input8);
>  u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);


^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH net-next] net: mana: Add handler for sriov configure
From: Leon Romanovsky @ 2026-05-13 18:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Haiyang Zhang, Haiyang Zhang, Paul Rosswurm,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	KY Srinivasan, Wei Liu, Dexuan Cui, Long Li, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Bjorn Helgaas, Simon Horman, Shradha Gupta, Dipayaan Roy,
	Erni Sri Satya Vennela, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org
In-Reply-To: <20260508231029.GA44712@bhelgaas>

On Fri, May 08, 2026 at 06:10:29PM -0500, Bjorn Helgaas wrote:
> On Fri, May 08, 2026 at 10:47:14PM +0000, Haiyang Zhang wrote:
> > > -----Original Message-----
> > > From: Bjorn Helgaas <helgaas@kernel.org>
> > > Sent: Friday, May 8, 2026 6:38 PM
> > > To: Haiyang Zhang <haiyangz@linux.microsoft.com>
> > > Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> > > <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> > > <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> > > <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S.
> > > Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> > > Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Bjorn Helgaas
> > > <bhelgaas@google.com>; Simon Horman <horms@kernel.org>; Shradha Gupta
> > > <shradhagupta@linux.microsoft.com>; Dipayaan Roy
> > > <dipayanroy@linux.microsoft.com>; Erni Sri Satya Vennela
> > > <ernis@linux.microsoft.com>; linux-kernel@vger.kernel.org; linux-
> > > pci@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> > > Subject: [EXTERNAL] Re: [PATCH net-next] net: mana: Add handler for sriov
> > > configure
> > > 
> > > On Fri, May 08, 2026 at 03:04:06PM -0700, Haiyang Zhang wrote:
> > > > From: Haiyang Zhang <haiyangz@microsoft.com>
> > > >
> > > > Add callback function for the pci_driver, sriov_configure.
> > > >
> > > > Also disable VF autoprobe when it runs as PF driver on bare metal,
> > > > since the hardware side may not have the VF ready immediately.
> > > >
> > > > Export pci_vf_drivers_autoprobe() so the driver can toggle the VF
> > > > autoprobe flag.
> > > 
> > > Technically pci_vf_drivers_autoprobe() doesn't *toggle* the autoprobe
> > > flag.  That would mean setting it to the opposite of its current
> > > value.
> > > 
> > > Here I would say "so the driver can prevent autoprobing of the VFs",
> > > which is the intent.
> > Thanks, I will change the wording.
> > 
> > > 
> > > Out of curiosity, how do the VFs eventually get probed?  I guess
> > > there's some other mechanism that tells you when they're ready, and
> > > you manually use sysfs 'sriov_drivers_autoprobe' to enable probing,
> > > then bind drivers to them via sysfs?
> > We have a user program talking to the Azure backplane to get that information.
> > @Paul Rosswurm, do you have more details?
> > 
> > 
> > > The prevention of autoprobing sounds like a critical part of this
> > > change; might be worth saying something in the subject, because "add
> > > sriov configure" doesn't include much information.
> > How about "Add handler for sriov configure with VF autoprobe off"?
> 
> OK by me :)

Bjorn,

I believe it is the wrong decision to allow toggling a user‑visible knob  
without the user’s awareness. In this case, they can either disable  
autoprobe on the PF or rely on EPROBE_DEFER. In all cases, the same
functionality can be achieved without changing PCI autoprobe code.

Thanks.

> 

^ permalink raw reply

* [PATCH v3 00/11] mshv: Refactor memory region management and map pages at creation
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel

This series refactors the mshv memory region subsystem in preparation
for mapping populated pages into the hypervisor at movable region
creation time, rather than relying solely on demand faulting.

The primary motivation is to ensure that when userspace passes a
pre-populated mapping for a movable memory region, those pages are
immediately visible to the hypervisor. Previously, all movable regions
were created with HV_MAP_GPA_NO_ACCESS on every page regardless of
whether the backing pages were already present, deferring all mapping
to the fault handler. This added unnecessary fault overhead and
complicated the initial setup of child partitions with pre-populated
memory.

v3:
 - Fixed a few issues.
 - Reworked the code flow to simplify the logic.

v2:
 - Rebased on top of latest mainline, simplified the check for valid PFNs,
   added other minor cleanups and improvements.

---

Stanislav Kinsburskii (11):
      mshv: Don't reject huge-page stride on unaligned region length
      mshv: Don't request HMM write fault for read-only regions
      mshv: Convert region storage from page pointers to PFNs
      mshv: Refactor region segmentation into a dedicated helper
      mshv: Support address range holes in remapping
      mshv: Iterate VMAs when faulting in region pages
      mshv: Scale fault granularity for non-4 KiB host pages
      mshv: Move pinned region setup to mshv_regions.c
      mshv: Map populated pages on movable region creation
      mshv: Extract MMIO region mapping into separate function
      mshv: Add tracepoint for map GPA hypercall


 drivers/hv/mshv_regions.c      |  636 ++++++++++++++++++++++++++++------------
 drivers/hv/mshv_root.h         |   30 +-
 drivers/hv/mshv_root_hv_call.c |   52 ++-
 drivers/hv/mshv_root_main.c    |  105 +------
 drivers/hv/mshv_trace.h        |   36 ++
 5 files changed, 530 insertions(+), 329 deletions(-)


^ permalink raw reply

* [PATCH v3 01/11] mshv: Don't reject huge-page stride on unaligned region length
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_chunk_stride() rejected huge-page stride whenever pfn_count was
not a multiple of PTRS_PER_PMD, even though the same-stride scan in
mshv_region_process_pfns() already handles the transition between
huge-page and 4K stride for the tail of a run. As a result, a region
backed by a 2 MiB folio with a length that wasn't a multiple of 512
PFNs was processed entirely at 4K stride, issuing one hypercall per
PFN instead of one per 2 MiB.

Reject huge-page stride only when fewer than PTRS_PER_PMD PFNs are
available from this point.  The starting GFN alignment requirement is
unchanged.  The same-stride scan continues to handle the tail
correctly.

Fixes: 259add0d982cb ("mshv: Align huge page stride with guest mapping")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6f..81e57f727be35 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -43,7 +43,7 @@ static int mshv_chunk_stride(struct page *page,
 	 */
 	if (!PageCompound(page) || !PageHead(page) ||
 	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
-	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
+	    page_count < PTRS_PER_PMD)
 		return 1;
 
 	page_order = folio_order(page_folio(page));



^ permalink raw reply related

* [PATCH v3 02/11] mshv: Don't request HMM write fault for read-only regions
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_region_range_fault() unconditionally sets HMM_PFN_REQ_WRITE on
the hmm_range_fault() request.  When the region was created without
MSHV_SET_MEM_BIT_WRITABLE (so region->hv_map_flags has no
HV_MAP_GPA_WRITABLE), the request still asks HMM for writable pages.
On read-only mappings this causes hmm_range_fault() to break
copy-on-write — for example, the shared zero page or file-backed
pages — granting the guest a private writable copy of memory that
host policy intended to keep shared.

Gate HMM_PFN_REQ_WRITE on the region's HV_MAP_GPA_WRITABLE bit so
that read-only regions request read-only faults.

Note: this still asks for write on writable regions even if the
backing VMA is read-only.  A more thorough check would also consult
each VMA's vm_flags inside the fault loop; that requires iterating
VMAs and is left for a follow-up.

Fixes: b9a66cd5ccbb ("mshv: Add support for movable memory regions")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 81e57f727be35..d9e1fbfefe714 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -441,7 +441,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 {
 	struct hmm_range range = {
 		.notifier = &region->mreg_mni,
-		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
+		.default_flags = HMM_PFN_REQ_FAULT,
 	};
 	unsigned long *pfns;
 	int ret;
@@ -455,6 +455,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
 	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
 
+	/*
+	 * Only request writable pages from HMM when the region itself
+	 * permits writes.  Without this, hmm_range_fault() would
+	 * trigger COW on read-only regions, breaking copy-on-write
+	 * semantics on shared host pages.
+	 */
+	if (region->hv_map_flags & HV_MAP_GPA_WRITABLE)
+		range.default_flags |= HMM_PFN_REQ_WRITE;
+
 	do {
 		ret = mshv_region_hmm_fault_and_lock(region, &range);
 	} while (ret == -EBUSY);



^ permalink raw reply related

* [PATCH v3 03/11] mshv: Convert region storage from page pointers to PFNs
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The HMM interface returns PFNs from hmm_range_fault(), and the
hypervisor hypercalls in this driver consume PFNs.  Storing struct
page pointers in mshv_mem_region between those two interfaces forces
a page_to_pfn() conversion at every hypercall site and a
hmm_pfn_to_page() conversion at the HMM fault site, neither of
which needs the struct page form.

Switch mreg_pages[] to mreg_pfns[] and propagate through
hv_call_map_ram_pfns(), hv_call_map_mmio_pfns(), hv_call_unmap_pfns(),
and hv_call_modify_spa_host_access(). Per-PFN conversions at those
sites go away; the HMM fault path writes hmm_range_fault() output
straight into mreg_pfns[] after stripping HMM_PFN_FLAGS. Convert
back to struct page only where the API requires it (e.g.
unpin_user_page() during region invalidation).

Use ULONG_MAX as the invalid-PFN sentinel (MSHV_INVALID_PFN); 0 is a
valid PFN on some architectures. mshv_region_init_pfns() centralises
the "mark slots empty" pattern so callers agree on the sentinel.

Trade-off: the bulk unpin_user_pages() call is replaced with a
per-PFN unpin_user_page() loop, losing per-folio coalescing for
huge-folio-backed pinned regions. The cost is paid only on region
teardown and MMU-notifier invalidation (it it will ever be added for pinned
regions).

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c      |  325 +++++++++++++++++++++++-----------------
 drivers/hv/mshv_root.h         |   21 +--
 drivers/hv/mshv_root_hv_call.c |   49 +++---
 drivers/hv/mshv_root_main.c    |   33 ++--
 4 files changed, 239 insertions(+), 189 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index d9e1fbfefe714..77fc94733cb20 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -18,12 +18,32 @@
 #include "mshv_root.h"
 
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
+#define MSHV_INVALID_PFN				ULONG_MAX
+
+static inline bool mshv_pfn_valid(unsigned long pfn)
+{
+	return pfn != MSHV_INVALID_PFN;
+}
+
+static void mshv_region_init_pfns_range(struct mshv_mem_region *region,
+					u64 pfn_offset, u64 pfn_count)
+{
+	u64 i;
+
+	for (i = pfn_offset; i < pfn_offset + pfn_count; i++)
+		region->mreg_pfns[i] = MSHV_INVALID_PFN;
+}
+
+void mshv_region_init_pfns(struct mshv_mem_region *region)
+{
+	mshv_region_init_pfns_range(region, 0, region->nr_pfns);
+}
 
 /**
  * mshv_chunk_stride - Compute stride for mapping guest memory
- * @page      : The page to check for huge page backing
- * @gfn       : Guest frame number for the mapping
- * @page_count: Total number of pages in the mapping
+ * @page     : The page to check for huge page backing
+ * @gfn      : Guest frame number for the mapping
+ * @pfn_count: Total number of pages in the mapping
  *
  * Determines the appropriate stride (in pages) for mapping guest memory.
  * Uses huge page stride if the backing page is huge and the guest mapping
@@ -32,18 +52,19 @@
  * Return: Stride in pages, or -EINVAL if page order is unsupported.
  */
 static int mshv_chunk_stride(struct page *page,
-			     u64 gfn, u64 page_count)
+			     u64 gfn, u64 pfn_count)
 {
 	unsigned int page_order;
 
 	/*
 	 * Use single page stride by default. For huge page stride, the
 	 * page must be compound and point to the head of the compound
-	 * page, and both gfn and page_count must be huge-page aligned.
+	 * page, gfn must be huge-page aligned and pfn_count must be at
+	 * least the number of pages in a huge page.
 	 */
 	if (!PageCompound(page) || !PageHead(page) ||
 	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
-	    page_count < PTRS_PER_PMD)
+	    pfn_count < PTRS_PER_PMD)
 		return 1;
 
 	page_order = folio_order(page_folio(page));
@@ -57,60 +78,61 @@ static int mshv_chunk_stride(struct page *page,
 /**
  * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
  *                             in a region.
- * @region     : Pointer to the memory region structure.
- * @flags      : Flags to pass to the handler.
- * @page_offset: Offset into the region's pages array to start processing.
- * @page_count : Number of pages to process.
- * @handler    : Callback function to handle the chunk.
+ * @region    : Pointer to the memory region structure.
+ * @flags     : Flags to pass to the handler.
+ * @pfn_offset: Offset into the region's PFNs array to start processing.
+ * @pfn_count : Number of PFNs to process.
+ * @handler   : Callback function to handle the chunk.
  *
- * This function scans the region's pages starting from @page_offset,
- * checking for contiguous present pages of the same size (normal or huge).
- * It invokes @handler for the chunk of contiguous pages found. Returns the
- * number of pages handled, or a negative error code if the first page is
- * not present or the handler fails.
+ * This function scans the region's PFNs starting from @pfn_offset,
+ * checking for contiguous valid PFNs backed by pages of the same size
+ * (normal or huge). It invokes @handler for the chunk of contiguous valid
+ * PFNs found. Returns the number of PFNs handled, or a negative error code
+ * if the first PFN is invalid or the handler fails.
  *
- * Note: The @handler callback must be able to handle both normal and huge
- * pages.
+ * Note: The @handler callback must be able to handle valid PFNs backed by
+ * both normal and huge pages.
  *
  * Return: Number of pages handled, or negative error code.
  */
-static long mshv_region_process_chunk(struct mshv_mem_region *region,
-				      u32 flags,
-				      u64 page_offset, u64 page_count,
-				      int (*handler)(struct mshv_mem_region *region,
-						     u32 flags,
-						     u64 page_offset,
-						     u64 page_count,
-						     bool huge_page))
+static long mshv_region_process_pfns(struct mshv_mem_region *region,
+				     u32 flags,
+				     u64 pfn_offset, u64 pfn_count,
+				     int (*handler)(struct mshv_mem_region *region,
+						    u32 flags,
+						    u64 pfn_offset,
+						    u64 pfn_count,
+						    bool huge_page))
 {
-	u64 gfn = region->start_gfn + page_offset;
+	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count;
-	struct page *page;
+	unsigned long pfn;
 	int stride, ret;
 
-	page = region->mreg_pages[page_offset];
-	if (!page)
+	pfn = region->mreg_pfns[pfn_offset];
+	if (!mshv_pfn_valid(pfn))
 		return -EINVAL;
 
-	stride = mshv_chunk_stride(page, gfn, page_count);
+	stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
 	if (stride < 0)
 		return stride;
 
 	/* Start at stride since the first stride is validated */
-	for (count = stride; count < page_count; count += stride) {
-		page = region->mreg_pages[page_offset + count];
+	for (count = stride; count < pfn_count ; count += stride) {
+		pfn = region->mreg_pfns[pfn_offset + count];
 
-		/* Break if current page is not present */
-		if (!page)
+		/* Break if current pfn is invalid */
+		if (!mshv_pfn_valid(pfn))
 			break;
 
 		/* Break if stride size changes */
-		if (stride != mshv_chunk_stride(page, gfn + count,
-						page_count - count))
+		if (stride != mshv_chunk_stride(pfn_to_page(pfn),
+						gfn + count,
+						pfn_count - count))
 			break;
 	}
 
-	ret = handler(region, flags, page_offset, count, stride > 1);
+	ret = handler(region, flags, pfn_offset, count, stride > 1);
 	if (ret)
 		return ret;
 
@@ -118,70 +140,72 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 }
 
 /**
- * mshv_region_process_range - Processes a range of memory pages in a
- *                             region.
- * @region     : Pointer to the memory region structure.
- * @flags      : Flags to pass to the handler.
- * @page_offset: Offset into the region's pages array to start processing.
- * @page_count : Number of pages to process.
- * @handler    : Callback function to handle each chunk of contiguous
- *               pages.
+ * mshv_region_process_range - Processes a range of PFNs in a region.
+ * @region    : Pointer to the memory region structure.
+ * @flags     : Flags to pass to the handler.
+ * @pfn_offset: Offset into the region's PFNs array to start processing.
+ * @pfn_count : Number of PFNs to process.
+ * @handler   : Callback function to handle each chunk of contiguous
+ *              valid PFNs.
  *
- * Iterates over the specified range of pages in @region, skipping
- * non-present pages. For each contiguous chunk of present pages, invokes
- * @handler via mshv_region_process_chunk.
+ * Iterates over the specified range of PFNs in @region, skipping
+ * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
+ * @handler via mshv_region_process_pfns.
  *
- * Note: The @handler callback must be able to handle both normal and huge
- * pages.
+ * Note: The @handler callback must be able to handle PFNs backed by both
+ * normal and huge pages.
  *
  * Returns 0 on success, or a negative error code on failure.
  */
 static int mshv_region_process_range(struct mshv_mem_region *region,
 				     u32 flags,
-				     u64 page_offset, u64 page_count,
+				     u64 pfn_offset, u64 pfn_count,
 				     int (*handler)(struct mshv_mem_region *region,
 						    u32 flags,
-						    u64 page_offset,
-						    u64 page_count,
+						    u64 pfn_offset,
+						    u64 pfn_count,
 						    bool huge_page))
 {
+	u64 end;
 	long ret;
 
-	if (page_offset + page_count > region->nr_pages)
+	if (check_add_overflow(pfn_offset, pfn_count, &end))
+		return -EOVERFLOW;
+
+	if (end > region->nr_pfns)
 		return -EINVAL;
 
-	while (page_count) {
+	while (pfn_count) {
 		/* Skip non-present pages */
-		if (!region->mreg_pages[page_offset]) {
-			page_offset++;
-			page_count--;
+		if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset])) {
+			pfn_offset++;
+			pfn_count--;
 			continue;
 		}
 
-		ret = mshv_region_process_chunk(region, flags,
-						page_offset,
-						page_count,
-						handler);
+		ret = mshv_region_process_pfns(region, flags,
+					       pfn_offset, pfn_count,
+					       handler);
 		if (ret < 0)
 			return ret;
 
-		page_offset += ret;
-		page_count -= ret;
+		pfn_offset += ret;
+		pfn_count -= ret;
 	}
 
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
+struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
 					   u64 uaddr, u32 flags)
 {
 	struct mshv_mem_region *region;
 
-	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+	region = vzalloc(struct_size(region, mreg_pfns, nr_pfns));
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 
-	region->nr_pages = nr_pages;
+	region->nr_pfns = nr_pfns;
 	region->start_gfn = guest_pfn;
 	region->start_uaddr = uaddr;
 	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
@@ -190,6 +214,8 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
+	mshv_region_init_pfns(region);
+
 	kref_init(&region->mreg_refcount);
 
 	return region;
@@ -197,15 +223,15 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 
 static int mshv_region_chunk_share(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pages + page_offset,
-					      page_count,
+					      region->mreg_pfns + pfn_offset,
+					      pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
 					      flags, true);
@@ -216,21 +242,21 @@ int mshv_region_share(struct mshv_mem_region *region)
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
 	return mshv_region_process_range(region, flags,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_share);
 }
 
 static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 				     u32 flags,
-				     u64 page_offset, u64 page_count,
+				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pages + page_offset,
-					      page_count, 0,
+					      region->mreg_pfns + pfn_offset,
+					      pfn_count, 0,
 					      flags, false);
 }
 
@@ -239,30 +265,30 @@ int mshv_region_unshare(struct mshv_mem_region *region)
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
 	return mshv_region_process_range(region, flags,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_unshare);
 }
 
 static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MAP_GPA_LARGE_PAGE;
 
-	return hv_call_map_gpa_pages(region->partition->pt_id,
-				     region->start_gfn + page_offset,
-				     page_count, flags,
-				     region->mreg_pages + page_offset);
+	return hv_call_map_ram_pfns(region->partition->pt_id,
+				    region->start_gfn + pfn_offset,
+				    pfn_count, flags,
+				    region->mreg_pfns + pfn_offset);
 }
 
-static int mshv_region_remap_pages(struct mshv_mem_region *region,
-				   u32 map_flags,
-				   u64 page_offset, u64 page_count)
+static int mshv_region_remap_pfns(struct mshv_mem_region *region,
+				  u32 map_flags,
+				  u64 pfn_offset, u64 pfn_count)
 {
 	return mshv_region_process_range(region, map_flags,
-					 page_offset, page_count,
+					 pfn_offset, pfn_count,
 					 mshv_region_chunk_remap);
 }
 
@@ -270,38 +296,50 @@ int mshv_region_map(struct mshv_mem_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
-	return mshv_region_remap_pages(region, map_flags,
-				       0, region->nr_pages);
+	return mshv_region_remap_pfns(region, map_flags,
+				      0, region->nr_pfns);
 }
 
-static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
-					 u64 page_offset, u64 page_count)
+static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
+					u64 pfn_offset, u64 pfn_count)
 {
-	if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
-		unpin_user_pages(region->mreg_pages + page_offset, page_count);
+	u64 i;
+
+	for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
+		if (!mshv_pfn_valid(region->mreg_pfns[i]))
+			continue;
 
-	memset(region->mreg_pages + page_offset, 0,
-	       page_count * sizeof(struct page *));
+		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
+			unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
+	}
+
+	mshv_region_init_pfns_range(region, pfn_offset, pfn_count);
 }
 
 void mshv_region_invalidate(struct mshv_mem_region *region)
 {
-	mshv_region_invalidate_pages(region, 0, region->nr_pages);
+	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
 int mshv_region_pin(struct mshv_mem_region *region)
 {
-	u64 done_count, nr_pages;
+	u64 done_count, nr_pfns, i;
+	unsigned long *pfns;
 	struct page **pages;
 	__u64 userspace_addr;
 	int ret;
 
-	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
-		pages = region->mreg_pages + done_count;
+	pages = kmalloc_array(MSHV_PIN_PAGES_BATCH_SIZE,
+			      sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	for (done_count = 0; done_count < region->nr_pfns; done_count += ret) {
+		pfns = region->mreg_pfns + done_count;
 		userspace_addr = region->start_uaddr +
 				 done_count * HV_HYP_PAGE_SIZE;
-		nr_pages = min(region->nr_pages - done_count,
-			       MSHV_PIN_PAGES_BATCH_SIZE);
+		nr_pfns = min(region->nr_pfns - done_count,
+			      MSHV_PIN_PAGES_BATCH_SIZE);
 
 		/*
 		 * Pinning assuming 4k pages works for large pages too.
@@ -311,39 +349,50 @@ int mshv_region_pin(struct mshv_mem_region *region)
 		 * with the FOLL_LONGTERM flag does a large temporary
 		 * allocation of contiguous memory.
 		 */
-		ret = pin_user_pages_fast(userspace_addr, nr_pages,
+		ret = pin_user_pages_fast(userspace_addr, nr_pfns,
 					  FOLL_WRITE | FOLL_LONGTERM,
 					  pages);
-		if (ret != nr_pages)
+
+		for (i = 0; i < ret; i++)
+			pfns[i] = page_to_pfn(pages[i]);
+
+		/*
+		 * Demand all requested pages were successfully pinned
+		 * or fail otherwise.
+		 */
+		if (ret != nr_pfns)
 			goto release_pages;
+
 	}
 
+	kfree(pages);
 	return 0;
 
 release_pages:
 	if (ret > 0)
 		done_count += ret;
-	mshv_region_invalidate_pages(region, 0, done_count);
+	mshv_region_invalidate_pfns(region, 0, done_count);
+	kfree(pages);
 	return ret < 0 ? ret : -ENOMEM;
 }
 
 static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_UNMAP_GPA_LARGE_PAGE;
 
-	return hv_call_unmap_gpa_pages(region->partition->pt_id,
-				       region->start_gfn + page_offset,
-				       page_count, flags);
+	return hv_call_unmap_pfns(region->partition->pt_id,
+				  region->start_gfn + pfn_offset,
+				  pfn_count, flags);
 }
 
 static int mshv_region_unmap(struct mshv_mem_region *region)
 {
 	return mshv_region_process_range(region, 0,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_unmap);
 }
 
@@ -427,8 +476,8 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 /**
  * mshv_region_range_fault - Handle memory range faults for a given region.
  * @region: Pointer to the memory region structure.
- * @page_offset: Offset of the page within the region.
- * @page_count: Number of pages to handle.
+ * @pfn_offset: Offset of the page within the region.
+ * @pfn_count: Number of pages to handle.
  *
  * This function resolves memory faults for a specified range of pages
  * within a memory region. It uses HMM (Heterogeneous Memory Management)
@@ -437,7 +486,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
  * Return: 0 on success, negative error code on failure.
  */
 static int mshv_region_range_fault(struct mshv_mem_region *region,
-				   u64 page_offset, u64 page_count)
+				   u64 pfn_offset, u64 pfn_count)
 {
 	struct hmm_range range = {
 		.notifier = &region->mreg_mni,
@@ -447,13 +496,13 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	int ret;
 	u64 i;
 
-	pfns = kmalloc_array(page_count, sizeof(*pfns), GFP_KERNEL);
+	pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
 	if (!pfns)
 		return -ENOMEM;
 
 	range.hmm_pfns = pfns;
-	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
-	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
+	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
+	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
 
 	/*
 	 * Only request writable pages from HMM when the region itself
@@ -471,11 +520,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	if (ret)
 		goto out;
 
-	for (i = 0; i < page_count; i++)
-		region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+	for (i = 0; i < pfn_count; i++) {
+		if (!(pfns[i] & HMM_PFN_VALID))
+			continue;
+		/* Drop HMM_PFN_* flags to ensure PFNs are valid. */
+		region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
+	}
 
-	ret = mshv_region_remap_pages(region, region->hv_map_flags,
-				      page_offset, page_count);
+	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
+				     pfn_offset, pfn_count);
 
 	mutex_unlock(&region->mreg_mutex);
 out:
@@ -485,24 +538,24 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
 {
-	u64 page_offset, page_count;
+	u64 pfn_offset, pfn_count;
 	int ret;
 
 	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
-	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
-				 MSHV_MAP_FAULT_IN_PAGES);
+	pfn_offset = ALIGN_DOWN(gfn - region->start_gfn,
+				MSHV_MAP_FAULT_IN_PAGES);
 
 	/* Map more pages than requested to reduce the number of faults. */
-	page_count = min(region->nr_pages - page_offset,
-			 MSHV_MAP_FAULT_IN_PAGES);
+	pfn_count = min(region->nr_pfns - pfn_offset,
+			MSHV_MAP_FAULT_IN_PAGES);
 
-	ret = mshv_region_range_fault(region, page_offset, page_count);
+	ret = mshv_region_range_fault(region, pfn_offset, pfn_count);
 
 	WARN_ONCE(ret,
-		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
+		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, pfn_offset %llu, pfn_count %llu\n",
 		  region->partition->pt_id, region->start_uaddr,
-		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
-		  gfn, page_offset, page_count);
+		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
+		  gfn, pfn_offset, pfn_count);
 
 	return !ret;
 }
@@ -532,16 +585,16 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 	struct mshv_mem_region *region = container_of(mni,
 						      struct mshv_mem_region,
 						      mreg_mni);
-	u64 page_offset, page_count;
+	u64 pfn_offset, pfn_count;
 	unsigned long mstart, mend;
 	int ret = -EPERM;
 
 	mstart = max(range->start, region->start_uaddr);
 	mend = min(range->end, region->start_uaddr +
-		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
+		   (region->nr_pfns << HV_HYP_PAGE_SHIFT));
 
-	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
-	page_count = HVPFN_DOWN(mend - mstart);
+	pfn_offset = HVPFN_DOWN(mstart - region->start_uaddr);
+	pfn_count = HVPFN_DOWN(mend - mstart);
 
 	if (mmu_notifier_range_blockable(range))
 		mutex_lock(&region->mreg_mutex);
@@ -550,12 +603,12 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 
 	mmu_interval_set_seq(mni, cur_seq);
 
-	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
-				      page_offset, page_count);
+	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
+				     pfn_offset, pfn_count);
 	if (ret)
 		goto out_unlock;
 
-	mshv_region_invalidate_pages(region, page_offset, page_count);
+	mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
 
 	mutex_unlock(&region->mreg_mutex);
 
@@ -567,9 +620,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 	WARN_ONCE(ret,
 		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
 		  region->start_uaddr,
-		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
 		  range->start, range->end, range->event,
-		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
+		  pfn_offset, pfn_offset + pfn_count - 1, (u64)range->mm, ret);
 	return false;
 }
 
@@ -588,7 +641,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
 
 	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
 					   region->start_uaddr,
-					   region->nr_pages << HV_HYP_PAGE_SHIFT,
+					   region->nr_pfns << HV_HYP_PAGE_SHIFT,
 					   &mshv_region_mni_ops);
 	if (ret)
 		return false;
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index e19a84ea07905..34c9b57c50f47 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -85,15 +85,15 @@ enum mshv_region_type {
 struct mshv_mem_region {
 	struct hlist_node hnode;
 	struct kref mreg_refcount;
-	u64 nr_pages;
+	u64 nr_pfns;
 	u64 start_gfn;
 	u64 start_uaddr;
 	u32 hv_map_flags;
 	struct mshv_partition *partition;
 	enum mshv_region_type mreg_type;
 	struct mmu_interval_notifier mreg_mni;
-	struct mutex mreg_mutex;	/* protects region pages remapping */
-	struct page *mreg_pages[];
+	struct mutex mreg_mutex;	/* protects region PFNs remapping */
+	unsigned long mreg_pfns[];
 };
 
 struct mshv_irq_ack_notifier {
@@ -282,11 +282,11 @@ int hv_call_create_partition(u64 flags,
 int hv_call_initialize_partition(u64 partition_id);
 int hv_call_finalize_partition(u64 partition_id);
 int hv_call_delete_partition(u64 partition_id);
-int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
-int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			  u32 flags, struct page **pages);
-int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			    u32 flags);
+int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
+int hv_call_map_ram_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
+			 u32 flags, unsigned long *pfns);
+int hv_call_unmap_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
+		       u32 flags);
 int hv_call_delete_vp(u64 partition_id, u32 vp_index);
 int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
 				     u64 dest_addr,
@@ -332,8 +332,8 @@ int hv_map_stats_page(enum hv_stats_object_type type,
 int hv_unmap_stats_page(enum hv_stats_object_type type,
 			struct hv_stats_page *page_addr,
 			const union hv_stats_object_identity *identity);
-int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
-				   u64 page_struct_count, u32 host_access,
+int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
+				   u64 pfns_count, u32 host_access,
 				   u32 flags, u8 acquire);
 int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
 				      void *property_value, size_t property_value_sz);
@@ -375,6 +375,7 @@ int mshv_region_share(struct mshv_mem_region *region);
 int mshv_region_unshare(struct mshv_mem_region *region);
 int mshv_region_map(struct mshv_mem_region *region);
 void mshv_region_invalidate(struct mshv_mem_region *region);
+void mshv_region_init_pfns(struct mshv_mem_region *region);
 int mshv_region_pin(struct mshv_mem_region *region);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index cc580225e9e45..00d37c39cbf26 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -188,17 +188,16 @@ int hv_call_delete_partition(u64 partition_id)
 	return hv_result_to_errno(status);
 }
 
-/* Ask the hypervisor to map guest ram pages or the guest mmio space */
-static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
-			       u32 flags, struct page **pages, u64 mmio_spa)
+static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
+			  u32 flags, unsigned long *pfns, u64 mmio_spa)
 {
 	struct hv_input_map_gpa_pages *input_page;
 	u64 status, *pfnlist;
 	unsigned long irq_flags, large_shift = 0;
-	u64 done = 0, page_count = page_struct_count;
+	u64 done = 0, page_count = pfns_count;
 	int ret = 0;
 
-	if (page_count == 0 || (pages && mmio_spa))
+	if (page_count == 0 || (pfns && mmio_spa))
 		return -EINVAL;
 
 	if (flags & HV_MAP_GPA_LARGE_PAGE) {
@@ -227,9 +226,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		for (i = 0; i < rep_count; i++) {
 			if (flags & HV_MAP_GPA_NO_ACCESS)
 				pfnlist[i] = 0;
-			else if (pages)
-				pfnlist[i] = page_to_pfn(pages[(done + i) <<
-							 large_shift]);
+			else if (pfns)
+				pfnlist[i] = pfns[(done + i) << large_shift];
 			else
 				pfnlist[i] = mmio_spa + done + i;
 		}
@@ -257,38 +255,38 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 
 		if (flags & HV_MAP_GPA_LARGE_PAGE)
 			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-		hv_call_unmap_gpa_pages(partition_id, gfn,
-					done << large_shift, unmap_flags);
+		hv_call_unmap_pfns(partition_id, gfn,
+				   done << large_shift, unmap_flags);
 	}
 
 	return ret;
 }
 
 /* Ask the hypervisor to map guest ram pages */
-int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			  u32 flags, struct page **pages)
+int hv_call_map_ram_pfns(u64 partition_id, u64 gfn, u64 pfn_count,
+			 u32 flags, unsigned long *pfns)
 {
-	return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
-				   flags, pages, 0);
+	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
+			      pfns, 0);
 }
 
-/* Ask the hypervisor to map guest mmio space */
-int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
+int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa,
+			  u64 pfn_count)
 {
 	int i;
 	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
 		    HV_MAP_GPA_NOT_CACHED;
 
-	for (i = 0; i < numpgs; i++)
+	for (i = 0; i < pfn_count; i++)
 		if (page_is_ram(mmio_spa + i))
 			return -EINVAL;
 
-	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
-				   mmio_spa);
+	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
+			      NULL, mmio_spa);
 }
 
-int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
-			    u32 flags)
+int hv_call_unmap_pfns(u64 partition_id, u64 gfn, u64 page_count_4k,
+		       u32 flags)
 {
 	struct hv_input_unmap_gpa_pages *input_page;
 	u64 status, page_count = page_count_4k;
@@ -1036,15 +1034,15 @@ int hv_unmap_stats_page(enum hv_stats_object_type type,
 	return ret;
 }
 
-int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
-				   u64 page_struct_count, u32 host_access,
+int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
+				   u64 pfns_count, u32 host_access,
 				   u32 flags, u8 acquire)
 {
 	struct hv_input_modify_sparse_spa_page_host_access *input_page;
 	u64 status;
 	u64 done = 0;
 	unsigned long irq_flags, large_shift = 0;
-	u64 page_count = page_struct_count;
+	u64 page_count = pfns_count;
 	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
 			     HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
 
@@ -1076,8 +1074,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 		input_page->host_access = host_access;
 
 		for (i = 0; i < rep_count; i++)
-			input_page->spa_page_list[i] =
-				page_to_pfn(pages[(done + i) << large_shift]);
+			input_page->spa_page_list[i] = pfns[(done + i) << large_shift];
 
 		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
 					     NULL);
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 03c65ff6a7397..986cb9a4c428e 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -619,7 +619,7 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 
 	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
 		if (gfn >= region->start_gfn &&
-		    gfn < region->start_gfn + region->nr_pages)
+		    gfn < region->start_gfn + region->nr_pfns)
 			return region;
 	}
 
@@ -1313,20 +1313,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 					bool is_mmio)
 {
 	struct mshv_mem_region *rg;
-	u64 nr_pages = HVPFN_DOWN(mem->size);
+	u64 nr_pfns = HVPFN_DOWN(mem->size);
 
 	/* Reject overlapping regions */
 	spin_lock(&partition->pt_mem_regions_lock);
 	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
-		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
-		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
+		if (mem->guest_pfn + nr_pfns <= rg->start_gfn ||
+		    rg->start_gfn + rg->nr_pfns <= mem->guest_pfn)
 			continue;
 		spin_unlock(&partition->pt_mem_regions_lock);
 		return -EEXIST;
 	}
 	spin_unlock(&partition->pt_mem_regions_lock);
 
-	rg = mshv_region_create(mem->guest_pfn, nr_pages,
+	rg = mshv_region_create(mem->guest_pfn, nr_pfns,
 				mem->userspace_addr, mem->flags);
 	if (IS_ERR(rg))
 		return PTR_ERR(rg);
@@ -1412,8 +1412,7 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 		 * is intentional; unpinning host-inaccessible pages would be
 		 * unsafe).
 		 */
-		memset(region->mreg_pages, 0,
-		       region->nr_pages * sizeof(region->mreg_pages[0]));
+		mshv_region_init_pfns(region);
 		goto err_out;
 	}
 err_out:
@@ -1470,21 +1469,21 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		 * the hypervisor track dirty pages, enabling pre-copy live
 		 * migration.
 		 */
-		ret = hv_call_map_gpa_pages(partition->pt_id,
-					    region->start_gfn,
-					    region->nr_pages,
-					    HV_MAP_GPA_NO_ACCESS, NULL);
+		ret = hv_call_map_ram_pfns(partition->pt_id,
+					   region->start_gfn,
+					   region->nr_pfns,
+					   HV_MAP_GPA_NO_ACCESS, NULL);
 		break;
 	case MSHV_REGION_TYPE_MMIO:
-		ret = hv_call_map_mmio_pages(partition->pt_id,
-					     region->start_gfn,
-					     mmio_pfn,
-					     region->nr_pages);
+		ret = hv_call_map_mmio_pfns(partition->pt_id,
+					    region->start_gfn,
+					    mmio_pfn,
+					    region->nr_pfns);
 		break;
 	}
 
 	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
-				   region->start_gfn, region->nr_pages,
+				   region->start_gfn, region->nr_pfns,
 				   region->hv_map_flags, ret);
 
 	if (ret)
@@ -1522,7 +1521,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	/* Paranoia check */
 	if (region->start_uaddr != mem->userspace_addr ||
 	    region->start_gfn != mem->guest_pfn ||
-	    region->nr_pages != HVPFN_DOWN(mem->size)) {
+	    region->nr_pfns != HVPFN_DOWN(mem->size)) {
 		spin_unlock(&partition->pt_mem_regions_lock);
 		return -EINVAL;
 	}



^ permalink raw reply related

* [PATCH v3 04/11] mshv: Refactor region segmentation into a dedicated helper
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_region_process_pfns() conflated three concerns: validating the first
PFN of a chunk, locating the longest contiguous run of same-stride PFNs
starting from there, and dispatching the chunk to the handler. The
locate-and-dispatch interleaving made the partial-consume case (4K-to-2M
stride transition inside a same-validity run) emergent rather than
explicit, and required process_range to handle a return value that was
simultaneously a count and an error code.

Split the locate step out into mshv_region_chunk_size().  The new helper
takes a starting offset and an upper bound, returns the length of the
same-stride run, and reports whether that run is huge-page-backed via an
out-parameter. mshv_region_process_pfns() goes away;
mshv_region_process_range() now drives the loop directly, calling
chunk_size() for the next segment length and dispatching the handler with
the precomputed huge_page hint.

mshv_chunk_stride() additionally takes a PFN instead of a struct page * and
validates it internally, so each call site no longer needs its own
mshv_pfn_valid() check before pfn_to_page().

No functional change; the per-handler dispatch shape, segmentation
boundaries, and lock context are all preserved.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |  103 ++++++++++++++++++++++-----------------------
 1 file changed, 50 insertions(+), 53 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 77fc94733cb20..090c4052f0f4d 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -41,7 +41,7 @@ void mshv_region_init_pfns(struct mshv_mem_region *region)
 
 /**
  * mshv_chunk_stride - Compute stride for mapping guest memory
- * @page     : The page to check for huge page backing
+ * @pfn      : The PFN to check for huge page backing
  * @gfn      : Guest frame number for the mapping
  * @pfn_count: Total number of pages in the mapping
  *
@@ -51,11 +51,16 @@ void mshv_region_init_pfns(struct mshv_mem_region *region)
  *
  * Return: Stride in pages, or -EINVAL if page order is unsupported.
  */
-static int mshv_chunk_stride(struct page *page,
-			     u64 gfn, u64 pfn_count)
+static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
 {
+	struct page *page;
 	unsigned int page_order;
 
+	if (!mshv_pfn_valid(pfn))
+		return -EINVAL;
+
+	page = pfn_to_page(pfn);
+
 	/*
 	 * Use single page stride by default. For huge page stride, the
 	 * page must be compound and point to the head of the compound
@@ -76,65 +81,51 @@ static int mshv_chunk_stride(struct page *page,
 }
 
 /**
- * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
- *                             in a region.
- * @region    : Pointer to the memory region structure.
- * @flags     : Flags to pass to the handler.
- * @pfn_offset: Offset into the region's PFNs array to start processing.
- * @pfn_count : Number of PFNs to process.
- * @handler   : Callback function to handle the chunk.
+ * mshv_region_chunk_size - Length of the next same-stride PFN run.
+ * @region    : Memory region whose PFN array is being walked.
+ * @pfn_offset: Offset into region->mreg_pfns at which to start; the
+ *              PFN at this offset must be valid.
+ * @pfn_count : Upper bound on the run length (not necessarily the
+ *              region's total length; typically the residual passed
+ *              from mshv_region_process_range()).
+ * @huge_page : Out-parameter set to true if the run is backed by
+ *              PMD-order folios and may be dispatched as 2 MiB
+ *              chunks; false for 4 KiB-stride dispatch.
  *
- * This function scans the region's PFNs starting from @pfn_offset,
- * checking for contiguous valid PFNs backed by pages of the same size
- * (normal or huge). It invokes @handler for the chunk of contiguous valid
- * PFNs found. Returns the number of PFNs handled, or a negative error code
- * if the first PFN is invalid or the handler fails.
+ * Walks the PFN array starting at @pfn_offset and returns the length
+ * of the longest contiguous run that shares the stride classification
+ * (4 KiB vs 2 MiB) of the first PFN.  An invalid PFN inside the run
+ * terminates it.  The run is bounded above by @pfn_count.
  *
- * Note: The @handler callback must be able to handle valid PFNs backed by
- * both normal and huge pages.
+ * The caller may then dispatch [pfn_offset, pfn_offset + return) to a
+ * handler with @huge_page indicating which stride applies.  After the
+ * dispatch the caller advances by the returned length and re-invokes
+ * this function for the next run.
  *
- * Return: Number of pages handled, or negative error code.
+ * Return: Length of the run in PFNs, or a negative errno from
+ *         mshv_chunk_stride() if the starting PFN is invalid or its
+ *         backing folio order is unsupported.
  */
-static long mshv_region_process_pfns(struct mshv_mem_region *region,
-				     u32 flags,
-				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+static long mshv_region_chunk_size(struct mshv_mem_region *region,
+				   u64 pfn_offset, u64 pfn_count,
+				   bool *huge_page)
 {
+	unsigned long *pfns = region->mreg_pfns + pfn_offset;
 	u64 gfn = region->start_gfn + pfn_offset;
-	u64 count;
-	unsigned long pfn;
-	int stride, ret;
+	u64 count = 0, stride;
 
-	pfn = region->mreg_pfns[pfn_offset];
-	if (!mshv_pfn_valid(pfn))
-		return -EINVAL;
-
-	stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
+	stride = mshv_chunk_stride(pfns[0], gfn, pfn_count);
 	if (stride < 0)
 		return stride;
 
-	/* Start at stride since the first stride is validated */
-	for (count = stride; count < pfn_count ; count += stride) {
-		pfn = region->mreg_pfns[pfn_offset + count];
-
-		/* Break if current pfn is invalid */
-		if (!mshv_pfn_valid(pfn))
-			break;
-
-		/* Break if stride size changes */
-		if (stride != mshv_chunk_stride(pfn_to_page(pfn),
+	for (count = stride; count < pfn_count; count += stride) {
+		if (stride != mshv_chunk_stride(pfns[count],
 						gfn + count,
 						pfn_count - count))
 			break;
 	}
 
-	ret = handler(region, flags, pfn_offset, count, stride > 1);
-	if (ret)
-		return ret;
+	*huge_page = stride > 1;
 
 	return count;
 }
@@ -150,7 +141,7 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
  *
  * Iterates over the specified range of PFNs in @region, skipping
  * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
- * @handler via mshv_region_process_pfns.
+ * @handler.
  *
  * Note: The @handler callback must be able to handle PFNs backed by both
  * normal and huge pages.
@@ -176,6 +167,9 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 		return -EINVAL;
 
 	while (pfn_count) {
+		bool huge_page;
+		long count;
+
 		/* Skip non-present pages */
 		if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset])) {
 			pfn_offset++;
@@ -183,14 +177,17 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 			continue;
 		}
 
-		ret = mshv_region_process_pfns(region, flags,
-					       pfn_offset, pfn_count,
-					       handler);
+		count = mshv_region_chunk_size(region, pfn_offset, pfn_count,
+					       &huge_page);
+		if (count < 0)
+			return count;
+
+		ret = handler(region, flags, pfn_offset, count, huge_page);
 		if (ret < 0)
 			return ret;
 
-		pfn_offset += ret;
-		pfn_count -= ret;
+		pfn_offset += count;
+		pfn_count -= count;
 	}
 
 	return 0;



^ permalink raw reply related

* [PATCH v3 05/11] mshv: Support address range holes in remapping
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Consolidate memory region processing to handle both valid and invalid PFNs
uniformly. This eliminates code duplication across remap, unmap, share, and
unshare operations by using a common range processing interface.

Holes are now remapped with no-access permissions to enable
hypervisor dirty page tracking for precopy live migration.

This refactoring is a precursor to an upcoming change that will map
present pages in movable regions upon region creation, requiring
consistent handling of both mapped and unmapped ranges.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   70 ++++++++++++++++++++++++++++-----------------
 1 file changed, 43 insertions(+), 27 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 090c4052f0f4d..579a29f2924b8 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -81,30 +81,23 @@ static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
 }
 
 /**
- * mshv_region_chunk_size - Length of the next same-stride PFN run.
+ * mshv_region_chunk_size - Length of the next contiguous PFN run in a region.
  * @region    : Memory region whose PFN array is being walked.
- * @pfn_offset: Offset into region->mreg_pfns at which to start; the
- *              PFN at this offset must be valid.
- * @pfn_count : Upper bound on the run length (not necessarily the
- *              region's total length; typically the residual passed
- *              from mshv_region_process_range()).
- * @huge_page : Out-parameter set to true if the run is backed by
- *              PMD-order folios and may be dispatched as 2 MiB
- *              chunks; false for 4 KiB-stride dispatch.
+ * @pfn_offset: Offset into region->mreg_pfns at which to start.
+ * @pfn_count : Upper bound on the run length.
+ * @huge_page : Out-parameter set to true if the run may be dispatched
+ *              as a 2 MiB chunk; false for 4 KiB-stride dispatch.
  *
- * Walks the PFN array starting at @pfn_offset and returns the length
- * of the longest contiguous run that shares the stride classification
- * (4 KiB vs 2 MiB) of the first PFN.  An invalid PFN inside the run
- * terminates it.  The run is bounded above by @pfn_count.
- *
- * The caller may then dispatch [pfn_offset, pfn_offset + return) to a
- * handler with @huge_page indicating which stride applies.  After the
- * dispatch the caller advances by the returned length and re-invokes
- * this function for the next run.
+ * Returns the length of the longest contiguous run starting at @pfn_offset
+ * that shares the classification of the first PFN: either a same-stride run of
+ * valid PFNs (4 KiB or 2 MiB) or a hole of invalid PFNs. A hole that is
+ * huge-page aligned in @gfn space and at least PTRS_PER_PMD entries long is
+ * reported as a 2 MiB chunk (huge_page = true) so the caller can dispatch it
+ * as a single HV_MAP_GPA_NO_ACCESS huge mapping. The run is bounded above by
+ * @pfn_count.
  *
  * Return: Length of the run in PFNs, or a negative errno from
- *         mshv_chunk_stride() if the starting PFN is invalid or its
- *         backing folio order is unsupported.
+ *         mshv_chunk_stride() if the backing folio order is unsupported.
  */
 static long mshv_region_chunk_size(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
@@ -114,6 +107,22 @@ static long mshv_region_chunk_size(struct mshv_mem_region *region,
 	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count = 0, stride;
 
+	if (!mshv_pfn_valid(pfns[0])) {
+		for (count = 1; count < pfn_count; count++) {
+			if (mshv_pfn_valid(pfns[count]))
+				break;
+		}
+
+		if (IS_ALIGNED(gfn, PTRS_PER_PMD) &&
+		    count >= PTRS_PER_PMD) {
+			*huge_page = true;
+			return ALIGN_DOWN(count, PTRS_PER_PMD);
+		}
+
+		*huge_page = false;
+		return count;
+	}
+
 	stride = mshv_chunk_stride(pfns[0], gfn, pfn_count);
 	if (stride < 0)
 		return stride;
@@ -170,13 +179,6 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 		bool huge_page;
 		long count;
 
-		/* Skip non-present pages */
-		if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset])) {
-			pfn_offset++;
-			pfn_count--;
-			continue;
-		}
-
 		count = mshv_region_chunk_size(region, pfn_offset, pfn_count,
 					       &huge_page);
 		if (count < 0)
@@ -223,6 +225,9 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
+	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+		return -EINVAL;
+
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
@@ -248,6 +253,9 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
 {
+	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+		return -EINVAL;
+
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
@@ -271,6 +279,14 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
+	/*
+	 * Remap missing pages with no access to let the
+	 * hypervisor track dirty pages, enabling precopy live
+	 * migration.
+	 */
+	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+		flags = HV_MAP_GPA_NO_ACCESS;
+
 	if (huge_page)
 		flags |= HV_MAP_GPA_LARGE_PAGE;
 



^ permalink raw reply related

* [PATCH v3 06/11] mshv: Iterate VMAs when faulting in region pages
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_region_hmm_fault_and_lock() called hmm_range_fault() once for the
entire requested range. hmm_range_fault() can only handle a single VMA per
call, so a region whose user address range spans multiple VMAs fails the
fault even though each individual VMA is fault-able.

Walk the requested range VMA by VMA under mmap_read_lock and call
hmm_range_fault() for each [vma->vm_start, vma->vm_end) ∩ [start, end)
segment. The mmu notifier sequence is captured once before the loop, so a
writer racing with the multi-VMA fault is still detected at the closing
mmu_interval_read_retry().

Tighten the read-only gate added in 3f8e229cb787 ("mshv: Don't request HMM
write fault for read-only regions") so HMM_PFN_REQ_WRITE is requested only
when both the region (HV_MAP_GPA_WRITABLE) and the backing VMA (VM_WRITE)
permit writes. Without the per-VMA check, a writable region whose
underlying VMA is read-only would still trigger COW on the host's read-only
pages.

While here, restructure mshv_region_hmm_fault_and_lock() to take the range
as (start, end, pfns) directly rather than a populated hmm_range; the
struct is now constructed inside the function since its fields are
recomputed per VMA.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   97 +++++++++++++++++++++++++++++----------------
 1 file changed, 62 insertions(+), 35 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 579a29f2924b8..807fff43deb43 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -447,37 +447,76 @@ int mshv_region_get(struct mshv_mem_region *region)
 }
 
 /**
- * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
+ * mshv_region_hmm_fault_and_lock - Fault in pages across VMAs and lock
+ *                                  the memory region
  * @region: Pointer to the memory region structure
- * @range: Pointer to the HMM range structure
+ * @start : Starting virtual address of the range to fault (inclusive)
+ * @end   : Ending virtual address of the range to fault (exclusive)
+ * @pfns  : Output array for page frame numbers with HMM flags
  *
- * This function performs the following steps:
- * 1. Reads the notifier sequence for the HMM range.
- * 2. Acquires a read lock on the memory map.
- * 3. Handles HMM faults for the specified range.
- * 4. Releases the read lock on the memory map.
- * 5. If successful, locks the memory region mutex.
- * 6. Verifies if the notifier sequence has changed during the operation.
- *    If it has, releases the mutex and returns -EBUSY to match with
- *    hmm_range_fault() return code for repeating.
+ * Iterates through VMAs covering [start, end), faulting in pages via
+ * hmm_range_fault() for each VMA segment.  Write faults are requested
+ * only when both the VMA and the hypervisor mapping permit writes, to
+ * avoid breaking copy-on-write semantics on read-only mappings.
  *
- * Return: 0 on success, a negative error code otherwise.
+ * On success, returns with region->mreg_mutex held; the caller is
+ * responsible for releasing it.  Returns -EBUSY if the mmu notifier
+ * sequence changed during the operation, signalling the caller to retry.
+ *
+ * Return: 0 on success, negative error code on failure.
  */
 static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
-					  struct hmm_range *range)
+					  unsigned long start,
+					  unsigned long end,
+					  unsigned long *pfns)
 {
+	struct hmm_range range = {
+		.notifier = &region->mreg_mni,
+	};
+	struct mm_struct *mm = region->mreg_mni.mm;
 	int ret;
 
-	range->notifier_seq = mmu_interval_read_begin(range->notifier);
-	mmap_read_lock(region->mreg_mni.mm);
-	ret = hmm_range_fault(range);
-	mmap_read_unlock(region->mreg_mni.mm);
+	range.notifier_seq = mmu_interval_read_begin(range.notifier);
+	mmap_read_lock(mm);
+	while (start < end) {
+		struct vm_area_struct *vma;
+
+		vma = vma_lookup(mm, start);
+		if (!vma) {
+			ret = -EFAULT;
+			break;
+		}
+
+		range.hmm_pfns = pfns;
+		range.start = start;
+		range.end = min(vma->vm_end, end);
+		range.default_flags = HMM_PFN_REQ_FAULT;
+		/*
+		 * Only request writable pages from HMM when both the
+		 * VMA and the hypervisor mapping allow writes.  Without
+		 * this, hmm_range_fault() would trigger COW on read-only
+		 * mappings (e.g. shared zero pages, file-backed pages),
+		 * breaking copy-on-write semantics and potentially
+		 * granting the guest write access to shared host pages.
+		 */
+		if ((vma->vm_flags & VM_WRITE) &&
+		    (region->hv_map_flags & HV_MAP_GPA_WRITABLE))
+			range.default_flags |= HMM_PFN_REQ_WRITE;
+
+		ret = hmm_range_fault(&range);
+		if (ret)
+			break;
+
+		start = range.end;
+		pfns += (range.end - range.start) >> PAGE_SHIFT;
+	}
+	mmap_read_unlock(mm);
 	if (ret)
 		return ret;
 
 	mutex_lock(&region->mreg_mutex);
 
-	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
+	if (mmu_interval_read_retry(range.notifier, range.notifier_seq)) {
 		mutex_unlock(&region->mreg_mutex);
 		cond_resched();
 		return -EBUSY;
@@ -501,10 +540,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 static int mshv_region_range_fault(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count)
 {
-	struct hmm_range range = {
-		.notifier = &region->mreg_mni,
-		.default_flags = HMM_PFN_REQ_FAULT,
-	};
+	unsigned long start, end;
 	unsigned long *pfns;
 	int ret;
 	u64 i;
@@ -513,21 +549,12 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	if (!pfns)
 		return -ENOMEM;
 
-	range.hmm_pfns = pfns;
-	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
-	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
-
-	/*
-	 * Only request writable pages from HMM when the region itself
-	 * permits writes.  Without this, hmm_range_fault() would
-	 * trigger COW on read-only regions, breaking copy-on-write
-	 * semantics on shared host pages.
-	 */
-	if (region->hv_map_flags & HV_MAP_GPA_WRITABLE)
-		range.default_flags |= HMM_PFN_REQ_WRITE;
+	start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
+	end = start + pfn_count * HV_HYP_PAGE_SIZE;
 
 	do {
-		ret = mshv_region_hmm_fault_and_lock(region, &range);
+		ret = mshv_region_hmm_fault_and_lock(region, start, end,
+						     pfns);
 	} while (ret == -EBUSY);
 
 	if (ret)



^ permalink raw reply related

* [PATCH v3 07/11] mshv: Scale fault granularity for non-4 KiB host pages
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_region_handle_gfn_fault() faults in PTRS_PER_PMD HV pages
(2 MiB), sized for a single 2 MiB SLAT mapping.  When the host page
is larger than HV_HYP_PAGE_SIZE (e.g. arm64 16 KiB or 64 KiB), the
range covers fewer than PTRS_PER_PMD host pages, so a subsequent
guest fault re-enters HMM for offsets the same host folio already
backs.

Scale MSHV_MAP_FAULT_IN_PAGES by PAGE_SIZE / HV_HYP_PAGE_SIZE
(clamped to at least 1) so the fault range covers one huge page on
whichever side has the larger one.

No functional change for the H == V case; on H > V hosts the fault
grows from 2 MiB to one host huge page.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 807fff43deb43..057fc83895d37 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -17,7 +17,8 @@
 
 #include "mshv_root.h"
 
-#define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
+#define MSHV_MAP_FAULT_IN_PAGES				\
+	(PTRS_PER_PMD * max_t(unsigned long, 1, PAGE_SIZE / HV_HYP_PAGE_SIZE))
 #define MSHV_INVALID_PFN				ULONG_MAX
 
 static inline bool mshv_pfn_valid(unsigned long pfn)



^ permalink raw reply related

* [PATCH v3 08/11] mshv: Move pinned region setup to mshv_regions.c
From: Stanislav Kinsburskii @ 2026-05-13 18:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Move mshv_prepare_pinned_region() from mshv_root_main.c to
mshv_regions.c and rename it to mshv_map_pinned_region(). This
co-locates the pinned region logic with the rest of the memory region
operations.

Make mshv_region_pin(), mshv_region_map(), mshv_region_share(),
mshv_region_unshare(), and mshv_region_invalidate() static, as they are
no longer called outside of mshv_regions.c.

No functional change.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   83 ++++++++++++++++++++++++++++++++++++++++---
 drivers/hv/mshv_root.h      |    6 +--
 drivers/hv/mshv_root_main.c |   75 +--------------------------------------
 3 files changed, 80 insertions(+), 84 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 057fc83895d37..d9e33e9ef8550 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -240,7 +240,7 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 					      flags, true);
 }
 
-int mshv_region_share(struct mshv_mem_region *region)
+static int mshv_region_share(struct mshv_mem_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
@@ -266,7 +266,7 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 					      flags, false);
 }
 
-int mshv_region_unshare(struct mshv_mem_region *region)
+static int mshv_region_unshare(struct mshv_mem_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
@@ -306,7 +306,7 @@ static int mshv_region_remap_pfns(struct mshv_mem_region *region,
 					 mshv_region_chunk_remap);
 }
 
-int mshv_region_map(struct mshv_mem_region *region)
+static int mshv_region_map(struct mshv_mem_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
@@ -330,12 +330,12 @@ static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
 	mshv_region_init_pfns_range(region, pfn_offset, pfn_count);
 }
 
-void mshv_region_invalidate(struct mshv_mem_region *region)
+static void mshv_region_invalidate(struct mshv_mem_region *region)
 {
 	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
-int mshv_region_pin(struct mshv_mem_region *region)
+static int mshv_region_pin(struct mshv_mem_region *region)
 {
 	u64 done_count, nr_pfns, i;
 	unsigned long *pfns;
@@ -691,3 +691,76 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
 
 	return true;
 }
+
+/**
+ * mshv_map_pinned_region - Pin and map memory regions
+ * @region: Pointer to the memory region structure
+ *
+ * This function processes memory regions that are explicitly marked as pinned.
+ * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
+ * population. The function ensures the region is properly populated, handles
+ * encryption requirements for SNP partitions if applicable, maps the region,
+ * and performs necessary sharing or eviction operations based on the mapping
+ * result.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int mshv_map_pinned_region(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	int ret;
+
+	ret = mshv_region_pin(region);
+	if (ret) {
+		pt_err(partition, "Failed to pin memory region: %d\n",
+		       ret);
+		goto err_out;
+	}
+
+	/*
+	 * For an SNP partition it is a requirement that for every memory region
+	 * that we are going to map for this partition we should make sure that
+	 * host access to that region is released. This is ensured by doing an
+	 * additional hypercall which will update the SLAT to release host
+	 * access to guest memory regions.
+	 */
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_region_unshare(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
+			       region->start_gfn, ret);
+			goto err_out;
+		}
+	}
+
+	ret = mshv_region_map(region);
+	if (ret)
+		goto share_region;
+
+	return 0;
+
+share_region:
+	if (mshv_partition_encrypted(partition)) {
+		int shrc;
+
+		shrc = mshv_region_share(region);
+		if (!shrc)
+			goto err_out;
+
+		pt_err(partition,
+		       "Failed to share memory region (guest_pfn: %llu): %d\n",
+		       region->start_gfn, shrc);
+		/*
+		 * Re-sharing failed — the pages remain inaccessible to the
+		 * host.  Zero the page array so that mshv_region_destroy()
+		 * won't attempt to unpin them (leaking the page references
+		 * is intentional; unpinning host-inaccessible pages would be
+		 * unsafe).
+		 */
+		mshv_region_init_pfns(region);
+		goto err_out;
+	}
+err_out:
+	return ret;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 34c9b57c50f47..2a4eff27917f2 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -371,16 +371,12 @@ extern u8 * __percpu *hv_synic_eventring_tail;
 
 struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 					   u64 uaddr, u32 flags);
-int mshv_region_share(struct mshv_mem_region *region);
-int mshv_region_unshare(struct mshv_mem_region *region);
-int mshv_region_map(struct mshv_mem_region *region);
-void mshv_region_invalidate(struct mshv_mem_region *region);
 void mshv_region_init_pfns(struct mshv_mem_region *region);
-int mshv_region_pin(struct mshv_mem_region *region);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
 void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
+int mshv_map_pinned_region(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 986cb9a4c428e..4af2b98738ee2 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1346,79 +1346,6 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	return 0;
 }
 
-/**
- * mshv_prepare_pinned_region - Pin and map memory regions
- * @region: Pointer to the memory region structure
- *
- * This function processes memory regions that are explicitly marked as pinned.
- * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
- * population. The function ensures the region is properly populated, handles
- * encryption requirements for SNP partitions if applicable, maps the region,
- * and performs necessary sharing or eviction operations based on the mapping
- * result.
- *
- * Return: 0 on success, negative error code on failure.
- */
-static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
-{
-	struct mshv_partition *partition = region->partition;
-	int ret;
-
-	ret = mshv_region_pin(region);
-	if (ret) {
-		pt_err(partition, "Failed to pin memory region: %d\n",
-		       ret);
-		goto err_out;
-	}
-
-	/*
-	 * For an SNP partition it is a requirement that for every memory region
-	 * that we are going to map for this partition we should make sure that
-	 * host access to that region is released. This is ensured by doing an
-	 * additional hypercall which will update the SLAT to release host
-	 * access to guest memory regions.
-	 */
-	if (mshv_partition_encrypted(partition)) {
-		ret = mshv_region_unshare(region);
-		if (ret) {
-			pt_err(partition,
-			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
-			       region->start_gfn, ret);
-			goto err_out;
-		}
-	}
-
-	ret = mshv_region_map(region);
-	if (ret)
-		goto share_region;
-
-	return 0;
-
-share_region:
-	if (mshv_partition_encrypted(partition)) {
-		int shrc;
-
-		shrc = mshv_region_share(region);
-		if (!shrc)
-			goto err_out;
-
-		pt_err(partition,
-		       "Failed to share memory region (guest_pfn: %llu): %d\n",
-		       region->start_gfn, shrc);
-		/*
-		 * Re-sharing failed — the pages remain inaccessible to the
-		 * host.  Zero the page array so that mshv_region_destroy()
-		 * won't attempt to unpin them (leaking the page references
-		 * is intentional; unpinning host-inaccessible pages would be
-		 * unsafe).
-		 */
-		mshv_region_init_pfns(region);
-		goto err_out;
-	}
-err_out:
-	return ret;
-}
-
 /*
  * This maps two things: guest RAM and for pci passthru mmio space.
  *
@@ -1461,7 +1388,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 
 	switch (region->mreg_type) {
 	case MSHV_REGION_TYPE_MEM_PINNED:
-		ret = mshv_prepare_pinned_region(region);
+		ret = mshv_map_pinned_region(region);
 		break;
 	case MSHV_REGION_TYPE_MEM_MOVABLE:
 		/*



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox