Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
From: Frederic Weisbecker @ 2026-05-13 16:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Chen Ridong, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
	linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <20260421030351.281436-9-longman@redhat.com>

Le Mon, Apr 20, 2026 at 11:03:36PM -0400, Waiman Long a écrit :
> As the HK_TYPE_TICK cpumask is going to be changeable at run time, we
> need to use RCU to protect access to the cpumask to prevent it from
> going away in the middle of the operation.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  arch/arm64/kernel/topology.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index b32f13358fbb..48f150801689 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -173,6 +173,7 @@ void arch_cpu_idle_enter(void)
>  	if (!amu_fie_cpu_supported(cpu))
>  		return;
>  
> +	guard(rcu)();
>  	/* Kick in AMU update but only if one has not happened already */
>  	if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
>  	    time_is_before_jiffies(per_cpu(cpu_amu_samples.last_scale_update,
>  	cpu)))

This is called with IRQs disabled in the current CPU that is online so it's
already guaranteed to be stable.


> @@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
>  	unsigned int start_cpu = cpu;
>  	unsigned long last_update;
>  	unsigned int freq = 0;
> +	bool hk_cpu;
>  	u64 scale;
>  
>  	if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
>  		return -EOPNOTSUPP;
>  
> +	scoped_guard(rcu) {
> +		hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
> +	}
> +
>  	while (1) {
>  
>  		amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> @@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
>  		 * (and thus freq scale), if available, for given policy: this boils
>  		 * down to identifying an active cpu within the same freq domain, if any.
>  		 */
> -		if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> +		if (!hk_cpu ||
>  		    time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
>  			struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> +			bool hk_intersects;
>  			int ref_cpu;
>  
>  			if (!policy)
>  				return -EINVAL;
>  
> -			if (!cpumask_intersects(policy->related_cpus,
> -						housekeeping_cpumask(HK_TYPE_TICK))) {
> +			scoped_guard(rcu) {
> +				hk_intersects = cpumask_intersects(policy->related_cpus,
> +							housekeeping_cpumask(HK_TYPE_TICK));
> +			}
> +
> +			if (!hk_intersects) {
>  				cpufreq_cpu_put(policy);
>  				return -EOPNOTSUPP;
>  			}

Ok so this is racy but it's fine because:

This function is only used by cpufreq with either cpufreq_policy_write or
cpufreq_policy_read held (that is, struct cpufreq_policy::rwsem).

And that rwsem is write held on cpufreq_online() -> cpufreq_policy_online() and
also offline to guarantee the policy->cpus and policy->cpu stability.

Therefore housekeeping_cpumask() should only deal with stable online CPUs here. So
even if the housekeeping mask can be changed concurrently, those CPUs can't
appear or disappear from it.

Would be worth adding a comment about that.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
From: Stanislav Kinsburskii @ 2026-05-13 17:31 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-roaring-gentle-kiwi-21bccd@anirudhrb>

On Wed, May 13, 2026 at 11:15:37AM +0000, Anirudh Rayabharam wrote:
> On Mon, May 11, 2026 at 08:06:44AM -0700, Stanislav Kinsburskii wrote:
> > On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> > > On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > > > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > > > fails on an unencrypted partition. The condition on the error path:
> > > > 
> > > >     if (ret && mshv_partition_encrypted(partition))
> > > > 
> > > > only handles map failures for encrypted partitions — if the partition is
> > > > not encrypted and the map fails, execution falls through to 'return 0',
> > > > silently ignoring the error.
> > > > 
> > > > Additionally, calling mshv_region_invalidate() inline on map failure
> > > > zeroes the mreg_pages array before the caller's cleanup path
> > > > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > > > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > > > mappings for partially-mapped pages.
> > > > 
> > > > Fix by returning immediately on success and falling through to error
> > > > return on failure. For unencrypted partitions, the caller's
> > > > mshv_region_destroy() handles unmap followed by invalidate in the
> > > > correct order. For encrypted partitions where re-sharing fails, zero
> > > > the page array without unpinning — the pages are inaccessible to the
> > > > host and must not be unpinned, but zeroing prevents
> > > > mshv_region_destroy() from attempting to unpin them.
> > > 
> > > mshv_region_destroy() calls mshv_region_invalidate() which calls
> > > mshv_region_invalidate_pages() which just does:
> > > 
> > > 	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > > 						 u64 page_offset, u64 page_count)
> > > 	{
> > > 		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > > 			unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > > 
> > > 		memset(region->mreg_pages + page_offset, 0,
> > > 		       page_count * sizeof(struct page *));
> > > 	}
> > > 
> > > It doesn't checked for zeroed pages before unpinning.
> > > 
> > > > 
> > > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > >  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
> > > >  1 file changed, 16 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > index 665d565899c15..7e4252b6bc65c 100644
> > > > --- a/drivers/hv/mshv_root_main.c
> > > > +++ b/drivers/hv/mshv_root_main.c
> > > > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> > > >  			pt_err(partition,
> > > >  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> > > >  			       region->start_gfn, ret);
> > > > -			goto invalidate_region;
> > > > +			goto err_out;
> > > >  		}
> > > >  	}
> > > >  
> > > >  	ret = mshv_region_map(region);
> > > > -	if (ret && mshv_partition_encrypted(partition)) {
> > > > +	if (ret)
> > > > +		goto share_region;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +share_region:
> > > > +	if (mshv_partition_encrypted(partition)) {
> > > >  		int shrc;
> > > >  
> > > >  		shrc = mshv_region_share(region);
> > > >  		if (!shrc)
> > > > -			goto invalidate_region;
> > > > +			goto err_out;
> > > >  
> > > >  		pt_err(partition,
> > > >  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> > > >  		       region->start_gfn, shrc);
> > > >  		/*
> > > > -		 * Don't unpin if marking shared failed because pages are no
> > > > -		 * longer mapped in the host, ie root, anymore.
> > > > +		 * Re-sharing failed — the pages remain inaccessible to the
> > > > +		 * host.  Zero the page array so that mshv_region_destroy()
> > > > +		 * won't attempt to unpin them (leaking the page references
> > > > +		 * is intentional; unpinning host-inaccessible pages would be
> > > > +		 * unsafe).
> > > >  		 */
> > > 
> > > Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> > > can remove it from here altogether. Either way, should this zeroing
> > > logic be added there too so as not to crash the host by unpinning pages
> > > that weren't marked shared?
> > > 
> > 
> > Indeed.
> > The issue with all this code is that we are leaking state in
> > mshv_region_pin if we wimply return from it: we don't know if the pages
> > are pinned or unshared (or mapped) in the destruction callback.
> > We should either introduce a state variable to track this or just don't
> > call the generic mshv_region_put on case of region creation error.
> > The latter approch make mshv_map_user_memory idempotent and thus looks
> > like a better way forward.
> > What do you think?
> 

I tried to say, that there can be two apporaches to solving this issue:
  1. Either make sure mshv_region_pus() can destroy a partially created
     region (tha'ts what this patch is doing) or
  2. Make mshv_map_user_region be idempotent by not calling mshv_region_put() if the region was not
     fully created and destroy the partioally created parts of it on
     error paths. This would require preserving some state in the region
     struct to know if the region was pinned or shared or mapped.
     This looks like a leaner apporach, but it requires additional state
     tracking and more code changes.

Thanks,
Stanislav


> I'm not sure I follow the latter approach. Omitting mshv_region_put()
> would cause a dangling reference to the mshv_region right?
> 
> Thanks,
> Anirudh.

^ permalink raw reply

* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
From: Stanislav Kinsburskii @ 2026-05-13 17:38 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-omniscient-enchanted-otter-dfd602@anirudhrb>

On Wed, May 13, 2026 at 12:14:49PM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> > In mshv_irqfd_assign(), the level-triggered validation for resample
> > irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> > mshv_irqfd_update() has populated the field. Since the irqfd struct is
> > zero-allocated, level_triggered is always 0 at that point, causing the
> > check to always reject resample irqfds with -EINVAL. This makes
> > level-triggered interrupt resampling — used to avoid interrupt storms
> > with assigned devices — completely non-functional.
> 
> What bugs would this manifest as? Why haven't we seen any such bugs so
> far?
> 

This patch fixes a logical error.
Whtout the change this hunk always fails:

        if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
            !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {

and the reason we never seen it as that we never used
register_irqfd_with_resample() function of the mshv crate.

Thanks,
Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply

* Re: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
From: Stanislav Kinsburskii @ 2026-05-13 17:39 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-grinning-firefly-of-refinement-ed06cd@anirudhrb>

On Wed, May 13, 2026 at 11:12:05AM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:44:26PM +0000, Stanislav Kinsburskii wrote:
> > handle_pair_message() iterates up to msg->vp_count without verifying it
> > against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> > from untrusted hypervisor data, a malformed message with a large value
> > would cause out-of-bounds reads from the partition_ids and vp_indexes
> > arrays.
> > 
> > handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> > 64) and advances bank_contents for each one. However, the payload buffer
> > only has space for 16 bank entries. A valid_bank_mask with more than 16
> > bits set causes bank_contents to read beyond the message buffer.
> > 
> > Fix both by adding bounds validation:
> > - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> > - Track banks consumed and stop before exceeding buffer capacity
> > 
> > Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_synic.c |   20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index 89207aad7cf1f..5d509299f14d7 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
> >  static void
> >  handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
> >  {
> > -	int bank_idx, vps_signaled = 0, bank_mask_size;
> > +	int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> > +	const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> > +			      sizeof(u64) - 2; /* subtract format + mask */
> 
> Could this be a constant in the header?
> 

Yes, it could. But it the only place it's used and it's pretty
self-explanatory, so I don't think it needs to be.

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 

^ permalink raw reply

* RE: [EXTERNAL] [PATCH] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Long Li @ 2026-05-13 18:30 UTC (permalink / raw)
  To: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, James.Bottomley@HansenPartnership.com,
	martin.petersen@oracle.com, Allen Pais
  Cc: Tianyu Lan, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org,
	vdso@hexbites.dev, mhklinux@outlook.com
In-Reply-To: <20260408073105.272255-1-tiala@microsoft.com>

> Hyper-V provides Confidential VMBus to communicate between device model
> and device guest driver via encrypted/private memory in Confidential VM. The
> device model is in OpenHCL
> (https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopenvm
> m.dev%2Fguide%2Fuser_guide%2Fopenhcl.html&data=05%7C02%7Clongli%40mi
> crosoft.com%7C0ccfea7cda8e4500ae9808de9540d01e%7C72f988bf86f141af91a
> b2d7cd011db47%7C1%7C0%7C639112302777934798%7CUnknown%7CTWFpbG
> Zsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIk
> FOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=5Uc%2FM4ZVgJT1
> NAq08cIlNtfF5oW4n%2FTj%2Bqg3YqBUeZg%3D&reserved=0) that plays the
> paravisor role.
> 
> For a VMBus device, there are two communication methods to talk with
> Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic DMA transfer.
> 
> The Confidential VMBus Ring buffer has been upstreamed by Roman Kisel(commit
> 6802d8af47d1).
> 
> The dynamic DMA transition of VMBus device normally goes through DMA core
> and it uses SWIOTLB as bounce buffer in a CoCo VM.
> 
> The Confidential VMBus device can do DMA directly to private/encrypted
> memory. Because the swiotlb is decrypted memory, the DMA transfer must not
> be bounced through the swiotlb, so as to preserve confidentiality. This is different
> from the default for Linux CoCo VMs, so not use DMA(SWIOTLB) API in VMBus
> driver when confidential dynamic DMA transfers capability is present.
> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  drivers/scsi/storvsc_drv.c | 28 +++++++++++++++++++++-------
>  include/linux/hyperv.h     |  1 +
>  2 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
> ae1abab97835..79b7611518b7 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1316,7 +1316,8 @@ static void storvsc_on_channel_callback(void *context)
>  					continue;
>  				}
>  				request = (struct storvsc_cmd_request
> *)scsi_cmd_priv(scmnd);
> -				scsi_dma_unmap(scmnd);
> +				if (!device->co_external_memory)
> +					scsi_dma_unmap(scmnd);
>  			}
> 
>  			storvsc_on_receive(stor_device, packet, request); @@ -
> 1339,6 +1340,8 @@ static int storvsc_connect_to_vsp(struct hv_device *device,
> u32 ring_size,
> 
>  	device->channel->max_pkt_size = STORVSC_MAX_PKT_SIZE;
>  	device->channel->next_request_id_callback = storvsc_next_request_id;
> +	if (device->channel->co_external_memory)
> +		device->co_external_memory = true;
> 
>  	ret = vmbus_open(device->channel,
>  			 ring_size,
> @@ -1805,7 +1808,7 @@ static enum scsi_qc_status
> storvsc_queuecommand(struct Scsi_Host *host,
>  		unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
>  		unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
>  		struct scatterlist *sg;
> -		unsigned long hvpfn, hvpfns_to_add;
> +		unsigned long hvpfn, hvpfns_to_add, hvpgoff;
>  		int j, i = 0, sg_count;
> 
>  		payload_sz = (hvpg_count * sizeof(u64) + @@ -1821,7 +1824,11
> @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
>  		payload->range.len = length;
>  		payload->range.offset = offset_in_hvpg;
> 
> -		sg_count = scsi_dma_map(scmnd);
> +		if (dev->co_external_memory)
> +			sg_count = scsi_sg_count(scmnd);

scsi_sg_count() returns unsigned int, sg_count can't be negative. The check for sg_count < 0 below becomes dead code. Add a comment to say this is expected behavior.

> +		else
> +			sg_count = scsi_dma_map(scmnd);
> +
>  		if (sg_count < 0) {
>  			ret = SCSI_MLQUEUE_DEVICE_BUSY;
>  			goto err_free_payload;
> @@ -1836,9 +1843,16 @@ static enum scsi_qc_status
> storvsc_queuecommand(struct Scsi_Host *host,
>  			 * Such offsets are handled even on other than the first
>  			 * sgl entry, provided they are a multiple of PAGE_SIZE.
>  			 */
> -			hvpfn = HVPFN_DOWN(sg_dma_address(sg));
> -			hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
> -						 sg_dma_len(sg)) - hvpfn;
> +			if (dev->co_external_memory) {
> +				hvpgoff = HVPFN_DOWN(sg->offset);
> +				hvpfn = page_to_hvpfn(sg_page(sg)) + hvpgoff;
> +				hvpfns_to_add =	HVPFN_UP(sg->offset
> + sg->length) -
> +							hvpgoff;
> +			} else {
> +				hvpfn = HVPFN_DOWN(sg_dma_address(sg));
> +				hvpfns_to_add =
> HVPFN_UP(sg_dma_address(sg) +
> +							 sg_dma_len(sg)) -
> hvpfn;
> +			}
> 
>  			/*
>  			 * Fill the next portion of the PFN array with @@ -1860,7
> +1874,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct
> Scsi_Host *host,
>  	ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
>  	migrate_enable();
> 
> -	if (ret)
> +	if (ret && (!dev->co_external_memory))
>  		scsi_dma_unmap(scmnd);
> 
>  	if (ret == -EAGAIN) {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index
> dfc516c1c719..bcb143766d6e 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1285,6 +1285,7 @@ struct hv_device {
> 
>  	/* place holder to keep track of the dir for hv device in debugfs */
>  	struct dentry *debug_dir;
> +	bool co_external_memory;

You don't need to introduce co_external_memory in hv_device, vmbus_channel already has co_external_memory. Is it possible that you can check the vmbus_channel->co_external_memory directly? If you can remove this,  you can reword this patch to " scsi: storvsc: Confidential VMBus for dynamic DMA transfers".

Thanks,
Long



^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Jacob Pan @ 2026-05-13 18:39 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, jgg, mhklinux,
	tgopinath, easwar.hariharan, jacob.pan
In-Reply-To: <20260511162408.1180069-4-zhangyu1@linux.microsoft.com>

Hi Yu,

On Tue, 12 May 2026 00:24:07 +0800
Yu Zhang <zhangyu1@linux.microsoft.com> wrote:

> Add a para-virtualized IOMMU driver for Linux guests running on
> Hyper-V. This driver implements stage-1 IO translation within the
> guest OS. It integrates with the Linux IOMMU core, utilizing Hyper-V
> hypercalls for:
>  - Capability discovery
>  - Domain allocation, configuration, and deallocation
>  - Device attachment and detachment
>  - IOTLB invalidation
> 
> The driver constructs x86-compatible stage-1 IO page tables in the
> guest memory using consolidated IO page table helpers. This allows
> the guest to manage stage-1 translations independently of vendor-
> specific drivers (like Intel VT-d or AMD IOMMU).
> 
> Hyper-V consumes this stage-1 IO page table when a device domain is
> created and configured, and nests it with the host's stage-2 IO page
> tables, therefore eliminating the VM exits for guest IOMMU mapping
> operations. For unmapping operations, VM exits to perform the IOTLB
> flush are still unavoidable.
> 
> Hyper-V identifies each PCI pass-thru device by a logical device ID
> in its hypercall interface. The vPCI driver (pci-hyperv) registers the
> per-bus portion of this ID with the pvIOMMU driver during bus probe.
> The pvIOMMU driver stores this mapping and combines it with the
> function number of the endpoint PCI device to form the complete ID
> for hypercalls.
> 
> Co-developed-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Co-developed-by: Easwar Hariharan
> <easwar.hariharan@linux.microsoft.com> Signed-off-by: Easwar
> Hariharan <easwar.hariharan@linux.microsoft.com> Signed-off-by: Yu
> Zhang <zhangyu1@linux.microsoft.com> ---
>  arch/x86/hyperv/hv_init.c           |   4 +
>  arch/x86/include/asm/mshyperv.h     |   4 +
>  drivers/iommu/hyperv/Kconfig        |  17 +
>  drivers/iommu/hyperv/Makefile       |   1 +
>  drivers/iommu/hyperv/iommu.c        | 705
> ++++++++++++++++++++++++++++ drivers/iommu/hyperv/iommu.h        |
> 54 +++ drivers/pci/controller/pci-hyperv.c |  19 +-
>  include/asm-generic/mshyperv.h      |  12 +
>  8 files changed, 815 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/iommu/hyperv/iommu.c
>  create mode 100644 drivers/iommu/hyperv/iommu.h
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 323adc93f2dc..2c8ff8e06249 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -578,6 +578,10 @@ void __init hyperv_init(void)
>  	old_setup_percpu_clockev =
> x86_init.timers.setup_percpu_clockev;
> x86_init.timers.setup_percpu_clockev =
> hv_stimer_setup_percpu_clockev; +#ifdef CONFIG_HYPERV_PVIOMMU
> +	x86_init.iommu.iommu_init = hv_iommu_init;
> +#endif
> +
>  	hv_apic_init();
>  
>  	x86_init.pci.arch_init = hv_pci_init;
> diff --git a/arch/x86/include/asm/mshyperv.h
> b/arch/x86/include/asm/mshyperv.h index f64393e853ee..20d947c2c758
> 100644 --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -313,6 +313,10 @@ static inline void
> mshv_vtl_return_hypercall(void) {} static inline void
> __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {} #endif
>  
> +#ifdef CONFIG_HYPERV_PVIOMMU
> +int __init hv_iommu_init(void);
> +#endif
> +
>  #include <asm-generic/mshyperv.h>
>  
>  #endif
> diff --git a/drivers/iommu/hyperv/Kconfig
> b/drivers/iommu/hyperv/Kconfig index 30f40d867036..9e658d5c9a77 100644
> --- a/drivers/iommu/hyperv/Kconfig
> +++ b/drivers/iommu/hyperv/Kconfig
> @@ -8,3 +8,20 @@ config HYPERV_IOMMU
>  	help
>  	  Stub IOMMU driver to handle IRQs to support Hyper-V Linux
>  	  guest and root partitions.
> +
> +if HYPERV_IOMMU
> +config HYPERV_PVIOMMU
> +	bool "Microsoft Hypervisor para-virtualized IOMMU support"
> +	depends on X86 && HYPERV
> +	select IOMMU_API
> +	select GENERIC_PT
> +	select IOMMU_PT
> +	select IOMMU_PT_X86_64
nit:
If HYPERV_PVIOMMU is enabled on a (hypothetical) platform with
GENERIC_ATOMIC64=y, the select would force-enable IOMMU_PT_X86_64 even
though its depends on is unsatisfied — leading to a build failure.

In practice this can't happen today because HYPERV_PVIOMMU already
depends on X86 && HYPERV, and x86 never sets GENERIC_ATOMIC64. But
adding the explicit guard is more defensive.
i.e.
       depends on !GENERIC_ATOMIC64    # for cmpxchg64 in IOMMU_PT

> +	select IOMMU_IOVA
> +	default HYPERV
> +	help
> +	  Para-virtualized IOMMU driver for Linux guests running on
> +	  Microsoft Hyper-V. Provides DMA remapping and IOTLB
> +	  flush support to enable DMA isolation for devices
> +	  assigned to the guest.
> +endif
> diff --git a/drivers/iommu/hyperv/Makefile
> b/drivers/iommu/hyperv/Makefile index 9f557bad94ff..8669741c0a51
> 100644 --- a/drivers/iommu/hyperv/Makefile
> +++ b/drivers/iommu/hyperv/Makefile
> @@ -1,2 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-$(CONFIG_HYPERV_IOMMU) += irq_remapping.o
> +obj-$(CONFIG_HYPERV_PVIOMMU) += iommu.o
> diff --git a/drivers/iommu/hyperv/iommu.c
> b/drivers/iommu/hyperv/iommu.c new file mode 100644
> index 000000000000..e5fc625314b5
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.c
> @@ -0,0 +1,705 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Hyper-V IOMMU driver.
> + *
> + * Copyright (C) 2019, 2024-2026 Microsoft, Inc.
> + */
> +
> +#define pr_fmt(fmt) "Hyper-V pvIOMMU: " fmt
> +#define dev_fmt(fmt) pr_fmt(fmt)
> +
> +#include <linux/iommu.h>
> +#include <linux/pci.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/generic_pt/iommu.h>
> +#include <linux/pci-ats.h>
> +
> +#include <asm/iommu.h>
> +#include <asm/hypervisor.h>
> +#include <asm/mshyperv.h>
> +
> +#include "iommu.h"
> +#include "../iommu-pages.h"
> +
> +struct hv_iommu_dev *hv_iommu_device;
> +
> +/*
> + * Identity and blocking domains are static singletons: identity is
> a 1:1
> + * passthrough with no page table, blocking rejects all DMA. Neither
> holds
> + * per-IOMMU state, so one instance suffices even with multiple
> vIOMMUs.
> + */
> +static struct hv_iommu_domain hv_identity_domain;
> +static struct hv_iommu_domain hv_blocking_domain;
> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
> +static struct iommu_ops hv_iommu_ops;
> +static LIST_HEAD(hv_iommu_pci_bus_list);
> +static DEFINE_SPINLOCK(hv_iommu_pci_bus_lock);
> +
> +#define hv_iommu_present(iommu_cap) (iommu_cap &
> HV_IOMMU_CAP_PRESENT) +#define
> hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1)
> +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap &
> HV_IOMMU_CAP_S1_5LVL) +#define hv_iommu_ats_supported(iommu_cap)
> (iommu_cap & HV_IOMMU_CAP_ATS) + +int hv_iommu_register_pci_bus(int
> pci_domain_nr, u32 logical_dev_id_prefix) +{
> +	struct hv_pci_busdata *bus, *new;
> +	int ret = 0;
> +
> +	if (no_iommu || !iommu_detected)
> +		return 0;
> +
> +	new = kzalloc_obj(*new, GFP_KERNEL);
> +	if (!new)
> +		return -ENOMEM;
> +
> +	spin_lock(&hv_iommu_pci_bus_lock);
> +	list_for_each_entry(bus, &hv_iommu_pci_bus_list, list) {
> +		if (bus->pci_domain_nr != pci_domain_nr)
> +			continue;
> +
> +		if (bus->logical_dev_id_prefix !=
> logical_dev_id_prefix) {
> +			pr_err("stale registration for PCI domain %d
> (old prefix 0x%08x, new 0x%08x)\n",
> +			       pci_domain_nr,
> bus->logical_dev_id_prefix,
> +			       logical_dev_id_prefix);
> +			ret = -EEXIST;
> +		}
> +
> +		goto out_free;
> +	}
> +
> +	new->pci_domain_nr = pci_domain_nr;
> +	new->logical_dev_id_prefix = logical_dev_id_prefix;
> +	list_add(&new->list, &hv_iommu_pci_bus_list);
> +	spin_unlock(&hv_iommu_pci_bus_lock);
> +	return 0;
> +
> +out_free:
> +	spin_unlock(&hv_iommu_pci_bus_lock);
> +	kfree(new);
> +	return ret;
> +}
> +EXPORT_SYMBOL_FOR_MODULES(hv_iommu_register_pci_bus, "pci-hyperv");
> +
> +void hv_iommu_unregister_pci_bus(int pci_domain_nr)
> +{
> +	struct hv_pci_busdata *bus, *tmp;
> +
> +	spin_lock(&hv_iommu_pci_bus_lock);
> +	list_for_each_entry_safe(bus, tmp, &hv_iommu_pci_bus_list,
> list) {
> +		if (bus->pci_domain_nr == pci_domain_nr) {
> +			list_del(&bus->list);
> +			kfree(bus);
> +			break;
> +		}
> +	}
> +	spin_unlock(&hv_iommu_pci_bus_lock);
> +}
> +EXPORT_SYMBOL_FOR_MODULES(hv_iommu_unregister_pci_bus, "pci-hyperv");
> +
> +/*
> + * Look up the logical device ID for a vPCI device. Returns 0 on
> success
> + * with *logical_id filled in; -ENODEV if no entry registered for
> this
> + * device's vPCI bus.
> + */
> +static int hv_iommu_lookup_logical_dev_id(struct pci_dev *pdev, u64
> *logical_id) +{
> +	struct hv_pci_busdata *bus;
> +	int domain = pci_domain_nr(pdev->bus);
> +	int ret = -ENODEV;
> +
> +	spin_lock(&hv_iommu_pci_bus_lock);
this is called under local_irq_save, should use  raw_spinlock_t for RT
kernel?

> +	list_for_each_entry(bus, &hv_iommu_pci_bus_list, list) {
> +		if (bus->pci_domain_nr == domain) {
> +			*logical_id =
> (u64)bus->logical_dev_id_prefix |
> +				      PCI_FUNC(pdev->devfn);
> +			ret = 0;
> +			break;
> +		}
> +	}
> +	spin_unlock(&hv_iommu_pci_bus_lock);
> +	return ret;
> +}
> +
> +static int hv_create_device_domain(struct hv_iommu_domain
> *hv_domain, u32 domain_stage) +{
> +	int ret;
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_create_device_domain *input;
> +
> +	ret = ida_alloc_range(&hv_iommu_device->domain_ids,
> +			hv_iommu_device->first_domain,
> hv_iommu_device->last_domain,
> +			GFP_KERNEL);
> +	if (ret < 0)
> +		return ret;
> +
> +	hv_domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	hv_domain->device_domain.domain_id.type = domain_stage;
> +	hv_domain->device_domain.domain_id.id = ret;
> +	hv_domain->hv_iommu = hv_iommu_device;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	input->create_device_domain_flags.forward_progress_required
> = 1;
> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status)) {
> +		pr_err("HVCALL_CREATE_DEVICE_DOMAIN failed, status
> %lld\n", status);
> +		ida_free(&hv_iommu_device->domain_ids,
> hv_domain->device_domain.domain_id.id);
> +	}
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static void hv_delete_device_domain(struct hv_iommu_domain
> *hv_domain) +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_delete_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_DELETE_DEVICE_DOMAIN failed, status
> %lld\n", status); +
> +	ida_free(&hv_domain->hv_iommu->domain_ids,
> hv_domain->device_domain.domain_id.id); +}
> +
> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +		return true;
> +	case IOMMU_CAP_DEFERRED_FLUSH:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_flush_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN failed, status
> %lld\n", status); +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct
> device *dev) +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +	struct pci_dev *pdev;
> +	struct hv_iommu_domain *hv_domain =
> to_hv_iommu_domain(domain);
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	dev_dbg(dev, "detaching from domain %d\n",
> hv_domain->device_domain.domain_id.id); +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	if (hv_iommu_lookup_logical_dev_id(pdev,
> &input->device_id.as_uint64)) {
> +		local_irq_restore(flags);
> +		dev_warn(&pdev->dev, "no IOMMU registration for vPCI
> bus on detach\n");
> +		return;
> +	}
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_DETACH_DEVICE_DOMAIN failed, status
> %lld\n", status); +
> +	hv_flush_device_domain(hv_domain);
> +
> +	vdev->hv_domain = NULL;
> +}
> +
> +static int hv_iommu_attach_dev(struct iommu_domain *domain, struct
> device *dev,
> +			       struct iommu_domain *old)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct pci_dev *pdev;
> +	struct hv_input_attach_device_domain *input;
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +	struct hv_iommu_domain *hv_domain =
> to_hv_iommu_domain(domain);
> +	int ret;
> +
> +	/* Only allow PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return -EINVAL;
> +
> +	if (vdev->hv_domain == hv_domain)
> +		return 0;
> +
> +	if (vdev->hv_domain)
> +		hv_iommu_detach_dev(&vdev->hv_domain->domain, dev);
> +
> +	pdev = to_pci_dev(dev);
> +	dev_dbg(dev, "attaching to domain %d\n",
> +		hv_domain->device_domain.domain_id.id);
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	ret = hv_iommu_lookup_logical_dev_id(pdev,
> &input->device_id.as_uint64);
> +	if (ret) {
> +		local_irq_restore(flags);
> +		dev_err(&pdev->dev, "no IOMMU registration for vPCI
> bus\n");
> +		return ret;
> +	}
> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_ATTACH_DEVICE_DOMAIN failed, status
> %lld\n", status);
> +	else
> +		vdev->hv_domain = hv_domain;
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static int hv_iommu_get_logical_device_property(struct device *dev,
> +					u32 code,
> +					struct
> hv_output_get_logical_device_property *property) +{
> +	u64 status, lid;
> +	unsigned long flags;
> +	int ret;
> +	struct hv_input_get_logical_device_property *input;
> +	struct hv_output_get_logical_device_property *output;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_input_arg) +
> sizeof(*input);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	ret = hv_iommu_lookup_logical_dev_id(to_pci_dev(dev), &lid);
> +	if (ret) {
> +		local_irq_restore(flags);
> +		return ret;
> +	}
> +	input->logical_device_id = lid;
> +	input->code = code;
> +	status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY,
> input, output);
> +	*property = *output;
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_GET_LOGICAL_DEVICE_PROPERTY failed,
> status %lld\n", status); +
> +	return hv_result_to_errno(status);
> +}
> +
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	struct pci_dev *pdev;
> +	struct hv_iommu_endpoint *vdev;
> +	struct hv_output_get_logical_device_property
> device_iommu_property = {0}; +
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	pdev = to_pci_dev(dev);
> +
> +	if (hv_iommu_get_logical_device_property(dev,
> +
> HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU,
> +
> &device_iommu_property) ||
> +	    !(device_iommu_property.device_iommu &
> HV_DEVICE_IOMMU_ENABLED))
> +		return ERR_PTR(-ENODEV);
> +
> +	vdev = kzalloc_obj(*vdev, GFP_KERNEL);
> +	if (!vdev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	vdev->dev = dev;
> +	vdev->hv_iommu = hv_iommu_device;
> +	dev_iommu_priv_set(dev, vdev);
> +
> +	if (hv_iommu_ats_supported(hv_iommu_device->cap) &&
> +	    pci_ats_supported(pdev))
> +		pci_enable_ats(pdev,
> __ffs(hv_iommu_device->pgsize_bitmap)); +
> +	return &vdev->hv_iommu->iommu;
> +}
> +
> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (pdev->ats_enabled)
> +		pci_disable_ats(pdev);
> +
> +	dev_iommu_priv_set(dev, NULL);
> +	set_dma_ops(dev, NULL);
I don't think this is necessary.

> +
> +	kfree(vdev);
> +}
> +
> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
non pci device already rejected during attach, maybe we should warn
here?
        WARN_ON_ONCE(1);
        return generic_device_group(dev);

> +}
> +
> +static int hv_configure_device_domain(struct hv_iommu_domain
> *hv_domain, u32 domain_type) +{
> +	u64 status;
> +	unsigned long flags;
> +	struct pt_iommu_x86_64_hw_info pt_info;
> +	struct hv_input_configure_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	input->settings.flags.blocked = (domain_type ==
> IOMMU_DOMAIN_BLOCKED);
> +	input->settings.flags.translation_enabled = (domain_type !=
> IOMMU_DOMAIN_IDENTITY); +
Should this be:
input->settings.flags.translation_enabled =
     (domain_type & __IOMMU_DOMAIN_PAGING);
Otherwise, blocked domain will have translation enabled. Maybe add some
explanation of what HV expects.

> +	if (domain_type & __IOMMU_DOMAIN_PAGING) {
> +		pt_iommu_x86_64_hw_info(&hv_domain->pt_iommu_x86_64,
> &pt_info);
> +		input->settings.page_table_root = pt_info.gcr3_pt;
> +		input->settings.flags.first_stage_paging_mode =
> +			pt_info.levels == 5;
> +	}
> +	status = hv_do_hypercall(HVCALL_CONFIGURE_DEVICE_DOMAIN,
> input, NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_CONFIGURE_DEVICE_DOMAIN failed,
> status %lld\n", status); +
> +	return hv_result_to_errno(status);
> +}
> +
> +static int __init hv_initialize_static_domains(void)
> +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +
> +	/* Default stage-1 identity domain */
> +	hv_domain = &hv_identity_domain;
> +
> +	ret = hv_create_device_domain(hv_domain,
> HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		return ret;
> +
> +	ret = hv_configure_device_domain(hv_domain,
> IOMMU_DOMAIN_IDENTITY);
> +	if (ret)
> +		goto delete_identity_domain;
> +
> +	hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY;
> +	hv_domain->domain.ops = &hv_iommu_identity_domain_ops;
> +	hv_domain->domain.owner = &hv_iommu_ops;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->domain.pgsize_bitmap =
> hv_iommu_device->pgsize_bitmap; +
> +	/* Default stage-1 blocked domain */
> +	hv_domain = &hv_blocking_domain;
> +
> +	ret = hv_create_device_domain(hv_domain,
> HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		goto delete_identity_domain;
> +
> +	ret = hv_configure_device_domain(hv_domain,
> IOMMU_DOMAIN_BLOCKED);
> +	if (ret)
> +		goto delete_blocked_domain;
> +
> +	hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED;
> +	hv_domain->domain.ops = &hv_iommu_blocking_domain_ops;
> +	hv_domain->domain.owner = &hv_iommu_ops;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->domain.pgsize_bitmap =
> hv_iommu_device->pgsize_bitmap; +
> +	return 0;
> +
> +delete_blocked_domain:
> +	hv_delete_device_domain(&hv_blocking_domain);
> +delete_identity_domain:
> +	hv_delete_device_domain(&hv_identity_domain);
> +	return ret;
> +}
> +
> +#define INTERRUPT_RANGE_START	(0xfee00000)
> +#define INTERRUPT_RANGE_END	(0xfeefffff)
> +static void hv_iommu_get_resv_regions(struct device *dev,
> +		struct list_head *head)
> +{
> +	struct iommu_resv_region *region;
> +
> +	region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
> +				      INTERRUPT_RANGE_END -
> INTERRUPT_RANGE_START + 1,
> +				      0, IOMMU_RESV_MSI, GFP_KERNEL);
> +	if (!region)
> +		return;
> +
> +	list_add_tail(&region->list, head);
> +}
> +
> +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> +{
> +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +}
> +
> +static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
> +				struct iommu_iotlb_gather
> *iotlb_gather) +{
> +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +
> +	iommu_put_pages_list(&iotlb_gather->freelist);
> +}
> +
> +static void hv_iommu_paging_domain_free(struct iommu_domain *domain)
> +{
> +	struct hv_iommu_domain *hv_domain =
> to_hv_iommu_domain(domain); +
> +	/* Free all remaining mappings */
> +	pt_iommu_deinit(&hv_domain->pt_iommu);
> +
> +	hv_delete_device_domain(hv_domain);
> +
> +	kfree(hv_domain);
> +}
> +
> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +	IOMMU_PT_DOMAIN_OPS(x86_64),
> +	.flush_iotlb_all = hv_iommu_flush_iotlb_all,
> +	.iotlb_sync = hv_iommu_iotlb_sync,
> +	.free = hv_iommu_paging_domain_free,
> +};
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
> device *dev) +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +	struct pt_iommu_x86_64_cfg cfg = {};
> +
> +	hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
> +	if (!hv_domain)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = hv_create_device_domain(hv_domain,
> HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret) {
> +		kfree(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->pt_iommu.nid = dev_to_node(dev);
> +
> +	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
> +	cfg.common.hw_max_oasz_lg2 = 52;
> +	cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 :
> 3; +
> +	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64,
> &cfg, GFP_KERNEL);
> +	if (ret) {
> +		hv_delete_device_domain(hv_domain);
> +		kfree(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	/* Constrain to page sizes the hypervisor supports */
> +	hv_domain->domain.pgsize_bitmap &=
> hv_iommu_device->pgsize_bitmap; +
> +	hv_domain->domain.ops = &hv_iommu_paging_domain_ops;
> +
> +	ret = hv_configure_device_domain(hv_domain,
> __IOMMU_DOMAIN_PAGING);
> +	if (ret) {
> +		pt_iommu_deinit(&hv_domain->pt_iommu);
> +		hv_delete_device_domain(hv_domain);
> +		kfree(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return &hv_domain->domain;
> +}
> +
> +static struct iommu_ops hv_iommu_ops = {
> +	.capable		  = hv_iommu_capable,
> +	.domain_alloc_paging	  = hv_iommu_domain_alloc_paging,
> +	.probe_device		  = hv_iommu_probe_device,
> +	.release_device		  = hv_iommu_release_device,
> +	.device_group		  = hv_iommu_device_group,
> +	.get_resv_regions	  = hv_iommu_get_resv_regions,
> +	.owner			  = THIS_MODULE,
> +	.identity_domain	  = &hv_identity_domain.domain,
> +	.blocked_domain		  =
> &hv_blocking_domain.domain,
> +	.release_domain		  =
> &hv_blocking_domain.domain, +};
> +
> +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities
> *hv_iommu_cap) +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_get_iommu_capabilities *input;
> +	struct hv_output_get_iommu_capabilities *output;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_input_arg) +
> sizeof(*input);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	status = hv_do_hypercall(HVCALL_GET_IOMMU_CAPABILITIES,
> input, output);
> +	*hv_iommu_cap = *output;
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_GET_IOMMU_CAPABILITIES failed, status
> %lld\n", status); +
> +	return hv_result_to_errno(status);
> +}
> +
> +static void __init hv_init_iommu_device(struct hv_iommu_dev
> *hv_iommu,
> +			struct hv_output_get_iommu_capabilities
> *hv_iommu_cap) +{
> +	ida_init(&hv_iommu->domain_ids);
> +
> +	hv_iommu->cap = hv_iommu_cap->iommu_cap;
> +	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> +	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> +	    hv_iommu->max_iova_width > 48) {
> +		pr_info("5-level paging not supported, limiting iova
> width to 48.\n");
> +		hv_iommu->max_iova_width = 48;
> +	}
> +
> +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> +		.aperture_start = 0,
> +		.aperture_end = (((u64)1) <<
> hv_iommu->max_iova_width) - 1,
> +		.force_aperture = true,
> +	};
> +
> +	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
> +	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
> +	/* Only x86 page sizes (4K/2M/1G) are supported */
> +	hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap &
> +				  (SZ_4K | SZ_2M | SZ_1G);
> +	if (hv_iommu->pgsize_bitmap != hv_iommu_cap->pgsize_bitmap)
> +		pr_warn("unsupported page sizes masked: 0x%llx ->
> 0x%llx\n",
> +			hv_iommu_cap->pgsize_bitmap,
> hv_iommu->pgsize_bitmap);
> +	if (!hv_iommu->pgsize_bitmap) {
> +		pr_warn("no supported page sizes, defaulting to
> 4K\n");
> +		hv_iommu->pgsize_bitmap = SZ_4K;
> +	}
> +	hv_iommu_device = hv_iommu;
> +}
> +
> +int __init hv_iommu_init(void)
> +{
> +	int ret = 0;
> +	struct hv_iommu_dev *hv_iommu = NULL;
> +	struct hv_output_get_iommu_capabilities hv_iommu_cap = {0};
> +
> +	if (no_iommu || iommu_detected)
> +		return -ENODEV;
> +
> +	if (!hv_is_hyperv_initialized())
> +		return -ENODEV;
> +
> +	ret = hv_iommu_detect(&hv_iommu_cap);
> +	if (ret) {
> +		pr_err("HVCALL_GET_IOMMU_CAPABILITIES failed: %d\n",
> ret);
> +		return -ENODEV;
> +	}
> +
> +	if (!hv_iommu_present(hv_iommu_cap.iommu_cap) ||
> +	    !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap)) {
> +		pr_err("IOMMU capabilities not sufficient:
> cap=0x%llx\n",
> +		       hv_iommu_cap.iommu_cap);
> +		return -ENODEV;
> +	}
> +
> +	iommu_detected = 1;
> +	pci_request_acs();
> +
> +	hv_iommu = kzalloc_obj(*hv_iommu, GFP_KERNEL);
> +	if (!hv_iommu)
> +		return -ENOMEM;
> +
> +	hv_init_iommu_device(hv_iommu, &hv_iommu_cap);
> +
> +	ret = hv_initialize_static_domains();
> +	if (ret) {
> +		pr_err("static domains init failed: %d\n", ret);
> +		goto err_free;
> +	}
> +
> +	ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL,
> "%s", "hv-iommu");
> +	if (ret) {
> +		pr_err("iommu_device_sysfs_add failed: %d\n", ret);
> +		goto err_delete_static_domains;
> +	}
> +
> +	ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops,
> NULL);
> +	if (ret) {
> +		pr_err("iommu_device_register failed: %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	pr_info("successfully initialized\n");
> +	return 0;
> +
> +err_sysfs_remove:
> +	iommu_device_sysfs_remove(&hv_iommu->iommu);
> +err_delete_static_domains:
> +	hv_delete_device_domain(&hv_blocking_domain);
> +	hv_delete_device_domain(&hv_identity_domain);
> +err_free:
> +	kfree(hv_iommu);
> +	return ret;
> +}
> diff --git a/drivers/iommu/hyperv/iommu.h
> b/drivers/iommu/hyperv/iommu.h new file mode 100644
> index 000000000000..43f20d371245
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.h
> @@ -0,0 +1,54 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Hyper-V IOMMU driver.
> + *
> + * Copyright (C) 2024-2025, Microsoft, Inc.
> + *
> + */
> +
> +#ifndef _HYPERV_IOMMU_H
> +#define _HYPERV_IOMMU_H
> +
> +struct hv_iommu_dev {
> +	struct iommu_device iommu;
> +	struct ida domain_ids;
> +
> +	/* Device configuration */
> +	u8  max_iova_width;
> +	u8  max_pasid_width;
> +	u64 cap;
> +	u64 pgsize_bitmap;
> +
> +	struct iommu_domain_geometry geometry;
> +	u64 first_domain;
> +	u64 last_domain;
> +};
> +
> +struct hv_iommu_domain {
> +	union {
> +		struct iommu_domain    domain;
> +		struct pt_iommu        pt_iommu;
> +		struct pt_iommu_x86_64 pt_iommu_x86_64;
> +	};
> +	struct hv_iommu_dev *hv_iommu;
> +	struct hv_input_device_domain device_domain;
> +	u64		pgsize_bitmap;
> +};
> +
> +struct hv_pci_busdata {
> +	int               pci_domain_nr;
> +	u32               logical_dev_id_prefix;
> +	struct list_head  list;
> +};
> +
> +struct hv_iommu_endpoint {
> +	struct device *dev;
> +	struct hv_iommu_dev *hv_iommu;
> +	struct hv_iommu_domain *hv_domain;
> +};
> +
> +#define to_hv_iommu_domain(d) \
> +	container_of(d, struct hv_iommu_domain, domain)
> +
> +#endif /* _HYPERV_IOMMU_H */
> diff --git a/drivers/pci/controller/pci-hyperv.c
> b/drivers/pci/controller/pci-hyperv.c index
> cfc8fa403dad..a4af9c8c2220 100644 ---
> a/drivers/pci/controller/pci-hyperv.c +++
> b/drivers/pci/controller/pci-hyperv.c @@ -3715,6 +3715,7 @@ static
> int hv_pci_probe(struct hv_device *hdev, struct hv_pcibus_device
> *hbus; int ret, dom;
>  	u16 dom_req;
> +	u32 prefix;
>  	char *name;
>  
>  	bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
> @@ -3857,13 +3858,25 @@ static int hv_pci_probe(struct hv_device
> *hdev, 
>  	hbus->state = hv_pcibus_probed;
>  
> -	ret = create_root_hv_pci_bus(hbus);
> +	/* Notify pvIOMMU before any device on the bus is scanned. */
> +	prefix = (hdev->dev_instance.b[5] << 24) |
> +		 (hdev->dev_instance.b[4] << 16) |
> +		 (hdev->dev_instance.b[7] <<  8) |
> +		 (hdev->dev_instance.b[6] & 0xf8);
> +
> +	ret = hv_iommu_register_pci_bus(dom, prefix);
>  	if (ret)
>  		goto free_windows;
>  
> +	ret = create_root_hv_pci_bus(hbus);
> +	if (ret)
> +		goto unregister_pviommu;
> +
>  	mutex_unlock(&hbus->state_lock);
>  	return 0;
>  
> +unregister_pviommu:
> +	hv_iommu_unregister_pci_bus(dom);
>  free_windows:
>  	hv_pci_free_bridge_windows(hbus);
>  exit_d0:
> @@ -3974,8 +3987,10 @@ static int hv_pci_bus_exit(struct hv_device
> *hdev, bool keep_devs) static void hv_pci_remove(struct hv_device
> *hdev) {
>  	struct hv_pcibus_device *hbus;
> +	int dom;
>  
>  	hbus = hv_get_drvdata(hdev);
> +	dom = hbus->bridge->domain_nr;
>  	if (hbus->state == hv_pcibus_installed) {
>  		tasklet_disable(&hdev->channel->callback_event);
>  		hbus->state = hv_pcibus_removing;
> @@ -3994,6 +4009,8 @@ static void hv_pci_remove(struct hv_device
> *hdev) hv_pci_remove_slots(hbus);
>  		pci_remove_root_bus(hbus->bridge->bus);
>  		pci_unlock_rescan_remove();
> +
> +		hv_iommu_unregister_pci_bus(dom);
>  	}
>  
>  	hv_pci_bus_exit(hdev, false);
> diff --git a/include/asm-generic/mshyperv.h
> b/include/asm-generic/mshyperv.h index bf601d67cecb..b71345c74568
> 100644 --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -73,6 +73,18 @@ extern enum hv_partition_type
> hv_curr_partition_type; extern void * __percpu *hyperv_pcpu_input_arg;
>  extern void * __percpu *hyperv_pcpu_output_arg;
>  
> +#ifdef CONFIG_HYPERV_PVIOMMU
> +int  hv_iommu_register_pci_bus(int pci_domain_nr, u32
> logical_dev_id_prefix); +void hv_iommu_unregister_pci_bus(int
> pci_domain_nr); +#else
> +static inline int hv_iommu_register_pci_bus(int pci_domain_nr,
> +					    u32
> logical_dev_id_prefix) +{
> +	return 0;
> +}
> +static inline void hv_iommu_unregister_pci_bus(int pci_domain_nr) { }
> +#endif
> +
>  u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
>  u64 hv_do_fast_hypercall8(u16 control, u64 input8);
>  u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);


^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH net-next] net: mana: Add handler for sriov configure
From: Leon Romanovsky @ 2026-05-13 18:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Haiyang Zhang, Haiyang Zhang, Paul Rosswurm,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	KY Srinivasan, Wei Liu, Dexuan Cui, Long Li, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Bjorn Helgaas, Simon Horman, Shradha Gupta, Dipayaan Roy,
	Erni Sri Satya Vennela, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org
In-Reply-To: <20260508231029.GA44712@bhelgaas>

On Fri, May 08, 2026 at 06:10:29PM -0500, Bjorn Helgaas wrote:
> On Fri, May 08, 2026 at 10:47:14PM +0000, Haiyang Zhang wrote:
> > > -----Original Message-----
> > > From: Bjorn Helgaas <helgaas@kernel.org>
> > > Sent: Friday, May 8, 2026 6:38 PM
> > > To: Haiyang Zhang <haiyangz@linux.microsoft.com>
> > > Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> > > <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> > > <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> > > <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S.
> > > Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> > > Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Bjorn Helgaas
> > > <bhelgaas@google.com>; Simon Horman <horms@kernel.org>; Shradha Gupta
> > > <shradhagupta@linux.microsoft.com>; Dipayaan Roy
> > > <dipayanroy@linux.microsoft.com>; Erni Sri Satya Vennela
> > > <ernis@linux.microsoft.com>; linux-kernel@vger.kernel.org; linux-
> > > pci@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> > > Subject: [EXTERNAL] Re: [PATCH net-next] net: mana: Add handler for sriov
> > > configure
> > > 
> > > On Fri, May 08, 2026 at 03:04:06PM -0700, Haiyang Zhang wrote:
> > > > From: Haiyang Zhang <haiyangz@microsoft.com>
> > > >
> > > > Add callback function for the pci_driver, sriov_configure.
> > > >
> > > > Also disable VF autoprobe when it runs as PF driver on bare metal,
> > > > since the hardware side may not have the VF ready immediately.
> > > >
> > > > Export pci_vf_drivers_autoprobe() so the driver can toggle the VF
> > > > autoprobe flag.
> > > 
> > > Technically pci_vf_drivers_autoprobe() doesn't *toggle* the autoprobe
> > > flag.  That would mean setting it to the opposite of its current
> > > value.
> > > 
> > > Here I would say "so the driver can prevent autoprobing of the VFs",
> > > which is the intent.
> > Thanks, I will change the wording.
> > 
> > > 
> > > Out of curiosity, how do the VFs eventually get probed?  I guess
> > > there's some other mechanism that tells you when they're ready, and
> > > you manually use sysfs 'sriov_drivers_autoprobe' to enable probing,
> > > then bind drivers to them via sysfs?
> > We have a user program talking to the Azure backplane to get that information.
> > @Paul Rosswurm, do you have more details?
> > 
> > 
> > > The prevention of autoprobing sounds like a critical part of this
> > > change; might be worth saying something in the subject, because "add
> > > sriov configure" doesn't include much information.
> > How about "Add handler for sriov configure with VF autoprobe off"?
> 
> OK by me :)

Bjorn,

I believe it is the wrong decision to allow toggling a user‑visible knob  
without the user’s awareness. In this case, they can either disable  
autoprobe on the PF or rely on EPROBE_DEFER. In all cases, the same
functionality can be achieved without changing PCI autoprobe code.

Thanks.

> 

^ permalink raw reply

* [PATCH v3 00/11] mshv: Refactor memory region management and map pages at creation
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel

This series refactors the mshv memory region subsystem in preparation
for mapping populated pages into the hypervisor at movable region
creation time, rather than relying solely on demand faulting.

The primary motivation is to ensure that when userspace passes a
pre-populated mapping for a movable memory region, those pages are
immediately visible to the hypervisor. Previously, all movable regions
were created with HV_MAP_GPA_NO_ACCESS on every page regardless of
whether the backing pages were already present, deferring all mapping
to the fault handler. This added unnecessary fault overhead and
complicated the initial setup of child partitions with pre-populated
memory.

v3:
 - Fixed a few issues.
 - Reworked the code flow to simplify the logic.

v2:
 - Rebased on top of latest mainline, simplified the check for valid PFNs,
   added other minor cleanups and improvements.

---

Stanislav Kinsburskii (11):
      mshv: Don't reject huge-page stride on unaligned region length
      mshv: Don't request HMM write fault for read-only regions
      mshv: Convert region storage from page pointers to PFNs
      mshv: Refactor region segmentation into a dedicated helper
      mshv: Support address range holes in remapping
      mshv: Iterate VMAs when faulting in region pages
      mshv: Scale fault granularity for non-4 KiB host pages
      mshv: Move pinned region setup to mshv_regions.c
      mshv: Map populated pages on movable region creation
      mshv: Extract MMIO region mapping into separate function
      mshv: Add tracepoint for map GPA hypercall

 drivers/hv/mshv_regions.c      |  636 ++++++++++++++++++++++++++++------------
 drivers/hv/mshv_root.h         |   30 +-
 drivers/hv/mshv_root_hv_call.c |   52 ++-
 drivers/hv/mshv_root_main.c    |  105 +------
 drivers/hv/mshv_trace.h        |   36 ++
 5 files changed, 530 insertions(+), 329 deletions(-)

^ permalink raw reply

* [PATCH v3 01/11] mshv: Don't reject huge-page stride on unaligned region length
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_chunk_stride() rejected huge-page stride whenever pfn_count was
not a multiple of PTRS_PER_PMD, even though the same-stride scan in
mshv_region_process_pfns() already handles the transition between
huge-page and 4K stride for the tail of a run. As a result, a region
backed by a 2 MiB folio with a length that wasn't a multiple of 512
PFNs was processed entirely at 4K stride, issuing one hypercall per
PFN instead of one per 2 MiB.

Reject huge-page stride only when fewer than PTRS_PER_PMD PFNs are
available from this point.  The starting GFN alignment requirement is
unchanged.  The same-stride scan continues to handle the tail
correctly.

Fixes: 259add0d982cb ("mshv: Align huge page stride with guest mapping")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6f..81e57f727be35 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -43,7 +43,7 @@ static int mshv_chunk_stride(struct page *page,
 	 */
 	if (!PageCompound(page) || !PageHead(page) ||
 	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
-	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
+	    page_count < PTRS_PER_PMD)
 		return 1;

 	page_order = folio_order(page_folio(page));

^ permalink raw reply related

* [PATCH v3 02/11] mshv: Don't request HMM write fault for read-only regions
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_region_range_fault() unconditionally sets HMM_PFN_REQ_WRITE on
the hmm_range_fault() request.  When the region was created without
MSHV_SET_MEM_BIT_WRITABLE (so region->hv_map_flags has no
HV_MAP_GPA_WRITABLE), the request still asks HMM for writable pages.
On read-only mappings this causes hmm_range_fault() to break
copy-on-write — for example, the shared zero page or file-backed
pages — granting the guest a private writable copy of memory that
host policy intended to keep shared.

Gate HMM_PFN_REQ_WRITE on the region's HV_MAP_GPA_WRITABLE bit so
that read-only regions request read-only faults.

Note: this still asks for write on writable regions even if the
backing VMA is read-only.  A more thorough check would also consult
each VMA's vm_flags inside the fault loop; that requires iterating
VMAs and is left for a follow-up.

Fixes: b9a66cd5ccbb ("mshv: Add support for movable memory regions")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 81e57f727be35..d9e1fbfefe714 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -441,7 +441,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 {
 	struct hmm_range range = {
 		.notifier = &region->mreg_mni,
-		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
+		.default_flags = HMM_PFN_REQ_FAULT,
 	};
 	unsigned long *pfns;
 	int ret;
@@ -455,6 +455,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
 	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;

+	/*
+	 * Only request writable pages from HMM when the region itself
+	 * permits writes.  Without this, hmm_range_fault() would
+	 * trigger COW on read-only regions, breaking copy-on-write
+	 * semantics on shared host pages.
+	 */
+	if (region->hv_map_flags & HV_MAP_GPA_WRITABLE)
+		range.default_flags |= HMM_PFN_REQ_WRITE;
+
 	do {
 		ret = mshv_region_hmm_fault_and_lock(region, &range);
 	} while (ret == -EBUSY);

^ permalink raw reply related

* [PATCH v3 03/11] mshv: Convert region storage from page pointers to PFNs
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The HMM interface returns PFNs from hmm_range_fault(), and the
hypervisor hypercalls in this driver consume PFNs.  Storing struct
page pointers in mshv_mem_region between those two interfaces forces
a page_to_pfn() conversion at every hypercall site and a
hmm_pfn_to_page() conversion at the HMM fault site, neither of
which needs the struct page form.

Switch mreg_pages[] to mreg_pfns[] and propagate through
hv_call_map_ram_pfns(), hv_call_map_mmio_pfns(), hv_call_unmap_pfns(),
and hv_call_modify_spa_host_access(). Per-PFN conversions at those
sites go away; the HMM fault path writes hmm_range_fault() output
straight into mreg_pfns[] after stripping HMM_PFN_FLAGS. Convert
back to struct page only where the API requires it (e.g.
unpin_user_page() during region invalidation).

Use ULONG_MAX as the invalid-PFN sentinel (MSHV_INVALID_PFN); 0 is a
valid PFN on some architectures. mshv_region_init_pfns() centralises
the "mark slots empty" pattern so callers agree on the sentinel.

Trade-off: the bulk unpin_user_pages() call is replaced with a
per-PFN unpin_user_page() loop, losing per-folio coalescing for
huge-folio-backed pinned regions. The cost is paid only on region
teardown and MMU-notifier invalidation (it it will ever be added for pinned
regions).

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c      |  325 +++++++++++++++++++++++-----------------
 drivers/hv/mshv_root.h         |   21 +--
 drivers/hv/mshv_root_hv_call.c |   49 +++---
 drivers/hv/mshv_root_main.c    |   33 ++--
 4 files changed, 239 insertions(+), 189 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index d9e1fbfefe714..77fc94733cb20 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -18,12 +18,32 @@
 #include "mshv_root.h"
 
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
+#define MSHV_INVALID_PFN				ULONG_MAX
+
+static inline bool mshv_pfn_valid(unsigned long pfn)
+{
+	return pfn != MSHV_INVALID_PFN;
+}
+
+static void mshv_region_init_pfns_range(struct mshv_mem_region *region,
+					u64 pfn_offset, u64 pfn_count)
+{
+	u64 i;
+
+	for (i = pfn_offset; i < pfn_offset + pfn_count; i++)
+		region->mreg_pfns[i] = MSHV_INVALID_PFN;
+}
+
+void mshv_region_init_pfns(struct mshv_mem_region *region)
+{
+	mshv_region_init_pfns_range(region, 0, region->nr_pfns);
+}
 
 /**
  * mshv_chunk_stride - Compute stride for mapping guest memory
- * @page      : The page to check for huge page backing
- * @gfn       : Guest frame number for the mapping
- * @page_count: Total number of pages in the mapping
+ * @page     : The page to check for huge page backing
+ * @gfn      : Guest frame number for the mapping
+ * @pfn_count: Total number of pages in the mapping
  *
  * Determines the appropriate stride (in pages) for mapping guest memory.
  * Uses huge page stride if the backing page is huge and the guest mapping
@@ -32,18 +52,19 @@
  * Return: Stride in pages, or -EINVAL if page order is unsupported.
  */
 static int mshv_chunk_stride(struct page *page,
-			     u64 gfn, u64 page_count)
+			     u64 gfn, u64 pfn_count)
 {
 	unsigned int page_order;
 
 	/*
 	 * Use single page stride by default. For huge page stride, the
 	 * page must be compound and point to the head of the compound
-	 * page, and both gfn and page_count must be huge-page aligned.
+	 * page, gfn must be huge-page aligned and pfn_count must be at
+	 * least the number of pages in a huge page.
 	 */
 	if (!PageCompound(page) || !PageHead(page) ||
 	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
-	    page_count < PTRS_PER_PMD)
+	    pfn_count < PTRS_PER_PMD)
 		return 1;
 
 	page_order = folio_order(page_folio(page));
@@ -57,60 +78,61 @@ static int mshv_chunk_stride(struct page *page,
 /**
  * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
  *                             in a region.
- * @region     : Pointer to the memory region structure.
- * @flags      : Flags to pass to the handler.
- * @page_offset: Offset into the region's pages array to start processing.
- * @page_count : Number of pages to process.
- * @handler    : Callback function to handle the chunk.
+ * @region    : Pointer to the memory region structure.
+ * @flags     : Flags to pass to the handler.
+ * @pfn_offset: Offset into the region's PFNs array to start processing.
+ * @pfn_count : Number of PFNs to process.
+ * @handler   : Callback function to handle the chunk.
  *
- * This function scans the region's pages starting from @page_offset,
- * checking for contiguous present pages of the same size (normal or huge).
- * It invokes @handler for the chunk of contiguous pages found. Returns the
- * number of pages handled, or a negative error code if the first page is
- * not present or the handler fails.
+ * This function scans the region's PFNs starting from @pfn_offset,
+ * checking for contiguous valid PFNs backed by pages of the same size
+ * (normal or huge). It invokes @handler for the chunk of contiguous valid
+ * PFNs found. Returns the number of PFNs handled, or a negative error code
+ * if the first PFN is invalid or the handler fails.
  *
- * Note: The @handler callback must be able to handle both normal and huge
- * pages.
+ * Note: The @handler callback must be able to handle valid PFNs backed by
+ * both normal and huge pages.
  *
  * Return: Number of pages handled, or negative error code.
  */
-static long mshv_region_process_chunk(struct mshv_mem_region *region,
-				      u32 flags,
-				      u64 page_offset, u64 page_count,
-				      int (*handler)(struct mshv_mem_region *region,
-						     u32 flags,
-						     u64 page_offset,
-						     u64 page_count,
-						     bool huge_page))
+static long mshv_region_process_pfns(struct mshv_mem_region *region,
+				     u32 flags,
+				     u64 pfn_offset, u64 pfn_count,
+				     int (*handler)(struct mshv_mem_region *region,
+						    u32 flags,
+						    u64 pfn_offset,
+						    u64 pfn_count,
+						    bool huge_page))
 {
-	u64 gfn = region->start_gfn + page_offset;
+	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count;
-	struct page *page;
+	unsigned long pfn;
 	int stride, ret;
 
-	page = region->mreg_pages[page_offset];
-	if (!page)
+	pfn = region->mreg_pfns[pfn_offset];
+	if (!mshv_pfn_valid(pfn))
 		return -EINVAL;
 
-	stride = mshv_chunk_stride(page, gfn, page_count);
+	stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
 	if (stride < 0)
 		return stride;
 
 	/* Start at stride since the first stride is validated */
-	for (count = stride; count < page_count; count += stride) {
-		page = region->mreg_pages[page_offset + count];
+	for (count = stride; count < pfn_count ; count += stride) {
+		pfn = region->mreg_pfns[pfn_offset + count];
 
-		/* Break if current page is not present */
-		if (!page)
+		/* Break if current pfn is invalid */
+		if (!mshv_pfn_valid(pfn))
 			break;
 
 		/* Break if stride size changes */
-		if (stride != mshv_chunk_stride(page, gfn + count,
-						page_count - count))
+		if (stride != mshv_chunk_stride(pfn_to_page(pfn),
+						gfn + count,
+						pfn_count - count))
 			break;
 	}
 
-	ret = handler(region, flags, page_offset, count, stride > 1);
+	ret = handler(region, flags, pfn_offset, count, stride > 1);
 	if (ret)
 		return ret;
 
@@ -118,70 +140,72 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 }
 
 /**
- * mshv_region_process_range - Processes a range of memory pages in a
- *                             region.
- * @region     : Pointer to the memory region structure.
- * @flags      : Flags to pass to the handler.
- * @page_offset: Offset into the region's pages array to start processing.
- * @page_count : Number of pages to process.
- * @handler    : Callback function to handle each chunk of contiguous
- *               pages.
+ * mshv_region_process_range - Processes a range of PFNs in a region.
+ * @region    : Pointer to the memory region structure.
+ * @flags     : Flags to pass to the handler.
+ * @pfn_offset: Offset into the region's PFNs array to start processing.
+ * @pfn_count : Number of PFNs to process.
+ * @handler   : Callback function to handle each chunk of contiguous
+ *              valid PFNs.
  *
- * Iterates over the specified range of pages in @region, skipping
- * non-present pages. For each contiguous chunk of present pages, invokes
- * @handler via mshv_region_process_chunk.
+ * Iterates over the specified range of PFNs in @region, skipping
+ * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
+ * @handler via mshv_region_process_pfns.
  *
- * Note: The @handler callback must be able to handle both normal and huge
- * pages.
+ * Note: The @handler callback must be able to handle PFNs backed by both
+ * normal and huge pages.
  *
  * Returns 0 on success, or a negative error code on failure.
  */
 static int mshv_region_process_range(struct mshv_mem_region *region,
 				     u32 flags,
-				     u64 page_offset, u64 page_count,
+				     u64 pfn_offset, u64 pfn_count,
 				     int (*handler)(struct mshv_mem_region *region,
 						    u32 flags,
-						    u64 page_offset,
-						    u64 page_count,
+						    u64 pfn_offset,
+						    u64 pfn_count,
 						    bool huge_page))
 {
+	u64 end;
 	long ret;
 
-	if (page_offset + page_count > region->nr_pages)
+	if (check_add_overflow(pfn_offset, pfn_count, &end))
+		return -EOVERFLOW;
+
+	if (end > region->nr_pfns)
 		return -EINVAL;
 
-	while (page_count) {
+	while (pfn_count) {
 		/* Skip non-present pages */
-		if (!region->mreg_pages[page_offset]) {
-			page_offset++;
-			page_count--;
+		if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset])) {
+			pfn_offset++;
+			pfn_count--;
 			continue;
 		}
 
-		ret = mshv_region_process_chunk(region, flags,
-						page_offset,
-						page_count,
-						handler);
+		ret = mshv_region_process_pfns(region, flags,
+					       pfn_offset, pfn_count,
+					       handler);
 		if (ret < 0)
 			return ret;
 
-		page_offset += ret;
-		page_count -= ret;
+		pfn_offset += ret;
+		pfn_count -= ret;
 	}
 
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
+struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
 					   u64 uaddr, u32 flags)
 {
 	struct mshv_mem_region *region;
 
-	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+	region = vzalloc(struct_size(region, mreg_pfns, nr_pfns));
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 
-	region->nr_pages = nr_pages;
+	region->nr_pfns = nr_pfns;
 	region->start_gfn = guest_pfn;
 	region->start_uaddr = uaddr;
 	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
@@ -190,6 +214,8 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
+	mshv_region_init_pfns(region);
+
 	kref_init(&region->mreg_refcount);
 
 	return region;
@@ -197,15 +223,15 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 
 static int mshv_region_chunk_share(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pages + page_offset,
-					      page_count,
+					      region->mreg_pfns + pfn_offset,
+					      pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
 					      flags, true);
@@ -216,21 +242,21 @@ int mshv_region_share(struct mshv_mem_region *region)
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
 	return mshv_region_process_range(region, flags,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_share);
 }
 
 static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 				     u32 flags,
-				     u64 page_offset, u64 page_count,
+				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pages + page_offset,
-					      page_count, 0,
+					      region->mreg_pfns + pfn_offset,
+					      pfn_count, 0,
 					      flags, false);
 }
 
@@ -239,30 +265,30 @@ int mshv_region_unshare(struct mshv_mem_region *region)
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
 	return mshv_region_process_range(region, flags,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_unshare);
 }
 
 static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MAP_GPA_LARGE_PAGE;
 
-	return hv_call_map_gpa_pages(region->partition->pt_id,
-				     region->start_gfn + page_offset,
-				     page_count, flags,
-				     region->mreg_pages + page_offset);
+	return hv_call_map_ram_pfns(region->partition->pt_id,
+				    region->start_gfn + pfn_offset,
+				    pfn_count, flags,
+				    region->mreg_pfns + pfn_offset);
 }
 
-static int mshv_region_remap_pages(struct mshv_mem_region *region,
-				   u32 map_flags,
-				   u64 page_offset, u64 page_count)
+static int mshv_region_remap_pfns(struct mshv_mem_region *region,
+				  u32 map_flags,
+				  u64 pfn_offset, u64 pfn_count)
 {
 	return mshv_region_process_range(region, map_flags,
-					 page_offset, page_count,
+					 pfn_offset, pfn_count,
 					 mshv_region_chunk_remap);
 }
 
@@ -270,38 +296,50 @@ int mshv_region_map(struct mshv_mem_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
-	return mshv_region_remap_pages(region, map_flags,
-				       0, region->nr_pages);
+	return mshv_region_remap_pfns(region, map_flags,
+				      0, region->nr_pfns);
 }
 
-static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
-					 u64 page_offset, u64 page_count)
+static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
+					u64 pfn_offset, u64 pfn_count)
 {
-	if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
-		unpin_user_pages(region->mreg_pages + page_offset, page_count);
+	u64 i;
+
+	for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
+		if (!mshv_pfn_valid(region->mreg_pfns[i]))
+			continue;
 
-	memset(region->mreg_pages + page_offset, 0,
-	       page_count * sizeof(struct page *));
+		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
+			unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
+	}
+
+	mshv_region_init_pfns_range(region, pfn_offset, pfn_count);
 }
 
 void mshv_region_invalidate(struct mshv_mem_region *region)
 {
-	mshv_region_invalidate_pages(region, 0, region->nr_pages);
+	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
 int mshv_region_pin(struct mshv_mem_region *region)
 {
-	u64 done_count, nr_pages;
+	u64 done_count, nr_pfns, i;
+	unsigned long *pfns;
 	struct page **pages;
 	__u64 userspace_addr;
 	int ret;
 
-	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
-		pages = region->mreg_pages + done_count;
+	pages = kmalloc_array(MSHV_PIN_PAGES_BATCH_SIZE,
+			      sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	for (done_count = 0; done_count < region->nr_pfns; done_count += ret) {
+		pfns = region->mreg_pfns + done_count;
 		userspace_addr = region->start_uaddr +
 				 done_count * HV_HYP_PAGE_SIZE;
-		nr_pages = min(region->nr_pages - done_count,
-			       MSHV_PIN_PAGES_BATCH_SIZE);
+		nr_pfns = min(region->nr_pfns - done_count,
+			      MSHV_PIN_PAGES_BATCH_SIZE);
 
 		/*
 		 * Pinning assuming 4k pages works for large pages too.
@@ -311,39 +349,50 @@ int mshv_region_pin(struct mshv_mem_region *region)
 		 * with the FOLL_LONGTERM flag does a large temporary
 		 * allocation of contiguous memory.
 		 */
-		ret = pin_user_pages_fast(userspace_addr, nr_pages,
+		ret = pin_user_pages_fast(userspace_addr, nr_pfns,
 					  FOLL_WRITE | FOLL_LONGTERM,
 					  pages);
-		if (ret != nr_pages)
+
+		for (i = 0; i < ret; i++)
+			pfns[i] = page_to_pfn(pages[i]);
+
+		/*
+		 * Demand all requested pages were successfully pinned
+		 * or fail otherwise.
+		 */
+		if (ret != nr_pfns)
 			goto release_pages;
+
 	}
 
+	kfree(pages);
 	return 0;
 
 release_pages:
 	if (ret > 0)
 		done_count += ret;
-	mshv_region_invalidate_pages(region, 0, done_count);
+	mshv_region_invalidate_pfns(region, 0, done_count);
+	kfree(pages);
 	return ret < 0 ? ret : -ENOMEM;
 }
 
 static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_UNMAP_GPA_LARGE_PAGE;
 
-	return hv_call_unmap_gpa_pages(region->partition->pt_id,
-				       region->start_gfn + page_offset,
-				       page_count, flags);
+	return hv_call_unmap_pfns(region->partition->pt_id,
+				  region->start_gfn + pfn_offset,
+				  pfn_count, flags);
 }
 
 static int mshv_region_unmap(struct mshv_mem_region *region)
 {
 	return mshv_region_process_range(region, 0,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_unmap);
 }
 
@@ -427,8 +476,8 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 /**
  * mshv_region_range_fault - Handle memory range faults for a given region.
  * @region: Pointer to the memory region structure.
- * @page_offset: Offset of the page within the region.
- * @page_count: Number of pages to handle.
+ * @pfn_offset: Offset of the page within the region.
+ * @pfn_count: Number of pages to handle.
  *
  * This function resolves memory faults for a specified range of pages
  * within a memory region. It uses HMM (Heterogeneous Memory Management)
@@ -437,7 +486,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
  * Return: 0 on success, negative error code on failure.
  */
 static int mshv_region_range_fault(struct mshv_mem_region *region,
-				   u64 page_offset, u64 page_count)
+				   u64 pfn_offset, u64 pfn_count)
 {
 	struct hmm_range range = {
 		.notifier = &region->mreg_mni,
@@ -447,13 +496,13 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	int ret;
 	u64 i;
 
-	pfns = kmalloc_array(page_count, sizeof(*pfns), GFP_KERNEL);
+	pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
 	if (!pfns)
 		return -ENOMEM;
 
 	range.hmm_pfns = pfns;
-	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
-	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
+	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
+	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
 
 	/*
 	 * Only request writable pages from HMM when the region itself
@@ -471,11 +520,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	if (ret)
 		goto out;
 
-	for (i = 0; i < page_count; i++)
-		region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+	for (i = 0; i < pfn_count; i++) {
+		if (!(pfns[i] & HMM_PFN_VALID))
+			continue;
+		/* Drop HMM_PFN_* flags to ensure PFNs are valid. */
+		region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
+	}
 
-	ret = mshv_region_remap_pages(region, region->hv_map_flags,
-				      page_offset, page_count);
+	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
+				     pfn_offset, pfn_count);
 
 	mutex_unlock(&region->mreg_mutex);
 out:
@@ -485,24 +538,24 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
 {
-	u64 page_offset, page_count;
+	u64 pfn_offset, pfn_count;
 	int ret;
 
 	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
-	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
-				 MSHV_MAP_FAULT_IN_PAGES);
+	pfn_offset = ALIGN_DOWN(gfn - region->start_gfn,
+				MSHV_MAP_FAULT_IN_PAGES);
 
 	/* Map more pages than requested to reduce the number of faults. */
-	page_count = min(region->nr_pages - page_offset,
-			 MSHV_MAP_FAULT_IN_PAGES);
+	pfn_count = min(region->nr_pfns - pfn_offset,
+			MSHV_MAP_FAULT_IN_PAGES);
 
-	ret = mshv_region_range_fault(region, page_offset, page_count);
+	ret = mshv_region_range_fault(region, pfn_offset, pfn_count);
 
 	WARN_ONCE(ret,
-		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
+		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, pfn_offset %llu, pfn_count %llu\n",
 		  region->partition->pt_id, region->start_uaddr,
-		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
-		  gfn, page_offset, page_count);
+		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
+		  gfn, pfn_offset, pfn_count);
 
 	return !ret;
 }
@@ -532,16 +585,16 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 	struct mshv_mem_region *region = container_of(mni,
 						      struct mshv_mem_region,
 						      mreg_mni);
-	u64 page_offset, page_count;
+	u64 pfn_offset, pfn_count;
 	unsigned long mstart, mend;
 	int ret = -EPERM;
 
 	mstart = max(range->start, region->start_uaddr);
 	mend = min(range->end, region->start_uaddr +
-		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
+		   (region->nr_pfns << HV_HYP_PAGE_SHIFT));
 
-	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
-	page_count = HVPFN_DOWN(mend - mstart);
+	pfn_offset = HVPFN_DOWN(mstart - region->start_uaddr);
+	pfn_count = HVPFN_DOWN(mend - mstart);
 
 	if (mmu_notifier_range_blockable(range))
 		mutex_lock(&region->mreg_mutex);
@@ -550,12 +603,12 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 
 	mmu_interval_set_seq(mni, cur_seq);
 
-	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
-				      page_offset, page_count);
+	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
+				     pfn_offset, pfn_count);
 	if (ret)
 		goto out_unlock;
 
-	mshv_region_invalidate_pages(region, page_offset, page_count);
+	mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
 
 	mutex_unlock(&region->mreg_mutex);
 
@@ -567,9 +620,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 	WARN_ONCE(ret,
 		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
 		  region->start_uaddr,
-		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
 		  range->start, range->end, range->event,
-		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
+		  pfn_offset, pfn_offset + pfn_count - 1, (u64)range->mm, ret);
 	return false;
 }
 
@@ -588,7 +641,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
 
 	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
 					   region->start_uaddr,
-					   region->nr_pages << HV_HYP_PAGE_SHIFT,
+					   region->nr_pfns << HV_HYP_PAGE_SHIFT,
 					   &mshv_region_mni_ops);
 	if (ret)
 		return false;
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index e19a84ea07905..34c9b57c50f47 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -85,15 +85,15 @@ enum mshv_region_type {
 struct mshv_mem_region {
 	struct hlist_node hnode;
 	struct kref mreg_refcount;
-	u64 nr_pages;
+	u64 nr_pfns;
 	u64 start_gfn;
 	u64 start_uaddr;
 	u32 hv_map_flags;
 	struct mshv_partition *partition;
 	enum mshv_region_type mreg_type;
 	struct mmu_interval_notifier mreg_mni;
-	struct mutex mreg_mutex;	/* protects region pages remapping */
-	struct page *mreg_pages[];
+	struct mutex mreg_mutex;	/* protects region PFNs remapping */
+	unsigned long mreg_pfns[];
 };
 
 struct mshv_irq_ack_notifier {
@@ -282,11 +282,11 @@ int hv_call_create_partition(u64 flags,
 int hv_call_initialize_partition(u64 partition_id);
 int hv_call_finalize_partition(u64 partition_id);
 int hv_call_delete_partition(u64 partition_id);
-int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
-int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			  u32 flags, struct page **pages);
-int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			    u32 flags);
+int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
+int hv_call_map_ram_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
+			 u32 flags, unsigned long *pfns);
+int hv_call_unmap_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
+		       u32 flags);
 int hv_call_delete_vp(u64 partition_id, u32 vp_index);
 int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
 				     u64 dest_addr,
@@ -332,8 +332,8 @@ int hv_map_stats_page(enum hv_stats_object_type type,
 int hv_unmap_stats_page(enum hv_stats_object_type type,
 			struct hv_stats_page *page_addr,
 			const union hv_stats_object_identity *identity);
-int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
-				   u64 page_struct_count, u32 host_access,
+int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
+				   u64 pfns_count, u32 host_access,
 				   u32 flags, u8 acquire);
 int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
 				      void *property_value, size_t property_value_sz);
@@ -375,6 +375,7 @@ int mshv_region_share(struct mshv_mem_region *region);
 int mshv_region_unshare(struct mshv_mem_region *region);
 int mshv_region_map(struct mshv_mem_region *region);
 void mshv_region_invalidate(struct mshv_mem_region *region);
+void mshv_region_init_pfns(struct mshv_mem_region *region);
 int mshv_region_pin(struct mshv_mem_region *region);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index cc580225e9e45..00d37c39cbf26 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -188,17 +188,16 @@ int hv_call_delete_partition(u64 partition_id)
 	return hv_result_to_errno(status);
 }
 
-/* Ask the hypervisor to map guest ram pages or the guest mmio space */
-static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
-			       u32 flags, struct page **pages, u64 mmio_spa)
+static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
+			  u32 flags, unsigned long *pfns, u64 mmio_spa)
 {
 	struct hv_input_map_gpa_pages *input_page;
 	u64 status, *pfnlist;
 	unsigned long irq_flags, large_shift = 0;
-	u64 done = 0, page_count = page_struct_count;
+	u64 done = 0, page_count = pfns_count;
 	int ret = 0;
 
-	if (page_count == 0 || (pages && mmio_spa))
+	if (page_count == 0 || (pfns && mmio_spa))
 		return -EINVAL;
 
 	if (flags & HV_MAP_GPA_LARGE_PAGE) {
@@ -227,9 +226,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		for (i = 0; i < rep_count; i++) {
 			if (flags & HV_MAP_GPA_NO_ACCESS)
 				pfnlist[i] = 0;
-			else if (pages)
-				pfnlist[i] = page_to_pfn(pages[(done + i) <<
-							 large_shift]);
+			else if (pfns)
+				pfnlist[i] = pfns[(done + i) << large_shift];
 			else
 				pfnlist[i] = mmio_spa + done + i;
 		}
@@ -257,38 +255,38 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 
 		if (flags & HV_MAP_GPA_LARGE_PAGE)
 			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-		hv_call_unmap_gpa_pages(partition_id, gfn,
-					done << large_shift, unmap_flags);
+		hv_call_unmap_pfns(partition_id, gfn,
+				   done << large_shift, unmap_flags);
 	}
 
 	return ret;
 }
 
 /* Ask the hypervisor to map guest ram pages */
-int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			  u32 flags, struct page **pages)
+int hv_call_map_ram_pfns(u64 partition_id, u64 gfn, u64 pfn_count,
+			 u32 flags, unsigned long *pfns)
 {
-	return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
-				   flags, pages, 0);
+	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
+			      pfns, 0);
 }
 
-/* Ask the hypervisor to map guest mmio space */
-int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
+int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa,
+			  u64 pfn_count)
 {
 	int i;
 	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
 		    HV_MAP_GPA_NOT_CACHED;
 
-	for (i = 0; i < numpgs; i++)
+	for (i = 0; i < pfn_count; i++)
 		if (page_is_ram(mmio_spa + i))
 			return -EINVAL;
 
-	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
-				   mmio_spa);
+	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
+			      NULL, mmio_spa);
 }
 
-int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
-			    u32 flags)
+int hv_call_unmap_pfns(u64 partition_id, u64 gfn, u64 page_count_4k,
+		       u32 flags)
 {
 	struct hv_input_unmap_gpa_pages *input_page;
 	u64 status, page_count = page_count_4k;
@@ -1036,15 +1034,15 @@ int hv_unmap_stats_page(enum hv_stats_object_type type,
 	return ret;
 }
 
-int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
-				   u64 page_struct_count, u32 host_access,
+int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
+				   u64 pfns_count, u32 host_access,
 				   u32 flags, u8 acquire)
 {
 	struct hv_input_modify_sparse_spa_page_host_access *input_page;
 	u64 status;
 	u64 done = 0;
 	unsigned long irq_flags, large_shift = 0;
-	u64 page_count = page_struct_count;
+	u64 page_count = pfns_count;
 	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
 			     HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
 
@@ -1076,8 +1074,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 		input_page->host_access = host_access;
 
 		for (i = 0; i < rep_count; i++)
-			input_page->spa_page_list[i] =
-				page_to_pfn(pages[(done + i) << large_shift]);
+			input_page->spa_page_list[i] = pfns[(done + i) << large_shift];
 
 		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
 					     NULL);
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 03c65ff6a7397..986cb9a4c428e 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -619,7 +619,7 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 
 	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
 		if (gfn >= region->start_gfn &&
-		    gfn < region->start_gfn + region->nr_pages)
+		    gfn < region->start_gfn + region->nr_pfns)
 			return region;
 	}
 
@@ -1313,20 +1313,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 					bool is_mmio)
 {
 	struct mshv_mem_region *rg;
-	u64 nr_pages = HVPFN_DOWN(mem->size);
+	u64 nr_pfns = HVPFN_DOWN(mem->size);
 
 	/* Reject overlapping regions */
 	spin_lock(&partition->pt_mem_regions_lock);
 	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
-		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
-		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
+		if (mem->guest_pfn + nr_pfns <= rg->start_gfn ||
+		    rg->start_gfn + rg->nr_pfns <= mem->guest_pfn)
 			continue;
 		spin_unlock(&partition->pt_mem_regions_lock);
 		return -EEXIST;
 	}
 	spin_unlock(&partition->pt_mem_regions_lock);
 
-	rg = mshv_region_create(mem->guest_pfn, nr_pages,
+	rg = mshv_region_create(mem->guest_pfn, nr_pfns,
 				mem->userspace_addr, mem->flags);
 	if (IS_ERR(rg))
 		return PTR_ERR(rg);
@@ -1412,8 +1412,7 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 		 * is intentional; unpinning host-inaccessible pages would be
 		 * unsafe).
 		 */
-		memset(region->mreg_pages, 0,
-		       region->nr_pages * sizeof(region->mreg_pages[0]));
+		mshv_region_init_pfns(region);
 		goto err_out;
 	}
 err_out:
@@ -1470,21 +1469,21 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		 * the hypervisor track dirty pages, enabling pre-copy live
 		 * migration.
 		 */
-		ret = hv_call_map_gpa_pages(partition->pt_id,
-					    region->start_gfn,
-					    region->nr_pages,
-					    HV_MAP_GPA_NO_ACCESS, NULL);
+		ret = hv_call_map_ram_pfns(partition->pt_id,
+					   region->start_gfn,
+					   region->nr_pfns,
+					   HV_MAP_GPA_NO_ACCESS, NULL);
 		break;
 	case MSHV_REGION_TYPE_MMIO:
-		ret = hv_call_map_mmio_pages(partition->pt_id,
-					     region->start_gfn,
-					     mmio_pfn,
-					     region->nr_pages);
+		ret = hv_call_map_mmio_pfns(partition->pt_id,
+					    region->start_gfn,
+					    mmio_pfn,
+					    region->nr_pfns);
 		break;
 	}
 
 	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
-				   region->start_gfn, region->nr_pages,
+				   region->start_gfn, region->nr_pfns,
 				   region->hv_map_flags, ret);
 
 	if (ret)
@@ -1522,7 +1521,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	/* Paranoia check */
 	if (region->start_uaddr != mem->userspace_addr ||
 	    region->start_gfn != mem->guest_pfn ||
-	    region->nr_pages != HVPFN_DOWN(mem->size)) {
+	    region->nr_pfns != HVPFN_DOWN(mem->size)) {
 		spin_unlock(&partition->pt_mem_regions_lock);
 		return -EINVAL;
 	}



^ permalink raw reply related

* [PATCH v3 04/11] mshv: Refactor region segmentation into a dedicated helper
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_region_process_pfns() conflated three concerns: validating the first
PFN of a chunk, locating the longest contiguous run of same-stride PFNs
starting from there, and dispatching the chunk to the handler. The
locate-and-dispatch interleaving made the partial-consume case (4K-to-2M
stride transition inside a same-validity run) emergent rather than
explicit, and required process_range to handle a return value that was
simultaneously a count and an error code.

Split the locate step out into mshv_region_chunk_size().  The new helper
takes a starting offset and an upper bound, returns the length of the
same-stride run, and reports whether that run is huge-page-backed via an
out-parameter. mshv_region_process_pfns() goes away;
mshv_region_process_range() now drives the loop directly, calling
chunk_size() for the next segment length and dispatching the handler with
the precomputed huge_page hint.

mshv_chunk_stride() additionally takes a PFN instead of a struct page * and
validates it internally, so each call site no longer needs its own
mshv_pfn_valid() check before pfn_to_page().

No functional change; the per-handler dispatch shape, segmentation
boundaries, and lock context are all preserved.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |  103 ++++++++++++++++++++++-----------------------
 1 file changed, 50 insertions(+), 53 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 77fc94733cb20..090c4052f0f4d 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -41,7 +41,7 @@ void mshv_region_init_pfns(struct mshv_mem_region *region)
 
 /**
  * mshv_chunk_stride - Compute stride for mapping guest memory
- * @page     : The page to check for huge page backing
+ * @pfn      : The PFN to check for huge page backing
  * @gfn      : Guest frame number for the mapping
  * @pfn_count: Total number of pages in the mapping
  *
@@ -51,11 +51,16 @@ void mshv_region_init_pfns(struct mshv_mem_region *region)
  *
  * Return: Stride in pages, or -EINVAL if page order is unsupported.
  */
-static int mshv_chunk_stride(struct page *page,
-			     u64 gfn, u64 pfn_count)
+static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
 {
+	struct page *page;
 	unsigned int page_order;
 
+	if (!mshv_pfn_valid(pfn))
+		return -EINVAL;
+
+	page = pfn_to_page(pfn);
+
 	/*
 	 * Use single page stride by default. For huge page stride, the
 	 * page must be compound and point to the head of the compound
@@ -76,65 +81,51 @@ static int mshv_chunk_stride(struct page *page,
 }
 
 /**
- * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
- *                             in a region.
- * @region    : Pointer to the memory region structure.
- * @flags     : Flags to pass to the handler.
- * @pfn_offset: Offset into the region's PFNs array to start processing.
- * @pfn_count : Number of PFNs to process.
- * @handler   : Callback function to handle the chunk.
+ * mshv_region_chunk_size - Length of the next same-stride PFN run.
+ * @region    : Memory region whose PFN array is being walked.
+ * @pfn_offset: Offset into region->mreg_pfns at which to start; the
+ *              PFN at this offset must be valid.
+ * @pfn_count : Upper bound on the run length (not necessarily the
+ *              region's total length; typically the residual passed
+ *              from mshv_region_process_range()).
+ * @huge_page : Out-parameter set to true if the run is backed by
+ *              PMD-order folios and may be dispatched as 2 MiB
+ *              chunks; false for 4 KiB-stride dispatch.
  *
- * This function scans the region's PFNs starting from @pfn_offset,
- * checking for contiguous valid PFNs backed by pages of the same size
- * (normal or huge). It invokes @handler for the chunk of contiguous valid
- * PFNs found. Returns the number of PFNs handled, or a negative error code
- * if the first PFN is invalid or the handler fails.
+ * Walks the PFN array starting at @pfn_offset and returns the length
+ * of the longest contiguous run that shares the stride classification
+ * (4 KiB vs 2 MiB) of the first PFN.  An invalid PFN inside the run
+ * terminates it.  The run is bounded above by @pfn_count.
  *
- * Note: The @handler callback must be able to handle valid PFNs backed by
- * both normal and huge pages.
+ * The caller may then dispatch [pfn_offset, pfn_offset + return) to a
+ * handler with @huge_page indicating which stride applies.  After the
+ * dispatch the caller advances by the returned length and re-invokes
+ * this function for the next run.
  *
- * Return: Number of pages handled, or negative error code.
+ * Return: Length of the run in PFNs, or a negative errno from
+ *         mshv_chunk_stride() if the starting PFN is invalid or its
+ *         backing folio order is unsupported.
  */
-static long mshv_region_process_pfns(struct mshv_mem_region *region,
-				     u32 flags,
-				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+static long mshv_region_chunk_size(struct mshv_mem_region *region,
+				   u64 pfn_offset, u64 pfn_count,
+				   bool *huge_page)
 {
+	unsigned long *pfns = region->mreg_pfns + pfn_offset;
 	u64 gfn = region->start_gfn + pfn_offset;
-	u64 count;
-	unsigned long pfn;
-	int stride, ret;
+	u64 count = 0, stride;
 
-	pfn = region->mreg_pfns[pfn_offset];
-	if (!mshv_pfn_valid(pfn))
-		return -EINVAL;
-
-	stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
+	stride = mshv_chunk_stride(pfns[0], gfn, pfn_count);
 	if (stride < 0)
 		return stride;
 
-	/* Start at stride since the first stride is validated */
-	for (count = stride; count < pfn_count ; count += stride) {
-		pfn = region->mreg_pfns[pfn_offset + count];
-
-		/* Break if current pfn is invalid */
-		if (!mshv_pfn_valid(pfn))
-			break;
-
-		/* Break if stride size changes */
-		if (stride != mshv_chunk_stride(pfn_to_page(pfn),
+	for (count = stride; count < pfn_count; count += stride) {
+		if (stride != mshv_chunk_stride(pfns[count],
 						gfn + count,
 						pfn_count - count))
 			break;
 	}
 
-	ret = handler(region, flags, pfn_offset, count, stride > 1);
-	if (ret)
-		return ret;
+	*huge_page = stride > 1;
 
 	return count;
 }
@@ -150,7 +141,7 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
  *
  * Iterates over the specified range of PFNs in @region, skipping
  * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
- * @handler via mshv_region_process_pfns.
+ * @handler.
  *
  * Note: The @handler callback must be able to handle PFNs backed by both
  * normal and huge pages.
@@ -176,6 +167,9 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 		return -EINVAL;
 
 	while (pfn_count) {
+		bool huge_page;
+		long count;
+
 		/* Skip non-present pages */
 		if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset])) {
 			pfn_offset++;
@@ -183,14 +177,17 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 			continue;
 		}
 
-		ret = mshv_region_process_pfns(region, flags,
-					       pfn_offset, pfn_count,
-					       handler);
+		count = mshv_region_chunk_size(region, pfn_offset, pfn_count,
+					       &huge_page);
+		if (count < 0)
+			return count;
+
+		ret = handler(region, flags, pfn_offset, count, huge_page);
 		if (ret < 0)
 			return ret;
 
-		pfn_offset += ret;
-		pfn_count -= ret;
+		pfn_offset += count;
+		pfn_count -= count;
 	}
 
 	return 0;



^ permalink raw reply related

* [PATCH v3 05/11] mshv: Support address range holes in remapping
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Consolidate memory region processing to handle both valid and invalid PFNs
uniformly. This eliminates code duplication across remap, unmap, share, and
unshare operations by using a common range processing interface.

Holes are now remapped with no-access permissions to enable
hypervisor dirty page tracking for precopy live migration.

This refactoring is a precursor to an upcoming change that will map
present pages in movable regions upon region creation, requiring
consistent handling of both mapped and unmapped ranges.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   70 ++++++++++++++++++++++++++++-----------------
 1 file changed, 43 insertions(+), 27 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 090c4052f0f4d..579a29f2924b8 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -81,30 +81,23 @@ static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
 }
 
 /**
- * mshv_region_chunk_size - Length of the next same-stride PFN run.
+ * mshv_region_chunk_size - Length of the next contiguous PFN run in a region.
  * @region    : Memory region whose PFN array is being walked.
- * @pfn_offset: Offset into region->mreg_pfns at which to start; the
- *              PFN at this offset must be valid.
- * @pfn_count : Upper bound on the run length (not necessarily the
- *              region's total length; typically the residual passed
- *              from mshv_region_process_range()).
- * @huge_page : Out-parameter set to true if the run is backed by
- *              PMD-order folios and may be dispatched as 2 MiB
- *              chunks; false for 4 KiB-stride dispatch.
+ * @pfn_offset: Offset into region->mreg_pfns at which to start.
+ * @pfn_count : Upper bound on the run length.
+ * @huge_page : Out-parameter set to true if the run may be dispatched
+ *              as a 2 MiB chunk; false for 4 KiB-stride dispatch.
  *
- * Walks the PFN array starting at @pfn_offset and returns the length
- * of the longest contiguous run that shares the stride classification
- * (4 KiB vs 2 MiB) of the first PFN.  An invalid PFN inside the run
- * terminates it.  The run is bounded above by @pfn_count.
- *
- * The caller may then dispatch [pfn_offset, pfn_offset + return) to a
- * handler with @huge_page indicating which stride applies.  After the
- * dispatch the caller advances by the returned length and re-invokes
- * this function for the next run.
+ * Returns the length of the longest contiguous run starting at @pfn_offset
+ * that shares the classification of the first PFN: either a same-stride run of
+ * valid PFNs (4 KiB or 2 MiB) or a hole of invalid PFNs. A hole that is
+ * huge-page aligned in @gfn space and at least PTRS_PER_PMD entries long is
+ * reported as a 2 MiB chunk (huge_page = true) so the caller can dispatch it
+ * as a single HV_MAP_GPA_NO_ACCESS huge mapping. The run is bounded above by
+ * @pfn_count.
  *
  * Return: Length of the run in PFNs, or a negative errno from
- *         mshv_chunk_stride() if the starting PFN is invalid or its
- *         backing folio order is unsupported.
+ *         mshv_chunk_stride() if the backing folio order is unsupported.
  */
 static long mshv_region_chunk_size(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
@@ -114,6 +107,22 @@ static long mshv_region_chunk_size(struct mshv_mem_region *region,
 	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count = 0, stride;
 
+	if (!mshv_pfn_valid(pfns[0])) {
+		for (count = 1; count < pfn_count; count++) {
+			if (mshv_pfn_valid(pfns[count]))
+				break;
+		}
+
+		if (IS_ALIGNED(gfn, PTRS_PER_PMD) &&
+		    count >= PTRS_PER_PMD) {
+			*huge_page = true;
+			return ALIGN_DOWN(count, PTRS_PER_PMD);
+		}
+
+		*huge_page = false;
+		return count;
+	}
+
 	stride = mshv_chunk_stride(pfns[0], gfn, pfn_count);
 	if (stride < 0)
 		return stride;
@@ -170,13 +179,6 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 		bool huge_page;
 		long count;
 
-		/* Skip non-present pages */
-		if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset])) {
-			pfn_offset++;
-			pfn_count--;
-			continue;
-		}
-
 		count = mshv_region_chunk_size(region, pfn_offset, pfn_count,
 					       &huge_page);
 		if (count < 0)
@@ -223,6 +225,9 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
+	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+		return -EINVAL;
+
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
@@ -248,6 +253,9 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
 {
+	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+		return -EINVAL;
+
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
@@ -271,6 +279,14 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
+	/*
+	 * Remap missing pages with no access to let the
+	 * hypervisor track dirty pages, enabling precopy live
+	 * migration.
+	 */
+	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+		flags = HV_MAP_GPA_NO_ACCESS;
+
 	if (huge_page)
 		flags |= HV_MAP_GPA_LARGE_PAGE;
 



^ permalink raw reply related

* [PATCH v3 06/11] mshv: Iterate VMAs when faulting in region pages
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_region_hmm_fault_and_lock() called hmm_range_fault() once for the
entire requested range. hmm_range_fault() can only handle a single VMA per
call, so a region whose user address range spans multiple VMAs fails the
fault even though each individual VMA is fault-able.

Walk the requested range VMA by VMA under mmap_read_lock and call
hmm_range_fault() for each [vma->vm_start, vma->vm_end) ∩ [start, end)
segment. The mmu notifier sequence is captured once before the loop, so a
writer racing with the multi-VMA fault is still detected at the closing
mmu_interval_read_retry().

Tighten the read-only gate added in 3f8e229cb787 ("mshv: Don't request HMM
write fault for read-only regions") so HMM_PFN_REQ_WRITE is requested only
when both the region (HV_MAP_GPA_WRITABLE) and the backing VMA (VM_WRITE)
permit writes. Without the per-VMA check, a writable region whose
underlying VMA is read-only would still trigger COW on the host's read-only
pages.

While here, restructure mshv_region_hmm_fault_and_lock() to take the range
as (start, end, pfns) directly rather than a populated hmm_range; the
struct is now constructed inside the function since its fields are
recomputed per VMA.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   97 +++++++++++++++++++++++++++++----------------
 1 file changed, 62 insertions(+), 35 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 579a29f2924b8..807fff43deb43 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -447,37 +447,76 @@ int mshv_region_get(struct mshv_mem_region *region)
 }
 
 /**
- * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
+ * mshv_region_hmm_fault_and_lock - Fault in pages across VMAs and lock
+ *                                  the memory region
  * @region: Pointer to the memory region structure
- * @range: Pointer to the HMM range structure
+ * @start : Starting virtual address of the range to fault (inclusive)
+ * @end   : Ending virtual address of the range to fault (exclusive)
+ * @pfns  : Output array for page frame numbers with HMM flags
  *
- * This function performs the following steps:
- * 1. Reads the notifier sequence for the HMM range.
- * 2. Acquires a read lock on the memory map.
- * 3. Handles HMM faults for the specified range.
- * 4. Releases the read lock on the memory map.
- * 5. If successful, locks the memory region mutex.
- * 6. Verifies if the notifier sequence has changed during the operation.
- *    If it has, releases the mutex and returns -EBUSY to match with
- *    hmm_range_fault() return code for repeating.
+ * Iterates through VMAs covering [start, end), faulting in pages via
+ * hmm_range_fault() for each VMA segment.  Write faults are requested
+ * only when both the VMA and the hypervisor mapping permit writes, to
+ * avoid breaking copy-on-write semantics on read-only mappings.
  *
- * Return: 0 on success, a negative error code otherwise.
+ * On success, returns with region->mreg_mutex held; the caller is
+ * responsible for releasing it.  Returns -EBUSY if the mmu notifier
+ * sequence changed during the operation, signalling the caller to retry.
+ *
+ * Return: 0 on success, negative error code on failure.
  */
 static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
-					  struct hmm_range *range)
+					  unsigned long start,
+					  unsigned long end,
+					  unsigned long *pfns)
 {
+	struct hmm_range range = {
+		.notifier = &region->mreg_mni,
+	};
+	struct mm_struct *mm = region->mreg_mni.mm;
 	int ret;
 
-	range->notifier_seq = mmu_interval_read_begin(range->notifier);
-	mmap_read_lock(region->mreg_mni.mm);
-	ret = hmm_range_fault(range);
-	mmap_read_unlock(region->mreg_mni.mm);
+	range.notifier_seq = mmu_interval_read_begin(range.notifier);
+	mmap_read_lock(mm);
+	while (start < end) {
+		struct vm_area_struct *vma;
+
+		vma = vma_lookup(mm, start);
+		if (!vma) {
+			ret = -EFAULT;
+			break;
+		}
+
+		range.hmm_pfns = pfns;
+		range.start = start;
+		range.end = min(vma->vm_end, end);
+		range.default_flags = HMM_PFN_REQ_FAULT;
+		/*
+		 * Only request writable pages from HMM when both the
+		 * VMA and the hypervisor mapping allow writes.  Without
+		 * this, hmm_range_fault() would trigger COW on read-only
+		 * mappings (e.g. shared zero pages, file-backed pages),
+		 * breaking copy-on-write semantics and potentially
+		 * granting the guest write access to shared host pages.
+		 */
+		if ((vma->vm_flags & VM_WRITE) &&
+		    (region->hv_map_flags & HV_MAP_GPA_WRITABLE))
+			range.default_flags |= HMM_PFN_REQ_WRITE;
+
+		ret = hmm_range_fault(&range);
+		if (ret)
+			break;
+
+		start = range.end;
+		pfns += (range.end - range.start) >> PAGE_SHIFT;
+	}
+	mmap_read_unlock(mm);
 	if (ret)
 		return ret;
 
 	mutex_lock(&region->mreg_mutex);
 
-	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
+	if (mmu_interval_read_retry(range.notifier, range.notifier_seq)) {
 		mutex_unlock(&region->mreg_mutex);
 		cond_resched();
 		return -EBUSY;
@@ -501,10 +540,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 static int mshv_region_range_fault(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count)
 {
-	struct hmm_range range = {
-		.notifier = &region->mreg_mni,
-		.default_flags = HMM_PFN_REQ_FAULT,
-	};
+	unsigned long start, end;
 	unsigned long *pfns;
 	int ret;
 	u64 i;
@@ -513,21 +549,12 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	if (!pfns)
 		return -ENOMEM;
 
-	range.hmm_pfns = pfns;
-	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
-	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
-
-	/*
-	 * Only request writable pages from HMM when the region itself
-	 * permits writes.  Without this, hmm_range_fault() would
-	 * trigger COW on read-only regions, breaking copy-on-write
-	 * semantics on shared host pages.
-	 */
-	if (region->hv_map_flags & HV_MAP_GPA_WRITABLE)
-		range.default_flags |= HMM_PFN_REQ_WRITE;
+	start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
+	end = start + pfn_count * HV_HYP_PAGE_SIZE;
 
 	do {
-		ret = mshv_region_hmm_fault_and_lock(region, &range);
+		ret = mshv_region_hmm_fault_and_lock(region, start, end,
+						     pfns);
 	} while (ret == -EBUSY);
 
 	if (ret)



^ permalink raw reply related

* [PATCH v3 07/11] mshv: Scale fault granularity for non-4 KiB host pages
From: Stanislav Kinsburskii @ 2026-05-13 18:50 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_region_handle_gfn_fault() faults in PTRS_PER_PMD HV pages
(2 MiB), sized for a single 2 MiB SLAT mapping.  When the host page
is larger than HV_HYP_PAGE_SIZE (e.g. arm64 16 KiB or 64 KiB), the
range covers fewer than PTRS_PER_PMD host pages, so a subsequent
guest fault re-enters HMM for offsets the same host folio already
backs.

Scale MSHV_MAP_FAULT_IN_PAGES by PAGE_SIZE / HV_HYP_PAGE_SIZE
(clamped to at least 1) so the fault range covers one huge page on
whichever side has the larger one.

No functional change for the H == V case; on H > V hosts the fault
grows from 2 MiB to one host huge page.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 807fff43deb43..057fc83895d37 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -17,7 +17,8 @@

 #include "mshv_root.h"

-#define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
+#define MSHV_MAP_FAULT_IN_PAGES				\
+	(PTRS_PER_PMD * max_t(unsigned long, 1, PAGE_SIZE / HV_HYP_PAGE_SIZE))
 #define MSHV_INVALID_PFN				ULONG_MAX

 static inline bool mshv_pfn_valid(unsigned long pfn)

^ permalink raw reply related

* [PATCH v3 08/11] mshv: Move pinned region setup to mshv_regions.c
From: Stanislav Kinsburskii @ 2026-05-13 18:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Move mshv_prepare_pinned_region() from mshv_root_main.c to
mshv_regions.c and rename it to mshv_map_pinned_region(). This
co-locates the pinned region logic with the rest of the memory region
operations.

Make mshv_region_pin(), mshv_region_map(), mshv_region_share(),
mshv_region_unshare(), and mshv_region_invalidate() static, as they are
no longer called outside of mshv_regions.c.

No functional change.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   83 ++++++++++++++++++++++++++++++++++++++++---
 drivers/hv/mshv_root.h      |    6 +--
 drivers/hv/mshv_root_main.c |   75 +--------------------------------------
 3 files changed, 80 insertions(+), 84 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 057fc83895d37..d9e33e9ef8550 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -240,7 +240,7 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 					      flags, true);
 }
 
-int mshv_region_share(struct mshv_mem_region *region)
+static int mshv_region_share(struct mshv_mem_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
@@ -266,7 +266,7 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 					      flags, false);
 }
 
-int mshv_region_unshare(struct mshv_mem_region *region)
+static int mshv_region_unshare(struct mshv_mem_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
@@ -306,7 +306,7 @@ static int mshv_region_remap_pfns(struct mshv_mem_region *region,
 					 mshv_region_chunk_remap);
 }
 
-int mshv_region_map(struct mshv_mem_region *region)
+static int mshv_region_map(struct mshv_mem_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
@@ -330,12 +330,12 @@ static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
 	mshv_region_init_pfns_range(region, pfn_offset, pfn_count);
 }
 
-void mshv_region_invalidate(struct mshv_mem_region *region)
+static void mshv_region_invalidate(struct mshv_mem_region *region)
 {
 	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
-int mshv_region_pin(struct mshv_mem_region *region)
+static int mshv_region_pin(struct mshv_mem_region *region)
 {
 	u64 done_count, nr_pfns, i;
 	unsigned long *pfns;
@@ -691,3 +691,76 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
 
 	return true;
 }
+
+/**
+ * mshv_map_pinned_region - Pin and map memory regions
+ * @region: Pointer to the memory region structure
+ *
+ * This function processes memory regions that are explicitly marked as pinned.
+ * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
+ * population. The function ensures the region is properly populated, handles
+ * encryption requirements for SNP partitions if applicable, maps the region,
+ * and performs necessary sharing or eviction operations based on the mapping
+ * result.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int mshv_map_pinned_region(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	int ret;
+
+	ret = mshv_region_pin(region);
+	if (ret) {
+		pt_err(partition, "Failed to pin memory region: %d\n",
+		       ret);
+		goto err_out;
+	}
+
+	/*
+	 * For an SNP partition it is a requirement that for every memory region
+	 * that we are going to map for this partition we should make sure that
+	 * host access to that region is released. This is ensured by doing an
+	 * additional hypercall which will update the SLAT to release host
+	 * access to guest memory regions.
+	 */
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_region_unshare(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
+			       region->start_gfn, ret);
+			goto err_out;
+		}
+	}
+
+	ret = mshv_region_map(region);
+	if (ret)
+		goto share_region;
+
+	return 0;
+
+share_region:
+	if (mshv_partition_encrypted(partition)) {
+		int shrc;
+
+		shrc = mshv_region_share(region);
+		if (!shrc)
+			goto err_out;
+
+		pt_err(partition,
+		       "Failed to share memory region (guest_pfn: %llu): %d\n",
+		       region->start_gfn, shrc);
+		/*
+		 * Re-sharing failed — the pages remain inaccessible to the
+		 * host.  Zero the page array so that mshv_region_destroy()
+		 * won't attempt to unpin them (leaking the page references
+		 * is intentional; unpinning host-inaccessible pages would be
+		 * unsafe).
+		 */
+		mshv_region_init_pfns(region);
+		goto err_out;
+	}
+err_out:
+	return ret;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 34c9b57c50f47..2a4eff27917f2 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -371,16 +371,12 @@ extern u8 * __percpu *hv_synic_eventring_tail;
 
 struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 					   u64 uaddr, u32 flags);
-int mshv_region_share(struct mshv_mem_region *region);
-int mshv_region_unshare(struct mshv_mem_region *region);
-int mshv_region_map(struct mshv_mem_region *region);
-void mshv_region_invalidate(struct mshv_mem_region *region);
 void mshv_region_init_pfns(struct mshv_mem_region *region);
-int mshv_region_pin(struct mshv_mem_region *region);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
 void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
+int mshv_map_pinned_region(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 986cb9a4c428e..4af2b98738ee2 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1346,79 +1346,6 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	return 0;
 }
 
-/**
- * mshv_prepare_pinned_region - Pin and map memory regions
- * @region: Pointer to the memory region structure
- *
- * This function processes memory regions that are explicitly marked as pinned.
- * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
- * population. The function ensures the region is properly populated, handles
- * encryption requirements for SNP partitions if applicable, maps the region,
- * and performs necessary sharing or eviction operations based on the mapping
- * result.
- *
- * Return: 0 on success, negative error code on failure.
- */
-static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
-{
-	struct mshv_partition *partition = region->partition;
-	int ret;
-
-	ret = mshv_region_pin(region);
-	if (ret) {
-		pt_err(partition, "Failed to pin memory region: %d\n",
-		       ret);
-		goto err_out;
-	}
-
-	/*
-	 * For an SNP partition it is a requirement that for every memory region
-	 * that we are going to map for this partition we should make sure that
-	 * host access to that region is released. This is ensured by doing an
-	 * additional hypercall which will update the SLAT to release host
-	 * access to guest memory regions.
-	 */
-	if (mshv_partition_encrypted(partition)) {
-		ret = mshv_region_unshare(region);
-		if (ret) {
-			pt_err(partition,
-			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
-			       region->start_gfn, ret);
-			goto err_out;
-		}
-	}
-
-	ret = mshv_region_map(region);
-	if (ret)
-		goto share_region;
-
-	return 0;
-
-share_region:
-	if (mshv_partition_encrypted(partition)) {
-		int shrc;
-
-		shrc = mshv_region_share(region);
-		if (!shrc)
-			goto err_out;
-
-		pt_err(partition,
-		       "Failed to share memory region (guest_pfn: %llu): %d\n",
-		       region->start_gfn, shrc);
-		/*
-		 * Re-sharing failed — the pages remain inaccessible to the
-		 * host.  Zero the page array so that mshv_region_destroy()
-		 * won't attempt to unpin them (leaking the page references
-		 * is intentional; unpinning host-inaccessible pages would be
-		 * unsafe).
-		 */
-		mshv_region_init_pfns(region);
-		goto err_out;
-	}
-err_out:
-	return ret;
-}
-
 /*
  * This maps two things: guest RAM and for pci passthru mmio space.
  *
@@ -1461,7 +1388,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 
 	switch (region->mreg_type) {
 	case MSHV_REGION_TYPE_MEM_PINNED:
-		ret = mshv_prepare_pinned_region(region);
+		ret = mshv_map_pinned_region(region);
 		break;
 	case MSHV_REGION_TYPE_MEM_MOVABLE:
 		/*



^ permalink raw reply related

* [PATCH v3 09/11] mshv: Map populated pages on movable region creation
From: Stanislav Kinsburskii @ 2026-05-13 18:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Map any populated pages into the hypervisor upfront when creating a
movable region, rather than waiting for faults. Previously, movable
regions were created with all pages marked as HV_MAP_GPA_NO_ACCESS
regardless of whether the userspace mapping contained populated pages.

This guarantees that if the caller passes a populated mapping, those
present pages will be mapped into the hypervisor immediately during
region creation instead of being faulted in later.

Pages that are present but not writable in host page tables (e.g.
shared zero pages) are left as no-access mappings to preserve copy-on-write
semantics; they will be faulted in on demand.

The region is processed in bounded chunks to avoid soft lockups and
livelock from concurrent invalidations.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |  126 +++++++++++++++++++++++++++++++++----------
 drivers/hv/mshv_root.h      |    1 
 drivers/hv/mshv_root_main.c |   10 ---
 3 files changed, 98 insertions(+), 39 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index d9e33e9ef8550..85f8b7bddf939 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -17,8 +17,12 @@
 
 #include "mshv_root.h"
 
+/* Process memory regions in chunks to avoid soft lockups and livelock */
+#define MSHV_MAX_PFN_BATCH				(SZ_2M / PAGE_SIZE)
+
 #define MSHV_MAP_FAULT_IN_PAGES				\
 	(PTRS_PER_PMD * max_t(unsigned long, 1, PAGE_SIZE / HV_HYP_PAGE_SIZE))
+
 #define MSHV_INVALID_PFN				ULONG_MAX
 
 static inline bool mshv_pfn_valid(unsigned long pfn)
@@ -450,13 +454,16 @@ int mshv_region_get(struct mshv_mem_region *region)
 /**
  * mshv_region_hmm_fault_and_lock - Fault in pages across VMAs and lock
  *                                  the memory region
- * @region: Pointer to the memory region structure
- * @start : Starting virtual address of the range to fault (inclusive)
- * @end   : Ending virtual address of the range to fault (exclusive)
- * @pfns  : Output array for page frame numbers with HMM flags
+ * @region  : Pointer to the memory region structure
+ * @start   : Starting virtual address of the range to fault (inclusive)
+ * @end     : Ending virtual address of the range to fault (exclusive)
+ * @pfns    : Output array for page frame numbers with HMM flags
+ * @do_fault: If true, fault in missing pages; if false, snapshot only
+ *            pages already present in page tables
  *
- * Iterates through VMAs covering [start, end), faulting in pages via
- * hmm_range_fault() for each VMA segment.  Write faults are requested
+ * Iterates through VMAs covering [start, end), collecting page frame
+ * numbers via hmm_range_fault() for each VMA segment.  When @do_fault
+ * is true, missing pages are faulted in and write faults are requested
  * only when both the VMA and the hypervisor mapping permit writes, to
  * avoid breaking copy-on-write semantics on read-only mappings.
  *
@@ -469,7 +476,8 @@ int mshv_region_get(struct mshv_mem_region *region)
 static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 					  unsigned long start,
 					  unsigned long end,
-					  unsigned long *pfns)
+					  unsigned long *pfns,
+					  bool do_fault)
 {
 	struct hmm_range range = {
 		.notifier = &region->mreg_mni,
@@ -491,18 +499,22 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 		range.hmm_pfns = pfns;
 		range.start = start;
 		range.end = min(vma->vm_end, end);
-		range.default_flags = HMM_PFN_REQ_FAULT;
-		/*
-		 * Only request writable pages from HMM when both the
-		 * VMA and the hypervisor mapping allow writes.  Without
-		 * this, hmm_range_fault() would trigger COW on read-only
-		 * mappings (e.g. shared zero pages, file-backed pages),
-		 * breaking copy-on-write semantics and potentially
-		 * granting the guest write access to shared host pages.
-		 */
-		if ((vma->vm_flags & VM_WRITE) &&
-		    (region->hv_map_flags & HV_MAP_GPA_WRITABLE))
-			range.default_flags |= HMM_PFN_REQ_WRITE;
+		range.default_flags = 0;
+		if (do_fault) {
+			range.default_flags = HMM_PFN_REQ_FAULT;
+			/*
+			 * Only request writable pages from HMM when both
+			 * the VMA and the hypervisor mapping allow writes.
+			 * Without this, hmm_range_fault() would trigger
+			 * COW on read-only mappings (e.g. shared zero
+			 * pages, file-backed pages), breaking
+			 * copy-on-write semantics and potentially granting
+			 * the guest write access to shared host pages.
+			 */
+			if ((vma->vm_flags & VM_WRITE) &&
+			    (region->hv_map_flags & HV_MAP_GPA_WRITABLE))
+				range.default_flags |= HMM_PFN_REQ_WRITE;
+		}
 
 		ret = hmm_range_fault(&range);
 		if (ret)
@@ -527,19 +539,33 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 }
 
 /**
- * mshv_region_range_fault - Handle memory range faults for a given region.
- * @region: Pointer to the memory region structure.
- * @pfn_offset: Offset of the page within the region.
- * @pfn_count: Number of pages to handle.
+ * mshv_region_collect_and_map - Collect PFNs for a user range and map them
+ * @region    : memory region being processed
+ * @pfn_offset: PFNs offset within the region
+ * @pfn_count : number of PFNs to process
+ * @do_fault  : if true, fault in missing pages;
+ *              if false, collect only present pages
  *
- * This function resolves memory faults for a specified range of pages
- * within a memory region. It uses HMM (Heterogeneous Memory Management)
- * to fault in the required pages and updates the region's page array.
+ * Collects PFNs for the specified portion of @region from the
+ * corresponding userspace VMAs and maps them into the hypervisor. The
+ * behavior depends on @do_fault:
  *
- * Return: 0 on success, negative error code on failure.
+ * - true: Fault in missing pages from userspace, ensuring all pages in the
+ *   range are present. Used for on-demand page population.
+ * - false: Collect PFNs only for pages already present in userspace,
+ *   leaving missing pages as invalid PFN markers.
+ *   Used for initial region setup.
+ *
+ * Collected PFNs are stored in region->mreg_pfns[] with HMM bookkeeping
+ * flags cleared, then the range is mapped into the hypervisor. Present
+ * PFNs get mapped with region access permissions; missing PFNs (invalid
+ * entries) get mapped with no-access permissions.
+ *
+ * Return: 0 on success, negative errno on failure.
  */
-static int mshv_region_range_fault(struct mshv_mem_region *region,
-				   u64 pfn_offset, u64 pfn_count)
+static int mshv_region_collect_and_map(struct mshv_mem_region *region,
+				       u64 pfn_offset, u64 pfn_count,
+				       bool do_fault)
 {
 	unsigned long start, end;
 	unsigned long *pfns;
@@ -555,7 +581,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 
 	do {
 		ret = mshv_region_hmm_fault_and_lock(region, start, end,
-						     pfns);
+						     pfns, do_fault);
 	} while (ret == -EBUSY);
 
 	if (ret)
@@ -564,6 +590,11 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	for (i = 0; i < pfn_count; i++) {
 		if (!(pfns[i] & HMM_PFN_VALID))
 			continue;
+		/* Skip read-only pages to avoid bypassing COW */
+		if (!do_fault &&
+		    (region->hv_map_flags & HV_MAP_GPA_WRITABLE) &&
+		    !(pfns[i] & HMM_PFN_WRITE))
+			continue;
 		/* Drop HMM_PFN_* flags to ensure PFNs are valid. */
 		region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
 	}
@@ -577,6 +608,13 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	return ret;
 }
 
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 pfn_offset, u64 pfn_count)
+{
+	return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
+					   true);
+}
+
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
 {
 	u64 pfn_offset, pfn_count;
@@ -764,3 +802,31 @@ int mshv_map_pinned_region(struct mshv_mem_region *region)
 err_out:
 	return ret;
 }
+
+/*
+ * mshv_map_movable_region - Map a movable memory region to the hypervisor
+ * @region: The memory region to map
+ *
+ * Maps the entire movable region by processing it in bounded chunks to avoid
+ * soft lockups from holding mmap_read_lock() too long and to prevent livelock
+ * if concurrent memory invalidations force restarts.
+ *
+ * Returns: 0 on success, negative error code on failure.
+ */
+int mshv_map_movable_region(struct mshv_mem_region *region)
+{
+	u64 pfn, count;
+	int ret;
+
+	for (pfn = 0; pfn < region->nr_pfns; pfn += MSHV_MAX_PFN_BATCH) {
+		count = min_t(u64, MSHV_MAX_PFN_BATCH, region->nr_pfns - pfn);
+
+		ret = mshv_region_collect_and_map(region, pfn, count, false);
+		if (ret)
+			return ret;
+
+		cond_resched();
+	}
+
+	return 0;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 2a4eff27917f2..0f4fc57a14cd0 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -378,5 +378,6 @@ bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
 void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
 int mshv_map_pinned_region(struct mshv_mem_region *region);
+int mshv_map_movable_region(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 4af2b98738ee2..e38438c539c5d 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1391,15 +1391,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		ret = mshv_map_pinned_region(region);
 		break;
 	case MSHV_REGION_TYPE_MEM_MOVABLE:
-		/*
-		 * For movable memory regions, remap with no access to let
-		 * the hypervisor track dirty pages, enabling pre-copy live
-		 * migration.
-		 */
-		ret = hv_call_map_ram_pfns(partition->pt_id,
-					   region->start_gfn,
-					   region->nr_pfns,
-					   HV_MAP_GPA_NO_ACCESS, NULL);
+		ret = mshv_map_movable_region(region);
 		break;
 	case MSHV_REGION_TYPE_MMIO:
 		ret = hv_call_map_mmio_pfns(partition->pt_id,



^ permalink raw reply related

* [PATCH v3 10/11] mshv: Extract MMIO region mapping into separate function
From: Stanislav Kinsburskii @ 2026-05-13 18:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Extract the MMIO region mapping logic from mshv_map_user_memory() into
a dedicated mshv_map_mmio_region() function. This improves code
organization and consistency with the existing mshv_map_pinned_region()
and mshv_map_movable_region() functions.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |    8 ++++++++
 drivers/hv/mshv_root.h      |    2 ++
 drivers/hv/mshv_root_main.c |    5 +----
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 85f8b7bddf939..7bcfba9ebac12 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -830,3 +830,11 @@ int mshv_map_movable_region(struct mshv_mem_region *region)
 
 	return 0;
 }
+
+int mshv_map_mmio_region(struct mshv_mem_region *region,
+			 unsigned long mmio_pfn)
+{
+	return hv_call_map_mmio_pfns(region->partition->pt_id,
+				     region->start_gfn,
+				     mmio_pfn, region->nr_pfns);
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 0f4fc57a14cd0..b091db06390b0 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -379,5 +379,7 @@ void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
 int mshv_map_pinned_region(struct mshv_mem_region *region);
 int mshv_map_movable_region(struct mshv_mem_region *region);
+int mshv_map_mmio_region(struct mshv_mem_region *region,
+			 unsigned long mmio_pfn);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e38438c539c5d..d5197559d7650 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1394,10 +1394,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		ret = mshv_map_movable_region(region);
 		break;
 	case MSHV_REGION_TYPE_MMIO:
-		ret = hv_call_map_mmio_pfns(partition->pt_id,
-					    region->start_gfn,
-					    mmio_pfn,
-					    region->nr_pfns);
+		ret = mshv_map_mmio_region(region, mmio_pfn);
 		break;
 	}
 



^ permalink raw reply related

* [PATCH v3 11/11] mshv: Add tracepoint for map GPA hypercall
From: Stanislav Kinsburskii @ 2026-05-13 18:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869813422.18773.515308662914055136.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Add tracing for GPA mapping hypercalls to aid in debugging memory
management issues in child partitions. The tracepoint captures both
successful and failed mapping attempts, including the number of pages
successfully mapped before any failure occurred.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |    3 +++
 drivers/hv/mshv_trace.h        |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 00d37c39cbf26..ccf297fbfae3a 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -250,6 +250,9 @@ static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
 		}
 	}
 
+	trace_mshv_map_pfns(partition_id, gfn, pfns_count, page_count,
+			    flags, mmio_spa, done, ret);
+
 	if (ret && done) {
 		u32 unmap_flags = 0;
 
diff --git a/drivers/hv/mshv_trace.h b/drivers/hv/mshv_trace.h
index e7280c47e579a..619c4563d4211 100644
--- a/drivers/hv/mshv_trace.h
+++ b/drivers/hv/mshv_trace.h
@@ -538,6 +538,42 @@ TRACE_EVENT(mshv_handle_gpa_intercept,
 	    )
 );
 
+TRACE_EVENT(mshv_map_pfns,
+	    TP_PROTO(u64 partition_id, u64 gfn, u64 pfn_count, u64 page_count, u32 flags,
+		     u64 mmio_spa, u64 done, int ret),
+	    TP_ARGS(partition_id, gfn, pfn_count, page_count, flags, mmio_spa, done, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, gfn)
+		    __field(u64, pfn_count)
+		    __field(u64, page_count)
+		    __field(u32, flags)
+		    __field(u64, mmio_spa)
+		    __field(u64, done)
+		    __field(int, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->gfn = gfn;
+		    __entry->pfn_count = pfn_count;
+		    __entry->page_count = page_count;
+		    __entry->flags = flags;
+		    __entry->mmio_spa = mmio_spa;
+		    __entry->done = done;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu gfn=0x%llx pfn_count=%llu page_count=%llu flags=0x%x mmio_spa=0x%llx done=%llu ret=%d",
+		    __entry->partition_id,
+		    __entry->gfn,
+		    __entry->pfn_count,
+		    __entry->page_count,
+		    __entry->flags,
+		    __entry->mmio_spa,
+		    __entry->done,
+		    __entry->ret
+	    )
+);
+
 #endif /* _MSHV_TRACE_H_ */
 
 /* This part must be outside protection */



^ permalink raw reply related

* [PATCH v4 0/6] mshv: Reduce memory consumption for unpinned regions
From: Stanislav Kinsburskii @ 2026-05-13 18:54 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel

This series reduces memory consumption for unpinned regions by avoiding
PFN array allocation. A 1GB unpinned region currently wastes 2MB for an
unused PFN array that HMM-managed regions don't need.


v4:
 - Rebased on the latest state of the tree.
 - Dropped "mshv: Improve code readability with handler function typedef"
   as redundant.

v3:
- Fix missing unmap/remap of pages before the first huge page.

v2:
- Improved commit message
- Fixed invalid vfree(region->mreg_pfns) call for MMIO-backed regions
- Fixed unpinning of already-released pages in the error path during
  pinned region creation
- Removed redundant mshv_map_region helper in favor of the new
  optimized mapping logic

---

Stanislav Kinsburskii (6):
      mshv: Consolidate region creation and mapping
      mshv: Rename mshv_mem_region to mshv_region
      mshv: Optimize region unmap and invalidation with large pages
      mshv: Pass pfns array explicitly through processing chain
      mshv: Simplify pfn array handling in region processing
      mshv: Allocate pfns array only for pinned regions


 drivers/hv/mshv_regions.c   |  346 +++++++++++++++++++++++++++----------------
 drivers/hv/mshv_root.h      |   28 ++-
 drivers/hv/mshv_root_main.c |   81 ++++------
 3 files changed, 265 insertions(+), 190 deletions(-)


^ permalink raw reply

* [PATCH v4 1/6] mshv: Consolidate region creation and mapping
From: Stanislav Kinsburskii @ 2026-05-13 18:54 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869833161.19379.1357399549586915698.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Consolidate region type detection and initialization into
mshv_region_create() to simplify the region creation flow. Move type
determination logic (MMIO/pinned/movable) earlier in the process and
initialize type-specific fields during creation rather than after.

Movable region initialization now fails explicitly if
mmu_interval_notifier_insert() returns an error, rather than silently
falling back to pinned memory. This fail-fast approach makes
configuration issues more visible.

This eliminates the need for mshv_region_movable_init/fini() by
handling MMU interval notifier setup directly in the constructor and
teardown in the destructor. Region mapping is also unified through a
single mshv_map_region() dispatcher that routes to the appropriate
type-specific handler, reducing the exported API surface by 5
functions.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   83 ++++++++++++++++++++++++++++---------------
 drivers/hv/mshv_root.h      |   15 ++++----
 drivers/hv/mshv_root_main.c |   61 ++++++++++++--------------------
 3 files changed, 85 insertions(+), 74 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 7bcfba9ebac12..894eb4dcad93f 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -25,6 +25,8 @@
 
 #define MSHV_INVALID_PFN				ULONG_MAX
 
+static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
+
 static inline bool mshv_pfn_valid(unsigned long pfn)
 {
 	return pfn != MSHV_INVALID_PFN;
@@ -200,15 +202,21 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags)
+struct mshv_mem_region *mshv_region_create(struct mshv_partition *partition,
+					   enum mshv_region_type type,
+					   u64 guest_pfn, u64 nr_pfns,
+					   u64 uaddr, u32 flags,
+					   unsigned long mmio_pfn)
 {
 	struct mshv_mem_region *region;
+	int ret = 0;
 
 	region = vzalloc(struct_size(region, mreg_pfns, nr_pfns));
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 
+	region->partition = partition;
+	region->mreg_type = type;
 	region->nr_pfns = nr_pfns;
 	region->start_gfn = guest_pfn;
 	region->start_uaddr = uaddr;
@@ -220,9 +228,33 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
 
 	mshv_region_init_pfns(region);
 
+	mutex_init(&region->mreg_mutex);
 	kref_init(&region->mreg_refcount);
 
+	switch (type) {
+	case MSHV_REGION_TYPE_MEM_MOVABLE:
+		ret = mmu_interval_notifier_insert(&region->mreg_mni,
+						   current->mm, uaddr,
+						   nr_pfns << HV_HYP_PAGE_SHIFT,
+						   &mshv_region_mni_ops);
+		break;
+	case MSHV_REGION_TYPE_MEM_PINNED:
+		break;
+	case MSHV_REGION_TYPE_MMIO:
+		region->mreg_mmio_pfn = mmio_pfn;
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	if (ret)
+		goto free_region;
+
 	return region;
+
+free_region:
+	vfree(region);
+	return ERR_PTR(ret);
 }
 
 static int mshv_region_chunk_share(struct mshv_mem_region *region,
@@ -422,7 +454,7 @@ static void mshv_region_destroy(struct kref *ref)
 	int ret;
 
 	if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
-		mshv_region_movable_fini(region);
+		mmu_interval_notifier_remove(&region->mreg_mni);
 
 	if (mshv_partition_encrypted(partition)) {
 		ret = mshv_region_share(region);
@@ -709,27 +741,6 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
 	.invalidate = mshv_region_interval_invalidate,
 };
 
-void mshv_region_movable_fini(struct mshv_mem_region *region)
-{
-	mmu_interval_notifier_remove(&region->mreg_mni);
-}
-
-bool mshv_region_movable_init(struct mshv_mem_region *region)
-{
-	int ret;
-
-	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
-					   region->start_uaddr,
-					   region->nr_pfns << HV_HYP_PAGE_SHIFT,
-					   &mshv_region_mni_ops);
-	if (ret)
-		return false;
-
-	mutex_init(&region->mreg_mutex);
-
-	return true;
-}
-
 /**
  * mshv_map_pinned_region - Pin and map memory regions
  * @region: Pointer to the memory region structure
@@ -743,7 +754,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
  *
  * Return: 0 on success, negative error code on failure.
  */
-int mshv_map_pinned_region(struct mshv_mem_region *region)
+static int mshv_map_pinned_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
@@ -813,7 +824,7 @@ int mshv_map_pinned_region(struct mshv_mem_region *region)
  *
  * Returns: 0 on success, negative error code on failure.
  */
-int mshv_map_movable_region(struct mshv_mem_region *region)
+static int mshv_map_movable_region(struct mshv_mem_region *region)
 {
 	u64 pfn, count;
 	int ret;
@@ -831,10 +842,24 @@ int mshv_map_movable_region(struct mshv_mem_region *region)
 	return 0;
 }
 
-int mshv_map_mmio_region(struct mshv_mem_region *region,
-			 unsigned long mmio_pfn)
+static int mshv_map_mmio_region(struct mshv_mem_region *region)
 {
 	return hv_call_map_mmio_pfns(region->partition->pt_id,
 				     region->start_gfn,
-				     mmio_pfn, region->nr_pfns);
+				     region->mreg_mmio_pfn,
+				     region->nr_pfns);
+}
+
+int mshv_map_region(struct mshv_mem_region *region)
+{
+	switch (region->mreg_type) {
+	case MSHV_REGION_TYPE_MEM_PINNED:
+		return mshv_map_pinned_region(region);
+	case MSHV_REGION_TYPE_MEM_MOVABLE:
+		return mshv_map_movable_region(region);
+	case MSHV_REGION_TYPE_MMIO:
+		return mshv_map_mmio_region(region);
+	default:
+		return -EINVAL;
+	}
 }
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index b091db06390b0..40fcf57219ece 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -93,6 +93,7 @@ struct mshv_mem_region {
 	enum mshv_region_type mreg_type;
 	struct mmu_interval_notifier mreg_mni;
 	struct mutex mreg_mutex;	/* protects region PFNs remapping */
+	u64 mreg_mmio_pfn;
 	unsigned long mreg_pfns[];
 };
 
@@ -369,17 +370,15 @@ extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
-					   u64 uaddr, u32 flags);
+struct mshv_mem_region *mshv_region_create(struct mshv_partition *partition,
+					   enum mshv_region_type type,
+					   u64 guest_pfn, u64 nr_pfns,
+					   u64 uaddr, u32 flags,
+					   unsigned long mmio_pfn);
 void mshv_region_init_pfns(struct mshv_mem_region *region);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
-void mshv_region_movable_fini(struct mshv_mem_region *region);
-bool mshv_region_movable_init(struct mshv_mem_region *region);
-int mshv_map_pinned_region(struct mshv_mem_region *region);
-int mshv_map_movable_region(struct mshv_mem_region *region);
-int mshv_map_mmio_region(struct mshv_mem_region *region,
-			 unsigned long mmio_pfn);
+int mshv_map_region(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index d5197559d7650..68609f2452b3a 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1309,11 +1309,14 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
  */
 static int mshv_partition_create_region(struct mshv_partition *partition,
 					struct mshv_user_mem_region *mem,
-					struct mshv_mem_region **regionpp,
-					bool is_mmio)
+					struct mshv_mem_region **regionpp)
 {
 	struct mshv_mem_region *rg;
+	enum mshv_region_type type;
 	u64 nr_pfns = HVPFN_DOWN(mem->size);
+	struct vm_area_struct *vma;
+	ulong mmio_pfn;
+	bool is_mmio;
 
 	/* Reject overlapping regions */
 	spin_lock(&partition->pt_mem_regions_lock);
@@ -1326,20 +1329,27 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	}
 	spin_unlock(&partition->pt_mem_regions_lock);
 
-	rg = mshv_region_create(mem->guest_pfn, nr_pfns,
-				mem->userspace_addr, mem->flags);
-	if (IS_ERR(rg))
-		return PTR_ERR(rg);
+	mmap_read_lock(current->mm);
+	vma = vma_lookup(current->mm, mem->userspace_addr);
+	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
+	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
+	mmap_read_unlock(current->mm);
+
+	if (!vma)
+		return -EINVAL;
 
 	if (is_mmio)
-		rg->mreg_type = MSHV_REGION_TYPE_MMIO;
-	else if (mshv_partition_encrypted(partition) ||
-		 !mshv_region_movable_init(rg))
-		rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
+		type = MSHV_REGION_TYPE_MMIO;
+	else if (mshv_partition_encrypted(partition))
+		type = MSHV_REGION_TYPE_MEM_PINNED;
 	else
-		rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
+		type = MSHV_REGION_TYPE_MEM_MOVABLE;
 
-	rg->partition = partition;
+	rg = mshv_region_create(partition, type, mem->guest_pfn, nr_pfns,
+				mem->userspace_addr, mem->flags,
+				mmio_pfn);
+	if (IS_ERR(rg))
+		return PTR_ERR(rg);
 
 	*regionpp = rg;
 
@@ -1363,40 +1373,17 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		     struct mshv_user_mem_region *mem)
 {
 	struct mshv_mem_region *region;
-	struct vm_area_struct *vma;
-	bool is_mmio;
-	ulong mmio_pfn;
 	long ret;
 
 	if (mem->flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
 	    !access_ok((const void __user *)mem->userspace_addr, mem->size))
 		return -EINVAL;
 
-	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, mem->userspace_addr);
-	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
-	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
-	mmap_read_unlock(current->mm);
-
-	if (!vma)
-		return -EINVAL;
-
-	ret = mshv_partition_create_region(partition, mem, &region,
-					   is_mmio);
+	ret = mshv_partition_create_region(partition, mem, &region);
 	if (ret)
 		return ret;
 
-	switch (region->mreg_type) {
-	case MSHV_REGION_TYPE_MEM_PINNED:
-		ret = mshv_map_pinned_region(region);
-		break;
-	case MSHV_REGION_TYPE_MEM_MOVABLE:
-		ret = mshv_map_movable_region(region);
-		break;
-	case MSHV_REGION_TYPE_MMIO:
-		ret = mshv_map_mmio_region(region, mmio_pfn);
-		break;
-	}
+	ret = mshv_map_region(region);
 
 	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
 				   region->start_gfn, region->nr_pfns,



^ permalink raw reply related

* [PATCH v4 2/6] mshv: Rename mshv_mem_region to mshv_region
From: Stanislav Kinsburskii @ 2026-05-13 18:54 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869833161.19379.1357399549586915698.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The mshv_mem_region structure represents guest address space regions,
which can be either RAM-backed memory or memory-mapped IO regions
without physical backing. The "mem_" prefix incorrectly suggests the
structure only handles memory regions, creating confusion about its
actual purpose.

Remove the "mem_" prefix to align with existing function naming
(mshv_region_map, mshv_region_pin, etc.) and accurately reflect that
this structure manages arbitrary guest address space mappings
regardless of their backing type.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   76 ++++++++++++++++++++++---------------------
 drivers/hv/mshv_root.h      |   22 ++++++------
 drivers/hv/mshv_root_main.c |   22 ++++++------
 3 files changed, 60 insertions(+), 60 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 894eb4dcad93f..f81951ae3f808 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -32,7 +32,7 @@ static inline bool mshv_pfn_valid(unsigned long pfn)
 	return pfn != MSHV_INVALID_PFN;
 }
 
-static void mshv_region_init_pfns_range(struct mshv_mem_region *region,
+static void mshv_region_init_pfns_range(struct mshv_region *region,
 					u64 pfn_offset, u64 pfn_count)
 {
 	u64 i;
@@ -41,7 +41,7 @@ static void mshv_region_init_pfns_range(struct mshv_mem_region *region,
 		region->mreg_pfns[i] = MSHV_INVALID_PFN;
 }
 
-void mshv_region_init_pfns(struct mshv_mem_region *region)
+void mshv_region_init_pfns(struct mshv_region *region)
 {
 	mshv_region_init_pfns_range(region, 0, region->nr_pfns);
 }
@@ -106,7 +106,7 @@ static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
  * Return: Length of the run in PFNs, or a negative errno from
  *         mshv_chunk_stride() if the backing folio order is unsupported.
  */
-static long mshv_region_chunk_size(struct mshv_mem_region *region,
+static long mshv_region_chunk_size(struct mshv_region *region,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool *huge_page)
 {
@@ -164,10 +164,10 @@ static long mshv_region_chunk_size(struct mshv_mem_region *region,
  *
  * Returns 0 on success, or a negative error code on failure.
  */
-static int mshv_region_process_range(struct mshv_mem_region *region,
+static int mshv_region_process_range(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
+				     int (*handler)(struct mshv_region *region,
 						    u32 flags,
 						    u64 pfn_offset,
 						    u64 pfn_count,
@@ -202,13 +202,13 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(struct mshv_partition *partition,
-					   enum mshv_region_type type,
-					   u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags,
-					   unsigned long mmio_pfn)
+struct mshv_region *mshv_region_create(struct mshv_partition *partition,
+				       enum mshv_region_type type,
+				       u64 guest_pfn, u64 nr_pfns,
+				       u64 uaddr, u32 flags,
+				       unsigned long mmio_pfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	int ret = 0;
 
 	region = vzalloc(struct_size(region, mreg_pfns, nr_pfns));
@@ -257,7 +257,7 @@ struct mshv_mem_region *mshv_region_create(struct mshv_partition *partition,
 	return ERR_PTR(ret);
 }
 
-static int mshv_region_chunk_share(struct mshv_mem_region *region,
+static int mshv_region_chunk_share(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -276,7 +276,7 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 					      flags, true);
 }
 
-static int mshv_region_share(struct mshv_mem_region *region)
+static int mshv_region_share(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
@@ -285,7 +285,7 @@ static int mshv_region_share(struct mshv_mem_region *region)
 					 mshv_region_chunk_share);
 }
 
-static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
+static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
@@ -302,7 +302,7 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 					      flags, false);
 }
 
-static int mshv_region_unshare(struct mshv_mem_region *region)
+static int mshv_region_unshare(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
@@ -311,7 +311,7 @@ static int mshv_region_unshare(struct mshv_mem_region *region)
 					 mshv_region_chunk_unshare);
 }
 
-static int mshv_region_chunk_remap(struct mshv_mem_region *region,
+static int mshv_region_chunk_remap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -333,7 +333,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				    region->mreg_pfns + pfn_offset);
 }
 
-static int mshv_region_remap_pfns(struct mshv_mem_region *region,
+static int mshv_region_remap_pfns(struct mshv_region *region,
 				  u32 map_flags,
 				  u64 pfn_offset, u64 pfn_count)
 {
@@ -342,7 +342,7 @@ static int mshv_region_remap_pfns(struct mshv_mem_region *region,
 					 mshv_region_chunk_remap);
 }
 
-static int mshv_region_map(struct mshv_mem_region *region)
+static int mshv_region_map(struct mshv_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
@@ -350,7 +350,7 @@ static int mshv_region_map(struct mshv_mem_region *region)
 				      0, region->nr_pfns);
 }
 
-static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
+static void mshv_region_invalidate_pfns(struct mshv_region *region,
 					u64 pfn_offset, u64 pfn_count)
 {
 	u64 i;
@@ -366,12 +366,12 @@ static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
 	mshv_region_init_pfns_range(region, pfn_offset, pfn_count);
 }
 
-static void mshv_region_invalidate(struct mshv_mem_region *region)
+static void mshv_region_invalidate(struct mshv_region *region)
 {
 	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
-static int mshv_region_pin(struct mshv_mem_region *region)
+static int mshv_region_pin(struct mshv_region *region)
 {
 	u64 done_count, nr_pfns, i;
 	unsigned long *pfns;
@@ -426,7 +426,7 @@ static int mshv_region_pin(struct mshv_mem_region *region)
 	return ret < 0 ? ret : -ENOMEM;
 }
 
-static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
+static int mshv_region_chunk_unmap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -439,7 +439,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
 				  pfn_count, flags);
 }
 
-static int mshv_region_unmap(struct mshv_mem_region *region)
+static int mshv_region_unmap(struct mshv_region *region)
 {
 	return mshv_region_process_range(region, 0,
 					 0, region->nr_pfns,
@@ -448,8 +448,8 @@ static int mshv_region_unmap(struct mshv_mem_region *region)
 
 static void mshv_region_destroy(struct kref *ref)
 {
-	struct mshv_mem_region *region =
-		container_of(ref, struct mshv_mem_region, mreg_refcount);
+	struct mshv_region *region =
+		container_of(ref, struct mshv_region, mreg_refcount);
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
@@ -473,12 +473,12 @@ static void mshv_region_destroy(struct kref *ref)
 	vfree(region);
 }
 
-void mshv_region_put(struct mshv_mem_region *region)
+void mshv_region_put(struct mshv_region *region)
 {
 	kref_put(&region->mreg_refcount, mshv_region_destroy);
 }
 
-int mshv_region_get(struct mshv_mem_region *region)
+int mshv_region_get(struct mshv_region *region)
 {
 	return kref_get_unless_zero(&region->mreg_refcount);
 }
@@ -505,7 +505,7 @@ int mshv_region_get(struct mshv_mem_region *region)
  *
  * Return: 0 on success, negative error code on failure.
  */
-static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
+static int mshv_region_hmm_fault_and_lock(struct mshv_region *region,
 					  unsigned long start,
 					  unsigned long end,
 					  unsigned long *pfns,
@@ -595,7 +595,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
  *
  * Return: 0 on success, negative errno on failure.
  */
-static int mshv_region_collect_and_map(struct mshv_mem_region *region,
+static int mshv_region_collect_and_map(struct mshv_region *region,
 				       u64 pfn_offset, u64 pfn_count,
 				       bool do_fault)
 {
@@ -640,14 +640,14 @@ static int mshv_region_collect_and_map(struct mshv_mem_region *region,
 	return ret;
 }
 
-static int mshv_region_range_fault(struct mshv_mem_region *region,
+static int mshv_region_range_fault(struct mshv_region *region,
 				   u64 pfn_offset, u64 pfn_count)
 {
 	return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
 					   true);
 }
 
-bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
+bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn)
 {
 	u64 pfn_offset, pfn_count;
 	int ret;
@@ -693,9 +693,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 					    const struct mmu_notifier_range *range,
 					    unsigned long cur_seq)
 {
-	struct mshv_mem_region *region = container_of(mni,
-						      struct mshv_mem_region,
-						      mreg_mni);
+	struct mshv_region *region = container_of(mni,
+						  struct mshv_region,
+						  mreg_mni);
 	u64 pfn_offset, pfn_count;
 	unsigned long mstart, mend;
 	int ret = -EPERM;
@@ -754,7 +754,7 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
  *
  * Return: 0 on success, negative error code on failure.
  */
-static int mshv_map_pinned_region(struct mshv_mem_region *region)
+static int mshv_map_pinned_region(struct mshv_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
@@ -824,7 +824,7 @@ static int mshv_map_pinned_region(struct mshv_mem_region *region)
  *
  * Returns: 0 on success, negative error code on failure.
  */
-static int mshv_map_movable_region(struct mshv_mem_region *region)
+static int mshv_map_movable_region(struct mshv_region *region)
 {
 	u64 pfn, count;
 	int ret;
@@ -842,7 +842,7 @@ static int mshv_map_movable_region(struct mshv_mem_region *region)
 	return 0;
 }
 
-static int mshv_map_mmio_region(struct mshv_mem_region *region)
+static int mshv_map_mmio_region(struct mshv_region *region)
 {
 	return hv_call_map_mmio_pfns(region->partition->pt_id,
 				     region->start_gfn,
@@ -850,7 +850,7 @@ static int mshv_map_mmio_region(struct mshv_mem_region *region)
 				     region->nr_pfns);
 }
 
-int mshv_map_region(struct mshv_mem_region *region)
+int mshv_map_region(struct mshv_region *region)
 {
 	switch (region->mreg_type) {
 	case MSHV_REGION_TYPE_MEM_PINNED:
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 40fcf57219ece..e9bd18013b486 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -82,7 +82,7 @@ enum mshv_region_type {
 	MSHV_REGION_TYPE_MMIO
 };
 
-struct mshv_mem_region {
+struct mshv_region {
 	struct hlist_node hnode;
 	struct kref mreg_refcount;
 	u64 nr_pfns;
@@ -370,15 +370,15 @@ extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
 
-struct mshv_mem_region *mshv_region_create(struct mshv_partition *partition,
-					   enum mshv_region_type type,
-					   u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags,
-					   unsigned long mmio_pfn);
-void mshv_region_init_pfns(struct mshv_mem_region *region);
-void mshv_region_put(struct mshv_mem_region *region);
-int mshv_region_get(struct mshv_mem_region *region);
-bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
-int mshv_map_region(struct mshv_mem_region *region);
+struct mshv_region *mshv_region_create(struct mshv_partition *partition,
+				       enum mshv_region_type type,
+				       u64 guest_pfn, u64 nr_pfns,
+				       u64 uaddr, u32 flags,
+				       unsigned long mmio_pfn);
+void mshv_region_init_pfns(struct mshv_region *region);
+void mshv_region_put(struct mshv_region *region);
+int mshv_region_get(struct mshv_region *region);
+bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn);
+int mshv_map_region(struct mshv_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 68609f2452b3a..0e9d14d1d6167 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -612,10 +612,10 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
 	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
 
-static struct mshv_mem_region *
+static struct mshv_region *
 mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
 		if (gfn >= region->start_gfn &&
@@ -626,10 +626,10 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 	return NULL;
 }
 
-static struct mshv_mem_region *
+static struct mshv_region *
 mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	spin_lock(&p->pt_mem_regions_lock);
 	region = mshv_partition_region_by_gfn(p, gfn);
@@ -656,7 +656,7 @@ mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
 static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
 {
 	struct mshv_partition *p = vp->vp_partition;
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	bool ret = false;
 	u64 gfn;
 #if defined(CONFIG_X86_64)
@@ -933,7 +933,7 @@ mshv_vp_ioctl_translate_gva(struct mshv_vp *vp, void __user *user_args)
 			return ret;
 
 		if (mshv_gpa_fault_retryable(result.result_code)) {
-			struct mshv_mem_region *region;
+			struct mshv_region *region;
 			bool faulted;
 
 			region = mshv_partition_region_by_gfn_get(partition,
@@ -1309,9 +1309,9 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
  */
 static int mshv_partition_create_region(struct mshv_partition *partition,
 					struct mshv_user_mem_region *mem,
-					struct mshv_mem_region **regionpp)
+					struct mshv_region **regionpp)
 {
-	struct mshv_mem_region *rg;
+	struct mshv_region *rg;
 	enum mshv_region_type type;
 	u64 nr_pfns = HVPFN_DOWN(mem->size);
 	struct vm_area_struct *vma;
@@ -1372,7 +1372,7 @@ static long
 mshv_map_user_memory(struct mshv_partition *partition,
 		     struct mshv_user_mem_region *mem)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	long ret;
 
 	if (mem->flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
@@ -1408,7 +1408,7 @@ static long
 mshv_unmap_user_memory(struct mshv_partition *partition,
 		       struct mshv_user_mem_region *mem)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	if (!(mem->flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
 		return -EINVAL;
@@ -1780,7 +1780,7 @@ remove_partition(struct mshv_partition *partition)
 static void destroy_partition(struct mshv_partition *partition)
 {
 	struct mshv_vp *vp;
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	struct hlist_node *n;
 	int i;
 



^ permalink raw reply related

* [PATCH v4 3/6] mshv: Optimize region unmap and invalidation with large pages
From: Stanislav Kinsburskii @ 2026-05-13 18:54 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869833161.19379.1357399549586915698.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Region unmapping and no-access remapping do not require per-PFN
iteration: all GFNs in a region are guaranteed mapped (valid PFNs
with access permissions, holes with no-access), so they can be
processed in bulk.

Split GFN ranges into large-page-aligned chunks and use
HV_MAP_GPA_LARGE_PAGE / HV_UNMAP_GPA_LARGE_PAGE flags for the
aligned portions. This eliminates PFN traversal and reduces
hypercalls for large regions.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   77 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 64 insertions(+), 13 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index f81951ae3f808..cb42ee49c2e2f 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -20,11 +20,17 @@
 /* Process memory regions in chunks to avoid soft lockups and livelock */
 #define MSHV_MAX_PFN_BATCH				(SZ_2M / PAGE_SIZE)
 
+/* Hypervisor base pages per large page (2 MiB / 4 KiB) */
+#define HV_PFNS_PER_LARGE_PAGE				(SZ_2M / HV_HYP_PAGE_SIZE)
+
 #define MSHV_MAP_FAULT_IN_PAGES				\
 	(PTRS_PER_PMD * max_t(unsigned long, 1, PAGE_SIZE / HV_HYP_PAGE_SIZE))
 
 #define MSHV_INVALID_PFN				ULONG_MAX
 
+typedef int (*gfn_handler_t)(struct mshv_region *region,
+			      u64 gfn, u64 count, u32 flags);
+
 static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
 
 static inline bool mshv_pfn_valid(unsigned long pfn)
@@ -426,24 +432,54 @@ static int mshv_region_pin(struct mshv_region *region)
 	return ret < 0 ? ret : -ENOMEM;
 }
 
-static int mshv_region_chunk_unmap(struct mshv_region *region,
-				   u32 flags,
-				   u64 pfn_offset, u64 pfn_count,
-				   bool huge_page)
+/*
+ * Split a GFN range into head (unaligned), large-page-aligned middle,
+ * and tail, invoking @fn for each non-empty piece.
+ */
+static int mshv_region_for_each_gfn_chunk(struct mshv_region *region,
+					  u64 gfn, u64 nr_pfns,
+					  u32 base_flags, u32 large_flag,
+					  gfn_handler_t fn)
 {
-	if (huge_page)
-		flags |= HV_UNMAP_GPA_LARGE_PAGE;
+	u64 head, aligned, tail;
+	int ret;
+
+	head = min(ALIGN(gfn, HV_PFNS_PER_LARGE_PAGE) - gfn, nr_pfns);
+	aligned = ALIGN_DOWN(nr_pfns - head, HV_PFNS_PER_LARGE_PAGE);
+	tail = nr_pfns - head - aligned;
+
+	if (head) {
+		ret = fn(region, gfn, head, base_flags);
+		if (ret)
+			return ret;
+	}
+	if (aligned) {
+		ret = fn(region, gfn + head, aligned,
+			 base_flags | large_flag);
+		if (ret)
+			return ret;
+	}
+	if (tail) {
+		ret = fn(region, gfn + head + aligned, tail, base_flags);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
 
+static int mshv_unmap_gfns(struct mshv_region *region,
+			   u64 gfn, u64 count, u32 flags)
+{
 	return hv_call_unmap_pfns(region->partition->pt_id,
-				  region->start_gfn + pfn_offset,
-				  pfn_count, flags);
+				  gfn, count, flags);
 }
 
 static int mshv_region_unmap(struct mshv_region *region)
 {
-	return mshv_region_process_range(region, 0,
-					 0, region->nr_pfns,
-					 mshv_region_chunk_unmap);
+	return mshv_region_for_each_gfn_chunk(region, region->start_gfn,
+					      region->nr_pfns,
+					      0, HV_UNMAP_GPA_LARGE_PAGE,
+					      mshv_unmap_gfns);
 }
 
 static void mshv_region_destroy(struct kref *ref)
@@ -671,6 +707,22 @@ bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn)
 	return !ret;
 }
 
+static int mshv_map_no_access_gfns(struct mshv_region *region,
+				   u64 gfn, u64 count, u32 flags)
+{
+	return hv_call_map_ram_pfns(region->partition->pt_id,
+				    gfn, count, flags, NULL);
+}
+
+static int mshv_region_map_no_access(struct mshv_region *region,
+				     u64 pfn_offset, u64 pfn_count)
+{
+	return mshv_region_for_each_gfn_chunk(region,
+			region->start_gfn + pfn_offset, pfn_count,
+			HV_MAP_GPA_NO_ACCESS, HV_MAP_GPA_LARGE_PAGE,
+			mshv_map_no_access_gfns);
+}
+
 /**
  * mshv_region_interval_invalidate - Invalidate a range of memory region
  * @mni: Pointer to the mmu_interval_notifier structure
@@ -714,8 +766,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 
 	mmu_interval_set_seq(mni, cur_seq);
 
-	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
-				     pfn_offset, pfn_count);
+	ret = mshv_region_map_no_access(region, pfn_offset, pfn_count);
 	if (ret)
 		goto out_unlock;
 



^ permalink raw reply related

* [PATCH v4 4/6] mshv: Pass pfns array explicitly through processing chain
From: Stanislav Kinsburskii @ 2026-05-13 18:54 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869833161.19379.1357399549586915698.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The current implementation relies on accessing region->pfns directly
within the pfn processing chain, making it difficult to use these
handlers with alternative pfn sources. This tight coupling limits
flexibility when processing pfns from different locations, such as
temporary arrays or external sources.

By threading the pfns pointer through the entire processing chain
(mshv_region_process_range, mshv_region_chunk_size, and all
handlers), we decouple the processing logic from the storage location.
This enables future enhancements like processing pfns from multiple
sources or implementing more sophisticated memory management strategies
without duplicating the core processing logic.

No functional change intended.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   50 +++++++++++++++++++++++++++++----------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index cb42ee49c2e2f..87204b2b48290 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -31,6 +31,10 @@
 typedef int (*gfn_handler_t)(struct mshv_region *region,
 			      u64 gfn, u64 count, u32 flags);
 
+typedef int (*pfn_handler_t)(struct mshv_region *region, u32 flags,
+			     u64 pfn_offset, u64 pfn_count,
+			     unsigned long *pfns, bool huge_page);
+
 static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
 
 static inline bool mshv_pfn_valid(unsigned long pfn)
@@ -98,6 +102,7 @@ static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
  * @region    : Memory region whose PFN array is being walked.
  * @pfn_offset: Offset into region->mreg_pfns at which to start.
  * @pfn_count : Upper bound on the run length.
+ * @pfns      : Pointer to an array of PFNs corresponding to the region.
  * @huge_page : Out-parameter set to true if the run may be dispatched
  *              as a 2 MiB chunk; false for 4 KiB-stride dispatch.
  *
@@ -114,12 +119,13 @@ static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
  */
 static long mshv_region_chunk_size(struct mshv_region *region,
 				   u64 pfn_offset, u64 pfn_count,
-				   bool *huge_page)
+				   unsigned long *pfns, bool *huge_page)
 {
-	unsigned long *pfns = region->mreg_pfns + pfn_offset;
 	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count = 0, stride;
 
+	pfns += pfn_offset;
+
 	if (!mshv_pfn_valid(pfns[0])) {
 		for (count = 1; count < pfn_count; count++) {
 			if (mshv_pfn_valid(pfns[count]))
@@ -158,6 +164,7 @@ static long mshv_region_chunk_size(struct mshv_region *region,
  * @flags     : Flags to pass to the handler.
  * @pfn_offset: Offset into the region's PFNs array to start processing.
  * @pfn_count : Number of PFNs to process.
+ * @pfns      : Pointer to an array of PFNs corresponding to the region.
  * @handler   : Callback function to handle each chunk of contiguous
  *              valid PFNs.
  *
@@ -173,11 +180,8 @@ static long mshv_region_chunk_size(struct mshv_region *region,
 static int mshv_region_process_range(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+				     unsigned long *pfns,
+				     pfn_handler_t handler)
 {
 	u64 end;
 	long ret;
@@ -193,11 +197,12 @@ static int mshv_region_process_range(struct mshv_region *region,
 		long count;
 
 		count = mshv_region_chunk_size(region, pfn_offset, pfn_count,
-					       &huge_page);
+					       pfns, &huge_page);
 		if (count < 0)
 			return count;
 
-		ret = handler(region, flags, pfn_offset, count, huge_page);
+		ret = handler(region, flags, pfn_offset, count, pfns,
+			      huge_page);
 		if (ret < 0)
 			return ret;
 
@@ -266,16 +271,17 @@ struct mshv_region *mshv_region_create(struct mshv_partition *partition,
 static int mshv_region_chunk_share(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
+				   unsigned long *pfns,
 				   bool huge_page)
 {
-	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!mshv_pfn_valid(pfns[pfn_offset]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pfns + pfn_offset,
+					      pfns + pfn_offset,
 					      pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
@@ -288,22 +294,24 @@ static int mshv_region_share(struct mshv_region *region)
 
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
+					 region->mreg_pfns,
 					 mshv_region_chunk_share);
 }
 
 static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     bool huge_page)
 {
-	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!mshv_pfn_valid(pfns[pfn_offset]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pfns + pfn_offset,
+					      pfns + pfn_offset,
 					      pfn_count, 0,
 					      flags, false);
 }
@@ -314,12 +322,14 @@ static int mshv_region_unshare(struct mshv_region *region)
 
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
+					 region->mreg_pfns,
 					 mshv_region_chunk_unshare);
 }
 
 static int mshv_region_chunk_remap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
+				   unsigned long *pfns,
 				   bool huge_page)
 {
 	/*
@@ -327,7 +337,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	 * hypervisor track dirty pages, enabling precopy live
 	 * migration.
 	 */
-	if (!mshv_pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!mshv_pfn_valid(pfns[pfn_offset]))
 		flags = HV_MAP_GPA_NO_ACCESS;
 
 	if (huge_page)
@@ -336,15 +346,17 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	return hv_call_map_ram_pfns(region->partition->pt_id,
 				    region->start_gfn + pfn_offset,
 				    pfn_count, flags,
-				    region->mreg_pfns + pfn_offset);
+				    pfns + pfn_offset);
 }
 
 static int mshv_region_remap_pfns(struct mshv_region *region,
 				  u32 map_flags,
-				  u64 pfn_offset, u64 pfn_count)
+				  u64 pfn_offset, u64 pfn_count,
+				  unsigned long *pfns)
 {
 	return mshv_region_process_range(region, map_flags,
 					 pfn_offset, pfn_count,
+					 pfns,
 					 mshv_region_chunk_remap);
 }
 
@@ -353,7 +365,8 @@ static int mshv_region_map(struct mshv_region *region)
 	u32 map_flags = region->hv_map_flags;
 
 	return mshv_region_remap_pfns(region, map_flags,
-				      0, region->nr_pfns);
+				      0, region->nr_pfns,
+				      region->mreg_pfns);
 }
 
 static void mshv_region_invalidate_pfns(struct mshv_region *region,
@@ -668,7 +681,8 @@ static int mshv_region_collect_and_map(struct mshv_region *region,
 	}
 
 	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
-				     pfn_offset, pfn_count);
+				     pfn_offset, pfn_count,
+				     region->mreg_pfns);
 
 	mutex_unlock(&region->mreg_mutex);
 out:



^ permalink raw reply related

* [PATCH v4 5/6] mshv: Simplify pfn array handling in region processing
From: Stanislav Kinsburskii @ 2026-05-13 18:55 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177869833161.19379.1357399549586915698.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The current code requires passing both the full pfn array and an offset
parameter to region processing functions, forcing callees to manually
index into arrays. This approach is inflexible and makes it difficult
to work with different sources of pfn arrays.

Upcoming changes will need to pass pfn arrays obtained from the HMM
framework directly to these functions. The HMM framework returns arrays
that represent specific ranges rather than full region arrays with
offsets, making the current offset-based indexing pattern incompatible.

Refactor by having callers pass pre-offset pointers to pfn arrays and
removing offset-based indexing from callees. This allows functions to
work with any pfn array starting at index 0, regardless of its source,
and prepares the code for HMM integration.

No functional change intended.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   41 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 87204b2b48290..e20db61e9829f 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -99,14 +99,13 @@ static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
 
 /**
  * mshv_region_chunk_size - Length of the next contiguous PFN run in a region.
- * @region    : Memory region whose PFN array is being walked.
- * @pfn_offset: Offset into region->mreg_pfns at which to start.
- * @pfn_count : Upper bound on the run length.
- * @pfns      : Pointer to an array of PFNs corresponding to the region.
- * @huge_page : Out-parameter set to true if the run may be dispatched
+ * @gfn      : GFN corresponding to the start of the PFN run.
+ * @pfn_count: Upper bound on the run length.
+ * @pfns     : PFN array starting at the chunk's first PFN.
+ * @huge_page: Out-parameter set to true if the run may be dispatched
  *              as a 2 MiB chunk; false for 4 KiB-stride dispatch.
  *
- * Returns the length of the longest contiguous run starting at @pfn_offset
+ * Returns the length of the longest contiguous run starting at at @pfns[0]
  * that shares the classification of the first PFN: either a same-stride run of
  * valid PFNs (4 KiB or 2 MiB) or a hole of invalid PFNs. A hole that is
  * huge-page aligned in @gfn space and at least PTRS_PER_PMD entries long is
@@ -117,15 +116,11 @@ static int mshv_chunk_stride(unsigned long pfn, u64 gfn, u64 pfn_count)
  * Return: Length of the run in PFNs, or a negative errno from
  *         mshv_chunk_stride() if the backing folio order is unsupported.
  */
-static long mshv_region_chunk_size(struct mshv_region *region,
-				   u64 pfn_offset, u64 pfn_count,
+static long mshv_region_chunk_size(u64 gfn, u64 pfn_count,
 				   unsigned long *pfns, bool *huge_page)
 {
-	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count = 0, stride;
 
-	pfns += pfn_offset;
-
 	if (!mshv_pfn_valid(pfns[0])) {
 		for (count = 1; count < pfn_count; count++) {
 			if (mshv_pfn_valid(pfns[count]))
@@ -162,7 +157,7 @@ static long mshv_region_chunk_size(struct mshv_region *region,
  * mshv_region_process_range - Processes a range of PFNs in a region.
  * @region    : Pointer to the memory region structure.
  * @flags     : Flags to pass to the handler.
- * @pfn_offset: Offset into the region's PFNs array to start processing.
+ * @pfn_offset: Offset into the region's PFN's array to start processing.
  * @pfn_count : Number of PFNs to process.
  * @pfns      : Pointer to an array of PFNs corresponding to the region.
  * @handler   : Callback function to handle each chunk of contiguous
@@ -183,6 +178,7 @@ static int mshv_region_process_range(struct mshv_region *region,
 				     unsigned long *pfns,
 				     pfn_handler_t handler)
 {
+	u64 gfn = region->start_gfn + pfn_offset;
 	u64 end;
 	long ret;
 
@@ -196,7 +192,7 @@ static int mshv_region_process_range(struct mshv_region *region,
 		bool huge_page;
 		long count;
 
-		count = mshv_region_chunk_size(region, pfn_offset, pfn_count,
+		count = mshv_region_chunk_size(gfn, pfn_count,
 					       pfns, &huge_page);
 		if (count < 0)
 			return count;
@@ -208,6 +204,8 @@ static int mshv_region_process_range(struct mshv_region *region,
 
 		pfn_offset += count;
 		pfn_count -= count;
+		pfns += count;
+		gfn += count;
 	}
 
 	return 0;
@@ -274,15 +272,14 @@ static int mshv_region_chunk_share(struct mshv_region *region,
 				   unsigned long *pfns,
 				   bool huge_page)
 {
-	if (!mshv_pfn_valid(pfns[pfn_offset]))
+	if (!mshv_pfn_valid(pfns[0]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      pfns + pfn_offset,
-					      pfn_count,
+					      pfns, pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
 					      flags, true);
@@ -304,15 +301,15 @@ static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     unsigned long *pfns,
 				     bool huge_page)
 {
-	if (!mshv_pfn_valid(pfns[pfn_offset]))
+	if (!mshv_pfn_valid(pfns[0]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      pfns + pfn_offset,
-					      pfn_count, 0,
+					      pfns, pfn_count,
+					      0,
 					      flags, false);
 }
 
@@ -337,7 +334,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	 * hypervisor track dirty pages, enabling precopy live
 	 * migration.
 	 */
-	if (!mshv_pfn_valid(pfns[pfn_offset]))
+	if (!mshv_pfn_valid(pfns[0]))
 		flags = HV_MAP_GPA_NO_ACCESS;
 
 	if (huge_page)
@@ -346,7 +343,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	return hv_call_map_ram_pfns(region->partition->pt_id,
 				    region->start_gfn + pfn_offset,
 				    pfn_count, flags,
-				    pfns + pfn_offset);
+				    pfns);
 }
 
 static int mshv_region_remap_pfns(struct mshv_region *region,
@@ -682,7 +679,7 @@ static int mshv_region_collect_and_map(struct mshv_region *region,
 
 	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
 				     pfn_offset, pfn_count,
-				     region->mreg_pfns);
+				     region->mreg_pfns + pfn_offset);
 
 	mutex_unlock(&region->mreg_mutex);
 out:



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox