Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH net v3 3/5] net: mana: Guard mana_remove against double invocation
From: Simon Horman @ 2026-04-17 14:09 UTC (permalink / raw)
  To: ernis
  Cc: Simon Horman, haiyangz, davem, longli, ssengar, linux-kernel,
	kuba, kotaranov, stephen, andrew+netdev, kys, wei.liu,
	shirazsaleem, kees, pabeni, linux-hyperv, edumazet, decui, leon,
	gargaditya, dipayanroy, netdev, shacharr
In-Reply-To: <20260415080944.732901-4-ernis@linux.microsoft.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net: mana: Guard mana_remove against double invocation

If PM resume fails (e.g., mana_attach() returns an error), mana_probe()
calls mana_remove(), which tears down the device and sets
gd->gdma_context = NULL and gd->driver_data = NULL. However, a failed
resume callback does not automatically unbind the driver. When the device
is eventually unbound, mana_remove() is invoked a second time. Without a
NULL check, it dereferences gc->dev with gc == NULL, causing a kernel
panic. Add an early return if gdma_context or driver_data is NULL so the
second invocation is harmless.

> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")

The Fixes: tag points to ca9c54d2d6a5, which introduced mana_remove()
without NULL checks. However, the exploitable double invocation scenario
was actually introduced by commit 635096a86edb ("net: mana: Support
hibernation and kexec"), which added suspend/resume support and made
mana_probe() call mana_remove() on error.

Should the Fixes: tag be:

Fixes: 635096a86edb ("net: mana: Support hibernation and kexec")

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9d095a6fb56c..2ab7c89e2fed 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -3685,11 +3685,16 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
>  	struct gdma_context *gc = gd->gdma_context;
>  	struct mana_context *ac = gd->driver_data;
>  	struct mana_port_context *apc;
> -	struct device *dev = gc->dev;
> +	struct device *dev;
>  	struct net_device *ndev;
>  	int err;
>  	int i;
>
> +	if (!gc || !ac)
> +		return;
> +
> +	dev = gc->dev;
> +
>  	disable_work_sync(&ac->link_change_work);
>  	cancel_delayed_work_sync(&ac->gf_stats_work);

^ permalink raw reply

* Re: [PATCH net v3 4/5] net: mana: Don't overwrite port probe error with add_adev result
From: Simon Horman @ 2026-04-17 14:10 UTC (permalink / raw)
  To: ernis
  Cc: Simon Horman, haiyangz, davem, longli, ssengar, linux-kernel,
	kuba, kotaranov, stephen, andrew+netdev, kys, wei.liu,
	shirazsaleem, kees, pabeni, linux-hyperv, edumazet, decui, leon,
	gargaditya, dipayanroy, netdev, shacharr
In-Reply-To: <20260415080944.732901-5-ernis@linux.microsoft.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net: mana: Don't overwrite port probe error with add_adev result

In mana_probe(), if mana_probe_port() fails for any port, the error
is stored in 'err' and the loop breaks. However, the subsequent
unconditional 'err = add_adev(gd, "eth")' overwrites this error.
If add_adev() succeeds, mana_probe() returns success despite ports
being left in a partially initialized state (ac->ports[i] == NULL).
Only call add_adev() when there is no prior error, so the probe
correctly fails and triggers mana_remove() cleanup.

> In mana_probe(), if mana_probe_port() fails for any port, the error
> is stored in 'err' and the loop breaks. However, the subsequent
> unconditional 'err = add_adev(gd, "eth")' overwrites this error.
> If add_adev() succeeds, mana_probe() returns success despite ports
> being left in a partially initialized state (ac->ports[i] == NULL).
>
> Only call add_adev() when there is no prior error, so the probe
> correctly fails and triggers mana_remove() cleanup.
>
> Fixes: ced82fce77e9 ("net: mana: Probe rdma device in mana driver")

The existing Fixes: tag points to commit ced82fce77e9 ("net: mana: Probe
rdma device in mana driver"), but was the bug actually introduced there?

Looking at the commit history, commit a69839d4327d ("net: mana: Add support
for auxiliary device") added the unconditional 'err = add_adev(gd);' call
that overwrites the error from mana_probe_port(). Commit ced82fce77e9 only
modified the add_adev signature from add_adev(gd) to add_adev(gd, "eth")
but did not introduce the buggy pattern.

Should the Fixes: tag be:
    Fixes: a69839d4327d ("net: mana: Add support for auxiliary device")

^ permalink raw reply

* Re: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Hardik Garg @ 2026-04-17 18:19 UTC (permalink / raw)
  To: Dexuan Cui, kys, haiyangz, wei.liu, longli, linux-hyperv,
	linux-kernel, mhklinux, matthew.ruffell, johansen
  Cc: stable
In-Reply-To: <20260416183529.838321-1-decui@microsoft.com>



On 4/16/2026 11:35 AM, Dexuan Cui wrote:
> If vmbus_reserve_fb() in the kdump kernel fails to properly reserve the
> framebuffer MMIO range due to a Gen2 VM's screen.lfb_base being zero [1],
> there is an MMIO conflict between the drivers hyperv_drm and pci-hyperv.
> This is especially an issue if pci-hyperv is built-in and hyperv_drm is
> built as a module. Consequently, the kdump kernel fails to detect PCI
> devices via pci-hyperv, and may fail to mount the root file system,
> which may reside in a NVMe disk.
> 
> On Gen2 VMs, if the screen.lfb_base is 0 in the kdump kernel, fall
> back to the low MMIO base, which should be equal to the framebuffer
> MMIO base (Tested on x64 Windows Server 2016, and on x64 and ARM64 Windows
> Server 2025 and on Azure) [2]. In the first kernel, screen.lfb_base
> is not 0; if the user specifies a high resolution, it's not enough to
> only reserve 8MB: in this case, reserve half of the space below 4GB, but
> cap the reservation to 128MB, which is the required framebuffer size of
> the highest resolution 7680*4320 supported by Hyper-V.
> 
> Add the cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) check, because a CoCo
> VM (i.e. Confidential VM) on Hyper-V doesn't have any framebuffer
> device, so there is no need to reserve any MMIO for it.
> 
> While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
> the > to >=. Here the 'end' is an inclusive end (typically, it's
> 0xFFFF_FFFF).
> 
> [1] https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
> [2] https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com/
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> CC: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/vmbus_drv.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
Reviewed-by: Hardik Garg <hargar@linux.microsoft.com>




Thanks,
Hardik

^ permalink raw reply

* Re: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Krister Johansen @ 2026-04-17 20:24 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: kys, haiyangz, wei.liu, longli, linux-hyperv, linux-kernel,
	mhklinux, matthew.ruffell, stable
In-Reply-To: <20260416183529.838321-1-decui@microsoft.com>

On Thu, Apr 16, 2026 at 11:35:29AM -0700, Dexuan Cui wrote:
> If vmbus_reserve_fb() in the kdump kernel fails to properly reserve the
> framebuffer MMIO range due to a Gen2 VM's screen.lfb_base being zero [1],
> there is an MMIO conflict between the drivers hyperv_drm and pci-hyperv.
> This is especially an issue if pci-hyperv is built-in and hyperv_drm is
> built as a module. Consequently, the kdump kernel fails to detect PCI
> devices via pci-hyperv, and may fail to mount the root file system,
> which may reside in a NVMe disk.
> 
> On Gen2 VMs, if the screen.lfb_base is 0 in the kdump kernel, fall
> back to the low MMIO base, which should be equal to the framebuffer
> MMIO base (Tested on x64 Windows Server 2016, and on x64 and ARM64 Windows
> Server 2025 and on Azure) [2]. In the first kernel, screen.lfb_base
> is not 0; if the user specifies a high resolution, it's not enough to
> only reserve 8MB: in this case, reserve half of the space below 4GB, but
> cap the reservation to 128MB, which is the required framebuffer size of
> the highest resolution 7680*4320 supported by Hyper-V.
> 
> Add the cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) check, because a CoCo
> VM (i.e. Confidential VM) on Hyper-V doesn't have any framebuffer
> device, so there is no need to reserve any MMIO for it.
> 
> While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
> the > to >=. Here the 'end' is an inclusive end (typically, it's
> 0xFFFF_FFFF).
> 
> [1] https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
> [2] https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com/
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> CC: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/vmbus_drv.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
 
Thanks for the updated patch.  I tested this on the arm64 instances that
had been failing and was able to confirm that without it present the
failure still occurred, but with the new patch networking was able to
attach correctly in the dump environment and kdumps were successful.

Tested-by: Krister Johansen <kjlx@templeofstupid.com>

-K

^ permalink raw reply

* Re: [PATCH v0 07/15] mshv: Add ioctl support for MSHV-VFIO bridge device
From: Mukesh R @ 2026-04-18  0:20 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <aW-pw7GlQdFv-lf5@skinsburskii.localdomain>

On 1/20/26 08:13, Stanislav Kinsburskii wrote:
> On Mon, Jan 19, 2026 at 10:42:22PM -0800, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Add ioctl support for creating MSHV devices for a paritition. At
>> present only VFIO device types are supported, but more could be
>> added. At a high level, a partition ioctl to create device verifies
>> it is of type VFIO and does some setup for bridge code in mshv_vfio.c.
>> Adapted from KVM device ioctls.
>>
>> Credits: Original author: Wei Liu <wei.liu@kernel.org>
>> NB: Slightly modified from the original version.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   drivers/hv/mshv_root_main.c | 126 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 126 insertions(+)
>>
>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>> index 83c7bad269a0..27313419828d 100644
>> --- a/drivers/hv/mshv_root_main.c
>> +++ b/drivers/hv/mshv_root_main.c
>> @@ -1551,6 +1551,129 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)
>>   	return ret;
>>   }
>>   
>> +static long mshv_device_attr_ioctl(struct mshv_device *mshv_dev, int cmd,
>> +				   ulong uarg)
>> +{
>> +	struct mshv_device_attr attr;
>> +	const struct mshv_device_ops *devops = mshv_dev->device_ops;
>> +
>> +	if (copy_from_user(&attr, (void __user *)uarg, sizeof(attr)))
>> +		return -EFAULT;
>> +
>> +	switch (cmd) {
>> +	case MSHV_SET_DEVICE_ATTR:
>> +		if (devops->device_set_attr)
>> +			return devops->device_set_attr(mshv_dev, &attr);
>> +		break;
>> +	case MSHV_HAS_DEVICE_ATTR:
>> +		if (devops->device_has_attr)
>> +			return devops->device_has_attr(mshv_dev, &attr);
>> +		break;
>> +	}
>> +
>> +	return -EPERM;
>> +}
>> +
>> +static long mshv_device_fop_ioctl(struct file *filp, unsigned int cmd,
>> +				  ulong uarg)
>> +{
>> +	struct mshv_device *mshv_dev = filp->private_data;
>> +
>> +	switch (cmd) {
>> +	case MSHV_SET_DEVICE_ATTR:
>> +	case MSHV_HAS_DEVICE_ATTR:
>> +		return mshv_device_attr_ioctl(mshv_dev, cmd, uarg);
>> +	}
>> +
>> +	return -ENOTTY;
>> +}
>> +
>> +static int mshv_device_fop_release(struct inode *inode, struct file *filp)
>> +{
>> +	struct mshv_device *mshv_dev = filp->private_data;
>> +	struct mshv_partition *partition = mshv_dev->device_pt;
>> +
>> +	if (mshv_dev->device_ops->device_release) {
>> +		mutex_lock(&partition->pt_mutex);
>> +		hlist_del(&mshv_dev->device_ptnode);
>> +		mshv_dev->device_ops->device_release(mshv_dev);
>> +		mutex_unlock(&partition->pt_mutex);
>> +	}
>> +
>> +	mshv_partition_put(partition);
>> +	return 0;
>> +}
>> +
>> +static const struct file_operations mshv_device_fops = {
>> +	.owner = THIS_MODULE,
>> +	.unlocked_ioctl = mshv_device_fop_ioctl,
>> +	.release = mshv_device_fop_release,
>> +};
>> +
>> +long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
>> +					void __user *uarg)
>> +{
>> +	long rc;
>> +	struct mshv_create_device devargk;
>> +	struct mshv_device *mshv_dev;
>> +	const struct mshv_device_ops *vfio_ops;
>> +	int type;
>> +
>> +	if (copy_from_user(&devargk, uarg, sizeof(devargk))) {
>> +		rc = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	/* At present, only VFIO is supported */
>> +	if (devargk.type != MSHV_DEV_TYPE_VFIO) {
>> +		rc = -ENODEV;
>> +		goto out;
>> +	}
>> +
>> +	if (devargk.flags & MSHV_CREATE_DEVICE_TEST) {
>> +		rc = 0;
>> +		goto out;
>> +	}
>> +
>> +	mshv_dev = kzalloc(sizeof(*mshv_dev), GFP_KERNEL_ACCOUNT);
>> +	if (mshv_dev == NULL) {
>> +		rc = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	vfio_ops = &mshv_vfio_device_ops;
>> +	mshv_dev->device_ops = vfio_ops;
>> +	mshv_dev->device_pt = partition;
>> +
>> +	rc = vfio_ops->device_create(mshv_dev, type);
>> +	if (rc < 0) {
>> +		kfree(mshv_dev);
>> +		goto out;
>> +	}
>> +
>> +	hlist_add_head(&mshv_dev->device_ptnode, &partition->pt_devices);
>> +
>> +	mshv_partition_get(partition);
>> +	rc = anon_inode_getfd(vfio_ops->device_name, &mshv_device_fops,
>> +			      mshv_dev, O_RDWR | O_CLOEXEC);
>> +	if (rc < 0) {
>> +		mshv_partition_put(partition);
>> +		hlist_del(&mshv_dev->device_ptnode);
>> +		vfio_ops->device_release(mshv_dev);
>> +		goto out;
>> +	}
>> +
>> +	devargk.fd = rc;
>> +	rc = 0;
>> +
>> +	if (copy_to_user(uarg, &devargk, sizeof(devargk))) {
> 
> Shouldn't the partition be put here?

No. anon_inode_getfd was successful and so it installed the fd already..
As a result the cleanup will happen in the file op release.

Thanks,
-Mukesh

> Thanks,
> Stanislav
> 
>> +		rc = -EFAULT;
>> +		goto out;
>> +	}
>> +out:
>> +	return rc;
>> +}
>> +
>>   static long
>>   mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>>   {
>> @@ -1587,6 +1710,9 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>>   	case MSHV_ROOT_HVCALL:
>>   		ret = mshv_ioctl_passthru_hvcall(partition, true, uarg);
>>   		break;
>> +	case MSHV_CREATE_DEVICE:
>> +		ret = mshv_partition_ioctl_create_device(partition, uarg);
>> +		break;
>>   	default:
>>   		ret = -ENOTTY;
>>   	}
>> -- 
>> 2.51.2.vfs.0.1
>>


^ permalink raw reply

* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: Junrui Luo @ 2026-04-20  5:17 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: vdso@mailbox.org, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Nuno Das Neves, Anirudh Rayabharam,
	Mukesh Rathor, Muminul Islam, Praveen K Paladugu, Jinank Jain,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yuhao Jiang, stable@vger.kernel.org
In-Reply-To: <1644495552.14476.1776103846016@app.mailbox.org>

Hi Stanislav,

Gentle ping on this. Does this approach work for you?

Thanks,
Junrui Luo

^ permalink raw reply

* Re: [PATCH net v3 4/5] net: mana: Don't overwrite port probe error with add_adev result
From: Erni Sri Satya Vennela @ 2026-04-20 12:45 UTC (permalink / raw)
  To: Simon Horman
  Cc: haiyangz, davem, longli, ssengar, linux-kernel, kuba, kotaranov,
	stephen, andrew+netdev, kys, wei.liu, shirazsaleem, kees, pabeni,
	linux-hyperv, edumazet, decui, leon, gargaditya, dipayanroy,
	netdev, shacharr
In-Reply-To: <20260417141014.218936-1-horms@kernel.org>

On Fri, Apr 17, 2026 at 03:10:14PM +0100, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> 
> For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
> ---
> net: mana: Don't overwrite port probe error with add_adev result
> 
> In mana_probe(), if mana_probe_port() fails for any port, the error
> is stored in 'err' and the loop breaks. However, the subsequent
> unconditional 'err = add_adev(gd, "eth")' overwrites this error.
> If add_adev() succeeds, mana_probe() returns success despite ports
> being left in a partially initialized state (ac->ports[i] == NULL).
> Only call add_adev() when there is no prior error, so the probe
> correctly fails and triggers mana_remove() cleanup.
> 
> > In mana_probe(), if mana_probe_port() fails for any port, the error
> > is stored in 'err' and the loop breaks. However, the subsequent
> > unconditional 'err = add_adev(gd, "eth")' overwrites this error.
> > If add_adev() succeeds, mana_probe() returns success despite ports
> > being left in a partially initialized state (ac->ports[i] == NULL).
> >
> > Only call add_adev() when there is no prior error, so the probe
> > correctly fails and triggers mana_remove() cleanup.
> >
> > Fixes: ced82fce77e9 ("net: mana: Probe rdma device in mana driver")
> 
> The existing Fixes: tag points to commit ced82fce77e9 ("net: mana: Probe
> rdma device in mana driver"), but was the bug actually introduced there?
> 
> Looking at the commit history, commit a69839d4327d ("net: mana: Add support
> for auxiliary device") added the unconditional 'err = add_adev(gd);' call
> that overwrites the error from mana_probe_port(). Commit ced82fce77e9 only
> modified the add_adev signature from add_adev(gd) to add_adev(gd, "eth")
> but did not introduce the buggy pattern.
> 
> Should the Fixes: tag be:
>     Fixes: a69839d4327d ("net: mana: Add support for auxiliary device")

Thankyou for the correction, Simon.
I'll make this change in the next version.

- Vennela

^ permalink raw reply

* Re: [PATCH net v3 3/5] net: mana: Guard mana_remove against double invocation
From: Erni Sri Satya Vennela @ 2026-04-20 12:45 UTC (permalink / raw)
  To: Simon Horman
  Cc: haiyangz, davem, longli, ssengar, linux-kernel, kuba, kotaranov,
	stephen, andrew+netdev, kys, wei.liu, shirazsaleem, kees, pabeni,
	linux-hyperv, edumazet, decui, leon, gargaditya, dipayanroy,
	netdev, shacharr
In-Reply-To: <20260417140953.218832-1-horms@kernel.org>

On Fri, Apr 17, 2026 at 03:09:53PM +0100, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> 
> For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
> ---
> net: mana: Guard mana_remove against double invocation
> 
> If PM resume fails (e.g., mana_attach() returns an error), mana_probe()
> calls mana_remove(), which tears down the device and sets
> gd->gdma_context = NULL and gd->driver_data = NULL. However, a failed
> resume callback does not automatically unbind the driver. When the device
> is eventually unbound, mana_remove() is invoked a second time. Without a
> NULL check, it dereferences gc->dev with gc == NULL, causing a kernel
> panic. Add an early return if gdma_context or driver_data is NULL so the
> second invocation is harmless.
> 
> > Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
> 
> The Fixes: tag points to ca9c54d2d6a5, which introduced mana_remove()
> without NULL checks. However, the exploitable double invocation scenario
> was actually introduced by commit 635096a86edb ("net: mana: Support
> hibernation and kexec"), which added suspend/resume support and made
> mana_probe() call mana_remove() on error.
> 
> Should the Fixes: tag be:
> 
> Fixes: 635096a86edb ("net: mana: Support hibernation and kexec")
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9d095a6fb56c..2ab7c89e2fed 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -3685,11 +3685,16 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> >  	struct gdma_context *gc = gd->gdma_context;
> >  	struct mana_context *ac = gd->driver_data;
> >  	struct mana_port_context *apc;
> > -	struct device *dev = gc->dev;
> > +	struct device *dev;
> >  	struct net_device *ndev;
> >  	int err;
> >  	int i;
> >
> > +	if (!gc || !ac)
> > +		return;
> > +
> > +	dev = gc->dev;
> > +
> >  	disable_work_sync(&ac->link_change_work);
> >  	cancel_delayed_work_sync(&ac->gf_stats_work);
Thankyou for the correction, Simon.
I'll make this change in the next version.

- Vennela

^ permalink raw reply

* [PATCH net v4 0/5] net: mana: Fix probe/remove error path bugs
From: Erni Sri Satya Vennela @ 2026-04-20 12:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel

Fix five bugs in mana_probe()/mana_remove() error handling that can
cause warnings on uninitialized work structs, NULL pointer dereferences,
masked errors, and resource leaks when early probe steps fail.

Patches 1-2 move work struct initialization (link_change_work and
gf_stats_work) to before any error path that could trigger
mana_remove(), preventing WARN_ON in __flush_work() or debug object
warnings when sync cancellation runs on uninitialized work structs.

Patch 3 guards mana_remove() against double invocation. If PM resume
fails, mana_probe() calls mana_remove() which sets gdma_context and
driver_data to NULL. A failed resume does not unbind the driver, so
when the device is eventually unbound, mana_remove() is called again
and dereferences NULL, causing a kernel panic. An early return on
NULL gdma_context or driver_data makes the second call harmless.

Patch 4 prevents add_adev() from overwriting a port probe error,
which could leave the driver in a broken state with NULL ports while
reporting success.

Patch 5 changes 'goto out' to 'break' in mana_remove()'s port loop
so that mana_destroy_eq() is always reached, preventing EQ leaks when
a NULL port is encountered.
---
Changes in v4:
* Correct Fixes tag from ca9c54d2d6a5 to 635096a86edb
* Correct Fixes tag from ced82fce77e9 to a69839d4327d
Changes in v3:
* Add patch 3: net: mana: Guard mana_remove against double invocation.
* Fix inaccurate comments.
* Correct Fixes tag from ca9c54d2d6a5 to 1e2d0824a9c3.
Changes in v2:
* Apply the patchset in net instead of net-next.
---

Erni Sri Satya Vennela (5):
  net: mana: Init link_change_work before potential error paths in probe
  net: mana: Init gf_stats_work before potential error paths in probe
  net: mana: Guard mana_remove against double invocation
  net: mana: Don't overwrite port probe error with add_adev result
  net: mana: Fix EQ leak in mana_remove on NULL port

 drivers/net/ethernet/microsoft/mana/mana_en.c | 35 +++++++++++--------
 1 file changed, 20 insertions(+), 15 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH net v4 1/5] net: mana: Init link_change_work before potential error paths in probe
From: Erni Sri Satya Vennela @ 2026-04-20 12:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260420124741.1056179-1-ernis@linux.microsoft.com>

Move INIT_WORK(link_change_work) to right after the mana_context
allocation, before any error path that could reach mana_remove().

Previously, if mana_create_eq() or mana_query_device_cfg() failed,
mana_probe() would jump to the error path which calls mana_remove().
mana_remove() unconditionally calls disable_work_sync(link_change_work),
but the work struct had not been initialized yet. This can trigger
CONFIG_DEBUG_OBJECTS_WORK enabled.

Fixes: 54133f9b4b53 ("net: mana: Support HW link state events")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v3,v4:
* No change.
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 6302432b9bf6..e3e4b6de6668 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3631,6 +3631,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 		ac->gdma_dev = gd;
 		gd->driver_data = ac;
+
+		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	}
 
 	err = mana_create_eq(ac);
@@ -3648,8 +3650,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	if (!resuming) {
 		ac->num_ports = num_ports;
-
-		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	} else {
 		if (ac->num_ports != num_ports) {
 			dev_err(dev, "The number of vPorts changed: %d->%d\n",
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v4 2/5] net: mana: Init gf_stats_work before potential error paths in probe
From: Erni Sri Satya Vennela @ 2026-04-20 12:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260420124741.1056179-1-ernis@linux.microsoft.com>

Move INIT_DELAYED_WORK(gf_stats_work) to before mana_create_eq(),
while keeping schedule_delayed_work() at its original location.

Previously, if any function between mana_create_eq() and the
INIT_DELAYED_WORK call failed, mana_probe() would call mana_remove()
which unconditionally calls cancel_delayed_work_sync(gf_stats_work)
in __flush_work() or debug object warnings with
CONFIG_DEBUG_OBJECTS_WORK enabled.

Fixes: be4f1d67ec56 ("net: mana: Add standard counter rx_missed_errors")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v3,v4:
* No change.
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index e3e4b6de6668..468ed60a8a00 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3635,6 +3635,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	}
 
+	INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
+
 	err = mana_create_eq(ac);
 	if (err) {
 		dev_err(dev, "Failed to create EQs: %d\n", err);
@@ -3709,7 +3711,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	err = add_adev(gd, "eth");
 
-	INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
 	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
 
 out:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v4 3/5] net: mana: Guard mana_remove against double invocation
From: Erni Sri Satya Vennela @ 2026-04-20 12:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260420124741.1056179-1-ernis@linux.microsoft.com>

If PM resume fails (e.g., mana_attach() returns an error), mana_probe()
calls mana_remove(), which tears down the device and sets
gd->gdma_context = NULL and gd->driver_data = NULL.

However, a failed resume callback does not automatically unbind the
driver. When the device is eventually unbound, mana_remove() is invoked
a second time. Without a NULL check, it dereferences gc->dev with
gc == NULL, causing a kernel panic.

Add an early return if gdma_context or driver_data is NULL so the second
invocation is harmless. Move the dev = gc->dev assignment after the
guard so it cannot dereference NULL.

Fixes: 635096a86edb ("net: mana: Support hibernation and kexec")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v4:
* Update Fixes tag to 635096a86edb
Changes in v3:
* Add this patch to the patchset
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 468ed60a8a00..ce1b7ec46a27 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3731,11 +3731,16 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 	struct gdma_context *gc = gd->gdma_context;
 	struct mana_context *ac = gd->driver_data;
 	struct mana_port_context *apc;
-	struct device *dev = gc->dev;
+	struct device *dev;
 	struct net_device *ndev;
 	int err;
 	int i;
 
+	if (!gc || !ac)
+		return;
+
+	dev = gc->dev;
+
 	disable_work_sync(&ac->link_change_work);
 	cancel_delayed_work_sync(&ac->gf_stats_work);
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v4 4/5] net: mana: Don't overwrite port probe error with add_adev result
From: Erni Sri Satya Vennela @ 2026-04-20 12:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260420124741.1056179-1-ernis@linux.microsoft.com>

In mana_probe(), if mana_probe_port() fails for any port, the error
is stored in 'err' and the loop breaks. However, the subsequent
unconditional 'err = add_adev(gd, "eth")' overwrites this error.
If add_adev() succeeds, mana_probe() returns success despite ports
being left in a partially initialized state (ac->ports[i] == NULL).

Only call add_adev() when there is no prior error, so the probe
correctly fails and triggers mana_remove() cleanup.

Fixes: a69839d4327d ("net: mana: Add support for auxiliary device")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v4:
* Update Fixes tag to a69839d4327d.
Changes in v3:
*  Fix inaccurate comments.
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ce1b7ec46a27..39b18577fb51 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3680,10 +3680,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 	if (!resuming) {
 		for (i = 0; i < ac->num_ports; i++) {
 			err = mana_probe_port(ac, i, &ac->ports[i]);
-			/* we log the port for which the probe failed and stop
-			 * probes for subsequent ports.
-			 * Note that we keep running ports, for which the probes
-			 * were successful, unless add_adev fails too
+			/* Log the port for which the probe failed, stop probing
+			 * subsequent ports, and skip add_adev.
+			 * mana_remove() will clean up already-probed ports.
 			 */
 			if (err) {
 				dev_err(dev, "Probe Failed for port %d\n", i);
@@ -3697,10 +3696,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 			enable_work(&apc->queue_reset_work);
 			err = mana_attach(ac->ports[i]);
 			rtnl_unlock();
-			/* we log the port for which the attach failed and stop
-			 * attach for subsequent ports
-			 * Note that we keep running ports, for which the attach
-			 * were successful, unless add_adev fails too
+			/* Log the port for which the attach failed, stop
+			 * attaching subsequent ports, and skip add_adev.
+			 * mana_remove() will clean up already-attached ports.
 			 */
 			if (err) {
 				dev_err(dev, "Attach Failed for port %d\n", i);
@@ -3709,7 +3707,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 		}
 	}
 
-	err = add_adev(gd, "eth");
+	if (!err)
+		err = add_adev(gd, "eth");
 
 	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v4 5/5] net: mana: Fix EQ leak in mana_remove on NULL port
From: Erni Sri Satya Vennela @ 2026-04-20 12:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260420124741.1056179-1-ernis@linux.microsoft.com>

In mana_remove(), when a NULL port is encountered in the port iteration
loop, 'goto out' skips the mana_destroy_eq(ac) call, leaking the event
queues allocated earlier by mana_create_eq().

This can happen when mana_probe_port() fails for port 0, leaving
ac->ports[0] as NULL. On driver unload or error cleanup, mana_remove()
hits the NULL entry and jumps past mana_destroy_eq().

Change 'goto out' to 'break' so the for-loop exits normally and
mana_destroy_eq() is always reached. Remove the now-unreferenced out:
label.

Fixes: 1e2d0824a9c3 ("net: mana: Add support for EQ sharing")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v4:
* No change
Changes in v3;
* Update Fixes tag to appropriate commit id.
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 39b18577fb51..98e2fcc797ca 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3752,7 +3752,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 		if (!ndev) {
 			if (i == 0)
 				dev_err(dev, "No net device to remove\n");
-			goto out;
+			break;
 		}
 
 		apc = netdev_priv(ndev);
@@ -3783,7 +3783,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 	}
 
 	mana_destroy_eq(ac);
-out:
+
 	if (ac->per_port_queue_reset_wq) {
 		destroy_workqueue(ac->per_port_queue_reset_wq);
 		ac->per_port_queue_reset_wq = NULL;
-- 
2.34.1


^ permalink raw reply related

* [PATCH AUTOSEL 7.0-6.18] net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG
From: Sasha Levin @ 2026-04-20 13:17 UTC (permalink / raw)
  To: patches, stable
  Cc: Erni Sri Satya Vennela, Jakub Kicinski, Sasha Levin, kys,
	haiyangz, wei.liu, decui, longli, andrew+netdev, davem, edumazet,
	pabeni, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

[ Upstream commit d7709812e13d06132ddae3d21540472ea5cb11c5 ]

As a part of MANA hardening for CVM, validate the adapter_mtu value
returned from the MANA_QUERY_DEV_CONFIG HWC command.

The adapter_mtu value is used to compute ndev->max_mtu via:
gc->adapter_mtu - ETH_HLEN. If hardware returns a bogus adapter_mtu
smaller than ETH_HLEN (e.g. 0), the unsigned subtraction wraps to a
huge value, silently allowing oversized MTU settings.

Add a validation check to reject adapter_mtu values below
ETH_MIN_MTU + ETH_HLEN, returning -EPROTO to fail the device
configuration early with a clear error message.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Link: https://patch.msgid.link/20260326173101.2010514-1-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

This confirms the integer underflow. Now let me complete the analysis.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1: Subject Line**
- Subsystem: `net: mana:` (Microsoft Azure Network Adapter driver)
- Action: "hardening: Validate" - input validation / defensive check
- Summary: Validates `adapter_mtu` from hardware config query to prevent
  integer underflow

**Step 1.2: Tags**
- `Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>` -
  author, Microsoft employee, regular MANA contributor (9+ commits)
- `Link: https://patch.msgid.link/20260326173101.2010514-1-
  ernis@linux.microsoft.com` - single patch (not part of a series,
  1-of-1)
- `Signed-off-by: Jakub Kicinski <kuba@kernel.org>` - netdev maintainer
  accepted the patch
- No Fixes: tag (expected for candidates under review)
- No Reported-by tag
- No Cc: stable tag

**Step 1.3: Body Text**
- Bug: `adapter_mtu` value from hardware can be bogus (< ETH_HLEN = 14).
  The subtraction `gc->adapter_mtu - ETH_HLEN` used to compute
  `ndev->max_mtu` wraps to a huge value (~4GB), silently allowing
  oversized MTU settings.
- Context: Part of CVM (Confidential VM) hardening where the hypervisor
  is less trusted.
- Fix: Reject values below `ETH_MIN_MTU + ETH_HLEN` (82 bytes) with
  `-EPROTO`.

**Step 1.4: Hidden Bug Fix Detection**
- Though labeled "hardening," this IS a real bug fix: it prevents a
  concrete integer underflow that leads to incorrect max_mtu. The bug
  mechanism is clear and the consequences (allowing oversized MTU
  settings) are real.

## PHASE 2: DIFF ANALYSIS

**Step 2.1: Inventory**
- Files: `drivers/net/ethernet/microsoft/mana/mana_en.c` (+8/-2 net, ~6
  lines of logic)
- Function modified: `mana_query_device_cfg()`
- Scope: Single-file, single-function, surgical fix

**Step 2.2: Code Flow Change**
- Before: `resp.adapter_mtu` was accepted unconditionally when
  msg_version >= GDMA_MESSAGE_V2
- After: Validates `resp.adapter_mtu >= ETH_MIN_MTU + ETH_HLEN` (82)
  before accepting; returns `-EPROTO` on failure
- The else branch and brace additions are purely cosmetic (adding braces
  to existing if/else)

**Step 2.3: Bug Mechanism**
- Category: Integer underflow / input validation bug
- Mechanism: `gc->adapter_mtu` (u16, could be 0) used in `ndev->max_mtu
  = gc->adapter_mtu - ETH_HLEN`. If adapter_mtu < 14, the result wraps
  to ~4GB as unsigned int.
- Confirmed via two usage sites:
  - `mana_en.c:3349`: `ndev->max_mtu = gc->adapter_mtu - ETH_HLEN`
  - `mana_bpf.c:242`: `ndev->max_mtu = gc->adapter_mtu - ETH_HLEN`

**Step 2.4: Fix Quality**
- Obviously correct: simple bounds check with a clear threshold
- Minimal: 6 lines of logic change
- No regression risk: only rejects values that would cause incorrect
  behavior anyway
- Clean: well-contained, single function

## PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1: Blame**
- The `adapter_mtu` field assignment was introduced in commit
  `80f6215b450eb8` ("net: mana: Add support for jumbo frame", Haiyang
  Zhang, 2023-04-12)
- This commit was first included in `v6.4-rc1`
- The vulnerable code has been present since v6.4

**Step 3.2: No Fixes: tag to follow**

**Step 3.3: File History**
- The file has active development with multiple fixes applied. No
  conflicting changes to the `mana_query_device_cfg()` function recently
  aside from commit `290e5d3c49f687` which added GDMA_MESSAGE_V3
  handling.

**Step 3.4: Author**
- Erni Sri Satya Vennela is a regular MANA contributor with 9+ commits
  to the driver, all from `@linux.microsoft.com`. The author is part of
  the Microsoft team maintaining this driver.

**Step 3.5: Dependencies**
- This is a standalone patch (1-of-1, not part of a series)
- Uses only existing constants (`ETH_MIN_MTU`, `ETH_HLEN`) which exist
  in all kernel versions
- The GDMA_MESSAGE_V2 check already exists in stable trees since v6.4

## PHASE 4: MAILING LIST RESEARCH

**Step 4.1-4.5:** b4 dig failed to find the thread. Lore is behind an
anti-scraping wall. However, the patch was accepted by netdev maintainer
Jakub Kicinski (signed-off-by), which indicates it passed netdev review.
The Link tag confirms it was a single-patch submission.

## PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1: Functions Modified**
- `mana_query_device_cfg()` - device configuration query during probe

**Step 5.2: Callers**
- Called from `mana_probe_port()` -> `mana_query_device_cfg()` during
  device initialization
- This is the main probe path for all MANA network interfaces in Azure
  VMs

**Step 5.3: Downstream Impact**
- `gc->adapter_mtu` is used in two places to compute `ndev->max_mtu`:
  - `mana_en.c:3349` during probe
  - `mana_bpf.c:242` when XDP is detached
- Both perform `gc->adapter_mtu - ETH_HLEN` without checking for
  underflow

**Step 5.4: Reachability**
- This code is reached during every MANA device probe in Azure VMs -
  very common path for Azure users

## PHASE 6: STABLE TREE ANALYSIS

**Step 6.1: Buggy Code in Stable Trees**
- `adapter_mtu` was added in v6.4-rc1 via commit `80f6215b450eb8`
- Present in stable trees: 6.6.y, 6.12.y, 7.0.y
- NOT present in: 6.1.y, 5.15.y, 5.10.y (pre-dates adapter_mtu feature)

**Step 6.2: Backport Complications**
- Note: the current 7.0 tree has `resp.hdr.response.msg_version` (from
  commit `290e5d3c49f687`) while older stable trees may have
  `resp.hdr.resp.msg_version`. The diff may need minor adjustment for
  6.6.y.
- The validation logic itself is self-contained and trivially adaptable.

**Step 6.3: No related fixes already in stable.**

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

**Step 7.1: Subsystem**
- `drivers/net/ethernet/microsoft/mana/` - MANA network driver for Azure
  VMs
- Criticality: IMPORTANT - widely used in Azure cloud infrastructure
  (millions of VMs)

**Step 7.2: Activity**
- Actively maintained with regular fixes. The author and team are
  Microsoft employees dedicated to this driver.

## PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1: Who is Affected**
- All Azure VM users running MANA driver (very large population)
- Especially CVM (Confidential VM) users where the hypervisor is less
  trusted

**Step 8.2: Trigger Conditions**
- Triggered when hardware/hypervisor returns `adapter_mtu < 82` in the
  config query response
- In CVM scenarios: malicious hypervisor could deliberately trigger this
- In non-CVM: unlikely but possible with firmware bugs

**Step 8.3: Failure Mode Severity**
- Integer underflow causes `max_mtu` to be set to ~4GB
- This silently allows setting huge MTU values that the hardware cannot
  support
- Could lead to packet corruption, buffer overflows in TX path, or
  device malfunction
- Severity: HIGH (potential for data corruption or security issue,
  especially in CVM)

**Step 8.4: Risk-Benefit Ratio**
- BENEFIT: Prevents integer underflow and incorrect device
  configuration. HIGH for CVM users, MEDIUM for regular Azure users.
- RISK: VERY LOW - only adds a bounds check on the initialization path.
  Cannot cause regression because it only rejects values that would
  cause broken behavior.

## PHASE 9: FINAL SYNTHESIS

**Step 9.1: Evidence Summary**

FOR backporting:
- Fixes a concrete integer underflow bug (adapter_mtu - ETH_HLEN wraps
  to ~4GB)
- Small, surgical fix (6 lines of logic)
- Obviously correct bounds check
- No regression risk
- Accepted by netdev maintainer
- Author is regular driver contributor
- Affects widely-used Azure MANA driver
- Security-relevant in CVM environments

AGAINST backporting:
- Labeled as "hardening" rather than "fix"
- No user reports of this being triggered in practice
- Trigger requires malicious or buggy firmware
- May need minor adjustment for older stable trees (response field name)

**Step 9.2: Stable Rules Checklist**
1. Obviously correct and tested? **YES** - simple bounds check, accepted
   by netdev maintainer
2. Fixes a real bug? **YES** - integer underflow leading to incorrect
   max_mtu
3. Important issue? **YES** - incorrect MTU can cause device
   malfunction; security issue in CVM
4. Small and contained? **YES** - 6 lines, single function, single file
5. No new features or APIs? **CORRECT** - no new features
6. Can apply to stable? **YES** - may need trivial adjustment for
   response field name in 6.6.y

**Step 9.3: Exception Categories**
- Not a standard exception category, but fits the pattern of input
  validation fixes that prevent integer overflow/underflow.

**Step 9.4: Decision**
The fix prevents a concrete integer underflow that causes `max_mtu` to
be set to ~4GB when hardware returns an invalid adapter_mtu. The fix is
minimal, obviously correct, and has zero regression risk. It is relevant
for Azure CVM security and defensive against firmware bugs.

## Verification

- [Phase 1] Parsed tags: Signed-off-by from author (Microsoft) and
  netdev maintainer Jakub Kicinski; Link to single-patch submission
- [Phase 2] Diff analysis: +6 lines of validation in
  `mana_query_device_cfg()`, checks `resp.adapter_mtu >= ETH_MIN_MTU +
  ETH_HLEN` (82)
- [Phase 2] Integer underflow verified: adapter_mtu=0 ->
  max_mtu=4294967282 (~4GB) via Python simulation
- [Phase 3] git blame: adapter_mtu code introduced in commit
  `80f6215b450eb8` (v6.4-rc1, 2023-04-12)
- [Phase 3] git describe --contains: confirmed in v6.4-rc1
- [Phase 3] Author has 9+ commits to MANA driver, regular contributor
- [Phase 4] b4 dig failed to find thread (timeout); lore blocked by
  anti-bot measures
- [Phase 5] Callers: `mana_query_device_cfg()` called from probe path;
  `gc->adapter_mtu - ETH_HLEN` used at mana_en.c:3349 and mana_bpf.c:242
- [Phase 5] Both usage sites perform unsigned subtraction without bounds
  check
- [Phase 6] Buggy code exists in stable trees 6.6.y+ (since v6.4-rc1)
- [Phase 6] Standalone fix, may need minor field name adjustment for
  older trees
- [Phase 7] MANA driver widely used in Azure (IMPORTANT criticality)
- [Phase 8] Failure mode: max_mtu set to ~4GB, allowing oversized MTU;
  severity HIGH
- [Phase 8] Risk: VERY LOW (only rejects clearly invalid values)

**YES**

 drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 09a53c9775455..7589ead7efdb6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1214,10 +1214,16 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
 
 	*max_num_vports = resp.max_num_vports;
 
-	if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2)
+	if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2) {
+		if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
+			dev_err(dev, "Adapter MTU too small: %u\n",
+				resp.adapter_mtu);
+			return -EPROTO;
+		}
 		gc->adapter_mtu = resp.adapter_mtu;
-	else
+	} else {
 		gc->adapter_mtu = ETH_FRAME_LEN;
+	}
 
 	if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V3)
 		*bm_hostmode = resp.bm_hostmode;
-- 
2.53.0


^ permalink raw reply related

* [PATCH AUTOSEL 6.18] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Sasha Levin @ 2026-04-20 13:21 UTC (permalink / raw)
  To: patches, stable
  Cc: Long Li, Michael Kelley, Wei Liu, Sasha Levin, kys, haiyangz,
	decui, lpieralisi, kwilczynski, mani, bhelgaas, mikelley,
	linux-hyperv, linux-pci, linux-kernel
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Long Li <longli@microsoft.com>

[ Upstream commit 7b3b1e5a87b2f5e35c52b5386d7c327be869454f ]

When hv_pci_assign_numa_node() processes a device that does not have
HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
virtual_numa_node, the device NUMA node is left unset. On x86_64,
the uninitialized default happens to be 0, but on ARM64 it is
NUMA_NO_NODE (-1).

Tests show that when no NUMA information is available from the Hyper-V
host, devices perform best when assigned to node 0. With NUMA_NO_NODE
the kernel may spread work across NUMA nodes, which degrades
performance on Hyper-V, particularly for high-throughput devices like
MANA.

Always set the device NUMA node to 0 before the conditional NUMA
affinity check, so that devices get a performant default when the host
provides no NUMA information, and behavior is consistent on both
x86_64 and ARM64.

Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and support PCI_BUS_RELATIONS2")
Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Error: Failed to generate final synthesis

 drivers/pci/controller/pci-hyperv.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 146b43981b278..28b1572974879 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2486,6 +2486,14 @@ static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
 		if (!hv_dev)
 			continue;
 
+		/*
+		 * If the Hyper-V host doesn't provide a NUMA node for the
+		 * device, default to node 0. With NUMA_NO_NODE the kernel
+		 * may spread work across NUMA nodes, which degrades
+		 * performance on Hyper-V.
+		 */
+		set_dev_node(&dev->dev, 0);
+
 		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
 		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
 			/*
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH 02/11] Drivers: hv: Move hv_vp_assist_page to common files
From: Naman Jain @ 2026-04-20 15:22 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB415790977DA40BAD0822DA54D450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:25 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Move the logic to initialize and export hv_vp_assist_page from x86
>> architecture code to Hyper-V common code to allow it to be used for
>> upcoming arm64 support in MSHV_VTL driver.
>> Note: This change also improves error handling - if VP assist page
>> allocation fails, hyperv_init() now returns early instead of
>> continuing with partial initialization.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/x86/hyperv/hv_init.c      | 88 +---------------------------------
>>   drivers/hv/hv_common.c         | 88 ++++++++++++++++++++++++++++++++++
>>   include/asm-generic/mshyperv.h |  4 ++
>>   include/hyperv/hvgdk_mini.h    |  2 +
>>   4 files changed, 95 insertions(+), 87 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index 323adc93f2dc..75a98b5e451b 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -81,9 +81,6 @@ union hv_ghcb * __percpu *hv_ghcb_pg;
>>   /* Storage to save the hypercall page temporarily for hibernation */
>>   static void *hv_hypercall_pg_saved;
>>
>> -struct hv_vp_assist_page **hv_vp_assist_page;
>> -EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>> -
>>   static int hyperv_init_ghcb(void)
>>   {
>>   	u64 ghcb_gpa;
>> @@ -117,59 +114,12 @@ static int hyperv_init_ghcb(void)
>>
>>   static int hv_cpu_init(unsigned int cpu)
>>   {
>> -	union hv_vp_assist_msr_contents msr = { 0 };
>> -	struct hv_vp_assist_page **hvp;
>>   	int ret;
>>
>>   	ret = hv_common_cpu_init(cpu);
>>   	if (ret)
>>   		return ret;
>>
>> -	if (!hv_vp_assist_page)
>> -		return 0;
>> -
>> -	hvp = &hv_vp_assist_page[cpu];
>> -	if (hv_root_partition()) {
>> -		/*
>> -		 * For root partition we get the hypervisor provided VP assist
>> -		 * page, instead of allocating a new page.
>> -		 */
>> -		rdmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> -		*hvp = memremap(msr.pfn << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> -				PAGE_SIZE, MEMREMAP_WB);
>> -	} else {
>> -		/*
>> -		 * The VP assist page is an "overlay" page (see Hyper-V TLFS's
>> -		 * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> -		 * out to make sure we always write the EOI MSR in
>> -		 * hv_apic_eoi_write() *after* the EOI optimization is disabled
>> -		 * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> -		 * case of CPU offlining and the VM will hang.
>> -		 */
>> -		if (!*hvp) {
>> -			*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
>> -
>> -			/*
>> -			 * Hyper-V should never specify a VM that is a Confidential
>> -			 * VM and also running in the root partition. Root partition
>> -			 * is blocked to run in Confidential VM. So only decrypt assist
>> -			 * page in non-root partition here.
>> -			 */
>> -			if (*hvp && !ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
>> -				WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>> -				memset(*hvp, 0, PAGE_SIZE);
>> -			}
>> -		}
>> -
>> -		if (*hvp)
>> -			msr.pfn = vmalloc_to_pfn(*hvp);
>> -
>> -	}
>> -	if (!WARN_ON(!(*hvp))) {
>> -		msr.enable = 1;
>> -		wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> -	}
>> -
>>   	/* Allow Hyper-V stimer vector to be injected from Hypervisor. */
>>   	if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE)
>>   		apic_update_vector(cpu, HYPERV_STIMER0_VECTOR, true);
>> @@ -286,23 +236,6 @@ static int hv_cpu_die(unsigned int cpu)
>>
>>   	hv_common_cpu_die(cpu);
>>
>> -	if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
>> -		union hv_vp_assist_msr_contents msr = { 0 };
>> -		if (hv_root_partition()) {
>> -			/*
>> -			 * For root partition the VP assist page is mapped to
>> -			 * hypervisor provided page, and thus we unmap the
>> -			 * page here and nullify it, so that in future we have
>> -			 * correct page address mapped in hv_cpu_init.
>> -			 */
>> -			memunmap(hv_vp_assist_page[cpu]);
>> -			hv_vp_assist_page[cpu] = NULL;
>> -			rdmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> -			msr.enable = 0;
>> -		}
>> -		wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> -	}
>> -
>>   	if (hv_reenlightenment_cb == NULL)
>>   		return 0;
>>
>> @@ -460,21 +393,6 @@ void __init hyperv_init(void)
>>   	if (hv_common_init())
>>   		return;
>>
>> -	/*
>> -	 * The VP assist page is useless to a TDX guest: the only use we
>> -	 * would have for it is lazy EOI, which can not be used with TDX.
>> -	 */
>> -	if (hv_isolation_type_tdx())
>> -		hv_vp_assist_page = NULL;
>> -	else
>> -		hv_vp_assist_page = kzalloc_objs(*hv_vp_assist_page, nr_cpu_ids);
>> -	if (!hv_vp_assist_page) {
>> -		ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>> -
>> -		if (!hv_isolation_type_tdx())
>> -			goto common_free;
>> -	}
>> -
>>   	if (ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
>>   		/* Negotiate GHCB Version. */
>>   		if (!hv_ghcb_negotiate_protocol())
>> @@ -483,7 +401,7 @@ void __init hyperv_init(void)
>>
>>   		hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
>>   		if (!hv_ghcb_pg)
>> -			goto free_vp_assist_page;
>> +			goto free_ghcb_page;
>>   	}
>>
>>   	cpuhp = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "x86/hyperv_init:online",
>> @@ -613,10 +531,6 @@ void __init hyperv_init(void)
>>   	cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
>>   free_ghcb_page:
>>   	free_percpu(hv_ghcb_pg);
>> -free_vp_assist_page:
>> -	kfree(hv_vp_assist_page);
>> -	hv_vp_assist_page = NULL;
>> -common_free:
>>   	hv_common_free();
>>   }
>>
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 6b67ac616789..d1ebc0ebd08f 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -28,7 +28,9 @@
>>   #include <linux/slab.h>
>>   #include <linux/dma-map-ops.h>
>>   #include <linux/set_memory.h>
>> +#include <linux/vmalloc.h>
>>   #include <hyperv/hvhdk.h>
>> +#include <hyperv/hvgdk.h>
>>   #include <asm/mshyperv.h>
> 
> Need to add
> 
> #include <linux/io.h>
> 
> because of the memremap() and related calls that have been added.
> io.h is probably being #include'd indirectly, but it is better to #include
> it directly.
> 

Acked.

>>
>>   u64 hv_current_partition_id = HV_PARTITION_ID_SELF;
>> @@ -78,6 +80,8 @@ static struct ctl_table_header *hv_ctl_table_hdr;
>>   u8 * __percpu *hv_synic_eventring_tail;
>>   EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
>>
>> +struct hv_vp_assist_page **hv_vp_assist_page;
>> +EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>>   /*
>>    * Hyper-V specific initialization and shutdown code that is
>>    * common across all architectures.  Called from architecture
>> @@ -92,6 +96,9 @@ void __init hv_common_free(void)
>>   	if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE)
>>   		hv_kmsg_dump_unregister();
>>
>> +	kfree(hv_vp_assist_page);
>> +	hv_vp_assist_page = NULL;
>> +
>>   	kfree(hv_vp_index);
>>   	hv_vp_index = NULL;
>>
>> @@ -394,6 +401,23 @@ int __init hv_common_init(void)
>>   	for (i = 0; i < nr_cpu_ids; i++)
>>   		hv_vp_index[i] = VP_INVAL;
>>
>> +	/*
>> +	 * The VP assist page is useless to a TDX guest: the only use we
>> +	 * would have for it is lazy EOI, which can not be used with TDX.
>> +	 */
>> +	if (hv_isolation_type_tdx()) {
>> +		hv_vp_assist_page = NULL;
>> +	} else {
>> +		hv_vp_assist_page = kzalloc_objs(*hv_vp_assist_page, nr_cpu_ids);
>> +		if (!hv_vp_assist_page) {
>> +#ifdef CONFIG_X86_64
>> +			ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>> +#endif
>> +			hv_common_free();
>> +			return -ENOMEM;
> 
> Given that "failure to allocate memory" now returns an error that is
> essentially fatal to hyperv_init(), is it still necessary to clear the flag in
> ms_hyperv.hints?  I'd love to see that #ifdef go away. It's the only
> #ifdef in hv_common.c, and I had worked hard in the past to avoid
> such #ifdef's. :-)

Yes, this particular block can be removed, and I will remove it in v2.
The other thing pointed out in Sashiko's AI review was having this 
if-def block in tdx case after setting hv_vp_assist_page to NULL. This 
is to maintain parity with existing code. That's the reason, I will need 
to add it back there.

> 
>> +		}
>> +	}
>> +
>>   	return 0;
>>   }
>>
>> @@ -471,6 +495,8 @@ void __init ms_hyperv_late_init(void)
>>
>>   int hv_common_cpu_init(unsigned int cpu)
>>   {
>> +	union hv_vp_assist_msr_contents msr = { 0 };
>> +	struct hv_vp_assist_page **hvp;
>>   	void **inputarg, **outputarg;
>>   	u8 **synic_eventring_tail;
>>   	u64 msr_vp_index;
>> @@ -542,6 +568,50 @@ int hv_common_cpu_init(unsigned int cpu)
>>   			ret = -ENOMEM;
> 
> The Sashiko AI comment here about a bug when ret is set to -ENOMEM
> seems valid to me.
> 

I'm planning to simply "return -ENOMEM" here.

>>   	}
>>
>> +	if (!hv_vp_assist_page)
>> +		return ret;
>> +
>> +	hvp = &hv_vp_assist_page[cpu];
>> +	if (hv_root_partition()) {
>> +		/*
>> +		 * For root partition we get the hypervisor provided VP assist
>> +		 * page, instead of allocating a new page.
>> +		 */
>> +		msr.as_uint64 = hv_get_msr(HV_SYN_REG_VP_ASSIST_PAGE);
>> +		*hvp = memremap(msr.pfn << HV_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> +				PAGE_SIZE, MEMREMAP_WB);
> 
> The Sashiko AI comment about potentially memremap'ing 64K instead of 4K can
> be ignored. We know that the root partition can only run with a 4K page size,
> and that is enforced in drivers/hv/Kconfig.
>

I am thinking of adding this config dependency (PAGE_SIZE_4KB) in the 
Kconfig patch in this series, to MSHV_VTL as well. We are also using 
only 4KB as page size. This should prevent all of PAGE_SIZE Sachiko 
issues. I am also replacing PAGE_SIZE with HV_HYP_PAGE_SIZE in all 
places. Hope that is fine?

> HV_VP_ASSIST_PAGE_ADDRESS_SHIFT is defined in asm-generic/mshyperv.h.
> But there is also HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT in hvgdk_mini.h.
> Is there a clean way to eliminate the duplication?

Although both these architectures are using same value - 12, I was 
hesitant to use x64 register for ARM64. I will move arch based 
definition of HV_VP_ASSIST_PAGE_ADDRESS_SHIFT to hvgdk_mini.h and remove 
it from asm-generic/mshyperv.h.

> 
>> +	} else {
>> +		/*
>> +		 * The VP assist page is an "overlay" page (see Hyper-V TLFS's
>> +		 * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> +		 * out to make sure we always write the EOI MSR in
>> +		 * hv_apic_eoi_write() *after* the EOI optimization is disabled
>> +		 * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> +		 * case of CPU offlining and the VM will hang.
>> +		 */
> 
> Somewhere in the comment above, I'd suggest adding a short "on x86/x64"
> qualifier, as the comment doesn't apply on arm64 since it doesn't support
> the AutoEOI optimization.  Maybe "Here it must be zeroed out to make sure
> that on x86/x64 we always write the EOI MSR in ....".

Acked. I will add it.

> 
>> +		if (!*hvp) {
>> +			*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
> 
> The Sashiko AI comment about using "flags" instead of GFP_KERNEL seems valid.

Acked.

> 
>> +
>> +			/*
>> +			 * Hyper-V should never specify a VM that is a Confidential
>> +			 * VM and also running in the root partition. Root partition
>> +			 * is blocked to run in Confidential VM. So only decrypt assist
>> +			 * page in non-root partition here.
>> +			 */
>> +			if (*hvp && !ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
>> +				WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>> +				memset(*hvp, 0, PAGE_SIZE);
>> +			}
>> +		}
>> +
>> +		if (*hvp)
>> +			msr.pfn = vmalloc_to_pfn(*hvp);
> 
> The Sashiko AI comment about page size here seems valid. But what are the rules
> about arm64 page sizes that are supported for VTL2, and how does they relate
> to VTL0 allowing 4K, 16K, and 64K page size? What combinations are allowed?
> For example, can a VTL2 built with 4K page size run with a VTL0 built with
> 64K page size? It would be nice to have the rules recorded somewhere in a
> code comment, but I'm not sure of the best place.
> 

VTL2 uses 4k page size only. This can be enforced with a Kconfig change 
in next version. As and when other page size support is added in ARM64 
for MSHV_VTL, this change can be removed.

Regarding support of VTL0 kernel page sizes, page size in VTL2 is of no 
impact to it.


> But regardless of the rules, I'd suggest future-proofing by using
> "page_to_hvpfn(vmalloc_to_page(*hvp))" so that the PFN generated is always
> in terms of 4K page size as the Hyper-V host expects.

Acked. Will try this and make the changes.

> 
>> +	}
>> +	if (!WARN_ON(!(*hvp))) {
>> +		msr.enable = 1;
>> +		hv_set_msr(HV_SYN_REG_VP_ASSIST_PAGE, msr.as_uint64);
>> +	}
>> +
>>   	return ret;
>>   }
>>
>> @@ -566,6 +636,24 @@ int hv_common_cpu_die(unsigned int cpu)
>>   		*synic_eventring_tail = NULL;
>>   	}
>>
>> +	if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
>> +		union hv_vp_assist_msr_contents msr = { 0 };
>> +
>> +		if (hv_root_partition()) {
>> +			/*
>> +			 * For root partition the VP assist page is mapped to
>> +			 * hypervisor provided page, and thus we unmap the
>> +			 * page here and nullify it, so that in future we have
>> +			 * correct page address mapped in hv_cpu_init.
>> +			 */
>> +			memunmap(hv_vp_assist_page[cpu]);
>> +			hv_vp_assist_page[cpu] = NULL;
>> +			msr.as_uint64 = hv_get_msr(HV_SYN_REG_VP_ASSIST_PAGE);
>> +			msr.enable = 0;
>> +		}
>> +		hv_set_msr(HV_SYN_REG_VP_ASSIST_PAGE, msr.as_uint64);
>> +	}
>> +
>>   	return 0;
>>   }
>>
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index d37b68238c97..108f135d4fd9 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -25,6 +25,7 @@
>>   #include <linux/nmi.h>
>>   #include <asm/ptrace.h>
>>   #include <hyperv/hvhdk.h>
>> +#include <hyperv/hvgdk.h>
>>
>>   #define VTPM_BASE_ADDRESS 0xfed40000
>>
>> @@ -299,6 +300,8 @@ do { \
>>   #define hv_status_debug(status, fmt, ...) \
>>   	hv_status_printk(debug, status, fmt, ##__VA_ARGS__)
>>
>> +extern struct hv_vp_assist_page **hv_vp_assist_page;
> 
> This "extern" statement is added here so it is visible to both x86/x64 and arm64.
> And that's correct.
> 
> But there is still some VP assist page stuff that has been left in the arch/x86
> version of mshyperv.h.  That other stuff, including the inline function
> hv_get_vp_assist_page(), should also be moved to asm-generic/mshyperv.h.
> Given that the VP assist page support is now fully generic and not x86/x64
> specific, it shouldn't occur anywhere in the arch/x86 version of mshyperv.h.

Will move the remaining code.

> 
>> +
>>   const char *hv_result_to_string(u64 hv_status);
>>   int hv_result_to_errno(u64 status);
>>   void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
>> @@ -377,6 +380,7 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>   	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
>>   }
>>
>> +#define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT	12
>>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>>   u8 __init get_vtl(void);
>>   #else
>> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
>> index 056ef7b6b360..be697ddb211a 100644
>> --- a/include/hyperv/hvgdk_mini.h
>> +++ b/include/hyperv/hvgdk_mini.h
>> @@ -149,6 +149,7 @@ struct hv_u128 {
>>   #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT	12
>>   #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_MASK	\
>>   		(~((1ull << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) - 1))
>> +#define HV_SYN_REG_VP_ASSIST_PAGE              (HV_X64_MSR_VP_ASSIST_PAGE)
>>
>>   /* Hyper-V Enlightened VMCS version mask in nested features CPUID */
>>   #define HV_X64_ENLIGHTENED_VMCS_VERSION		0xff
>> @@ -1185,6 +1186,7 @@ enum hv_register_name {
>>
>>   #define HV_MSR_STIMER0_CONFIG	(HV_REGISTER_STIMER0_CONFIG)
>>   #define HV_MSR_STIMER0_COUNT	(HV_REGISTER_STIMER0_COUNT)
>> +#define HV_SYN_REG_VP_ASSIST_PAGE    (HV_REGISTER_VP_ASSIST_PAGE)
> 
> This defines a new register name prefix "HV_SYN_REG_" that isn't used
> anywhere else. The prefixes for Hyper-V register names are already complex
> to account to x86/x64 and arm64 differences, and the fact the x86/x64 has
> synthetic MSRs, while arm64 does not. So introducing another prefix is
> undesirable. Couldn't this just be HV_MSR_VP_ASSIST_PAGE using the
> same structure as HV_MSR_STIMER0_COUNT (for example)?
>

Will rename it to HV_MSR_VP_ASSIST_PAGE in all places.

>>
>>   #endif /* CONFIG_ARM64 */
>>
>> --
>> 2.43.0
>>


Thank you so much for thoroughly reviwing this Michael.

Regards,
Naman

^ permalink raw reply

* Re: [PATCH 08/11] Drivers: hv: mshv_vtl: Move register page config to arch-specific files
From: Naman Jain @ 2026-04-20 15:23 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157CF364DA2C0CC657A6DCBD450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:28 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Move mshv_vtl_configure_reg_page() implementation from
>> drivers/hv/mshv_vtl_main.c to arch-specific files:
>> - arch/x86/hyperv/hv_vtl.c: full implementation with register page setup
>> - arch/arm64/hyperv/hv_vtl.c: stub implementation (unsupported)
>>
>> Move common type definitions to include/asm-generic/mshyperv.h:
>> - struct mshv_vtl_per_cpu
>> - union hv_synic_overlay_page_msr
>>
>> Move hv_call_get_vp_registers() and hv_call_set_vp_registers()
>> declarations to include/asm-generic/mshyperv.h since these functions
>> are used by multiple modules.
>>
>> While at it, remove the unnecessary stub implementations in #else
>> case for mshv_vtl_return* functions in arch/x86/include/asm/mshyperv.h.
> 
> Seems like this patch is doing multiple things. The reg page configuration
> changes are more substantial and should probably be in a patch by
> themselves. The other changes are more trivial and maybe are OK
> grouped into a single patch, but you could also consider breaking them
> out.

I will split this patch into 3 patches.

> 
>>
>> This is essential for adding support for ARM64 in MSHV_VTL.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/hyperv/hv_vtl.c        |  8 +++++
>>   arch/arm64/include/asm/mshyperv.h |  3 ++
>>   arch/x86/hyperv/hv_vtl.c          | 32 ++++++++++++++++++++
>>   arch/x86/include/asm/mshyperv.h   |  7 ++---
>>   drivers/hv/mshv.h                 |  8 -----
>>   drivers/hv/mshv_vtl_main.c        | 49 +++----------------------------
>>   include/asm-generic/mshyperv.h    | 42 ++++++++++++++++++++++++++
>>   7 files changed, 92 insertions(+), 57 deletions(-)
>>
>> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
>> index 66318672c242..d699138427c1 100644
>> --- a/arch/arm64/hyperv/hv_vtl.c
>> +++ b/arch/arm64/hyperv/hv_vtl.c
>> @@ -10,6 +10,7 @@
>>   #include <asm/boot.h>
>>   #include <asm/mshyperv.h>
>>   #include <asm/cpu_ops.h>
>> +#include <linux/export.h>
>>
>>   void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>>   {
>> @@ -142,3 +143,10 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>>   		"v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31");
>>   }
>>   EXPORT_SYMBOL(mshv_vtl_return_call);
>> +
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
>> +{
>> +	pr_debug("Register page not supported on ARM64\n");
>> +	return false;
>> +}
>> +EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index de7f3a41a8ea..36803f0386cc 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -61,6 +61,8 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>   				ARM_SMCCC_OWNER_VENDOR_HYP,	\
>>   				HV_SMCCC_FUNC_NUMBER)
>>
>> +struct mshv_vtl_per_cpu;
>> +
>>   struct mshv_vtl_cpu_context {
>>   /*
>>    * NOTE: x18 is managed by the hypervisor. It won't be reloaded from this array.
>> @@ -82,6 +84,7 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs,
>> bool set, u
>>   }
>>
>>   void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
> 
> I think this declaration could be added in asm-generic/mshyperv.h so that it
> is shared by x86 and arm64. That also obviates the need for the forward
> ref to struct mshv_vtl_per_cpu that you've added here.

Acked.

> 
>>   #endif
>>
>>   #include <asm-generic/mshyperv.h>
>> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
>> index 72a0bb4ae0c7..ede290985d41 100644
>> --- a/arch/x86/hyperv/hv_vtl.c
>> +++ b/arch/x86/hyperv/hv_vtl.c
>> @@ -20,6 +20,7 @@
>>   #include <uapi/asm/mtrr.h>
>>   #include <asm/debugreg.h>
>>   #include <linux/export.h>
>> +#include <linux/hyperv.h>
>>   #include <../kernel/smpboot.h>
>>   #include "../../kernel/fpu/legacy.h"
>>
>> @@ -259,6 +260,37 @@ int __init hv_vtl_early_init(void)
>>   	return 0;
>>   }
>>
>> +static const union hv_input_vtl input_vtl_zero;
>> +
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
>> +{
>> +	struct hv_register_assoc reg_assoc = {};
>> +	union hv_synic_overlay_page_msr overlay = {};
>> +	struct page *reg_page;
>> +
>> +	reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL);
>> +	if (!reg_page) {
>> +		WARN(1, "failed to allocate register page\n");
>> +		return false;
>> +	}
>> +
>> +	overlay.enabled = 1;
>> +	overlay.pfn = page_to_hvpfn(reg_page);
>> +	reg_assoc.name = HV_X64_REGISTER_REG_PAGE;
>> +	reg_assoc.value.reg64 = overlay.as_uint64;
>> +
>> +	if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> +				     1, input_vtl_zero, &reg_assoc)) {
>> +		WARN(1, "failed to setup register page\n");
>> +		__free_page(reg_page);
>> +		return false;
>> +	}
>> +
>> +	per_cpu->reg_page = reg_page;
>> +	return true;
> 
> As Sashiko AI noted, the memory allocated for the reg_page never gets freed.

These are present in existing code, I'll address them in a separate series.

> 
>> +}
>> +EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
>> +
>>   DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void));
>>
>>   void mshv_vtl_return_call_init(u64 vtl_return_offset)
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index d5355a5b7517..d592fea49cdb 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -271,6 +271,8 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg) {
>> return 0; }
>>   static inline int hv_apicid_to_vp_index(u32 apic_id) { return -EINVAL; }
>>   #endif /* CONFIG_HYPERV */
>>
>> +struct mshv_vtl_per_cpu;
>> +
>>   struct mshv_vtl_cpu_context {
>>   	union {
>>   		struct {
>> @@ -305,13 +307,10 @@ void mshv_vtl_return_call_init(u64 vtl_return_offset);
>>   void mshv_vtl_return_hypercall(void);
>>   void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared);
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
> 
> Same as for arm64. Add a shared declaration in asm-generic/mshyperv.h.

Ditto.

> 
>>   #else
>>   static inline void __init hv_vtl_init_platform(void) {}
>>   static inline int __init hv_vtl_early_init(void) { return 0; }
>> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> -static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>> -static inline void mshv_vtl_return_hypercall(void) {}
>> -static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>>   #endif
>>
>>   #include <asm-generic/mshyperv.h>
>> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
>> index d4813df92b9c..0fcb7f9ba6a9 100644
>> --- a/drivers/hv/mshv.h
>> +++ b/drivers/hv/mshv.h
>> @@ -14,14 +14,6 @@
>>   	memchr_inv(&((STRUCT).MEMBER), \
>>   		   0, sizeof_field(typeof(STRUCT), MEMBER))
>>
>> -int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
>> -			     union hv_input_vtl input_vtl,
>> -			     struct hv_register_assoc *registers);
>> -
>> -int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
>> -			     union hv_input_vtl input_vtl,
>> -			     struct hv_register_assoc *registers);
>> -
>>   int hv_call_get_partition_property(u64 partition_id, u64 property_code,
>>   				   u64 *property_value);
>>
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index 91517b45d526..c79d24317b8e 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -78,21 +78,6 @@ struct mshv_vtl {
>>   	u64 id;
>>   };
>>
>> -struct mshv_vtl_per_cpu {
>> -	struct mshv_vtl_run *run;
>> -	struct page *reg_page;
>> -};
>> -
>> -/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
>> -union hv_synic_overlay_page_msr {
>> -	u64 as_uint64;
>> -	struct {
>> -		u64 enabled: 1;
>> -		u64 reserved: 11;
>> -		u64 pfn: 52;
>> -	} __packed;
>> -};
>> -
>>   static struct mutex mshv_vtl_poll_file_lock;
>>   static union hv_register_vsm_page_offsets mshv_vsm_page_offsets;
>>   static union hv_register_vsm_capabilities mshv_vsm_capabilities;
>> @@ -201,34 +186,6 @@ static struct page *mshv_vtl_cpu_reg_page(int cpu)
>>   	return *per_cpu_ptr(&mshv_vtl_per_cpu.reg_page, cpu);
>>   }
>>
>> -static void mshv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
>> -{
>> -	struct hv_register_assoc reg_assoc = {};
>> -	union hv_synic_overlay_page_msr overlay = {};
>> -	struct page *reg_page;
>> -
>> -	reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL);
>> -	if (!reg_page) {
>> -		WARN(1, "failed to allocate register page\n");
>> -		return;
>> -	}
>> -
>> -	overlay.enabled = 1;
>> -	overlay.pfn = page_to_hvpfn(reg_page);
>> -	reg_assoc.name = HV_X64_REGISTER_REG_PAGE;
>> -	reg_assoc.value.reg64 = overlay.as_uint64;
>> -
>> -	if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> -				     1, input_vtl_zero, &reg_assoc)) {
>> -		WARN(1, "failed to setup register page\n");
>> -		__free_page(reg_page);
>> -		return;
>> -	}
>> -
>> -	per_cpu->reg_page = reg_page;
>> -	mshv_has_reg_page = true;
>> -}
>> -
>>   static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>>   {
>>   	union hv_synic_sint sint;
>> @@ -329,8 +286,10 @@ static int mshv_vtl_alloc_context(unsigned int cpu)
>>   	if (!per_cpu->run)
>>   		return -ENOMEM;
>>
>> -	if (mshv_vsm_capabilities.intercept_page_available)
>> -		mshv_vtl_configure_reg_page(per_cpu);
>> +	if (mshv_vsm_capabilities.intercept_page_available) {
>> +		if (hv_vtl_configure_reg_page(per_cpu))
>> +			mshv_has_reg_page = true;
> 
> As Sashiko AI noted, it doesn't work to use the global mshv_has_reg_page
> to indicate the success of configuring the reg page, which is a per-cpu
> operation. But this bug existed before this patch set, so maybe it should
> be fixed as a preliminary patch.

Acked. Will address them in a separate series.

> 
>> +	}
>>
>>   	mshv_vtl_synic_enable_regs(cpu);
>>
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index b147a12085e4..b53fcc071596 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -383,8 +383,50 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>   	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
>>   }
>>
>> +#if IS_ENABLED(CONFIG_MSHV_ROOT) || IS_ENABLED(CONFIG_MSHV_VTL)
>> +int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
>> +			     union hv_input_vtl input_vtl,
>> +			     struct hv_register_assoc *registers);
>> +
>> +int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
>> +			     union hv_input_vtl input_vtl,
>> +			     struct hv_register_assoc *registers);
>> +#else
>> +static inline int hv_call_get_vp_registers(u32 vp_index, u64 partition_id,
>> +					   u16 count,
>> +					   union hv_input_vtl input_vtl,
>> +					   struct hv_register_assoc *registers)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline int hv_call_set_vp_registers(u32 vp_index, u64 partition_id,
>> +					   u16 count,
>> +					   union hv_input_vtl input_vtl,
>> +					   struct hv_register_assoc *registers)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +#endif /* CONFIG_MSHV_ROOT || CONFIG_MSHV_VTL */
>> +
>>   #define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT	12
>> +
>>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>> +struct mshv_vtl_per_cpu {
>> +	struct mshv_vtl_run *run;
>> +	struct page *reg_page;
>> +};
>> +
>> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
>> +union hv_synic_overlay_page_msr {
>> +	u64 as_uint64;
>> +	struct {
>> +		u64 enabled: 1;
>> +		u64 reserved: 11;
>> +		u64 pfn: 52;
>> +	} __packed;
>> +};
>> +
>>   u8 __init get_vtl(void);
>>   #else
>>   static inline u8 get_vtl(void) { return 0; }
>> --
>> 2.43.0
>>
> 
> Sashiko AI noted another existing bug in mshv_vtl_init(), which is that
> the error path does kfree(mem_dev) when it should do
> put_device(mem_dev).  See the comment in the header of
> device_initialize().


To avoid this series bloating up, I am thinking of taking up these fixes 
in a separate series.

Regards,
Naman

^ permalink raw reply

* Re: [PATCH 10/11] Drivers: hv: Add support for arm64 in MSHV_VTL
From: Naman Jain @ 2026-04-20 15:24 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB41576766C5FB291952CC58E8D450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:28 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Add necessary support to make MSHV_VTL work for arm64 architecture.
>> * Add stub implementation for mshv_vtl_return_call_init(): not required
>>    for arm64
>> * Remove fpu/legacy.h header inclusion, as this is not required
>> * handle HV_REGISTER_VSM_CODE_PAGE_OFFSETS register: not supported
>>    in arm64
>> * Configure custom percpu_vmbus_handler by using
>>    hv_setup_percpu_vmbus_handler()
>> * Handle hugepage functions by config checks
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/include/asm/mshyperv.h |  2 ++
>>   drivers/hv/mshv_vtl_main.c        | 21 ++++++++++++++-------
>>   2 files changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index 36803f0386cc..027a7f062d70 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -83,6 +83,8 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u
>>   	return 1;
>>   }
>>
>> +static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>> +
>>   void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>>   #endif
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index 4c9ae65ad3e8..5702fe258500 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -23,8 +23,6 @@
>>   #include <trace/events/ipi.h>
>>   #include <uapi/linux/mshv.h>
>>   #include <hyperv/hvhdk.h>
>> -
>> -#include "../../kernel/fpu/legacy.h"
> 
> Was there a particular code change that made this unnecessary? Or was it
> unnecessary from the start of this source code file? Just curious ....

This was present in initial driver changes when the assembly code was 
part of this driver. Then it moved to arch files and this was left here.
Just cleaning it up.


> 
>>   #include "mshv.h"
>>   #include "mshv_vtl.h"
>>   #include "hyperv_vmbus.h"
>> @@ -206,18 +204,21 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>>   static int mshv_vtl_get_vsm_regs(void)
>>   {
>>   	struct hv_register_assoc registers[2];
>> -	int ret, count = 2;
>> +	int ret, count = 0;
>>
>> -	registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> -	registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
>> +	registers[count++].name = HV_REGISTER_VSM_CAPABILITIES;
>> +	/* Code page offset register is not supported on ARM */
>> +	if (IS_ENABLED(CONFIG_X86_64))
>> +		registers[count++].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>>
>>   	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>>   				       count, input_vtl_zero, registers);
>>   	if (ret)
>>   		return ret;
>>
>> -	mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
>> -	mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
>> +	mshv_vsm_capabilities.as_uint64 = registers[0].value.reg64;
>> +	if (IS_ENABLED(CONFIG_X86_64))
>> +		mshv_vsm_page_offsets.as_uint64 = registers[1].value.reg64;
>>
>>   	return ret;
>>   }
> 
> This function has gotten somewhat messy to handle the x86 and arm64
> differences. Let me suggest a different approach. Have this function only
> get the VSM capabilities register, as that is generic across x86 and
> arm64. Then, update x86 mshv_vtl_return_call_init() to get the
> PAGE_OFFSETS register and then immediately use the value to update
> the static call. The global variable mshv_vms_page_offsets is no longer
> necessary.
> 
> My suggestion might be little more code because hv_call_get_vp_registers()
> is invoked in two different places. But it cleanly separates the two use
> cases, and keeps the x86 hackery under arch/x86.
> 

I implemented this in my dev branch, and it works fine. Thanks for the 
suggestion.


>> @@ -280,10 +281,13 @@ static int hv_vtl_setup_synic(void)
>>
>>   	/* Use our isr to first filter out packets destined for userspace */
>>   	hv_setup_vmbus_handler(mshv_vtl_vmbus_isr);
>> +	/* hv_setup_vmbus_handler() is stubbed for ARM64, add per-cpu VMBus handlers instead */
>> +	hv_setup_percpu_vmbus_handler(mshv_vtl_vmbus_isr);
>>
>>   	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vtl:online",
>>   				mshv_vtl_alloc_context, NULL);
>>   	if (ret < 0) {
>> +		hv_setup_percpu_vmbus_handler(vmbus_isr);
>>   		hv_setup_vmbus_handler(vmbus_isr);
>>   		return ret;
>>   	}
>> @@ -296,6 +300,7 @@ static int hv_vtl_setup_synic(void)
>>   static void hv_vtl_remove_synic(void)
>>   {
>>   	cpuhp_remove_state(mshv_vtl_cpuhp_online);
>> +	hv_setup_percpu_vmbus_handler(vmbus_isr);

hv_setup_percpu_vmbus_handler() calls will also be removed with the 
redesign.

Regards,
Naman

^ permalink raw reply

* Re: [PATCH 11/11] Drivers: hv: Kconfig: Add ARM64 support for MSHV_VTL
From: Naman Jain @ 2026-04-20 15:24 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157FEE5578344625418BFDBD450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:28 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
> 
> Nit: In keeping with past practice, the "Subject" prefix for this patch could
> just be "Drivers: hv:"

Acked.
I am also planning to change other subject line prefixes, based on your 
earlier suggestion:

mshv_vtl_main changes - "mshv_vtl: "
arch/arm64 Hyper-V changes - "arm64: hyperv: "
arch/x86 Hyper-V changes - "x86/hyperv: "


Thank you so much for doing such a thorough review. I really appreciate 
all the help and guidance.

Regards,
Naman



^ permalink raw reply

* [BUG/RFE] hv_balloon: hot-add not triggered under burst memory demand
From: Boiler Plate @ 2026-04-20 15:30 UTC (permalink / raw)
  To: linux-hyperv; +Cc: wei.liu, mhklinux
In-Reply-To: <CAPLehQ3cT+jEUkLnfBYFh3a=eCZcxzBeDvTX-B0MWdSJ_oEFuQ@mail.gmail.com>

Hi !

I'm experiencing a consistent failure of the hv_balloon driver to
respond to burst memory demand on Alpine Linux, with VS Code Remote
devcontainers as a
representative workload.

The issue has been thoroughly analyzed using PSI monitoring and kernel
configuration verification. The root cause is the absence of burst
demand support in the driver architecture, compounded by the 1-second
polling loop latency.

A detailed analysis with measurement data and proposed improvements down below.
--

Bug Report / Request for Enhancement: hv_balloon Dynamic Memory
Hot-Add Fails Under Burst Demand Workloads

SUMMARY

The Linux hv_balloon driver for Hyper-V Dynamic Memory is documented
and architecturally designed to manage guest memory in both
directions: increasing memory via hot-add when the guest needs more,
and decreasing it via balloon inflation when the guest needs less. The
original patch comment introducing the driver explicitly states this
dual purpose, and the driver's state machine contains distinct states
for DM_BALLOON_UP, DM_BALLOON_DOWN and DM_HOT_ADD.

In practice, only the downward direction works reliably. This was
demonstrated in a controlled test running Alpine Linux v3.23 (kernel
6.18.20-lts) as a Hyper-V guest with Dynamic Memory configured
(Startup RAM 1024 MB, Maximum 16384 MB). Under a representative burst
demand workload using VS Code Remote SSH with devcontainer startup,
the guest experienced 97% PSI memory stall, 176,000+ swap pages
written, and near-OOM conditions over a sustained period exceeding 150
seconds. During this entire period, MemTotal never increased by a
single kilobyte and dmesg showed zero hot-add activity. The upward
direction failed completely.

The root cause is the complete absence of burst demand support in the
driver architecture, compounded by a fixed-interval 1-second polling
loop between guest and host, sequential hot-add protocol semantics,
and a kernel default configuration (MHP_DEFAULT_ONLINE_TYPE_OFFLINE)
that leaves hot-added memory sections offline even when they do
arrive. Collectively these mean the driver cannot respond to burst
memory demand fast enough to be useful.

Proposed resolution: PSI-triggered hot-add requests independent of the
1-second polling loop, documentation of required
auto_online_blocks=online configuration, and improved diagnostics when
hot-add is not initiated despite high memory pressure.


1. VERIFIED PURPOSE AND DESIGN INTENT

The original patch introducing hv_balloon into the Linux kernel states
explicitly:

"Windows hosts dynamically manage the guest memory allocation via a
combination memory hot add and ballooning. Memory hot add is used to
grow the guest memory up to the maximum memory that can be allocated
to the guest. Ballooning is used to both shrink as well as expand up
to the max memory."

Source: K.Y. Srinivasan, [PATCH 2/2] Drivers: hv: Add Hyper-V balloon
driver, lkml.iu.edu, 2012.

The driver's state machine in the current kernel source at
drivers/hv/hv_balloon.c confirms this with explicit states
DM_BALLOON_UP, DM_BALLOON_DOWN and DM_HOT_ADD. Source:
github.com/torvalds/linux/blob/master/drivers/hv/hv_balloon.c
(verified April 2026).

The upward direction is therefore not an optional or aspirational
feature. It is the primary stated purpose of the hot-add component of
the driver.


2. SYSTEM CONFIGURATION

Host: Windows Server 2022 with Hyper-V, Dynamic Memory enabled

Guest OS: Alpine Linux v3.23, kernel 6.18.20-lts (x86_64)

Kernel config relevant to this report:
- CONFIG_MEMORY_HOTPLUG=y
- CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE=y (Alpine default)
- CONFIG_PSI=y
- CONFIG_PSI_DEFAULT_DISABLED=y

Hyper-V Dynamic Memory settings:
- Startup RAM: 1024 MB
- Minimum RAM: 512 MB
- Maximum RAM: 16384 MB

auto_online_blocks: Set to online manually after discovering the
default was offline

PSI: Enabled via psi=1 kernel parameter after discovering
CONFIG_PSI_DEFAULT_DISABLED=y

Driver module parameters verified on test system:
- /sys/module/hv_balloon/parameters/hot_add = Y (hot-add enabled)
- /sys/module/hv_balloon/parameters/pressure_report_delay = 0 (no startup delay)


3. USE CASE: VS CODE REMOTE SSH WITH DEVCONTAINER

This use case is representative of a class of developer workloads with
containerized development environments, that are increasingly common
on Linux VMs hosted on Hyper-V.

Workload profile:
VS Code Remote SSH connects to the Alpine guest and starts a Home
Assistant Add-on development container (ha-dev). The VS Code server
process (node) expands from zero to approximately 240 MB RSS within 10
seconds of connection. Multiple node processes spawn in rapid
succession as extensions and language servers load.

Expected behavior:
Hyper-V Dynamic Memory detects guest memory pressure, initiates
hot-add to expand MemTotal beyond the startup value, guest makes the
new memory available via auto_online_blocks, workload proceeds
normally.

Observed behavior:
MemTotal remained at 921764 kB (startup RAM) throughout the entire
session. dmesg showed no hot-add activity whatsoever. The system
responded by swapping aggressively and reaching PSI avg10 values of
97% before becoming unresponsive.


4. MEASUREMENT DATA

All measurements were collected using a custom shell script sampling
/proc/pressure/memory, /proc/vmstat and /proc/meminfo at 1-2 second
intervals, with per-process RSS from /proc/PID/status.

Timeline of VS Code startup of remote containers (elapsed time from connection)

Time    Event                               PSI delta/s      Swap
pages out   MemTotal
+0s     Baseline, docker running            0 us             0
       921764 kB
+48s    VS Code server appears              0 us             0
       921764 kB
+50s    node 107 MB RSS                     152,837 us       3,748
       921764 kB
+55s    4 node processes, 348 MB RSS total  941,966 us       22,945
       921764 kB
+92s    PSI avg10 48%                       6,357,506 us     159,934
       921764 kB
+106s   PSI avg10 83%                       13,446,517 us    160,262
       921764 kB
+179s   PSI avg10 97%                       20,093,449 us    165,208
       921764 kB

Key observations:
- MemTotal never changed from startup value
- No hot-add lines appeared in dmesg at any point during or after the session
- PSI cumulative stall since boot at session end: 804,383,983
microseconds (804 seconds of accumulated memory stall)


5. ROOT CAUSE ANALYSIS

The fundamental design gap in hv_balloon is the complete absence of
burst demand support. The driver was designed around a polling-based,
fixed-interval model and has no mechanism to detect or respond to
rapid memory transitions. All other issues described below are
consequences or amplifications of this core architectural limitation.

Issue 1: No burst demand support in the driver architecture

The driver has no concept of burst demand, a rapid transition from low
memory pressure to near-OOM within seconds. There is no fast path, no
threshold trigger, and no priority escalation mechanism. The entire
communication model between guest and host is based on periodic status
reporting, which by design introduces latency that is structurally
incompatible with burst workloads. A guest can transition from 0% to
97% PSI memory stall and write 160,000 swap pages before the driver
has sent more than a handful of status messages to the host.

Modern workloads such as containerized development environments,
Kubernetes pod scheduling, JVM heap initialization, Node.js extension
loading, routinely demand hundreds of megabytes of memory within a
5-10 second window. The driver architecture predates this workload
class entirely.

Issue 2: 1-second fixed-interval polling loop is too slow for burst workloads

The hv_balloon thread reports memory pressure to the host once per
second via post_status(). Source:
elixir.bootlin.com/linux/v6.14.6/source/drivers/hv/hv_balloon.c#L1381
(verified via Medium article by Shlomi Boutnaru, May 2025).

VS Code expanded from 0 to 240 MB RSS in under 10 seconds. By the time
the host received sufficient pressure signals to consider a hot-add
response, the guest had already exhausted available memory and entered
heavy swap. The polling cadence has no mechanism to accelerate or
escalate regardless of how severe or rapid the memory pressure
becomes.

Issue 3: pressure_report_delay not a factor in this case

The hv_balloon module parameter pressure_report_delay defaults to 30
seconds per the original 2013 patch. Source: K.Y. Srinivasan, [PATCH
1/2] Drivers: hv: balloon: Add a parameter to delay pressure
reporting, lkml.indiana.edu, 2013. On the test system this parameter
was verified to be 0
(/sys/module/hv_balloon/parameters/pressure_report_delay = 0), meaning
pressure reporting was not delayed. This eliminates
pressure_report_delay as a contributing factor in this specific case
and strengthens the conclusion that Issues 1 and 2 are solely
responsible for the observed failure.

Note: on systems where pressure_report_delay retains its default value
of 30, the failure window would be significantly wider, as the host
would receive no pressure data at all during the first 30 seconds
after driver load.

Issue 4: Sequential hot-add protocol prevents parallel responses

Per the Hyper-V Dynamic Memory protocol specification: the host must
not send a new hot-add request until the guest has responded to the
previous one. Source: quoted in QEMU developer discussion,
mail-archive.com, September 2020. Combined with the 128 MB minimum
DIMM size for Linux hot-add (source: patchew.org, verified search
result), each expansion step is large, slow and serialized.

Issue 5: MHP_DEFAULT_ONLINE_TYPE_OFFLINE leaves hot-added memory unusable

Alpine Linux ships with CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE=y.
Hot-added memory sections are registered in sysfs but remain in
offline state until explicitly brought online. Without udev (Alpine
uses mdev) there is no automatic mechanism to online new sections. The
auto_online_blocks sysfs interface defaults to offline and must be
manually set to online.

This issue was identified and resolved in this specific environment by
setting echo online > /sys/devices/system/memory/auto_online_blocks
and making it persistent via /etc/local.d/memory-hotplug.start.
However even with this fix applied, hot-add was never triggered by the
host during the VS Code session. This confirms that Issues 1-4 are the
primary blockers and Issue 5 is a prerequisite that was already
satisfied.


6. COMPARISON WITH KNOWN SIMILAR REPORTS AND RECENT PATCHES

This failure mode is not new. A Kubernetes/minikube issue from 2017
describes an identical pattern: memory demand increases, Hyper-V
Manager shows warning status, assigned memory never increases, OOM
killer activates. Source: github.com/kubernetes/minikube/issues/1403.
The issue was closed as stale without resolution. The present report
provides significantly more detailed measurement data and kernel
configuration context than the prior report.

The driver is actively maintained. Two recent patches are relevant as context:

- A March 2024 patch by Michael Kelley fixes hot-add failures on
systems with memblock sizes larger than 128 MB, where add_memory()
would fail with error -22. Source: lore.kernel.org/lkml, March 2024.
This is a separate correctness fix and does not address burst demand.
- A January 2025 patch accepted into hyperv-next fixes an issue where
the balloon driver's global page-onlining callback blocked hot-add of
memory from GPU and vPCI device drivers. Source:
mail-archive.com/linux-hyperv, January 2025. Again a separate
correctness fix, but both patches confirm the driver is under active
development and that the maintainers are responsive to bug reports.

No open patches or RFC discussions on linux-hyperv@vger.kernel.org
addressing burst demand or PSI integration in hv_balloon were
identified as of April 2026.


7. PROPOSED IMPROVEMENTS

RFE 1: PSI-triggered hot-add requests to handle burst demand

The driver's current architecture is built around fixed-interval
polling: it reports memory pressure to the host once per second via
post_status() and waits for the host to initiate a hot-add sequence.
This design has no mechanism to accelerate or escalate outside the
polling cadence regardless of how severe or rapid the memory pressure
becomes.

Modern workloads have fundamentally different memory characteristics.
Containers, container runtimes (Docker, containerd, podman), JVM-based
systems, Node.js applications, Kubernetes pods, and development
environments such as VS Code devcontainers routinely exhibit burst
demand patterns: a guest transitions from low memory pressure to
near-OOM within seconds as processes spawn, images are pulled, or
runtimes initialize their heaps. Kubernetes is a particularly
illustrative case - the entire value proposition of dynamic memory
allocation in a Kubernetes node depends on the hypervisor being able
to supply memory fast enough to honor pod scheduling decisions. When a
scheduler assigns a new pod to a node, it expects memory to be
available within seconds, not after a multi-second feedback loop that
may itself be preceded by a 30-second pressure_report_delay. This
pattern is not an edge case - it is the normal startup behavior of a
significant proportion of workloads running on Linux VMs today.

The VS Code devcontainer use case presented in this report is
representative but not exceptional. Any workload that combines a
container runtime with a language server, a build system, a database
startup sequence, or a Kubernetes pod scheduling event will exhibit
similar burst demand characteristics. The 1-second fixed-interval
polling loop is structurally incapable of protecting against this
class of memory event regardless of host configuration.

The infrastructure to solve this already exists in the Linux kernel.
PSI threshold triggers via poll() on /proc/pressure/memory have been
available since kernel 4.20 and are already used by systemd-oomd and
Facebook's oomd to react to memory pressure faster than periodic
polling allows. The driver should leverage this same mechanism to send
an immediate out-of-band hot-add request to the host when burst demand
is detected, specifically when memory.full exceeds a configurable
threshold such as 10% over a 500ms window. This would allow the host
to begin the hot-add sequence at the onset of burst demand rather than
after the guest has already entered heavy swap.

This improvement requires no protocol-level changes if the existing
hot-add request message is used as the signal. A protocol extension
adding an explicit "burst demand" flag to the status message would be
preferable, allowing the host to prioritize the response and bypass
any queuing of normal pressure-based adjustments.

RFE 2: Document auto_online_blocks requirement

The kernel documentation and Hyper-V guest integration documentation
should explicitly state that auto_online_blocks must be set to online
(or the kernel compiled with MHP_DEFAULT_ONLINE_TYPE_ONLINE_AUTO) for
Dynamic Memory hot-add to function on distributions that do not use
udev with a memory hotplug rule. Currently this is undocumented and
discoverable only by reading kernel source or community bug reports.

RFE 3: Diagnostics when hot-add is not triggered

When PSI memory.full avg10 exceeds a significant threshold (e.g. 10%)
without a hot-add request being sent or received, the driver should
emit a pr_warn to dmesg. Currently the guest has no visibility into
whether the host is aware of pressure, whether a hot-add request was
sent, or whether it failed. This makes the failure mode completely
silent from the guest's perspective.

RFE 4: Consider proactive pressure signaling

The driver could be extended to send an out-of-band high-priority
pressure signal to the host when PSI crosses a critical threshold,
rather than waiting for the next 1-second polling cycle. This would
require a protocol-level change and coordination with the Hyper-V host
implementation but would address the fundamental latency issue.


8. WHAT IS UNKNOWN

Why the host never sent a hot-add request despite the guest reaching
near-OOM conditions is unknown. The host-side decision algorithm for
when to initiate hot-add is proprietary and not publicly documented.
It is possible the host's memory pressure thresholds were not met
because guest-side pressure reporting was too slow to accumulate
sufficient signal. It is also possible the host's algorithm is simply
not designed for burst workloads of this type.

The Hyper-V Dynamic Memory Buffer setting (configurable between 5% and
200%, default 20%) controls how much headroom the host maintains above
current demand. Whether increasing this value would provide sufficient
buffer to absorb burst demand without requiring hot-add at all is
unknown without testing. It would not address the architectural
limitation but could serve as a partial operational mitigation.
--

^ permalink raw reply

* Re: [PATCH 1/7] mshv: Convert from page pointers to PFNs
From: Stanislav Kinsburskii @ 2026-04-20 16:21 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157CD26728B2D4BFD171DB7D4242@SN6PR02MB4157.namprd02.prod.outlook.com>

On Mon, Apr 13, 2026 at 09:08:16PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> > 
> > The HMM interface returns PFNs from hmm_range_fault(), and the
> > hypervisor hypercalls operate on PFNs. Storing page pointers in
> > between these interfaces requires unnecessary conversions and
> > temporary allocations.
> > 
> > Store PFNs directly in memory regions to match the natural data flow.
> > This eliminates the temporary PFN array allocation in the HMM fault
> > path and reduces page_to_pfn() conversions throughout the driver.
> > Convert to page structs via pfn_to_page() only when operations like
> > unpin_user_page() require them.
> 
> General comment for this series:  PFN fields are typed as "unsigned long".
> But pfn_offset and pfn_count are "u64".  GFNs are also "u64".  Any
> reason not to make PFNs also "u64"? I know that pfn_valid() takes
> an "unsigned long" input, but see comment below about pfn_valid().
> 

The only reason is to keep the type consistent with the standard Linux
kernel definition of PFN as unsigned long.

> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_regions.c      |  297 ++++++++++++++++++++++------------------
> >  drivers/hv/mshv_root.h         |   20 +--
> >  drivers/hv/mshv_root_hv_call.c |   50 +++----
> >  drivers/hv/mshv_root_main.c    |   30 ++--
> >  4 files changed, 212 insertions(+), 185 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index fdffd4f002f6..b1a707d16c07 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -18,12 +18,13 @@
> >  #include "mshv_root.h"
> > 
> >  #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
> > +#define MSHV_INVALID_PFN				ULONG_MAX
> > 
> >  /**
> >   * mshv_chunk_stride - Compute stride for mapping guest memory
> >   * @page      : The page to check for huge page backing
> >   * @gfn       : Guest frame number for the mapping
> > - * @page_count: Total number of pages in the mapping
> > + * @pfn_count: Total number of pages in the mapping
> 
> Nit: The colons are misaligned after this change.
> 
> >   *
> >   * Determines the appropriate stride (in pages) for mapping guest memory.
> >   * Uses huge page stride if the backing page is huge and the guest mapping
> > @@ -32,18 +33,18 @@
> >   * Return: Stride in pages, or -EINVAL if page order is unsupported.
> >   */
> >  static int mshv_chunk_stride(struct page *page,
> > -			     u64 gfn, u64 page_count)
> > +			     u64 gfn, u64 pfn_count)
> >  {
> >  	unsigned int page_order;
> > 
> >  	/*
> >  	 * Use single page stride by default. For huge page stride, the
> >  	 * page must be compound and point to the head of the compound
> > -	 * page, and both gfn and page_count must be huge-page aligned.
> > +	 * page, and both gfn and pfn_count must be huge-page aligned.
> >  	 */
> >  	if (!PageCompound(page) || !PageHead(page) ||
> >  	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
> > -	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
> > +	    !IS_ALIGNED(pfn_count, PTRS_PER_PMD))
> >  		return 1;
> > 
> >  	page_order = folio_order(page_folio(page));
> > @@ -57,60 +58,61 @@ static int mshv_chunk_stride(struct page *page,
> >  /**
> >   * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
> >   *                             in a region.
> > - * @region     : Pointer to the memory region structure.
> > - * @flags      : Flags to pass to the handler.
> > - * @page_offset: Offset into the region's pages array to start processing.
> > - * @page_count : Number of pages to process.
> > - * @handler    : Callback function to handle the chunk.
> > + * @region    : Pointer to the memory region structure.
> > + * @flags     : Flags to pass to the handler.
> > + * @pfn_offset: Offset into the region's PFNs array to start processing.
> > + * @pfn_count : Number of PFNs to process.
> > + * @handler   : Callback function to handle the chunk.
> >   *
> > - * This function scans the region's pages starting from @page_offset,
> > - * checking for contiguous present pages of the same size (normal or huge).
> > - * It invokes @handler for the chunk of contiguous pages found. Returns the
> > - * number of pages handled, or a negative error code if the first page is
> > - * not present or the handler fails.
> > + * This function scans the region's PFNs starting from @pfn_offset,
> > + * checking for contiguous valid PFNs backed by pages of the same size
> > + * (normal or huge). It invokes @handler for the chunk of contiguous valid
> > + * PFNs found. Returns the number of PFNs handled, or a negative error code
> > + * if the first PFN is invalid or the handler fails.
> >   *
> > - * Note: The @handler callback must be able to handle both normal and huge
> > - * pages.
> > + * Note: The @handler callback must be able to handle valid PFNs backed by
> > + * both normal and huge pages.
> >   *
> >   * Return: Number of pages handled, or negative error code.
> >   */
> > -static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > -				      u32 flags,
> > -				      u64 page_offset, u64 page_count,
> > -				      int (*handler)(struct mshv_mem_region *region,
> > -						     u32 flags,
> > -						     u64 page_offset,
> > -						     u64 page_count,
> > -						     bool huge_page))
> > +static long mshv_region_process_pfns(struct mshv_mem_region *region,
> > +				     u32 flags,
> > +				     u64 pfn_offset, u64 pfn_count,
> > +				     int (*handler)(struct mshv_mem_region *region,
> > +						    u32 flags,
> > +						    u64 pfn_offset,
> > +						    u64 pfn_count,
> > +						    bool huge_page))
> >  {
> > -	u64 gfn = region->start_gfn + page_offset;
> > +	u64 gfn = region->start_gfn + pfn_offset;
> >  	u64 count;
> > -	struct page *page;
> > +	unsigned long pfn;
> >  	int stride, ret;
> > 
> > -	page = region->mreg_pages[page_offset];
> > -	if (!page)
> > +	pfn = region->mreg_pfns[pfn_offset];
> > +	if (!pfn_valid(pfn))
> >  		return -EINVAL;
> > 
> > -	stride = mshv_chunk_stride(page, gfn, page_count);
> > +	stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
> >  	if (stride < 0)
> >  		return stride;
> > 
> >  	/* Start at stride since the first stride is validated */
> > -	for (count = stride; count < page_count; count += stride) {
> > -		page = region->mreg_pages[page_offset + count];
> > +	for (count = stride; count < pfn_count ; count += stride) {
> > +		pfn = region->mreg_pfns[pfn_offset + count];
> > 
> > -		/* Break if current page is not present */
> > -		if (!page)
> > +		/* Break if current pfn is invalid */
> > +		if (!pfn_valid(pfn))
> 
> pfn_valid() is a relatively expensive test to be doing in a loop
> on what may be every single page. It does an RCU lock/unlock
> and make other checks that aren't necessary here. Since
> mreg_pfns[] is populated from mm calls, the only invalid PFNs
> would be MSHV_INVALID_PFN that code in this module has
> explicitly put there. Just testing against MSHV_INVALID_PFN
> would be a lot faster here and elsewhere in this module. It's
> really a "pfn set/not set" test. Defining a pfn_set() macro
> here in this module that tests against MSHV_INVALID_PFN
> would accomplish the same thing more efficiently.
> 

Yes, we could do it the way you suggest. For completeness, I should add
that pfn_valid() is expensive only on 32-bit ARM and ARC, which we
don’t care about.

> >  			break;
> > 
> >  		/* Break if stride size changes */
> > -		if (stride != mshv_chunk_stride(page, gfn + count,
> > -						page_count - count))
> > +		if (stride != mshv_chunk_stride(pfn_to_page(pfn),
> > +						gfn + count,
> > +						pfn_count - count))
> >  			break;
> >  	}
> > 
> > -	ret = handler(region, flags, page_offset, count, stride > 1);
> > +	ret = handler(region, flags, pfn_offset, count, stride > 1);
> >  	if (ret)
> >  		return ret;
> > 
> > @@ -118,70 +120,73 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> >  }
> > 
> >  /**
> > - * mshv_region_process_range - Processes a range of memory pages in a
> > - *                             region.
> > - * @region     : Pointer to the memory region structure.
> > - * @flags      : Flags to pass to the handler.
> > - * @page_offset: Offset into the region's pages array to start processing.
> > - * @page_count : Number of pages to process.
> > - * @handler    : Callback function to handle each chunk of contiguous
> > - *               pages.
> > + * mshv_region_process_range - Processes a range of PFNs in a region.
> > + * @region    : Pointer to the memory region structure.
> > + * @flags     : Flags to pass to the handler.
> > + * @pfn_offset: Offset into the region's PFNs array to start processing.
> > + * @pfn_count : Number of PFNs to process.
> > + * @handler   : Callback function to handle each chunk of contiguous
> > + *              valid PFNs.
> >   *
> > - * Iterates over the specified range of pages in @region, skipping
> > - * non-present pages. For each contiguous chunk of present pages, invokes
> > - * @handler via mshv_region_process_chunk.
> > + * Iterates over the specified range of PFNs in @region, skipping
> > + * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
> > + * @handler via mshv_region_process_pfns.
> >   *
> > - * Note: The @handler callback must be able to handle both normal and huge
> > - * pages.
> > + * Note: The @handler callback must be able to handle PFNs backed by both
> > + * normal and huge pages.
> >   *
> >   * Returns 0 on success, or a negative error code on failure.
> >   */
> >  static int mshv_region_process_range(struct mshv_mem_region *region,
> >  				     u32 flags,
> > -				     u64 page_offset, u64 page_count,
> > +				     u64 pfn_offset, u64 pfn_count,
> >  				     int (*handler)(struct mshv_mem_region *region,
> >  						    u32 flags,
> > -						    u64 page_offset,
> > -						    u64 page_count,
> > +						    u64 pfn_offset,
> > +						    u64 pfn_count,
> >  						    bool huge_page))
> >  {
> > +	u64 pfn_end;
> 
> In Patch 2 of this series, "pfn_end" is changed to just "end", and
> the references are adjusted. Patch 2 could be a few lines smaller if it
> was named "end" here and Patch 2 didn't have to change it.
> 

Sure, can do.

> >  	long ret;
> > 
> > -	if (page_offset + page_count > region->nr_pages)
> > +	if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
> > +		return -EOVERFLOW;
> > +
> > +	if (pfn_end > region->nr_pfns)
> >  		return -EINVAL;
> > 
> > -	while (page_count) {
> > +	while (pfn_count) {
> >  		/* Skip non-present pages */
> > -		if (!region->mreg_pages[page_offset]) {
> > -			page_offset++;
> > -			page_count--;
> > +		if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
> > +			pfn_offset++;
> > +			pfn_count--;
> >  			continue;
> >  		}
> > 
> > -		ret = mshv_region_process_chunk(region, flags,
> > -						page_offset,
> > -						page_count,
> > -						handler);
> > +		ret = mshv_region_process_pfns(region, flags,
> > +					       pfn_offset, pfn_count,
> > +					       handler);
> >  		if (ret < 0)
> >  			return ret;
> > 
> > -		page_offset += ret;
> > -		page_count -= ret;
> > +		pfn_offset += ret;
> > +		pfn_count -= ret;
> >  	}
> > 
> >  	return 0;
> >  }
> > 
> > -struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> > +struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
> >  					   u64 uaddr, u32 flags)
> >  {
> >  	struct mshv_mem_region *region;
> > +	u64 i;
> > 
> > -	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
> > +	region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);
> 
> Use struct_size(region, mreg_pfns, nr_pfns) instead of open coding the arithmetic?
> 

This is new to me. Sure, will do.

Thanks,
Stanislav

> >  	if (!region)
> >  		return ERR_PTR(-ENOMEM);
> > 
> > -	region->nr_pages = nr_pages;
> > +	region->nr_pfns = nr_pfns;
> >  	region->start_gfn = guest_pfn;
> >  	region->start_uaddr = uaddr;
> >  	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
> > @@ -190,6 +195,9 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> >  	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
> >  		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
> > 
> > +	for (i = 0; i < nr_pfns; i++)
> > +		region->mreg_pfns[i] = MSHV_INVALID_PFN;
> > +
> >  	kref_init(&region->mreg_refcount);
> > 
> >  	return region;
> > @@ -197,15 +205,15 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> > 
> >  static int mshv_region_chunk_share(struct mshv_mem_region *region,
> >  				   u32 flags,
> > -				   u64 page_offset, u64 page_count,
> > +				   u64 pfn_offset, u64 pfn_count,
> >  				   bool huge_page)
> >  {
> >  	if (huge_page)
> >  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> > 
> >  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> > -					      region->mreg_pages + page_offset,
> > -					      page_count,
> > +					      region->mreg_pfns + pfn_offset,
> > +					      pfn_count,
> >  					      HV_MAP_GPA_READABLE |
> >  					      HV_MAP_GPA_WRITABLE,
> >  					      flags, true);
> > @@ -216,21 +224,21 @@ int mshv_region_share(struct mshv_mem_region *region)
> >  	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
> > 
> >  	return mshv_region_process_range(region, flags,
> > -					 0, region->nr_pages,
> > +					 0, region->nr_pfns,
> >  					 mshv_region_chunk_share);
> >  }
> > 
> >  static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> >  				     u32 flags,
> > -				     u64 page_offset, u64 page_count,
> > +				     u64 pfn_offset, u64 pfn_count,
> >  				     bool huge_page)
> >  {
> >  	if (huge_page)
> >  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> > 
> >  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> > -					      region->mreg_pages + page_offset,
> > -					      page_count, 0,
> > +					      region->mreg_pfns + pfn_offset,
> > +					      pfn_count, 0,
> >  					      flags, false);
> >  }
> > 
> > @@ -239,30 +247,30 @@ int mshv_region_unshare(struct mshv_mem_region *region)
> >  	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
> > 
> >  	return mshv_region_process_range(region, flags,
> > -					 0, region->nr_pages,
> > +					 0, region->nr_pfns,
> >  					 mshv_region_chunk_unshare);
> >  }
> > 
> >  static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> >  				   u32 flags,
> > -				   u64 page_offset, u64 page_count,
> > +				   u64 pfn_offset, u64 pfn_count,
> >  				   bool huge_page)
> >  {
> >  	if (huge_page)
> >  		flags |= HV_MAP_GPA_LARGE_PAGE;
> > 
> > -	return hv_call_map_gpa_pages(region->partition->pt_id,
> > -				     region->start_gfn + page_offset,
> > -				     page_count, flags,
> > -				     region->mreg_pages + page_offset);
> > +	return hv_call_map_ram_pfns(region->partition->pt_id,
> > +				    region->start_gfn + pfn_offset,
> > +				    pfn_count, flags,
> > +				    region->mreg_pfns + pfn_offset);
> >  }
> > 
> > -static int mshv_region_remap_pages(struct mshv_mem_region *region,
> > -				   u32 map_flags,
> > -				   u64 page_offset, u64 page_count)
> > +static int mshv_region_remap_pfns(struct mshv_mem_region *region,
> > +				  u32 map_flags,
> > +				  u64 pfn_offset, u64 pfn_count)
> >  {
> >  	return mshv_region_process_range(region, map_flags,
> > -					 page_offset, page_count,
> > +					 pfn_offset, pfn_count,
> >  					 mshv_region_chunk_remap);
> >  }
> > 
> > @@ -270,38 +278,50 @@ int mshv_region_map(struct mshv_mem_region *region)
> >  {
> >  	u32 map_flags = region->hv_map_flags;
> > 
> > -	return mshv_region_remap_pages(region, map_flags,
> > -				       0, region->nr_pages);
> > +	return mshv_region_remap_pfns(region, map_flags,
> > +				      0, region->nr_pfns);
> >  }
> > 
> > -static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > -					 u64 page_offset, u64 page_count)
> > +static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
> > +					u64 pfn_offset, u64 pfn_count)
> >  {
> > -	if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > -		unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > +	u64 i;
> > +
> > +	for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
> > +		if (!pfn_valid(region->mreg_pfns[i]))
> > +			continue;
> > +
> > +		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > +			unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
> > 
> > -	memset(region->mreg_pages + page_offset, 0,
> > -	       page_count * sizeof(struct page *));
> > +		region->mreg_pfns[i] = MSHV_INVALID_PFN;
> > +	}
> >  }
> > 
> >  void mshv_region_invalidate(struct mshv_mem_region *region)
> >  {
> > -	mshv_region_invalidate_pages(region, 0, region->nr_pages);
> > +	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
> >  }
> > 
> >  int mshv_region_pin(struct mshv_mem_region *region)
> >  {
> > -	u64 done_count, nr_pages;
> > +	u64 done_count, nr_pfns, i;
> > +	unsigned long *pfns;
> >  	struct page **pages;
> >  	__u64 userspace_addr;
> >  	int ret;
> > 
> > -	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
> > -		pages = region->mreg_pages + done_count;
> > +	pages = kmalloc_array(MSHV_PIN_PAGES_BATCH_SIZE,
> > +			      sizeof(struct page *), GFP_KERNEL);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +
> > +	for (done_count = 0; done_count < region->nr_pfns; done_count += ret) {
> > +		pfns = region->mreg_pfns + done_count;
> >  		userspace_addr = region->start_uaddr +
> >  				 done_count * HV_HYP_PAGE_SIZE;
> > -		nr_pages = min(region->nr_pages - done_count,
> > -			       MSHV_PIN_PAGES_BATCH_SIZE);
> > +		nr_pfns = min(region->nr_pfns - done_count,
> > +			      MSHV_PIN_PAGES_BATCH_SIZE);
> > 
> >  		/*
> >  		 * Pinning assuming 4k pages works for large pages too.
> > @@ -311,39 +331,44 @@ int mshv_region_pin(struct mshv_mem_region *region)
> >  		 * with the FOLL_LONGTERM flag does a large temporary
> >  		 * allocation of contiguous memory.
> >  		 */
> > -		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> > +		ret = pin_user_pages_fast(userspace_addr, nr_pfns,
> >  					  FOLL_WRITE | FOLL_LONGTERM,
> >  					  pages);
> > -		if (ret != nr_pages)
> > +		if (ret != nr_pfns)
> >  			goto release_pages;
> > +
> > +		for (i = 0; i < ret; i++)
> > +			pfns[i] = page_to_pfn(pages[i]);
> >  	}
> > 
> > +	kfree(pages);
> >  	return 0;
> > 
> >  release_pages:
> >  	if (ret > 0)
> >  		done_count += ret;
> > -	mshv_region_invalidate_pages(region, 0, done_count);
> > +	mshv_region_invalidate_pfns(region, 0, done_count);
> > +	kfree(pages);
> >  	return ret < 0 ? ret : -ENOMEM;
> >  }
> > 
> >  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> >  				   u32 flags,
> > -				   u64 page_offset, u64 page_count,
> > +				   u64 pfn_offset, u64 pfn_count,
> >  				   bool huge_page)
> >  {
> >  	if (huge_page)
> >  		flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > 
> > -	return hv_call_unmap_gpa_pages(region->partition->pt_id,
> > -				       region->start_gfn + page_offset,
> > -				       page_count, flags);
> > +	return hv_call_unmap_pfns(region->partition->pt_id,
> > +				  region->start_gfn + pfn_offset,
> > +				  pfn_count, flags);
> >  }
> > 
> >  static int mshv_region_unmap(struct mshv_mem_region *region)
> >  {
> >  	return mshv_region_process_range(region, 0,
> > -					 0, region->nr_pages,
> > +					 0, region->nr_pfns,
> >  					 mshv_region_chunk_unmap);
> >  }
> > 
> > @@ -427,8 +452,8 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> >  /**
> >   * mshv_region_range_fault - Handle memory range faults for a given region.
> >   * @region: Pointer to the memory region structure.
> > - * @page_offset: Offset of the page within the region.
> > - * @page_count: Number of pages to handle.
> > + * @pfn_offset: Offset of the page within the region.
> > + * @pfn_count: Number of pages to handle.
> >   *
> >   * This function resolves memory faults for a specified range of pages
> >   * within a memory region. It uses HMM (Heterogeneous Memory Management)
> > @@ -437,7 +462,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> >   * Return: 0 on success, negative error code on failure.
> >   */
> >  static int mshv_region_range_fault(struct mshv_mem_region *region,
> > -				   u64 page_offset, u64 page_count)
> > +				   u64 pfn_offset, u64 pfn_count)
> >  {
> >  	struct hmm_range range = {
> >  		.notifier = &region->mreg_mni,
> > @@ -447,13 +472,13 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> >  	int ret;
> >  	u64 i;
> > 
> > -	pfns = kmalloc_array(page_count, sizeof(*pfns), GFP_KERNEL);
> > +	pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
> >  	if (!pfns)
> >  		return -ENOMEM;
> > 
> >  	range.hmm_pfns = pfns;
> > -	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
> > -	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
> > +	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
> > +	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
> > 
> >  	do {
> >  		ret = mshv_region_hmm_fault_and_lock(region, &range);
> > @@ -462,11 +487,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> >  	if (ret)
> >  		goto out;
> > 
> > -	for (i = 0; i < page_count; i++)
> > -		region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> > +	for (i = 0; i < pfn_count; i++) {
> > +		if (!(pfns[i] & HMM_PFN_VALID))
> > +			continue;
> > +		/* Drop HMM_PFN_* flags to ensure PFNs are valid. */
> > +		region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
> > +	}
> > 
> > -	ret = mshv_region_remap_pages(region, region->hv_map_flags,
> > -				      page_offset, page_count);
> > +	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
> > +				     pfn_offset, pfn_count);
> > 
> >  	mutex_unlock(&region->mreg_mutex);
> >  out:
> > @@ -476,24 +505,24 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> > 
> >  bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> >  {
> > -	u64 page_offset, page_count;
> > +	u64 pfn_offset, pfn_count;
> >  	int ret;
> > 
> >  	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> > -	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> > -				 MSHV_MAP_FAULT_IN_PAGES);
> > +	pfn_offset = ALIGN_DOWN(gfn - region->start_gfn,
> > +				MSHV_MAP_FAULT_IN_PAGES);
> > 
> >  	/* Map more pages than requested to reduce the number of faults. */
> > -	page_count = min(region->nr_pages - page_offset,
> > -			 MSHV_MAP_FAULT_IN_PAGES);
> > +	pfn_count = min(region->nr_pfns - pfn_offset,
> > +			MSHV_MAP_FAULT_IN_PAGES);
> > 
> > -	ret = mshv_region_range_fault(region, page_offset, page_count);
> > +	ret = mshv_region_range_fault(region, pfn_offset, pfn_count);
> > 
> >  	WARN_ONCE(ret,
> > -		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> > +		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, pfn_offset %llu, pfn_count %llu\n",
> >  		  region->partition->pt_id, region->start_uaddr,
> > -		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > -		  gfn, page_offset, page_count);
> > +		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
> > +		  gfn, pfn_offset, pfn_count);
> > 
> >  	return !ret;
> >  }
> > @@ -523,16 +552,16 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> >  	struct mshv_mem_region *region = container_of(mni,
> >  						      struct mshv_mem_region,
> >  						      mreg_mni);
> > -	u64 page_offset, page_count;
> > +	u64 pfn_offset, pfn_count;
> >  	unsigned long mstart, mend;
> >  	int ret = -EPERM;
> > 
> >  	mstart = max(range->start, region->start_uaddr);
> >  	mend = min(range->end, region->start_uaddr +
> > -		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
> > +		   (region->nr_pfns << HV_HYP_PAGE_SHIFT));
> > 
> > -	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> > -	page_count = HVPFN_DOWN(mend - mstart);
> > +	pfn_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> > +	pfn_count = HVPFN_DOWN(mend - mstart);
> > 
> >  	if (mmu_notifier_range_blockable(range))
> >  		mutex_lock(&region->mreg_mutex);
> > @@ -541,12 +570,12 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> > 
> >  	mmu_interval_set_seq(mni, cur_seq);
> > 
> > -	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> > -				      page_offset, page_count);
> > +	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
> > +				     pfn_offset, pfn_count);
> >  	if (ret)
> >  		goto out_unlock;
> > 
> > -	mshv_region_invalidate_pages(region, page_offset, page_count);
> > +	mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
> > 
> >  	mutex_unlock(&region->mreg_mutex);
> > 
> > @@ -558,9 +587,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> >  	WARN_ONCE(ret,
> >  		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
> >  		  region->start_uaddr,
> > -		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > +		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
> >  		  range->start, range->end, range->event,
> > -		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
> > +		  pfn_offset, pfn_offset + pfn_count - 1, (u64)range->mm, ret);
> >  	return false;
> >  }
> > 
> > @@ -579,7 +608,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
> > 
> >  	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
> >  					   region->start_uaddr,
> > -					   region->nr_pages << HV_HYP_PAGE_SHIFT,
> > +					   region->nr_pfns << HV_HYP_PAGE_SHIFT,
> >  					   &mshv_region_mni_ops);
> >  	if (ret)
> >  		return false;
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index 947dfb76bb19..f1d4bee97a3f 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -84,15 +84,15 @@ enum mshv_region_type {
> >  struct mshv_mem_region {
> >  	struct hlist_node hnode;
> >  	struct kref mreg_refcount;
> > -	u64 nr_pages;
> > +	u64 nr_pfns;
> >  	u64 start_gfn;
> >  	u64 start_uaddr;
> >  	u32 hv_map_flags;
> >  	struct mshv_partition *partition;
> >  	enum mshv_region_type mreg_type;
> >  	struct mmu_interval_notifier mreg_mni;
> > -	struct mutex mreg_mutex;	/* protects region pages remapping */
> > -	struct page *mreg_pages[];
> > +	struct mutex mreg_mutex;	/* protects region PFNs remapping */
> > +	unsigned long mreg_pfns[];
> >  };
> > 
> >  struct mshv_irq_ack_notifier {
> > @@ -282,11 +282,11 @@ int hv_call_create_partition(u64 flags,
> >  int hv_call_initialize_partition(u64 partition_id);
> >  int hv_call_finalize_partition(u64 partition_id);
> >  int hv_call_delete_partition(u64 partition_id);
> > -int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64
> > numpgs);
> > -int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> > -			  u32 flags, struct page **pages);
> > -int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> > -			    u32 flags);
> > +int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
> > +int hv_call_map_ram_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
> > +			 u32 flags, unsigned long *pfns);
> > +int hv_call_unmap_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
> > +		       u32 flags);
> >  int hv_call_delete_vp(u64 partition_id, u32 vp_index);
> >  int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
> >  				     u64 dest_addr,
> > @@ -329,8 +329,8 @@ int hv_map_stats_page(enum hv_stats_object_type type,
> >  int hv_unmap_stats_page(enum hv_stats_object_type type,
> >  			struct hv_stats_page *page_addr,
> >  			const union hv_stats_object_identity *identity);
> > -int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> > -				   u64 page_struct_count, u32 host_access,
> > +int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
> > +				   u64 pfns_count, u32 host_access,
> >  				   u32 flags, u8 acquire);
> >  int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
> >  				      void *property_value, size_t property_value_sz);
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index cb55d4d4be2e..a95f2cfc5da5 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -188,17 +188,16 @@ int hv_call_delete_partition(u64 partition_id)
> >  	return hv_result_to_errno(status);
> >  }
> > 
> > -/* Ask the hypervisor to map guest ram pages or the guest mmio space */
> > -static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> > -			       u32 flags, struct page **pages, u64 mmio_spa)
> > +static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
> > +			  u32 flags, unsigned long *pfns, u64 mmio_spa)
> >  {
> >  	struct hv_input_map_gpa_pages *input_page;
> >  	u64 status, *pfnlist;
> >  	unsigned long irq_flags, large_shift = 0;
> >  	int ret = 0, done = 0;
> > -	u64 page_count = page_struct_count;
> > +	u64 page_count = pfns_count;
> > 
> > -	if (page_count == 0 || (pages && mmio_spa))
> > +	if (page_count == 0 || (pfns && mmio_spa))
> >  		return -EINVAL;
> > 
> >  	if (flags & HV_MAP_GPA_LARGE_PAGE) {
> > @@ -227,14 +226,14 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  		for (i = 0; i < rep_count; i++)
> >  			if (flags & HV_MAP_GPA_NO_ACCESS) {
> >  				pfnlist[i] = 0;
> > -			} else if (pages) {
> > +			} else if (pfns) {
> >  				u64 index = (done + i) << large_shift;
> > 
> > -				if (index >= page_struct_count) {
> > +				if (index >= pfns_count) {
> >  					ret = -EINVAL;
> >  					break;
> >  				}
> > -				pfnlist[i] = page_to_pfn(pages[index]);
> > +				pfnlist[i] = pfns[index];
> >  			} else {
> >  				pfnlist[i] = mmio_spa + done + i;
> >  			}
> > @@ -266,37 +265,37 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> > 
> >  		if (flags & HV_MAP_GPA_LARGE_PAGE)
> >  			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > -		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> > +		hv_call_unmap_pfns(partition_id, gfn, done, unmap_flags);
> >  	}
> > 
> >  	return ret;
> >  }
> > 
> >  /* Ask the hypervisor to map guest ram pages */
> > -int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> > -			  u32 flags, struct page **pages)
> > +int hv_call_map_ram_pfns(u64 partition_id, u64 gfn, u64 pfn_count,
> > +			 u32 flags, unsigned long *pfns)
> >  {
> > -	return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
> > -				   flags, pages, 0);
> > +	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
> > +			      pfns, 0);
> >  }
> > 
> > -/* Ask the hypervisor to map guest mmio space */
> > -int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
> > +int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa,
> > +			  u64 pfn_count)
> >  {
> >  	int i;
> >  	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
> >  		    HV_MAP_GPA_NOT_CACHED;
> > 
> > -	for (i = 0; i < numpgs; i++)
> > +	for (i = 0; i < pfn_count; i++)
> >  		if (page_is_ram(mmio_spa + i))
> >  			return -EINVAL;
> > 
> > -	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
> > -				   mmio_spa);
> > +	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
> > +			      NULL, mmio_spa);
> >  }
> > 
> > -int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> > -			    u32 flags)
> > +int hv_call_unmap_pfns(u64 partition_id, u64 gfn, u64 page_count_4k,
> > +		       u32 flags)
> >  {
> >  	struct hv_input_unmap_gpa_pages *input_page;
> >  	u64 status, page_count = page_count_4k;
> > @@ -1009,15 +1008,15 @@ int hv_unmap_stats_page(enum hv_stats_object_type type,
> >  	return ret;
> >  }
> > 
> > -int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> > -				   u64 page_struct_count, u32 host_access,
> > +int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
> > +				   u64 pfns_count, u32 host_access,
> >  				   u32 flags, u8 acquire)
> >  {
> >  	struct hv_input_modify_sparse_spa_page_host_access *input_page;
> >  	u64 status;
> >  	int done = 0;
> >  	unsigned long irq_flags, large_shift = 0;
> > -	u64 page_count = page_struct_count;
> > +	u64 page_count = pfns_count;
> >  	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
> >  			     HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
> > 
> > @@ -1051,11 +1050,10 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> >  		for (i = 0; i < rep_count; i++) {
> >  			u64 index = (done + i) << large_shift;
> > 
> > -			if (index >= page_struct_count)
> > +			if (index >= pfns_count)
> >  				return -EINVAL;
> > 
> > -			input_page->spa_page_list[i] =
> > -						page_to_pfn(pages[index]);
> > +			input_page->spa_page_list[i] = pfns[index];
> >  		}
> > 
> >  		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index f2d83d6c8c4f..685e4b562186 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -619,7 +619,7 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > 
> >  	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
> >  		if (gfn >= region->start_gfn &&
> > -		    gfn < region->start_gfn + region->nr_pages)
> > +		    gfn < region->start_gfn + region->nr_pfns)
> >  			return region;
> >  	}
> > 
> > @@ -1221,20 +1221,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> >  					bool is_mmio)
> >  {
> >  	struct mshv_mem_region *rg;
> > -	u64 nr_pages = HVPFN_DOWN(mem->size);
> > +	u64 nr_pfns = HVPFN_DOWN(mem->size);
> > 
> >  	/* Reject overlapping regions */
> >  	spin_lock(&partition->pt_mem_regions_lock);
> >  	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
> > -		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
> > -		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
> > +		if (mem->guest_pfn + nr_pfns <= rg->start_gfn ||
> > +		    rg->start_gfn + rg->nr_pfns <= mem->guest_pfn)
> >  			continue;
> >  		spin_unlock(&partition->pt_mem_regions_lock);
> >  		return -EEXIST;
> >  	}
> >  	spin_unlock(&partition->pt_mem_regions_lock);
> > 
> > -	rg = mshv_region_create(mem->guest_pfn, nr_pages,
> > +	rg = mshv_region_create(mem->guest_pfn, nr_pfns,
> >  				mem->userspace_addr, mem->flags);
> >  	if (IS_ERR(rg))
> >  		return PTR_ERR(rg);
> > @@ -1372,21 +1372,21 @@ mshv_map_user_memory(struct mshv_partition *partition,
> >  		 * the hypervisor track dirty pages, enabling pre-copy live
> >  		 * migration.
> >  		 */
> > -		ret = hv_call_map_gpa_pages(partition->pt_id,
> > -					    region->start_gfn,
> > -					    region->nr_pages,
> > -					    HV_MAP_GPA_NO_ACCESS, NULL);
> > +		ret = hv_call_map_ram_pfns(partition->pt_id,
> > +					   region->start_gfn,
> > +					   region->nr_pfns,
> > +					   HV_MAP_GPA_NO_ACCESS, NULL);
> >  		break;
> >  	case MSHV_REGION_TYPE_MMIO:
> > -		ret = hv_call_map_mmio_pages(partition->pt_id,
> > -					     region->start_gfn,
> > -					     mmio_pfn,
> > -					     region->nr_pages);
> > +		ret = hv_call_map_mmio_pfns(partition->pt_id,
> > +					    region->start_gfn,
> > +					    mmio_pfn,
> > +					    region->nr_pfns);
> >  		break;
> >  	}
> > 
> >  	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
> > -				   region->start_gfn, region->nr_pages,
> > +				   region->start_gfn, region->nr_pfns,
> >  				   region->hv_map_flags, ret);
> > 
> >  	if (ret)
> > @@ -1424,7 +1424,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
> >  	/* Paranoia check */
> >  	if (region->start_uaddr != mem.userspace_addr ||
> >  	    region->start_gfn != mem.guest_pfn ||
> > -	    region->nr_pages != HVPFN_DOWN(mem.size)) {
> > +	    region->nr_pfns != HVPFN_DOWN(mem.size)) {
> >  		spin_unlock(&partition->pt_mem_regions_lock);
> >  		return -EINVAL;
> >  	}
> > 
> > 
> 

^ permalink raw reply

* Re: [PATCH 2/7] mshv: Add support to address range holes remapping
From: Stanislav Kinsburskii @ 2026-04-20 16:24 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D44B15BAA0F3CA8B078BD4242@SN6PR02MB4157.namprd02.prod.outlook.com>

On Mon, Apr 13, 2026 at 09:08:31PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> > 
> > Consolidate memory region processing to handle both valid and invalid PFNs
> > uniformly. This eliminates code duplication across remap, unmap, share, and
> > unshare operations by using a common range processing interface.
> > 
> > Holes are now remapped with no-access permissions to enable
> > hypervisor dirty page tracking for precopy live migration.
> > 
> > This refactoring is a precursor to an upcoming change that will map
> > present pages in movable regions upon region creation, requiring
> > consistent handling of both mapped and unmapped ranges.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_regions.c |  108
> > ++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 95 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index b1a707d16c07..ed9c55841140 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -119,6 +119,57 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
> >  	return count;
> >  }
> > 
> > +/**
> > + * mshv_region_process_hole - Handle a hole (invalid PFNs) in a memory
> > + *                            region
> > + * @region    : Memory region containing the hole
> > + * @flags     : Flags to pass to the handler function
> > + * @pfn_offset: Starting PFN offset within the region
> > + * @pfn_count : Number of PFNs in the hole
> > + * @handler   : Callback function to invoke for the hole
> > + *
> > + * Invokes the handler function for a contiguous hole with the specified
> > + * parameters.
> > + *
> > + * Return: Number of PFNs handled, or negative error code.
> > + */
> > +static long mshv_region_process_hole(struct mshv_mem_region *region,
> > +				     u32 flags,
> > +				     u64 pfn_offset, u64 pfn_count,
> > +				     int (*handler)(struct mshv_mem_region *region,
> > +						    u32 flags,
> > +						    u64 pfn_offset,
> > +						    u64 pfn_count,
> > +						    bool huge_page))
> > +{
> > +	long ret;
> > +
> > +	ret = handler(region, flags, pfn_offset, pfn_count, 0);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return pfn_count;
> > +}
> > +
> > +static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > +				      u32 flags,
> > +				      u64 pfn_offset, u64 pfn_count,
> > +				      int (*handler)(struct mshv_mem_region *region,
> > +						     u32 flags,
> > +						     u64 pfn_offset,
> > +						     u64 pfn_count,
> > +						     bool huge_page))
> > +{
> > +	if (pfn_valid(region->mreg_pfns[pfn_offset]))
> > +		return mshv_region_process_pfns(region, flags,
> > +				pfn_offset, pfn_count,
> > +				handler);
> > +	else
> > +		return mshv_region_process_hole(region, flags,
> > +				pfn_offset, pfn_count,
> > +				handler);
> > +}
> > +
> >  /**
> >   * mshv_region_process_range - Processes a range of PFNs in a region.
> >   * @region    : Pointer to the memory region structure.
> > @@ -146,33 +197,47 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
> >  						    u64 pfn_count,
> >  						    bool huge_page))
> >  {
> > -	u64 pfn_end;
> > +	u64 start, end;
> >  	long ret;
> > 
> > -	if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
> > +	if (!pfn_count)
> > +		return 0;
> > +
> > +	if (check_add_overflow(pfn_offset, pfn_count, &end))
> >  		return -EOVERFLOW;
> > 
> > -	if (pfn_end > region->nr_pfns)
> > +	if (end > region->nr_pfns)
> >  		return -EINVAL;
> > 
> > -	while (pfn_count) {
> > -		/* Skip non-present pages */
> > -		if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
> > -			pfn_offset++;
> > -			pfn_count--;
> > +	start = pfn_offset;
> > +	end = pfn_offset + 1;
> > +
> > +	while (end < pfn_offset + pfn_count) {
> > +		/*
> > +		 * Accumulate contiguous pfns with the same validity
> > +		 * (valid or not).
> > +		 */
> > +		if (pfn_valid(region->mreg_pfns[start]) ==
> > +		    pfn_valid(region->mreg_pfns[end])) {
> > +			end++;
> >  			continue;
> >  		}
> > 
> > -		ret = mshv_region_process_pfns(region, flags,
> > -					       pfn_offset, pfn_count,
> > -					       handler);
> > +		ret = mshv_region_process_chunk(region, flags,
> > +						start, end - start,
> > +						handler);
> >  		if (ret < 0)
> >  			return ret;
> > 
> > -		pfn_offset += ret;
> > -		pfn_count -= ret;
> > +		start += ret;
> >  	}
> > 
> > +	ret = mshv_region_process_chunk(region, flags,
> > +					start, end - start,
> > +					handler);
> > +	if (ret < 0)
> > +		return ret;
> > +
> >  	return 0;
> >  }
> > 
> > @@ -208,6 +273,9 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
> >  				   u64 pfn_offset, u64 pfn_count,
> >  				   bool huge_page)
> >  {
> > +	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > +		return -EINVAL;
> > +
> >  	if (huge_page)
> >  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> > 
> > @@ -233,6 +301,9 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> >  				     u64 pfn_offset, u64 pfn_count,
> >  				     bool huge_page)
> >  {
> > +	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > +		return -EINVAL;
> > +
> >  	if (huge_page)
> >  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> > 
> > @@ -256,6 +327,14 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> >  				   u64 pfn_offset, u64 pfn_count,
> >  				   bool huge_page)
> >  {
> > +	/*
> > +	 * Remap missing pages with no access to let the
> > +	 * hypervisor track dirty pages, enabling precopy live
> > +	 * migration.
> > +	 */
> > +	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > +		flags = HV_MAP_GPA_NO_ACCESS;
> 
> Is it OK to wipe out any other flags that might be set? Certainly, any previous
> flags in PERMISSIONS_MASK should be removed, but what about ADJUSTABLE
> and NOT_CACHED?
> 

Yes, this is the right approach. The HV_MAP_GPA_NO_ACCESS flag will
immediately cause a hypervisor fault on any access to the page. So
caching and adjustability no longer matter.

Thanks,
Stanislav

> > +
> >  	if (huge_page)
> >  		flags |= HV_MAP_GPA_LARGE_PAGE;
> > 
> > @@ -357,6 +436,9 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> >  				   u64 pfn_offset, u64 pfn_count,
> >  				   bool huge_page)
> >  {
> > +	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > +		return 0;
> > +
> >  	if (huge_page)
> >  		flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > 
> > 
> > 
> 

^ permalink raw reply

* Re: [PATCH 3/7] mshv: Support regions with different VMAs
From: Stanislav Kinsburskii @ 2026-04-20 16:29 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157043516DEB3DC5E987AA6D4242@SN6PR02MB4157.namprd02.prod.outlook.com>

On Mon, Apr 13, 2026 at 09:08:52PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> > 
> > Allow HMM fault handling across memory regions that span multiple VMAs
> > with different protection flags. The previous implementation assumed a
> > single VMA per region, which would fail when guest memory crosses VMA
> > boundaries.
> > 
> > Iterate through VMAs within the range and handle each separately with
> > appropriate protection flags, enabling more flexible memory region
> > configurations for partitions.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_regions.c |   72 +++++++++++++++++++++++++++++++++------------
> >  1 file changed, 52 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index ed9c55841140..1bb1bfe177e2 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -492,37 +492,72 @@ int mshv_region_get(struct mshv_mem_region *region)
> >  }
> > 
> >  /**
> > - * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
> > + * mshv_region_hmm_fault_and_lock - Handle HMM faults across VMAs and lock
> > + *                                  the memory region
> >   * @region: Pointer to the memory region structure
> > - * @range: Pointer to the HMM range structure
> > + * @start : Starting virtual address of the range to fault
> > + * @end   : Ending virtual address of the range to fault (exclusive)
> > + * @pfns  : Output array for page frame numbers with HMM flags
> >   *
> >   * This function performs the following steps:
> >   * 1. Reads the notifier sequence for the HMM range.
> >   * 2. Acquires a read lock on the memory map.
> > - * 3. Handles HMM faults for the specified range.
> > - * 4. Releases the read lock on the memory map.
> > - * 5. If successful, locks the memory region mutex.
> > - * 6. Verifies if the notifier sequence has changed during the operation.
> > - *    If it has, releases the mutex and returns -EBUSY to match with
> > - *    hmm_range_fault() return code for repeating.
> > + * 3. Iterates through VMAs in the specified range, handling each
> > + *    separately with appropriate protection flags (HMM_PFN_REQ_WRITE set
> > + *    based on VMA flags).
> > + * 4. Handles HMM faults for each VMA segment.
> > + * 5. Releases the read lock on the memory map.
> > + * 6. If successful, locks the memory region mutex.
> > + * 7. Verifies if the notifier sequence has changed during the operation.
> > + *    If it has, releases the mutex and returns -EBUSY to signal retry.
> > + *
> > + * The function expects the range [start, end] is backed by valid VMAs.
> 
> Use "[start, end)" to describe the range since end is exclusive.
> 

Will do

> > + * Returns -EFAULT if any address in the range is not covered by a VMA.
> >   *
> >   * Return: 0 on success, a negative error code otherwise.
> >   */
> >  static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > -					  struct hmm_range *range)
> > +					  unsigned long start,
> > +					  unsigned long end,
> > +					  unsigned long *pfns)
> >  {
> > +	struct hmm_range range = {
> > +		.notifier = &region->mreg_mni,
> > +	};
> >  	int ret;
> > 
> > -	range->notifier_seq = mmu_interval_read_begin(range->notifier);
> > +	range.notifier_seq = mmu_interval_read_begin(range.notifier);
> >  	mmap_read_lock(region->mreg_mni.mm);
> > -	ret = hmm_range_fault(range);
> > +	while (start < end) {
> > +		struct vm_area_struct *vma;
> > +
> > +		vma = vma_lookup(current->mm, start);
> 
> The mmap_read_lock() was obtained on region->mreg_mni.mm, but the
> lookup is done against current->mm. Maybe these are the same, but
> it looks wrong.  (Pointed out by a Co-Pilot AI review.)
> 

Yes, they arethe same, but I'll update to use the same mm for clarity.

> > +		if (!vma) {
> > +			ret = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		range.hmm_pfns = pfns;
> > +		range.start = start;
> > +		range.end = min(vma->vm_end, end);
> > +		range.default_flags = HMM_PFN_REQ_FAULT;
> > +		if (vma->vm_flags & VM_WRITE)
> > +			range.default_flags |= HMM_PFN_REQ_WRITE;
> > +
> > +		ret = hmm_range_fault(&range);
> > +		if (ret)
> > +			break;
> > +
> > +		start = range.end + 1;
> 
> Since range.end is exclusive, the +1 should not be done.
> 

Is it always? I'll need to check to make sure the end passed to this
function is page aligned. If it is, then I'll remove the +1.

> > +		pfns += DIV_ROUND_UP(range.end - range.start, PAGE_SIZE);
> 
> Just to confirm, range.end and range.start should always be page aligned,
> right? So the ROUND_UP should never kick in.
> 

Same as above: if the end passed to this function is page aligned, then
I'll remove the DIV_ROUND_UP and just do a simple division.

Thanks,
Stanislav

> > +	}
> >  	mmap_read_unlock(region->mreg_mni.mm);
> >  	if (ret)
> >  		return ret;
> > 
> >  	mutex_lock(&region->mreg_mutex);
> > 
> > -	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
> > +	if (mmu_interval_read_retry(range.notifier, range.notifier_seq)) {
> >  		mutex_unlock(&region->mreg_mutex);
> >  		cond_resched();
> >  		return -EBUSY;
> > @@ -546,10 +581,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> >  static int mshv_region_range_fault(struct mshv_mem_region *region,
> >  				   u64 pfn_offset, u64 pfn_count)
> >  {
> > -	struct hmm_range range = {
> > -		.notifier = &region->mreg_mni,
> > -		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> > -	};
> > +	unsigned long start, end;
> >  	unsigned long *pfns;
> >  	int ret;
> >  	u64 i;
> > @@ -558,12 +590,12 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> >  	if (!pfns)
> >  		return -ENOMEM;
> > 
> > -	range.hmm_pfns = pfns;
> > -	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
> > -	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
> > +	start = region->start_uaddr + pfn_offset * PAGE_SIZE;
> > +	end = start + pfn_count * PAGE_SIZE;
> > 
> >  	do {
> > -		ret = mshv_region_hmm_fault_and_lock(region, &range);
> > +		ret = mshv_region_hmm_fault_and_lock(region, start, end,
> > +						     pfns);
> >  	} while (ret == -EBUSY);
> > 
> >  	if (ret)
> > 
> > 
> 

^ permalink raw reply

* Re: [PATCH 5/7] mshv: Map populated pages on movable region creation
From: Stanislav Kinsburskii @ 2026-04-20 16:35 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415725768DFBF9502CE942DDD4242@SN6PR02MB4157.namprd02.prod.outlook.com>

On Mon, Apr 13, 2026 at 09:09:08PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:05 PM
> > 
> > Map any populated pages into the hypervisor upfront when creating a
> > movable region, rather than waiting for faults. Previously, movable
> > regions were created with all pages marked as HV_MAP_GPA_NO_ACCESS
> > regardless of whether the userspace mapping contained populated pages.
> > 
> > This guarantees that if the caller passes a populated mapping, those
> > present pages will be mapped into the hypervisor immediately during
> > region creation instead of being faulted in later.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_regions.c   |   65 ++++++++++++++++++++++++++++++++-----------
> >  drivers/hv/mshv_root.h      |    1 +
> >  drivers/hv/mshv_root_main.c |   10 +------
> >  3 files changed, 50 insertions(+), 26 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index 133ec7771812..28d3f488d89f 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -519,7 +519,8 @@ int mshv_region_get(struct mshv_mem_region *region)
> >  static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> >  					  unsigned long start,
> >  					  unsigned long end,
> > -					  unsigned long *pfns)
> > +					  unsigned long *pfns,
> > +					  bool do_fault)
> >  {
> >  	struct hmm_range range = {
> >  		.notifier = &region->mreg_mni,
> > @@ -540,9 +541,12 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> >  		range.hmm_pfns = pfns;
> >  		range.start = start;
> >  		range.end = min(vma->vm_end, end);
> > -		range.default_flags = HMM_PFN_REQ_FAULT;
> > -		if (vma->vm_flags & VM_WRITE)
> > -			range.default_flags |= HMM_PFN_REQ_WRITE;
> > +		range.default_flags = 0;
> > +		if (do_fault) {
> > +			range.default_flags = HMM_PFN_REQ_FAULT;
> > +			if (vma->vm_flags & VM_WRITE)
> > +				range.default_flags |= HMM_PFN_REQ_WRITE;
> > +		}
> > 
> >  		ret = hmm_range_fault(&range);
> >  		if (ret)
> > @@ -567,26 +571,40 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> >  }
> > 
> >  /**
> > - * mshv_region_range_fault - Handle memory range faults for a given region.
> > - * @region: Pointer to the memory region structure.
> > - * @pfn_offset: Offset of the page within the region.
> > - * @pfn_count: Number of pages to handle.
> > + * mshv_region_collect_and_map - Collect PFNs for a user range and map them
> > + * @region    : memory region being processed
> > + * @pfn_offset: PFNs offset within the region
> > + * @pfn_count : number of PFNs to process
> > + * @do_fault  : if true, fault in missing pages;
> > + *              if false, collect only present pages
> >   *
> > - * This function resolves memory faults for a specified range of pages
> > - * within a memory region. It uses HMM (Heterogeneous Memory Management)
> > - * to fault in the required pages and updates the region's page array.
> > + * Collects PFNs for the specified portion of @region from the
> > + * corresponding userspace VMA and maps them into the hypervisor. The
> 
> Actually, this should be "userspace VMAs" (i.e., plural)
> 

Will change.

> > + * behavior depends on @do_fault:
> >   *
> > - * Return: 0 on success, negative error code on failure.
> > + * - true: Fault in missing pages from userspace, ensuring all pages in the
> > + *   range are present. Used for on-demand page population.
> > + * - false: Collect PFNs only for pages already present in userspace,
> > + *   leaving missing pages as invalid PFN markers.
> > + *   Used for initial region setup.
> > + *
> > + * Collected PFNs are stored in region->mreg_pfns[] with HMM bookkeeping
> > + * flags cleared, then the range is mapped into the hypervisor. Present
> > + * PFNs get mapped with region access permissions; missing PFNs (zero
> > + * entries) get mapped with no-access permissions.
> 
> Hmmm. The missing PFNs are just skipped and the mreg_pfns[] array
> is not updated. Is the corresponding entry in mreg_pfns[] known to
> already be set to MSHV_INVALID_PFN? When mapping a new movable
> region, that appears to be so. I'm less sure about the 
> mshv_region_range_fault() case, though mshv_region_invalidate_pfns()
> does such initialization of any entries that are invalidated. At that point
> in the code, I'd add a comment about that assumption, as it took me a
> bit to figure it out.
> 

This logic is called for movable regions only.
Should this be mentioned in the comment from your POV?

> So does the comment about "zero entries" refer to what is returned
> by hmm_range_fault() via mshv_region_hmm_fault_and_lock()?
> The mention of "zero entries" here is a bit confusing.
> 

"Zero entries" should be changed to invalid PFN markers, which are
defined as MSHV_INVALID_PFN. I'll update the comment to clarify that.

Thanks,
Stanislav

> > + *
> > + * Return: 0 on success, negative errno on failure.
> >   */
> > -static int mshv_region_range_fault(struct mshv_mem_region *region,
> > -				   u64 pfn_offset, u64 pfn_count)
> > +static int mshv_region_collect_and_map(struct mshv_mem_region *region,
> > +				       u64 pfn_offset, u64 pfn_count,
> > +				       bool do_fault)
> >  {
> >  	unsigned long start, end;
> >  	unsigned long *pfns;
> >  	int ret;
> >  	u64 i;
> > 
> > -	pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
> > +	pfns = vmalloc_array(pfn_count, sizeof(unsigned long));
> >  	if (!pfns)
> >  		return -ENOMEM;
> > 
> > @@ -595,7 +613,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> > 
> >  	do {
> >  		ret = mshv_region_hmm_fault_and_lock(region, start, end,
> > -						     pfns);
> > +						     pfns, do_fault);
> >  	} while (ret == -EBUSY);
> > 
> >  	if (ret)
> > @@ -613,10 +631,17 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> > 
> >  	mutex_unlock(&region->mreg_mutex);
> >  out:
> > -	kfree(pfns);
> > +	vfree(pfns);
> >  	return ret;
> >  }
> > 
> > +static int mshv_region_range_fault(struct mshv_mem_region *region,
> > +				   u64 pfn_offset, u64 pfn_count)
> > +{
> > +	return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
> > +					   true);
> > +}
> > +
> >  bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> >  {
> >  	u64 pfn_offset, pfn_count;
> > @@ -800,3 +825,9 @@ int mshv_map_pinned_region(struct mshv_mem_region
> > *region)
> >  err_out:
> >  	return ret;
> >  }
> > +
> > +int mshv_map_movable_region(struct mshv_mem_region *region)
> > +{
> > +	return mshv_region_collect_and_map(region, 0, region->nr_pfns,
> > +					   false);
> > +}
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index d2e65a137bf4..02c1c11f701c 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -374,5 +374,6 @@ bool mshv_region_handle_gfn_fault(struct mshv_mem_region
> > *region, u64 gfn);
> >  void mshv_region_movable_fini(struct mshv_mem_region *region);
> >  bool mshv_region_movable_init(struct mshv_mem_region *region);
> >  int mshv_map_pinned_region(struct mshv_mem_region *region);
> > +int mshv_map_movable_region(struct mshv_mem_region *region);
> > 
> >  #endif /* _MSHV_ROOT_H_ */
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index c393b5144e0b..91dab2a3bc92 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -1299,15 +1299,7 @@ mshv_map_user_memory(struct mshv_partition
> > *partition,
> >  		ret = mshv_map_pinned_region(region);
> >  		break;
> >  	case MSHV_REGION_TYPE_MEM_MOVABLE:
> > -		/*
> > -		 * For movable memory regions, remap with no access to let
> > -		 * the hypervisor track dirty pages, enabling pre-copy live
> > -		 * migration.
> > -		 */
> > -		ret = hv_call_map_ram_pfns(partition->pt_id,
> > -					   region->start_gfn,
> > -					   region->nr_pfns,
> > -					   HV_MAP_GPA_NO_ACCESS, NULL);
> > +		ret = mshv_map_movable_region(region);
> >  		break;
> >  	case MSHV_REGION_TYPE_MMIO:
> >  		ret = hv_call_map_mmio_pfns(partition->pt_id,
> > 
> > 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox