Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH V3 02/11] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: sashiko-bot @ 2026-05-13  3:26 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-3-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Resource leak in `hv_msi_domain_free()` when freeing multiple IRQs.
- [High] Uninitialized memory write to hardware MSI registers in `hv_irq_compose_msi_msg()` error paths.
- [High] Invalid hardware state read in `hv_build_devid_type_pci()` for Type 0 PCI-X leaf devices.
--

commit 8c84576bf69ec6b4d58407408095bf71a9940bf2
Author: Mukesh R <mrathor@linux.microsoft.com>

x86/hyperv: Cosmetic changes in irqdomain.c for readability

Make cosmetic changes such as renaming variables for consistency, fixing
checkpatch warnings, updating indentation, and replacing spaces with tabs.

> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
[ ... ]
> @@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
> +static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
[ ... ]
> -	if (data.bridge) {
> -		int pos;
> +	if (data.bridge == NULL)
> +		goto out;
>  
> -		/*
> -		 * Microsoft Hypervisor requires a bus range when the bridge is
> -		 * running in PCI-X mode.
[ ... ]
> -		pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> -		if (pos) {
> -			u16 status;
> +	pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> +	if (pos) {
> +		u16 status;

If the leaf device is on the root bus, data.bridge will point to the leaf
device itself (a Type 0 device) after pci_for_each_dma_alias() completes.
Does this code incorrectly assume data.bridge is always a Type 1 bridge?

If it evaluates a Type 0 device here, reading PCI_X_BRIDGE_SSTATUS aliases
the Type 0 Command register. This could cause the code to read standard
Type 1 PCI_SECONDARY_BUS and PCI_SUBORDINATE_BUS offsets (which fall into
the middle of BAR2 for a Type 0 device), sending arbitrary BAR values as
shadow bus ranges to the hypervisor.

[ ... ]
> @@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	}
>  
>  	stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
> -	if (!stored_entry) {
> -		pr_debug("%s: failed to allocate chip data\n", __func__);
> +	if (!stored_entry)
>  		return;
> -	}

If kzalloc_obj() or hv_map_msi_interrupt() fails, this returns early
without populating the struct msi_msg *msg parameter.

Since callers (like msi_domain_activate() or irq_chip_compose_msi_msg())
allocate msg on the stack and cannot check for errors due to the void
return type, does this result in uninitialized stack memory being blindly
written to the hardware MSI registers?

[ ... ]
> @@ -344,13 +351,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
>  		return ret;
>  
>  	for (int i = 0; i < nr_irqs; ++i) {
> -		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
> -				    handle_edge_irq, NULL, "edge");
> +		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
> +				    NULL, handle_edge_irq, NULL, "edge");
>  	}
> +
>  	return 0;
>  }
>  
> -static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
> +static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
> +			       unsigned int nr_irqs)
>  {
>  	for (int i = 0; i < nr_irqs; ++i) {
>  		struct irq_data *irqd = irq_domain_get_irq_data(d, virq);

Does this loop leak chip_data and leave IRQs mapped in the hypervisor for
the remaining nr_irqs - 1 IRQs?

It appears to call irq_domain_get_irq_data() using the base virq on every
iteration instead of virq + i, causing hv_teardown_msi_irq() to be
repeatedly invoked on the first IRQ only.

> @@ -362,6 +371,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
>  
>  		hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
>  	}
> +
>  	irq_domain_free_irqs_top(d, virq, nr_irqs);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=2

^ permalink raw reply

* Re: [PATCH V3 03/11] mshv: Provide a way to get partition ID if running in a VMM process
From: sashiko-bot @ 2026-05-13  3:47 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-4-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Tying VM partition IDs to process TGIDs (`current->tgid`) is an architectural flaw causing isolation breakage and security risks.
- [High] `mshv_current_partid()` returns a raw partition ID without taking a reference, creating a TOCTOU vulnerability.
- [High] Pre-existing Use-After-Free in `destroy_partition()` due to RCU-001 violation during VP teardown.
--

commit 528e38745b2ae16c0fd8ad4876d893d1b1fea027
Author: Mukesh R <mrathor@linux.microsoft.com>

mshv: Provide a way to get partition ID if running in a VMM process

This commit introduces mshv_current_partid() to allow retrieving the
partition id for the current vmm process. It adds a pt_vmm_tgid field
to the mshv_partition struct and sets it during partition creation.
The partition id is found by iterating the global hash table.

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index bd1359eb58dd4..02c107458be90 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1908,6 +1908,27 @@ mshv_partition_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> +/* Given a process tgid, return partition id if it is a VMM process */
> +u64 mshv_current_partid(void)
> +{
> +	struct mshv_partition *pt;
> +	int i;
> +	u64 ret_ptid = HV_PARTITION_ID_INVALID;
> +
> +	rcu_read_lock();
> +
> +	hash_for_each_rcu(mshv_root.pt_htable, i, pt, pt_hnode) {
> +		if (pt->pt_vmm_tgid == current->tgid) {

Is tying the partition id to the process tgid safe?

Since a single vmm process can create multiple partitions, this loop returns
the first partition matching the tgid. This could result in pci passthrough
operations being misrouted to the wrong partition if a process manages
multiple virtual machines.

Also, since the partition's lifecycle is tied to an anonymous file descriptor,
passing the fd to another process and exiting would break
mshv_current_partid() for the new process.

Additionally, if the operating system recycles the old tgid, an unrelated
process might gain access to the partition id.

Should explicit file descriptor tracking be used instead of global process
ids?

> +			ret_ptid = pt->pt_id;
> +			break;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +	return ret_ptid;

Does returning a raw partition id without taking a reference create a race
condition?

After rcu_read_unlock() is called and the pt_id is returned, a concurrent
thread in the vmm could close the partition's file descriptor, causing the
partition to be deleted in the hypervisor.

If the hypervisor recycles the pt_id for a newly created partition belonging
to a completely different process, the pending iommu attachment or pci
passthrough hypercall will use the stale pt_id.

Can this allow the vmm to inadvertently attach its devices to an unrelated
partition?

> +}

This isn't a regression in this patch, but looking at the partition lifecycle
code, does destroy_partition() have a use-after-free?

destroy_partition() frees the virtual processor structures (kfree(vp))
before calling remove_partition(partition), which removes the partition
from the rcu-protected global hash table (pt_htable) and calls
synchronize_rcu().

Because the partition remains visible to rcu readers during the teardown loop,
can an interrupt handler (e.g., in mshv_synic.c) call mshv_partition_find(),
successfully retrieve the partition, and read
vp = partition->pt_vp_array[vp_index]?

If this happens concurrently with or immediately after kfree(vp), the
interrupt handler would access freed memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=3

^ permalink raw reply

* Re: [PATCH net-next v8] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-05-13  4:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, shradhagupta, dipayanroy,
	shirazsaleem, yury.norov, kees, linux-hyperv, netdev,
	linux-kernel, linux-rdma
In-Reply-To: <20260508220114.3342627-1-kuba@kernel.org>

On Fri, May 08, 2026 at 03:01:14PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Expose hardware diagnostic info via debugfs
> 
> Add debugfs entries to expose hardware configuration and diagnostic
> information that aids in debugging driver initialization and runtime
> operations without adding noise to dmesg.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> [ ... ]
> > @@ -2231,13 +2239,18 @@ int mana_gd_resume(struct pci_dev *pdev)
> >  
> >  	err = mana_probe(&gc->mana, true);
> >  	if (err)
> > -		return err;
> > +		goto cleanup_gd;
> >  
> >  	err = mana_rdma_probe(&gc->mana_ib);
> >  	if (err)
> > -		return err;
> > +		goto cleanup_mana;
> >  
> >  	return 0;
> > +cleanup_mana:
> > +	mana_remove(&gc->mana, true);
> > +cleanup_gd:
> > +	mana_gd_cleanup_device(pdev);
> > +	return err;
> >  }
> 
> Since mana_remove() called with true for suspending skips
> unregister_netdevice(), the network devices remain registered in the system.
> 
> If mana_gd_cleanup_device() is then called, which destroys the hardware
> channel and clears gc->hwc.driver_data, could this lead to a NULL pointer
> dereference if userspace later interacts with the still-registered devices?

Sorry for delayed response.

This is a valid concern. I will update it to mana_remove(&gc->mana, false) in
the next version.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> [ ... ]
> > @@ -3161,6 +3169,23 @@ static int mana_init_port(struct net_device *ndev)
> >  	eth_hw_addr_set(ndev, apc->mac_addr);
> >  	sprintf(vport, "vport%d", port_idx);
> >  	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
> > +
> > +	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
> > +			   &apc->port_handle);
> 
> When the device suspends, mana_gd_suspend() calls mana_gd_cleanup_device(),
> which recursively removes the entire debugfs directory tree.
> 
> During resume, the parent debugfs directory is recreated, but mana_probe()
> skips calling mana_init_port() for existing ports.
> 
> Does this mean the per-vPort debugfs entries are permanently lost after a
> suspend and resume cycle?
> 
This is false positive.
The per-vPort debugfs entries are recreated during resume. The flow is:

Suspend: mana_gd_cleanup_device() removes the entire debugfs tree.

Resume: mana_gd_setup() recreates the top-level mana_pci_debugfs
directory. Then mana_probe(gd, true) takes the resuming=true path,
which calls mana_attach() for each port. mana_attach() calls
mana_init_port(), which recreates the vportN directory and all
per-vPort debugfs files (port_handle, max_sq, max_rq, steer_*,
current_speed).

So the per-vPort entries are not permanently lost, they survive the
suspend/resume cycle.
> [ ... ]
> > @@ -3800,6 +3830,9 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> >  
> >  	mana_gd_deregister_device(gd);
> >  
> > +	debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
> > +	debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
> > +
> >  	if (suspending)
> >  		return;
> 
> Earlier in mana_remove(), free_netdev() is called, which frees the port
> context structure (apc) allocated via netdev_priv(ndev).
> 
> Since the per-vPort debugfs files created in mana_init_port() directly
> reference memory within apc, and apc->mana_port_debugfs is not explicitly
> removed here, does this leave the debugfs files accessible but pointing to
> freed memory until the parent directory is finally removed in
> mana_gd_cleanup_device()?

This is a false positive.
The per-vPort debugfs files are removed before free_netdev() frees apc.

In mana_remove(gd, false), the sequence for each port is:

mana_detach(ndev, false) -> mana_cleanup_port_context(apc)
-> debugfs_remove(apc->mana_port_debugfs).
This removes the entire vportN directory and all its child files, since
debugfs_remove() is recursive (simple_recursive_removal()).
free_netdev(ndev) which frees apc.
Because step 1 removes all debugfs files referencing apc fields before
step 2 frees the memory, there is no window where the files point to
freed memory.

^ permalink raw reply

* Re: [PATCH V3 05/11] mshv: Implement mshv bridge device for VFIO
From: sashiko-bot @ 2026-05-13  5:09 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-6-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Unprotected access to `pt_devices` list and missing lockdep assertion.
- [Low] The `mshv_vfio_device_ops` structure is missing the `const` qualifier.
--

commit 96ab470ecc51b095ed8b1538e4abdf37d030b039
Author: Mukesh R <mrathor@linux.microsoft.com>
Subject: mshv: Implement mshv bridge device for VFIO

This commit introduces a new VFIO-MSHV bridge pseudo device file, adapted
from KVM's VFIO bridge. It provides the necessary device operations to link
VFIO files with an MSHV partition.

> diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
> new file mode 100644
> index 0000000000000..00a97920e25ba
> --- /dev/null
> +++ b/drivers/hv/mshv_vfio.c
[ ... ]
> +static long mshv_vfio_create_device(struct mshv_device *mshvdev)
> +{
> +	struct mshv_device *tmp;
> +	struct mshv_vfio *mshv_vfio;
> +
> +	/* Only one VFIO "device" per VM */
> +	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
> +			     device_ptnode)
> +		if (tmp->device_ops == &mshv_vfio_device_ops)
> +			return -EBUSY;

Are we missing a lockdep assertion here to ensure the partition lock is held?

Since this iterates over the pt_devices list to enforce a singleton device,
could concurrent ioctl calls bypass this check and corrupt the hlist if the
caller doesn't acquire the lock?

[ ... ]

> +struct mshv_device_ops mshv_vfio_device_ops = {
> +	.device_name = "mshv-vfio",
> +	.device_create = mshv_vfio_create_device,
> +	.device_release = mshv_vfio_release_device,
> +	.device_set_attr = mshv_vfio_set_attr,
> +	.device_has_attr = mshv_vfio_has_attr,
> +};

Should mshv_vfio_device_ops be declared as const?

The structure consists entirely of function pointers and is never modified
after initialization. Making it const would place it in read-only memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=5

^ permalink raw reply

* Re: [PATCH V3 06/11] mshv: Add ioctl support for MSHV-VFIO bridge device
From: sashiko-bot @ 2026-05-13  5:27 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-7-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] FD leak and potential deadlock on copy_to_user() failure after anon_inode_getfd().
- [Medium] Asymmetric memory lifecycle and inconsistent device_release handling.
- [Medium] Missing validation of unknown flags in mshv_partition_ioctl_create_device().
--

commit a0a0db24ca9d194302f458a7afa8a424cc357990
Author: Mukesh R <mrathor@linux.microsoft.com>

mshv: Add ioctl support for MSHV-VFIO bridge device

This commit adds an ioctl for creating MSHV devices for a partition.
Currently only VFIO device types are supported, doing setup for the bridge
code in mshv_vfio.c.

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 02c107458be90..6ceb5f608589f 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c

[ ... ]

> @@ -1625,6 +1625,119 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)

[ ... ]

> +static int mshv_device_fop_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_device *mshv_dev = filp->private_data;
> +	struct mshv_partition *partition = mshv_dev->device_pt;
> +
> +	if (mshv_dev->device_ops->device_release) {
> +		mutex_lock(&partition->pt_mutex);
> +		hlist_del(&mshv_dev->device_ptnode);
> +		mshv_dev->device_ops->device_release(mshv_dev);
> +		mutex_unlock(&partition->pt_mutex);
> +	}

What happens if a future device type does not implement the device_release
callback? It looks like the device would remain indefinitely linked in 
partition->pt_devices while the file descriptor is closed, leading to
list corruption and a memory leak.

Since the device structure is allocated by the core module via kzalloc(),
would it be safer to handle the list removal and generic kfree()
unconditionally in this function?

> +
> +	mshv_partition_put(partition);
> +	return 0;
> +}
> +
> +static const struct file_operations mshv_device_fops = {
> +	.owner = THIS_MODULE,
> +	.unlocked_ioctl = mshv_device_fop_ioctl,
> +	.release = mshv_device_fop_release,
> +};
> +
> +static long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> +					       void __user *uarg)
> +{
> +	long rc;
> +	struct mshv_create_device devargk;
> +	struct mshv_device *mshv_dev;
> +	const struct mshv_device_ops *vfio_ops;
> +
> +	if (copy_from_user(&devargk, uarg, sizeof(devargk)))
> +		return -EFAULT;
> +
> +	/* At present, only VFIO is supported */
> +	if (devargk.type != MSHV_DEV_TYPE_VFIO)
> +		return -ENODEV;
> +
> +	if (devargk.flags & MSHV_CREATE_DEVICE_TEST)
> +		return 0;

Should we explicitly validate that no undefined flags are set here? 
If a future kernel version introduces new flags for device creation, older 
kernels will silently ignore them and return success instead of -EINVAL, 
breaking forward compatibility.

> +
> +	/* This is freed later by mshv_vfio_release_device() */
> +	mshv_dev = kzalloc(sizeof(*mshv_dev), GFP_KERNEL_ACCOUNT);
> +	if (mshv_dev == NULL)
> +		return -ENOMEM;
> +
> +	vfio_ops = &mshv_vfio_device_ops;
> +	mshv_dev->device_ops = vfio_ops;
> +	mshv_dev->device_pt = partition;
> +
> +	rc = vfio_ops->device_create(mshv_dev);
> +	if (rc < 0) {
> +		kfree(mshv_dev);
> +		return rc;
> +	}
> +
> +	hlist_add_head(&mshv_dev->device_ptnode, &partition->pt_devices);
> +
> +	mshv_partition_get(partition);
> +	rc = anon_inode_getfd(vfio_ops->device_name, &mshv_device_fops,
> +			      mshv_dev, O_RDWR | O_CLOEXEC);
> +	if (rc < 0)
> +		goto undo_out;
> +
> +	devargk.fd = rc;
> +	if (copy_to_user(uarg, &devargk, sizeof(devargk)))
> +		return -EFAULT;    /* cleanup in mshv_device_fop_release() */

Does this sequence leak the file descriptor if copy_to_user() fails?

The anon_inode_getfd() call installs the file descriptor into the process's 
fd table. If the subsequent copy_to_user() fails, the function returns 
-EFAULT, but the user space application never receives the fd number to 
explicitly close it, resulting in a leaked file descriptor.

Additionally, can this trigger a deadlock?

The ioctl is called with partition->pt_mutex held. If a malicious user 
provides a buffer backed by userfaultfd, the copy_to_user() call will 
block, holding the mutex. Concurrently, another thread could find the 
new fd in /proc/self/fd and call close(), which invokes 
mshv_device_fop_release(). That release function unconditionally attempts 
to acquire partition->pt_mutex, resulting in a deadlock.

Would it be better to separate file allocation and installation by using 
get_unused_fd_flags(), anon_inode_getfile(), and only calling fd_install() 
after copy_to_user() succeeds?

> +
> +	return 0;
> +
> +undo_out:
> +	hlist_del(&mshv_dev->device_ptnode);
> +	vfio_ops->device_release(mshv_dev);    /* will kfree(mshv_dev) */

If a future device type does not implement the device_release callback,
could this result in a NULL pointer dereference?

> +	mshv_partition_put(partition);
> +	return rc;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=6

^ permalink raw reply

* Re: [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR
From: Anirudh Rayabharam @ 2026-05-13  5:32 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816865065.21765.5112039734725197893.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:10PM +0000, Stanislav Kinsburskii wrote:
> mshv_intercept_isr() reads vp_index from the SynIC intercept message
> payload and uses it directly to index into partition->pt_vp_array without
> validating that vp_index < MSHV_MAX_VPS.
> 
> Mshv treats the Microsoft Hypervisor as trusted, so a malformed vp_index is
> not a security concern; the threat model does not include a malicious
> hypervisor. A hypervisor bug that placed an out-of-range value here would,
> however, cause an out-of-bounds read of pt_vp_array in hardirq context,
> manifesting as random memory corruption or a host crash with no clear
> signal pointing back to the hypervisor as the source.
> 
> handle_bitset_message() and handle_pair_message() already perform this
> defensive check on hypervisor-supplied vp_index values, with an explicit
> "This shouldn't happen, but just in case" comment  Add the same check to
> mshv_intercept_isr() for consistency, turning a potential silent corruption
> into a debuggable pr_debug message.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index bac890cd2b468..89207aad7cf1f 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -387,6 +387,11 @@ mshv_intercept_isr(struct hv_message *msg)
>  	 */
>  	vp_index =
>  	       ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
> +	/* This shouldn't happen, but just in case. */
> +	if (unlikely(vp_index >= MSHV_MAX_VPS)) {
> +		pr_debug("VP index %u out of bounds\n", vp_index);
> +		goto unlock_out;
> +	}
>  	vp = partition->pt_vp_array[vp_index];
>  	if (unlikely(!vp)) {
>  		pr_debug("failed to find VP %u\n", vp_index);
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: sashiko-bot @ 2026-05-13  6:43 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-9-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Conflicting declarations and missing `#ifdef` guards for `hv_build_devid_type_pci()` cause build failures under certain Kconfig combinations.
--

commit 086da02227f217e959d3022c592e1b8698814d5b
Author: Mukesh R <mrathor@linux.microsoft.com>

PCI: hv: VMBus and PCI device IDs for PCI passthru

This commit introduces functions to build and export VMBus and PCI device
IDs for use with Hyper-V PCI passthru. It exposes hv_build_devid_type_pci()
and adds hv_pci_vmbus_device_id() to generate the device IDs expected by
the hypervisor.

> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index b3ad50a874dca..8780573a43326 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
>  	int pos;
[ ... ]
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index f64393e853ee3..2ef34001f8d31 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
[ ... ]
> @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
>  static inline void hv_root_crash_init(void) {}
>  #endif  /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
>  
> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */

Will this cause build failures under certain Kconfig combinations?

If CONFIG_PCI_MSI=y and CONFIG_HYPERV_IOMMU=n, arch/x86/hyperv/irqdomain.c
provides a non-static definition of hv_build_devid_type_pci(), but this
header provides a static inline definition, which leads to a redefinition
error.

Conversely, if CONFIG_PCI_MSI=n and CONFIG_HYPERV_IOMMU=y, this header
declares the function, but it is never defined because irqdomain.c is not
compiled, resulting in linker errors.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=8

^ permalink raw reply

* Re: [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path
From: Anirudh Rayabharam @ 2026-05-13  9:57 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816865606.21765.9555294064589953949.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:16PM +0000, Stanislav Kinsburskii wrote:
> mshv_partition_ioctl_create_vp() initialises a VP struct (allocations,
> mutex_init, init_waitqueue_head, page mappings) and then publishes the
> pointer into partition->pt_vp_array.  Several ISR paths read this array
> locklessly: the intercept ISR, the two scheduler ISRs, and
> mshv_try_assert_irq_fast() on the irqfd fast path.
> 
> Of these, only mshv_try_assert_irq_fast() can structurally race the
> publish.  It runs from an eventfd waker without holding pt_mutex, and
> MSHV_IRQFD does not require the target lapic_apic_id (== vp_index) to
> refer to an existing VP at registration time.  A user can therefore
> register an irqfd targeting a yet-to-be-created VP, then trigger
> mshv_try_assert_irq_fast() concurrently with MSHV_CREATE_VP for the
> same index.  On weakly-ordered architectures the reader can observe a
> non-NULL pointer in pt_vp_array before the initialising stores to the
> VP struct become visible, leading to use of partially-initialised
> fields (e.g. vp_register_page).
> 
> The other ISR readers cannot reach this race: the hypervisor will not
> generate intercept or scheduler messages for a VP that has never been
> told to run, and the user can only call MSHV_RUN_VP on the VP fd
> returned by MSHV_CREATE_VP, which by construction is returned after
> the publish.  Leave those readers as plain loads.
> 
> Use smp_store_release() in mshv_partition_ioctl_create_vp() to publish
> the pointer, and pair it with smp_load_acquire() in
> mshv_try_assert_irq_fast().  On x86 these compile to plain accesses
> under TSO; on ARM64 they emit one-instruction acquire/release barriers,
> acceptable on this fast path.
> 
> The destroy-side path (destroy_partition() clearing pt_vp_array[i] to
> NULL after kfree(vp)) has a separate ordering and lifetime concern
> that is out of scope here.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_eventfd.c   |    9 ++++++++-
>  drivers/hv/mshv_root_main.c |    8 +++++++-
>  2 files changed, 15 insertions(+), 2 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period
From: Anirudh Rayabharam @ 2026-05-13 10:11 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816866152.21765.16203922564983237274.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:21PM +0000, Stanislav Kinsburskii wrote:
> destroy_partition() frees mshv_vp with plain kfree() while ISR readers
> walk pt_vp_array[] under rcu_read_lock().  On non-root schedulers,
> where drain_all_vps() does not run, an in-flight intercept ISR can
> observe a non-NULL pt_vp_array slot and dereference freed memory in
> kick_vp().  On the root scheduler the same race exists in a narrower
> form: drain_vp_signals() synchronises on kick_vp()'s kicked_by_hv flag
> but not on its wake_up() tail, so the wait-queue lock embedded in vp
> can still be held when destroy_partition() reaches kfree(vp).
> 
> Add struct rcu_head vp_rcu to struct mshv_vp, clear the pt_vp_array
> slot before the free, and use kfree_rcu() so the actual kfree happens
> after a grace period.  drain_all_vps() is retained because it serves a
> separate purpose (telling the hypervisor to stop signalling and
> reconciling signal counts) that kfree_rcu() does not address.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h      |    1 +
>  drivers/hv/mshv_root_main.c |    5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
From: Anirudh Rayabharam @ 2026-05-13 11:12 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816866691.21765.15605640837157423543.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:26PM +0000, Stanislav Kinsburskii wrote:
> handle_pair_message() iterates up to msg->vp_count without verifying it
> against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> from untrusted hypervisor data, a malformed message with a large value
> would cause out-of-bounds reads from the partition_ids and vp_indexes
> arrays.
> 
> handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> 64) and advances bank_contents for each one. However, the payload buffer
> only has space for 16 bank entries. A valid_bank_mask with more than 16
> bits set causes bank_contents to read beyond the message buffer.
> 
> Fix both by adding bounds validation:
> - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> - Track banks consumed and stop before exceeding buffer capacity
> 
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 89207aad7cf1f..5d509299f14d7 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
>  static void
>  handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
>  {
> -	int bank_idx, vps_signaled = 0, bank_mask_size;
> +	int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> +	const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> +			      sizeof(u64) - 2; /* subtract format + mask */

Could this be a constant in the header?

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
From: Anirudh Rayabharam @ 2026-05-13 11:15 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <agHwhJrqEL3IewHz@skinsburskii.localdomain>

On Mon, May 11, 2026 at 08:06:44AM -0700, Stanislav Kinsburskii wrote:
> On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> > On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > > fails on an unencrypted partition. The condition on the error path:
> > > 
> > >     if (ret && mshv_partition_encrypted(partition))
> > > 
> > > only handles map failures for encrypted partitions — if the partition is
> > > not encrypted and the map fails, execution falls through to 'return 0',
> > > silently ignoring the error.
> > > 
> > > Additionally, calling mshv_region_invalidate() inline on map failure
> > > zeroes the mreg_pages array before the caller's cleanup path
> > > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > > mappings for partially-mapped pages.
> > > 
> > > Fix by returning immediately on success and falling through to error
> > > return on failure. For unencrypted partitions, the caller's
> > > mshv_region_destroy() handles unmap followed by invalidate in the
> > > correct order. For encrypted partitions where re-sharing fails, zero
> > > the page array without unpinning — the pages are inaccessible to the
> > > host and must not be unpinned, but zeroing prevents
> > > mshv_region_destroy() from attempting to unpin them.
> > 
> > mshv_region_destroy() calls mshv_region_invalidate() which calls
> > mshv_region_invalidate_pages() which just does:
> > 
> > 	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > 						 u64 page_offset, u64 page_count)
> > 	{
> > 		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > 			unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > 
> > 		memset(region->mreg_pages + page_offset, 0,
> > 		       page_count * sizeof(struct page *));
> > 	}
> > 
> > It doesn't checked for zeroed pages before unpinning.
> > 
> > > 
> > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
> > >  1 file changed, 16 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 665d565899c15..7e4252b6bc65c 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> > >  			pt_err(partition,
> > >  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> > >  			       region->start_gfn, ret);
> > > -			goto invalidate_region;
> > > +			goto err_out;
> > >  		}
> > >  	}
> > >  
> > >  	ret = mshv_region_map(region);
> > > -	if (ret && mshv_partition_encrypted(partition)) {
> > > +	if (ret)
> > > +		goto share_region;
> > > +
> > > +	return 0;
> > > +
> > > +share_region:
> > > +	if (mshv_partition_encrypted(partition)) {
> > >  		int shrc;
> > >  
> > >  		shrc = mshv_region_share(region);
> > >  		if (!shrc)
> > > -			goto invalidate_region;
> > > +			goto err_out;
> > >  
> > >  		pt_err(partition,
> > >  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> > >  		       region->start_gfn, shrc);
> > >  		/*
> > > -		 * Don't unpin if marking shared failed because pages are no
> > > -		 * longer mapped in the host, ie root, anymore.
> > > +		 * Re-sharing failed — the pages remain inaccessible to the
> > > +		 * host.  Zero the page array so that mshv_region_destroy()
> > > +		 * won't attempt to unpin them (leaking the page references
> > > +		 * is intentional; unpinning host-inaccessible pages would be
> > > +		 * unsafe).
> > >  		 */
> > 
> > Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> > can remove it from here altogether. Either way, should this zeroing
> > logic be added there too so as not to crash the host by unpinning pages
> > that weren't marked shared?
> > 
> 
> Indeed.
> The issue with all this code is that we are leaking state in
> mshv_region_pin if we wimply return from it: we don't know if the pages
> are pinned or unshared (or mapped) in the destruction callback.
> We should either introduce a state variable to track this or just don't
> call the generic mshv_region_put on case of region creation error.
> The latter approch make mshv_map_user_memory idempotent and thus looks
> like a better way forward.
> What do you think?

I'm not sure I follow the latter approach. Omitting mshv_region_put()
would cause a dangling reference to the mshv_region right?

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract
From: Anirudh Rayabharam @ 2026-05-13 11:20 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816863447.21765.7284842709694944084.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:43:54PM +0000, Stanislav Kinsburskii wrote:
> mshv_portid_lookup() previously took rcu_read_lock() internally, ran
> idr_find(), released the read lock, and copied the struct contents
> into a caller-supplied buffer.  This had two problems.
> 
> 1. The struct copy ran outside the read section, racing with
>    mshv_portid_free() which does idr_remove + synchronize_rcu + kfree.
>    A copy that started just before synchronize_rcu() observed the read
>    section as already drained and was free to read freed memory while
>    the writer was kfree()'ing the entry.
> 
> 2. The only consumer, mshv_doorbell_isr(), then dispatched a callback
>    using fields of the snapshot — entirely outside any RCU read
>    section.  The callback's data argument and any field it touches
>    were therefore safe only because mshv_isr() runs from
>    sysvec_hyperv_callback, a non-threaded system vector that
>    synchronize_rcu() implicitly waits for via the hardirq quiescent-
>    state coupling.  That protection is real today but undocumented and
>    fragile: a future move of mshv_isr() to a threaded context, or a
>    future caller that registers a doorbell with a shorter-lived data
>    pointer, would silently expose a use-after-free.
> 
> Make the contract explicit instead of implicit.  mshv_portid_lookup()
> now returns a pointer to the table entry and requires the caller to
> hold rcu_read_lock for the entire lifetime of that pointer.  The
> contract is annotated with __must_hold(RCU) so sparse flags any
> direct caller that forgets it.  The sole caller, mshv_doorbell_isr(),
> takes rcu_read_lock around the whole drain loop, so the lookup, the
> field reads, and the doorbell_cb dispatch all run inside one
> read-side critical section.  synchronize_rcu() in mshv_portid_free()
> now genuinely waits for any in-flight callback before kfree() runs,
> without relying on hardirq context for correctness.
> 
> This also drops the by-value struct copy: entries are publish-once
> (populated before idr_alloc) and free-once (after synchronize_rcu),
> so a pointer dereferenced inside the read section gives a stable
> view of the contents without copying.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_portid_table.c |   22 +++++++---------------
>  drivers/hv/mshv_root.h         |    2 +-
>  drivers/hv/mshv_synic.c        |   15 +++++++++------
>  3 files changed, 17 insertions(+), 22 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free
From: Anirudh Rayabharam @ 2026-05-13 11:22 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816864533.21765.15114003791446487007.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:05PM +0000, Stanislav Kinsburskii wrote:
> mshv_portid_free() uses synchronize_rcu() followed by kfree() to
> reclaim port table entries. This blocks the caller until a full RCU
> grace period elapses, which is unnecessary since the same module already
> uses the non-blocking kfree_rcu() pattern in mshv_port_table_fini().
> 
> Replace with kfree_rcu() to avoid the blocking wait and keep the
> reclamation strategy consistent across the file.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_portid_table.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
> index d87a82e399e96..42d21b92b88fd 100644
> --- a/drivers/hv/mshv_portid_table.c
> +++ b/drivers/hv/mshv_portid_table.c
> @@ -62,8 +62,7 @@ mshv_portid_free(int port_id)
>  	WARN_ON(!info);
>  	idr_unlock(&port_table_idr);
>  
> -	synchronize_rcu();
> -	kfree(info);
> +	kfree_rcu(info, portbl_rcu);
>  }
>  
>  /*
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0
From: Anirudh Rayabharam @ 2026-05-13 11:36 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816862911.21765.306085307721937662.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:43:49PM +0000, Stanislav Kinsburskii wrote:
> The duplicate routing entry check in mshv_update_routing_table() uses
> guest_irq_num != 0 to detect whether a GSI slot is already occupied.
> This fails for GSI 0 because its guest_irq_num is 0 both when the slot
> is unused (zero-initialized) and when legitimately assigned. As a
> result, duplicate entries for GSI 0 are silently accepted, with the
> second entry overwriting the first — corrupting the routing table
> without any error reported to userspace.
> 
> While GSI 0 (legacy timer) is unlikely to appear in MSI-based routing
> in practice, the check is semantically wrong — it conflates
> "uninitialized" with "GSI number 0." Use girq_entry_valid instead,
> which is explicitly set to true when an entry is populated and remains
> zero for unused slots regardless of the GSI number.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_irq.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
> index 59584a132ca9f..db05512db5548 100644
> --- a/drivers/hv/mshv_irq.c
> +++ b/drivers/hv/mshv_irq.c
> @@ -88,7 +88,7 @@ int mshv_update_routing_table(struct mshv_partition *partition,
>  		/*
>  		 * Allow only one to one mapping between GSI and MSI routing.
>  		 */
> -		if (girq->guest_irq_num != 0) {
> +		if (girq->girq_entry_valid) {
>  			r = -EINVAL;
>  			goto out;
>  		}
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-13 11:38 UTC (permalink / raw)
  To: sashiko-reviews, Dennis Dalessandro; +Cc: linux-hyperv
In-Reply-To: <20260513031237.4280DC2BCB0@smtp.kernel.org>

On Wed, May 13, 2026 at 03:12:36AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
> - [Critical] TOCTOU heap buffer overflow due to unvalidated `num_sge` from user-shared memory.
> - [High] Memory leak of the kernel queue structure (`srq->rq.kwq`) on user-backed SRQ modifications.
> - [High] Locking imbalance and freeing memory while locked.
> - [High] Inconsistent state and Use-After-Free on error path.
> - [Low] Uninitialized variable compiler warning for `offset_addr`.
> --

These are all pre-existing and rvt is too confusing to try to fix be
me. Dennis will have to handle them
 
> commit dac7699d9e58c179b3a8859ac64da79e5fd21f46
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Subject: IB/rdmavt: Don't abuse udata and ib_respond_udata()
> 
> This commit modifies rvt_modify_srq to use copy_to_user directly instead
> of ib_copy_to_udata for returning the offset address.
> 
> > diff --git a/drivers/infiniband/sw/rdmavt/srq.c b/drivers/infiniband/sw/rdmavt/srq.c
> > index fe125bf85b272..d022aa56c5bfd 100644
> > --- a/drivers/infiniband/sw/rdmavt/srq.c
> > +++ b/drivers/infiniband/sw/rdmavt/srq.c
> > @@ -128,6 +128,7 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> >  	struct rvt_srq *srq = ibsrq_to_rvtsrq(ibsrq);
> >  	struct rvt_dev_info *dev = ib_to_rvt(ibsrq->device);
> >  	struct rvt_rq tmp_rq = {};
> > +	__u64 offset_addr;
> >  	int ret = 0;
> >  
> >  	if (attr_mask & IB_SRQ_MAX_WR) {
> 
> Could hoisting offset_addr lead to a compiler warning for being potentially
> uninitialized?
> 
> The compiler might assume udata->inlen could change concurrently across the
> spin_lock_irq() and spin_unlock_irq() calls. If the first check fails
> but the second one passes:
> 
> if (udata && udata->inlen >= sizeof(__u64))
> 
> Does this allow offset_addr to be used without initialization?
> 
> [ ... ]
> 
> > @@ -149,19 +150,17 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> >  			return -ENOMEM;
> >  		/* Check that we can write the offset to mmap. */
> >  		if (udata && udata->inlen >= sizeof(__u64)) {
> > -			__u64 offset_addr;
> >  			__u64 offset = 0;
> >  
> >  			ret = ib_copy_from_udata(&offset_addr, udata,
> >  						 sizeof(offset_addr));
> >  			if (ret)
> >  				goto bail_free;
> > -			udata->outbuf = (void __user *)
> > -					(unsigned long)offset_addr;
> > -			ret = ib_copy_to_udata(udata, &offset,
> > -					       sizeof(offset));
> > -			if (ret)
> > +			if (copy_to_user(u64_to_user_ptr(offset_addr), &offset,
> > +					 sizeof(offset))) {
> > +				ret = -EFAULT;
> >  				goto bail_free;
> > +			}
> >  		}
> >  
> >  		spin_lock_irq(&srq->rq.kwq->c_lock);
> 
> Can this code leak the kernel queue structure?
> 
> When resizing a user-backed SRQ, the code fetches elements from the old
> user-mapped queue. In the udata path, okwq is left as NULL:
> 
> if (udata) {
>         owq = srq->rq.wq;
>         head = RDMA_READ_UAPI_ATOMIC(owq->head);
>         tail = RDMA_READ_UAPI_ATOMIC(owq->tail);
> } else {
>         okwq = srq->rq.kwq;
>         ...
> 
> When it later calls kvfree(okwq), it seems nothing is freed, which could
> permanently leak the old kwq structure.
> 
> Is there a lock imbalance here when the queue is replaced?
> 
> The original queue is locked with:
> 
> spin_lock_irq(&srq->rq.kwq->c_lock);
> 
> But then the queue pointer is replaced:
> 
> srq->rq.kwq = tmp_rq.kwq;
> 
> When the lock is released:
> 
> spin_unlock_irq(&srq->rq.kwq->c_lock);
> 
> Does this unlock the new queue instead of the old one, and then free the old
> queue via kvfree(okwq) while its lock is still held?
> 
> Can the loop migrating WQEs overflow the destination buffer?
> 
> The number of scatter-gather elements is read directly from user-shared
> memory:
> 
> wqe = rvt_get_rwqe_ptr(&srq->rq, tail);
> p->wr_id = wqe->wr_id;
> p->num_sge = wqe->num_sge;
> for (i = 0; i < wqe->num_sge; i++)
>         p->sg_list[i] = wqe->sg_list[i];
> 
> If a userspace program concurrently modifies num_sge to a large value, could
> this loop exceed the bounds of tmp_rq.kwq->curr_wq, since the destination
> size was calculated based on srq->rq.max_sge?
> 
> [ ... ]
> 
> > @@ -236,10 +235,10 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> >  			 * See rvt_mmap() for details.
> >  			 */
> >  			if (udata && udata->inlen >= sizeof(__u64)) {
> > -				ret = ib_copy_to_udata(udata, &ip->offset,
> > -						       sizeof(ip->offset));
> > -				if (ret)
> > -					return ret;
> > +				if (copy_to_user(u64_to_user_ptr(offset_addr),
> > +						 &ip->offset,
> > +						 sizeof(ip->offset)))
> > +					return -EFAULT;
> >  			}
> >  
> 
> Does returning -EFAULT here leave the kernel and userspace in an inconsistent
> state?
> 
> At this point, the kernel has already applied the SRQ size change, replaced
> the queue, and permanently freed the old memory:
> 
> vfree(owq);
> kvfree(okwq);
> 
> If copy_to_user fails and returns -EFAULT, userspace might interpret this as
> a failed resize and continue using its existing memory mapping of the old
> queue. Could this lead to a use-after-free since the pages have already
> been freed?
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com?part=2

^ permalink raw reply

* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
From: Anirudh Rayabharam @ 2026-05-13 12:14 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816862362.21765.11809618639989414561.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> In mshv_irqfd_assign(), the level-triggered validation for resample
> irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> mshv_irqfd_update() has populated the field. Since the irqfd struct is
> zero-allocated, level_triggered is always 0 at that point, causing the
> check to always reject resample irqfds with -EINVAL. This makes
> level-triggered interrupt resampling — used to avoid interrupt storms
> with assigned devices — completely non-functional.

What bugs would this manifest as? Why haven't we seen any such bugs so
far?

Thanks,
Anirudh.

^ permalink raw reply

* Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU
From: sashiko-bot @ 2026-05-13 12:41 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-10-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [Critical] `hv_iommu_unmap_pages` makes a rep hypercall but ignores the completion count, leaving pages permanently mapped.
- [Critical] Integer truncation in `hv_iommu_unmap_pages` causes incomplete hardware unmapping.
- [Critical] Global `unique_id` data race leads to IOMMU domain collisions and cross-VM DMA sharing.
- [High] The `sscanf` return value check in `hv_iommu_probe_device` breaks the loop on success instead of failure.
- [High] Partial unmaps corrupt the interval tree state, desynchronizing hardware and software.
- [High] Unsafe interval tree iteration leads to skipped nodes during removal.
- [High] Memory leak of interval tree nodes on domain free.
- [High] Ignoring `gfp` flags causes premature OOM and risks sleeping in atomic context.
- [Medium] Missing NULL check on `mapped` pointer in fast path.
- [Low] The `__setup` handlers violate the API by returning 0.
--

commit 4a3c258bcace1805e7afe537c6eb3e451a6fd030
Author: Mukesh R <mrathor@linux.microsoft.com>

x86/hyperv: Implement Hyper-V virtual IOMMU

Add a new file to implement management of device domains, mapping and
unmapping of IOMMU memory, and other iommu_ops to fit within the VFIO
framework for PCI passthru on Hyper-V running Linux as baremetal root
or L1VH root.

> diff --git a/drivers/iommu/hyperv-iommu-root.c b/drivers/iommu/hyperv-iommu-root.c
> new file mode 100644
> index 0000000000000..a2e0f6cc78e66
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu-root.c

[ ... ]

> +static char *pci_devs_to_skip;
> +static int __init hv_iommu_setup_skip(char *str)
> +{
> +	pci_devs_to_skip = str;
> +
> +	return 0;
> +}

Should this return 1 instead of 0? Returning 0 causes the kernel to print an
unknown command line parameters warning and passes the parameters to the init
process.

> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> +
> +bool hv_no_attdev;	 /* disable direct device attach for passthru */
> +EXPORT_SYMBOL_GPL(hv_no_attdev);
> +static int __init setup_hv_no_attdev(char *str)
> +{
> +	hv_no_attdev = true;
> +	return 0;
> +}

Should this also return 1 instead of 0 to indicate the parameter was handled?

[ ... ]

> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> +	struct hv_domain *hvdom;
> +	int rc;

[ ... ]

> +	/* Called under iommu group mutex, so single threaded */
> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_NULL)   /* ie, UINTMAX */
> +		goto out_err;

Can this increment of unique_id race if called concurrently?

iommu_domain_alloc() can be called by subsystems without holding global or
group locks. If multiple allocations race, it could assign identical domid_num
values to different domains, which might place them in the same S2 domain.

[ ... ]

> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> +{
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_delete_device_domain *input;
> +
> +	if (hv_special_domain(hvdom))
> +		return;
> +
> +	if (!hv_dom_owner_is_vmm(hvdom) || hv_no_attdev) {

[ ... ]

> +	}
> +
> +	kfree(hvdom);
> +}

Does this code leak the interval tree nodes?

It looks like the remaining struct hv_iommu_mapping nodes in
hvdom->mappings_tree are not freed before kfree(hvdom).

[ ... ]

> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> +				     unsigned long iova, phys_addr_t paddr,
> +				     size_t size, u32 flags)
> +{
> +	unsigned long irqflags;
> +	struct hv_iommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);

Should this use the gfp flag passed into hv_iommu_map_pages() instead of
GFP_ATOMIC? It might avoid failing prematurely under memory pressure when the
caller can sleep.

[ ... ]

> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> +					unsigned long iova, size_t size)
> +{
> +	unsigned long flags;
> +	size_t unmapped = 0;
> +	unsigned long last = iova + size - 1;
> +	struct hv_iommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/* Trying to split a mapping? Not supported for now. */
> +		if (mapping->iova.start < iova)
> +			break;

Does breaking out early here leave stale mappings in the tree?

If the target region is unmapped in hardware, the tracking node would remain,
causing stale lookups. Also, if unmapping the head of a mapping, is the
entire node deleted even though the tail is still mapped?

> +
> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> +		interval_tree_remove(node, &hvdom->mappings_tree);

Can erasing the node corrupt the iteration?

Since interval_tree_remove() rebalances the tree and alters parent pointers,
calling interval_tree_iter_next(next) on the next loop iteration might skip
overlapping nodes.

[ ... ]

> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
> +			      int prot, gfp_t gfp, size_t *mapped)
> +{
> +	u32 map_flags;
> +	int ret;
> +	u64 status;
> +	unsigned long npages, done = 0;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t size = pgsize * pgcount;
> +
> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (hvdom->attached_dom) {
> +		*mapped = size;
> +		return 0;
> +	}

Is it possible for mapped to be NULL here?

The fast path assigns to *mapped without validating the pointer, while later
in the function there are defensive checks like if (mapped).

[ ... ]

> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +						    hv_current_partition_id,
> +						    256);

Could this trigger a sleeping in atomic context bug?

hv_call_deposit_pages() typically performs sleepable allocations, but the
caller of hv_iommu_map_pages() might have provided a non-sleeping gfp flag
which was not forwarded.

[ ... ]

> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	unsigned long flags, npages;
> +	struct hv_input_unmap_device_gpa_pages *input;
> +	u64 status;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t unmapped, size = pgsize * pgcount;

[ ... ]

> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->target_device_va_base = iova;
> +
> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> +				     0, input, NULL);

Can npages be truncated here?

The hypercall parameter rep_count in hv_do_rep_hypercall() expects a u16.
If npages is >= 65536, it would be silently truncated, causing incomplete
unmapping in the hardware.

Does this also need a loop to handle partial completions?

hv_do_rep_hypercall() might only partially complete. Since the completion count
from hv_repcomp(status) is ignored, remaining pages could be left fully
mapped in the hardware.

[ ... ]

> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int rc, pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> +				    &segment, &bus, &slot, &func, &parsed);
> +			if (rc)
> +				break;

Should this check for failure instead of success?

sscanf() returns the number of successfully matched items. If it succeeds, rc
will be 4, causing the loop to break prematurely without checking the parsed
values against the device.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=9

^ permalink raw reply

* Re: [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
From: Frederic Weisbecker @ 2026-05-13 13:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Anna-Maria Behnsen,
	Ingo Molnar, Chen Ridong, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
	linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <87340od7ev.ffs@tglx>

Le Tue, Apr 21, 2026 at 10:50:00AM +0200, Thomas Gleixner a écrit :
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> > +	/*
> > +	 * To properly enable/disable nohz_full dynticks for the affected CPUs,
> > +	 * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
> > +	 * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
> > +	 * for those CPUs that have their states changed. Those CPUs should be
> > +	 * in an offline state.
> > +	 */
> > +	for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
> > +		WARN_ON_ONCE(cpu_online(cpu));
> > +		ct_cpu_track_user(cpu);
> > +		cpumask_set_cpu(cpu, tick_nohz_full_mask);
> > +	}
> > +
> > +	for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
> > +		WARN_ON_ONCE(cpu_online(cpu));
> > +		ct_cpu_untrack_user(cpu);
> > +		cpumask_clear_cpu(cpu, tick_nohz_full_mask);
> > +	}
> > +}
> 
> So this writes to tick_nohz_full_mask while other CPUs can access
> it. That's just wrong and I'm not at all interested in the resulting
> KCSAN warnings.
> 
> tick_nohz_full_mask needs to become a RCU protected pointer, which is
> updated once the new mask is established in a separately allocated one.

How about just dropping tick_nohz_full_mask that is just
 ~housekeeping_cpumask(HK_TYPE_KERNEL_NOISE) which itself is becoming RCU
protected in this patchset?

Thanks.

> 
> Thanks,
> 
>         tglx
> 
> 

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

* [PATCH v4] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Anirudh Rayabharam (Microsoft) @ 2026-05-13 13:25 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
  Cc: linux-hyperv, linux-kernel, Anirudh Rayabharam (Microsoft),
	Stanislav Kinsburskii

The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
chunks into 1G mappings when alignment permits, so the driver can
support 1G hugepages by feeding them in as 2M chunks. Note that this
is the only way to make 1G mappings; there is no way to directly map
a 1G hugepage using the hypercall.

Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
hypercall has no 1G stride, so 1G folios are processed as a
sequence of 2M chunks. Folios whose order is less than PMD_ORDER
(e.g. mTHP) fall back to single-page stride; mapping them as 2M
would fail in the hypervisor anyway.

Assisted-by: Copilot-CLI:claude-opus-4.7
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
Changes in v4:
- Changed the check to page_order < PMD_ORDER for using page stride of 1
  - Also updated the commit message.
- Pick up Acked-by:
- Link to v3: https://lore.kernel.org/r/20260506-huge_1g-v3-1-26e1e4c439e4@anirudhrb.com

Changes in v3:
- Fixed various corner cases reported by Sashiko.
- Link to v2: https://lore.kernel.org/r/20260505-huge_1g-v2-1-b6a91327a88d@anirudhrb.com

Changes in v2:
- Handled the case where we can have 2M aligned pages in the middle of a
  1G page
- Brought back the page order check but expanded it to include 1G
- Clamp stride to requested page count in mshv_region_process_chunk
- Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
---
 drivers/hv/mshv_regions.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6..6d65e5b42152 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -29,29 +29,27 @@
  * Uses huge page stride if the backing page is huge and the guest mapping
  * is properly aligned; otherwise falls back to single page stride.
  *
- * Return: Stride in pages, or -EINVAL if page order is unsupported.
+ * Return: Stride in pages.
  */
-static int mshv_chunk_stride(struct page *page,
-			     u64 gfn, u64 page_count)
+static unsigned int mshv_chunk_stride(struct page *page, u64 gfn,
+				      u64 page_count)
 {
-	unsigned int page_order;
+	unsigned int page_order = folio_order(page_folio(page));
 
 	/*
 	 * Use single page stride by default. For huge page stride, the
-	 * page must be compound and point to the head of the compound
-	 * page, and both gfn and page_count must be huge-page aligned.
+	 * folio order must be at least PMD_ORDER, the page's PFN must be
+	 * 2M-aligned (so that a 2M-aligned tail page of a larger folio is
+	 * acceptable), and both gfn and page_count must be 2M-aligned.
 	 */
-	if (!PageCompound(page) || !PageHead(page) ||
+	if (page_order < PMD_ORDER ||
+	    !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
 	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
 	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
 		return 1;
 
-	page_order = folio_order(page_folio(page));
-	/* The hypervisor only supports 2M huge page */
-	if (page_order != PMD_ORDER)
-		return -EINVAL;
-
-	return 1 << page_order;
+	/* Use 2M stride always i.e. process 1G folios as 2M chunks */
+	return 1 << PMD_ORDER;
 }
 
 /**
@@ -86,15 +84,14 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 	u64 gfn = region->start_gfn + page_offset;
 	u64 count;
 	struct page *page;
-	int stride, ret;
+	unsigned int stride;
+	int ret;
 
 	page = region->mreg_pages[page_offset];
 	if (!page)
 		return -EINVAL;
 
 	stride = mshv_chunk_stride(page, gfn, page_count);
-	if (stride < 0)
-		return stride;
 
 	/* Start at stride since the first stride is validated */
 	for (count = stride; count < page_count; count += stride) {

---
base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
change-id: 20260416-huge_1g-e44461393c8f

Best regards,
-- 
Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply related

* Re: [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: Dennis Dalessandro @ 2026-05-13 13:45 UTC (permalink / raw)
  To: Jason Gunthorpe, sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260513113835.GE7655@nvidia.com>

On 5/13/26 7:38 AM, Jason Gunthorpe wrote:
> On Wed, May 13, 2026 at 03:12:36AM +0000, sashiko-bot@kernel.org wrote:
>> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
>> - [Critical] TOCTOU heap buffer overflow due to unvalidated `num_sge` from user-shared memory.
>> - [High] Memory leak of the kernel queue structure (`srq->rq.kwq`) on user-backed SRQ modifications.
>> - [High] Locking imbalance and freeing memory while locked.
>> - [High] Inconsistent state and Use-After-Free on error path.
>> - [Low] Uninitialized variable compiler warning for `offset_addr`.
>> --
> 
> These are all pre-existing and rvt is too confusing to try to fix be
> me. Dennis will have to handle them

Will take care of it.

-Denny

^ permalink raw reply

* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: Souradeep Chakrabarti @ 2026-05-13 15:08 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-9-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:02:56PM -0700, Mukesh R wrote:
> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> interrupts, etc need a device ID as a parameter. This device ID refers
> to that specific device during the lifetime of passthru.
> 
> An L1VH VM only contains VMBus based devices. A device ID for a VMBus
> device is slightly different in that it uses the hv_pcibus_device info
> for building it to make sure it matches exactly what the hypervisor
> expects. This VMBus based device ID is needed when attaching devices in
> an L1VH based guest VM. Add a function to build and export it. Before
> building it, a check is done to make sure the device is a valid VMBus
> device.
> 
> In remaining cases, PCI device ID is used. So, also make PCI device ID
> build function hv_build_devid_type_pci() public.
>
Reviewed-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c         |  9 +++++----
>  arch/x86/include/asm/mshyperv.h     |  6 ++++++
>  drivers/pci/controller/pci-hyperv.c | 24 ++++++++++++++++++++++++
>  include/asm-generic/mshyperv.h      | 11 +++++++++++
>  4 files changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index b3ad50a874dc..8780573a4332 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
>  	int pos;
>  	union hv_device_id hv_devid;
> @@ -172,8 +172,9 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>  	}
>  
>  out:
> -	return hv_devid;
> +	return hv_devid.as_uint64;
>  }
> +EXPORT_SYMBOL_GPL(hv_build_devid_type_pci);
>  
>  /*
>   * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
> @@ -196,7 +197,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
>  
>  	msidesc = irq_data_get_msi_desc(data);
>  	pdev = msi_desc_to_pci_dev(msidesc);
> -	hv_devid = hv_build_devid_type_pci(pdev);
> +	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
>  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>  
>  	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
> @@ -271,7 +272,7 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>  {
>  	union hv_device_id hv_devid;
>  
> -	hv_devid = hv_build_devid_type_pci(pdev);
> +	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
>  	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
>  }
>  
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index f64393e853ee..2ef34001f8d3 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
>  static inline void hv_root_crash_init(void) {}
>  #endif  /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
>  
> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
> +
>  #else /* CONFIG_HYPERV */
>  static inline void hyperv_init(void) {}
>  static inline void hyperv_setup_mmu_ops(void) {}
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index cfc8fa403dad..50d793ca8f31 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -573,6 +573,7 @@ struct hv_pci_compl {
>  };
>  
>  static void hv_pci_onchannelcallback(void *context);
> +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
>  
>  #ifdef CONFIG_X86
>  #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
> @@ -1005,6 +1006,24 @@ static struct irq_domain *hv_pci_get_root_domain(void)
>  static void hv_arch_irq_unmask(struct irq_data *data) { }
>  #endif /* CONFIG_ARM64 */
>  
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> +{
> +	struct hv_pcibus_device *hbus;
> +	struct pci_bus *pbus = pdev->bus;
> +
> +	if (!hv_vmbus_pci_device(pbus))
> +		return 0;
> +
> +	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> +
> +	return	(hbus->hdev->dev_instance.b[5] << 24) |
> +		(hbus->hdev->dev_instance.b[4] << 16) |
> +		(hbus->hdev->dev_instance.b[7] << 8) |
> +		(hbus->hdev->dev_instance.b[6] & 0xf8) |
> +		PCI_FUNC(pdev->devfn);
> +}
> +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> +
>  /**
>   * hv_pci_generic_compl() - Invoked for a completion packet
>   * @context:		Set up by the sender of the packet.
> @@ -1403,6 +1422,11 @@ static struct pci_ops hv_pcifront_ops = {
>  	.write = hv_pcifront_write_config,
>  };
>  
> +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> +{
> +	return pbus->ops == &hv_pcifront_ops;
> +}
> +
>  /*
>   * Paravirtual backchannel
>   *
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index e8cbc4e3f7ad..25ac7ca0fd8b 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -204,6 +204,9 @@ extern u64 (*hv_read_reference_counter)(void);
>  /* Sentinel value for an uninitialized entry in hv_vp_index array */
>  #define VP_INVAL	U32_MAX
>  
> +/* Forward declarations */
> +struct pci_dev;
> +
>  int __init hv_common_init(void);
>  void __init hv_get_partition_id(void);
>  void __init hv_common_free(void);
> @@ -316,6 +319,14 @@ void hv_para_set_synic_register(unsigned int reg, u64 val);
>  void hyperv_cleanup(void);
>  bool hv_query_ext_cap(u64 cap_query);
>  void hv_setup_dma_ops(struct device *dev, bool coherent);
> +
> +#if IS_ENABLED(CONFIG_PCI_HYPERV)
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> +{ return 0; }
> +#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
> +
>  #else /* CONFIG_HYPERV */
>  static inline void hv_identify_partition_type(void) {}
>  static inline bool hv_is_hyperv_initialized(void) { return false; }
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* RE: [PATCH v4] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Michael Kelley @ 2026-05-13 15:13 UTC (permalink / raw)
  To: Anirudh Rayabharam (Microsoft), K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Stanislav Kinsburskii
In-Reply-To: <20260513-huge_1g-v4-1-33cda59e4a70@anirudhrb.com>

From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Sent: Wednesday, May 13, 2026 6:26 AM
> 
> The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
> chunks into 1G mappings when alignment permits, so the driver can
> support 1G hugepages by feeding them in as 2M chunks. Note that this
> is the only way to make 1G mappings; there is no way to directly map
> a 1G hugepage using the hypercall.
> 
> Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
> hypercall has no 1G stride, so 1G folios are processed as a
> sequence of 2M chunks. Folios whose order is less than PMD_ORDER
> (e.g. mTHP) fall back to single-page stride; mapping them as 2M
> would fail in the hypervisor anyway.
> 
> Assisted-by: Copilot-CLI:claude-opus-4.7
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> ---
> Changes in v4:
> - Changed the check to page_order < PMD_ORDER for using page stride of 1
>   - Also updated the commit message.
> - Pick up Acked-by:
> - Link to v3: https://lore.kernel.org/r/20260506-huge_1g-v3-1-26e1e4c439e4@anirudhrb.com
> 
> Changes in v3:
> - Fixed various corner cases reported by Sashiko.
> - Link to v2: https://lore.kernel.org/r/20260505-huge_1g-v2-1-b6a91327a88d@anirudhrb.com
> 
> Changes in v2:
> - Handled the case where we can have 2M aligned pages in the middle of a
>   1G page
> - Brought back the page order check but expanded it to include 1G
> - Clamp stride to requested page count in mshv_region_process_chunk
> - Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
> ---
>  drivers/hv/mshv_regions.c | 29 +++++++++++++----------------
>  1 file changed, 13 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index fdffd4f002f6..6d65e5b42152 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -29,29 +29,27 @@
>   * Uses huge page stride if the backing page is huge and the guest mapping
>   * is properly aligned; otherwise falls back to single page stride.
>   *
> - * Return: Stride in pages, or -EINVAL if page order is unsupported.
> + * Return: Stride in pages.
>   */
> -static int mshv_chunk_stride(struct page *page,
> -			     u64 gfn, u64 page_count)
> +static unsigned int mshv_chunk_stride(struct page *page, u64 gfn,
> +				      u64 page_count)
>  {
> -	unsigned int page_order;
> +	unsigned int page_order = folio_order(page_folio(page));
> 
>  	/*
>  	 * Use single page stride by default. For huge page stride, the
> -	 * page must be compound and point to the head of the compound
> -	 * page, and both gfn and page_count must be huge-page aligned.
> +	 * folio order must be at least PMD_ORDER, the page's PFN must be
> +	 * 2M-aligned (so that a 2M-aligned tail page of a larger folio is
> +	 * acceptable), and both gfn and page_count must be 2M-aligned.
>  	 */
> -	if (!PageCompound(page) || !PageHead(page) ||
> +	if (page_order < PMD_ORDER ||
> +	    !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
>  	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
>  	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
>  		return 1;
> 
> -	page_order = folio_order(page_folio(page));
> -	/* The hypervisor only supports 2M huge page */
> -	if (page_order != PMD_ORDER)
> -		return -EINVAL;
> -
> -	return 1 << page_order;
> +	/* Use 2M stride always i.e. process 1G folios as 2M chunks */
> +	return 1 << PMD_ORDER;
>  }
> 
>  /**
> @@ -86,15 +84,14 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>  	u64 gfn = region->start_gfn + page_offset;
>  	u64 count;
>  	struct page *page;
> -	int stride, ret;
> +	unsigned int stride;
> +	int ret;
> 
>  	page = region->mreg_pages[page_offset];
>  	if (!page)
>  		return -EINVAL;
> 
>  	stride = mshv_chunk_stride(page, gfn, page_count);
> -	if (stride < 0)
> -		return stride;
> 
>  	/* Start at stride since the first stride is validated */
>  	for (count = stride; count < page_count; count += stride) {
> 
> ---
> base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
> change-id: 20260416-huge_1g-e44461393c8f
> 
> Best regards,
> --
> Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 


^ permalink raw reply

* Re: [PATCH V1 3/3] mshv: Implement guest irq migration for passthru devices
From: Souradeep Chakrabarti @ 2026-05-13 15:15 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, linux-hyperv, linux-kernel, iommu,
	linux-pci, linux-arch
In-Reply-To: <20260512021242.1679786-4-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:12:42PM -0700, Mukesh R wrote:
> Ask the hypervisor to retarget interrupts to new guest cpu or vector
> upon guest irq migration. This happens in the irqfd update path.
> 
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_eventfd.c | 78 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> index 1f5c1e9ee9b7..c05201d857fd 100644
> --- a/drivers/hv/mshv_eventfd.c
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -192,6 +192,77 @@ static int mshv_map_device_interrupt(u64 ptid, union hv_device_id hv_devid,
>  
>  }
>  
> +/* NOTE: caller does spin_lock_irq on pt_irqfds_lock, hence no disable here */
> +static void mshv_do_guest_irq_retarget(u64 partid, struct mshv_irqfd *irqfd)
> +{
> +	int rc, var_size;
> +	u64 status;
> +	union hv_device_id hv_devid;
> +	struct hv_input_get_vp_set_from_mda *mda_input;
> +	union hv_output_get_vp_set_from_mda *mda_output;
> +	struct hv_retarget_device_interrupt *remap_inp;
> +	struct pci_dev *pdev;
> +	struct irq_data *irqdata;
> +	struct mshv_lapic_irq *lapic_irq = &irqfd->irqfd_lapic_irq;
> +	struct hv_interrupt_entry *inte = NULL;
> +
> +	if (!irqfd->irqfd_girq_ent.girq_entry_valid ||
> +	    irqfd->irqfd_bypass_prod == NULL)
> +		return;
> +
> +	rc = mshv_parse_mshv_irqfd(irqfd, &pdev, &irqdata);
> +	if (rc)
> +		return;
> +
> +	inte = irqdata->chip_data;
> +	if (inte == NULL)
> +		return;
> +
> +	hv_devid.as_uint64 = hv_devid_from_pdev(pdev);
> +
> +
> +	mda_input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	mda_output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	rc = hv_vpset_from_hyp_disabled(mda_input, mda_output, lapic_irq,
> +					partid);
> +	if (rc)
> +		return;
> +
> +	remap_inp = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(remap_inp, 0, sizeof(*remap_inp));
> +
> +	rc = hv_copy_vpset(&remap_inp->int_target.vp_set,
> +			   &mda_output->target_vpset);
> +	if (rc <= 0) {
> +		pr_err("Hyper-V: ptid %lld - vpset copy failed (%d)\n",
> +		       partid, rc);
> +		return;
> +	}
> +
> +	/*
> +	 * var-sized hcall: var-size starts after vp_mask (thus vp_set.format
> +	 * does not count, but vp_set.valid_bank_mask does).
> +	 */
> +	var_size = rc + 1;
> +
> +	remap_inp->partition_id = partid;
> +	remap_inp->device_id = hv_devid.as_uint64;
> +	remap_inp->int_target.vector = lapic_irq->lapic_vector;
> +	remap_inp->int_target.flags = HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET;
> +
> +	remap_inp->int_entry.source = inte->source;
> +	remap_inp->int_entry.msi_entry.as_uint64 = inte->msi_entry.as_uint64;
> +
> +	status = hv_do_rep_hypercall(HVCALL_RETARGET_INTERRUPT, 0, var_size,
> +				     remap_inp, NULL);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "pt:%lld vec:%d lapic-id:%lld\n",
> +			      partid, lapic_irq->lapic_vector,
> +			      lapic_irq->lapic_apic_id);
> +}
> +
>  static int mshv_unmap_device_interrupt(union hv_device_id hv_devid,
>  				       struct hv_interrupt_entry *irq_entry)
>  {
> @@ -729,9 +800,12 @@ static void mshv_irqfd_update(struct mshv_partition *pt,
>  			      struct mshv_irqfd *irqfd)
>  {
>  	write_seqcount_begin(&irqfd->irqfd_irqe_sc);
> -	irqfd->irqfd_girq_ent = mshv_ret_girq_entry(pt,
> -						    irqfd->irqfd_irqnum);
> +	irqfd->irqfd_girq_ent = mshv_ret_girq_entry(pt, irqfd->irqfd_irqnum);
>  	mshv_copy_girq_info(&irqfd->irqfd_girq_ent, &irqfd->irqfd_lapic_irq);
> +
> +#if IS_ENABLED(CONFIG_X86_64)
> +	mshv_do_guest_irq_retarget(pt->pt_id, irqfd);
> +#endif
>  	write_seqcount_end(&irqfd->irqfd_irqe_sc);
>  }
>  
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: Souradeep Chakrabarti @ 2026-05-13 15:17 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <agST9ZRXNWhbUGMY@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Wed, May 13, 2026 at 08:08:37AM -0700, Souradeep Chakrabarti wrote:
> On Mon, May 11, 2026 at 07:02:56PM -0700, Mukesh R wrote:
> > On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> > interrupts, etc need a device ID as a parameter. This device ID refers
> > to that specific device during the lifetime of passthru.
> > 
> > An L1VH VM only contains VMBus based devices. A device ID for a VMBus
> > device is slightly different in that it uses the hv_pcibus_device info
> > for building it to make sure it matches exactly what the hypervisor
> > expects. This VMBus based device ID is needed when attaching devices in
> > an L1VH based guest VM. Add a function to build and export it. Before
> > building it, a check is done to make sure the device is a valid VMBus
> > device.
> > 
> > In remaining cases, PCI device ID is used. So, also make PCI device ID
> > build function hv_build_devid_type_pci() public.
> >
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Reviewed-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
> > Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> > ---
> >  arch/x86/hyperv/irqdomain.c         |  9 +++++----
> >  arch/x86/include/asm/mshyperv.h     |  6 ++++++
> >  drivers/pci/controller/pci-hyperv.c | 24 ++++++++++++++++++++++++
> >  include/asm-generic/mshyperv.h      | 11 +++++++++++
> >  4 files changed, 46 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> > index b3ad50a874dc..8780573a4332 100644
> > --- a/arch/x86/hyperv/irqdomain.c
> > +++ b/arch/x86/hyperv/irqdomain.c
> > @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
> >  	return 0;
> >  }
> >  
> > -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> > +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
> >  {
> >  	int pos;
> >  	union hv_device_id hv_devid;
> > @@ -172,8 +172,9 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> >  	}
> >  
> >  out:
> > -	return hv_devid;
> > +	return hv_devid.as_uint64;
> >  }
> > +EXPORT_SYMBOL_GPL(hv_build_devid_type_pci);
> >  
> >  /*
> >   * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
> > @@ -196,7 +197,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
> >  
> >  	msidesc = irq_data_get_msi_desc(data);
> >  	pdev = msi_desc_to_pci_dev(msidesc);
> > -	hv_devid = hv_build_devid_type_pci(pdev);
> > +	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> >  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
> >  
> >  	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
> > @@ -271,7 +272,7 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> >  {
> >  	union hv_device_id hv_devid;
> >  
> > -	hv_devid = hv_build_devid_type_pci(pdev);
> > +	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> >  	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
> >  }
> >  
> > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > index f64393e853ee..2ef34001f8d3 100644
> > --- a/arch/x86/include/asm/mshyperv.h
> > +++ b/arch/x86/include/asm/mshyperv.h
> > @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
> >  static inline void hv_root_crash_init(void) {}
> >  #endif  /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
> >  
> > +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> > +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> > +#else
> > +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> > +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
> > +
> >  #else /* CONFIG_HYPERV */
> >  static inline void hyperv_init(void) {}
> >  static inline void hyperv_setup_mmu_ops(void) {}
> > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > index cfc8fa403dad..50d793ca8f31 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -573,6 +573,7 @@ struct hv_pci_compl {
> >  };
> >  
> >  static void hv_pci_onchannelcallback(void *context);
> > +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
> >  
> >  #ifdef CONFIG_X86
> >  #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
> > @@ -1005,6 +1006,24 @@ static struct irq_domain *hv_pci_get_root_domain(void)
> >  static void hv_arch_irq_unmask(struct irq_data *data) { }
> >  #endif /* CONFIG_ARM64 */
> >  
> > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > +{
> > +	struct hv_pcibus_device *hbus;
> > +	struct pci_bus *pbus = pdev->bus;
> > +
> > +	if (!hv_vmbus_pci_device(pbus))
> > +		return 0;
> > +
> > +	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > +
> > +	return	(hbus->hdev->dev_instance.b[5] << 24) |
> > +		(hbus->hdev->dev_instance.b[4] << 16) |
> > +		(hbus->hdev->dev_instance.b[7] << 8) |
> > +		(hbus->hdev->dev_instance.b[6] & 0xf8) |
> > +		PCI_FUNC(pdev->devfn);
> > +}
> > +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> > +
> >  /**
> >   * hv_pci_generic_compl() - Invoked for a completion packet
> >   * @context:		Set up by the sender of the packet.
> > @@ -1403,6 +1422,11 @@ static struct pci_ops hv_pcifront_ops = {
> >  	.write = hv_pcifront_write_config,
> >  };
> >  
> > +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> > +{
> > +	return pbus->ops == &hv_pcifront_ops;
> > +}
> > +
> >  /*
> >   * Paravirtual backchannel
> >   *
> > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> > index e8cbc4e3f7ad..25ac7ca0fd8b 100644
> > --- a/include/asm-generic/mshyperv.h
> > +++ b/include/asm-generic/mshyperv.h
> > @@ -204,6 +204,9 @@ extern u64 (*hv_read_reference_counter)(void);
> >  /* Sentinel value for an uninitialized entry in hv_vp_index array */
> >  #define VP_INVAL	U32_MAX
> >  
> > +/* Forward declarations */
> > +struct pci_dev;
> > +
> >  int __init hv_common_init(void);
> >  void __init hv_get_partition_id(void);
> >  void __init hv_common_free(void);
> > @@ -316,6 +319,14 @@ void hv_para_set_synic_register(unsigned int reg, u64 val);
> >  void hyperv_cleanup(void);
> >  bool hv_query_ext_cap(u64 cap_query);
> >  void hv_setup_dma_ops(struct device *dev, bool coherent);
> > +
> > +#if IS_ENABLED(CONFIG_PCI_HYPERV)
> > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> > +#else
> > +static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > +{ return 0; }
> > +#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
> > +
> >  #else /* CONFIG_HYPERV */
> >  static inline void hv_identify_partition_type(void) {}
> >  static inline bool hv_is_hyperv_initialized(void) { return false; }
> > -- 
> > 2.51.2.vfs.0.1
> > 

^ permalink raw reply

* Re: [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
From: Frederic Weisbecker @ 2026-05-13 16:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Chen Ridong, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
	linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <20260421030351.281436-9-longman@redhat.com>

Le Mon, Apr 20, 2026 at 11:03:36PM -0400, Waiman Long a écrit :
> As the HK_TYPE_TICK cpumask is going to be changeable at run time, we
> need to use RCU to protect access to the cpumask to prevent it from
> going away in the middle of the operation.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  arch/arm64/kernel/topology.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index b32f13358fbb..48f150801689 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -173,6 +173,7 @@ void arch_cpu_idle_enter(void)
>  	if (!amu_fie_cpu_supported(cpu))
>  		return;
>  
> +	guard(rcu)();
>  	/* Kick in AMU update but only if one has not happened already */
>  	if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
>  	    time_is_before_jiffies(per_cpu(cpu_amu_samples.last_scale_update,
>  	cpu)))

This is called with IRQs disabled in the current CPU that is online so it's
already guaranteed to be stable.


> @@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
>  	unsigned int start_cpu = cpu;
>  	unsigned long last_update;
>  	unsigned int freq = 0;
> +	bool hk_cpu;
>  	u64 scale;
>  
>  	if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
>  		return -EOPNOTSUPP;
>  
> +	scoped_guard(rcu) {
> +		hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
> +	}
> +
>  	while (1) {
>  
>  		amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> @@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
>  		 * (and thus freq scale), if available, for given policy: this boils
>  		 * down to identifying an active cpu within the same freq domain, if any.
>  		 */
> -		if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> +		if (!hk_cpu ||
>  		    time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
>  			struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> +			bool hk_intersects;
>  			int ref_cpu;
>  
>  			if (!policy)
>  				return -EINVAL;
>  
> -			if (!cpumask_intersects(policy->related_cpus,
> -						housekeeping_cpumask(HK_TYPE_TICK))) {
> +			scoped_guard(rcu) {
> +				hk_intersects = cpumask_intersects(policy->related_cpus,
> +							housekeeping_cpumask(HK_TYPE_TICK));
> +			}
> +
> +			if (!hk_intersects) {
>  				cpufreq_cpu_put(policy);
>  				return -EOPNOTSUPP;
>  			}

Ok so this is racy but it's fine because:

This function is only used by cpufreq with either cpufreq_policy_write or
cpufreq_policy_read held (that is, struct cpufreq_policy::rwsem).

And that rwsem is write held on cpufreq_online() -> cpufreq_policy_online() and
also offline to guarantee the policy->cpus and policy->cpu stability.

Therefore housekeeping_cpumask() should only deal with stable online CPUs here. So
even if the housekeeping mask can be changed concurrently, those CPUs can't
appear or disappear from it.

Would be worth adding a comment about that.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox