Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR
From: Anirudh Rayabharam @ 2026-05-13  5:32 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816865065.21765.5112039734725197893.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:10PM +0000, Stanislav Kinsburskii wrote:
> mshv_intercept_isr() reads vp_index from the SynIC intercept message
> payload and uses it directly to index into partition->pt_vp_array without
> validating that vp_index < MSHV_MAX_VPS.
> 
> Mshv treats the Microsoft Hypervisor as trusted, so a malformed vp_index is
> not a security concern; the threat model does not include a malicious
> hypervisor. A hypervisor bug that placed an out-of-range value here would,
> however, cause an out-of-bounds read of pt_vp_array in hardirq context,
> manifesting as random memory corruption or a host crash with no clear
> signal pointing back to the hypervisor as the source.
> 
> handle_bitset_message() and handle_pair_message() already perform this
> defensive check on hypervisor-supplied vp_index values, with an explicit
> "This shouldn't happen, but just in case" comment  Add the same check to
> mshv_intercept_isr() for consistency, turning a potential silent corruption
> into a debuggable pr_debug message.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index bac890cd2b468..89207aad7cf1f 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -387,6 +387,11 @@ mshv_intercept_isr(struct hv_message *msg)
>  	 */
>  	vp_index =
>  	       ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
> +	/* This shouldn't happen, but just in case. */
> +	if (unlikely(vp_index >= MSHV_MAX_VPS)) {
> +		pr_debug("VP index %u out of bounds\n", vp_index);
> +		goto unlock_out;
> +	}
>  	vp = partition->pt_vp_array[vp_index];
>  	if (unlikely(!vp)) {
>  		pr_debug("failed to find VP %u\n", vp_index);
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH V3 06/11] mshv: Add ioctl support for MSHV-VFIO bridge device
From: sashiko-bot @ 2026-05-13  5:27 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-7-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] FD leak and potential deadlock on copy_to_user() failure after anon_inode_getfd().
- [Medium] Asymmetric memory lifecycle and inconsistent device_release handling.
- [Medium] Missing validation of unknown flags in mshv_partition_ioctl_create_device().
--

commit a0a0db24ca9d194302f458a7afa8a424cc357990
Author: Mukesh R <mrathor@linux.microsoft.com>

mshv: Add ioctl support for MSHV-VFIO bridge device

This commit adds an ioctl for creating MSHV devices for a partition.
Currently only VFIO device types are supported, doing setup for the bridge
code in mshv_vfio.c.

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 02c107458be90..6ceb5f608589f 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c

[ ... ]

> @@ -1625,6 +1625,119 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)

[ ... ]

> +static int mshv_device_fop_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_device *mshv_dev = filp->private_data;
> +	struct mshv_partition *partition = mshv_dev->device_pt;
> +
> +	if (mshv_dev->device_ops->device_release) {
> +		mutex_lock(&partition->pt_mutex);
> +		hlist_del(&mshv_dev->device_ptnode);
> +		mshv_dev->device_ops->device_release(mshv_dev);
> +		mutex_unlock(&partition->pt_mutex);
> +	}

What happens if a future device type does not implement the device_release
callback? It looks like the device would remain indefinitely linked in 
partition->pt_devices while the file descriptor is closed, leading to
list corruption and a memory leak.

Since the device structure is allocated by the core module via kzalloc(),
would it be safer to handle the list removal and generic kfree()
unconditionally in this function?

> +
> +	mshv_partition_put(partition);
> +	return 0;
> +}
> +
> +static const struct file_operations mshv_device_fops = {
> +	.owner = THIS_MODULE,
> +	.unlocked_ioctl = mshv_device_fop_ioctl,
> +	.release = mshv_device_fop_release,
> +};
> +
> +static long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> +					       void __user *uarg)
> +{
> +	long rc;
> +	struct mshv_create_device devargk;
> +	struct mshv_device *mshv_dev;
> +	const struct mshv_device_ops *vfio_ops;
> +
> +	if (copy_from_user(&devargk, uarg, sizeof(devargk)))
> +		return -EFAULT;
> +
> +	/* At present, only VFIO is supported */
> +	if (devargk.type != MSHV_DEV_TYPE_VFIO)
> +		return -ENODEV;
> +
> +	if (devargk.flags & MSHV_CREATE_DEVICE_TEST)
> +		return 0;

Should we explicitly validate that no undefined flags are set here? 
If a future kernel version introduces new flags for device creation, older 
kernels will silently ignore them and return success instead of -EINVAL, 
breaking forward compatibility.

> +
> +	/* This is freed later by mshv_vfio_release_device() */
> +	mshv_dev = kzalloc(sizeof(*mshv_dev), GFP_KERNEL_ACCOUNT);
> +	if (mshv_dev == NULL)
> +		return -ENOMEM;
> +
> +	vfio_ops = &mshv_vfio_device_ops;
> +	mshv_dev->device_ops = vfio_ops;
> +	mshv_dev->device_pt = partition;
> +
> +	rc = vfio_ops->device_create(mshv_dev);
> +	if (rc < 0) {
> +		kfree(mshv_dev);
> +		return rc;
> +	}
> +
> +	hlist_add_head(&mshv_dev->device_ptnode, &partition->pt_devices);
> +
> +	mshv_partition_get(partition);
> +	rc = anon_inode_getfd(vfio_ops->device_name, &mshv_device_fops,
> +			      mshv_dev, O_RDWR | O_CLOEXEC);
> +	if (rc < 0)
> +		goto undo_out;
> +
> +	devargk.fd = rc;
> +	if (copy_to_user(uarg, &devargk, sizeof(devargk)))
> +		return -EFAULT;    /* cleanup in mshv_device_fop_release() */

Does this sequence leak the file descriptor if copy_to_user() fails?

The anon_inode_getfd() call installs the file descriptor into the process's 
fd table. If the subsequent copy_to_user() fails, the function returns 
-EFAULT, but the user space application never receives the fd number to 
explicitly close it, resulting in a leaked file descriptor.

Additionally, can this trigger a deadlock?

The ioctl is called with partition->pt_mutex held. If a malicious user 
provides a buffer backed by userfaultfd, the copy_to_user() call will 
block, holding the mutex. Concurrently, another thread could find the 
new fd in /proc/self/fd and call close(), which invokes 
mshv_device_fop_release(). That release function unconditionally attempts 
to acquire partition->pt_mutex, resulting in a deadlock.

Would it be better to separate file allocation and installation by using 
get_unused_fd_flags(), anon_inode_getfile(), and only calling fd_install() 
after copy_to_user() succeeds?

> +
> +	return 0;
> +
> +undo_out:
> +	hlist_del(&mshv_dev->device_ptnode);
> +	vfio_ops->device_release(mshv_dev);    /* will kfree(mshv_dev) */

If a future device type does not implement the device_release callback,
could this result in a NULL pointer dereference?

> +	mshv_partition_put(partition);
> +	return rc;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=6

^ permalink raw reply

* Re: [PATCH V3 05/11] mshv: Implement mshv bridge device for VFIO
From: sashiko-bot @ 2026-05-13  5:09 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-6-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Unprotected access to `pt_devices` list and missing lockdep assertion.
- [Low] The `mshv_vfio_device_ops` structure is missing the `const` qualifier.
--

commit 96ab470ecc51b095ed8b1538e4abdf37d030b039
Author: Mukesh R <mrathor@linux.microsoft.com>
Subject: mshv: Implement mshv bridge device for VFIO

This commit introduces a new VFIO-MSHV bridge pseudo device file, adapted
from KVM's VFIO bridge. It provides the necessary device operations to link
VFIO files with an MSHV partition.

> diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
> new file mode 100644
> index 0000000000000..00a97920e25ba
> --- /dev/null
> +++ b/drivers/hv/mshv_vfio.c
[ ... ]
> +static long mshv_vfio_create_device(struct mshv_device *mshvdev)
> +{
> +	struct mshv_device *tmp;
> +	struct mshv_vfio *mshv_vfio;
> +
> +	/* Only one VFIO "device" per VM */
> +	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
> +			     device_ptnode)
> +		if (tmp->device_ops == &mshv_vfio_device_ops)
> +			return -EBUSY;

Are we missing a lockdep assertion here to ensure the partition lock is held?

Since this iterates over the pt_devices list to enforce a singleton device,
could concurrent ioctl calls bypass this check and corrupt the hlist if the
caller doesn't acquire the lock?

[ ... ]

> +struct mshv_device_ops mshv_vfio_device_ops = {
> +	.device_name = "mshv-vfio",
> +	.device_create = mshv_vfio_create_device,
> +	.device_release = mshv_vfio_release_device,
> +	.device_set_attr = mshv_vfio_set_attr,
> +	.device_has_attr = mshv_vfio_has_attr,
> +};

Should mshv_vfio_device_ops be declared as const?

The structure consists entirely of function pointers and is never modified
after initialization. Making it const would place it in read-only memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=5

^ permalink raw reply

* Re: [PATCH net-next v8] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-05-13  4:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, shradhagupta, dipayanroy,
	shirazsaleem, yury.norov, kees, linux-hyperv, netdev,
	linux-kernel, linux-rdma
In-Reply-To: <20260508220114.3342627-1-kuba@kernel.org>

On Fri, May 08, 2026 at 03:01:14PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Expose hardware diagnostic info via debugfs
> 
> Add debugfs entries to expose hardware configuration and diagnostic
> information that aids in debugging driver initialization and runtime
> operations without adding noise to dmesg.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> [ ... ]
> > @@ -2231,13 +2239,18 @@ int mana_gd_resume(struct pci_dev *pdev)
> >  
> >  	err = mana_probe(&gc->mana, true);
> >  	if (err)
> > -		return err;
> > +		goto cleanup_gd;
> >  
> >  	err = mana_rdma_probe(&gc->mana_ib);
> >  	if (err)
> > -		return err;
> > +		goto cleanup_mana;
> >  
> >  	return 0;
> > +cleanup_mana:
> > +	mana_remove(&gc->mana, true);
> > +cleanup_gd:
> > +	mana_gd_cleanup_device(pdev);
> > +	return err;
> >  }
> 
> Since mana_remove() called with true for suspending skips
> unregister_netdevice(), the network devices remain registered in the system.
> 
> If mana_gd_cleanup_device() is then called, which destroys the hardware
> channel and clears gc->hwc.driver_data, could this lead to a NULL pointer
> dereference if userspace later interacts with the still-registered devices?

Sorry for delayed response.

This is a valid concern. I will update it to mana_remove(&gc->mana, false) in
the next version.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> [ ... ]
> > @@ -3161,6 +3169,23 @@ static int mana_init_port(struct net_device *ndev)
> >  	eth_hw_addr_set(ndev, apc->mac_addr);
> >  	sprintf(vport, "vport%d", port_idx);
> >  	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
> > +
> > +	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
> > +			   &apc->port_handle);
> 
> When the device suspends, mana_gd_suspend() calls mana_gd_cleanup_device(),
> which recursively removes the entire debugfs directory tree.
> 
> During resume, the parent debugfs directory is recreated, but mana_probe()
> skips calling mana_init_port() for existing ports.
> 
> Does this mean the per-vPort debugfs entries are permanently lost after a
> suspend and resume cycle?
> 
This is false positive.
The per-vPort debugfs entries are recreated during resume. The flow is:

Suspend: mana_gd_cleanup_device() removes the entire debugfs tree.

Resume: mana_gd_setup() recreates the top-level mana_pci_debugfs
directory. Then mana_probe(gd, true) takes the resuming=true path,
which calls mana_attach() for each port. mana_attach() calls
mana_init_port(), which recreates the vportN directory and all
per-vPort debugfs files (port_handle, max_sq, max_rq, steer_*,
current_speed).

So the per-vPort entries are not permanently lost, they survive the
suspend/resume cycle.
> [ ... ]
> > @@ -3800,6 +3830,9 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> >  
> >  	mana_gd_deregister_device(gd);
> >  
> > +	debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
> > +	debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
> > +
> >  	if (suspending)
> >  		return;
> 
> Earlier in mana_remove(), free_netdev() is called, which frees the port
> context structure (apc) allocated via netdev_priv(ndev).
> 
> Since the per-vPort debugfs files created in mana_init_port() directly
> reference memory within apc, and apc->mana_port_debugfs is not explicitly
> removed here, does this leave the debugfs files accessible but pointing to
> freed memory until the parent directory is finally removed in
> mana_gd_cleanup_device()?

This is a false positive.
The per-vPort debugfs files are removed before free_netdev() frees apc.

In mana_remove(gd, false), the sequence for each port is:

mana_detach(ndev, false) -> mana_cleanup_port_context(apc)
-> debugfs_remove(apc->mana_port_debugfs).
This removes the entire vportN directory and all its child files, since
debugfs_remove() is recursive (simple_recursive_removal()).
free_netdev(ndev) which frees apc.
Because step 1 removes all debugfs files referencing apc fields before
step 2 frees the memory, there is no window where the files point to
freed memory.

^ permalink raw reply

* Re: [PATCH V3 03/11] mshv: Provide a way to get partition ID if running in a VMM process
From: sashiko-bot @ 2026-05-13  3:47 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-4-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Tying VM partition IDs to process TGIDs (`current->tgid`) is an architectural flaw causing isolation breakage and security risks.
- [High] `mshv_current_partid()` returns a raw partition ID without taking a reference, creating a TOCTOU vulnerability.
- [High] Pre-existing Use-After-Free in `destroy_partition()` due to RCU-001 violation during VP teardown.
--

commit 528e38745b2ae16c0fd8ad4876d893d1b1fea027
Author: Mukesh R <mrathor@linux.microsoft.com>

mshv: Provide a way to get partition ID if running in a VMM process

This commit introduces mshv_current_partid() to allow retrieving the
partition id for the current vmm process. It adds a pt_vmm_tgid field
to the mshv_partition struct and sets it during partition creation.
The partition id is found by iterating the global hash table.

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index bd1359eb58dd4..02c107458be90 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1908,6 +1908,27 @@ mshv_partition_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> +/* Given a process tgid, return partition id if it is a VMM process */
> +u64 mshv_current_partid(void)
> +{
> +	struct mshv_partition *pt;
> +	int i;
> +	u64 ret_ptid = HV_PARTITION_ID_INVALID;
> +
> +	rcu_read_lock();
> +
> +	hash_for_each_rcu(mshv_root.pt_htable, i, pt, pt_hnode) {
> +		if (pt->pt_vmm_tgid == current->tgid) {

Is tying the partition id to the process tgid safe?

Since a single vmm process can create multiple partitions, this loop returns
the first partition matching the tgid. This could result in pci passthrough
operations being misrouted to the wrong partition if a process manages
multiple virtual machines.

Also, since the partition's lifecycle is tied to an anonymous file descriptor,
passing the fd to another process and exiting would break
mshv_current_partid() for the new process.

Additionally, if the operating system recycles the old tgid, an unrelated
process might gain access to the partition id.

Should explicit file descriptor tracking be used instead of global process
ids?

> +			ret_ptid = pt->pt_id;
> +			break;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +	return ret_ptid;

Does returning a raw partition id without taking a reference create a race
condition?

After rcu_read_unlock() is called and the pt_id is returned, a concurrent
thread in the vmm could close the partition's file descriptor, causing the
partition to be deleted in the hypervisor.

If the hypervisor recycles the pt_id for a newly created partition belonging
to a completely different process, the pending iommu attachment or pci
passthrough hypercall will use the stale pt_id.

Can this allow the vmm to inadvertently attach its devices to an unrelated
partition?

> +}

This isn't a regression in this patch, but looking at the partition lifecycle
code, does destroy_partition() have a use-after-free?

destroy_partition() frees the virtual processor structures (kfree(vp))
before calling remove_partition(partition), which removes the partition
from the rcu-protected global hash table (pt_htable) and calls
synchronize_rcu().

Because the partition remains visible to rcu readers during the teardown loop,
can an interrupt handler (e.g., in mshv_synic.c) call mshv_partition_find(),
successfully retrieve the partition, and read
vp = partition->pt_vp_array[vp_index]?

If this happens concurrently with or immediately after kfree(vp), the
interrupt handler would access freed memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=3

^ permalink raw reply

* Re: [PATCH V3 02/11] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: sashiko-bot @ 2026-05-13  3:26 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-3-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Resource leak in `hv_msi_domain_free()` when freeing multiple IRQs.
- [High] Uninitialized memory write to hardware MSI registers in `hv_irq_compose_msi_msg()` error paths.
- [High] Invalid hardware state read in `hv_build_devid_type_pci()` for Type 0 PCI-X leaf devices.
--

commit 8c84576bf69ec6b4d58407408095bf71a9940bf2
Author: Mukesh R <mrathor@linux.microsoft.com>

x86/hyperv: Cosmetic changes in irqdomain.c for readability

Make cosmetic changes such as renaming variables for consistency, fixing
checkpatch warnings, updating indentation, and replacing spaces with tabs.

> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
[ ... ]
> @@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
> +static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
[ ... ]
> -	if (data.bridge) {
> -		int pos;
> +	if (data.bridge == NULL)
> +		goto out;
>  
> -		/*
> -		 * Microsoft Hypervisor requires a bus range when the bridge is
> -		 * running in PCI-X mode.
[ ... ]
> -		pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> -		if (pos) {
> -			u16 status;
> +	pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> +	if (pos) {
> +		u16 status;

If the leaf device is on the root bus, data.bridge will point to the leaf
device itself (a Type 0 device) after pci_for_each_dma_alias() completes.
Does this code incorrectly assume data.bridge is always a Type 1 bridge?

If it evaluates a Type 0 device here, reading PCI_X_BRIDGE_SSTATUS aliases
the Type 0 Command register. This could cause the code to read standard
Type 1 PCI_SECONDARY_BUS and PCI_SUBORDINATE_BUS offsets (which fall into
the middle of BAR2 for a Type 0 device), sending arbitrary BAR values as
shadow bus ranges to the hypervisor.

[ ... ]
> @@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	}
>  
>  	stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
> -	if (!stored_entry) {
> -		pr_debug("%s: failed to allocate chip data\n", __func__);
> +	if (!stored_entry)
>  		return;
> -	}

If kzalloc_obj() or hv_map_msi_interrupt() fails, this returns early
without populating the struct msi_msg *msg parameter.

Since callers (like msi_domain_activate() or irq_chip_compose_msi_msg())
allocate msg on the stack and cannot check for errors due to the void
return type, does this result in uninitialized stack memory being blindly
written to the hardware MSI registers?

[ ... ]
> @@ -344,13 +351,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
>  		return ret;
>  
>  	for (int i = 0; i < nr_irqs; ++i) {
> -		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
> -				    handle_edge_irq, NULL, "edge");
> +		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
> +				    NULL, handle_edge_irq, NULL, "edge");
>  	}
> +
>  	return 0;
>  }
>  
> -static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
> +static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
> +			       unsigned int nr_irqs)
>  {
>  	for (int i = 0; i < nr_irqs; ++i) {
>  		struct irq_data *irqd = irq_domain_get_irq_data(d, virq);

Does this loop leak chip_data and leave IRQs mapped in the hypervisor for
the remaining nr_irqs - 1 IRQs?

It appears to call irq_domain_get_irq_data() using the base virq on every
iteration instead of virq + i, causing hv_teardown_msi_irq() to be
repeatedly invoked on the first IRQ only.

> @@ -362,6 +371,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
>  
>  		hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
>  	}
> +
>  	irq_domain_free_irqs_top(d, virq, nr_irqs);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=2

^ permalink raw reply

* RE: [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Michael Kelley @ 2026-05-13  3:15 UTC (permalink / raw)
  To: Jacob Pan, Mukesh R
  Cc: hpa@zytor.com, robin.murphy@arm.com, robh@kernel.org,
	wei.liu@kernel.org, Michael Kelley, muislam@microsoft.com,
	namjain@linux.microsoft.com, magnuskulke@linux.microsoft.com,
	anbelski@linux.microsoft.com, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	longli@microsoft.com, tglx@kernel.org, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	joro@8bytes.org, will@kernel.org, lpieralisi@kernel.org,
	kwilczynski@kernel.org, bhelgaas@google.com, arnd@arndb.de
In-Reply-To: <20260512164623.0000273f@linux.microsoft.com>

From: Jacob Pan <jacob.pan@linux.microsoft.com> Sent: Tuesday, May 12, 2026 4:46 PM
> 
> Hi Mukesh,
> 
> On Mon, 11 May 2026 19:02:49 -0700
> Mukesh R <mrathor@linux.microsoft.com> wrote:
> 
> > This file actually implements irq remapping, so rename to more
> > appropriate hyperv-irq.c. A new file to implement hyperv iommu will
> > be introduced later.  Also, it should not be tied to HYPERV_IOMMU,
> > but to CONFIG_HYPERV and IRQ_REMAP. The file already has #ifdef
> > CONFIG_IRQ_REMAP.
> >
> > Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> > ---
> >  MAINTAINERS                                    | 2 +-
> >  drivers/iommu/Makefile                         | 2 +-
> >  drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
>
> Given that we have multiple Hyper-V IOMMU-related files — this renamed
> hyperv-irq.c, the existing hyperv-iommu code, iommu-root (this
> series) and the recently posted guest pvIOMMU driver — should we create
> a drivers/iommu/hyperv/ directory to consolidate them?

Patch 1/4 in the guest pvIOMMU driver [1] that was recently posted by
Yu Zhang does as you suggest.

Michael

[1] https://lore.kernel.org/linux-hyperv/20260511162408.1180069-1-zhangyu1@linux.microsoft.com/

> 
> >  drivers/iommu/irq_remapping.c                  | 2 +-
> >  4 files changed, 6 insertions(+), 6 deletions(-)
> >  rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index d1cc0e12fe1f..f803a6a38fee 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -11914,7 +11914,7 @@ F:	drivers/clocksource/hyperv_timer.c
> >  F:	drivers/hid/hid-hyperv.c
> >  F:	drivers/hv/
> >  F:	drivers/input/serio/hyperv-keyboard.c
> > -F:	drivers/iommu/hyperv-iommu.c
> > +F:	drivers/iommu/hyperv-irq.c
> >  F:	drivers/net/ethernet/microsoft/
> >  F:	drivers/net/hyperv/
> >  F:	drivers/pci/controller/pci-hyperv-intf.c
> > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> > index 0275821f4ef9..335ea77cced6 100644
> > --- a/drivers/iommu/Makefile
> > +++ b/drivers/iommu/Makefile
> > @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
> >  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
> >  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
> >  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> > -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> > +obj-$(CONFIG_HYPERV) += hyperv-irq.o
> >  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> >  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
> >  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> > diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
> > similarity index 99%
> > rename from drivers/iommu/hyperv-iommu.c
> > rename to drivers/iommu/hyperv-irq.c
> > index 479103261ae6..d11076f906fb 100644
> > --- a/drivers/iommu/hyperv-iommu.c
> > +++ b/drivers/iommu/hyperv-irq.c
> > @@ -8,6 +8,8 @@
> >   * Author : Lan Tianyu <Tianyu.Lan@microsoft.com>
> >   */
> >
> > +#ifdef CONFIG_IRQ_REMAP
> > +
> >  #include <linux/types.h>
> >  #include <linux/interrupt.h>
> >  #include <linux/irq.h>
> > @@ -24,8 +26,6 @@
> >
> >  #include "irq_remapping.h"
> >
> > -#ifdef CONFIG_IRQ_REMAP
> > -
> >  /*
> >   * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
> >   * Redirection Table. Hyper-V exposes one single IO-APIC and so
> > define @@ -331,4 +331,4 @@ static const struct irq_domain_ops
> > hyperv_root_ir_domain_ops = { .free = hyperv_root_irq_remapping_free,
> >  };
> >
> > -#endif
> > +#endif  /* CONFIG_IRQ_REMAP */
> > diff --git a/drivers/iommu/irq_remapping.c
> > b/drivers/iommu/irq_remapping.c index c2443659812a..41bf65e4ea88
> > 100644 --- a/drivers/iommu/irq_remapping.c
> > +++ b/drivers/iommu/irq_remapping.c
> > @@ -108,7 +108,7 @@ int __init irq_remapping_prepare(void)
> >  	else if (IS_ENABLED(CONFIG_AMD_IOMMU) &&
> >  		 amd_iommu_irq_ops.prepare() == 0)
> >  		remap_ops = &amd_iommu_irq_ops;
> > -	else if (IS_ENABLED(CONFIG_HYPERV_IOMMU) &&
> > +	else if (IS_ENABLED(CONFIG_HYPERV) &&
> >  		 hyperv_irq_remap_ops.prepare() == 0)
> >  		remap_ops = &hyperv_irq_remap_ops;
> >  	else
> 


^ permalink raw reply

* Re: [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: sashiko-bot @ 2026-05-13  3:12 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-hyperv
In-Reply-To: <2-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Critical] TOCTOU heap buffer overflow due to unvalidated `num_sge` from user-shared memory.
- [High] Memory leak of the kernel queue structure (`srq->rq.kwq`) on user-backed SRQ modifications.
- [High] Locking imbalance and freeing memory while locked.
- [High] Inconsistent state and Use-After-Free on error path.
- [Low] Uninitialized variable compiler warning for `offset_addr`.
--

commit dac7699d9e58c179b3a8859ac64da79e5fd21f46
Author: Jason Gunthorpe <jgg@ziepe.ca>
Subject: IB/rdmavt: Don't abuse udata and ib_respond_udata()

This commit modifies rvt_modify_srq to use copy_to_user directly instead
of ib_copy_to_udata for returning the offset address.

> diff --git a/drivers/infiniband/sw/rdmavt/srq.c b/drivers/infiniband/sw/rdmavt/srq.c
> index fe125bf85b272..d022aa56c5bfd 100644
> --- a/drivers/infiniband/sw/rdmavt/srq.c
> +++ b/drivers/infiniband/sw/rdmavt/srq.c
> @@ -128,6 +128,7 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
>  	struct rvt_srq *srq = ibsrq_to_rvtsrq(ibsrq);
>  	struct rvt_dev_info *dev = ib_to_rvt(ibsrq->device);
>  	struct rvt_rq tmp_rq = {};
> +	__u64 offset_addr;
>  	int ret = 0;
>  
>  	if (attr_mask & IB_SRQ_MAX_WR) {

Could hoisting offset_addr lead to a compiler warning for being potentially
uninitialized?

The compiler might assume udata->inlen could change concurrently across the
spin_lock_irq() and spin_unlock_irq() calls. If the first check fails
but the second one passes:

if (udata && udata->inlen >= sizeof(__u64))

Does this allow offset_addr to be used without initialization?

[ ... ]

> @@ -149,19 +150,17 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
>  			return -ENOMEM;
>  		/* Check that we can write the offset to mmap. */
>  		if (udata && udata->inlen >= sizeof(__u64)) {
> -			__u64 offset_addr;
>  			__u64 offset = 0;
>  
>  			ret = ib_copy_from_udata(&offset_addr, udata,
>  						 sizeof(offset_addr));
>  			if (ret)
>  				goto bail_free;
> -			udata->outbuf = (void __user *)
> -					(unsigned long)offset_addr;
> -			ret = ib_copy_to_udata(udata, &offset,
> -					       sizeof(offset));
> -			if (ret)
> +			if (copy_to_user(u64_to_user_ptr(offset_addr), &offset,
> +					 sizeof(offset))) {
> +				ret = -EFAULT;
>  				goto bail_free;
> +			}
>  		}
>  
>  		spin_lock_irq(&srq->rq.kwq->c_lock);

Can this code leak the kernel queue structure?

When resizing a user-backed SRQ, the code fetches elements from the old
user-mapped queue. In the udata path, okwq is left as NULL:

if (udata) {
        owq = srq->rq.wq;
        head = RDMA_READ_UAPI_ATOMIC(owq->head);
        tail = RDMA_READ_UAPI_ATOMIC(owq->tail);
} else {
        okwq = srq->rq.kwq;
        ...

When it later calls kvfree(okwq), it seems nothing is freed, which could
permanently leak the old kwq structure.

Is there a lock imbalance here when the queue is replaced?

The original queue is locked with:

spin_lock_irq(&srq->rq.kwq->c_lock);

But then the queue pointer is replaced:

srq->rq.kwq = tmp_rq.kwq;

When the lock is released:

spin_unlock_irq(&srq->rq.kwq->c_lock);

Does this unlock the new queue instead of the old one, and then free the old
queue via kvfree(okwq) while its lock is still held?

Can the loop migrating WQEs overflow the destination buffer?

The number of scatter-gather elements is read directly from user-shared
memory:

wqe = rvt_get_rwqe_ptr(&srq->rq, tail);
p->wr_id = wqe->wr_id;
p->num_sge = wqe->num_sge;
for (i = 0; i < wqe->num_sge; i++)
        p->sg_list[i] = wqe->sg_list[i];

If a userspace program concurrently modifies num_sge to a large value, could
this loop exceed the bounds of tmp_rq.kwq->curr_wq, since the destination
size was calculated based on srq->rq.max_sge?

[ ... ]

> @@ -236,10 +235,10 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
>  			 * See rvt_mmap() for details.
>  			 */
>  			if (udata && udata->inlen >= sizeof(__u64)) {
> -				ret = ib_copy_to_udata(udata, &ip->offset,
> -						       sizeof(ip->offset));
> -				if (ret)
> -					return ret;
> +				if (copy_to_user(u64_to_user_ptr(offset_addr),
> +						 &ip->offset,
> +						 sizeof(ip->offset)))
> +					return -EFAULT;
>  			}
>  

Does returning -EFAULT here leave the kernel and userspace in an inconsistent
state?

At this point, the kernel has already applied the SRQ size change, replaced
the queue, and permanently freed the old memory:

vfree(owq);
kvfree(okwq);

If copy_to_user fails and returns -EFAULT, userspace might interpret this as
a failed resize and continue using its existing memory mapping of the old
queue. Could this lead to a use-after-free since the pages have already
been freed?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com?part=2

^ permalink raw reply

* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
From: Stanislav Kinsburskii @ 2026-05-13  2:42 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: kys, Liam.Howlett, akpm, decui, haiyangz, jgg, corbet, leon,
	longli, ljs, mhocko, rppt, shuah, skhan, surenb, vbabka, wei.liu,
	linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <f073a8d7-5761-4f7b-a5e5-c6aeae5fdc72@kernel.org>

On Tue, May 12, 2026 at 09:18:11PM +0200, David Hildenbrand (Arm) wrote:
> On 5/12/26 18:18, Stanislav Kinsburskii wrote:
> > On Tue, May 12, 2026 at 10:42:14AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >>> +	for (; addr < end; addr += PAGE_SIZE) {
> >>> +		vm_fault_t ret;
> >>> +
> >>> +		ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> >>> +
> >>> +		if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
> >>> +			/*
> >>> +			 * The mmap lock has been dropped by the fault handler.
> >>> +			 * Record the failing address and signal lock-drop to
> >>> +			 * the caller.
> >>> +			 */
> >>> +			*hmm_vma_walk->locked = 0;
> >>> +			hmm_vma_walk->last = addr;
> >>> +			return -EAGAIN;
> >>
> >>
> >> Okay, so we'll return straight from hmm_vma_fault() to
> >> hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
> >>
> >> Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
> >> the hmm_vma_fault() could be called by the caller of walk_page_range(), but
> >> that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
> >> requires the vma in there.
> >>
> > 
> > It looks like a caller can provide a post_vma callback in mm_walk_ops. I
> > missed that case here. This callback cannot be supported by this change.
> > I will update the patch.
> > 
> >>
> >> Note: am I wrong, or is hmm_vma_fault() really always called with
> >> required_fault=true?
> >>
> > 
> > No, hmm_pte_need_fault can return false.
> 
> That's not what I mean. Looks like all paths leading to hmm_vma_fault() have
> required_fault = true;
> 
> IOW, there is always a "if (required_fault)" before it one way or the other.
> 
> Ah, and there even is a "WARN_ON_ONCE(!required_fault)" in the function. What an
> odd thing to do :)
> 
> > 
> >>> +		}
> >>> +
> >>> +		if (ret & VM_FAULT_ERROR)
> >>>  			return -EFAULT;
> >>> +	}
> >>>  	return -EBUSY;
> >>>  }
> >>>  
> >>> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
> >>>  	if (required_fault) {
> >>>  		int ret;
> >>>  
> >>> +		/*
> >>> +		 * Faulting hugetlb pages on the unlockable path is not
> >>> +		 * supported. The walk framework holds hugetlb_vma_lock_read
> >>> +		 * which must be dropped before handle_mm_fault, but if the
> >>> +		 * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
> >>> +		 * be freed and the walk framework's unconditional unlock
> >>> +		 * becomes a use-after-free.
> >>> +		 */
> >>> +		if (hmm_vma_walk->locked)
> >>> +			return -EFAULT;
> >>
> >> Just because it's unlockable doesn't mean that you must unlock. Can't this be
> >> kept working as is, just simulating here as if it would not be unlockable?
> >>
> > 
> > I’m not sure how to implement this. The walk_page_range code expects the
> > hugetlb VMA to still be read-locked when we return from
> > hmm_vma_walk_hugetlb_entry. How can we guarantee that if the VMA might
> > be gone?
> > 
> > I added a note in the docs. Whoever tackles this will likely need to
> > either rework `walk_page_range` to handle the case where the VMA is
> > gone, or use a different approach.
> > 
> > Do you have any other suggestions on how to implement it?
> 
> You just want hmm_vma_fault() to not set
> "FAULT_FLAG_ALLOW_RETRY·|·FAULT_FLAG_KILLABLE".
> 
> The hacky way could be:
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 5955f2f0c83d..83dba990e10a 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -564,6 +564,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
> long hmask,
>         required_fault =
>                 hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
>         if (required_fault) {
> +               int *saved_locked = hmm_vma_walk->locked;
>                 int ret;
> 
>                 spin_unlock(ptl);
> @@ -576,7 +577,9 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
> long hmask,
>                  * use here of either pte or ptl after dropping the vma
>                  * lock.
>                  */
> +               hmm_vma_walk->locked = NULL;
>                 ret = hmm_vma_fault(addr, end, required_fault, walk);
> +               hmm_vma_walk->locked = saved_locked;
>                 hugetlb_vma_lock_read(vma);
>                 return ret;
>         }
> 

I see. AFAIU the outcome would be the same.

> But really, I think we should just try to get uffd support working properly, not
> excluding hugetlb.
> 
> GUP achieves it properly by performing the fault handling outside of page table
> walking context ... essentially what I described in my first comment above:
> return the information to the caller and let it just trigger the fault.
> 
> The issue here is that we trigger a fault out of walk_hugetlb_range() where we
> still hold locks, resulting in this questionable hugetlb_vma_unlock_read +
> hugetlb_vma_lock_read pattern.
> 

Fair enough.

> The fault should just be triggered from a place where we don't have to play with
> hugetlb vma locks or be afraid that dropping the mmap lock causes other problems.
> 

I reworked this part. Please take a look at v2.

Thanks,
Stanislav

> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply

* Re: [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Mukesh R @ 2026-05-13  1:31 UTC (permalink / raw)
  To: Jacob Pan
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260512164623.0000273f@linux.microsoft.com>

On 5/12/26 16:46, Jacob Pan wrote:
> Hi Mukesh,
> 
> On Mon, 11 May 2026 19:02:49 -0700
> Mukesh R <mrathor@linux.microsoft.com> wrote:
> 
>> This file actually implements irq remapping, so rename to more
>> appropriate hyperv-irq.c. A new file to implement hyperv iommu will
>> be introduced later.  Also, it should not be tied to HYPERV_IOMMU,
>> but to CONFIG_HYPERV and IRQ_REMAP. The file already has #ifdef
>> CONFIG_IRQ_REMAP.
>>
>> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
>> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
>> ---
>>   MAINTAINERS                                    | 2 +-
>>   drivers/iommu/Makefile                         | 2 +-
>>   drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
> Given that we have multiple Hyper-V IOMMU-related files ? this renamed
> hyperv-irq.c, the existing hyperv-iommu code, iommu-root (this
> series) and the recently posted guest pvIOMMU driver ? should we create
> a drivers/iommu/hyperv/ directory to consolidate them?

This came up briefly during synup with arm64 devs whether to split
the file into arch specific and common. We decided to wait till arm is
working so we can tell how intrusive the #ifdefs are. We can decide as
part of the arm port patches. I am ok also if you want to do it as part
of the pv-iommu patches as follow up once this is merged.

Thanks,
-Mukesh

>>   drivers/iommu/irq_remapping.c                  | 2 +-
>>   4 files changed, 6 insertions(+), 6 deletions(-)
>>   rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index d1cc0e12fe1f..f803a6a38fee 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -11914,7 +11914,7 @@ F:	drivers/clocksource/hyperv_timer.c
>>   F:	drivers/hid/hid-hyperv.c
>>   F:	drivers/hv/
>>   F:	drivers/input/serio/hyperv-keyboard.c
>> -F:	drivers/iommu/hyperv-iommu.c
>> +F:	drivers/iommu/hyperv-irq.c
>>   F:	drivers/net/ethernet/microsoft/
>>   F:	drivers/net/hyperv/
>>   F:	drivers/pci/controller/pci-hyperv-intf.c
>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>> index 0275821f4ef9..335ea77cced6 100644
>> --- a/drivers/iommu/Makefile
>> +++ b/drivers/iommu/Makefile
>> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>>   obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>>   obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>>   obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>> +obj-$(CONFIG_HYPERV) += hyperv-irq.o
>>   obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>>   obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>>   obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
>> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
>> similarity index 99%
>> rename from drivers/iommu/hyperv-iommu.c
>> rename to drivers/iommu/hyperv-irq.c
>> index 479103261ae6..d11076f906fb 100644
>> --- a/drivers/iommu/hyperv-iommu.c
>> +++ b/drivers/iommu/hyperv-irq.c
>> @@ -8,6 +8,8 @@
>>    * Author : Lan Tianyu <Tianyu.Lan@microsoft.com>
>>    */
>>   
>> +#ifdef CONFIG_IRQ_REMAP
>> +
>>   #include <linux/types.h>
>>   #include <linux/interrupt.h>
>>   #include <linux/irq.h>
>> @@ -24,8 +26,6 @@
>>   
>>   #include "irq_remapping.h"
>>   
>> -#ifdef CONFIG_IRQ_REMAP
>> -
>>   /*
>>    * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
>>    * Redirection Table. Hyper-V exposes one single IO-APIC and so
>> define @@ -331,4 +331,4 @@ static const struct irq_domain_ops
>> hyperv_root_ir_domain_ops = { .free = hyperv_root_irq_remapping_free,
>>   };
>>   
>> -#endif
>> +#endif  /* CONFIG_IRQ_REMAP */
>> diff --git a/drivers/iommu/irq_remapping.c
>> b/drivers/iommu/irq_remapping.c index c2443659812a..41bf65e4ea88
>> 100644 --- a/drivers/iommu/irq_remapping.c
>> +++ b/drivers/iommu/irq_remapping.c
>> @@ -108,7 +108,7 @@ int __init irq_remapping_prepare(void)
>>   	else if (IS_ENABLED(CONFIG_AMD_IOMMU) &&
>>   		 amd_iommu_irq_ops.prepare() == 0)
>>   		remap_ops = &amd_iommu_irq_ops;
>> -	else if (IS_ENABLED(CONFIG_HYPERV_IOMMU) &&
>> +	else if (IS_ENABLED(CONFIG_HYPERV) &&
>>   		 hyperv_irq_remap_ops.prepare() == 0)
>>   		remap_ops = &hyperv_irq_remap_ops;
>>   	else


^ permalink raw reply

* Re: [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Jacob Pan @ 2026-05-12 23:46 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-2-mrathor@linux.microsoft.com>

Hi Mukesh,

On Mon, 11 May 2026 19:02:49 -0700
Mukesh R <mrathor@linux.microsoft.com> wrote:

> This file actually implements irq remapping, so rename to more
> appropriate hyperv-irq.c. A new file to implement hyperv iommu will
> be introduced later.  Also, it should not be tied to HYPERV_IOMMU,
> but to CONFIG_HYPERV and IRQ_REMAP. The file already has #ifdef
> CONFIG_IRQ_REMAP.
> 
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  MAINTAINERS                                    | 2 +-
>  drivers/iommu/Makefile                         | 2 +-
>  drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
Given that we have multiple Hyper-V IOMMU-related files — this renamed
hyperv-irq.c, the existing hyperv-iommu code, iommu-root (this
series) and the recently posted guest pvIOMMU driver — should we create
a drivers/iommu/hyperv/ directory to consolidate them?

>  drivers/iommu/irq_remapping.c                  | 2 +-
>  4 files changed, 6 insertions(+), 6 deletions(-)
>  rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d1cc0e12fe1f..f803a6a38fee 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11914,7 +11914,7 @@ F:	drivers/clocksource/hyperv_timer.c
>  F:	drivers/hid/hid-hyperv.c
>  F:	drivers/hv/
>  F:	drivers/input/serio/hyperv-keyboard.c
> -F:	drivers/iommu/hyperv-iommu.c
> +F:	drivers/iommu/hyperv-irq.c
>  F:	drivers/net/ethernet/microsoft/
>  F:	drivers/net/hyperv/
>  F:	drivers/pci/controller/pci-hyperv-intf.c
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 0275821f4ef9..335ea77cced6 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> +obj-$(CONFIG_HYPERV) += hyperv-irq.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
> similarity index 99%
> rename from drivers/iommu/hyperv-iommu.c
> rename to drivers/iommu/hyperv-irq.c
> index 479103261ae6..d11076f906fb 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-irq.c
> @@ -8,6 +8,8 @@
>   * Author : Lan Tianyu <Tianyu.Lan@microsoft.com>
>   */
>  
> +#ifdef CONFIG_IRQ_REMAP
> +
>  #include <linux/types.h>
>  #include <linux/interrupt.h>
>  #include <linux/irq.h>
> @@ -24,8 +26,6 @@
>  
>  #include "irq_remapping.h"
>  
> -#ifdef CONFIG_IRQ_REMAP
> -
>  /*
>   * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
>   * Redirection Table. Hyper-V exposes one single IO-APIC and so
> define @@ -331,4 +331,4 @@ static const struct irq_domain_ops
> hyperv_root_ir_domain_ops = { .free = hyperv_root_irq_remapping_free,
>  };
>  
> -#endif
> +#endif  /* CONFIG_IRQ_REMAP */
> diff --git a/drivers/iommu/irq_remapping.c
> b/drivers/iommu/irq_remapping.c index c2443659812a..41bf65e4ea88
> 100644 --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -108,7 +108,7 @@ int __init irq_remapping_prepare(void)
>  	else if (IS_ENABLED(CONFIG_AMD_IOMMU) &&
>  		 amd_iommu_irq_ops.prepare() == 0)
>  		remap_ops = &amd_iommu_irq_ops;
> -	else if (IS_ENABLED(CONFIG_HYPERV_IOMMU) &&
> +	else if (IS_ENABLED(CONFIG_HYPERV) &&
>  		 hyperv_irq_remap_ops.prepare() == 0)
>  		remap_ops = &hyperv_irq_remap_ops;
>  	else


^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: sashiko-bot @ 2026-05-12 23:45 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260511162408.1180069-5-zhangyu1@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] hv_iommu_fill_iova_list() incorrectly uses guest PAGE_SHIFT instead of hypervisor HV_HYP_PAGE_SHIFT, causing incomplete and incorrect IOTLB flushes on architectures with non-4KB pages.
- [High] Unconditional use of the new HVCALL_FLUSH_DEVICE_DOMAIN_LIST hypercall without a capability check or fallback.
- [Medium] Bitfield assignment on uninitialized flexible array member causes KMSAN use-of-uninitialized-value warnings.
--

commit 5fd02fd07f9fbc4f1734f312b8a33dfd79a38961
Author: Yu Zhang <zhangyu1@linux.microsoft.com>

iommu/hyperv: Add page-selective IOTLB flush support

This patch adds page-selective IOTLB flush support using the
HVCALL_FLUSH_DEVICE_DOMAIN_LIST hypercall. It decomposes IOVA ranges into
aligned power-of-two regions to fit in a single hypercall page, falling
back to a full domain flush if the range exceeds page capacity.

> diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> index e5fc625314b55..3bca362b78150 100644
> --- a/drivers/iommu/hyperv/iommu.c
> +++ b/drivers/iommu/hyperv/iommu.c

[ ... ]

> +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> +					  unsigned long start,
> +					  unsigned long end)
> +{
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;

Should this use HV_HYP_PAGE_SHIFT instead of PAGE_SHIFT?

If the kernel is compiled with a larger page size (like 64KB on ARM64),
start >> PAGE_SHIFT produces a 64KB guest PFN.

Since Hyper-V hypercalls expect 4KB PFNs (HV_HYP_PAGE_SHIFT = 12), this could
result in flushing the wrong pages and leaving stale DMA mappings active.

> +	unsigned long nr_pages = end_pfn - start_pfn;
> +	u16 count = 0;
> +
> +	while (nr_pages > 0) {
> +		unsigned long flush_pages;
> +		int order;
> +		unsigned long pfn_align;
> +		unsigned long size_align;
> +
> +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> +			break;
> +		}
> +
> +		if (start_pfn)
> +			pfn_align = __ffs(start_pfn);
> +		else
> +			pfn_align = BITS_PER_LONG - 1;
> +
> +		size_align = __fls(nr_pages);
> +		order = min(pfn_align, size_align);
> +		iova_list[count].page_mask_shift = order;
> +		iova_list[count].page_number = start_pfn;

Could this assignment trigger a KMSAN use-of-uninitialized-value warning?

The input buffer is from a per-CPU page and memset in
hv_flush_device_domain_list() only zeroes the fixed-size header, leaving the
flexible array uninitialized.

Assigning to bitfields forces a read-modify-write on the uninitialized 64-bit
union hv_iommu_flush_va.

> +
> +		flush_pages = 1UL << order;
> +		start_pfn += flush_pages;
> +		nr_pages -= flush_pages;
> +		count++;
> +	}
> +
> +	return count;
> +}
> +
> +static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
> +					struct iommu_iotlb_gather *iotlb_gather)
> +{
> +	u64 status;
> +	u16 count;
> +	unsigned long flags;
> +	struct hv_input_flush_device_domain_list *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain = hv_domain->device_domain;
> +	input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> +	count = hv_iommu_fill_iova_list(input->iova_list,
> +					iotlb_gather->start,
> +					iotlb_gather->end);
> +	if (count == HV_IOMMU_FLUSH_VA_OVERFLOW) {
> +		/*
> +		 * Range exceeds hypercall page capacity. Fall back to a full
> +		 * domain flush.
> +		 */
> +		struct hv_input_flush_device_domain *flush_all = (void *)input;
> +
> +		memset(flush_all, 0, sizeof(*flush_all));
> +		flush_all->device_domain = hv_domain->device_domain;
> +		status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
> +					flush_all, NULL);
> +	} else {
> +		status = hv_do_rep_hypercall(
> +				HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
> +				count, 0, input, NULL);
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN_LIST failed, status %lld\n", status);

Does this code need a capability check or fallback for older hosts?

If the hypervisor doesn't support the new HVCALL_FLUSH_DEVICE_DOMAIN_LIST
hypercall, hv_do_rep_hypercall() might return an error status.

The code prints an error but doesn't fall back to HVCALL_FLUSH_DEVICE_DOMAIN,
potentially leaving the IOTLB unflushed.

> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260511162408.1180069-1-zhangyu1@linux.microsoft.com?part=4

^ permalink raw reply

* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
From: David Hildenbrand (Arm) @ 2026-05-12 19:18 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, Liam.Howlett, akpm, decui, haiyangz, jgg, corbet, leon,
	longli, ljs, mhocko, rppt, shuah, skhan, surenb, vbabka, wei.liu,
	linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <agNS4llNtAHBkMA2@skinsburskii.localdomain>

On 5/12/26 18:18, Stanislav Kinsburskii wrote:
> On Tue, May 12, 2026 at 10:42:14AM +0200, David Hildenbrand (Arm) wrote:
>>
>>> +	for (; addr < end; addr += PAGE_SIZE) {
>>> +		vm_fault_t ret;
>>> +
>>> +		ret = handle_mm_fault(vma, addr, fault_flags, NULL);
>>> +
>>> +		if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
>>> +			/*
>>> +			 * The mmap lock has been dropped by the fault handler.
>>> +			 * Record the failing address and signal lock-drop to
>>> +			 * the caller.
>>> +			 */
>>> +			*hmm_vma_walk->locked = 0;
>>> +			hmm_vma_walk->last = addr;
>>> +			return -EAGAIN;
>>
>>
>> Okay, so we'll return straight from hmm_vma_fault() to
>> hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
>>
>> Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
>> the hmm_vma_fault() could be called by the caller of walk_page_range(), but
>> that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
>> requires the vma in there.
>>
> 
> It looks like a caller can provide a post_vma callback in mm_walk_ops. I
> missed that case here. This callback cannot be supported by this change.
> I will update the patch.
> 
>>
>> Note: am I wrong, or is hmm_vma_fault() really always called with
>> required_fault=true?
>>
> 
> No, hmm_pte_need_fault can return false.

That's not what I mean. Looks like all paths leading to hmm_vma_fault() have
required_fault = true;

IOW, there is always a "if (required_fault)" before it one way or the other.

Ah, and there even is a "WARN_ON_ONCE(!required_fault)" in the function. What an
odd thing to do :)

> 
>>> +		}
>>> +
>>> +		if (ret & VM_FAULT_ERROR)
>>>  			return -EFAULT;
>>> +	}
>>>  	return -EBUSY;
>>>  }
>>>  
>>> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>>>  	if (required_fault) {
>>>  		int ret;
>>>  
>>> +		/*
>>> +		 * Faulting hugetlb pages on the unlockable path is not
>>> +		 * supported. The walk framework holds hugetlb_vma_lock_read
>>> +		 * which must be dropped before handle_mm_fault, but if the
>>> +		 * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
>>> +		 * be freed and the walk framework's unconditional unlock
>>> +		 * becomes a use-after-free.
>>> +		 */
>>> +		if (hmm_vma_walk->locked)
>>> +			return -EFAULT;
>>
>> Just because it's unlockable doesn't mean that you must unlock. Can't this be
>> kept working as is, just simulating here as if it would not be unlockable?
>>
> 
> I’m not sure how to implement this. The walk_page_range code expects the
> hugetlb VMA to still be read-locked when we return from
> hmm_vma_walk_hugetlb_entry. How can we guarantee that if the VMA might
> be gone?
> 
> I added a note in the docs. Whoever tackles this will likely need to
> either rework `walk_page_range` to handle the case where the VMA is
> gone, or use a different approach.
> 
> Do you have any other suggestions on how to implement it?

You just want hmm_vma_fault() to not set
"FAULT_FLAG_ALLOW_RETRY·|·FAULT_FLAG_KILLABLE".

The hacky way could be:

diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83d..83dba990e10a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -564,6 +564,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
long hmask,
        required_fault =
                hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
        if (required_fault) {
+               int *saved_locked = hmm_vma_walk->locked;
                int ret;

                spin_unlock(ptl);
@@ -576,7 +577,9 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
long hmask,
                 * use here of either pte or ptl after dropping the vma
                 * lock.
                 */
+               hmm_vma_walk->locked = NULL;
                ret = hmm_vma_fault(addr, end, required_fault, walk);
+               hmm_vma_walk->locked = saved_locked;
                hugetlb_vma_lock_read(vma);
                return ret;
        }

But really, I think we should just try to get uffd support working properly, not
excluding hugetlb.

GUP achieves it properly by performing the fault handling outside of page table
walking context ... essentially what I described in my first comment above:
return the information to the caller and let it just trigger the fault.

The issue here is that we trigger a fault out of walk_hugetlb_range() where we
still hold locks, resulting in this questionable hugetlb_vma_unlock_read +
hugetlb_vma_lock_read pattern.

The fault should just be triggered from a place where we don't have to play with
hugetlb vma locks or be afraid that dropping the mmap lock causes other problems.


-- 
Cheers,

David

^ permalink raw reply related

* [PATCH 7.0 142/307] hv: Select CONFIG_SYSFB only for CONFIG_HYPERV_VMBUS
From: Greg Kroah-Hartman @ 2026-05-12 17:38 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Thomas Zimmermann, Michael Kelley,
	Saurabh Sengar, Wei Liu, K. Y. Srinivasan, Haiyang Zhang,
	Dexuan Cui, Long Li, linux-hyperv
In-Reply-To: <20260512173940.117428952@linuxfoundation.org>

7.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Thomas Zimmermann <tzimmermann@suse.de>

commit d33db956c9618e7cb08c2520ce708437914214ec upstream.

Hyperv's sysfb access only exists in the VMBUS support. Therefore
only select CONFIG_SYSFB for CONFIG_HYPERV_VMBUS. Avoids sysfb code
on systems that don't need it.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests")
Cc: Michael Kelley <mhklinux@outlook.com>
Cc: Saurabh Sengar <ssengar@linux.microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-hyperv@vger.kernel.org
Cc: <stable@vger.kernel.org> # v6.16+
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Link: https://patch.msgid.link/20260402092305.208728-2-tzimmermann@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/hv/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -9,7 +9,6 @@ config HYPERV
 	select PARAVIRT
 	select X86_HV_CALLBACK_VECTOR if X86
 	select OF_EARLY_FLATTREE if OF
-	select SYSFB if EFI && !HYPERV_VTL_MODE
 	select IRQ_MSI_LIB if X86
 	help
 	  Select this option to run Linux as a Hyper-V client operating
@@ -62,6 +61,7 @@ config HYPERV_VMBUS
 	tristate "Microsoft Hyper-V VMBus driver"
 	depends on HYPERV
 	default HYPERV
+	select SYSFB if EFI && !HYPERV_VTL_MODE
 	help
 	  Select this option to enable Hyper-V Vmbus driver.
 



^ permalink raw reply

* [PATCH 6.18 121/270] hv: Select CONFIG_SYSFB only for CONFIG_HYPERV_VMBUS
From: Greg Kroah-Hartman @ 2026-05-12 17:38 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Thomas Zimmermann, Michael Kelley,
	Saurabh Sengar, Wei Liu, K. Y. Srinivasan, Haiyang Zhang,
	Dexuan Cui, Long Li, linux-hyperv
In-Reply-To: <20260512173938.452574370@linuxfoundation.org>

6.18-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Thomas Zimmermann <tzimmermann@suse.de>

commit d33db956c9618e7cb08c2520ce708437914214ec upstream.

Hyperv's sysfb access only exists in the VMBUS support. Therefore
only select CONFIG_SYSFB for CONFIG_HYPERV_VMBUS. Avoids sysfb code
on systems that don't need it.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests")
Cc: Michael Kelley <mhklinux@outlook.com>
Cc: Saurabh Sengar <ssengar@linux.microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-hyperv@vger.kernel.org
Cc: <stable@vger.kernel.org> # v6.16+
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Link: https://patch.msgid.link/20260402092305.208728-2-tzimmermann@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/hv/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -9,7 +9,6 @@ config HYPERV
 	select PARAVIRT
 	select X86_HV_CALLBACK_VECTOR if X86
 	select OF_EARLY_FLATTREE if OF
-	select SYSFB if EFI && !HYPERV_VTL_MODE
 	select IRQ_MSI_LIB if X86
 	help
 	  Select this option to run Linux as a Hyper-V client operating
@@ -61,6 +60,7 @@ config HYPERV_VMBUS
 	tristate "Microsoft Hyper-V VMBus driver"
 	depends on HYPERV
 	default HYPERV
+	select SYSFB if EFI && !HYPERV_VTL_MODE
 	help
 	  Select this option to enable Hyper-V Vmbus driver.
 



^ permalink raw reply

* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: Bjorn Helgaas @ 2026-05-12 17:41 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-9-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:02:56PM -0700, Mukesh R wrote:
> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> interrupts, etc need a device ID as a parameter. This device ID refers
> to that specific device during the lifetime of passthru.
> 
> An L1VH VM only contains VMBus based devices. A device ID for a VMBus
> device is slightly different in that it uses the hv_pcibus_device info
> for building it to make sure it matches exactly what the hypervisor
> expects. This VMBus based device ID is needed when attaching devices in
> an L1VH based guest VM. Add a function to build and export it. Before
> building it, a check is done to make sure the device is a valid VMBus
> device.
> 
> In remaining cases, PCI device ID is used. So, also make PCI device ID
> build function hv_build_devid_type_pci() public.

The subject line should start with a verb to help say what this patch
does.  I guess it's something about adding a function to build the
device IDs expected by the hypervisor?  Would be good to include the
function name too, e.g., something like this:

  PCI: hv: Add foo() to build hypervisor device IDs

^ permalink raw reply

* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
From: Stanislav Kinsburskii @ 2026-05-12 16:18 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: kys, Liam.Howlett, akpm, decui, haiyangz, jgg, corbet, leon,
	longli, ljs, mhocko, rppt, shuah, skhan, surenb, vbabka, wei.liu,
	linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <563bb216-c270-4711-adda-b91484af40dc@kernel.org>

On Tue, May 12, 2026 at 10:42:14AM +0200, David Hildenbrand (Arm) wrote:
> 
> > +	for (; addr < end; addr += PAGE_SIZE) {
> > +		vm_fault_t ret;
> > +
> > +		ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> > +
> > +		if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
> > +			/*
> > +			 * The mmap lock has been dropped by the fault handler.
> > +			 * Record the failing address and signal lock-drop to
> > +			 * the caller.
> > +			 */
> > +			*hmm_vma_walk->locked = 0;
> > +			hmm_vma_walk->last = addr;
> > +			return -EAGAIN;
> 
> 
> Okay, so we'll return straight from hmm_vma_fault() to
> hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
> 
> Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
> the hmm_vma_fault() could be called by the caller of walk_page_range(), but
> that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
> requires the vma in there.
> 

It looks like a caller can provide a post_vma callback in mm_walk_ops. I
missed that case here. This callback cannot be supported by this change.
I will update the patch.

> 
> Note: am I wrong, or is hmm_vma_fault() really always called with
> required_fault=true?
> 

No, hmm_pte_need_fault can return false.

> > +		}
> > +
> > +		if (ret & VM_FAULT_ERROR)
> >  			return -EFAULT;
> > +	}
> >  	return -EBUSY;
> >  }
> >  
> > @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
> >  	if (required_fault) {
> >  		int ret;
> >  
> > +		/*
> > +		 * Faulting hugetlb pages on the unlockable path is not
> > +		 * supported. The walk framework holds hugetlb_vma_lock_read
> > +		 * which must be dropped before handle_mm_fault, but if the
> > +		 * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
> > +		 * be freed and the walk framework's unconditional unlock
> > +		 * becomes a use-after-free.
> > +		 */
> > +		if (hmm_vma_walk->locked)
> > +			return -EFAULT;
> 
> Just because it's unlockable doesn't mean that you must unlock. Can't this be
> kept working as is, just simulating here as if it would not be unlockable?
> 

I’m not sure how to implement this. The walk_page_range code expects the
hugetlb VMA to still be read-locked when we return from
hmm_vma_walk_hugetlb_entry. How can we guarantee that if the VMA might
be gone?

I added a note in the docs. Whoever tackles this will likely need to
either rework `walk_page_range` to handle the case where the VMA is
gone, or use a different approach.

Do you have any other suggestions on how to implement it?

Thanks,
Stanislav

> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply

* Re: [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor
From: Anirudh Rayabharam @ 2026-05-12 12:46 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <agH17u_h2ljX5sUb@skinsburskii.localdomain>

On Mon, May 11, 2026 at 08:29:50AM -0700, Stanislav Kinsburskii wrote:
> On Mon, May 11, 2026 at 02:26:45PM +0000, Anirudh Rayabharam wrote:
> > On Thu, May 07, 2026 at 03:44:32PM +0000, Stanislav Kinsburskii wrote:
> > > mshv_partition_ioctl_create_vp() called anon_inode_getfd() before
> > > publishing the new VP into partition->pt_vp_array.  anon_inode_getfd()
> > > includes fd_install(), so the fd was live in current->files before the
> > > publish ran.
> > > 
> > > A concurrent MSHV_RUN_VP ioctl on that fd does not serialise against the
> > > in-progress MSHV_CREATE_VP — it takes vp->vp_mutex, not the partition
> > > mutex.  Once the VP starts running and traps, mshv_intercept_isr() can look
> > > up partition->pt_vp_array[vp_index] and observe NULL, silently dropping the
> > > intercept message.
> > > 
> > > Split the fd creation: reserve an fd with get_unused_fd_flags(), create the
> > > file with anon_inode_getfile(), publish the VP via smp_store_release(), and
> > > finally call fd_install() as the userspace-visibility commit point.
> > > 
> > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c |   29 ++++++++++++++++++++++-------
> > >  1 file changed, 22 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index e32f6e0f9f637..1c18d1c1f7947 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -1142,6 +1142,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
> > >  	struct mshv_vp *vp;
> > >  	struct page *intercept_msg_page, *register_page, *ghcb_page;
> > >  	struct hv_stats_page *stats_pages[2];
> > > +	struct file *file;
> > > +	int fd;
> > >  	long ret;
> > >  
> > >  	if (copy_from_user(&args, arg, sizeof(args)))
> > > @@ -1214,14 +1216,18 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
> > >  	if (ret)
> > >  		goto put_partition;
> > >  
> > > -	/*
> > > -	 * Keep anon_inode_getfd last: it installs fd in the file struct and
> > > -	 * thus makes the state accessible in user space.
> > > -	 */
> > > -	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
> > > -			       O_RDWR | O_CLOEXEC);
> > 
> > Why not just move this anon_inode_getfd() after the smp_store_release()
> > call?
> > 
> 
> Because anon_inode_getfd() can still fail at the fd-table step after the
> VP is already visible to lockless ISR readers, and unpublishing it would
> require a grace-period wait under pt_mutex while ISRs may have already
> observed and acted on the doomed VP. The split form keeps every failable
> step before the publish, so failure paths stay simple and the publish
> itself cannot fail.

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH net-next v8 3/6] net: mana: Introduce GIC context with refcounting for interrupt management
From: Paolo Abeni @ 2026-05-12 11:36 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Andrew Lunn, Jason Gunthorpe, Leon Romanovsky,
	Haiyang Zhang, K . Y . Srinivasan, Wei Liu, Dexuan Cui,
	shradhagupta
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260508221202.15725-4-longli@microsoft.com>

On 5/9/26 12:11 AM, Long Li wrote:
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 4673ff62e6d9..78cb89c46ff3 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1618,6 +1618,164 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
>  	return IRQ_HANDLED;
>  }
>  
> +void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi)
> +{
> +	struct pci_dev *dev = to_pci_dev(gc->dev);
> +	struct msi_map irq_map;
> +	struct gdma_irq_context *gic;
> +	int irq;

Since a new revision is needed, please fix the reverse christmas tree
above and elsewhere, thanks!

/P


^ permalink raw reply

* Re: [PATCH net-next v8 1/6] net: mana: Create separate EQs for each vPort
From: Paolo Abeni @ 2026-05-12 11:34 UTC (permalink / raw)
  To: longli
  Cc: kotaranov, kuba, davem, pabeni, edumazet, andrew+netdev, jgg,
	leon, haiyangz, kys, wei.liu, decui, shradhagupta, horms, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260508221202.15725-2-longli@microsoft.com>

From: AI Reviewer <ai@example.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
Subject: net: mana: Create separate EQs for each vPort

This patch moves EQ ownership from struct mana_context to
struct mana_port_context, creates per-vPort EQs in mana_alloc_queues()
and mana_ib_cfg_vport(), and exports mana_create_eq()/mana_destroy_eq()
for use by the MANA RDMA driver.

A few questions below about the new lifetime and ordering rules for
mpc->eqs.

> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
> index ac5e75dd3494..8000ab6e8beb 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
>  	pd->vport_use_count--;
>  	WARN_ON(pd->vport_use_count < 0);
>  
> -	if (!pd->vport_use_count)
> +	if (!pd->vport_use_count) {
> +		mana_destroy_eq(mpc);
>  		mana_uncfg_vport(mpc);
> +	}
>  
>  	mutex_unlock(&pd->vport_mutex);
>  }

[High]
After this change, mpc->eqs lifetime is governed by pd->vport_use_count,
and only raw QPs call mana_ib_cfg_vport()/mana_ib_uncfg_vport().  RSS
QPs in mana_ib_create_qp_rss() read mpc->eqs and latch eq->eq->id into
the HW CQ via cq_spec.attached_eq, but they do not bump
pd->vport_use_count.

So if a user creates a raw QP on a PD (use_count=1, mpc->eqs allocated),
then creates an RSS QP on the same PD whose CQs now reference those EQ
ids, and finally destroys the raw QP first, this path runs with
use_count dropping to 0 and calls mana_destroy_eq(mpc) -> kfree(apc->eqs)
and mana_gd_destroy_queue() for each EQ while the surviving RSS QP's HW
CQs are still attached to those EQs.

Is there an intended rule that enforces destruction ordering between raw
and RSS QPs on the same PD, or should RSS QPs also participate in the
refcount that guards mpc->eqs?

> @@ -55,15 +57,22 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>  		return err;
>  	}
>  
> -	mutex_unlock(&pd->vport_mutex);
>  
>  	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
>  	pd->tx_vp_offset = mpc->tx_vp_offset;
> +	err = mana_create_eq(mpc);
> +	if (err) {
> +		mana_uncfg_vport(mpc);
> +		pd->vport_use_count--;
> +	}
>  
> -	ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
> -		  mpc->port_handle, pd->pdn, doorbell_id);
> +	mutex_unlock(&pd->vport_mutex);
>  
> -	return 0;
> +	if (!err)
> +		ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
> +			  mpc->port_handle, pd->pdn, doorbell_id);
> +
> +	return err;
>  }

[Low]
pd->tx_shortform_allowed and pd->tx_vp_offset are written before
mana_create_eq() is attempted.  On the mana_create_eq() failure branch,
mana_uncfg_vport(mpc) is called and pd->vport_use_count is decremented
back to 0, but pd->tx_shortform_allowed/pd->tx_vp_offset are left set to
the values derived from a vport whose HW-side configuration has been
reverted.  Should these fields only be committed to pd after
mana_create_eq() succeeds?

> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 0fbcf449c134..6167742df9d2 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -171,7 +171,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  		cq_spec.gdma_region = cq->queue.gdma_region;
>  		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
>  		cq_spec.modr_ctx_id = 0;
> -		eq = &mpc->ac->eqs[cq->comp_vector];
> +		/* EQs are created when a raw QP configures the vport.
> +		 * A raw QP must be created before creating rwq_ind_tbl.
> +		 */
> +		if (!mpc->eqs) {
> +			ret = -EINVAL;
> +			i--;
> +			goto fail;
> +		}
> +		eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>  		cq_spec.attached_eq = eq->eq->id;

[Critical]
mana_ib_create_qp_rss() does not acquire pd->vport_mutex and does not
touch pd->vport_use_count.  The NULL check above reads mpc->eqs without
synchronization; immediately after it, the same mpc->eqs is dereferenced
and eq->eq->id is latched into cq_spec.attached_eq.

Can the following sequence occur?

  thread A (creating RSS QP on pd)        thread B (destroying last raw QP on pd)
  mana_ib_create_qp_rss()
    if (!mpc->eqs)  /* non-NULL */
    ...                                   mana_ib_uncfg_vport()
                                            mutex_lock(&pd->vport_mutex);
                                            if (!--pd->vport_use_count)
                                              mana_destroy_eq(mpc);
                                                kfree(apc->eqs);
    eq = &mpc->eqs[... % mpc->num_queues];
    cq_spec.attached_eq = eq->eq->id;     /* reads freed memory */

Is there something else serializing these paths that this reviewer is
missing?

[Medium]
The off-by-one on the error path looks inconsistent with the other fail
sites in this function.  Every other goto fail in this loop leaves i
alone and relies on:

  while (i-- > 0) {
      ibwq = ind_tbl->ind_tbl[i];
      ...
      mana_ib_remove_cq_cb(mdev, cq);
      mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
  }

Here the new branch does ret = -EINVAL; i--; goto fail, which drops one
extra iteration of cleanup.  If this check ever fires at i > 0 (for
example if the lifetime issue above is fixed by making mpc->eqs
observable as NULL mid-loop), the WQ object and installed CQ callback
created in the previous iteration are not torn down.  Should this match
the other fail sites and omit the i--?

[Medium]
Before this patch ac->eqs was populated unconditionally in mana_probe()
and existed for the lifetime of the device, so mana_ib_create_qp_rss()
and mana_ib_create_qp_raw() always had EQs available regardless of
whether any netdev was up or any raw QP had been created.

After this patch, mpc->eqs is populated only by mana_alloc_queues()
(netdev open) or by mana_ib_cfg_vport() (which is only invoked from
mana_ib_create_qp_raw()).  An RDMA user who creates an RSS QP on a PD
while the netdev is down and without a prior raw QP on the same PD will
now see -EINVAL from this new check, where the same call succeeded
before.

The commit message describes this work as preparatory and does not
mention this user-visible ordering requirement.  Should this semantic
change be called out in the commit log, and should the implicit "a raw
QP must be created first" rule be documented somewhere user-facing?

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index b2faa7cf398f..131a997e1083 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1615,78 +1615,83 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
>  }
>  EXPORT_SYMBOL_NS(mana_destroy_wq_obj, "NET_MANA");
>  
> -static void mana_destroy_eq(struct mana_context *ac)
> +void mana_destroy_eq(struct mana_port_context *apc)
>  {
> +	struct mana_context *ac = apc->ac;
>  	struct gdma_context *gc = ac->gdma_dev->gdma_context;
>  	struct gdma_queue *eq;
>  	int i;
>  
> -	if (!ac->eqs)
> +	if (!apc->eqs)
>  		return;
>  
> -	debugfs_remove_recursive(ac->mana_eqs_debugfs);
> -	ac->mana_eqs_debugfs = NULL;
> +	debugfs_remove_recursive(apc->mana_eqs_debugfs);
> +	apc->mana_eqs_debugfs = NULL;
>  
> -	for (i = 0; i < gc->max_num_queues; i++) {
> -		eq = ac->eqs[i].eq;
> +	for (i = 0; i < apc->num_queues; i++) {
> +		eq = apc->eqs[i].eq;
>  		if (!eq)
>  			continue;
>  
>  		mana_gd_destroy_queue(gc, eq);
>  	}

[High]
The loop bound is now apc->num_queues rather than the immutable
gc->max_num_queues.  mana_create_eq() allocates apc->eqs sized to
apc->num_queues, and mana_destroy_eq() walks the same count.

When RDMA has called mana_ib_cfg_vport() -> mana_create_eq() while the
netdev is down, apc->num_queues can still be mutated underneath it.
mana_detach(false) skips mana_dealloc_queues() when port_st_save is
false, and then mana_set_channels() (ethtool -L) writes
apc->num_queues = new_count while mpc->eqs is already held by the RDMA
side.  mana_attach() then also skips mana_alloc_queues().

When the RDMA user later destroys the last raw QP and this function
runs, i iterates 0..new_count-1.  If new_count is larger than the count
used at kzalloc_objs() time, apc->eqs is read out of bounds and
potentially passed to mana_gd_destroy_queue().  If new_count is smaller,
the tail EQs (real gdma_queue objects with MSI-X and DMA regions) are
leaked.

Before this patch both sides used gc->max_num_queues so no mismatch was
possible.  Should apc->num_queues be snapshotted into apc at
mana_create_eq() time (or should changing num_queues be forbidden while
mpc->eqs is populated by the RDMA path)?

> -static int mana_create_eq(struct mana_context *ac)
> +int mana_create_eq(struct mana_port_context *apc)
>  {
> -	struct gdma_dev *gd = ac->gdma_dev;
> +	struct gdma_dev *gd = apc->ac->gdma_dev;
>  	struct gdma_context *gc = gd->gdma_context;
>  	struct gdma_queue_spec spec = {};
>  	int err;
>  	int i;
>  
> -	ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
> -	if (!ac->eqs)
> +	WARN_ON(apc->eqs);
> +	apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
> +	if (!apc->eqs)
>  		return -ENOMEM;

[Low]
WARN_ON(apc->eqs) here is advisory only; execution continues into
apc->eqs = kzalloc_objs(...), which unconditionally overwrites any
non-NULL pointer.  If the invariant is ever violated by a future caller
or error path, the prior EQ array and its gdma_queue objects are
orphaned (kernel memory and firmware EQ state leaked) with only a WARN
splat as the signal.

Would "if (WARN_ON(apc->eqs)) return -EEXIST;" express the intent more
safely?
-- 
This is an AI-generated review.

^ permalink raw reply

* Re: [PATCH V3 02/11] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: Souradeep Chakrabarti @ 2026-05-12 10:27 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-3-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:02:50PM -0700, Mukesh R wrote:
> Make cosmetic changes:
>  o Rename struct pci_dev *dev to *pdev since there are cases of
>    struct device *dev in the file and all over the kernel
>  o Rename hv_build_pci_dev_id to hv_build_devid_type_pci in anticipation
>    of building different types of device IDs
>  o Fix checkpatch.pl issues with return and extraneous printk
>  o Replace spaces with tabs
>  o Rename struct hv_devid *xxx to struct hv_devid *hv_devid given code
>    paths involve many types of device IDs
>  o Fix indentation in a large if block by using goto.
> 
> There are no functional changes.
> 
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c | 198 +++++++++++++++++++-----------------
>  1 file changed, 104 insertions(+), 94 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index 365e364268d9..b3ad50a874dc 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0
> -
>  /*
>   * Irqdomain for Linux to run as the root partition on Microsoft Hypervisor.
>   *
> @@ -14,8 +13,8 @@
>  #include <linux/irqchip/irq-msi-lib.h>
>  #include <asm/mshyperv.h>
>  
> -static int hv_map_interrupt(union hv_device_id device_id, bool level,
> -		int cpu, int vector, struct hv_interrupt_entry *entry)
> +static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
> +		int cpu, int vector, struct hv_interrupt_entry *ret_entry)
>  {
>  	struct hv_input_map_device_interrupt *input;
>  	struct hv_output_map_device_interrupt *output;
> @@ -32,7 +31,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
>  	intr_desc = &input->interrupt_descriptor;
>  	memset(input, 0, sizeof(*input));
>  	input->partition_id = hv_current_partition_id;
> -	input->device_id = device_id.as_uint64;
> +	input->device_id = hv_devid.as_uint64;
>  	intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
>  	intr_desc->vector_count = 1;
>  	intr_desc->target.vector = vector;
> @@ -44,7 +43,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
>  
>  	intr_desc->target.vp_set.valid_bank_mask = 0;
>  	intr_desc->target.vp_set.format = HV_GENERIC_SET_SPARSE_4K;
> -	nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set), cpumask_of(cpu));
> +	nr_bank = cpumask_to_vpset(&intr_desc->target.vp_set, cpumask_of(cpu));
>  	if (nr_bank < 0) {
>  		local_irq_restore(flags);
>  		pr_err("%s: unable to generate VP set\n", __func__);
> @@ -61,7 +60,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
>  
>  	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, var_size,
>  			input, output);
> -	*entry = output->interrupt_entry;
> +	*ret_entry = output->interrupt_entry;
>  
>  	local_irq_restore(flags);
>  
> @@ -71,21 +70,19 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
>  	return hv_result_to_errno(status);
>  }
>  
> -static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *old_entry)
> +static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
>  {
>  	unsigned long flags;
>  	struct hv_input_unmap_device_interrupt *input;
> -	struct hv_interrupt_entry *intr_entry;
>  	u64 status;
>  
>  	local_irq_save(flags);
>  	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>  
>  	memset(input, 0, sizeof(*input));
> -	intr_entry = &input->interrupt_entry;
>  	input->partition_id = hv_current_partition_id;
>  	input->device_id = id;
> -	*intr_entry = *old_entry;
> +	input->interrupt_entry = *irq_entry;
>  
>  	status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
>  	local_irq_restore(flags);
> @@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
> +static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
> -	union hv_device_id dev_id;
> +	int pos;
> +	union hv_device_id hv_devid;
>  	struct rid_data data = {
>  		.bridge = NULL,
> -		.rid = PCI_DEVID(dev->bus->number, dev->devfn)
> +		.rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
>  	};
>  
> -	pci_for_each_dma_alias(dev, get_rid_cb, &data);
> +	pci_for_each_dma_alias(pdev, get_rid_cb, &data);
>  
> -	dev_id.as_uint64 = 0;
> -	dev_id.device_type = HV_DEVICE_TYPE_PCI;
> -	dev_id.pci.segment = pci_domain_nr(dev->bus);
> +	hv_devid.as_uint64 = 0;
> +	hv_devid.device_type = HV_DEVICE_TYPE_PCI;
> +	hv_devid.pci.segment = pci_domain_nr(pdev->bus);
>  
> -	dev_id.pci.bdf.bus = PCI_BUS_NUM(data.rid);
> -	dev_id.pci.bdf.device = PCI_SLOT(data.rid);
> -	dev_id.pci.bdf.function = PCI_FUNC(data.rid);
> -	dev_id.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
> +	hv_devid.pci.bdf.bus = PCI_BUS_NUM(data.rid);
> +	hv_devid.pci.bdf.device = PCI_SLOT(data.rid);
> +	hv_devid.pci.bdf.function = PCI_FUNC(data.rid);
> +	hv_devid.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
>  
> -	if (data.bridge) {
> -		int pos;
> +	if (data.bridge == NULL)
> +		goto out;
>  
> -		/*
> -		 * Microsoft Hypervisor requires a bus range when the bridge is
> -		 * running in PCI-X mode.
> -		 *
> -		 * To distinguish conventional vs PCI-X bridge, we can check
> -		 * the bridge's PCI-X Secondary Status Register, Secondary Bus
> -		 * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
> -		 * Specification Revision 1.0 5.2.2.1.3.
> -		 *
> -		 * Value zero means it is in conventional mode, otherwise it is
> -		 * in PCI-X mode.
> -		 */
> +	/*
> +	 * Microsoft Hypervisor requires a bus range when the bridge is
> +	 * running in PCI-X mode.
> +	 *
> +	 * To distinguish conventional vs PCI-X bridge, we can check
> +	 * the bridge's PCI-X Secondary Status Register, Secondary Bus
> +	 * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
> +	 * Specification Revision 1.0 5.2.2.1.3.
> +	 *
> +	 * Value zero means it is in conventional mode, otherwise it is
> +	 * in PCI-X mode.
> +	 */
>  
> -		pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> -		if (pos) {
> -			u16 status;
> +	pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> +	if (pos) {
> +		u16 status;
>  
> -			pci_read_config_word(data.bridge, pos +
> -					PCI_X_BRIDGE_SSTATUS, &status);
> +		pci_read_config_word(data.bridge, pos + PCI_X_BRIDGE_SSTATUS,
> +				     &status);
>  
> -			if (status & PCI_X_SSTATUS_FREQ) {
> -				/* Non-zero, PCI-X mode */
> -				u8 sec_bus, sub_bus;
> +		if (status & PCI_X_SSTATUS_FREQ) {
> +			/* Non-zero, PCI-X mode */
> +			u8 sec_bus, sub_bus;
>  
> -				dev_id.pci.source_shadow = HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
> +			hv_devid.pci.source_shadow =
> +					     HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
>  
> -				pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS, &sec_bus);
> -				dev_id.pci.shadow_bus_range.secondary_bus = sec_bus;
> -				pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS, &sub_bus);
> -				dev_id.pci.shadow_bus_range.subordinate_bus = sub_bus;
> -			}
> +			pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS,
> +					     &sec_bus);
> +			hv_devid.pci.shadow_bus_range.secondary_bus = sec_bus;
> +			pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS,
> +					     &sub_bus);
> +			hv_devid.pci.shadow_bus_range.subordinate_bus = sub_bus;
>  		}
>  	}
>  
> -	return dev_id;
> +out:
> +	return hv_devid;
>  }
>  
> -/**
> - * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
> +/*
> + * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
>   * @data:      Describes the IRQ
>   * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
>   *
> @@ -188,22 +189,23 @@ int hv_map_msi_interrupt(struct irq_data *data,
>  {
>  	struct irq_cfg *cfg = irqd_cfg(data);
>  	struct hv_interrupt_entry dummy;
> -	union hv_device_id device_id;
> +	union hv_device_id hv_devid;
>  	struct msi_desc *msidesc;
> -	struct pci_dev *dev;
> +	struct pci_dev *pdev;
>  	int cpu;
>  
>  	msidesc = irq_data_get_msi_desc(data);
> -	dev = msi_desc_to_pci_dev(msidesc);
> -	device_id = hv_build_pci_dev_id(dev);
> +	pdev = msi_desc_to_pci_dev(msidesc);
> +	hv_devid = hv_build_devid_type_pci(pdev);
>  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>  
> -	return hv_map_interrupt(device_id, false, cpu, cfg->vector,
> +	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
>  				out_entry ? out_entry : &dummy);
>  }
>  EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
>  
> -static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
> +static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
> +			     struct msi_msg *msg)
>  {
>  	/* High address is always 0 */
>  	msg->address_hi = 0;
> @@ -211,17 +213,19 @@ static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi
>  	msg->data = entry->msi_entry.data.as_uint32;
>  }
>  
> -static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry);
> +static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> +				  struct hv_interrupt_entry *irq_entry);
> +
>  static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	struct hv_interrupt_entry *stored_entry;
>  	struct irq_cfg *cfg = irqd_cfg(data);
>  	struct msi_desc *msidesc;
> -	struct pci_dev *dev;
> +	struct pci_dev *pdev;
>  	int ret;
>  
>  	msidesc = irq_data_get_msi_desc(data);
> -	dev = msi_desc_to_pci_dev(msidesc);
> +	pdev = msi_desc_to_pci_dev(msidesc);
>  
>  	if (!cfg) {
>  		pr_debug("%s: cfg is NULL", __func__);
> @@ -240,7 +244,7 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  		stored_entry = data->chip_data;
>  		data->chip_data = NULL;
>  
> -		ret = hv_unmap_msi_interrupt(dev, stored_entry);
> +		ret = hv_unmap_msi_interrupt(pdev, stored_entry);
>  
>  		kfree(stored_entry);
>  
> @@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	}
>  
>  	stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
> -	if (!stored_entry) {
> -		pr_debug("%s: failed to allocate chip data\n", __func__);
> +	if (!stored_entry)
>  		return;
> -	}
>  
>  	ret = hv_map_msi_interrupt(data, stored_entry);
>  	if (ret) {
> @@ -262,18 +264,21 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  
>  	data->chip_data = stored_entry;
>  	entry_to_msi_msg(data->chip_data, msg);
> -
> -	return;
>  }
>  
> -static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry)
> +static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> +				  struct hv_interrupt_entry *irq_entry)
>  {
> -	return hv_unmap_interrupt(hv_build_pci_dev_id(dev).as_uint64, old_entry);
> +	union hv_device_id hv_devid;
> +
> +	hv_devid = hv_build_devid_type_pci(pdev);
> +	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
>  }
>  
> -static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
> +/* NB: during map, hv_interrupt_entry is saved via data->chip_data */
> +static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
>  {
> -	struct hv_interrupt_entry old_entry;
> +	struct hv_interrupt_entry irq_entry;
>  	struct msi_msg msg;
>  
>  	if (!irqd->chip_data) {
> @@ -281,13 +286,13 @@ static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
>  		return;
>  	}
>  
> -	old_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
> -	entry_to_msi_msg(&old_entry, &msg);
> +	irq_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
> +	entry_to_msi_msg(&irq_entry, &msg);
>  
>  	kfree(irqd->chip_data);
>  	irqd->chip_data = NULL;
>  
> -	(void)hv_unmap_msi_interrupt(dev, &old_entry);
> +	(void)hv_unmap_msi_interrupt(pdev, &irq_entry);
>  }
>  
>  /*
> @@ -302,7 +307,8 @@ static struct irq_chip hv_pci_msi_controller = {
>  };
>  
>  static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
> -				 struct irq_domain *real_parent, struct msi_domain_info *info)
> +				 struct irq_domain *real_parent,
> +				 struct msi_domain_info *info)
>  {
>  	struct irq_chip *chip = info->chip;
>  
> @@ -317,7 +323,8 @@ static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
>  }
>  
>  #define HV_MSI_FLAGS_SUPPORTED	(MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX)
> -#define HV_MSI_FLAGS_REQUIRED	(MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS)
> +#define HV_MSI_FLAGS_REQUIRED	(MSI_FLAG_USE_DEF_DOM_OPS |	\
> +				 MSI_FLAG_USE_DEF_CHIP_OPS)
>  
>  static struct msi_parent_ops hv_msi_parent_ops = {
>  	.supported_flags	= HV_MSI_FLAGS_SUPPORTED,
> @@ -329,14 +336,14 @@ static struct msi_parent_ops hv_msi_parent_ops = {
>  	.init_dev_msi_info	= hv_init_dev_msi_info,
>  };
>  
> -static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs,
> -			       void *arg)
> +/* Allocate nr_irqs IRQs for the given irq domain */
> +static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq,
> +			       unsigned int nr_irqs, void *arg)
>  {
>  	/*
> -	 * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e. everything except
> -	 * entry_to_msi_msg() should be in here.
> +	 * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e.
> +	 *	 everything except entry_to_msi_msg() should be in here.
>  	 */
> -
>  	int ret;
>  
>  	ret = irq_domain_alloc_irqs_parent(d, virq, nr_irqs, arg);
> @@ -344,13 +351,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
>  		return ret;
>  
>  	for (int i = 0; i < nr_irqs; ++i) {
> -		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
> -				    handle_edge_irq, NULL, "edge");
> +		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
> +				    NULL, handle_edge_irq, NULL, "edge");
>  	}
> +
>  	return 0;
>  }
>  
> -static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
> +static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
> +			       unsigned int nr_irqs)
>  {
>  	for (int i = 0; i < nr_irqs; ++i) {
>  		struct irq_data *irqd = irq_domain_get_irq_data(d, virq);
> @@ -362,6 +371,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
>  
>  		hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
>  	}
> +
>  	irq_domain_free_irqs_top(d, virq, nr_irqs);
>  }
>  
> @@ -394,25 +404,25 @@ struct irq_domain * __init hv_create_pci_msi_domain(void)
>  
>  int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
>  {
> -	union hv_device_id device_id;
> +	union hv_device_id hv_devid;
>  
> -	device_id.as_uint64 = 0;
> -	device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
> -	device_id.ioapic.ioapic_id = (u8)ioapic_id;
> +	hv_devid.as_uint64 = 0;
> +	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
> +	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
>  
> -	return hv_unmap_interrupt(device_id.as_uint64, entry);
> +	return hv_unmap_interrupt(hv_devid.as_uint64, entry);
>  }
>  EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
>  
>  int hv_map_ioapic_interrupt(int ioapic_id, bool level, int cpu, int vector,
>  		struct hv_interrupt_entry *entry)
>  {
> -	union hv_device_id device_id;
> +	union hv_device_id hv_devid;
>  
> -	device_id.as_uint64 = 0;
> -	device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
> -	device_id.ioapic.ioapic_id = (u8)ioapic_id;
> +	hv_devid.as_uint64 = 0;
> +	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
> +	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
>  
> -	return hv_map_interrupt(device_id, level, cpu, vector, entry);
> +	return hv_map_interrupt(hv_devid, level, cpu, vector, entry);
>  }
>  EXPORT_SYMBOL_GPL(hv_map_ioapic_interrupt);
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* Re: [PATCH V3 04/11] mshv: Declarations and definitions for VFIO-MSHV bridge device
From: Souradeep Chakrabarti @ 2026-05-12 10:26 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-5-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:02:52PM -0700, Mukesh R wrote:
> Add data structs needed by the subsequent patch that introduces a new
> module to implement VFIO-MSHV pseudo device.
> 
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h    | 19 +++++++++++++++++++
>  include/uapi/linux/mshv.h | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index a85c24dcc701..b9880d0bdc4d 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -227,6 +227,25 @@ struct port_table_info {
>  	};
>  };
>  
> +struct mshv_device {
> +	const struct mshv_device_ops *device_ops;
> +	struct mshv_partition *device_pt;
> +	void *device_private;
> +	struct hlist_node device_ptnode;
> +};
> +
> +struct mshv_device_ops {
> +	const char *device_name;
> +	long (*device_create)(struct mshv_device *dev);
> +	void (*device_release)(struct mshv_device *dev);
> +	long (*device_set_attr)(struct mshv_device *dev,
> +				struct mshv_device_attr *attr);
> +	long (*device_has_attr)(struct mshv_device *dev,
> +				struct mshv_device_attr *attr);
> +};
> +
> +extern struct mshv_device_ops mshv_vfio_device_ops;
> +
>  int mshv_update_routing_table(struct mshv_partition *partition,
>  			      const struct mshv_user_irq_entry *entries,
>  			      unsigned int numents);
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index 32ff92b6342b..be6fe3ee8707 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -404,4 +404,34 @@ struct mshv_sint_mask {
>  /* hv_hvcall device */
>  #define MSHV_HVCALL_SETUP        _IOW(MSHV_IOCTL, 0x1E, struct mshv_vtl_hvcall_setup)
>  #define MSHV_HVCALL              _IOWR(MSHV_IOCTL, 0x1F, struct mshv_vtl_hvcall)
> +
> +/* Device passhthru */
> +#define MSHV_CREATE_DEVICE_TEST		1
> +
> +enum {
> +	MSHV_DEV_TYPE_VFIO,
> +	MSHV_DEV_TYPE_MAX,
> +};
> +
> +struct mshv_create_device {
> +	__u32	type;	     /* in: MSHV_DEV_TYPE_xxx */
> +	__u32	fd;	     /* out: device handle */
> +	__u32	flags;	     /* in: MSHV_CREATE_DEVICE_xxx */
> +};
> +
> +#define MSHV_DEV_VFIO_FILE      1
> +#define MSHV_DEV_VFIO_FILE_ADD	1
> +#define MSHV_DEV_VFIO_FILE_DEL	2
> +
> +struct mshv_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +/* Device fds created with MSHV_CREATE_DEVICE */
> +#define MSHV_SET_DEVICE_ATTR	_IOW(MSHV_IOCTL, 0x00, struct mshv_device_attr)
> +#define MSHV_HAS_DEVICE_ATTR	_IOW(MSHV_IOCTL, 0x01, struct mshv_device_attr)
> +
>  #endif
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* Re: [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Souradeep Chakrabarti @ 2026-05-12 10:26 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-2-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:02:49PM -0700, Mukesh R wrote:
> This file actually implements irq remapping, so rename to more appropriate
> hyperv-irq.c. A new file to implement hyperv iommu will be introduced
> later.  Also, it should not be tied to HYPERV_IOMMU, but to CONFIG_HYPERV
> and IRQ_REMAP. The file already has #ifdef CONFIG_IRQ_REMAP.
> 
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  MAINTAINERS                                    | 2 +-
>  drivers/iommu/Makefile                         | 2 +-
>  drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
>  drivers/iommu/irq_remapping.c                  | 2 +-
>  4 files changed, 6 insertions(+), 6 deletions(-)
>  rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d1cc0e12fe1f..f803a6a38fee 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11914,7 +11914,7 @@ F:	drivers/clocksource/hyperv_timer.c
>  F:	drivers/hid/hid-hyperv.c
>  F:	drivers/hv/
>  F:	drivers/input/serio/hyperv-keyboard.c
> -F:	drivers/iommu/hyperv-iommu.c
> +F:	drivers/iommu/hyperv-irq.c
>  F:	drivers/net/ethernet/microsoft/
>  F:	drivers/net/hyperv/
>  F:	drivers/pci/controller/pci-hyperv-intf.c
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 0275821f4ef9..335ea77cced6 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> +obj-$(CONFIG_HYPERV) += hyperv-irq.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
> similarity index 99%
> rename from drivers/iommu/hyperv-iommu.c
> rename to drivers/iommu/hyperv-irq.c
> index 479103261ae6..d11076f906fb 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-irq.c
> @@ -8,6 +8,8 @@
>   * Author : Lan Tianyu <Tianyu.Lan@microsoft.com>
>   */
>  
> +#ifdef CONFIG_IRQ_REMAP
> +
>  #include <linux/types.h>
>  #include <linux/interrupt.h>
>  #include <linux/irq.h>
> @@ -24,8 +26,6 @@
>  
>  #include "irq_remapping.h"
>  
> -#ifdef CONFIG_IRQ_REMAP
> -
>  /*
>   * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
>   * Redirection Table. Hyper-V exposes one single IO-APIC and so define
> @@ -331,4 +331,4 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
>  	.free = hyperv_root_irq_remapping_free,
>  };
>  
> -#endif
> +#endif  /* CONFIG_IRQ_REMAP */
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index c2443659812a..41bf65e4ea88 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -108,7 +108,7 @@ int __init irq_remapping_prepare(void)
>  	else if (IS_ENABLED(CONFIG_AMD_IOMMU) &&
>  		 amd_iommu_irq_ops.prepare() == 0)
>  		remap_ops = &amd_iommu_irq_ops;
> -	else if (IS_ENABLED(CONFIG_HYPERV_IOMMU) &&
> +	else if (IS_ENABLED(CONFIG_HYPERV) &&
>  		 hyperv_irq_remap_ops.prepare() == 0)
>  		remap_ops = &hyperv_irq_remap_ops;
>  	else
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* Re: [PATCH V1 2/3] hyperv: Implement irq remap for passthru devices
From: Souradeep Chakrabarti @ 2026-05-12  9:29 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, linux-hyperv, linux-kernel, iommu,
	linux-pci, linux-arch
In-Reply-To: <20260512021242.1679786-3-mrathor@linux.microsoft.com>

On Mon, May 11, 2026 at 07:12:41PM -0700, Mukesh R wrote:
> Implement interrupt remapping for direct attached and domain attached
> devices on Hyper-V.
> 
> Please note there are few constraints when it comes to mapping device
> interrupts on Hyper-V. For example, the hypervisor will not allow mapping
> device interrupts to root if the device is a direct attached device. Since
> the target guest cpu and vector info is not available during the initial
> VFIO irq setup, we work around by skipping this initial map. Then later
> during irqbypass trigger, when both guest target cpu vector are available,
> we do the map in the hypervisor, update the device, and enable the
> interrupt vector on the device. Rather than special case direct attached,
> we do same for domain attached also. This implies irqbypass is required
> for MSHV pci device passthru. Also noteworthy is that the hypervisor
> will automatically setup any direct hw injection like posted interrupts.
> 
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c         |  18 +-
>  drivers/hv/mshv_eventfd.c           | 423 +++++++++++++++++++++++++++-
>  drivers/iommu/hyperv-iommu-root.c   |  14 +
>  drivers/pci/controller/pci-hyperv.c |  10 +
>  include/asm-generic/mshyperv.h      |   3 +
>  5 files changed, 464 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index 8780573a4332..02f9a889c014 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -197,7 +197,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
>  
>  	msidesc = irq_data_get_msi_desc(data);
>  	pdev = msi_desc_to_pci_dev(msidesc);
> -	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> +	hv_devid.as_uint64 = hv_devid_from_pdev(pdev);
>  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>  
>  	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
> @@ -233,6 +233,20 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  		return;
>  	}
>  
> +	/*
> +	 * For direct attached devices, we cannot map interrupts in the
> +	 * hypervisor because it will not allow it until we have guest target
> +	 * vcpu and vector. So defer it until irqbypass. Also, do the same
> +	 * for domain attached devices for simplicity.
> +	 */
> +	if (hv_pcidev_is_pthru_dev(pdev)) {
> +		if (data->chip_data)
> +			entry_to_msi_msg(data->chip_data, msg);
> +		else
> +			memset(msg, 0, sizeof(struct msi_msg));
> +		return;
> +	}
> +
>  	if (data->chip_data) {
>  		/*
>  		 * This interrupt is already mapped. Let's unmap first.
> @@ -272,7 +286,7 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>  {
>  	union hv_device_id hv_devid;
>  
> -	hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> +	hv_devid.as_uint64 = hv_devid_from_pdev(pdev);
>  	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
>  }
>  
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> index 90959f639dc3..1f5c1e9ee9b7 100644
> --- a/drivers/hv/mshv_eventfd.c
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -7,7 +7,6 @@
>   *
>   * All credits to kvm developers.
>   */
> -
>  #include <linux/syscalls.h>
>  #include <linux/wait.h>
>  #include <linux/poll.h>
> @@ -15,7 +14,8 @@
>  #include <linux/list.h>
>  #include <linux/workqueue.h>
>  #include <linux/eventfd.h>
> -
> +#include <linux/pci.h>
> +#include <linux/vfio_pci_core.h>
>  #if IS_ENABLED(CONFIG_X86_64)
>  #include <asm/apic.h>
>  #endif
> @@ -27,6 +27,377 @@
>  
>  static struct workqueue_struct *irqfd_cleanup_wq;
>  
> +#if IS_ENABLED(CONFIG_X86_64)
> +
> +static int mshv_parse_mshv_irqfd(struct mshv_irqfd *irqfd,
> +				 struct pci_dev **out_pdev,
> +				 struct irq_data **out_irqdata)
> +{
> +	struct irq_bypass_producer *prod;
> +	struct msi_desc *msidesc;
> +	struct irq_data *irqdata;
> +
> +	if (irqfd == NULL || irqfd->irqfd_bypass_prod == NULL)
> +		return -ENODEV;
> +
> +	prod = irqfd->irqfd_bypass_prod;
> +
> +	irqdata = irq_get_irq_data(prod->irq);
> +	if (irqdata == NULL) {
> +		pr_err("Hyper-V: irqbypass fail, no irqdata. irq:0x%x\n",
> +		       prod->irq);
> +		return -EINVAL;
> +	}
> +	*out_irqdata = irqdata;
> +
> +	msidesc = irq_data_get_msi_desc(irqdata);
> +	if (msidesc == NULL) {
> +		pr_err("Hyper-V: irqbypass msi fail. irq:0x%x\n", prod->irq);
> +		return -EINVAL;
> +	}
> +
> +	*out_pdev = msi_desc_to_pci_dev(msidesc);
> +	if (*out_pdev == NULL) {
> +		pr_err("Hyper-V: mshv_irqfd parse fail. irq:0x%x\n", prod->irq);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Must be called with interrupts disabled */
> +static int hv_vpset_from_hyp_disabled(
> +			struct hv_input_get_vp_set_from_mda *input,
> +			union hv_output_get_vp_set_from_mda *output,
> +			struct mshv_lapic_irq *lapic_irq, u64 partid)
> +{
> +	u64 status;
> +
> +	memset(input, 0, sizeof(*input));
> +	input->target_partid = partid;
> +	input->dest_address = lapic_irq->lapic_apic_id;
> +	input->input_vtl = 0;
> +	input->destmode_logical = lapic_irq->lapic_control.logical_dest_mode;
> +
> +	status = hv_do_hypercall(HVCALL_GET_VPSET_FROM_MDA, input, output);
> +	if (!hv_result_success(status)) {
> +		hv_status_err(status, "apicid:0x%llx dest:0x%x\n",
> +			      lapic_irq->lapic_apic_id,
> +			      lapic_irq->lapic_control.logical_dest_mode);
> +	}
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* Returns number of banks copied, -errno in case of error */
> +static int hv_copy_vpset(struct hv_vpset *dest, struct hv_vpset *src)
> +{
> +	u64 bank_mask;
> +	int banks, tot_banks = hv_max_vp_index / HV_VCPUS_PER_SPARSE_BANK;
> +
> +	if (tot_banks >= HV_MAX_SPARSE_VCPU_BANKS)
> +		return -EINVAL;
> +
> +	dest->format = src->format;
> +	dest->valid_bank_mask = src->valid_bank_mask;
> +	bank_mask = src->valid_bank_mask;
> +	for (banks = 0; banks <= tot_banks; banks++) {
> +		if (bank_mask == 0)
> +			break;
> +
> +		if (bank_mask & 1)
> +			dest->bank_contents[banks] = src->bank_contents[banks];
> +		bank_mask = bank_mask >> 1;
> +	}
> +
> +	return banks;
> +}
> +
> +static int mshv_map_device_interrupt(u64 ptid, union hv_device_id hv_devid,
> +				     struct mshv_lapic_irq *ginfo,
> +				     struct hv_interrupt_entry *ret_entry,
> +				     u64 *ret_status)
> +{
> +	struct hv_input_map_device_interrupt *irq_input;
> +	struct hv_output_map_device_interrupt *irq_output;
> +	struct hv_device_interrupt_descriptor *intdesc;
> +	struct hv_input_get_vp_set_from_mda *mda_input;
> +	union hv_output_get_vp_set_from_mda *mda_output;
> +	ulong flags;
> +	u64 status;
> +	int rc, var_size;
> +
> +	*ret_status = U64_MAX;
> +	local_irq_save(flags);
> +
> +	mda_input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	mda_output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	/*
> +	 * Map Device Interrupt hcall needs vp set based on vp indexes used
> +	 * during vp creation. Here we have lapic-id of the vp only. Easiest
> +	 * is to just ask the hypervisor for the vp set matching the lapic-id.
> +	 */
> +	rc = hv_vpset_from_hyp_disabled(mda_input, mda_output, ginfo, ptid);
> +	if (rc)
> +		goto out;	/* error already printed */
> +
> +	irq_input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	irq_output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +	memset(irq_input, 0, sizeof(*irq_input));
> +
> +	irq_input->partition_id = ptid;
> +	irq_input->device_id = hv_devid.as_uint64;
> +
> +	intdesc = &irq_input->interrupt_descriptor;
> +	intdesc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
> +	intdesc->vector_count = 1;
> +	intdesc->target.vector = ginfo->lapic_vector;
> +	intdesc->trigger_mode = HV_INTERRUPT_TRIGGER_MODE_EDGE;
> +
> +	intdesc->target.vp_set.valid_bank_mask = 0;
> +	intdesc->target.vp_set.format = HV_GENERIC_SET_SPARSE_4K;
> +	intdesc->target.flags = HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET;
> +	rc = hv_copy_vpset(&intdesc->target.vp_set, &mda_output->target_vpset);
> +	if (rc <= 0) {
> +		pr_err("Hyper-V: ptid %lld - (irq)vpset copy failed (%d)\n",
> +		       ptid, rc);
> +		goto out;
> +	}
> +
> +	/*
> +	 * var-sized hcall: var-size starts after vp_mask (thus vp_set.format
> +	 * does not count, but vp_set.valid_bank_mask does).
> +	 */
> +	var_size = rc + 1;
> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, var_size,
> +				     irq_input, irq_output);
> +	*ret_entry = irq_output->interrupt_entry;
> +	local_irq_restore(flags);
> +
> +	rc = 0;
> +	if (!hv_result_success(status)) {
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY)
> +			hv_status_err(status, "pt:%lld vec:%d lapic-id:%lld\n",
> +			      ptid, ginfo->lapic_vector, ginfo->lapic_apic_id);
> +		*ret_status = status;
> +		rc = hv_result_to_errno(status);
> +	}
> +
> +	return rc;
> +
> +out:
> +	local_irq_restore(flags);
> +	return rc;
> +
> +}
> +
> +static int mshv_unmap_device_interrupt(union hv_device_id hv_devid,
> +				       struct hv_interrupt_entry *irq_entry)
> +{
> +	unsigned long flags;
> +	struct hv_input_unmap_device_interrupt *input;
> +	u64 status;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL)
> +		input->partition_id = hv_get_current_partid();
> +	else
> +		input->partition_id = hv_current_partition_id;
> +
> +	input->device_id = hv_devid.as_uint64;
> +	input->interrupt_entry = *irq_entry;
> +
> +	status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static int mshv_chk_unmap_irq(union hv_device_id hv_devid,
> +			      struct irq_data *irqdata)
> +{
> +	int rc;
> +
> +	if (irqdata->chip_data == NULL)
> +		return 0;
> +
> +	rc = mshv_unmap_device_interrupt(hv_devid, irqdata->chip_data);
> +	if (rc)
> +		return rc;
> +
> +	kfree(irqdata->chip_data);
> +	irqdata->chip_data = NULL;
> +
> +	return 0;
> +}
> +
> +/*
> + * Synchronize device update with VFIO.
> + *    See: vfio_pci_memory_lock_and_enable()
> + */
> +static u16 mshv_pci_memory_lock_and_enable(struct vfio_pci_core_device *cdev)
> +{
> +	u16 cmd;
> +
> +	down_write(&cdev->memory_lock);
> +	pci_read_config_word(cdev->pdev, PCI_COMMAND, &cmd);
> +	if (!(cmd & PCI_COMMAND_MEMORY))
> +		pci_write_config_word(cdev->pdev, PCI_COMMAND,
> +				      cmd | PCI_COMMAND_MEMORY);
> +	return cmd;
> +}
> +
> +static void mshv_pci_memory_unlock_and_restore(
> +					struct vfio_pci_core_device *cdev,
> +					u16 cmd)
> +{
> +	pci_write_config_word(cdev->pdev, PCI_COMMAND, cmd);
> +	up_write(&cdev->memory_lock);
> +}
> +
> +static void mshv_make_device_usable(struct pci_dev *pdev, int vector,
> +				    struct hv_interrupt_entry *hv_entry)
> +{
> +	int lirq;
> +	struct msi_msg msimsg;
> +	struct irq_data *irqdata, *parent;
> +	u16 pcicmd;
> +	struct vfio_pci_core_device *coredev = dev_get_drvdata(&pdev->dev);
> +
> +	if (pdev->dev.driver == NULL ||
> +	    strcmp(pdev->dev.driver->name, "vfio-pci") != 0) {
> +		pr_err("Hyper-V: irqbypass: non vfio device %s\n",
> +		       pci_name(pdev));
> +		return;
> +	}
> +	if (coredev == NULL) {
> +		pr_err("Hyper-V: irqbypass: null vfio device for %s\n",
> +		       pci_name(pdev));
> +		return;
> +	}
> +
> +	if (hv_entry->source != HV_INTERRUPT_SOURCE_MSI) {
> +		pr_err("Hyper-V: %s irq source not msi\n", pci_name(pdev));
> +		return;
> +	}
> +
> +	lirq = pci_irq_vector(pdev, vector);
> +	irqdata = irq_get_irq_data(lirq);
> +	if (irqdata == NULL) {
> +		pr_err("Hyper-V: null irq_data for write msimsg. lirq:0x%x\n",
> +		       lirq);
> +		return;
> +	}
> +
> +	msimsg.address_hi = 0;
> +	msimsg.address_lo = hv_entry->msi_entry.address.as_uint32;
> +	msimsg.data =  hv_entry->msi_entry.data.as_uint32;
> +
> +	pcicmd = mshv_pci_memory_lock_and_enable(coredev);
> +	pci_write_msi_msg(lirq, &msimsg);
> +	mshv_pci_memory_unlock_and_restore(coredev, pcicmd);
> +
> +	pci_msi_unmask_irq(irqdata);
> +
> +	parent = irqdata->parent_data;
> +	if (parent && parent->chip && parent->chip->irq_unmask)
> +		irq_chip_unmask_parent(irqdata);
> +}
> +
> +/*
> + * This guest has a device passthru'd to it. VFIO did the initial setup of
> + * the device interrupts, but we left them unmapped in the hypervisor
> + * because we didn't have the guest target cpu and vector (required by
> + * hypervisor). We have them now, so do the map hypercall.
> + * Also, when here, it is expected that the device global mask is unset
> + * but individual MSI/x masks are set. Goal here is to map the interrupt in
> + * the hypervisor, update the corresponding device MSI/x entry, and enable it.
> + */
> +static void mshv_pthru_dev_irq_remap(struct mshv_irqfd *irqfd)
> +{
> +	u64 ptid, status;
> +	struct pci_dev *pdev;
> +	int rc, deposit_pgs = 16;
> +	struct mshv_lapic_irq *ginfo = &irqfd->irqfd_lapic_irq;
> +	union hv_device_id hv_devid;
> +	struct hv_interrupt_entry *new_entry;
> +	struct irq_data *irqdata;
> +
> +	if (!irqfd->irqfd_girq_ent.girq_entry_valid ||
> +	    irqfd->irqfd_bypass_prod == NULL)
> +		return;
> +
> +	rc = mshv_parse_mshv_irqfd(irqfd, &pdev, &irqdata);
> +	if (rc)
> +		return;
> +
> +	hv_devid.as_uint64 = hv_devid_from_pdev(pdev);
> +
> +	rc = mshv_chk_unmap_irq(hv_devid, irqdata);
> +	if (rc)
> +		return;
> +
> +	new_entry = kmalloc(sizeof(*new_entry), GFP_ATOMIC);
> +	if (new_entry == NULL)
> +		return;
> +
> +	ptid = irqfd->irqfd_partn->pt_id;
> +
> +	while (deposit_pgs--) {
> +		rc = mshv_map_device_interrupt(ptid, hv_devid, ginfo, new_entry,
> +					       &status);
> +		if (rc == 0)
> +			break;
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY)
> +			break;
> +
> +		rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);
> +		if (rc)
> +			break;
> +	}
> +	if (rc) {
> +		kfree(new_entry);
> +		return;
> +	}
> +
> +	irqdata->chip_data = new_entry;
> +
> +	mshv_make_device_usable(pdev, irqdata->hwirq, new_entry);
> +}
> +
> +static void mshv_pthru_dev_irq_undo(struct mshv_irqfd *irqfd)
> +{
> +	struct pci_dev *pdev;
> +	union hv_device_id hv_devid;
> +	struct irq_data *irqdata;
> +	int rc;
> +
> +	if (!irqfd->irqfd_girq_ent.girq_entry_valid ||
> +	    irqfd->irqfd_bypass_prod == NULL)
> +		return;
> +
> +	rc = mshv_parse_mshv_irqfd(irqfd, &pdev, &irqdata);
> +	if (rc)
> +		return;
> +
> +	hv_devid.as_uint64 = hv_devid_from_pdev(pdev);
> +	mshv_chk_unmap_irq(hv_devid, irqdata);
> +}
> +
> +#else /* IS_ENABLED(CONFIG_X86_64) */
> +
> +static void mshv_pthru_dev_irq_remap(struct mshv_irqfd *irqfd) { }
> +static void mshv_pthru_dev_irq_undo(struct mshv_irqfd *irqfd) { }
> +
> +#endif /* IS_ENABLED(CONFIG_X86_64) */
> +
>  void mshv_register_irq_ack_notifier(struct mshv_partition *partition,
>  				    struct mshv_irq_ack_notifier *mian)
>  {
> @@ -264,6 +635,7 @@ static void mshv_irqfd_shutdown(struct work_struct *work)
>  	/*
>  	 * It is now safe to release the object's resources
>  	 */
> +	irq_bypass_unregister_consumer(&irqfd->irqfd_bypass_cons);
>  	eventfd_ctx_put(irqfd->irqfd_eventfd_ctx);
>  	kfree(irqfd);
>  }
> @@ -286,6 +658,12 @@ static void mshv_irqfd_deactivate(struct mshv_irqfd *irqfd)
>  
>  	hlist_del(&irqfd->irqfd_hnode);
>  
> +	/*
> +	 * Cleanup interrupt map (kfree chip_data) while in a VMM thread as
> +	 * unmap needs partition id. mshv_irqfd_shutdown() runs in a kthread.
> +	 */
> +	mshv_pthru_dev_irq_undo(irqfd);
> +
>  	queue_work(irqfd_cleanup_wq, &irqfd->irqfd_shutdown);
>  }
>  
> @@ -383,6 +761,45 @@ static void mshv_irqfd_queue_proc(struct file *file, wait_queue_head_t *wqh,
>  	add_wait_queue_priority(wqh, &irqfd->irqfd_wait);
>  }
>  
> +static int mshv_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
> +					struct irq_bypass_producer *prod)
> +{
> +	struct mshv_irqfd *irqfd;
> +
> +	irqfd = container_of(cons, struct mshv_irqfd, irqfd_bypass_cons);
> +	irqfd->irqfd_bypass_prod = prod;
> +
> +	mshv_pthru_dev_irq_remap(irqfd);
> +
> +	return 0;
> +}
> +
> +static void mshv_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
> +					 struct irq_bypass_producer *prod)
> +{
> +	struct mshv_irqfd *irqfd;
> +
> +	irqfd = container_of(cons, struct mshv_irqfd, irqfd_bypass_cons);
> +
> +	WARN_ON(irqfd->irqfd_bypass_prod != prod);
> +	irqfd->irqfd_bypass_prod = NULL;
> +
> +}
> +
> +static void mshv_setup_irq_bypass(struct mshv_irqfd *irqfd,
> +				  struct eventfd_ctx *eventfd)
> +{
> +	struct irq_bypass_consumer *consumer = &irqfd->irqfd_bypass_cons;
> +	int rc;
> +
> +	consumer->add_producer = mshv_irq_bypass_add_producer;
> +	consumer->del_producer = mshv_irq_bypass_del_producer;
> +	rc = irq_bypass_register_consumer(&irqfd->irqfd_bypass_cons, eventfd);
> +	if (rc)
> +		pr_err("Hyper-V: irq bypass consumer registration failed: %d\n",
> +		       rc);
> +}
> +
>  static int mshv_irqfd_assign(struct mshv_partition *pt,
>  			     struct mshv_user_irqfd *args)
>  {
> @@ -509,6 +926,8 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
>  	if (events & EPOLLIN)
>  		mshv_assert_irq_slow(irqfd);
>  
> +	mshv_setup_irq_bypass(irqfd, eventfd);
> +
>  	srcu_read_unlock(&pt->pt_irq_srcu, idx);
>  	return 0;
>  
> diff --git a/drivers/iommu/hyperv-iommu-root.c b/drivers/iommu/hyperv-iommu-root.c
> index a2e0f6cc78e6..dc270b0a80d9 100644
> --- a/drivers/iommu/hyperv-iommu-root.c
> +++ b/drivers/iommu/hyperv-iommu-root.c
> @@ -217,6 +217,20 @@ u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type)
>  }
>  EXPORT_SYMBOL_GPL(hv_build_devid_oftype);
>  
> +/* Build device id for the interrupt path */
> +u64 hv_devid_from_pdev(struct pci_dev *pdev)
> +{
> +	enum hv_device_type dev_type;
> +
> +	if (hv_pcidev_is_attached_dev(pdev))
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	return hv_build_devid_oftype(pdev, dev_type);
> +}
> +EXPORT_SYMBOL_GPL(hv_devid_from_pdev);
> +
>  /* Create a new device domain in the hypervisor */
>  static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
>  {
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 50d793ca8f31..702a8005651b 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1744,6 +1744,16 @@ static void hv_irq_mask(struct irq_data *data)
>  
>  static void hv_irq_unmask(struct irq_data *data)
>  {
> +	struct pci_dev *pdev;
> +	struct msi_desc *msi_desc;
> +
> +	msi_desc = irq_data_get_msi_desc(data);
> +	pdev = msi_desc_to_pci_dev(msi_desc);
> +
> +	/* Done during bypass setup in mshv_eventfd.c: mshv_irqfd_assign() */
> +	if (hv_pcidev_is_pthru_dev(pdev))
> +		return;
> +
>  	hv_arch_irq_unmask(data);
>  
>  	if (data->parent_data->chip->irq_unmask)
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 8d5c610da99a..88b3aba6691c 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -332,6 +332,7 @@ u64 hv_get_current_partid(void);
>  bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
>  bool hv_pcidev_is_pthru_dev(struct pci_dev *pdev);
>  u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type);
> +u64 hv_devid_from_pdev(struct pci_dev *pdev);
>  #else
>  static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>  { return false; }
> @@ -340,6 +341,8 @@ static inline bool hv_pcidev_is_pthru_dev(struct pci_dev *pdev)
>  static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
>  					enum hv_device_type type)
>  { return 0; }
> +static inline u64 hv_devid_from_pdev(struct pci_dev *pdev)
> +{ return 0; }
>  static inline u64 hv_get_current_partid(void)
>  { return HV_PARTITION_ID_INVALID; }
>  #endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply

* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
From: David Hildenbrand (Arm) @ 2026-05-12  8:42 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, Liam.Howlett, akpm, decui, haiyangz,
	jgg, corbet, leon, longli, ljs, mhocko, rppt, shuah, skhan,
	surenb, vbabka, wei.liu
  Cc: linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <177759840859.221039.13065406062747296947.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>


> +	for (; addr < end; addr += PAGE_SIZE) {
> +		vm_fault_t ret;
> +
> +		ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> +
> +		if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
> +			/*
> +			 * The mmap lock has been dropped by the fault handler.
> +			 * Record the failing address and signal lock-drop to
> +			 * the caller.
> +			 */
> +			*hmm_vma_walk->locked = 0;
> +			hmm_vma_walk->last = addr;
> +			return -EAGAIN;


Okay, so we'll return straight from hmm_vma_fault() to
hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.

Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
the hmm_vma_fault() could be called by the caller of walk_page_range(), but
that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
requires the vma in there.


Note: am I wrong, or is hmm_vma_fault() really always called with
required_fault=true?

> +		}
> +
> +		if (ret & VM_FAULT_ERROR)
>  			return -EFAULT;
> +	}
>  	return -EBUSY;
>  }
>  
> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>  	if (required_fault) {
>  		int ret;
>  
> +		/*
> +		 * Faulting hugetlb pages on the unlockable path is not
> +		 * supported. The walk framework holds hugetlb_vma_lock_read
> +		 * which must be dropped before handle_mm_fault, but if the
> +		 * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
> +		 * be freed and the walk framework's unconditional unlock
> +		 * becomes a use-after-free.
> +		 */
> +		if (hmm_vma_walk->locked)
> +			return -EFAULT;

Just because it's unlockable doesn't mean that you must unlock. Can't this be
kept working as is, just simulating here as if it would not be unlockable?


-- 
Cheers,

David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox