* Re: [PATCH V3 06/11] mshv: Add ioctl support for MSHV-VFIO bridge device
From: sashiko-bot @ 2026-05-13 5:27 UTC (permalink / raw)
To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-7-mrathor@linux.microsoft.com>
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] FD leak and potential deadlock on copy_to_user() failure after anon_inode_getfd().
- [Medium] Asymmetric memory lifecycle and inconsistent device_release handling.
- [Medium] Missing validation of unknown flags in mshv_partition_ioctl_create_device().
--
commit a0a0db24ca9d194302f458a7afa8a424cc357990
Author: Mukesh R <mrathor@linux.microsoft.com>
mshv: Add ioctl support for MSHV-VFIO bridge device
This commit adds an ioctl for creating MSHV devices for a partition.
Currently only VFIO device types are supported, doing setup for the bridge
code in mshv_vfio.c.
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 02c107458be90..6ceb5f608589f 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
[ ... ]
> @@ -1625,6 +1625,119 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)
[ ... ]
> +static int mshv_device_fop_release(struct inode *inode, struct file *filp)
> +{
> + struct mshv_device *mshv_dev = filp->private_data;
> + struct mshv_partition *partition = mshv_dev->device_pt;
> +
> + if (mshv_dev->device_ops->device_release) {
> + mutex_lock(&partition->pt_mutex);
> + hlist_del(&mshv_dev->device_ptnode);
> + mshv_dev->device_ops->device_release(mshv_dev);
> + mutex_unlock(&partition->pt_mutex);
> + }
What happens if a future device type does not implement the device_release
callback? It looks like the device would remain indefinitely linked in
partition->pt_devices while the file descriptor is closed, leading to
list corruption and a memory leak.
Since the device structure is allocated by the core module via kzalloc(),
would it be safer to handle the list removal and generic kfree()
unconditionally in this function?
> +
> + mshv_partition_put(partition);
> + return 0;
> +}
> +
> +static const struct file_operations mshv_device_fops = {
> + .owner = THIS_MODULE,
> + .unlocked_ioctl = mshv_device_fop_ioctl,
> + .release = mshv_device_fop_release,
> +};
> +
> +static long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> + void __user *uarg)
> +{
> + long rc;
> + struct mshv_create_device devargk;
> + struct mshv_device *mshv_dev;
> + const struct mshv_device_ops *vfio_ops;
> +
> + if (copy_from_user(&devargk, uarg, sizeof(devargk)))
> + return -EFAULT;
> +
> + /* At present, only VFIO is supported */
> + if (devargk.type != MSHV_DEV_TYPE_VFIO)
> + return -ENODEV;
> +
> + if (devargk.flags & MSHV_CREATE_DEVICE_TEST)
> + return 0;
Should we explicitly validate that no undefined flags are set here?
If a future kernel version introduces new flags for device creation, older
kernels will silently ignore them and return success instead of -EINVAL,
breaking forward compatibility.
> +
> + /* This is freed later by mshv_vfio_release_device() */
> + mshv_dev = kzalloc(sizeof(*mshv_dev), GFP_KERNEL_ACCOUNT);
> + if (mshv_dev == NULL)
> + return -ENOMEM;
> +
> + vfio_ops = &mshv_vfio_device_ops;
> + mshv_dev->device_ops = vfio_ops;
> + mshv_dev->device_pt = partition;
> +
> + rc = vfio_ops->device_create(mshv_dev);
> + if (rc < 0) {
> + kfree(mshv_dev);
> + return rc;
> + }
> +
> + hlist_add_head(&mshv_dev->device_ptnode, &partition->pt_devices);
> +
> + mshv_partition_get(partition);
> + rc = anon_inode_getfd(vfio_ops->device_name, &mshv_device_fops,
> + mshv_dev, O_RDWR | O_CLOEXEC);
> + if (rc < 0)
> + goto undo_out;
> +
> + devargk.fd = rc;
> + if (copy_to_user(uarg, &devargk, sizeof(devargk)))
> + return -EFAULT; /* cleanup in mshv_device_fop_release() */
Does this sequence leak the file descriptor if copy_to_user() fails?
The anon_inode_getfd() call installs the file descriptor into the process's
fd table. If the subsequent copy_to_user() fails, the function returns
-EFAULT, but the user space application never receives the fd number to
explicitly close it, resulting in a leaked file descriptor.
Additionally, can this trigger a deadlock?
The ioctl is called with partition->pt_mutex held. If a malicious user
provides a buffer backed by userfaultfd, the copy_to_user() call will
block, holding the mutex. Concurrently, another thread could find the
new fd in /proc/self/fd and call close(), which invokes
mshv_device_fop_release(). That release function unconditionally attempts
to acquire partition->pt_mutex, resulting in a deadlock.
Would it be better to separate file allocation and installation by using
get_unused_fd_flags(), anon_inode_getfile(), and only calling fd_install()
after copy_to_user() succeeds?
> +
> + return 0;
> +
> +undo_out:
> + hlist_del(&mshv_dev->device_ptnode);
> + vfio_ops->device_release(mshv_dev); /* will kfree(mshv_dev) */
If a future device type does not implement the device_release callback,
could this result in a NULL pointer dereference?
> + mshv_partition_put(partition);
> + return rc;
> +}
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=6
^ permalink raw reply
* Re: [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR
From: Anirudh Rayabharam @ 2026-05-13 5:32 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816865065.21765.5112039734725197893.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, May 07, 2026 at 03:44:10PM +0000, Stanislav Kinsburskii wrote:
> mshv_intercept_isr() reads vp_index from the SynIC intercept message
> payload and uses it directly to index into partition->pt_vp_array without
> validating that vp_index < MSHV_MAX_VPS.
>
> Mshv treats the Microsoft Hypervisor as trusted, so a malformed vp_index is
> not a security concern; the threat model does not include a malicious
> hypervisor. A hypervisor bug that placed an out-of-range value here would,
> however, cause an out-of-bounds read of pt_vp_array in hardirq context,
> manifesting as random memory corruption or a host crash with no clear
> signal pointing back to the hypervisor as the source.
>
> handle_bitset_message() and handle_pair_message() already perform this
> defensive check on hypervisor-supplied vp_index values, with an explicit
> "This shouldn't happen, but just in case" comment Add the same check to
> mshv_intercept_isr() for consistency, turning a potential silent corruption
> into a debuggable pr_debug message.
>
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_synic.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index bac890cd2b468..89207aad7cf1f 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -387,6 +387,11 @@ mshv_intercept_isr(struct hv_message *msg)
> */
> vp_index =
> ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
> + /* This shouldn't happen, but just in case. */
> + if (unlikely(vp_index >= MSHV_MAX_VPS)) {
> + pr_debug("VP index %u out of bounds\n", vp_index);
> + goto unlock_out;
> + }
> vp = partition->pt_vp_array[vp_index];
> if (unlikely(!vp)) {
> pr_debug("failed to find VP %u\n", vp_index);
>
>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: sashiko-bot @ 2026-05-13 6:43 UTC (permalink / raw)
To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-9-mrathor@linux.microsoft.com>
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Conflicting declarations and missing `#ifdef` guards for `hv_build_devid_type_pci()` cause build failures under certain Kconfig combinations.
--
commit 086da02227f217e959d3022c592e1b8698814d5b
Author: Mukesh R <mrathor@linux.microsoft.com>
PCI: hv: VMBus and PCI device IDs for PCI passthru
This commit introduces functions to build and export VMBus and PCI device
IDs for use with Hyper-V PCI passthru. It exposes hv_build_devid_type_pci()
and adds hv_pci_vmbus_device_id() to generate the device IDs expected by
the hypervisor.
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index b3ad50a874dca..8780573a43326 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
> return 0;
> }
>
> -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
> {
> int pos;
[ ... ]
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index f64393e853ee3..2ef34001f8d31 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
[ ... ]
> @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
> static inline void hv_root_crash_init(void) {}
> #endif /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
>
> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
Will this cause build failures under certain Kconfig combinations?
If CONFIG_PCI_MSI=y and CONFIG_HYPERV_IOMMU=n, arch/x86/hyperv/irqdomain.c
provides a non-static definition of hv_build_devid_type_pci(), but this
header provides a static inline definition, which leads to a redefinition
error.
Conversely, if CONFIG_PCI_MSI=n and CONFIG_HYPERV_IOMMU=y, this header
declares the function, but it is never defined because irqdomain.c is not
compiled, resulting in linker errors.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=8
^ permalink raw reply
* Re: [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path
From: Anirudh Rayabharam @ 2026-05-13 9:57 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816865606.21765.9555294064589953949.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, May 07, 2026 at 03:44:16PM +0000, Stanislav Kinsburskii wrote:
> mshv_partition_ioctl_create_vp() initialises a VP struct (allocations,
> mutex_init, init_waitqueue_head, page mappings) and then publishes the
> pointer into partition->pt_vp_array. Several ISR paths read this array
> locklessly: the intercept ISR, the two scheduler ISRs, and
> mshv_try_assert_irq_fast() on the irqfd fast path.
>
> Of these, only mshv_try_assert_irq_fast() can structurally race the
> publish. It runs from an eventfd waker without holding pt_mutex, and
> MSHV_IRQFD does not require the target lapic_apic_id (== vp_index) to
> refer to an existing VP at registration time. A user can therefore
> register an irqfd targeting a yet-to-be-created VP, then trigger
> mshv_try_assert_irq_fast() concurrently with MSHV_CREATE_VP for the
> same index. On weakly-ordered architectures the reader can observe a
> non-NULL pointer in pt_vp_array before the initialising stores to the
> VP struct become visible, leading to use of partially-initialised
> fields (e.g. vp_register_page).
>
> The other ISR readers cannot reach this race: the hypervisor will not
> generate intercept or scheduler messages for a VP that has never been
> told to run, and the user can only call MSHV_RUN_VP on the VP fd
> returned by MSHV_CREATE_VP, which by construction is returned after
> the publish. Leave those readers as plain loads.
>
> Use smp_store_release() in mshv_partition_ioctl_create_vp() to publish
> the pointer, and pair it with smp_load_acquire() in
> mshv_try_assert_irq_fast(). On x86 these compile to plain accesses
> under TSO; on ARM64 they emit one-instruction acquire/release barriers,
> acceptable on this fast path.
>
> The destroy-side path (destroy_partition() clearing pt_vp_array[i] to
> NULL after kfree(vp)) has a separate ordering and lifetime concern
> that is out of scope here.
>
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_eventfd.c | 9 ++++++++-
> drivers/hv/mshv_root_main.c | 8 +++++++-
> 2 files changed, 15 insertions(+), 2 deletions(-)
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period
From: Anirudh Rayabharam @ 2026-05-13 10:11 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816866152.21765.16203922564983237274.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, May 07, 2026 at 03:44:21PM +0000, Stanislav Kinsburskii wrote:
> destroy_partition() frees mshv_vp with plain kfree() while ISR readers
> walk pt_vp_array[] under rcu_read_lock(). On non-root schedulers,
> where drain_all_vps() does not run, an in-flight intercept ISR can
> observe a non-NULL pt_vp_array slot and dereference freed memory in
> kick_vp(). On the root scheduler the same race exists in a narrower
> form: drain_vp_signals() synchronises on kick_vp()'s kicked_by_hv flag
> but not on its wake_up() tail, so the wait-queue lock embedded in vp
> can still be held when destroy_partition() reaches kfree(vp).
>
> Add struct rcu_head vp_rcu to struct mshv_vp, clear the pt_vp_array
> slot before the free, and use kfree_rcu() so the actual kfree happens
> after a grace period. drain_all_vps() is retained because it serves a
> separate purpose (telling the hypervisor to stop signalling and
> reconciling signal counts) that kfree_rcu() does not address.
>
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_root.h | 1 +
> drivers/hv/mshv_root_main.c | 5 +++--
> 2 files changed, 4 insertions(+), 2 deletions(-)
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
From: Anirudh Rayabharam @ 2026-05-13 11:12 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816866691.21765.15605640837157423543.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, May 07, 2026 at 03:44:26PM +0000, Stanislav Kinsburskii wrote:
> handle_pair_message() iterates up to msg->vp_count without verifying it
> against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> from untrusted hypervisor data, a malformed message with a large value
> would cause out-of-bounds reads from the partition_ids and vp_indexes
> arrays.
>
> handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> 64) and advances bank_contents for each one. However, the payload buffer
> only has space for 16 bank entries. A valid_bank_mask with more than 16
> bits set causes bank_contents to read beyond the message buffer.
>
> Fix both by adding bounds validation:
> - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> - Track banks consumed and stop before exceeding buffer capacity
>
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_synic.c | 20 ++++++++++++++++++--
> 1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 89207aad7cf1f..5d509299f14d7 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
> static void
> handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
> {
> - int bank_idx, vps_signaled = 0, bank_mask_size;
> + int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> + const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> + sizeof(u64) - 2; /* subtract format + mask */
Could this be a constant in the header?
Thanks,
Anirudh.
^ permalink raw reply
* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
From: Anirudh Rayabharam @ 2026-05-13 11:15 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <agHwhJrqEL3IewHz@skinsburskii.localdomain>
On Mon, May 11, 2026 at 08:06:44AM -0700, Stanislav Kinsburskii wrote:
> On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> > On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > > fails on an unencrypted partition. The condition on the error path:
> > >
> > > if (ret && mshv_partition_encrypted(partition))
> > >
> > > only handles map failures for encrypted partitions — if the partition is
> > > not encrypted and the map fails, execution falls through to 'return 0',
> > > silently ignoring the error.
> > >
> > > Additionally, calling mshv_region_invalidate() inline on map failure
> > > zeroes the mreg_pages array before the caller's cleanup path
> > > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > > mappings for partially-mapped pages.
> > >
> > > Fix by returning immediately on success and falling through to error
> > > return on failure. For unencrypted partitions, the caller's
> > > mshv_region_destroy() handles unmap followed by invalidate in the
> > > correct order. For encrypted partitions where re-sharing fails, zero
> > > the page array without unpinning — the pages are inaccessible to the
> > > host and must not be unpinned, but zeroing prevents
> > > mshv_region_destroy() from attempting to unpin them.
> >
> > mshv_region_destroy() calls mshv_region_invalidate() which calls
> > mshv_region_invalidate_pages() which just does:
> >
> > static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > u64 page_offset, u64 page_count)
> > {
> > if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > unpin_user_pages(region->mreg_pages + page_offset, page_count);
> >
> > memset(region->mreg_pages + page_offset, 0,
> > page_count * sizeof(struct page *));
> > }
> >
> > It doesn't checked for zeroed pages before unpinning.
> >
> > >
> > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > > drivers/hv/mshv_root_main.c | 26 ++++++++++++++++----------
> > > 1 file changed, 16 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 665d565899c15..7e4252b6bc65c 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> > > pt_err(partition,
> > > "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> > > region->start_gfn, ret);
> > > - goto invalidate_region;
> > > + goto err_out;
> > > }
> > > }
> > >
> > > ret = mshv_region_map(region);
> > > - if (ret && mshv_partition_encrypted(partition)) {
> > > + if (ret)
> > > + goto share_region;
> > > +
> > > + return 0;
> > > +
> > > +share_region:
> > > + if (mshv_partition_encrypted(partition)) {
> > > int shrc;
> > >
> > > shrc = mshv_region_share(region);
> > > if (!shrc)
> > > - goto invalidate_region;
> > > + goto err_out;
> > >
> > > pt_err(partition,
> > > "Failed to share memory region (guest_pfn: %llu): %d\n",
> > > region->start_gfn, shrc);
> > > /*
> > > - * Don't unpin if marking shared failed because pages are no
> > > - * longer mapped in the host, ie root, anymore.
> > > + * Re-sharing failed — the pages remain inaccessible to the
> > > + * host. Zero the page array so that mshv_region_destroy()
> > > + * won't attempt to unpin them (leaking the page references
> > > + * is intentional; unpinning host-inaccessible pages would be
> > > + * unsafe).
> > > */
> >
> > Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> > can remove it from here altogether. Either way, should this zeroing
> > logic be added there too so as not to crash the host by unpinning pages
> > that weren't marked shared?
> >
>
> Indeed.
> The issue with all this code is that we are leaking state in
> mshv_region_pin if we wimply return from it: we don't know if the pages
> are pinned or unshared (or mapped) in the destruction callback.
> We should either introduce a state variable to track this or just don't
> call the generic mshv_region_put on case of region creation error.
> The latter approch make mshv_map_user_memory idempotent and thus looks
> like a better way forward.
> What do you think?
I'm not sure I follow the latter approach. Omitting mshv_region_put()
would cause a dangling reference to the mshv_region right?
Thanks,
Anirudh.
^ permalink raw reply
* Re: [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract
From: Anirudh Rayabharam @ 2026-05-13 11:20 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816863447.21765.7284842709694944084.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, May 07, 2026 at 03:43:54PM +0000, Stanislav Kinsburskii wrote:
> mshv_portid_lookup() previously took rcu_read_lock() internally, ran
> idr_find(), released the read lock, and copied the struct contents
> into a caller-supplied buffer. This had two problems.
>
> 1. The struct copy ran outside the read section, racing with
> mshv_portid_free() which does idr_remove + synchronize_rcu + kfree.
> A copy that started just before synchronize_rcu() observed the read
> section as already drained and was free to read freed memory while
> the writer was kfree()'ing the entry.
>
> 2. The only consumer, mshv_doorbell_isr(), then dispatched a callback
> using fields of the snapshot — entirely outside any RCU read
> section. The callback's data argument and any field it touches
> were therefore safe only because mshv_isr() runs from
> sysvec_hyperv_callback, a non-threaded system vector that
> synchronize_rcu() implicitly waits for via the hardirq quiescent-
> state coupling. That protection is real today but undocumented and
> fragile: a future move of mshv_isr() to a threaded context, or a
> future caller that registers a doorbell with a shorter-lived data
> pointer, would silently expose a use-after-free.
>
> Make the contract explicit instead of implicit. mshv_portid_lookup()
> now returns a pointer to the table entry and requires the caller to
> hold rcu_read_lock for the entire lifetime of that pointer. The
> contract is annotated with __must_hold(RCU) so sparse flags any
> direct caller that forgets it. The sole caller, mshv_doorbell_isr(),
> takes rcu_read_lock around the whole drain loop, so the lookup, the
> field reads, and the doorbell_cb dispatch all run inside one
> read-side critical section. synchronize_rcu() in mshv_portid_free()
> now genuinely waits for any in-flight callback before kfree() runs,
> without relying on hardirq context for correctness.
>
> This also drops the by-value struct copy: entries are publish-once
> (populated before idr_alloc) and free-once (after synchronize_rcu),
> so a pointer dereferenced inside the read section gives a stable
> view of the contents without copying.
>
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_portid_table.c | 22 +++++++---------------
> drivers/hv/mshv_root.h | 2 +-
> drivers/hv/mshv_synic.c | 15 +++++++++------
> 3 files changed, 17 insertions(+), 22 deletions(-)
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free
From: Anirudh Rayabharam @ 2026-05-13 11:22 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816864533.21765.15114003791446487007.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, May 07, 2026 at 03:44:05PM +0000, Stanislav Kinsburskii wrote:
> mshv_portid_free() uses synchronize_rcu() followed by kfree() to
> reclaim port table entries. This blocks the caller until a full RCU
> grace period elapses, which is unnecessary since the same module already
> uses the non-blocking kfree_rcu() pattern in mshv_port_table_fini().
>
> Replace with kfree_rcu() to avoid the blocking wait and keep the
> reclamation strategy consistent across the file.
>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_portid_table.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
> index d87a82e399e96..42d21b92b88fd 100644
> --- a/drivers/hv/mshv_portid_table.c
> +++ b/drivers/hv/mshv_portid_table.c
> @@ -62,8 +62,7 @@ mshv_portid_free(int port_id)
> WARN_ON(!info);
> idr_unlock(&port_table_idr);
>
> - synchronize_rcu();
> - kfree(info);
> + kfree_rcu(info, portbl_rcu);
> }
>
> /*
>
>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0
From: Anirudh Rayabharam @ 2026-05-13 11:36 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816862911.21765.306085307721937662.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, May 07, 2026 at 03:43:49PM +0000, Stanislav Kinsburskii wrote:
> The duplicate routing entry check in mshv_update_routing_table() uses
> guest_irq_num != 0 to detect whether a GSI slot is already occupied.
> This fails for GSI 0 because its guest_irq_num is 0 both when the slot
> is unused (zero-initialized) and when legitimately assigned. As a
> result, duplicate entries for GSI 0 are silently accepted, with the
> second entry overwriting the first — corrupting the routing table
> without any error reported to userspace.
>
> While GSI 0 (legacy timer) is unlikely to appear in MSI-based routing
> in practice, the check is semantically wrong — it conflates
> "uninitialized" with "GSI number 0." Use girq_entry_valid instead,
> which is explicitly set to true when an entry is populated and remains
> zero for unused slots regardless of the GSI number.
>
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_irq.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
> index 59584a132ca9f..db05512db5548 100644
> --- a/drivers/hv/mshv_irq.c
> +++ b/drivers/hv/mshv_irq.c
> @@ -88,7 +88,7 @@ int mshv_update_routing_table(struct mshv_partition *partition,
> /*
> * Allow only one to one mapping between GSI and MSI routing.
> */
> - if (girq->guest_irq_num != 0) {
> + if (girq->girq_entry_valid) {
> r = -EINVAL;
> goto out;
> }
>
>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-13 11:38 UTC (permalink / raw)
To: sashiko-reviews, Dennis Dalessandro; +Cc: linux-hyperv
In-Reply-To: <20260513031237.4280DC2BCB0@smtp.kernel.org>
On Wed, May 13, 2026 at 03:12:36AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
> - [Critical] TOCTOU heap buffer overflow due to unvalidated `num_sge` from user-shared memory.
> - [High] Memory leak of the kernel queue structure (`srq->rq.kwq`) on user-backed SRQ modifications.
> - [High] Locking imbalance and freeing memory while locked.
> - [High] Inconsistent state and Use-After-Free on error path.
> - [Low] Uninitialized variable compiler warning for `offset_addr`.
> --
These are all pre-existing and rvt is too confusing to try to fix be
me. Dennis will have to handle them
> commit dac7699d9e58c179b3a8859ac64da79e5fd21f46
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Subject: IB/rdmavt: Don't abuse udata and ib_respond_udata()
>
> This commit modifies rvt_modify_srq to use copy_to_user directly instead
> of ib_copy_to_udata for returning the offset address.
>
> > diff --git a/drivers/infiniband/sw/rdmavt/srq.c b/drivers/infiniband/sw/rdmavt/srq.c
> > index fe125bf85b272..d022aa56c5bfd 100644
> > --- a/drivers/infiniband/sw/rdmavt/srq.c
> > +++ b/drivers/infiniband/sw/rdmavt/srq.c
> > @@ -128,6 +128,7 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> > struct rvt_srq *srq = ibsrq_to_rvtsrq(ibsrq);
> > struct rvt_dev_info *dev = ib_to_rvt(ibsrq->device);
> > struct rvt_rq tmp_rq = {};
> > + __u64 offset_addr;
> > int ret = 0;
> >
> > if (attr_mask & IB_SRQ_MAX_WR) {
>
> Could hoisting offset_addr lead to a compiler warning for being potentially
> uninitialized?
>
> The compiler might assume udata->inlen could change concurrently across the
> spin_lock_irq() and spin_unlock_irq() calls. If the first check fails
> but the second one passes:
>
> if (udata && udata->inlen >= sizeof(__u64))
>
> Does this allow offset_addr to be used without initialization?
>
> [ ... ]
>
> > @@ -149,19 +150,17 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> > return -ENOMEM;
> > /* Check that we can write the offset to mmap. */
> > if (udata && udata->inlen >= sizeof(__u64)) {
> > - __u64 offset_addr;
> > __u64 offset = 0;
> >
> > ret = ib_copy_from_udata(&offset_addr, udata,
> > sizeof(offset_addr));
> > if (ret)
> > goto bail_free;
> > - udata->outbuf = (void __user *)
> > - (unsigned long)offset_addr;
> > - ret = ib_copy_to_udata(udata, &offset,
> > - sizeof(offset));
> > - if (ret)
> > + if (copy_to_user(u64_to_user_ptr(offset_addr), &offset,
> > + sizeof(offset))) {
> > + ret = -EFAULT;
> > goto bail_free;
> > + }
> > }
> >
> > spin_lock_irq(&srq->rq.kwq->c_lock);
>
> Can this code leak the kernel queue structure?
>
> When resizing a user-backed SRQ, the code fetches elements from the old
> user-mapped queue. In the udata path, okwq is left as NULL:
>
> if (udata) {
> owq = srq->rq.wq;
> head = RDMA_READ_UAPI_ATOMIC(owq->head);
> tail = RDMA_READ_UAPI_ATOMIC(owq->tail);
> } else {
> okwq = srq->rq.kwq;
> ...
>
> When it later calls kvfree(okwq), it seems nothing is freed, which could
> permanently leak the old kwq structure.
>
> Is there a lock imbalance here when the queue is replaced?
>
> The original queue is locked with:
>
> spin_lock_irq(&srq->rq.kwq->c_lock);
>
> But then the queue pointer is replaced:
>
> srq->rq.kwq = tmp_rq.kwq;
>
> When the lock is released:
>
> spin_unlock_irq(&srq->rq.kwq->c_lock);
>
> Does this unlock the new queue instead of the old one, and then free the old
> queue via kvfree(okwq) while its lock is still held?
>
> Can the loop migrating WQEs overflow the destination buffer?
>
> The number of scatter-gather elements is read directly from user-shared
> memory:
>
> wqe = rvt_get_rwqe_ptr(&srq->rq, tail);
> p->wr_id = wqe->wr_id;
> p->num_sge = wqe->num_sge;
> for (i = 0; i < wqe->num_sge; i++)
> p->sg_list[i] = wqe->sg_list[i];
>
> If a userspace program concurrently modifies num_sge to a large value, could
> this loop exceed the bounds of tmp_rq.kwq->curr_wq, since the destination
> size was calculated based on srq->rq.max_sge?
>
> [ ... ]
>
> > @@ -236,10 +235,10 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> > * See rvt_mmap() for details.
> > */
> > if (udata && udata->inlen >= sizeof(__u64)) {
> > - ret = ib_copy_to_udata(udata, &ip->offset,
> > - sizeof(ip->offset));
> > - if (ret)
> > - return ret;
> > + if (copy_to_user(u64_to_user_ptr(offset_addr),
> > + &ip->offset,
> > + sizeof(ip->offset)))
> > + return -EFAULT;
> > }
> >
>
> Does returning -EFAULT here leave the kernel and userspace in an inconsistent
> state?
>
> At this point, the kernel has already applied the SRQ size change, replaced
> the queue, and permanently freed the old memory:
>
> vfree(owq);
> kvfree(okwq);
>
> If copy_to_user fails and returns -EFAULT, userspace might interpret this as
> a failed resize and continue using its existing memory mapping of the old
> queue. Could this lead to a use-after-free since the pages have already
> been freed?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com?part=2
^ permalink raw reply
* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
From: Anirudh Rayabharam @ 2026-05-13 12:14 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816862362.21765.11809618639989414561.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> In mshv_irqfd_assign(), the level-triggered validation for resample
> irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> mshv_irqfd_update() has populated the field. Since the irqfd struct is
> zero-allocated, level_triggered is always 0 at that point, causing the
> check to always reject resample irqfds with -EINVAL. This makes
> level-triggered interrupt resampling — used to avoid interrupt storms
> with assigned devices — completely non-functional.
What bugs would this manifest as? Why haven't we seen any such bugs so
far?
Thanks,
Anirudh.
^ permalink raw reply
* Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU
From: sashiko-bot @ 2026-05-13 12:41 UTC (permalink / raw)
To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-10-mrathor@linux.microsoft.com>
Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [Critical] `hv_iommu_unmap_pages` makes a rep hypercall but ignores the completion count, leaving pages permanently mapped.
- [Critical] Integer truncation in `hv_iommu_unmap_pages` causes incomplete hardware unmapping.
- [Critical] Global `unique_id` data race leads to IOMMU domain collisions and cross-VM DMA sharing.
- [High] The `sscanf` return value check in `hv_iommu_probe_device` breaks the loop on success instead of failure.
- [High] Partial unmaps corrupt the interval tree state, desynchronizing hardware and software.
- [High] Unsafe interval tree iteration leads to skipped nodes during removal.
- [High] Memory leak of interval tree nodes on domain free.
- [High] Ignoring `gfp` flags causes premature OOM and risks sleeping in atomic context.
- [Medium] Missing NULL check on `mapped` pointer in fast path.
- [Low] The `__setup` handlers violate the API by returning 0.
--
commit 4a3c258bcace1805e7afe537c6eb3e451a6fd030
Author: Mukesh R <mrathor@linux.microsoft.com>
x86/hyperv: Implement Hyper-V virtual IOMMU
Add a new file to implement management of device domains, mapping and
unmapping of IOMMU memory, and other iommu_ops to fit within the VFIO
framework for PCI passthru on Hyper-V running Linux as baremetal root
or L1VH root.
> diff --git a/drivers/iommu/hyperv-iommu-root.c b/drivers/iommu/hyperv-iommu-root.c
> new file mode 100644
> index 0000000000000..a2e0f6cc78e66
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu-root.c
[ ... ]
> +static char *pci_devs_to_skip;
> +static int __init hv_iommu_setup_skip(char *str)
> +{
> + pci_devs_to_skip = str;
> +
> + return 0;
> +}
Should this return 1 instead of 0? Returning 0 causes the kernel to print an
unknown command line parameters warning and passes the parameters to the init
process.
> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> +
> +bool hv_no_attdev; /* disable direct device attach for passthru */
> +EXPORT_SYMBOL_GPL(hv_no_attdev);
> +static int __init setup_hv_no_attdev(char *str)
> +{
> + hv_no_attdev = true;
> + return 0;
> +}
Should this also return 1 instead of 0 to indicate the parameter was handled?
[ ... ]
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> + struct hv_domain *hvdom;
> + int rc;
[ ... ]
> + /* Called under iommu group mutex, so single threaded */
> + if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_NULL) /* ie, UINTMAX */
> + goto out_err;
Can this increment of unique_id race if called concurrently?
iommu_domain_alloc() can be called by subsystems without holding global or
group locks. If multiple allocations race, it could assign identical domid_num
values to different domains, which might place them in the same S2 domain.
[ ... ]
> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> +{
> + struct hv_domain *hvdom = to_hv_domain(immdom);
> + unsigned long flags;
> + u64 status;
> + struct hv_input_delete_device_domain *input;
> +
> + if (hv_special_domain(hvdom))
> + return;
> +
> + if (!hv_dom_owner_is_vmm(hvdom) || hv_no_attdev) {
[ ... ]
> + }
> +
> + kfree(hvdom);
> +}
Does this code leak the interval tree nodes?
It looks like the remaining struct hv_iommu_mapping nodes in
hvdom->mappings_tree are not freed before kfree(hvdom).
[ ... ]
> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> + unsigned long iova, phys_addr_t paddr,
> + size_t size, u32 flags)
> +{
> + unsigned long irqflags;
> + struct hv_iommu_mapping *mapping;
> +
> + mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
Should this use the gfp flag passed into hv_iommu_map_pages() instead of
GFP_ATOMIC? It might avoid failing prematurely under memory pressure when the
caller can sleep.
[ ... ]
> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> + unsigned long iova, size_t size)
> +{
> + unsigned long flags;
> + size_t unmapped = 0;
> + unsigned long last = iova + size - 1;
> + struct hv_iommu_mapping *mapping = NULL;
> + struct interval_tree_node *node, *next;
> +
> + spin_lock_irqsave(&hvdom->mappings_lock, flags);
> + next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> + while (next) {
> + node = next;
> + mapping = container_of(node, struct hv_iommu_mapping, iova);
> + next = interval_tree_iter_next(node, iova, last);
> +
> + /* Trying to split a mapping? Not supported for now. */
> + if (mapping->iova.start < iova)
> + break;
Does breaking out early here leave stale mappings in the tree?
If the target region is unmapped in hardware, the tracking node would remain,
causing stale lookups. Also, if unmapping the head of a mapping, is the
entire node deleted even though the tail is still mapped?
> +
> + unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> + interval_tree_remove(node, &hvdom->mappings_tree);
Can erasing the node corrupt the iteration?
Since interval_tree_remove() rebalances the tree and alters parent pointers,
calling interval_tree_iter_next(next) on the next loop iteration might skip
overlapping nodes.
[ ... ]
> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> + phys_addr_t paddr, size_t pgsize, size_t pgcount,
> + int prot, gfp_t gfp, size_t *mapped)
> +{
> + u32 map_flags;
> + int ret;
> + u64 status;
> + unsigned long npages, done = 0;
> + struct hv_domain *hvdom = to_hv_domain(immdom);
> + size_t size = pgsize * pgcount;
> +
> + map_flags = HV_MAP_GPA_READABLE; /* required */
> + map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> + ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> + if (ret)
> + return ret;
> +
> + if (hvdom->attached_dom) {
> + *mapped = size;
> + return 0;
> + }
Is it possible for mapped to be NULL here?
The fast path assigns to *mapped without validating the pointer, while later
in the function there are defensive checks like if (mapped).
[ ... ]
> + if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> + ret = hv_call_deposit_pages(NUMA_NO_NODE,
> + hv_current_partition_id,
> + 256);
Could this trigger a sleeping in atomic context bug?
hv_call_deposit_pages() typically performs sleepable allocations, but the
caller of hv_iommu_map_pages() might have provided a non-sleeping gfp flag
which was not forwarded.
[ ... ]
> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> + size_t pgsize, size_t pgcount,
> + struct iommu_iotlb_gather *gather)
> +{
> + unsigned long flags, npages;
> + struct hv_input_unmap_device_gpa_pages *input;
> + u64 status;
> + struct hv_domain *hvdom = to_hv_domain(immdom);
> + size_t unmapped, size = pgsize * pgcount;
[ ... ]
> + npages = size >> HV_HYP_PAGE_SHIFT;
> +
> + local_irq_save(flags);
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + memset(input, 0, sizeof(*input));
> +
> + input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> + input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> + input->device_domain.domain_id.id = hvdom->domid_num;
> + input->target_device_va_base = iova;
> +
> + status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> + 0, input, NULL);
Can npages be truncated here?
The hypercall parameter rep_count in hv_do_rep_hypercall() expects a u16.
If npages is >= 65536, it would be silently truncated, causing incomplete
unmapping in the hardware.
Does this also need a loop to handle partial completions?
hv_do_rep_hypercall() might only partially complete. Since the completion count
from hv_repcomp(status) is ignored, remaining pages could be left fully
mapped in the hardware.
[ ... ]
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> + if (!dev_is_pci(dev))
> + return ERR_PTR(-ENODEV);
> +
> + if (pci_devs_to_skip && *pci_devs_to_skip) {
> + int rc, pos = 0;
> + int parsed;
> + int segment, bus, slot, func;
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + do {
> + parsed = 0;
> +
> + rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> + &segment, &bus, &slot, &func, &parsed);
> + if (rc)
> + break;
Should this check for failure instead of success?
sscanf() returns the number of successfully matched items. If it succeeds, rc
will be 4, causing the loop to break prematurely without checking the parsed
values against the device.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=9
^ permalink raw reply
* Re: [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
From: Frederic Weisbecker @ 2026-05-13 13:04 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Guenter Roeck, Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Zqiang, Anna-Maria Behnsen,
Ingo Molnar, Chen Ridong, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan
In-Reply-To: <87340od7ev.ffs@tglx>
Le Tue, Apr 21, 2026 at 10:50:00AM +0200, Thomas Gleixner a écrit :
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> > + /*
> > + * To properly enable/disable nohz_full dynticks for the affected CPUs,
> > + * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
> > + * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
> > + * for those CPUs that have their states changed. Those CPUs should be
> > + * in an offline state.
> > + */
> > + for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
> > + WARN_ON_ONCE(cpu_online(cpu));
> > + ct_cpu_track_user(cpu);
> > + cpumask_set_cpu(cpu, tick_nohz_full_mask);
> > + }
> > +
> > + for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
> > + WARN_ON_ONCE(cpu_online(cpu));
> > + ct_cpu_untrack_user(cpu);
> > + cpumask_clear_cpu(cpu, tick_nohz_full_mask);
> > + }
> > +}
>
> So this writes to tick_nohz_full_mask while other CPUs can access
> it. That's just wrong and I'm not at all interested in the resulting
> KCSAN warnings.
>
> tick_nohz_full_mask needs to become a RCU protected pointer, which is
> updated once the new mask is established in a separately allocated one.
How about just dropping tick_nohz_full_mask that is just
~housekeeping_cpumask(HK_TYPE_KERNEL_NOISE) which itself is becoming RCU
protected in this patchset?
Thanks.
>
> Thanks,
>
> tglx
>
>
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply
* [PATCH v4] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Anirudh Rayabharam (Microsoft) @ 2026-05-13 13:25 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: linux-hyperv, linux-kernel, Anirudh Rayabharam (Microsoft),
Stanislav Kinsburskii
The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
chunks into 1G mappings when alignment permits, so the driver can
support 1G hugepages by feeding them in as 2M chunks. Note that this
is the only way to make 1G mappings; there is no way to directly map
a 1G hugepage using the hypercall.
Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
hypercall has no 1G stride, so 1G folios are processed as a
sequence of 2M chunks. Folios whose order is less than PMD_ORDER
(e.g. mTHP) fall back to single-page stride; mapping them as 2M
would fail in the hypervisor anyway.
Assisted-by: Copilot-CLI:claude-opus-4.7
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
Changes in v4:
- Changed the check to page_order < PMD_ORDER for using page stride of 1
- Also updated the commit message.
- Pick up Acked-by:
- Link to v3: https://lore.kernel.org/r/20260506-huge_1g-v3-1-26e1e4c439e4@anirudhrb.com
Changes in v3:
- Fixed various corner cases reported by Sashiko.
- Link to v2: https://lore.kernel.org/r/20260505-huge_1g-v2-1-b6a91327a88d@anirudhrb.com
Changes in v2:
- Handled the case where we can have 2M aligned pages in the middle of a
1G page
- Brought back the page order check but expanded it to include 1G
- Clamp stride to requested page count in mshv_region_process_chunk
- Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
---
drivers/hv/mshv_regions.c | 29 +++++++++++++----------------
1 file changed, 13 insertions(+), 16 deletions(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6..6d65e5b42152 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -29,29 +29,27 @@
* Uses huge page stride if the backing page is huge and the guest mapping
* is properly aligned; otherwise falls back to single page stride.
*
- * Return: Stride in pages, or -EINVAL if page order is unsupported.
+ * Return: Stride in pages.
*/
-static int mshv_chunk_stride(struct page *page,
- u64 gfn, u64 page_count)
+static unsigned int mshv_chunk_stride(struct page *page, u64 gfn,
+ u64 page_count)
{
- unsigned int page_order;
+ unsigned int page_order = folio_order(page_folio(page));
/*
* Use single page stride by default. For huge page stride, the
- * page must be compound and point to the head of the compound
- * page, and both gfn and page_count must be huge-page aligned.
+ * folio order must be at least PMD_ORDER, the page's PFN must be
+ * 2M-aligned (so that a 2M-aligned tail page of a larger folio is
+ * acceptable), and both gfn and page_count must be 2M-aligned.
*/
- if (!PageCompound(page) || !PageHead(page) ||
+ if (page_order < PMD_ORDER ||
+ !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
!IS_ALIGNED(gfn, PTRS_PER_PMD) ||
!IS_ALIGNED(page_count, PTRS_PER_PMD))
return 1;
- page_order = folio_order(page_folio(page));
- /* The hypervisor only supports 2M huge page */
- if (page_order != PMD_ORDER)
- return -EINVAL;
-
- return 1 << page_order;
+ /* Use 2M stride always i.e. process 1G folios as 2M chunks */
+ return 1 << PMD_ORDER;
}
/**
@@ -86,15 +84,14 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
u64 gfn = region->start_gfn + page_offset;
u64 count;
struct page *page;
- int stride, ret;
+ unsigned int stride;
+ int ret;
page = region->mreg_pages[page_offset];
if (!page)
return -EINVAL;
stride = mshv_chunk_stride(page, gfn, page_count);
- if (stride < 0)
- return stride;
/* Start at stride since the first stride is validated */
for (count = stride; count < page_count; count += stride) {
---
base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
change-id: 20260416-huge_1g-e44461393c8f
Best regards,
--
Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply related
* Re: [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: Dennis Dalessandro @ 2026-05-13 13:45 UTC (permalink / raw)
To: Jason Gunthorpe, sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260513113835.GE7655@nvidia.com>
On 5/13/26 7:38 AM, Jason Gunthorpe wrote:
> On Wed, May 13, 2026 at 03:12:36AM +0000, sashiko-bot@kernel.org wrote:
>> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
>> - [Critical] TOCTOU heap buffer overflow due to unvalidated `num_sge` from user-shared memory.
>> - [High] Memory leak of the kernel queue structure (`srq->rq.kwq`) on user-backed SRQ modifications.
>> - [High] Locking imbalance and freeing memory while locked.
>> - [High] Inconsistent state and Use-After-Free on error path.
>> - [Low] Uninitialized variable compiler warning for `offset_addr`.
>> --
>
> These are all pre-existing and rvt is too confusing to try to fix be
> me. Dennis will have to handle them
Will take care of it.
-Denny
^ permalink raw reply
* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: Souradeep Chakrabarti @ 2026-05-13 15:08 UTC (permalink / raw)
To: Mukesh R
Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-9-mrathor@linux.microsoft.com>
On Mon, May 11, 2026 at 07:02:56PM -0700, Mukesh R wrote:
> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> interrupts, etc need a device ID as a parameter. This device ID refers
> to that specific device during the lifetime of passthru.
>
> An L1VH VM only contains VMBus based devices. A device ID for a VMBus
> device is slightly different in that it uses the hv_pcibus_device info
> for building it to make sure it matches exactly what the hypervisor
> expects. This VMBus based device ID is needed when attaching devices in
> an L1VH based guest VM. Add a function to build and export it. Before
> building it, a check is done to make sure the device is a valid VMBus
> device.
>
> In remaining cases, PCI device ID is used. So, also make PCI device ID
> build function hv_build_devid_type_pci() public.
>
Reviewed-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
> arch/x86/hyperv/irqdomain.c | 9 +++++----
> arch/x86/include/asm/mshyperv.h | 6 ++++++
> drivers/pci/controller/pci-hyperv.c | 24 ++++++++++++++++++++++++
> include/asm-generic/mshyperv.h | 11 +++++++++++
> 4 files changed, 46 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index b3ad50a874dc..8780573a4332 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
> return 0;
> }
>
> -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
> {
> int pos;
> union hv_device_id hv_devid;
> @@ -172,8 +172,9 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> }
>
> out:
> - return hv_devid;
> + return hv_devid.as_uint64;
> }
> +EXPORT_SYMBOL_GPL(hv_build_devid_type_pci);
>
> /*
> * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
> @@ -196,7 +197,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
>
> msidesc = irq_data_get_msi_desc(data);
> pdev = msi_desc_to_pci_dev(msidesc);
> - hv_devid = hv_build_devid_type_pci(pdev);
> + hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>
> return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
> @@ -271,7 +272,7 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> {
> union hv_device_id hv_devid;
>
> - hv_devid = hv_build_devid_type_pci(pdev);
> + hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
> }
>
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index f64393e853ee..2ef34001f8d3 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
> static inline void hv_root_crash_init(void) {}
> #endif /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
>
> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
> +
> #else /* CONFIG_HYPERV */
> static inline void hyperv_init(void) {}
> static inline void hyperv_setup_mmu_ops(void) {}
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index cfc8fa403dad..50d793ca8f31 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -573,6 +573,7 @@ struct hv_pci_compl {
> };
>
> static void hv_pci_onchannelcallback(void *context);
> +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
>
> #ifdef CONFIG_X86
> #define DELIVERY_MODE APIC_DELIVERY_MODE_FIXED
> @@ -1005,6 +1006,24 @@ static struct irq_domain *hv_pci_get_root_domain(void)
> static void hv_arch_irq_unmask(struct irq_data *data) { }
> #endif /* CONFIG_ARM64 */
>
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> +{
> + struct hv_pcibus_device *hbus;
> + struct pci_bus *pbus = pdev->bus;
> +
> + if (!hv_vmbus_pci_device(pbus))
> + return 0;
> +
> + hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> +
> + return (hbus->hdev->dev_instance.b[5] << 24) |
> + (hbus->hdev->dev_instance.b[4] << 16) |
> + (hbus->hdev->dev_instance.b[7] << 8) |
> + (hbus->hdev->dev_instance.b[6] & 0xf8) |
> + PCI_FUNC(pdev->devfn);
> +}
> +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> +
> /**
> * hv_pci_generic_compl() - Invoked for a completion packet
> * @context: Set up by the sender of the packet.
> @@ -1403,6 +1422,11 @@ static struct pci_ops hv_pcifront_ops = {
> .write = hv_pcifront_write_config,
> };
>
> +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> +{
> + return pbus->ops == &hv_pcifront_ops;
> +}
> +
> /*
> * Paravirtual backchannel
> *
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index e8cbc4e3f7ad..25ac7ca0fd8b 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -204,6 +204,9 @@ extern u64 (*hv_read_reference_counter)(void);
> /* Sentinel value for an uninitialized entry in hv_vp_index array */
> #define VP_INVAL U32_MAX
>
> +/* Forward declarations */
> +struct pci_dev;
> +
> int __init hv_common_init(void);
> void __init hv_get_partition_id(void);
> void __init hv_common_free(void);
> @@ -316,6 +319,14 @@ void hv_para_set_synic_register(unsigned int reg, u64 val);
> void hyperv_cleanup(void);
> bool hv_query_ext_cap(u64 cap_query);
> void hv_setup_dma_ops(struct device *dev, bool coherent);
> +
> +#if IS_ENABLED(CONFIG_PCI_HYPERV)
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> +{ return 0; }
> +#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
> +
> #else /* CONFIG_HYPERV */
> static inline void hv_identify_partition_type(void) {}
> static inline bool hv_is_hyperv_initialized(void) { return false; }
> --
> 2.51.2.vfs.0.1
>
^ permalink raw reply
* RE: [PATCH v4] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Michael Kelley @ 2026-05-13 15:13 UTC (permalink / raw)
To: Anirudh Rayabharam (Microsoft), K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
Stanislav Kinsburskii
In-Reply-To: <20260513-huge_1g-v4-1-33cda59e4a70@anirudhrb.com>
From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Sent: Wednesday, May 13, 2026 6:26 AM
>
> The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
> chunks into 1G mappings when alignment permits, so the driver can
> support 1G hugepages by feeding them in as 2M chunks. Note that this
> is the only way to make 1G mappings; there is no way to directly map
> a 1G hugepage using the hypercall.
>
> Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
> hypercall has no 1G stride, so 1G folios are processed as a
> sequence of 2M chunks. Folios whose order is less than PMD_ORDER
> (e.g. mTHP) fall back to single-page stride; mapping them as 2M
> would fail in the hypervisor anyway.
>
> Assisted-by: Copilot-CLI:claude-opus-4.7
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> ---
> Changes in v4:
> - Changed the check to page_order < PMD_ORDER for using page stride of 1
> - Also updated the commit message.
> - Pick up Acked-by:
> - Link to v3: https://lore.kernel.org/r/20260506-huge_1g-v3-1-26e1e4c439e4@anirudhrb.com
>
> Changes in v3:
> - Fixed various corner cases reported by Sashiko.
> - Link to v2: https://lore.kernel.org/r/20260505-huge_1g-v2-1-b6a91327a88d@anirudhrb.com
>
> Changes in v2:
> - Handled the case where we can have 2M aligned pages in the middle of a
> 1G page
> - Brought back the page order check but expanded it to include 1G
> - Clamp stride to requested page count in mshv_region_process_chunk
> - Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
> ---
> drivers/hv/mshv_regions.c | 29 +++++++++++++----------------
> 1 file changed, 13 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index fdffd4f002f6..6d65e5b42152 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -29,29 +29,27 @@
> * Uses huge page stride if the backing page is huge and the guest mapping
> * is properly aligned; otherwise falls back to single page stride.
> *
> - * Return: Stride in pages, or -EINVAL if page order is unsupported.
> + * Return: Stride in pages.
> */
> -static int mshv_chunk_stride(struct page *page,
> - u64 gfn, u64 page_count)
> +static unsigned int mshv_chunk_stride(struct page *page, u64 gfn,
> + u64 page_count)
> {
> - unsigned int page_order;
> + unsigned int page_order = folio_order(page_folio(page));
>
> /*
> * Use single page stride by default. For huge page stride, the
> - * page must be compound and point to the head of the compound
> - * page, and both gfn and page_count must be huge-page aligned.
> + * folio order must be at least PMD_ORDER, the page's PFN must be
> + * 2M-aligned (so that a 2M-aligned tail page of a larger folio is
> + * acceptable), and both gfn and page_count must be 2M-aligned.
> */
> - if (!PageCompound(page) || !PageHead(page) ||
> + if (page_order < PMD_ORDER ||
> + !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
> !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
> !IS_ALIGNED(page_count, PTRS_PER_PMD))
> return 1;
>
> - page_order = folio_order(page_folio(page));
> - /* The hypervisor only supports 2M huge page */
> - if (page_order != PMD_ORDER)
> - return -EINVAL;
> -
> - return 1 << page_order;
> + /* Use 2M stride always i.e. process 1G folios as 2M chunks */
> + return 1 << PMD_ORDER;
> }
>
> /**
> @@ -86,15 +84,14 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> u64 gfn = region->start_gfn + page_offset;
> u64 count;
> struct page *page;
> - int stride, ret;
> + unsigned int stride;
> + int ret;
>
> page = region->mreg_pages[page_offset];
> if (!page)
> return -EINVAL;
>
> stride = mshv_chunk_stride(page, gfn, page_count);
> - if (stride < 0)
> - return stride;
>
> /* Start at stride since the first stride is validated */
> for (count = stride; count < page_count; count += stride) {
>
> ---
> base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
> change-id: 20260416-huge_1g-e44461393c8f
>
> Best regards,
> --
> Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
>
^ permalink raw reply
* Re: [PATCH V1 3/3] mshv: Implement guest irq migration for passthru devices
From: Souradeep Chakrabarti @ 2026-05-13 15:15 UTC (permalink / raw)
To: Mukesh R
Cc: hpa, robin.murphy, robh, linux-hyperv, linux-kernel, iommu,
linux-pci, linux-arch
In-Reply-To: <20260512021242.1679786-4-mrathor@linux.microsoft.com>
On Mon, May 11, 2026 at 07:12:42PM -0700, Mukesh R wrote:
> Ask the hypervisor to retarget interrupts to new guest cpu or vector
> upon guest irq migration. This happens in the irqfd update path.
>
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
> drivers/hv/mshv_eventfd.c | 78 ++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 76 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> index 1f5c1e9ee9b7..c05201d857fd 100644
> --- a/drivers/hv/mshv_eventfd.c
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -192,6 +192,77 @@ static int mshv_map_device_interrupt(u64 ptid, union hv_device_id hv_devid,
>
> }
>
> +/* NOTE: caller does spin_lock_irq on pt_irqfds_lock, hence no disable here */
> +static void mshv_do_guest_irq_retarget(u64 partid, struct mshv_irqfd *irqfd)
> +{
> + int rc, var_size;
> + u64 status;
> + union hv_device_id hv_devid;
> + struct hv_input_get_vp_set_from_mda *mda_input;
> + union hv_output_get_vp_set_from_mda *mda_output;
> + struct hv_retarget_device_interrupt *remap_inp;
> + struct pci_dev *pdev;
> + struct irq_data *irqdata;
> + struct mshv_lapic_irq *lapic_irq = &irqfd->irqfd_lapic_irq;
> + struct hv_interrupt_entry *inte = NULL;
> +
> + if (!irqfd->irqfd_girq_ent.girq_entry_valid ||
> + irqfd->irqfd_bypass_prod == NULL)
> + return;
> +
> + rc = mshv_parse_mshv_irqfd(irqfd, &pdev, &irqdata);
> + if (rc)
> + return;
> +
> + inte = irqdata->chip_data;
> + if (inte == NULL)
> + return;
> +
> + hv_devid.as_uint64 = hv_devid_from_pdev(pdev);
> +
> +
> + mda_input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + mda_output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> + rc = hv_vpset_from_hyp_disabled(mda_input, mda_output, lapic_irq,
> + partid);
> + if (rc)
> + return;
> +
> + remap_inp = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + memset(remap_inp, 0, sizeof(*remap_inp));
> +
> + rc = hv_copy_vpset(&remap_inp->int_target.vp_set,
> + &mda_output->target_vpset);
> + if (rc <= 0) {
> + pr_err("Hyper-V: ptid %lld - vpset copy failed (%d)\n",
> + partid, rc);
> + return;
> + }
> +
> + /*
> + * var-sized hcall: var-size starts after vp_mask (thus vp_set.format
> + * does not count, but vp_set.valid_bank_mask does).
> + */
> + var_size = rc + 1;
> +
> + remap_inp->partition_id = partid;
> + remap_inp->device_id = hv_devid.as_uint64;
> + remap_inp->int_target.vector = lapic_irq->lapic_vector;
> + remap_inp->int_target.flags = HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET;
> +
> + remap_inp->int_entry.source = inte->source;
> + remap_inp->int_entry.msi_entry.as_uint64 = inte->msi_entry.as_uint64;
> +
> + status = hv_do_rep_hypercall(HVCALL_RETARGET_INTERRUPT, 0, var_size,
> + remap_inp, NULL);
> +
> + if (!hv_result_success(status))
> + hv_status_err(status, "pt:%lld vec:%d lapic-id:%lld\n",
> + partid, lapic_irq->lapic_vector,
> + lapic_irq->lapic_apic_id);
> +}
> +
> static int mshv_unmap_device_interrupt(union hv_device_id hv_devid,
> struct hv_interrupt_entry *irq_entry)
> {
> @@ -729,9 +800,12 @@ static void mshv_irqfd_update(struct mshv_partition *pt,
> struct mshv_irqfd *irqfd)
> {
> write_seqcount_begin(&irqfd->irqfd_irqe_sc);
> - irqfd->irqfd_girq_ent = mshv_ret_girq_entry(pt,
> - irqfd->irqfd_irqnum);
> + irqfd->irqfd_girq_ent = mshv_ret_girq_entry(pt, irqfd->irqfd_irqnum);
> mshv_copy_girq_info(&irqfd->irqfd_girq_ent, &irqfd->irqfd_lapic_irq);
> +
> +#if IS_ENABLED(CONFIG_X86_64)
> + mshv_do_guest_irq_retarget(pt->pt_id, irqfd);
> +#endif
> write_seqcount_end(&irqfd->irqfd_irqe_sc);
> }
>
> --
> 2.51.2.vfs.0.1
>
^ permalink raw reply
* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: Souradeep Chakrabarti @ 2026-05-13 15:17 UTC (permalink / raw)
To: Mukesh R
Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
bhelgaas, arnd, jacob.pan
In-Reply-To: <agST9ZRXNWhbUGMY@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Wed, May 13, 2026 at 08:08:37AM -0700, Souradeep Chakrabarti wrote:
> On Mon, May 11, 2026 at 07:02:56PM -0700, Mukesh R wrote:
> > On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> > interrupts, etc need a device ID as a parameter. This device ID refers
> > to that specific device during the lifetime of passthru.
> >
> > An L1VH VM only contains VMBus based devices. A device ID for a VMBus
> > device is slightly different in that it uses the hv_pcibus_device info
> > for building it to make sure it matches exactly what the hypervisor
> > expects. This VMBus based device ID is needed when attaching devices in
> > an L1VH based guest VM. Add a function to build and export it. Before
> > building it, a check is done to make sure the device is a valid VMBus
> > device.
> >
> > In remaining cases, PCI device ID is used. So, also make PCI device ID
> > build function hv_build_devid_type_pci() public.
> >
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Reviewed-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
> > Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> > ---
> > arch/x86/hyperv/irqdomain.c | 9 +++++----
> > arch/x86/include/asm/mshyperv.h | 6 ++++++
> > drivers/pci/controller/pci-hyperv.c | 24 ++++++++++++++++++++++++
> > include/asm-generic/mshyperv.h | 11 +++++++++++
> > 4 files changed, 46 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> > index b3ad50a874dc..8780573a4332 100644
> > --- a/arch/x86/hyperv/irqdomain.c
> > +++ b/arch/x86/hyperv/irqdomain.c
> > @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
> > return 0;
> > }
> >
> > -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> > +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
> > {
> > int pos;
> > union hv_device_id hv_devid;
> > @@ -172,8 +172,9 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> > }
> >
> > out:
> > - return hv_devid;
> > + return hv_devid.as_uint64;
> > }
> > +EXPORT_SYMBOL_GPL(hv_build_devid_type_pci);
> >
> > /*
> > * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
> > @@ -196,7 +197,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
> >
> > msidesc = irq_data_get_msi_desc(data);
> > pdev = msi_desc_to_pci_dev(msidesc);
> > - hv_devid = hv_build_devid_type_pci(pdev);
> > + hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> > cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
> >
> > return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
> > @@ -271,7 +272,7 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> > {
> > union hv_device_id hv_devid;
> >
> > - hv_devid = hv_build_devid_type_pci(pdev);
> > + hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
> > return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
> > }
> >
> > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > index f64393e853ee..2ef34001f8d3 100644
> > --- a/arch/x86/include/asm/mshyperv.h
> > +++ b/arch/x86/include/asm/mshyperv.h
> > @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
> > static inline void hv_root_crash_init(void) {}
> > #endif /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
> >
> > +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> > +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> > +#else
> > +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> > +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
> > +
> > #else /* CONFIG_HYPERV */
> > static inline void hyperv_init(void) {}
> > static inline void hyperv_setup_mmu_ops(void) {}
> > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > index cfc8fa403dad..50d793ca8f31 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -573,6 +573,7 @@ struct hv_pci_compl {
> > };
> >
> > static void hv_pci_onchannelcallback(void *context);
> > +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
> >
> > #ifdef CONFIG_X86
> > #define DELIVERY_MODE APIC_DELIVERY_MODE_FIXED
> > @@ -1005,6 +1006,24 @@ static struct irq_domain *hv_pci_get_root_domain(void)
> > static void hv_arch_irq_unmask(struct irq_data *data) { }
> > #endif /* CONFIG_ARM64 */
> >
> > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > +{
> > + struct hv_pcibus_device *hbus;
> > + struct pci_bus *pbus = pdev->bus;
> > +
> > + if (!hv_vmbus_pci_device(pbus))
> > + return 0;
> > +
> > + hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > +
> > + return (hbus->hdev->dev_instance.b[5] << 24) |
> > + (hbus->hdev->dev_instance.b[4] << 16) |
> > + (hbus->hdev->dev_instance.b[7] << 8) |
> > + (hbus->hdev->dev_instance.b[6] & 0xf8) |
> > + PCI_FUNC(pdev->devfn);
> > +}
> > +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> > +
> > /**
> > * hv_pci_generic_compl() - Invoked for a completion packet
> > * @context: Set up by the sender of the packet.
> > @@ -1403,6 +1422,11 @@ static struct pci_ops hv_pcifront_ops = {
> > .write = hv_pcifront_write_config,
> > };
> >
> > +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> > +{
> > + return pbus->ops == &hv_pcifront_ops;
> > +}
> > +
> > /*
> > * Paravirtual backchannel
> > *
> > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> > index e8cbc4e3f7ad..25ac7ca0fd8b 100644
> > --- a/include/asm-generic/mshyperv.h
> > +++ b/include/asm-generic/mshyperv.h
> > @@ -204,6 +204,9 @@ extern u64 (*hv_read_reference_counter)(void);
> > /* Sentinel value for an uninitialized entry in hv_vp_index array */
> > #define VP_INVAL U32_MAX
> >
> > +/* Forward declarations */
> > +struct pci_dev;
> > +
> > int __init hv_common_init(void);
> > void __init hv_get_partition_id(void);
> > void __init hv_common_free(void);
> > @@ -316,6 +319,14 @@ void hv_para_set_synic_register(unsigned int reg, u64 val);
> > void hyperv_cleanup(void);
> > bool hv_query_ext_cap(u64 cap_query);
> > void hv_setup_dma_ops(struct device *dev, bool coherent);
> > +
> > +#if IS_ENABLED(CONFIG_PCI_HYPERV)
> > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> > +#else
> > +static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > +{ return 0; }
> > +#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
> > +
> > #else /* CONFIG_HYPERV */
> > static inline void hv_identify_partition_type(void) {}
> > static inline bool hv_is_hyperv_initialized(void) { return false; }
> > --
> > 2.51.2.vfs.0.1
> >
^ permalink raw reply
* Re: [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
From: Frederic Weisbecker @ 2026-05-13 16:19 UTC (permalink / raw)
To: Waiman Long
Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
Thomas Gleixner, Chen Ridong, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan
In-Reply-To: <20260421030351.281436-9-longman@redhat.com>
Le Mon, Apr 20, 2026 at 11:03:36PM -0400, Waiman Long a écrit :
> As the HK_TYPE_TICK cpumask is going to be changeable at run time, we
> need to use RCU to protect access to the cpumask to prevent it from
> going away in the middle of the operation.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> arch/arm64/kernel/topology.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index b32f13358fbb..48f150801689 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -173,6 +173,7 @@ void arch_cpu_idle_enter(void)
> if (!amu_fie_cpu_supported(cpu))
> return;
>
> + guard(rcu)();
> /* Kick in AMU update but only if one has not happened already */
> if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
> time_is_before_jiffies(per_cpu(cpu_amu_samples.last_scale_update,
> cpu)))
This is called with IRQs disabled in the current CPU that is online so it's
already guaranteed to be stable.
> @@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
> unsigned int start_cpu = cpu;
> unsigned long last_update;
> unsigned int freq = 0;
> + bool hk_cpu;
> u64 scale;
>
> if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
> return -EOPNOTSUPP;
>
> + scoped_guard(rcu) {
> + hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
> + }
> +
> while (1) {
>
> amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> @@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
> * (and thus freq scale), if available, for given policy: this boils
> * down to identifying an active cpu within the same freq domain, if any.
> */
> - if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> + if (!hk_cpu ||
> time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
> struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> + bool hk_intersects;
> int ref_cpu;
>
> if (!policy)
> return -EINVAL;
>
> - if (!cpumask_intersects(policy->related_cpus,
> - housekeeping_cpumask(HK_TYPE_TICK))) {
> + scoped_guard(rcu) {
> + hk_intersects = cpumask_intersects(policy->related_cpus,
> + housekeeping_cpumask(HK_TYPE_TICK));
> + }
> +
> + if (!hk_intersects) {
> cpufreq_cpu_put(policy);
> return -EOPNOTSUPP;
> }
Ok so this is racy but it's fine because:
This function is only used by cpufreq with either cpufreq_policy_write or
cpufreq_policy_read held (that is, struct cpufreq_policy::rwsem).
And that rwsem is write held on cpufreq_online() -> cpufreq_policy_online() and
also offline to guarantee the policy->cpus and policy->cpu stability.
Therefore housekeeping_cpumask() should only deal with stable online CPUs here. So
even if the housekeeping mask can be changed concurrently, those CPUs can't
appear or disappear from it.
Would be worth adding a comment about that.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply
* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
From: Stanislav Kinsburskii @ 2026-05-13 17:31 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-roaring-gentle-kiwi-21bccd@anirudhrb>
On Wed, May 13, 2026 at 11:15:37AM +0000, Anirudh Rayabharam wrote:
> On Mon, May 11, 2026 at 08:06:44AM -0700, Stanislav Kinsburskii wrote:
> > On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> > > On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > > > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > > > fails on an unencrypted partition. The condition on the error path:
> > > >
> > > > if (ret && mshv_partition_encrypted(partition))
> > > >
> > > > only handles map failures for encrypted partitions — if the partition is
> > > > not encrypted and the map fails, execution falls through to 'return 0',
> > > > silently ignoring the error.
> > > >
> > > > Additionally, calling mshv_region_invalidate() inline on map failure
> > > > zeroes the mreg_pages array before the caller's cleanup path
> > > > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > > > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > > > mappings for partially-mapped pages.
> > > >
> > > > Fix by returning immediately on success and falling through to error
> > > > return on failure. For unencrypted partitions, the caller's
> > > > mshv_region_destroy() handles unmap followed by invalidate in the
> > > > correct order. For encrypted partitions where re-sharing fails, zero
> > > > the page array without unpinning — the pages are inaccessible to the
> > > > host and must not be unpinned, but zeroing prevents
> > > > mshv_region_destroy() from attempting to unpin them.
> > >
> > > mshv_region_destroy() calls mshv_region_invalidate() which calls
> > > mshv_region_invalidate_pages() which just does:
> > >
> > > static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > > u64 page_offset, u64 page_count)
> > > {
> > > if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > > unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > >
> > > memset(region->mreg_pages + page_offset, 0,
> > > page_count * sizeof(struct page *));
> > > }
> > >
> > > It doesn't checked for zeroed pages before unpinning.
> > >
> > > >
> > > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > > drivers/hv/mshv_root_main.c | 26 ++++++++++++++++----------
> > > > 1 file changed, 16 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > index 665d565899c15..7e4252b6bc65c 100644
> > > > --- a/drivers/hv/mshv_root_main.c
> > > > +++ b/drivers/hv/mshv_root_main.c
> > > > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> > > > pt_err(partition,
> > > > "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> > > > region->start_gfn, ret);
> > > > - goto invalidate_region;
> > > > + goto err_out;
> > > > }
> > > > }
> > > >
> > > > ret = mshv_region_map(region);
> > > > - if (ret && mshv_partition_encrypted(partition)) {
> > > > + if (ret)
> > > > + goto share_region;
> > > > +
> > > > + return 0;
> > > > +
> > > > +share_region:
> > > > + if (mshv_partition_encrypted(partition)) {
> > > > int shrc;
> > > >
> > > > shrc = mshv_region_share(region);
> > > > if (!shrc)
> > > > - goto invalidate_region;
> > > > + goto err_out;
> > > >
> > > > pt_err(partition,
> > > > "Failed to share memory region (guest_pfn: %llu): %d\n",
> > > > region->start_gfn, shrc);
> > > > /*
> > > > - * Don't unpin if marking shared failed because pages are no
> > > > - * longer mapped in the host, ie root, anymore.
> > > > + * Re-sharing failed — the pages remain inaccessible to the
> > > > + * host. Zero the page array so that mshv_region_destroy()
> > > > + * won't attempt to unpin them (leaking the page references
> > > > + * is intentional; unpinning host-inaccessible pages would be
> > > > + * unsafe).
> > > > */
> > >
> > > Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> > > can remove it from here altogether. Either way, should this zeroing
> > > logic be added there too so as not to crash the host by unpinning pages
> > > that weren't marked shared?
> > >
> >
> > Indeed.
> > The issue with all this code is that we are leaking state in
> > mshv_region_pin if we wimply return from it: we don't know if the pages
> > are pinned or unshared (or mapped) in the destruction callback.
> > We should either introduce a state variable to track this or just don't
> > call the generic mshv_region_put on case of region creation error.
> > The latter approch make mshv_map_user_memory idempotent and thus looks
> > like a better way forward.
> > What do you think?
>
I tried to say, that there can be two apporaches to solving this issue:
1. Either make sure mshv_region_pus() can destroy a partially created
region (tha'ts what this patch is doing) or
2. Make mshv_map_user_region be idempotent by not calling mshv_region_put() if the region was not
fully created and destroy the partioally created parts of it on
error paths. This would require preserving some state in the region
struct to know if the region was pinned or shared or mapped.
This looks like a leaner apporach, but it requires additional state
tracking and more code changes.
Thanks,
Stanislav
> I'm not sure I follow the latter approach. Omitting mshv_region_put()
> would cause a dangling reference to the mshv_region right?
>
> Thanks,
> Anirudh.
^ permalink raw reply
* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
From: Stanislav Kinsburskii @ 2026-05-13 17:38 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-omniscient-enchanted-otter-dfd602@anirudhrb>
On Wed, May 13, 2026 at 12:14:49PM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> > In mshv_irqfd_assign(), the level-triggered validation for resample
> > irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> > mshv_irqfd_update() has populated the field. Since the irqfd struct is
> > zero-allocated, level_triggered is always 0 at that point, causing the
> > check to always reject resample irqfds with -EINVAL. This makes
> > level-triggered interrupt resampling — used to avoid interrupt storms
> > with assigned devices — completely non-functional.
>
> What bugs would this manifest as? Why haven't we seen any such bugs so
> far?
>
This patch fixes a logical error.
Whtout the change this hunk always fails:
if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
!irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {
and the reason we never seen it as that we never used
register_irqfd_with_resample() function of the mshv crate.
Thanks,
Stanislav
> Thanks,
> Anirudh.
^ permalink raw reply
* Re: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
From: Stanislav Kinsburskii @ 2026-05-13 17:39 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260513-grinning-firefly-of-refinement-ed06cd@anirudhrb>
On Wed, May 13, 2026 at 11:12:05AM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:44:26PM +0000, Stanislav Kinsburskii wrote:
> > handle_pair_message() iterates up to msg->vp_count without verifying it
> > against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> > from untrusted hypervisor data, a malformed message with a large value
> > would cause out-of-bounds reads from the partition_ids and vp_indexes
> > arrays.
> >
> > handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> > 64) and advances bank_contents for each one. However, the payload buffer
> > only has space for 16 bank entries. A valid_bank_mask with more than 16
> > bits set causes bank_contents to read beyond the message buffer.
> >
> > Fix both by adding bounds validation:
> > - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> > - Track banks consumed and stop before exceeding buffer capacity
> >
> > Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> > drivers/hv/mshv_synic.c | 20 ++++++++++++++++++--
> > 1 file changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index 89207aad7cf1f..5d509299f14d7 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
> > static void
> > handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
> > {
> > - int bank_idx, vps_signaled = 0, bank_mask_size;
> > + int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> > + const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> > + sizeof(u64) - 2; /* subtract format + mask */
>
> Could this be a constant in the header?
>
Yes, it could. But it the only place it's used and it's pretty
self-explanatory, so I don't think it needs to be.
Thanks,
Stanislav
> Thanks,
> Anirudh.
>
^ permalink raw reply
* RE: [EXTERNAL] [PATCH] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Long Li @ 2026-05-13 18:30 UTC (permalink / raw)
To: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Dexuan Cui, James.Bottomley@HansenPartnership.com,
martin.petersen@oracle.com, Allen Pais
Cc: Tianyu Lan, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org,
vdso@hexbites.dev, mhklinux@outlook.com
In-Reply-To: <20260408073105.272255-1-tiala@microsoft.com>
> Hyper-V provides Confidential VMBus to communicate between device model
> and device guest driver via encrypted/private memory in Confidential VM. The
> device model is in OpenHCL
> (https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopenvm
> m.dev%2Fguide%2Fuser_guide%2Fopenhcl.html&data=05%7C02%7Clongli%40mi
> crosoft.com%7C0ccfea7cda8e4500ae9808de9540d01e%7C72f988bf86f141af91a
> b2d7cd011db47%7C1%7C0%7C639112302777934798%7CUnknown%7CTWFpbG
> Zsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIk
> FOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=5Uc%2FM4ZVgJT1
> NAq08cIlNtfF5oW4n%2FTj%2Bqg3YqBUeZg%3D&reserved=0) that plays the
> paravisor role.
>
> For a VMBus device, there are two communication methods to talk with
> Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic DMA transfer.
>
> The Confidential VMBus Ring buffer has been upstreamed by Roman Kisel(commit
> 6802d8af47d1).
>
> The dynamic DMA transition of VMBus device normally goes through DMA core
> and it uses SWIOTLB as bounce buffer in a CoCo VM.
>
> The Confidential VMBus device can do DMA directly to private/encrypted
> memory. Because the swiotlb is decrypted memory, the DMA transfer must not
> be bounced through the swiotlb, so as to preserve confidentiality. This is different
> from the default for Linux CoCo VMs, so not use DMA(SWIOTLB) API in VMBus
> driver when confidential dynamic DMA transfers capability is present.
>
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
> drivers/scsi/storvsc_drv.c | 28 +++++++++++++++++++++-------
> include/linux/hyperv.h | 1 +
> 2 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
> ae1abab97835..79b7611518b7 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1316,7 +1316,8 @@ static void storvsc_on_channel_callback(void *context)
> continue;
> }
> request = (struct storvsc_cmd_request
> *)scsi_cmd_priv(scmnd);
> - scsi_dma_unmap(scmnd);
> + if (!device->co_external_memory)
> + scsi_dma_unmap(scmnd);
> }
>
> storvsc_on_receive(stor_device, packet, request); @@ -
> 1339,6 +1340,8 @@ static int storvsc_connect_to_vsp(struct hv_device *device,
> u32 ring_size,
>
> device->channel->max_pkt_size = STORVSC_MAX_PKT_SIZE;
> device->channel->next_request_id_callback = storvsc_next_request_id;
> + if (device->channel->co_external_memory)
> + device->co_external_memory = true;
>
> ret = vmbus_open(device->channel,
> ring_size,
> @@ -1805,7 +1808,7 @@ static enum scsi_qc_status
> storvsc_queuecommand(struct Scsi_Host *host,
> unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
> unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
> struct scatterlist *sg;
> - unsigned long hvpfn, hvpfns_to_add;
> + unsigned long hvpfn, hvpfns_to_add, hvpgoff;
> int j, i = 0, sg_count;
>
> payload_sz = (hvpg_count * sizeof(u64) + @@ -1821,7 +1824,11
> @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
> payload->range.len = length;
> payload->range.offset = offset_in_hvpg;
>
> - sg_count = scsi_dma_map(scmnd);
> + if (dev->co_external_memory)
> + sg_count = scsi_sg_count(scmnd);
scsi_sg_count() returns unsigned int, sg_count can't be negative. The check for sg_count < 0 below becomes dead code. Add a comment to say this is expected behavior.
> + else
> + sg_count = scsi_dma_map(scmnd);
> +
> if (sg_count < 0) {
> ret = SCSI_MLQUEUE_DEVICE_BUSY;
> goto err_free_payload;
> @@ -1836,9 +1843,16 @@ static enum scsi_qc_status
> storvsc_queuecommand(struct Scsi_Host *host,
> * Such offsets are handled even on other than the first
> * sgl entry, provided they are a multiple of PAGE_SIZE.
> */
> - hvpfn = HVPFN_DOWN(sg_dma_address(sg));
> - hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
> - sg_dma_len(sg)) - hvpfn;
> + if (dev->co_external_memory) {
> + hvpgoff = HVPFN_DOWN(sg->offset);
> + hvpfn = page_to_hvpfn(sg_page(sg)) + hvpgoff;
> + hvpfns_to_add = HVPFN_UP(sg->offset
> + sg->length) -
> + hvpgoff;
> + } else {
> + hvpfn = HVPFN_DOWN(sg_dma_address(sg));
> + hvpfns_to_add =
> HVPFN_UP(sg_dma_address(sg) +
> + sg_dma_len(sg)) -
> hvpfn;
> + }
>
> /*
> * Fill the next portion of the PFN array with @@ -1860,7
> +1874,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct
> Scsi_Host *host,
> ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
> migrate_enable();
>
> - if (ret)
> + if (ret && (!dev->co_external_memory))
> scsi_dma_unmap(scmnd);
>
> if (ret == -EAGAIN) {
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index
> dfc516c1c719..bcb143766d6e 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1285,6 +1285,7 @@ struct hv_device {
>
> /* place holder to keep track of the dir for hv device in debugfs */
> struct dentry *debug_dir;
> + bool co_external_memory;
You don't need to introduce co_external_memory in hv_device, vmbus_channel already has co_external_memory. Is it possible that you can check the vmbus_channel->co_external_memory directly? If you can remove this, you can reword this patch to " scsi: storvsc: Confidential VMBus for dynamic DMA transfers".
Thanks,
Long
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox