Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
From: Anirudh Rayabharam @ 2026-05-13 12:14 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816862362.21765.11809618639989414561.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> In mshv_irqfd_assign(), the level-triggered validation for resample
> irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> mshv_irqfd_update() has populated the field. Since the irqfd struct is
> zero-allocated, level_triggered is always 0 at that point, causing the
> check to always reject resample irqfds with -EINVAL. This makes
> level-triggered interrupt resampling — used to avoid interrupt storms
> with assigned devices — completely non-functional.

What bugs would this manifest as? Why haven't we seen any such bugs so
far?

Thanks,
Anirudh.

^ permalink raw reply

* Re: [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-13 11:38 UTC (permalink / raw)
  To: sashiko-reviews, Dennis Dalessandro; +Cc: linux-hyperv
In-Reply-To: <20260513031237.4280DC2BCB0@smtp.kernel.org>

On Wed, May 13, 2026 at 03:12:36AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
> - [Critical] TOCTOU heap buffer overflow due to unvalidated `num_sge` from user-shared memory.
> - [High] Memory leak of the kernel queue structure (`srq->rq.kwq`) on user-backed SRQ modifications.
> - [High] Locking imbalance and freeing memory while locked.
> - [High] Inconsistent state and Use-After-Free on error path.
> - [Low] Uninitialized variable compiler warning for `offset_addr`.
> --

These are all pre-existing and rvt is too confusing to try to fix be
me. Dennis will have to handle them
 
> commit dac7699d9e58c179b3a8859ac64da79e5fd21f46
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Subject: IB/rdmavt: Don't abuse udata and ib_respond_udata()
> 
> This commit modifies rvt_modify_srq to use copy_to_user directly instead
> of ib_copy_to_udata for returning the offset address.
> 
> > diff --git a/drivers/infiniband/sw/rdmavt/srq.c b/drivers/infiniband/sw/rdmavt/srq.c
> > index fe125bf85b272..d022aa56c5bfd 100644
> > --- a/drivers/infiniband/sw/rdmavt/srq.c
> > +++ b/drivers/infiniband/sw/rdmavt/srq.c
> > @@ -128,6 +128,7 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> >  	struct rvt_srq *srq = ibsrq_to_rvtsrq(ibsrq);
> >  	struct rvt_dev_info *dev = ib_to_rvt(ibsrq->device);
> >  	struct rvt_rq tmp_rq = {};
> > +	__u64 offset_addr;
> >  	int ret = 0;
> >  
> >  	if (attr_mask & IB_SRQ_MAX_WR) {
> 
> Could hoisting offset_addr lead to a compiler warning for being potentially
> uninitialized?
> 
> The compiler might assume udata->inlen could change concurrently across the
> spin_lock_irq() and spin_unlock_irq() calls. If the first check fails
> but the second one passes:
> 
> if (udata && udata->inlen >= sizeof(__u64))
> 
> Does this allow offset_addr to be used without initialization?
> 
> [ ... ]
> 
> > @@ -149,19 +150,17 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> >  			return -ENOMEM;
> >  		/* Check that we can write the offset to mmap. */
> >  		if (udata && udata->inlen >= sizeof(__u64)) {
> > -			__u64 offset_addr;
> >  			__u64 offset = 0;
> >  
> >  			ret = ib_copy_from_udata(&offset_addr, udata,
> >  						 sizeof(offset_addr));
> >  			if (ret)
> >  				goto bail_free;
> > -			udata->outbuf = (void __user *)
> > -					(unsigned long)offset_addr;
> > -			ret = ib_copy_to_udata(udata, &offset,
> > -					       sizeof(offset));
> > -			if (ret)
> > +			if (copy_to_user(u64_to_user_ptr(offset_addr), &offset,
> > +					 sizeof(offset))) {
> > +				ret = -EFAULT;
> >  				goto bail_free;
> > +			}
> >  		}
> >  
> >  		spin_lock_irq(&srq->rq.kwq->c_lock);
> 
> Can this code leak the kernel queue structure?
> 
> When resizing a user-backed SRQ, the code fetches elements from the old
> user-mapped queue. In the udata path, okwq is left as NULL:
> 
> if (udata) {
>         owq = srq->rq.wq;
>         head = RDMA_READ_UAPI_ATOMIC(owq->head);
>         tail = RDMA_READ_UAPI_ATOMIC(owq->tail);
> } else {
>         okwq = srq->rq.kwq;
>         ...
> 
> When it later calls kvfree(okwq), it seems nothing is freed, which could
> permanently leak the old kwq structure.
> 
> Is there a lock imbalance here when the queue is replaced?
> 
> The original queue is locked with:
> 
> spin_lock_irq(&srq->rq.kwq->c_lock);
> 
> But then the queue pointer is replaced:
> 
> srq->rq.kwq = tmp_rq.kwq;
> 
> When the lock is released:
> 
> spin_unlock_irq(&srq->rq.kwq->c_lock);
> 
> Does this unlock the new queue instead of the old one, and then free the old
> queue via kvfree(okwq) while its lock is still held?
> 
> Can the loop migrating WQEs overflow the destination buffer?
> 
> The number of scatter-gather elements is read directly from user-shared
> memory:
> 
> wqe = rvt_get_rwqe_ptr(&srq->rq, tail);
> p->wr_id = wqe->wr_id;
> p->num_sge = wqe->num_sge;
> for (i = 0; i < wqe->num_sge; i++)
>         p->sg_list[i] = wqe->sg_list[i];
> 
> If a userspace program concurrently modifies num_sge to a large value, could
> this loop exceed the bounds of tmp_rq.kwq->curr_wq, since the destination
> size was calculated based on srq->rq.max_sge?
> 
> [ ... ]
> 
> > @@ -236,10 +235,10 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> >  			 * See rvt_mmap() for details.
> >  			 */
> >  			if (udata && udata->inlen >= sizeof(__u64)) {
> > -				ret = ib_copy_to_udata(udata, &ip->offset,
> > -						       sizeof(ip->offset));
> > -				if (ret)
> > -					return ret;
> > +				if (copy_to_user(u64_to_user_ptr(offset_addr),
> > +						 &ip->offset,
> > +						 sizeof(ip->offset)))
> > +					return -EFAULT;
> >  			}
> >  
> 
> Does returning -EFAULT here leave the kernel and userspace in an inconsistent
> state?
> 
> At this point, the kernel has already applied the SRQ size change, replaced
> the queue, and permanently freed the old memory:
> 
> vfree(owq);
> kvfree(okwq);
> 
> If copy_to_user fails and returns -EFAULT, userspace might interpret this as
> a failed resize and continue using its existing memory mapping of the old
> queue. Could this lead to a use-after-free since the pages have already
> been freed?
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com?part=2

^ permalink raw reply

* Re: [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0
From: Anirudh Rayabharam @ 2026-05-13 11:36 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816862911.21765.306085307721937662.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:43:49PM +0000, Stanislav Kinsburskii wrote:
> The duplicate routing entry check in mshv_update_routing_table() uses
> guest_irq_num != 0 to detect whether a GSI slot is already occupied.
> This fails for GSI 0 because its guest_irq_num is 0 both when the slot
> is unused (zero-initialized) and when legitimately assigned. As a
> result, duplicate entries for GSI 0 are silently accepted, with the
> second entry overwriting the first — corrupting the routing table
> without any error reported to userspace.
> 
> While GSI 0 (legacy timer) is unlikely to appear in MSI-based routing
> in practice, the check is semantically wrong — it conflates
> "uninitialized" with "GSI number 0." Use girq_entry_valid instead,
> which is explicitly set to true when an entry is populated and remains
> zero for unused slots regardless of the GSI number.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_irq.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
> index 59584a132ca9f..db05512db5548 100644
> --- a/drivers/hv/mshv_irq.c
> +++ b/drivers/hv/mshv_irq.c
> @@ -88,7 +88,7 @@ int mshv_update_routing_table(struct mshv_partition *partition,
>  		/*
>  		 * Allow only one to one mapping between GSI and MSI routing.
>  		 */
> -		if (girq->guest_irq_num != 0) {
> +		if (girq->girq_entry_valid) {
>  			r = -EINVAL;
>  			goto out;
>  		}
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free
From: Anirudh Rayabharam @ 2026-05-13 11:22 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816864533.21765.15114003791446487007.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:05PM +0000, Stanislav Kinsburskii wrote:
> mshv_portid_free() uses synchronize_rcu() followed by kfree() to
> reclaim port table entries. This blocks the caller until a full RCU
> grace period elapses, which is unnecessary since the same module already
> uses the non-blocking kfree_rcu() pattern in mshv_port_table_fini().
> 
> Replace with kfree_rcu() to avoid the blocking wait and keep the
> reclamation strategy consistent across the file.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_portid_table.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
> index d87a82e399e96..42d21b92b88fd 100644
> --- a/drivers/hv/mshv_portid_table.c
> +++ b/drivers/hv/mshv_portid_table.c
> @@ -62,8 +62,7 @@ mshv_portid_free(int port_id)
>  	WARN_ON(!info);
>  	idr_unlock(&port_table_idr);
>  
> -	synchronize_rcu();
> -	kfree(info);
> +	kfree_rcu(info, portbl_rcu);
>  }
>  
>  /*
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract
From: Anirudh Rayabharam @ 2026-05-13 11:20 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816863447.21765.7284842709694944084.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:43:54PM +0000, Stanislav Kinsburskii wrote:
> mshv_portid_lookup() previously took rcu_read_lock() internally, ran
> idr_find(), released the read lock, and copied the struct contents
> into a caller-supplied buffer.  This had two problems.
> 
> 1. The struct copy ran outside the read section, racing with
>    mshv_portid_free() which does idr_remove + synchronize_rcu + kfree.
>    A copy that started just before synchronize_rcu() observed the read
>    section as already drained and was free to read freed memory while
>    the writer was kfree()'ing the entry.
> 
> 2. The only consumer, mshv_doorbell_isr(), then dispatched a callback
>    using fields of the snapshot — entirely outside any RCU read
>    section.  The callback's data argument and any field it touches
>    were therefore safe only because mshv_isr() runs from
>    sysvec_hyperv_callback, a non-threaded system vector that
>    synchronize_rcu() implicitly waits for via the hardirq quiescent-
>    state coupling.  That protection is real today but undocumented and
>    fragile: a future move of mshv_isr() to a threaded context, or a
>    future caller that registers a doorbell with a shorter-lived data
>    pointer, would silently expose a use-after-free.
> 
> Make the contract explicit instead of implicit.  mshv_portid_lookup()
> now returns a pointer to the table entry and requires the caller to
> hold rcu_read_lock for the entire lifetime of that pointer.  The
> contract is annotated with __must_hold(RCU) so sparse flags any
> direct caller that forgets it.  The sole caller, mshv_doorbell_isr(),
> takes rcu_read_lock around the whole drain loop, so the lookup, the
> field reads, and the doorbell_cb dispatch all run inside one
> read-side critical section.  synchronize_rcu() in mshv_portid_free()
> now genuinely waits for any in-flight callback before kfree() runs,
> without relying on hardirq context for correctness.
> 
> This also drops the by-value struct copy: entries are publish-once
> (populated before idr_alloc) and free-once (after synchronize_rcu),
> so a pointer dereferenced inside the read section gives a stable
> view of the contents without copying.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_portid_table.c |   22 +++++++---------------
>  drivers/hv/mshv_root.h         |    2 +-
>  drivers/hv/mshv_synic.c        |   15 +++++++++------
>  3 files changed, 17 insertions(+), 22 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
From: Anirudh Rayabharam @ 2026-05-13 11:15 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <agHwhJrqEL3IewHz@skinsburskii.localdomain>

On Mon, May 11, 2026 at 08:06:44AM -0700, Stanislav Kinsburskii wrote:
> On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> > On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > > fails on an unencrypted partition. The condition on the error path:
> > > 
> > >     if (ret && mshv_partition_encrypted(partition))
> > > 
> > > only handles map failures for encrypted partitions — if the partition is
> > > not encrypted and the map fails, execution falls through to 'return 0',
> > > silently ignoring the error.
> > > 
> > > Additionally, calling mshv_region_invalidate() inline on map failure
> > > zeroes the mreg_pages array before the caller's cleanup path
> > > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > > mappings for partially-mapped pages.
> > > 
> > > Fix by returning immediately on success and falling through to error
> > > return on failure. For unencrypted partitions, the caller's
> > > mshv_region_destroy() handles unmap followed by invalidate in the
> > > correct order. For encrypted partitions where re-sharing fails, zero
> > > the page array without unpinning — the pages are inaccessible to the
> > > host and must not be unpinned, but zeroing prevents
> > > mshv_region_destroy() from attempting to unpin them.
> > 
> > mshv_region_destroy() calls mshv_region_invalidate() which calls
> > mshv_region_invalidate_pages() which just does:
> > 
> > 	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > 						 u64 page_offset, u64 page_count)
> > 	{
> > 		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > 			unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > 
> > 		memset(region->mreg_pages + page_offset, 0,
> > 		       page_count * sizeof(struct page *));
> > 	}
> > 
> > It doesn't checked for zeroed pages before unpinning.
> > 
> > > 
> > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
> > >  1 file changed, 16 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 665d565899c15..7e4252b6bc65c 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> > >  			pt_err(partition,
> > >  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> > >  			       region->start_gfn, ret);
> > > -			goto invalidate_region;
> > > +			goto err_out;
> > >  		}
> > >  	}
> > >  
> > >  	ret = mshv_region_map(region);
> > > -	if (ret && mshv_partition_encrypted(partition)) {
> > > +	if (ret)
> > > +		goto share_region;
> > > +
> > > +	return 0;
> > > +
> > > +share_region:
> > > +	if (mshv_partition_encrypted(partition)) {
> > >  		int shrc;
> > >  
> > >  		shrc = mshv_region_share(region);
> > >  		if (!shrc)
> > > -			goto invalidate_region;
> > > +			goto err_out;
> > >  
> > >  		pt_err(partition,
> > >  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> > >  		       region->start_gfn, shrc);
> > >  		/*
> > > -		 * Don't unpin if marking shared failed because pages are no
> > > -		 * longer mapped in the host, ie root, anymore.
> > > +		 * Re-sharing failed — the pages remain inaccessible to the
> > > +		 * host.  Zero the page array so that mshv_region_destroy()
> > > +		 * won't attempt to unpin them (leaking the page references
> > > +		 * is intentional; unpinning host-inaccessible pages would be
> > > +		 * unsafe).
> > >  		 */
> > 
> > Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> > can remove it from here altogether. Either way, should this zeroing
> > logic be added there too so as not to crash the host by unpinning pages
> > that weren't marked shared?
> > 
> 
> Indeed.
> The issue with all this code is that we are leaking state in
> mshv_region_pin if we wimply return from it: we don't know if the pages
> are pinned or unshared (or mapped) in the destruction callback.
> We should either introduce a state variable to track this or just don't
> call the generic mshv_region_put on case of region creation error.
> The latter approch make mshv_map_user_memory idempotent and thus looks
> like a better way forward.
> What do you think?

I'm not sure I follow the latter approach. Omitting mshv_region_put()
would cause a dangling reference to the mshv_region right?

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
From: Anirudh Rayabharam @ 2026-05-13 11:12 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816866691.21765.15605640837157423543.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:26PM +0000, Stanislav Kinsburskii wrote:
> handle_pair_message() iterates up to msg->vp_count without verifying it
> against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> from untrusted hypervisor data, a malformed message with a large value
> would cause out-of-bounds reads from the partition_ids and vp_indexes
> arrays.
> 
> handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> 64) and advances bank_contents for each one. However, the payload buffer
> only has space for 16 bank entries. A valid_bank_mask with more than 16
> bits set causes bank_contents to read beyond the message buffer.
> 
> Fix both by adding bounds validation:
> - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> - Track banks consumed and stop before exceeding buffer capacity
> 
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 89207aad7cf1f..5d509299f14d7 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
>  static void
>  handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
>  {
> -	int bank_idx, vps_signaled = 0, bank_mask_size;
> +	int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> +	const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> +			      sizeof(u64) - 2; /* subtract format + mask */

Could this be a constant in the header?

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period
From: Anirudh Rayabharam @ 2026-05-13 10:11 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816866152.21765.16203922564983237274.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:21PM +0000, Stanislav Kinsburskii wrote:
> destroy_partition() frees mshv_vp with plain kfree() while ISR readers
> walk pt_vp_array[] under rcu_read_lock().  On non-root schedulers,
> where drain_all_vps() does not run, an in-flight intercept ISR can
> observe a non-NULL pt_vp_array slot and dereference freed memory in
> kick_vp().  On the root scheduler the same race exists in a narrower
> form: drain_vp_signals() synchronises on kick_vp()'s kicked_by_hv flag
> but not on its wake_up() tail, so the wait-queue lock embedded in vp
> can still be held when destroy_partition() reaches kfree(vp).
> 
> Add struct rcu_head vp_rcu to struct mshv_vp, clear the pt_vp_array
> slot before the free, and use kfree_rcu() so the actual kfree happens
> after a grace period.  drain_all_vps() is retained because it serves a
> separate purpose (telling the hypervisor to stop signalling and
> reconciling signal counts) that kfree_rcu() does not address.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h      |    1 +
>  drivers/hv/mshv_root_main.c |    5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path
From: Anirudh Rayabharam @ 2026-05-13  9:57 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816865606.21765.9555294064589953949.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:16PM +0000, Stanislav Kinsburskii wrote:
> mshv_partition_ioctl_create_vp() initialises a VP struct (allocations,
> mutex_init, init_waitqueue_head, page mappings) and then publishes the
> pointer into partition->pt_vp_array.  Several ISR paths read this array
> locklessly: the intercept ISR, the two scheduler ISRs, and
> mshv_try_assert_irq_fast() on the irqfd fast path.
> 
> Of these, only mshv_try_assert_irq_fast() can structurally race the
> publish.  It runs from an eventfd waker without holding pt_mutex, and
> MSHV_IRQFD does not require the target lapic_apic_id (== vp_index) to
> refer to an existing VP at registration time.  A user can therefore
> register an irqfd targeting a yet-to-be-created VP, then trigger
> mshv_try_assert_irq_fast() concurrently with MSHV_CREATE_VP for the
> same index.  On weakly-ordered architectures the reader can observe a
> non-NULL pointer in pt_vp_array before the initialising stores to the
> VP struct become visible, leading to use of partially-initialised
> fields (e.g. vp_register_page).
> 
> The other ISR readers cannot reach this race: the hypervisor will not
> generate intercept or scheduler messages for a VP that has never been
> told to run, and the user can only call MSHV_RUN_VP on the VP fd
> returned by MSHV_CREATE_VP, which by construction is returned after
> the publish.  Leave those readers as plain loads.
> 
> Use smp_store_release() in mshv_partition_ioctl_create_vp() to publish
> the pointer, and pair it with smp_load_acquire() in
> mshv_try_assert_irq_fast().  On x86 these compile to plain accesses
> under TSO; on ARM64 they emit one-instruction acquire/release barriers,
> acceptable on this fast path.
> 
> The destroy-side path (destroy_partition() clearing pt_vp_array[i] to
> NULL after kfree(vp)) has a separate ordering and lifetime concern
> that is out of scope here.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_eventfd.c   |    9 ++++++++-
>  drivers/hv/mshv_root_main.c |    8 +++++++-
>  2 files changed, 15 insertions(+), 2 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: sashiko-bot @ 2026-05-13  6:43 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-9-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Conflicting declarations and missing `#ifdef` guards for `hv_build_devid_type_pci()` cause build failures under certain Kconfig combinations.
--

commit 086da02227f217e959d3022c592e1b8698814d5b
Author: Mukesh R <mrathor@linux.microsoft.com>

PCI: hv: VMBus and PCI device IDs for PCI passthru

This commit introduces functions to build and export VMBus and PCI device
IDs for use with Hyper-V PCI passthru. It exposes hv_build_devid_type_pci()
and adds hv_pci_vmbus_device_id() to generate the device IDs expected by
the hypervisor.

> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index b3ad50a874dca..8780573a43326 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
>  	int pos;
[ ... ]
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index f64393e853ee3..2ef34001f8d31 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
[ ... ]
> @@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
>  static inline void hv_root_crash_init(void) {}
>  #endif  /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
>  
> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> +u64 hv_build_devid_type_pci(struct pci_dev *pdev);
> +#else
> +static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
> +#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */

Will this cause build failures under certain Kconfig combinations?

If CONFIG_PCI_MSI=y and CONFIG_HYPERV_IOMMU=n, arch/x86/hyperv/irqdomain.c
provides a non-static definition of hv_build_devid_type_pci(), but this
header provides a static inline definition, which leads to a redefinition
error.

Conversely, if CONFIG_PCI_MSI=n and CONFIG_HYPERV_IOMMU=y, this header
declares the function, but it is never defined because irqdomain.c is not
compiled, resulting in linker errors.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=8

^ permalink raw reply

* Re: [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR
From: Anirudh Rayabharam @ 2026-05-13  5:32 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177816865065.21765.5112039734725197893.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Thu, May 07, 2026 at 03:44:10PM +0000, Stanislav Kinsburskii wrote:
> mshv_intercept_isr() reads vp_index from the SynIC intercept message
> payload and uses it directly to index into partition->pt_vp_array without
> validating that vp_index < MSHV_MAX_VPS.
> 
> Mshv treats the Microsoft Hypervisor as trusted, so a malformed vp_index is
> not a security concern; the threat model does not include a malicious
> hypervisor. A hypervisor bug that placed an out-of-range value here would,
> however, cause an out-of-bounds read of pt_vp_array in hardirq context,
> manifesting as random memory corruption or a host crash with no clear
> signal pointing back to the hypervisor as the source.
> 
> handle_bitset_message() and handle_pair_message() already perform this
> defensive check on hypervisor-supplied vp_index values, with an explicit
> "This shouldn't happen, but just in case" comment  Add the same check to
> mshv_intercept_isr() for consistency, turning a potential silent corruption
> into a debuggable pr_debug message.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index bac890cd2b468..89207aad7cf1f 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -387,6 +387,11 @@ mshv_intercept_isr(struct hv_message *msg)
>  	 */
>  	vp_index =
>  	       ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
> +	/* This shouldn't happen, but just in case. */
> +	if (unlikely(vp_index >= MSHV_MAX_VPS)) {
> +		pr_debug("VP index %u out of bounds\n", vp_index);
> +		goto unlock_out;
> +	}
>  	vp = partition->pt_vp_array[vp_index];
>  	if (unlikely(!vp)) {
>  		pr_debug("failed to find VP %u\n", vp_index);
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH V3 06/11] mshv: Add ioctl support for MSHV-VFIO bridge device
From: sashiko-bot @ 2026-05-13  5:27 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-7-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] FD leak and potential deadlock on copy_to_user() failure after anon_inode_getfd().
- [Medium] Asymmetric memory lifecycle and inconsistent device_release handling.
- [Medium] Missing validation of unknown flags in mshv_partition_ioctl_create_device().
--

commit a0a0db24ca9d194302f458a7afa8a424cc357990
Author: Mukesh R <mrathor@linux.microsoft.com>

mshv: Add ioctl support for MSHV-VFIO bridge device

This commit adds an ioctl for creating MSHV devices for a partition.
Currently only VFIO device types are supported, doing setup for the bridge
code in mshv_vfio.c.

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 02c107458be90..6ceb5f608589f 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c

[ ... ]

> @@ -1625,6 +1625,119 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)

[ ... ]

> +static int mshv_device_fop_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_device *mshv_dev = filp->private_data;
> +	struct mshv_partition *partition = mshv_dev->device_pt;
> +
> +	if (mshv_dev->device_ops->device_release) {
> +		mutex_lock(&partition->pt_mutex);
> +		hlist_del(&mshv_dev->device_ptnode);
> +		mshv_dev->device_ops->device_release(mshv_dev);
> +		mutex_unlock(&partition->pt_mutex);
> +	}

What happens if a future device type does not implement the device_release
callback? It looks like the device would remain indefinitely linked in 
partition->pt_devices while the file descriptor is closed, leading to
list corruption and a memory leak.

Since the device structure is allocated by the core module via kzalloc(),
would it be safer to handle the list removal and generic kfree()
unconditionally in this function?

> +
> +	mshv_partition_put(partition);
> +	return 0;
> +}
> +
> +static const struct file_operations mshv_device_fops = {
> +	.owner = THIS_MODULE,
> +	.unlocked_ioctl = mshv_device_fop_ioctl,
> +	.release = mshv_device_fop_release,
> +};
> +
> +static long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> +					       void __user *uarg)
> +{
> +	long rc;
> +	struct mshv_create_device devargk;
> +	struct mshv_device *mshv_dev;
> +	const struct mshv_device_ops *vfio_ops;
> +
> +	if (copy_from_user(&devargk, uarg, sizeof(devargk)))
> +		return -EFAULT;
> +
> +	/* At present, only VFIO is supported */
> +	if (devargk.type != MSHV_DEV_TYPE_VFIO)
> +		return -ENODEV;
> +
> +	if (devargk.flags & MSHV_CREATE_DEVICE_TEST)
> +		return 0;

Should we explicitly validate that no undefined flags are set here? 
If a future kernel version introduces new flags for device creation, older 
kernels will silently ignore them and return success instead of -EINVAL, 
breaking forward compatibility.

> +
> +	/* This is freed later by mshv_vfio_release_device() */
> +	mshv_dev = kzalloc(sizeof(*mshv_dev), GFP_KERNEL_ACCOUNT);
> +	if (mshv_dev == NULL)
> +		return -ENOMEM;
> +
> +	vfio_ops = &mshv_vfio_device_ops;
> +	mshv_dev->device_ops = vfio_ops;
> +	mshv_dev->device_pt = partition;
> +
> +	rc = vfio_ops->device_create(mshv_dev);
> +	if (rc < 0) {
> +		kfree(mshv_dev);
> +		return rc;
> +	}
> +
> +	hlist_add_head(&mshv_dev->device_ptnode, &partition->pt_devices);
> +
> +	mshv_partition_get(partition);
> +	rc = anon_inode_getfd(vfio_ops->device_name, &mshv_device_fops,
> +			      mshv_dev, O_RDWR | O_CLOEXEC);
> +	if (rc < 0)
> +		goto undo_out;
> +
> +	devargk.fd = rc;
> +	if (copy_to_user(uarg, &devargk, sizeof(devargk)))
> +		return -EFAULT;    /* cleanup in mshv_device_fop_release() */

Does this sequence leak the file descriptor if copy_to_user() fails?

The anon_inode_getfd() call installs the file descriptor into the process's 
fd table. If the subsequent copy_to_user() fails, the function returns 
-EFAULT, but the user space application never receives the fd number to 
explicitly close it, resulting in a leaked file descriptor.

Additionally, can this trigger a deadlock?

The ioctl is called with partition->pt_mutex held. If a malicious user 
provides a buffer backed by userfaultfd, the copy_to_user() call will 
block, holding the mutex. Concurrently, another thread could find the 
new fd in /proc/self/fd and call close(), which invokes 
mshv_device_fop_release(). That release function unconditionally attempts 
to acquire partition->pt_mutex, resulting in a deadlock.

Would it be better to separate file allocation and installation by using 
get_unused_fd_flags(), anon_inode_getfile(), and only calling fd_install() 
after copy_to_user() succeeds?

> +
> +	return 0;
> +
> +undo_out:
> +	hlist_del(&mshv_dev->device_ptnode);
> +	vfio_ops->device_release(mshv_dev);    /* will kfree(mshv_dev) */

If a future device type does not implement the device_release callback,
could this result in a NULL pointer dereference?

> +	mshv_partition_put(partition);
> +	return rc;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=6

^ permalink raw reply

* Re: [PATCH V3 05/11] mshv: Implement mshv bridge device for VFIO
From: sashiko-bot @ 2026-05-13  5:09 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-6-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Unprotected access to `pt_devices` list and missing lockdep assertion.
- [Low] The `mshv_vfio_device_ops` structure is missing the `const` qualifier.
--

commit 96ab470ecc51b095ed8b1538e4abdf37d030b039
Author: Mukesh R <mrathor@linux.microsoft.com>
Subject: mshv: Implement mshv bridge device for VFIO

This commit introduces a new VFIO-MSHV bridge pseudo device file, adapted
from KVM's VFIO bridge. It provides the necessary device operations to link
VFIO files with an MSHV partition.

> diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
> new file mode 100644
> index 0000000000000..00a97920e25ba
> --- /dev/null
> +++ b/drivers/hv/mshv_vfio.c
[ ... ]
> +static long mshv_vfio_create_device(struct mshv_device *mshvdev)
> +{
> +	struct mshv_device *tmp;
> +	struct mshv_vfio *mshv_vfio;
> +
> +	/* Only one VFIO "device" per VM */
> +	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
> +			     device_ptnode)
> +		if (tmp->device_ops == &mshv_vfio_device_ops)
> +			return -EBUSY;

Are we missing a lockdep assertion here to ensure the partition lock is held?

Since this iterates over the pt_devices list to enforce a singleton device,
could concurrent ioctl calls bypass this check and corrupt the hlist if the
caller doesn't acquire the lock?

[ ... ]

> +struct mshv_device_ops mshv_vfio_device_ops = {
> +	.device_name = "mshv-vfio",
> +	.device_create = mshv_vfio_create_device,
> +	.device_release = mshv_vfio_release_device,
> +	.device_set_attr = mshv_vfio_set_attr,
> +	.device_has_attr = mshv_vfio_has_attr,
> +};

Should mshv_vfio_device_ops be declared as const?

The structure consists entirely of function pointers and is never modified
after initialization. Making it const would place it in read-only memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=5

^ permalink raw reply

* Re: [PATCH net-next v8] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-05-13  4:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, shradhagupta, dipayanroy,
	shirazsaleem, yury.norov, kees, linux-hyperv, netdev,
	linux-kernel, linux-rdma
In-Reply-To: <20260508220114.3342627-1-kuba@kernel.org>

On Fri, May 08, 2026 at 03:01:14PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Expose hardware diagnostic info via debugfs
> 
> Add debugfs entries to expose hardware configuration and diagnostic
> information that aids in debugging driver initialization and runtime
> operations without adding noise to dmesg.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> [ ... ]
> > @@ -2231,13 +2239,18 @@ int mana_gd_resume(struct pci_dev *pdev)
> >  
> >  	err = mana_probe(&gc->mana, true);
> >  	if (err)
> > -		return err;
> > +		goto cleanup_gd;
> >  
> >  	err = mana_rdma_probe(&gc->mana_ib);
> >  	if (err)
> > -		return err;
> > +		goto cleanup_mana;
> >  
> >  	return 0;
> > +cleanup_mana:
> > +	mana_remove(&gc->mana, true);
> > +cleanup_gd:
> > +	mana_gd_cleanup_device(pdev);
> > +	return err;
> >  }
> 
> Since mana_remove() called with true for suspending skips
> unregister_netdevice(), the network devices remain registered in the system.
> 
> If mana_gd_cleanup_device() is then called, which destroys the hardware
> channel and clears gc->hwc.driver_data, could this lead to a NULL pointer
> dereference if userspace later interacts with the still-registered devices?

Sorry for delayed response.

This is a valid concern. I will update it to mana_remove(&gc->mana, false) in
the next version.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> [ ... ]
> > @@ -3161,6 +3169,23 @@ static int mana_init_port(struct net_device *ndev)
> >  	eth_hw_addr_set(ndev, apc->mac_addr);
> >  	sprintf(vport, "vport%d", port_idx);
> >  	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
> > +
> > +	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
> > +			   &apc->port_handle);
> 
> When the device suspends, mana_gd_suspend() calls mana_gd_cleanup_device(),
> which recursively removes the entire debugfs directory tree.
> 
> During resume, the parent debugfs directory is recreated, but mana_probe()
> skips calling mana_init_port() for existing ports.
> 
> Does this mean the per-vPort debugfs entries are permanently lost after a
> suspend and resume cycle?
> 
This is false positive.
The per-vPort debugfs entries are recreated during resume. The flow is:

Suspend: mana_gd_cleanup_device() removes the entire debugfs tree.

Resume: mana_gd_setup() recreates the top-level mana_pci_debugfs
directory. Then mana_probe(gd, true) takes the resuming=true path,
which calls mana_attach() for each port. mana_attach() calls
mana_init_port(), which recreates the vportN directory and all
per-vPort debugfs files (port_handle, max_sq, max_rq, steer_*,
current_speed).

So the per-vPort entries are not permanently lost, they survive the
suspend/resume cycle.
> [ ... ]
> > @@ -3800,6 +3830,9 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> >  
> >  	mana_gd_deregister_device(gd);
> >  
> > +	debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
> > +	debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
> > +
> >  	if (suspending)
> >  		return;
> 
> Earlier in mana_remove(), free_netdev() is called, which frees the port
> context structure (apc) allocated via netdev_priv(ndev).
> 
> Since the per-vPort debugfs files created in mana_init_port() directly
> reference memory within apc, and apc->mana_port_debugfs is not explicitly
> removed here, does this leave the debugfs files accessible but pointing to
> freed memory until the parent directory is finally removed in
> mana_gd_cleanup_device()?

This is a false positive.
The per-vPort debugfs files are removed before free_netdev() frees apc.

In mana_remove(gd, false), the sequence for each port is:

mana_detach(ndev, false) -> mana_cleanup_port_context(apc)
-> debugfs_remove(apc->mana_port_debugfs).
This removes the entire vportN directory and all its child files, since
debugfs_remove() is recursive (simple_recursive_removal()).
free_netdev(ndev) which frees apc.
Because step 1 removes all debugfs files referencing apc fields before
step 2 frees the memory, there is no window where the files point to
freed memory.

^ permalink raw reply

* Re: [PATCH V3 03/11] mshv: Provide a way to get partition ID if running in a VMM process
From: sashiko-bot @ 2026-05-13  3:47 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260512020259.1678627-4-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Tying VM partition IDs to process TGIDs (`current->tgid`) is an architectural flaw causing isolation breakage and security risks.
- [High] `mshv_current_partid()` returns a raw partition ID without taking a reference, creating a TOCTOU vulnerability.
- [High] Pre-existing Use-After-Free in `destroy_partition()` due to RCU-001 violation during VP teardown.
--

commit 528e38745b2ae16c0fd8ad4876d893d1b1fea027
Author: Mukesh R <mrathor@linux.microsoft.com>

mshv: Provide a way to get partition ID if running in a VMM process

This commit introduces mshv_current_partid() to allow retrieving the
partition id for the current vmm process. It adds a pt_vmm_tgid field
to the mshv_partition struct and sets it during partition creation.
The partition id is found by iterating the global hash table.

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index bd1359eb58dd4..02c107458be90 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1908,6 +1908,27 @@ mshv_partition_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> +/* Given a process tgid, return partition id if it is a VMM process */
> +u64 mshv_current_partid(void)
> +{
> +	struct mshv_partition *pt;
> +	int i;
> +	u64 ret_ptid = HV_PARTITION_ID_INVALID;
> +
> +	rcu_read_lock();
> +
> +	hash_for_each_rcu(mshv_root.pt_htable, i, pt, pt_hnode) {
> +		if (pt->pt_vmm_tgid == current->tgid) {

Is tying the partition id to the process tgid safe?

Since a single vmm process can create multiple partitions, this loop returns
the first partition matching the tgid. This could result in pci passthrough
operations being misrouted to the wrong partition if a process manages
multiple virtual machines.

Also, since the partition's lifecycle is tied to an anonymous file descriptor,
passing the fd to another process and exiting would break
mshv_current_partid() for the new process.

Additionally, if the operating system recycles the old tgid, an unrelated
process might gain access to the partition id.

Should explicit file descriptor tracking be used instead of global process
ids?

> +			ret_ptid = pt->pt_id;
> +			break;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +	return ret_ptid;

Does returning a raw partition id without taking a reference create a race
condition?

After rcu_read_unlock() is called and the pt_id is returned, a concurrent
thread in the vmm could close the partition's file descriptor, causing the
partition to be deleted in the hypervisor.

If the hypervisor recycles the pt_id for a newly created partition belonging
to a completely different process, the pending iommu attachment or pci
passthrough hypercall will use the stale pt_id.

Can this allow the vmm to inadvertently attach its devices to an unrelated
partition?

> +}

This isn't a regression in this patch, but looking at the partition lifecycle
code, does destroy_partition() have a use-after-free?

destroy_partition() frees the virtual processor structures (kfree(vp))
before calling remove_partition(partition), which removes the partition
from the rcu-protected global hash table (pt_htable) and calls
synchronize_rcu().

Because the partition remains visible to rcu readers during the teardown loop,
can an interrupt handler (e.g., in mshv_synic.c) call mshv_partition_find(),
successfully retrieve the partition, and read
vp = partition->pt_vp_array[vp_index]?

If this happens concurrently with or immediately after kfree(vp), the
interrupt handler would access freed memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=3

^ permalink raw reply

* Re: [PATCH V3 02/11] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: sashiko-bot @ 2026-05-13  3:26 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-pci
In-Reply-To: <20260512020259.1678627-3-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Resource leak in `hv_msi_domain_free()` when freeing multiple IRQs.
- [High] Uninitialized memory write to hardware MSI registers in `hv_irq_compose_msi_msg()` error paths.
- [High] Invalid hardware state read in `hv_build_devid_type_pci()` for Type 0 PCI-X leaf devices.
--

commit 8c84576bf69ec6b4d58407408095bf71a9940bf2
Author: Mukesh R <mrathor@linux.microsoft.com>

x86/hyperv: Cosmetic changes in irqdomain.c for readability

Make cosmetic changes such as renaming variables for consistency, fixing
checkpatch warnings, updating indentation, and replacing spaces with tabs.

> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
[ ... ]
> @@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
> +static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
[ ... ]
> -	if (data.bridge) {
> -		int pos;
> +	if (data.bridge == NULL)
> +		goto out;
>  
> -		/*
> -		 * Microsoft Hypervisor requires a bus range when the bridge is
> -		 * running in PCI-X mode.
[ ... ]
> -		pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> -		if (pos) {
> -			u16 status;
> +	pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> +	if (pos) {
> +		u16 status;

If the leaf device is on the root bus, data.bridge will point to the leaf
device itself (a Type 0 device) after pci_for_each_dma_alias() completes.
Does this code incorrectly assume data.bridge is always a Type 1 bridge?

If it evaluates a Type 0 device here, reading PCI_X_BRIDGE_SSTATUS aliases
the Type 0 Command register. This could cause the code to read standard
Type 1 PCI_SECONDARY_BUS and PCI_SUBORDINATE_BUS offsets (which fall into
the middle of BAR2 for a Type 0 device), sending arbitrary BAR values as
shadow bus ranges to the hypervisor.

[ ... ]
> @@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	}
>  
>  	stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
> -	if (!stored_entry) {
> -		pr_debug("%s: failed to allocate chip data\n", __func__);
> +	if (!stored_entry)
>  		return;
> -	}

If kzalloc_obj() or hv_map_msi_interrupt() fails, this returns early
without populating the struct msi_msg *msg parameter.

Since callers (like msi_domain_activate() or irq_chip_compose_msi_msg())
allocate msg on the stack and cannot check for errors due to the void
return type, does this result in uninitialized stack memory being blindly
written to the hardware MSI registers?

[ ... ]
> @@ -344,13 +351,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
>  		return ret;
>  
>  	for (int i = 0; i < nr_irqs; ++i) {
> -		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
> -				    handle_edge_irq, NULL, "edge");
> +		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
> +				    NULL, handle_edge_irq, NULL, "edge");
>  	}
> +
>  	return 0;
>  }
>  
> -static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
> +static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
> +			       unsigned int nr_irqs)
>  {
>  	for (int i = 0; i < nr_irqs; ++i) {
>  		struct irq_data *irqd = irq_domain_get_irq_data(d, virq);

Does this loop leak chip_data and leave IRQs mapped in the hypervisor for
the remaining nr_irqs - 1 IRQs?

It appears to call irq_domain_get_irq_data() using the base virq on every
iteration instead of virq + i, causing hv_teardown_msi_irq() to be
repeatedly invoked on the first IRQ only.

> @@ -362,6 +371,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
>  
>  		hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
>  	}
> +
>  	irq_domain_free_irqs_top(d, virq, nr_irqs);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=2

^ permalink raw reply

* RE: [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Michael Kelley @ 2026-05-13  3:15 UTC (permalink / raw)
  To: Jacob Pan, Mukesh R
  Cc: hpa@zytor.com, robin.murphy@arm.com, robh@kernel.org,
	wei.liu@kernel.org, Michael Kelley, muislam@microsoft.com,
	namjain@linux.microsoft.com, magnuskulke@linux.microsoft.com,
	anbelski@linux.microsoft.com, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	longli@microsoft.com, tglx@kernel.org, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	joro@8bytes.org, will@kernel.org, lpieralisi@kernel.org,
	kwilczynski@kernel.org, bhelgaas@google.com, arnd@arndb.de
In-Reply-To: <20260512164623.0000273f@linux.microsoft.com>

From: Jacob Pan <jacob.pan@linux.microsoft.com> Sent: Tuesday, May 12, 2026 4:46 PM
> 
> Hi Mukesh,
> 
> On Mon, 11 May 2026 19:02:49 -0700
> Mukesh R <mrathor@linux.microsoft.com> wrote:
> 
> > This file actually implements irq remapping, so rename to more
> > appropriate hyperv-irq.c. A new file to implement hyperv iommu will
> > be introduced later.  Also, it should not be tied to HYPERV_IOMMU,
> > but to CONFIG_HYPERV and IRQ_REMAP. The file already has #ifdef
> > CONFIG_IRQ_REMAP.
> >
> > Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> > ---
> >  MAINTAINERS                                    | 2 +-
> >  drivers/iommu/Makefile                         | 2 +-
> >  drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
>
> Given that we have multiple Hyper-V IOMMU-related files — this renamed
> hyperv-irq.c, the existing hyperv-iommu code, iommu-root (this
> series) and the recently posted guest pvIOMMU driver — should we create
> a drivers/iommu/hyperv/ directory to consolidate them?

Patch 1/4 in the guest pvIOMMU driver [1] that was recently posted by
Yu Zhang does as you suggest.

Michael

[1] https://lore.kernel.org/linux-hyperv/20260511162408.1180069-1-zhangyu1@linux.microsoft.com/

> 
> >  drivers/iommu/irq_remapping.c                  | 2 +-
> >  4 files changed, 6 insertions(+), 6 deletions(-)
> >  rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index d1cc0e12fe1f..f803a6a38fee 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -11914,7 +11914,7 @@ F:	drivers/clocksource/hyperv_timer.c
> >  F:	drivers/hid/hid-hyperv.c
> >  F:	drivers/hv/
> >  F:	drivers/input/serio/hyperv-keyboard.c
> > -F:	drivers/iommu/hyperv-iommu.c
> > +F:	drivers/iommu/hyperv-irq.c
> >  F:	drivers/net/ethernet/microsoft/
> >  F:	drivers/net/hyperv/
> >  F:	drivers/pci/controller/pci-hyperv-intf.c
> > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> > index 0275821f4ef9..335ea77cced6 100644
> > --- a/drivers/iommu/Makefile
> > +++ b/drivers/iommu/Makefile
> > @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
> >  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
> >  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
> >  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> > -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> > +obj-$(CONFIG_HYPERV) += hyperv-irq.o
> >  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> >  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
> >  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> > diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
> > similarity index 99%
> > rename from drivers/iommu/hyperv-iommu.c
> > rename to drivers/iommu/hyperv-irq.c
> > index 479103261ae6..d11076f906fb 100644
> > --- a/drivers/iommu/hyperv-iommu.c
> > +++ b/drivers/iommu/hyperv-irq.c
> > @@ -8,6 +8,8 @@
> >   * Author : Lan Tianyu <Tianyu.Lan@microsoft.com>
> >   */
> >
> > +#ifdef CONFIG_IRQ_REMAP
> > +
> >  #include <linux/types.h>
> >  #include <linux/interrupt.h>
> >  #include <linux/irq.h>
> > @@ -24,8 +26,6 @@
> >
> >  #include "irq_remapping.h"
> >
> > -#ifdef CONFIG_IRQ_REMAP
> > -
> >  /*
> >   * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
> >   * Redirection Table. Hyper-V exposes one single IO-APIC and so
> > define @@ -331,4 +331,4 @@ static const struct irq_domain_ops
> > hyperv_root_ir_domain_ops = { .free = hyperv_root_irq_remapping_free,
> >  };
> >
> > -#endif
> > +#endif  /* CONFIG_IRQ_REMAP */
> > diff --git a/drivers/iommu/irq_remapping.c
> > b/drivers/iommu/irq_remapping.c index c2443659812a..41bf65e4ea88
> > 100644 --- a/drivers/iommu/irq_remapping.c
> > +++ b/drivers/iommu/irq_remapping.c
> > @@ -108,7 +108,7 @@ int __init irq_remapping_prepare(void)
> >  	else if (IS_ENABLED(CONFIG_AMD_IOMMU) &&
> >  		 amd_iommu_irq_ops.prepare() == 0)
> >  		remap_ops = &amd_iommu_irq_ops;
> > -	else if (IS_ENABLED(CONFIG_HYPERV_IOMMU) &&
> > +	else if (IS_ENABLED(CONFIG_HYPERV) &&
> >  		 hyperv_irq_remap_ops.prepare() == 0)
> >  		remap_ops = &hyperv_irq_remap_ops;
> >  	else
> 


^ permalink raw reply

* Re: [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: sashiko-bot @ 2026-05-13  3:12 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-hyperv
In-Reply-To: <2-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Critical] TOCTOU heap buffer overflow due to unvalidated `num_sge` from user-shared memory.
- [High] Memory leak of the kernel queue structure (`srq->rq.kwq`) on user-backed SRQ modifications.
- [High] Locking imbalance and freeing memory while locked.
- [High] Inconsistent state and Use-After-Free on error path.
- [Low] Uninitialized variable compiler warning for `offset_addr`.
--

commit dac7699d9e58c179b3a8859ac64da79e5fd21f46
Author: Jason Gunthorpe <jgg@ziepe.ca>
Subject: IB/rdmavt: Don't abuse udata and ib_respond_udata()

This commit modifies rvt_modify_srq to use copy_to_user directly instead
of ib_copy_to_udata for returning the offset address.

> diff --git a/drivers/infiniband/sw/rdmavt/srq.c b/drivers/infiniband/sw/rdmavt/srq.c
> index fe125bf85b272..d022aa56c5bfd 100644
> --- a/drivers/infiniband/sw/rdmavt/srq.c
> +++ b/drivers/infiniband/sw/rdmavt/srq.c
> @@ -128,6 +128,7 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
>  	struct rvt_srq *srq = ibsrq_to_rvtsrq(ibsrq);
>  	struct rvt_dev_info *dev = ib_to_rvt(ibsrq->device);
>  	struct rvt_rq tmp_rq = {};
> +	__u64 offset_addr;
>  	int ret = 0;
>  
>  	if (attr_mask & IB_SRQ_MAX_WR) {

Could hoisting offset_addr lead to a compiler warning for being potentially
uninitialized?

The compiler might assume udata->inlen could change concurrently across the
spin_lock_irq() and spin_unlock_irq() calls. If the first check fails
but the second one passes:

if (udata && udata->inlen >= sizeof(__u64))

Does this allow offset_addr to be used without initialization?

[ ... ]

> @@ -149,19 +150,17 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
>  			return -ENOMEM;
>  		/* Check that we can write the offset to mmap. */
>  		if (udata && udata->inlen >= sizeof(__u64)) {
> -			__u64 offset_addr;
>  			__u64 offset = 0;
>  
>  			ret = ib_copy_from_udata(&offset_addr, udata,
>  						 sizeof(offset_addr));
>  			if (ret)
>  				goto bail_free;
> -			udata->outbuf = (void __user *)
> -					(unsigned long)offset_addr;
> -			ret = ib_copy_to_udata(udata, &offset,
> -					       sizeof(offset));
> -			if (ret)
> +			if (copy_to_user(u64_to_user_ptr(offset_addr), &offset,
> +					 sizeof(offset))) {
> +				ret = -EFAULT;
>  				goto bail_free;
> +			}
>  		}
>  
>  		spin_lock_irq(&srq->rq.kwq->c_lock);

Can this code leak the kernel queue structure?

When resizing a user-backed SRQ, the code fetches elements from the old
user-mapped queue. In the udata path, okwq is left as NULL:

if (udata) {
        owq = srq->rq.wq;
        head = RDMA_READ_UAPI_ATOMIC(owq->head);
        tail = RDMA_READ_UAPI_ATOMIC(owq->tail);
} else {
        okwq = srq->rq.kwq;
        ...

When it later calls kvfree(okwq), it seems nothing is freed, which could
permanently leak the old kwq structure.

Is there a lock imbalance here when the queue is replaced?

The original queue is locked with:

spin_lock_irq(&srq->rq.kwq->c_lock);

But then the queue pointer is replaced:

srq->rq.kwq = tmp_rq.kwq;

When the lock is released:

spin_unlock_irq(&srq->rq.kwq->c_lock);

Does this unlock the new queue instead of the old one, and then free the old
queue via kvfree(okwq) while its lock is still held?

Can the loop migrating WQEs overflow the destination buffer?

The number of scatter-gather elements is read directly from user-shared
memory:

wqe = rvt_get_rwqe_ptr(&srq->rq, tail);
p->wr_id = wqe->wr_id;
p->num_sge = wqe->num_sge;
for (i = 0; i < wqe->num_sge; i++)
        p->sg_list[i] = wqe->sg_list[i];

If a userspace program concurrently modifies num_sge to a large value, could
this loop exceed the bounds of tmp_rq.kwq->curr_wq, since the destination
size was calculated based on srq->rq.max_sge?

[ ... ]

> @@ -236,10 +235,10 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
>  			 * See rvt_mmap() for details.
>  			 */
>  			if (udata && udata->inlen >= sizeof(__u64)) {
> -				ret = ib_copy_to_udata(udata, &ip->offset,
> -						       sizeof(ip->offset));
> -				if (ret)
> -					return ret;
> +				if (copy_to_user(u64_to_user_ptr(offset_addr),
> +						 &ip->offset,
> +						 sizeof(ip->offset)))
> +					return -EFAULT;
>  			}
>  

Does returning -EFAULT here leave the kernel and userspace in an inconsistent
state?

At this point, the kernel has already applied the SRQ size change, replaced
the queue, and permanently freed the old memory:

vfree(owq);
kvfree(okwq);

If copy_to_user fails and returns -EFAULT, userspace might interpret this as
a failed resize and continue using its existing memory mapping of the old
queue. Could this lead to a use-after-free since the pages have already
been freed?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com?part=2

^ permalink raw reply

* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
From: Stanislav Kinsburskii @ 2026-05-13  2:42 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: kys, Liam.Howlett, akpm, decui, haiyangz, jgg, corbet, leon,
	longli, ljs, mhocko, rppt, shuah, skhan, surenb, vbabka, wei.liu,
	linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <f073a8d7-5761-4f7b-a5e5-c6aeae5fdc72@kernel.org>

On Tue, May 12, 2026 at 09:18:11PM +0200, David Hildenbrand (Arm) wrote:
> On 5/12/26 18:18, Stanislav Kinsburskii wrote:
> > On Tue, May 12, 2026 at 10:42:14AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >>> +	for (; addr < end; addr += PAGE_SIZE) {
> >>> +		vm_fault_t ret;
> >>> +
> >>> +		ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> >>> +
> >>> +		if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
> >>> +			/*
> >>> +			 * The mmap lock has been dropped by the fault handler.
> >>> +			 * Record the failing address and signal lock-drop to
> >>> +			 * the caller.
> >>> +			 */
> >>> +			*hmm_vma_walk->locked = 0;
> >>> +			hmm_vma_walk->last = addr;
> >>> +			return -EAGAIN;
> >>
> >>
> >> Okay, so we'll return straight from hmm_vma_fault() to
> >> hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
> >>
> >> Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
> >> the hmm_vma_fault() could be called by the caller of walk_page_range(), but
> >> that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
> >> requires the vma in there.
> >>
> > 
> > It looks like a caller can provide a post_vma callback in mm_walk_ops. I
> > missed that case here. This callback cannot be supported by this change.
> > I will update the patch.
> > 
> >>
> >> Note: am I wrong, or is hmm_vma_fault() really always called with
> >> required_fault=true?
> >>
> > 
> > No, hmm_pte_need_fault can return false.
> 
> That's not what I mean. Looks like all paths leading to hmm_vma_fault() have
> required_fault = true;
> 
> IOW, there is always a "if (required_fault)" before it one way or the other.
> 
> Ah, and there even is a "WARN_ON_ONCE(!required_fault)" in the function. What an
> odd thing to do :)
> 
> > 
> >>> +		}
> >>> +
> >>> +		if (ret & VM_FAULT_ERROR)
> >>>  			return -EFAULT;
> >>> +	}
> >>>  	return -EBUSY;
> >>>  }
> >>>  
> >>> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
> >>>  	if (required_fault) {
> >>>  		int ret;
> >>>  
> >>> +		/*
> >>> +		 * Faulting hugetlb pages on the unlockable path is not
> >>> +		 * supported. The walk framework holds hugetlb_vma_lock_read
> >>> +		 * which must be dropped before handle_mm_fault, but if the
> >>> +		 * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
> >>> +		 * be freed and the walk framework's unconditional unlock
> >>> +		 * becomes a use-after-free.
> >>> +		 */
> >>> +		if (hmm_vma_walk->locked)
> >>> +			return -EFAULT;
> >>
> >> Just because it's unlockable doesn't mean that you must unlock. Can't this be
> >> kept working as is, just simulating here as if it would not be unlockable?
> >>
> > 
> > I’m not sure how to implement this. The walk_page_range code expects the
> > hugetlb VMA to still be read-locked when we return from
> > hmm_vma_walk_hugetlb_entry. How can we guarantee that if the VMA might
> > be gone?
> > 
> > I added a note in the docs. Whoever tackles this will likely need to
> > either rework `walk_page_range` to handle the case where the VMA is
> > gone, or use a different approach.
> > 
> > Do you have any other suggestions on how to implement it?
> 
> You just want hmm_vma_fault() to not set
> "FAULT_FLAG_ALLOW_RETRY·|·FAULT_FLAG_KILLABLE".
> 
> The hacky way could be:
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 5955f2f0c83d..83dba990e10a 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -564,6 +564,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
> long hmask,
>         required_fault =
>                 hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
>         if (required_fault) {
> +               int *saved_locked = hmm_vma_walk->locked;
>                 int ret;
> 
>                 spin_unlock(ptl);
> @@ -576,7 +577,9 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
> long hmask,
>                  * use here of either pte or ptl after dropping the vma
>                  * lock.
>                  */
> +               hmm_vma_walk->locked = NULL;
>                 ret = hmm_vma_fault(addr, end, required_fault, walk);
> +               hmm_vma_walk->locked = saved_locked;
>                 hugetlb_vma_lock_read(vma);
>                 return ret;
>         }
> 

I see. AFAIU the outcome would be the same.

> But really, I think we should just try to get uffd support working properly, not
> excluding hugetlb.
> 
> GUP achieves it properly by performing the fault handling outside of page table
> walking context ... essentially what I described in my first comment above:
> return the information to the caller and let it just trigger the fault.
> 
> The issue here is that we trigger a fault out of walk_hugetlb_range() where we
> still hold locks, resulting in this questionable hugetlb_vma_unlock_read +
> hugetlb_vma_lock_read pattern.
> 

Fair enough.

> The fault should just be triggered from a place where we don't have to play with
> hugetlb vma locks or be afraid that dropping the mmap lock causes other problems.
> 

I reworked this part. Please take a look at v2.

Thanks,
Stanislav

> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply

* Re: [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Mukesh R @ 2026-05-13  1:31 UTC (permalink / raw)
  To: Jacob Pan
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260512164623.0000273f@linux.microsoft.com>

On 5/12/26 16:46, Jacob Pan wrote:
> Hi Mukesh,
> 
> On Mon, 11 May 2026 19:02:49 -0700
> Mukesh R <mrathor@linux.microsoft.com> wrote:
> 
>> This file actually implements irq remapping, so rename to more
>> appropriate hyperv-irq.c. A new file to implement hyperv iommu will
>> be introduced later.  Also, it should not be tied to HYPERV_IOMMU,
>> but to CONFIG_HYPERV and IRQ_REMAP. The file already has #ifdef
>> CONFIG_IRQ_REMAP.
>>
>> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
>> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
>> ---
>>   MAINTAINERS                                    | 2 +-
>>   drivers/iommu/Makefile                         | 2 +-
>>   drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
> Given that we have multiple Hyper-V IOMMU-related files ? this renamed
> hyperv-irq.c, the existing hyperv-iommu code, iommu-root (this
> series) and the recently posted guest pvIOMMU driver ? should we create
> a drivers/iommu/hyperv/ directory to consolidate them?

This came up briefly during synup with arm64 devs whether to split
the file into arch specific and common. We decided to wait till arm is
working so we can tell how intrusive the #ifdefs are. We can decide as
part of the arm port patches. I am ok also if you want to do it as part
of the pv-iommu patches as follow up once this is merged.

Thanks,
-Mukesh

>>   drivers/iommu/irq_remapping.c                  | 2 +-
>>   4 files changed, 6 insertions(+), 6 deletions(-)
>>   rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index d1cc0e12fe1f..f803a6a38fee 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -11914,7 +11914,7 @@ F:	drivers/clocksource/hyperv_timer.c
>>   F:	drivers/hid/hid-hyperv.c
>>   F:	drivers/hv/
>>   F:	drivers/input/serio/hyperv-keyboard.c
>> -F:	drivers/iommu/hyperv-iommu.c
>> +F:	drivers/iommu/hyperv-irq.c
>>   F:	drivers/net/ethernet/microsoft/
>>   F:	drivers/net/hyperv/
>>   F:	drivers/pci/controller/pci-hyperv-intf.c
>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>> index 0275821f4ef9..335ea77cced6 100644
>> --- a/drivers/iommu/Makefile
>> +++ b/drivers/iommu/Makefile
>> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>>   obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>>   obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>>   obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>> +obj-$(CONFIG_HYPERV) += hyperv-irq.o
>>   obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>>   obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>>   obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
>> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
>> similarity index 99%
>> rename from drivers/iommu/hyperv-iommu.c
>> rename to drivers/iommu/hyperv-irq.c
>> index 479103261ae6..d11076f906fb 100644
>> --- a/drivers/iommu/hyperv-iommu.c
>> +++ b/drivers/iommu/hyperv-irq.c
>> @@ -8,6 +8,8 @@
>>    * Author : Lan Tianyu <Tianyu.Lan@microsoft.com>
>>    */
>>   
>> +#ifdef CONFIG_IRQ_REMAP
>> +
>>   #include <linux/types.h>
>>   #include <linux/interrupt.h>
>>   #include <linux/irq.h>
>> @@ -24,8 +26,6 @@
>>   
>>   #include "irq_remapping.h"
>>   
>> -#ifdef CONFIG_IRQ_REMAP
>> -
>>   /*
>>    * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
>>    * Redirection Table. Hyper-V exposes one single IO-APIC and so
>> define @@ -331,4 +331,4 @@ static const struct irq_domain_ops
>> hyperv_root_ir_domain_ops = { .free = hyperv_root_irq_remapping_free,
>>   };
>>   
>> -#endif
>> +#endif  /* CONFIG_IRQ_REMAP */
>> diff --git a/drivers/iommu/irq_remapping.c
>> b/drivers/iommu/irq_remapping.c index c2443659812a..41bf65e4ea88
>> 100644 --- a/drivers/iommu/irq_remapping.c
>> +++ b/drivers/iommu/irq_remapping.c
>> @@ -108,7 +108,7 @@ int __init irq_remapping_prepare(void)
>>   	else if (IS_ENABLED(CONFIG_AMD_IOMMU) &&
>>   		 amd_iommu_irq_ops.prepare() == 0)
>>   		remap_ops = &amd_iommu_irq_ops;
>> -	else if (IS_ENABLED(CONFIG_HYPERV_IOMMU) &&
>> +	else if (IS_ENABLED(CONFIG_HYPERV) &&
>>   		 hyperv_irq_remap_ops.prepare() == 0)
>>   		remap_ops = &hyperv_irq_remap_ops;
>>   	else


^ permalink raw reply

* Re: [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Jacob Pan @ 2026-05-12 23:46 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-2-mrathor@linux.microsoft.com>

Hi Mukesh,

On Mon, 11 May 2026 19:02:49 -0700
Mukesh R <mrathor@linux.microsoft.com> wrote:

> This file actually implements irq remapping, so rename to more
> appropriate hyperv-irq.c. A new file to implement hyperv iommu will
> be introduced later.  Also, it should not be tied to HYPERV_IOMMU,
> but to CONFIG_HYPERV and IRQ_REMAP. The file already has #ifdef
> CONFIG_IRQ_REMAP.
> 
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  MAINTAINERS                                    | 2 +-
>  drivers/iommu/Makefile                         | 2 +-
>  drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
Given that we have multiple Hyper-V IOMMU-related files — this renamed
hyperv-irq.c, the existing hyperv-iommu code, iommu-root (this
series) and the recently posted guest pvIOMMU driver — should we create
a drivers/iommu/hyperv/ directory to consolidate them?

>  drivers/iommu/irq_remapping.c                  | 2 +-
>  4 files changed, 6 insertions(+), 6 deletions(-)
>  rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d1cc0e12fe1f..f803a6a38fee 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11914,7 +11914,7 @@ F:	drivers/clocksource/hyperv_timer.c
>  F:	drivers/hid/hid-hyperv.c
>  F:	drivers/hv/
>  F:	drivers/input/serio/hyperv-keyboard.c
> -F:	drivers/iommu/hyperv-iommu.c
> +F:	drivers/iommu/hyperv-irq.c
>  F:	drivers/net/ethernet/microsoft/
>  F:	drivers/net/hyperv/
>  F:	drivers/pci/controller/pci-hyperv-intf.c
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 0275821f4ef9..335ea77cced6 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> +obj-$(CONFIG_HYPERV) += hyperv-irq.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
> similarity index 99%
> rename from drivers/iommu/hyperv-iommu.c
> rename to drivers/iommu/hyperv-irq.c
> index 479103261ae6..d11076f906fb 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-irq.c
> @@ -8,6 +8,8 @@
>   * Author : Lan Tianyu <Tianyu.Lan@microsoft.com>
>   */
>  
> +#ifdef CONFIG_IRQ_REMAP
> +
>  #include <linux/types.h>
>  #include <linux/interrupt.h>
>  #include <linux/irq.h>
> @@ -24,8 +26,6 @@
>  
>  #include "irq_remapping.h"
>  
> -#ifdef CONFIG_IRQ_REMAP
> -
>  /*
>   * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
>   * Redirection Table. Hyper-V exposes one single IO-APIC and so
> define @@ -331,4 +331,4 @@ static const struct irq_domain_ops
> hyperv_root_ir_domain_ops = { .free = hyperv_root_irq_remapping_free,
>  };
>  
> -#endif
> +#endif  /* CONFIG_IRQ_REMAP */
> diff --git a/drivers/iommu/irq_remapping.c
> b/drivers/iommu/irq_remapping.c index c2443659812a..41bf65e4ea88
> 100644 --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -108,7 +108,7 @@ int __init irq_remapping_prepare(void)
>  	else if (IS_ENABLED(CONFIG_AMD_IOMMU) &&
>  		 amd_iommu_irq_ops.prepare() == 0)
>  		remap_ops = &amd_iommu_irq_ops;
> -	else if (IS_ENABLED(CONFIG_HYPERV_IOMMU) &&
> +	else if (IS_ENABLED(CONFIG_HYPERV) &&
>  		 hyperv_irq_remap_ops.prepare() == 0)
>  		remap_ops = &hyperv_irq_remap_ops;
>  	else


^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: sashiko-bot @ 2026-05-12 23:45 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260511162408.1180069-5-zhangyu1@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] hv_iommu_fill_iova_list() incorrectly uses guest PAGE_SHIFT instead of hypervisor HV_HYP_PAGE_SHIFT, causing incomplete and incorrect IOTLB flushes on architectures with non-4KB pages.
- [High] Unconditional use of the new HVCALL_FLUSH_DEVICE_DOMAIN_LIST hypercall without a capability check or fallback.
- [Medium] Bitfield assignment on uninitialized flexible array member causes KMSAN use-of-uninitialized-value warnings.
--

commit 5fd02fd07f9fbc4f1734f312b8a33dfd79a38961
Author: Yu Zhang <zhangyu1@linux.microsoft.com>

iommu/hyperv: Add page-selective IOTLB flush support

This patch adds page-selective IOTLB flush support using the
HVCALL_FLUSH_DEVICE_DOMAIN_LIST hypercall. It decomposes IOVA ranges into
aligned power-of-two regions to fit in a single hypercall page, falling
back to a full domain flush if the range exceeds page capacity.

> diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> index e5fc625314b55..3bca362b78150 100644
> --- a/drivers/iommu/hyperv/iommu.c
> +++ b/drivers/iommu/hyperv/iommu.c

[ ... ]

> +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> +					  unsigned long start,
> +					  unsigned long end)
> +{
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;

Should this use HV_HYP_PAGE_SHIFT instead of PAGE_SHIFT?

If the kernel is compiled with a larger page size (like 64KB on ARM64),
start >> PAGE_SHIFT produces a 64KB guest PFN.

Since Hyper-V hypercalls expect 4KB PFNs (HV_HYP_PAGE_SHIFT = 12), this could
result in flushing the wrong pages and leaving stale DMA mappings active.

> +	unsigned long nr_pages = end_pfn - start_pfn;
> +	u16 count = 0;
> +
> +	while (nr_pages > 0) {
> +		unsigned long flush_pages;
> +		int order;
> +		unsigned long pfn_align;
> +		unsigned long size_align;
> +
> +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> +			break;
> +		}
> +
> +		if (start_pfn)
> +			pfn_align = __ffs(start_pfn);
> +		else
> +			pfn_align = BITS_PER_LONG - 1;
> +
> +		size_align = __fls(nr_pages);
> +		order = min(pfn_align, size_align);
> +		iova_list[count].page_mask_shift = order;
> +		iova_list[count].page_number = start_pfn;

Could this assignment trigger a KMSAN use-of-uninitialized-value warning?

The input buffer is from a per-CPU page and memset in
hv_flush_device_domain_list() only zeroes the fixed-size header, leaving the
flexible array uninitialized.

Assigning to bitfields forces a read-modify-write on the uninitialized 64-bit
union hv_iommu_flush_va.

> +
> +		flush_pages = 1UL << order;
> +		start_pfn += flush_pages;
> +		nr_pages -= flush_pages;
> +		count++;
> +	}
> +
> +	return count;
> +}
> +
> +static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
> +					struct iommu_iotlb_gather *iotlb_gather)
> +{
> +	u64 status;
> +	u16 count;
> +	unsigned long flags;
> +	struct hv_input_flush_device_domain_list *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain = hv_domain->device_domain;
> +	input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> +	count = hv_iommu_fill_iova_list(input->iova_list,
> +					iotlb_gather->start,
> +					iotlb_gather->end);
> +	if (count == HV_IOMMU_FLUSH_VA_OVERFLOW) {
> +		/*
> +		 * Range exceeds hypercall page capacity. Fall back to a full
> +		 * domain flush.
> +		 */
> +		struct hv_input_flush_device_domain *flush_all = (void *)input;
> +
> +		memset(flush_all, 0, sizeof(*flush_all));
> +		flush_all->device_domain = hv_domain->device_domain;
> +		status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
> +					flush_all, NULL);
> +	} else {
> +		status = hv_do_rep_hypercall(
> +				HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
> +				count, 0, input, NULL);
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN_LIST failed, status %lld\n", status);

Does this code need a capability check or fallback for older hosts?

If the hypervisor doesn't support the new HVCALL_FLUSH_DEVICE_DOMAIN_LIST
hypercall, hv_do_rep_hypercall() might return an error status.

The code prints an error but doesn't fall back to HVCALL_FLUSH_DEVICE_DOMAIN,
potentially leaving the IOTLB unflushed.

> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260511162408.1180069-1-zhangyu1@linux.microsoft.com?part=4

^ permalink raw reply

* Re: [PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support
From: David Hildenbrand (Arm) @ 2026-05-12 19:18 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, Liam.Howlett, akpm, decui, haiyangz, jgg, corbet, leon,
	longli, ljs, mhocko, rppt, shuah, skhan, surenb, vbabka, wei.liu,
	linux-doc, linux-hyperv, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <agNS4llNtAHBkMA2@skinsburskii.localdomain>

On 5/12/26 18:18, Stanislav Kinsburskii wrote:
> On Tue, May 12, 2026 at 10:42:14AM +0200, David Hildenbrand (Arm) wrote:
>>
>>> +	for (; addr < end; addr += PAGE_SIZE) {
>>> +		vm_fault_t ret;
>>> +
>>> +		ret = handle_mm_fault(vma, addr, fault_flags, NULL);
>>> +
>>> +		if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
>>> +			/*
>>> +			 * The mmap lock has been dropped by the fault handler.
>>> +			 * Record the failing address and signal lock-drop to
>>> +			 * the caller.
>>> +			 */
>>> +			*hmm_vma_walk->locked = 0;
>>> +			hmm_vma_walk->last = addr;
>>> +			return -EAGAIN;
>>
>>
>> Okay, so we'll return straight from hmm_vma_fault() to
>> hmm_vma_handle_pte()/hmm_vma_walk_pmd() -> walk_page_range() machinery.
>>
>> Hopefully we don't refer to the MM/VMA on any path there? It would be nicer if
>> the hmm_vma_fault() could be called by the caller of walk_page_range(), but
>> that's tricky I guess, as hmm_vma_fault() consumes the walk structure and
>> requires the vma in there.
>>
> 
> It looks like a caller can provide a post_vma callback in mm_walk_ops. I
> missed that case here. This callback cannot be supported by this change.
> I will update the patch.
> 
>>
>> Note: am I wrong, or is hmm_vma_fault() really always called with
>> required_fault=true?
>>
> 
> No, hmm_pte_need_fault can return false.

That's not what I mean. Looks like all paths leading to hmm_vma_fault() have
required_fault = true;

IOW, there is always a "if (required_fault)" before it one way or the other.

Ah, and there even is a "WARN_ON_ONCE(!required_fault)" in the function. What an
odd thing to do :)

> 
>>> +		}
>>> +
>>> +		if (ret & VM_FAULT_ERROR)
>>>  			return -EFAULT;
>>> +	}
>>>  	return -EBUSY;
>>>  }
>>>  
>>> @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>>>  	if (required_fault) {
>>>  		int ret;
>>>  
>>> +		/*
>>> +		 * Faulting hugetlb pages on the unlockable path is not
>>> +		 * supported. The walk framework holds hugetlb_vma_lock_read
>>> +		 * which must be dropped before handle_mm_fault, but if the
>>> +		 * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
>>> +		 * be freed and the walk framework's unconditional unlock
>>> +		 * becomes a use-after-free.
>>> +		 */
>>> +		if (hmm_vma_walk->locked)
>>> +			return -EFAULT;
>>
>> Just because it's unlockable doesn't mean that you must unlock. Can't this be
>> kept working as is, just simulating here as if it would not be unlockable?
>>
> 
> I’m not sure how to implement this. The walk_page_range code expects the
> hugetlb VMA to still be read-locked when we return from
> hmm_vma_walk_hugetlb_entry. How can we guarantee that if the VMA might
> be gone?
> 
> I added a note in the docs. Whoever tackles this will likely need to
> either rework `walk_page_range` to handle the case where the VMA is
> gone, or use a different approach.
> 
> Do you have any other suggestions on how to implement it?

You just want hmm_vma_fault() to not set
"FAULT_FLAG_ALLOW_RETRY·|·FAULT_FLAG_KILLABLE".

The hacky way could be:

diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83d..83dba990e10a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -564,6 +564,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
long hmask,
        required_fault =
                hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
        if (required_fault) {
+               int *saved_locked = hmm_vma_walk->locked;
                int ret;

                spin_unlock(ptl);
@@ -576,7 +577,9 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned
long hmask,
                 * use here of either pte or ptl after dropping the vma
                 * lock.
                 */
+               hmm_vma_walk->locked = NULL;
                ret = hmm_vma_fault(addr, end, required_fault, walk);
+               hmm_vma_walk->locked = saved_locked;
                hugetlb_vma_lock_read(vma);
                return ret;
        }

But really, I think we should just try to get uffd support working properly, not
excluding hugetlb.

GUP achieves it properly by performing the fault handling outside of page table
walking context ... essentially what I described in my first comment above:
return the information to the caller and let it just trigger the fault.

The issue here is that we trigger a fault out of walk_hugetlb_range() where we
still hold locks, resulting in this questionable hugetlb_vma_unlock_read +
hugetlb_vma_lock_read pattern.

The fault should just be triggered from a place where we don't have to play with
hugetlb vma locks or be afraid that dropping the mmap lock causes other problems.


-- 
Cheers,

David

^ permalink raw reply related

* [PATCH 7.0 142/307] hv: Select CONFIG_SYSFB only for CONFIG_HYPERV_VMBUS
From: Greg Kroah-Hartman @ 2026-05-12 17:38 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Thomas Zimmermann, Michael Kelley,
	Saurabh Sengar, Wei Liu, K. Y. Srinivasan, Haiyang Zhang,
	Dexuan Cui, Long Li, linux-hyperv
In-Reply-To: <20260512173940.117428952@linuxfoundation.org>

7.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Thomas Zimmermann <tzimmermann@suse.de>

commit d33db956c9618e7cb08c2520ce708437914214ec upstream.

Hyperv's sysfb access only exists in the VMBUS support. Therefore
only select CONFIG_SYSFB for CONFIG_HYPERV_VMBUS. Avoids sysfb code
on systems that don't need it.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests")
Cc: Michael Kelley <mhklinux@outlook.com>
Cc: Saurabh Sengar <ssengar@linux.microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-hyperv@vger.kernel.org
Cc: <stable@vger.kernel.org> # v6.16+
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Link: https://patch.msgid.link/20260402092305.208728-2-tzimmermann@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/hv/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -9,7 +9,6 @@ config HYPERV
 	select PARAVIRT
 	select X86_HV_CALLBACK_VECTOR if X86
 	select OF_EARLY_FLATTREE if OF
-	select SYSFB if EFI && !HYPERV_VTL_MODE
 	select IRQ_MSI_LIB if X86
 	help
 	  Select this option to run Linux as a Hyper-V client operating
@@ -62,6 +61,7 @@ config HYPERV_VMBUS
 	tristate "Microsoft Hyper-V VMBus driver"
 	depends on HYPERV
 	default HYPERV
+	select SYSFB if EFI && !HYPERV_VTL_MODE
 	help
 	  Select this option to enable Hyper-V Vmbus driver.
 



^ permalink raw reply

* [PATCH 6.18 121/270] hv: Select CONFIG_SYSFB only for CONFIG_HYPERV_VMBUS
From: Greg Kroah-Hartman @ 2026-05-12 17:38 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Thomas Zimmermann, Michael Kelley,
	Saurabh Sengar, Wei Liu, K. Y. Srinivasan, Haiyang Zhang,
	Dexuan Cui, Long Li, linux-hyperv
In-Reply-To: <20260512173938.452574370@linuxfoundation.org>

6.18-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Thomas Zimmermann <tzimmermann@suse.de>

commit d33db956c9618e7cb08c2520ce708437914214ec upstream.

Hyperv's sysfb access only exists in the VMBUS support. Therefore
only select CONFIG_SYSFB for CONFIG_HYPERV_VMBUS. Avoids sysfb code
on systems that don't need it.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests")
Cc: Michael Kelley <mhklinux@outlook.com>
Cc: Saurabh Sengar <ssengar@linux.microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-hyperv@vger.kernel.org
Cc: <stable@vger.kernel.org> # v6.16+
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Link: https://patch.msgid.link/20260402092305.208728-2-tzimmermann@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/hv/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -9,7 +9,6 @@ config HYPERV
 	select PARAVIRT
 	select X86_HV_CALLBACK_VECTOR if X86
 	select OF_EARLY_FLATTREE if OF
-	select SYSFB if EFI && !HYPERV_VTL_MODE
 	select IRQ_MSI_LIB if X86
 	help
 	  Select this option to run Linux as a Hyper-V client operating
@@ -61,6 +60,7 @@ config HYPERV_VMBUS
 	tristate "Microsoft Hyper-V VMBus driver"
 	depends on HYPERV
 	default HYPERV
+	select SYSFB if EFI && !HYPERV_VTL_MODE
 	help
 	  Select this option to enable Hyper-V Vmbus driver.
 



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox