The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module
@ 2026-05-07 15:42 Stanislav Kinsburskii
  2026-05-07 15:43 ` [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access Stanislav Kinsburskii
                   ` (17 more replies)
  0 siblings, 18 replies; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:42 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

This series addresses bugs found during a continued review of the
mshv_root module mostly introduced by commit 621191d709b14 ("Drivers: hv:
Introduce mshv_root module to expose /dev/mshv to VMMs").

Changes in v4:
- Dropped the following patches as the issues they fix don't happen in
  practice:
    - mshv: Fix potential integer overflow in mshv_region_create
    - mshv: Fix potential u64 overflow in region overlap check
    - mshv: Add defensive synchronize_srcu in irqfd shutdown

- Added new fixes:
    - mshv: irqfd: Reject routing updates that invalidate resampler binding
    - mshv: Fix sleeping under spinlock in mshv_portid_alloc
    - mshv: Order pt_vp_array publish against irqfd assertion path
    - mshv: Defer mshv_vp free to an RCU grace period
    - mshv: Publish VP to pt_vp_array before installing the file descriptor

- Replaced:
    - mshv: Fix use-after-RCU in mshv_portid_lookup 
      by
      mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract
    - mshv: Add store/load ordering for VP array publish
      by
      mshv: Order pt_vp_array publish against irqfd assertion path

Changes in v3:
- "Fix mshv_prepare_pinned_region error path for unencrypted
  partitions": removed inline mshv_region_invalidate() to prevent
  zeroing mreg_pages before mshv_region_destroy() can unmap partial
  SLAT mappings; for encrypted share-failure, memset the page array
  without unpinning (pages are host-inaccessible).
- "Consolidate irqfd interrupt injection paths": fixed data race in
  mshv_irqfd_assign EPOLLIN path — girq_ent is now snapshotted inside
  the seqcount loop (matching mshv_irqfd_wakeup) to prevent a
  concurrent routing update from injecting vector 0 to VP 0.
- "Add missing vp_index bounds check in intercept ISR": added
  array_index_nospec() after the bounds check to prevent speculative
  out-of-bounds array access.
- "Add store/load ordering for VP array publish": added missing
  smp_load_acquire in mshv_try_assert_irq_fast.

Changes in v2:
- Added 8 new patches addressing issues found by Sashiko (automated
  review) covering the irqfd, portid, scheduler message, and VP
  lifecycle paths.
- Consolidated the irqfd fast/slow injection paths to eliminate
  duplicated seqcount reads and fix the GSI 0 validity bypass.
- Added memory ordering for the lockless VP array.

The fixes range from data corruption and use-after-free to silent
functional failures and sleeping-while-atomic:

 Memory region management:
  - Integer overflow on userspace-controlled allocation size
    (mshv_region_create)
  - Silent success on map failure for unencrypted partitions
    (mshv_prepare_pinned_region)
  - u64 overflow in region overlap check allowing overlapping mappings

 IRQ/eventfd path:
  - IRQ state leak and type truncation in hypercall helpers
  - Missing locking and hlist_del vs hlist_del_init race in irqfd
    deassign
  - Defensive synchronize_srcu in irqfd shutdown (follows KVM pattern)
  - NULL pointer dereference on spurious interrupt to non-existent VP
    (mshv_try_assert_irq_fast)
  - Broken seqcount read protection — torn reads of interrupt routing
  - Duplicated and inconsistent validity checks between fast/slow
    injection paths; fast path could inject vector 0 spuriously
  - Level-triggered check on uninitialized data making interrupt
    resampling completely non-functional
  - Duplicate GSI 0 detection using the wrong predicate

 Port ID table:
  - Use-after-RCU in mshv_portid_lookup (dereference outside read-side
    critical section)
  - Sleeping under spinlock in mshv_portid_alloc (GFP_KERNEL inside
    idr_lock)
  - Use kfree_rcu for deferred free without blocking

 SynIC / ISR paths:
  - Missing VP index bounds check in intercept ISR (OOB in interrupt
    context from untrusted hypervisor data)
  - Missing store/load ordering for VP array publish — lockless ISR
    readers could observe partially-initialized VP
  - Missing bounds validation in scheduler messages
    (handle_pair_message vp_count, handle_bitset_message bank_mask)

 Miscellaneous:
  - Missing error code on VP allocation failure (silent success to
    userspace)

Kudos to Claude and Sashiko for assisting with analysis and
implementation.

---

Stanislav Kinsburskii (18):
      mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
      mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
      mshv: Fix race in mshv_irqfd_deassign
      mshv: Add NULL check for vp in mshv_try_assert_irq_fast
      mshv: irqfd: Reject routing updates that invalidate resampler binding
      mshv: Fix broken seqcount read protection
      mshv: Consolidate irqfd interrupt injection paths
      mshv: Fix level-triggered check on uninitialized data
      mshv: Fix duplicate GSI detection for GSI 0
      mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract
      mshv: Fix sleeping under spinlock in mshv_portid_alloc
      mshv: Use kfree_rcu in mshv_portid_free
      mshv: Add missing vp_index bounds check in intercept ISR
      mshv: Order pt_vp_array publish against irqfd assertion path
      mshv: Defer mshv_vp free to an RCU grace period
      mshv: Validate scheduler message bounds from hypervisor
      mshv: Publish VP to pt_vp_array before installing the file descriptor
      mshv: Fix missing error code on VP allocation failure


 drivers/hv/mshv_eventfd.c      |  136 +++++++++++++++++++++++++---------------
 drivers/hv/mshv_irq.c          |   46 +++++++++++++-
 drivers/hv/mshv_portid_table.c |   31 ++++-----
 drivers/hv/mshv_root.h         |    3 +
 drivers/hv/mshv_root_hv_call.c |   18 ++---
 drivers/hv/mshv_root_main.c    |   72 +++++++++++++++------
 drivers/hv/mshv_synic.c        |   40 +++++++++---
 7 files changed, 233 insertions(+), 113 deletions(-)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-11  3:46   ` Anirudh Rayabharam
  2026-05-07 15:43 ` [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions Stanislav Kinsburskii
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

The bounds check inside the PFN-filling loop can return -EINVAL while
interrupts are disabled via local_irq_save(), leaking IRQ state.

Remove the check — it is redundant because the loop invariant
(done + i < page_count == page_struct_count >> large_shift) guarantees
(done + i) << large_shift < page_struct_count always holds.

While here, fix type mismatches: change 'int done' to 'u64 done' and
use u64 for loop and batch-size variables so they match the u64
page_count they are compared against.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 129456bd72aba..cc580225e9e45 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -1042,7 +1042,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 {
 	struct hv_input_modify_sparse_spa_page_host_access *input_page;
 	u64 status;
-	int done = 0;
+	u64 done = 0;
 	unsigned long irq_flags, large_shift = 0;
 	u64 page_count = page_struct_count;
 	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
@@ -1059,9 +1059,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 	}
 
 	while (done < page_count) {
-		ulong i, completed, remain = page_count - done;
-		int rep_count = min(remain,
-				    HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+		u64 i, completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain,
+				      HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -1075,15 +1075,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 		input_page->flags = flags;
 		input_page->host_access = host_access;
 
-		for (i = 0; i < rep_count; i++) {
-			u64 index = (done + i) << large_shift;
-
-			if (index >= page_struct_count)
-				return -EINVAL;
-
+		for (i = 0; i < rep_count; i++)
 			input_page->spa_page_list[i] =
-						page_to_pfn(pages[index]);
-		}
+				page_to_pfn(pages[(done + i) << large_shift]);
 
 		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
 					     NULL);



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
  2026-05-07 15:43 ` [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-11 13:48   ` Anirudh Rayabharam
  2026-05-07 15:43 ` [PATCH v4 03/18] mshv: Fix race in mshv_irqfd_deassign Stanislav Kinsburskii
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
fails on an unencrypted partition. The condition on the error path:

    if (ret && mshv_partition_encrypted(partition))

only handles map failures for encrypted partitions — if the partition is
not encrypted and the map fails, execution falls through to 'return 0',
silently ignoring the error.

Additionally, calling mshv_region_invalidate() inline on map failure
zeroes the mreg_pages array before the caller's cleanup path
(mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
pages where mreg_pages[offset] is NULL, this can leave stale SLAT
mappings for partially-mapped pages.

Fix by returning immediately on success and falling through to error
return on failure. For unencrypted partitions, the caller's
mshv_region_destroy() handles unmap followed by invalidate in the
correct order. For encrypted partitions where re-sharing fails, zero
the page array without unpinning — the pages are inaccessible to the
host and must not be unpinned, but zeroing prevents
mshv_region_destroy() from attempting to unpin them.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 665d565899c15..7e4252b6bc65c 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 			pt_err(partition,
 			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
 			       region->start_gfn, ret);
-			goto invalidate_region;
+			goto err_out;
 		}
 	}
 
 	ret = mshv_region_map(region);
-	if (ret && mshv_partition_encrypted(partition)) {
+	if (ret)
+		goto share_region;
+
+	return 0;
+
+share_region:
+	if (mshv_partition_encrypted(partition)) {
 		int shrc;
 
 		shrc = mshv_region_share(region);
 		if (!shrc)
-			goto invalidate_region;
+			goto err_out;
 
 		pt_err(partition,
 		       "Failed to share memory region (guest_pfn: %llu): %d\n",
 		       region->start_gfn, shrc);
 		/*
-		 * Don't unpin if marking shared failed because pages are no
-		 * longer mapped in the host, ie root, anymore.
+		 * Re-sharing failed — the pages remain inaccessible to the
+		 * host.  Zero the page array so that mshv_region_destroy()
+		 * won't attempt to unpin them (leaking the page references
+		 * is intentional; unpinning host-inaccessible pages would be
+		 * unsafe).
 		 */
+		memset(region->mreg_pages, 0,
+		       region->nr_pages * sizeof(region->mreg_pages[0]));
 		goto err_out;
 	}
-
-	return 0;
-
-invalidate_region:
-	mshv_region_invalidate(region);
 err_out:
 	return ret;
 }



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 03/18] mshv: Fix race in mshv_irqfd_deassign
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
  2026-05-07 15:43 ` [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access Stanislav Kinsburskii
  2026-05-07 15:43 ` [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-11 13:57   ` Anirudh Rayabharam
  2026-05-07 15:43 ` [PATCH v4 04/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast Stanislav Kinsburskii
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_irqfd_deactivate() and the hlist traversal of pt_irqfds_list
require pt->pt_irqfds_lock to be held, but mshv_irqfd_deassign()
omits it. This races with the EPOLLHUP path in mshv_irqfd_wakeup(),
which does take the lock before calling mshv_irqfd_deactivate().

Additionally, mshv_irqfd_deactivate() uses hlist_del() which poisons
the node pointers rather than resetting them. Since
mshv_irqfd_is_active() relies on hlist_unhashed() (checks pprev ==
NULL), a poisoned node still appears active. If a concurrent path calls
mshv_irqfd_deactivate() again on the same irqfd, the guard fails to
prevent a double hlist_del() on poisoned pointers.

Fix both issues:
- Add the missing spin_lock_irq/spin_unlock_irq around the list
  traversal in mshv_irqfd_deassign(), matching mshv_irqfd_release().
- Use hlist_del_init() instead of hlist_del() so the node is properly
  marked as unhashed after removal, making the is_active guard reliable.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 90959f639dc32..5995a62aff8d8 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -284,7 +284,7 @@ static void mshv_irqfd_deactivate(struct mshv_irqfd *irqfd)
 	if (!mshv_irqfd_is_active(irqfd))
 		return;
 
-	hlist_del(&irqfd->irqfd_hnode);
+	hlist_del_init(&irqfd->irqfd_hnode);
 
 	queue_work(irqfd_cleanup_wq, &irqfd->irqfd_shutdown);
 }
@@ -541,13 +541,14 @@ static int mshv_irqfd_deassign(struct mshv_partition *pt,
 	if (IS_ERR(eventfd))
 		return PTR_ERR(eventfd);
 
+	spin_lock_irq(&pt->pt_irqfds_lock);
 	hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list,
 				  irqfd_hnode) {
 		if (irqfd->irqfd_eventfd_ctx == eventfd &&
 		    irqfd->irqfd_irqnum == args->gsi)
-
 			mshv_irqfd_deactivate(irqfd);
 	}
+	spin_unlock_irq(&pt->pt_irqfds_lock);
 
 	eventfd_ctx_put(eventfd);
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 04/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (2 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 03/18] mshv: Fix race in mshv_irqfd_deassign Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-11  3:24   ` Anirudh Rayabharam
  2026-05-07 15:43 ` [PATCH v4 05/18] mshv: irqfd: Reject routing updates that invalidate resampler binding Stanislav Kinsburskii
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_try_assert_irq_fast() dereferences the vp pointer obtained from
pt_vp_array[lapic_apic_id] without checking for NULL or validating that
lapic_apic_id is within bounds. A spurious interrupt from the hypervisor
targeting a non-existent VP (or one not yet created) causes a NULL
pointer dereference and crashes the host.

Add a bounds check on lapic_apic_id against MSHV_MAX_VPS and a NULL
check on the vp pointer before dereferencing.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 5995a62aff8d8..b398e58411dd7 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -169,7 +169,12 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
 		return -EOPNOTSUPP;
 #endif
 
+	if (irq->lapic_apic_id >= MSHV_MAX_VPS)
+		return -EINVAL;
+
 	vp = partition->pt_vp_array[irq->lapic_apic_id];
+	if (!vp)
+		return -EINVAL;
 
 	if (!vp->vp_register_page)
 		return -EOPNOTSUPP;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 05/18] mshv: irqfd: Reject routing updates that invalidate resampler binding
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (3 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 04/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-07 15:43 ` [PATCH v4 06/18] mshv: Fix broken seqcount read protection Stanislav Kinsburskii
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

A resampler-bound irqfd is only meaningful for level-triggered interrupts,
because the resampler signals userspace on guest EOI and edge-triggered
interrupts have no EOI to react to.  mshv_irqfd_assign() already rejects
the combination at registration time, but nothing prevented a subsequent
MSHV_SET_MSI_ROUTING ioctl from flipping the cached
lapic_control.level_triggered bit to edge-triggered while the resampler was
still attached. Once that happened, the WARN_ON() in mshv_assert_irq_slow()
was the only signal of an inconsistency the kernel had no way to recover
from: the resampler stayed wired in but no EOI ever arrived, so userspace
never saw a resample signal and the guest interrupt could become stuck.

Add mshv_irqfd_validate_routing() and call it from
mshv_update_routing_table() before publishing the new table. It walks
pt_irqfds_list looking for resampler-bound irqfds whose new routing entry
would be valid but not level-triggered, and returns -EINVAL to reject the
routing change atomically. The check is x86-only because
lapic_control.level_triggered is only populated on x86; on ARM64
mshv_copy_girq_info() unconditionally sets asserted = 1 and the invariant
does not apply.

The validator briefly takes pt_irqfds_lock for the list walk and drops it
before mshv_irqfd_routing_update() reacquires it.  This is safe because the
partition ioctl dispatcher holds pt_mutex for the duration of every
partition ioctl, so MSHV_IRQFD (which calls mshv_irqfd_assign()) cannot run
concurrently with MSHV_SET_MSI_ROUTING; no new resampler-bound irqfd can be
inserted between validation and refresh.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_irq.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
index b3142c84dcbc2..59584a132ca9f 100644
--- a/drivers/hv/mshv_irq.c
+++ b/drivers/hv/mshv_irq.c
@@ -14,7 +14,44 @@
 #include "mshv.h"
 #include "mshv_root.h"
 
-/* called from the ioctl code, user wants to update the guest irq table */
+static int mshv_irqfd_validate_routing(struct mshv_partition *pt,
+				       struct mshv_girq_routing_table *new)
+{
+	int r = 0;
+#if IS_ENABLED(CONFIG_X86)
+	struct mshv_irqfd *irqfd;
+
+	if (!new)
+		return 0;
+
+	spin_lock_irq(&pt->pt_irqfds_lock);
+	hlist_for_each_entry(irqfd, &pt->pt_irqfds_list, irqfd_hnode) {
+		struct mshv_guest_irq_ent ent = {};
+		struct mshv_lapic_irq lirq;
+
+		if (!irqfd->irqfd_resampler)
+			continue;
+
+		if (irqfd->irqfd_irqnum < new->num_rt_entries)
+			ent = new->mshv_girq_info_tbl[irqfd->irqfd_irqnum];
+
+		mshv_copy_girq_info(&ent, &lirq);
+
+		if (ent.girq_entry_valid &&
+		    !lirq.lapic_control.level_triggered) {
+			r = -EINVAL;
+			break;
+		}
+	}
+	spin_unlock_irq(&pt->pt_irqfds_lock);
+#endif
+	return r;
+}
+
+/*
+ * Called from the ioctl code, user wants to update the guest irq table.
+ * Serialized with mshv_irqfd_assign by partition mutex.
+ */
 int mshv_update_routing_table(struct mshv_partition *partition,
 			      const struct mshv_user_irq_entry *ue,
 			      unsigned int numents)
@@ -65,6 +102,11 @@ int mshv_update_routing_table(struct mshv_partition *partition,
 
 swap_routes:
 	mutex_lock(&partition->pt_irq_lock);
+	r = mshv_irqfd_validate_routing(partition, new);
+	if (r) {
+		mutex_unlock(&partition->pt_irq_lock);
+		goto out;
+	}
 	old = rcu_dereference_protected(partition->pt_girq_tbl, 1);
 	rcu_assign_pointer(partition->pt_girq_tbl, new);
 	mshv_irqfd_routing_update(partition);



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 06/18] mshv: Fix broken seqcount read protection
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (4 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 05/18] mshv: irqfd: Reject routing updates that invalidate resampler binding Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-07 15:43 ` [PATCH v4 07/18] mshv: Consolidate irqfd interrupt injection paths Stanislav Kinsburskii
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_irqfd_update() writes both irqfd_girq_ent and irqfd_lapic_irq as a
logical unit under seqcount write protection.  Readers must snapshot these
fields inside the seqcount begin/retry loop to obtain a consistent
point-in-time view; otherwise a concurrent update can produce a torn read
where one field comes from the old state and the other from the new.

Both mshv_assert_irq_slow() and mshv_irqfd_wakeup() got this wrong: the
seqcount loop bodies were empty (just spinning until a stable sequence was
observed), and all reads of the protected fields happened after the loop
with no protection from concurrent writes. If mshv_irqfd_update() races
with interrupt assertion, the caller may use a stale or mixed
vector/apic_id/control combination, delivering an interrupt to the wrong
vCPU, with the wrong vector, or with the wrong trigger mode.  This can
cause spurious or lost interrupts in the guest, or a stuck interrupt line
in the level-triggered case.

Fix mshv_assert_irq_slow() by snapshotting both irqfd_girq_ent and
irqfd_lapic_irq into local variables inside the seqcount loop, then using
those locals for the validity check, the resampler WARN_ON() and the
hypercall.  Reorder the function so the seqcount loop runs first and every
subsequent read of the protected fields is satisfied from the snapshot.

Fix mshv_irqfd_wakeup() by snapshotting irqfd_lapic_irq inside its seqcount
loop and passing the snapshot to mshv_try_assert_irq_fast(), so the fast
path operates on the consistent copy rather than reading the field directly
outside seqcount protection.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |   44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index b398e58411dd7..25bdc5e678849 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -151,10 +151,10 @@ static int mshv_vp_irq_set_vector(struct mshv_vp *vp, u32 vector)
  * Try to raise irq for guest via shared vector array. hyp does the actual
  * inject of the interrupt.
  */
-static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
+				    const struct mshv_lapic_irq *irq)
 {
 	struct mshv_partition *partition = irqfd->irqfd_partn;
-	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
 	struct mshv_vp *vp;
 
 	if (!(ms_hyperv.ext_features &
@@ -191,7 +191,8 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
 	return 0;
 }
 #else /* CONFIG_X86_64 */
-static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
+				    const struct mshv_lapic_irq *irq)
 {
 	return -EOPNOTSUPP;
 }
@@ -200,30 +201,33 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
 static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd)
 {
 	struct mshv_partition *partition = irqfd->irqfd_partn;
-	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
+	struct mshv_guest_irq_ent girq_ent;
+	struct mshv_lapic_irq irq;
 	unsigned int seq;
 	int idx;
 
+	idx = srcu_read_lock(&partition->pt_irq_srcu);
+
+	do {
+		seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+		girq_ent = irqfd->irqfd_girq_ent;
+		irq = irqfd->irqfd_lapic_irq;
+	} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+
 #if IS_ENABLED(CONFIG_X86)
 	WARN_ON(irqfd->irqfd_resampler &&
-		!irq->lapic_control.level_triggered);
+		!irq.lapic_control.level_triggered);
 #endif
 
-	idx = srcu_read_lock(&partition->pt_irq_srcu);
-	if (irqfd->irqfd_girq_ent.guest_irq_num) {
-		if (!irqfd->irqfd_girq_ent.girq_entry_valid) {
-			srcu_read_unlock(&partition->pt_irq_srcu, idx);
-			return;
-		}
-
-		do {
-			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
-		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+	if (girq_ent.guest_irq_num && !girq_ent.girq_entry_valid) {
+		srcu_read_unlock(&partition->pt_irq_srcu, idx);
+		return;
 	}
 
-	hv_call_assert_virtual_interrupt(irqfd->irqfd_partn->pt_id,
-					 irq->lapic_vector, irq->lapic_apic_id,
-					 irq->lapic_control);
+	hv_call_assert_virtual_interrupt(partition->pt_id,
+					 irq.lapic_vector, irq.lapic_apic_id,
+					 irq.lapic_control);
+
 	srcu_read_unlock(&partition->pt_irq_srcu, idx);
 }
 
@@ -309,16 +313,18 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
 	int ret = 0;
 
 	if (flags & EPOLLIN) {
+		struct mshv_lapic_irq irq;
 		u64 cnt;
 
 		eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
 		idx = srcu_read_lock(&pt->pt_irq_srcu);
 		do {
 			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+			irq = irqfd->irqfd_lapic_irq;
 		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
 
 		/* An event has been signaled, raise an interrupt */
-		ret = mshv_try_assert_irq_fast(irqfd);
+		ret = mshv_try_assert_irq_fast(irqfd, &irq);
 		if (ret)
 			mshv_assert_irq_slow(irqfd);
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 07/18] mshv: Consolidate irqfd interrupt injection paths
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (5 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 06/18] mshv: Fix broken seqcount read protection Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-07 15:43 ` [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data Stanislav Kinsburskii
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

irqfd interrupt injection had divergent seqcount snapshot scaffolding in
three places, and inconsistent validity checks between the fast and slow
assert paths:

1. mshv_irqfd_wakeup() snapshotted irqfd_lapic_irq under seqcount, then on
fast-path failure called mshv_assert_irq_slow(), which re-snapshotted both
irqfd_girq_ent and irqfd_lapic_irq under seqcount again — wasteful and
duplicative.

2. The girq_entry_valid check only existed in the slow path.  The fast path
would happily accept a zero-initialised mshv_lapic_irq when routing was not
yet configured for the GSI, potentially injecting vector 0 to VP 0.

3. The slow path's validity predicate was 'guest_irq_num &&
!girq_entry_valid', which short-circuits for GSI 0 (guest_irq_num == 0) and
so bypasses the validity check entirely for that GSI.

4. mshv_irqfd_resampler_ack() read irqfd_lapic_irq.lapic_control without
seqcount protection, which could observe a stale or transient value while
mshv_irqfd_update() was concurrently rewriting irqfd_lapic_irq via
mshv_copy_girq_info()'s memset-and-fill sequence.

Introduce mshv_irqfd_snapshot() that takes a consistent snapshot of both
irqfd_girq_ent and irqfd_lapic_irq inside the seqcount loop; girq_ent is
optional so the resampler ack path can snapshot only the LAPIC IRQ.

Use the helper from mshv_irqfd_resampler_ack() (closes 4),
mshv_irqfd_wakeup() and mshv_irqfd_assign()'s EPOLLIN replay path,
replacing the three ad-hoc seqcount loops.

Move the validity check into mshv_irqfd_wakeup() before either injection
path runs, so the fast path no longer accepts an unrouted irqfd (closes
2).  Use !girq_entry_valid as the condition (closes 3).  Change
mshv_assert_irq_slow() to take a pre-snapshotted const struct
mshv_lapic_irq pointer, eliminating its internal seqcount and SRCU
scaffolding (closes 1).

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |   90 ++++++++++++++++++++++++++-------------------
 1 file changed, 52 insertions(+), 38 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 25bdc5e678849..c24069dff9702 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -74,6 +74,27 @@ static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type)
 }
 #endif
 
+/*
+ * Snapshot per-irqfd routing state under seqcount protection so callers
+ * see a consistent point-in-time view of irqfd_girq_ent and
+ * irqfd_lapic_irq even if mshv_irqfd_update() runs concurrently.
+ *
+ * @girq_ent may be NULL when the caller only needs the LAPIC IRQ.
+ */
+static void mshv_irqfd_snapshot(struct mshv_irqfd *irqfd,
+				struct mshv_guest_irq_ent *girq_ent,
+				struct mshv_lapic_irq *irq)
+{
+	unsigned int seq;
+
+	do {
+		seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+		if (girq_ent)
+			*girq_ent = irqfd->irqfd_girq_ent;
+		*irq = irqfd->irqfd_lapic_irq;
+	} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+}
+
 static void mshv_irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian)
 {
 	struct mshv_irqfd_resampler *resampler;
@@ -90,7 +111,11 @@ static void mshv_irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian)
 	hlist_for_each_entry_srcu(irqfd, &resampler->rsmplr_irqfd_list,
 				 irqfd_resampler_hnode,
 				 srcu_read_lock_held(&partition->pt_irq_srcu)) {
-		if (hv_should_clear_interrupt(irqfd->irqfd_lapic_irq.lapic_control.interrupt_type))
+		struct mshv_lapic_irq irq;
+
+		mshv_irqfd_snapshot(irqfd, NULL, &irq);
+
+		if (hv_should_clear_interrupt(irq.lapic_control.interrupt_type))
 			hv_call_clear_virtual_interrupt(partition->pt_id);
 
 		eventfd_signal(irqfd->irqfd_resamplefd);
@@ -198,37 +223,14 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
 }
 #endif
 
-static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd)
+static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd,
+				 const struct mshv_lapic_irq *irq)
 {
 	struct mshv_partition *partition = irqfd->irqfd_partn;
-	struct mshv_guest_irq_ent girq_ent;
-	struct mshv_lapic_irq irq;
-	unsigned int seq;
-	int idx;
-
-	idx = srcu_read_lock(&partition->pt_irq_srcu);
-
-	do {
-		seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
-		girq_ent = irqfd->irqfd_girq_ent;
-		irq = irqfd->irqfd_lapic_irq;
-	} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
-
-#if IS_ENABLED(CONFIG_X86)
-	WARN_ON(irqfd->irqfd_resampler &&
-		!irq.lapic_control.level_triggered);
-#endif
-
-	if (girq_ent.guest_irq_num && !girq_ent.girq_entry_valid) {
-		srcu_read_unlock(&partition->pt_irq_srcu, idx);
-		return;
-	}
 
 	hv_call_assert_virtual_interrupt(partition->pt_id,
-					 irq.lapic_vector, irq.lapic_apic_id,
-					 irq.lapic_control);
-
-	srcu_read_unlock(&partition->pt_irq_srcu, idx);
+					 irq->lapic_vector, irq->lapic_apic_id,
+					 irq->lapic_control);
 }
 
 static void mshv_irqfd_resampler_shutdown(struct mshv_irqfd *irqfd)
@@ -308,26 +310,31 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
 						irqfd_wait);
 	__poll_t flags = key_to_poll(key);
 	int idx;
-	unsigned int seq;
 	struct mshv_partition *pt = irqfd->irqfd_partn;
 	int ret = 0;
 
 	if (flags & EPOLLIN) {
+		struct mshv_guest_irq_ent girq_ent;
 		struct mshv_lapic_irq irq;
 		u64 cnt;
 
 		eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
 		idx = srcu_read_lock(&pt->pt_irq_srcu);
-		do {
-			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
-			irq = irqfd->irqfd_lapic_irq;
-		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+		mshv_irqfd_snapshot(irqfd, &girq_ent, &irq);
+
+		if (!girq_ent.girq_entry_valid)
+			goto out_unlock;
+
+#if IS_ENABLED(CONFIG_X86)
+		WARN_ON(irqfd->irqfd_resampler &&
+			!irq.lapic_control.level_triggered);
+#endif
 
 		/* An event has been signaled, raise an interrupt */
-		ret = mshv_try_assert_irq_fast(irqfd, &irq);
-		if (ret)
-			mshv_assert_irq_slow(irqfd);
+		if (mshv_try_assert_irq_fast(irqfd, &irq))
+			mshv_assert_irq_slow(irqfd, &irq);
 
+out_unlock:
 		srcu_read_unlock(&pt->pt_irq_srcu, idx);
 
 		ret = 1;
@@ -517,8 +524,15 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
 	 */
 	events = vfs_poll(fd_file(f), &irqfd->irqfd_polltbl);
 
-	if (events & EPOLLIN)
-		mshv_assert_irq_slow(irqfd);
+	if (events & EPOLLIN) {
+		struct mshv_guest_irq_ent girq_ent;
+		struct mshv_lapic_irq irq;
+
+		mshv_irqfd_snapshot(irqfd, &girq_ent, &irq);
+
+		if (girq_ent.girq_entry_valid)
+			mshv_assert_irq_slow(irqfd, &irq);
+	}
 
 	srcu_read_unlock(&pt->pt_irq_srcu, idx);
 	return 0;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (6 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 07/18] mshv: Consolidate irqfd interrupt injection paths Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-13 12:14   ` Anirudh Rayabharam
  2026-05-07 15:43 ` [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0 Stanislav Kinsburskii
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

In mshv_irqfd_assign(), the level-triggered validation for resample
irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
mshv_irqfd_update() has populated the field. Since the irqfd struct is
zero-allocated, level_triggered is always 0 at that point, causing the
check to always reject resample irqfds with -EINVAL. This makes
level-triggered interrupt resampling — used to avoid interrupt storms
with assigned devices — completely non-functional.

Move the check after the mshv_irqfd_update() call, which resolves the
IRQ routing entry and populates irqfd_lapic_irq with the actual trigger
mode.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |   25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index c24069dff9702..11a6006f80194 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -491,6 +491,19 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
 	init_poll_funcptr(&irqfd->irqfd_polltbl, mshv_irqfd_queue_proc);
 
 	spin_lock_irq(&pt->pt_irqfds_lock);
+	ret = 0;
+	hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
+		if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
+			continue;
+		/* This fd is used for another irq already. */
+		ret = -EBUSY;
+		spin_unlock_irq(&pt->pt_irqfds_lock);
+		goto fail;
+	}
+
+	idx = srcu_read_lock(&pt->pt_irq_srcu);
+	mshv_irqfd_update(pt, irqfd);
+
 #if IS_ENABLED(CONFIG_X86)
 	if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
 	    !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {
@@ -499,22 +512,12 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
 		 * Otherwise return with failure
 		 */
 		spin_unlock_irq(&pt->pt_irqfds_lock);
+		srcu_read_unlock(&pt->pt_irq_srcu, idx);
 		ret = -EINVAL;
 		goto fail;
 	}
 #endif
-	ret = 0;
-	hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
-		if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
-			continue;
-		/* This fd is used for another irq already. */
-		ret = -EBUSY;
-		spin_unlock_irq(&pt->pt_irqfds_lock);
-		goto fail;
-	}
 
-	idx = srcu_read_lock(&pt->pt_irq_srcu);
-	mshv_irqfd_update(pt, irqfd);
 	hlist_add_head(&irqfd->irqfd_hnode, &pt->pt_irqfds_list);
 	spin_unlock_irq(&pt->pt_irqfds_lock);
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (7 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-13 11:36   ` Anirudh Rayabharam
  2026-05-07 15:43 ` [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract Stanislav Kinsburskii
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

The duplicate routing entry check in mshv_update_routing_table() uses
guest_irq_num != 0 to detect whether a GSI slot is already occupied.
This fails for GSI 0 because its guest_irq_num is 0 both when the slot
is unused (zero-initialized) and when legitimately assigned. As a
result, duplicate entries for GSI 0 are silently accepted, with the
second entry overwriting the first — corrupting the routing table
without any error reported to userspace.

While GSI 0 (legacy timer) is unlikely to appear in MSI-based routing
in practice, the check is semantically wrong — it conflates
"uninitialized" with "GSI number 0." Use girq_entry_valid instead,
which is explicitly set to true when an entry is populated and remains
zero for unused slots regardless of the GSI number.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_irq.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
index 59584a132ca9f..db05512db5548 100644
--- a/drivers/hv/mshv_irq.c
+++ b/drivers/hv/mshv_irq.c
@@ -88,7 +88,7 @@ int mshv_update_routing_table(struct mshv_partition *partition,
 		/*
 		 * Allow only one to one mapping between GSI and MSI routing.
 		 */
-		if (girq->guest_irq_num != 0) {
+		if (girq->girq_entry_valid) {
 			r = -EINVAL;
 			goto out;
 		}



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (8 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0 Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-13 11:20   ` Anirudh Rayabharam
  2026-05-07 15:43 ` [PATCH v4 11/18] mshv: Fix sleeping under spinlock in mshv_portid_alloc Stanislav Kinsburskii
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_portid_lookup() previously took rcu_read_lock() internally, ran
idr_find(), released the read lock, and copied the struct contents
into a caller-supplied buffer.  This had two problems.

1. The struct copy ran outside the read section, racing with
   mshv_portid_free() which does idr_remove + synchronize_rcu + kfree.
   A copy that started just before synchronize_rcu() observed the read
   section as already drained and was free to read freed memory while
   the writer was kfree()'ing the entry.

2. The only consumer, mshv_doorbell_isr(), then dispatched a callback
   using fields of the snapshot — entirely outside any RCU read
   section.  The callback's data argument and any field it touches
   were therefore safe only because mshv_isr() runs from
   sysvec_hyperv_callback, a non-threaded system vector that
   synchronize_rcu() implicitly waits for via the hardirq quiescent-
   state coupling.  That protection is real today but undocumented and
   fragile: a future move of mshv_isr() to a threaded context, or a
   future caller that registers a doorbell with a shorter-lived data
   pointer, would silently expose a use-after-free.

Make the contract explicit instead of implicit.  mshv_portid_lookup()
now returns a pointer to the table entry and requires the caller to
hold rcu_read_lock for the entire lifetime of that pointer.  The
contract is annotated with __must_hold(RCU) so sparse flags any
direct caller that forgets it.  The sole caller, mshv_doorbell_isr(),
takes rcu_read_lock around the whole drain loop, so the lookup, the
field reads, and the doorbell_cb dispatch all run inside one
read-side critical section.  synchronize_rcu() in mshv_portid_free()
now genuinely waits for any in-flight callback before kfree() runs,
without relying on hardirq context for correctness.

This also drops the by-value struct copy: entries are publish-once
(populated before idr_alloc) and free-once (after synchronize_rcu),
so a pointer dereferenced inside the read section gives a stable
view of the contents without copying.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_portid_table.c |   22 +++++++---------------
 drivers/hv/mshv_root.h         |    2 +-
 drivers/hv/mshv_synic.c        |   15 +++++++++------
 3 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index c349af1f0aaac..4cdf8e9575390 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -64,20 +64,12 @@ mshv_portid_free(int port_id)
 	kfree(info);
 }
 
-int
-mshv_portid_lookup(int port_id, struct port_table_info *info)
+/*
+ * Caller must hold rcu_read_lock for the entire lifetime of the
+ * returned pointer.  Returns NULL if @port_id is not in the table.
+ */
+struct port_table_info *mshv_portid_lookup(int port_id)
+	__must_hold(RCU)
 {
-	struct port_table_info *_info;
-	int ret = -ENOENT;
-
-	rcu_read_lock();
-	_info = idr_find(&port_table_idr, port_id);
-	rcu_read_unlock();
-
-	if (_info) {
-		*info = *_info;
-		ret = 0;
-	}
-
-	return ret;
+	return idr_find(&port_table_idr, port_id);
 }
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 2e6c4414740cc..b6961a6d9a98b 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -241,7 +241,7 @@ void mshv_irqfd_routing_update(struct mshv_partition *partition);
 
 void mshv_port_table_fini(void);
 int mshv_portid_alloc(struct port_table_info *info);
-int mshv_portid_lookup(int port_id, struct port_table_info *info);
+struct port_table_info *mshv_portid_lookup(int port_id) __must_hold(RCU);
 void mshv_portid_free(int port_id);
 
 int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 43f1bcbbf2d34..bac890cd2b468 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -114,24 +114,27 @@ mshv_doorbell_isr(struct hv_message *msg)
 	if (notification->sint_index != HV_SYNIC_DOORBELL_SINT_INDEX)
 		return false;
 
+	rcu_read_lock();
 	while ((port = synic_event_ring_get_queued_port(HV_SYNIC_DOORBELL_SINT_INDEX))) {
-		struct port_table_info ptinfo = { 0 };
+		struct port_table_info *ptinfo;
 
-		if (mshv_portid_lookup(port, &ptinfo)) {
+		ptinfo = mshv_portid_lookup(port);
+		if (!ptinfo) {
 			pr_debug("Failed to get port info from port_table!\n");
 			continue;
 		}
 
-		if (ptinfo.hv_port_type != HV_PORT_TYPE_DOORBELL) {
+		if (ptinfo->hv_port_type != HV_PORT_TYPE_DOORBELL) {
 			pr_debug("Not a doorbell port!, port: %d, port_type: %d\n",
-				 port, ptinfo.hv_port_type);
+				 port, ptinfo->hv_port_type);
 			continue;
 		}
 
 		/* Invoke the callback */
-		ptinfo.hv_port_doorbell.doorbell_cb(port,
-						 ptinfo.hv_port_doorbell.data);
+		ptinfo->hv_port_doorbell.doorbell_cb(port,
+						 ptinfo->hv_port_doorbell.data);
 	}
+	rcu_read_unlock();
 
 	return true;
 }



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 11/18] mshv: Fix sleeping under spinlock in mshv_portid_alloc
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (9 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract Stanislav Kinsburskii
@ 2026-05-07 15:43 ` Stanislav Kinsburskii
  2026-05-11  3:33   ` Anirudh Rayabharam
  2026-05-07 15:44 ` [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free Stanislav Kinsburskii
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

idr_alloc() is called with GFP_KERNEL inside idr_lock(), which holds a
spinlock. GFP_KERNEL allows the allocator to sleep, triggering a
sleeping-while-atomic bug.

Fix by using idr_preload(GFP_KERNEL) before taking the lock to
pre-allocate memory in a sleepable context, then idr_alloc() with
GFP_NOWAIT inside the spinlock-protected section.

Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_portid_table.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index 4cdf8e9575390..d87a82e399e96 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -40,12 +40,14 @@ mshv_port_table_fini(void)
 int
 mshv_portid_alloc(struct port_table_info *info)
 {
-	int ret = 0;
+	int ret;
 
+	idr_preload(GFP_KERNEL);
 	idr_lock(&port_table_idr);
 	ret = idr_alloc(&port_table_idr, info, PORTID_MIN,
-			PORTID_MAX, GFP_KERNEL);
+			PORTID_MAX, GFP_NOWAIT);
 	idr_unlock(&port_table_idr);
+	idr_preload_end();
 
 	return ret;
 }



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (10 preceding siblings ...)
  2026-05-07 15:43 ` [PATCH v4 11/18] mshv: Fix sleeping under spinlock in mshv_portid_alloc Stanislav Kinsburskii
@ 2026-05-07 15:44 ` Stanislav Kinsburskii
  2026-05-13 11:22   ` Anirudh Rayabharam
  2026-05-07 15:44 ` [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR Stanislav Kinsburskii
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_portid_free() uses synchronize_rcu() followed by kfree() to
reclaim port table entries. This blocks the caller until a full RCU
grace period elapses, which is unnecessary since the same module already
uses the non-blocking kfree_rcu() pattern in mshv_port_table_fini().

Replace with kfree_rcu() to avoid the blocking wait and keep the
reclamation strategy consistent across the file.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_portid_table.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index d87a82e399e96..42d21b92b88fd 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -62,8 +62,7 @@ mshv_portid_free(int port_id)
 	WARN_ON(!info);
 	idr_unlock(&port_table_idr);
 
-	synchronize_rcu();
-	kfree(info);
+	kfree_rcu(info, portbl_rcu);
 }
 
 /*



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (11 preceding siblings ...)
  2026-05-07 15:44 ` [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free Stanislav Kinsburskii
@ 2026-05-07 15:44 ` Stanislav Kinsburskii
  2026-05-13  5:32   ` Anirudh Rayabharam
  2026-05-07 15:44 ` [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path Stanislav Kinsburskii
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_intercept_isr() reads vp_index from the SynIC intercept message
payload and uses it directly to index into partition->pt_vp_array without
validating that vp_index < MSHV_MAX_VPS.

Mshv treats the Microsoft Hypervisor as trusted, so a malformed vp_index is
not a security concern; the threat model does not include a malicious
hypervisor. A hypervisor bug that placed an out-of-range value here would,
however, cause an out-of-bounds read of pt_vp_array in hardirq context,
manifesting as random memory corruption or a host crash with no clear
signal pointing back to the hypervisor as the source.

handle_bitset_message() and handle_pair_message() already perform this
defensive check on hypervisor-supplied vp_index values, with an explicit
"This shouldn't happen, but just in case" comment  Add the same check to
mshv_intercept_isr() for consistency, turning a potential silent corruption
into a debuggable pr_debug message.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_synic.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index bac890cd2b468..89207aad7cf1f 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -387,6 +387,11 @@ mshv_intercept_isr(struct hv_message *msg)
 	 */
 	vp_index =
 	       ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
+	/* This shouldn't happen, but just in case. */
+	if (unlikely(vp_index >= MSHV_MAX_VPS)) {
+		pr_debug("VP index %u out of bounds\n", vp_index);
+		goto unlock_out;
+	}
 	vp = partition->pt_vp_array[vp_index];
 	if (unlikely(!vp)) {
 		pr_debug("failed to find VP %u\n", vp_index);



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (12 preceding siblings ...)
  2026-05-07 15:44 ` [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR Stanislav Kinsburskii
@ 2026-05-07 15:44 ` Stanislav Kinsburskii
  2026-05-13  9:57   ` Anirudh Rayabharam
  2026-05-07 15:44 ` [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period Stanislav Kinsburskii
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_partition_ioctl_create_vp() initialises a VP struct (allocations,
mutex_init, init_waitqueue_head, page mappings) and then publishes the
pointer into partition->pt_vp_array.  Several ISR paths read this array
locklessly: the intercept ISR, the two scheduler ISRs, and
mshv_try_assert_irq_fast() on the irqfd fast path.

Of these, only mshv_try_assert_irq_fast() can structurally race the
publish.  It runs from an eventfd waker without holding pt_mutex, and
MSHV_IRQFD does not require the target lapic_apic_id (== vp_index) to
refer to an existing VP at registration time.  A user can therefore
register an irqfd targeting a yet-to-be-created VP, then trigger
mshv_try_assert_irq_fast() concurrently with MSHV_CREATE_VP for the
same index.  On weakly-ordered architectures the reader can observe a
non-NULL pointer in pt_vp_array before the initialising stores to the
VP struct become visible, leading to use of partially-initialised
fields (e.g. vp_register_page).

The other ISR readers cannot reach this race: the hypervisor will not
generate intercept or scheduler messages for a VP that has never been
told to run, and the user can only call MSHV_RUN_VP on the VP fd
returned by MSHV_CREATE_VP, which by construction is returned after
the publish.  Leave those readers as plain loads.

Use smp_store_release() in mshv_partition_ioctl_create_vp() to publish
the pointer, and pair it with smp_load_acquire() in
mshv_try_assert_irq_fast().  On x86 these compile to plain accesses
under TSO; on ARM64 they emit one-instruction acquire/release barriers,
acceptable on this fast path.

The destroy-side path (destroy_partition() clearing pt_vp_array[i] to
NULL after kfree(vp)) has a separate ordering and lifetime concern
that is out of scope here.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c   |    9 ++++++++-
 drivers/hv/mshv_root_main.c |    8 +++++++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 11a6006f80194..5f0dd243e1445 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -197,7 +197,14 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
 	if (irq->lapic_apic_id >= MSHV_MAX_VPS)
 		return -EINVAL;
 
-	vp = partition->pt_vp_array[irq->lapic_apic_id];
+	/*
+	 * Pairs with smp_store_release() in mshv_partition_ioctl_create_vp().
+	 * MSHV_IRQFD does not require the target lapic_apic_id to refer to an
+	 * existing VP, so this read can race a concurrent VP creation; the
+	 * acquire ensures that a non-NULL pointer implies the VP's
+	 * initialising stores are visible.
+	 */
+	vp = smp_load_acquire(&partition->pt_vp_array[irq->lapic_apic_id]);
 	if (!vp)
 		return -EINVAL;
 
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 7e4252b6bc65c..381aa86c5b90e 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1224,7 +1224,13 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 
 	/* already exclusive with the partition mutex for all ioctls */
 	partition->pt_vp_count++;
-	partition->pt_vp_array[args.vp_index] = vp;
+	/*
+	 * Pairs with smp_load_acquire() in mshv_try_assert_irq_fast(), which
+	 * can run concurrently from an irqfd waker without holding pt_mutex.
+	 * The release ensures the VP's initialising stores are visible to any
+	 * reader that observes a non-NULL pointer in pt_vp_array.
+	 */
+	smp_store_release(&partition->pt_vp_array[args.vp_index], vp);
 
 	goto out;
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (13 preceding siblings ...)
  2026-05-07 15:44 ` [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path Stanislav Kinsburskii
@ 2026-05-07 15:44 ` Stanislav Kinsburskii
  2026-05-13 10:11   ` Anirudh Rayabharam
  2026-05-07 15:44 ` [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor Stanislav Kinsburskii
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

destroy_partition() frees mshv_vp with plain kfree() while ISR readers
walk pt_vp_array[] under rcu_read_lock().  On non-root schedulers,
where drain_all_vps() does not run, an in-flight intercept ISR can
observe a non-NULL pt_vp_array slot and dereference freed memory in
kick_vp().  On the root scheduler the same race exists in a narrower
form: drain_vp_signals() synchronises on kick_vp()'s kicked_by_hv flag
but not on its wake_up() tail, so the wait-queue lock embedded in vp
can still be held when destroy_partition() reaches kfree(vp).

Add struct rcu_head vp_rcu to struct mshv_vp, clear the pt_vp_array
slot before the free, and use kfree_rcu() so the actual kfree happens
after a grace period.  drain_all_vps() is retained because it serves a
separate purpose (telling the hypervisor to stop signalling and
reconciling signal counts) that kfree_rcu() does not address.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root.h      |    1 +
 drivers/hv/mshv_root_main.c |    5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index b6961a6d9a98b..e19a84ea07905 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -35,6 +35,7 @@ static_assert(HV_HYP_PAGE_SIZE == MSHV_HV_PAGE_SIZE);
 #define MSHV_PIN_PAGES_BATCH_SIZE	(0x10000000ULL / HV_HYP_PAGE_SIZE)
 
 struct mshv_vp {
+	struct rcu_head vp_rcu;
 	u32 vp_index;
 	struct mshv_partition *vp_partition;
 	struct mutex vp_mutex;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 381aa86c5b90e..e32f6e0f9f637 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -30,6 +30,7 @@
 #include <linux/panic_notifier.h>
 #include <linux/vmalloc.h>
 #include <linux/rseq.h>
+#include <linux/rcupdate.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
@@ -1915,9 +1916,9 @@ static void destroy_partition(struct mshv_partition *partition)
 				vp->vp_ghcb_page = NULL;
 			}
 
-			kfree(vp);
-
 			partition->pt_vp_array[i] = NULL;
+
+			kfree_rcu(vp, vp_rcu);
 		}
 
 		mshv_debugfs_partition_remove(partition);



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (14 preceding siblings ...)
  2026-05-07 15:44 ` [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period Stanislav Kinsburskii
@ 2026-05-07 15:44 ` Stanislav Kinsburskii
  2026-05-13 11:12   ` Anirudh Rayabharam
  2026-05-07 15:44 ` [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor Stanislav Kinsburskii
  2026-05-07 15:44 ` [PATCH v4 18/18] mshv: Fix missing error code on VP allocation failure Stanislav Kinsburskii
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

handle_pair_message() iterates up to msg->vp_count without verifying it
against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
from untrusted hypervisor data, a malformed message with a large value
would cause out-of-bounds reads from the partition_ids and vp_indexes
arrays.

handle_bitset_message() iterates over set bits in valid_bank_mask (up to
64) and advances bank_contents for each one. However, the payload buffer
only has space for 16 bank entries. A valid_bank_mask with more than 16
bits set causes bank_contents to read beyond the message buffer.

Fix both by adding bounds validation:
- Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
- Track banks consumed and stop before exceeding buffer capacity

Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_synic.c |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 89207aad7cf1f..5d509299f14d7 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
 static void
 handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
 {
-	int bank_idx, vps_signaled = 0, bank_mask_size;
+	int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
+	const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
+			      sizeof(u64) - 2; /* subtract format + mask */
 	struct mshv_partition *partition;
 	const struct hv_vpset *vpset;
 	const u64 *bank_contents;
@@ -230,6 +232,11 @@ handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
 		if (bank_idx == bank_mask_size)
 			break;
 
+		if (unlikely(banks_used >= max_banks)) {
+			pr_debug("valid_bank_mask exceeds buffer capacity\n");
+			goto unlock_out;
+		}
+
 		while (true) {
 			struct mshv_vp *vp;
 
@@ -258,6 +265,7 @@ handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
 		}
 
 		bank_contents++;
+		banks_used++;
 	}
 
 unlock_out:
@@ -274,10 +282,18 @@ handle_pair_message(const struct hv_vp_signal_pair_scheduler_message *msg)
 	struct mshv_partition *partition = NULL;
 	struct mshv_vp *vp;
 	int idx;
+	u8 vp_count = msg->vp_count;
+
+	if (unlikely(vp_count > HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT)) {
+		pr_debug("pair message vp_count %u exceeds max %lu\n",
+			 vp_count,
+			 (unsigned long)HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT);
+		return;
+	}
 
 	rcu_read_lock();
 
-	for (idx = 0; idx < msg->vp_count; idx++) {
+	for (idx = 0; idx < vp_count; idx++) {
 		u64 partition_id = msg->partition_ids[idx];
 		u32 vp_index = msg->vp_indexes[idx];
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (15 preceding siblings ...)
  2026-05-07 15:44 ` [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor Stanislav Kinsburskii
@ 2026-05-07 15:44 ` Stanislav Kinsburskii
  2026-05-11 14:26   ` Anirudh Rayabharam
  2026-05-07 15:44 ` [PATCH v4 18/18] mshv: Fix missing error code on VP allocation failure Stanislav Kinsburskii
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

mshv_partition_ioctl_create_vp() called anon_inode_getfd() before
publishing the new VP into partition->pt_vp_array.  anon_inode_getfd()
includes fd_install(), so the fd was live in current->files before the
publish ran.

A concurrent MSHV_RUN_VP ioctl on that fd does not serialise against the
in-progress MSHV_CREATE_VP — it takes vp->vp_mutex, not the partition
mutex.  Once the VP starts running and traps, mshv_intercept_isr() can look
up partition->pt_vp_array[vp_index] and observe NULL, silently dropping the
intercept message.

Split the fd creation: reserve an fd with get_unused_fd_flags(), create the
file with anon_inode_getfile(), publish the VP via smp_store_release(), and
finally call fd_install() as the userspace-visibility commit point.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e32f6e0f9f637..1c18d1c1f7947 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1142,6 +1142,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	struct mshv_vp *vp;
 	struct page *intercept_msg_page, *register_page, *ghcb_page;
 	struct hv_stats_page *stats_pages[2];
+	struct file *file;
+	int fd;
 	long ret;
 
 	if (copy_from_user(&args, arg, sizeof(args)))
@@ -1214,14 +1216,18 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	if (ret)
 		goto put_partition;
 
-	/*
-	 * Keep anon_inode_getfd last: it installs fd in the file struct and
-	 * thus makes the state accessible in user space.
-	 */
-	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
-			       O_RDWR | O_CLOEXEC);
-	if (ret < 0)
+	fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
+	if (fd < 0) {
+		ret = fd;
 		goto remove_debugfs_vp;
+	}
+
+	file = anon_inode_getfile("mshv_vp", &mshv_vp_fops, vp,
+				  O_RDWR | O_CLOEXEC);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto put_unused_vp_fd;
+	}
 
 	/* already exclusive with the partition mutex for all ioctls */
 	partition->pt_vp_count++;
@@ -1233,8 +1239,17 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	 */
 	smp_store_release(&partition->pt_vp_array[args.vp_index], vp);
 
+	/*
+	 * fd_install() is the userspace-visibility commit point.  Must be the
+	 * last operation that can fail or be observed.
+	 */
+	fd_install(fd, file);
+	ret = fd;
+
 	goto out;
 
+put_unused_vp_fd:
+	put_unused_fd(fd);
 remove_debugfs_vp:
 	mshv_debugfs_vp_remove(vp);
 put_partition:



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 18/18] mshv: Fix missing error code on VP allocation failure
  2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
                   ` (16 preceding siblings ...)
  2026-05-07 15:44 ` [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor Stanislav Kinsburskii
@ 2026-05-07 15:44 ` Stanislav Kinsburskii
  2026-05-11  3:35   ` Anirudh Rayabharam
  17 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-07 15:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

In mshv_partition_ioctl_create_vp(), when kzalloc for the VP struct
fails, the code jumps to the cleanup path without setting ret. At that
point ret is 0 from the preceding successful mshv_vp_stats_map() call,
so the function returns success to userspace despite having failed to
create the VP. No fd is installed and no VP is registered in pt_vp_array,
but userspace has no way to know the operation failed.

Set ret to -ENOMEM before jumping to the cleanup path.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1c18d1c1f7947..03c65ff6a7397 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1189,8 +1189,10 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 		goto unmap_ghcb_page;
 
 	vp = kzalloc_obj(*vp);
-	if (!vp)
+	if (!vp) {
+		ret = -ENOMEM;
 		goto unmap_stats_pages;
+	}
 
 	vp->vp_partition = mshv_partition_get(partition);
 	if (!vp->vp_partition) {



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 04/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast
  2026-05-07 15:43 ` [PATCH v4 04/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast Stanislav Kinsburskii
@ 2026-05-11  3:24   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-11  3:24 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:43:21PM +0000, Stanislav Kinsburskii wrote:
> mshv_try_assert_irq_fast() dereferences the vp pointer obtained from
> pt_vp_array[lapic_apic_id] without checking for NULL or validating that
> lapic_apic_id is within bounds. A spurious interrupt from the hypervisor
> targeting a non-existent VP (or one not yet created) causes a NULL
> pointer dereference and crashes the host.
> 
> Add a bounds check on lapic_apic_id against MSHV_MAX_VPS and a NULL
> check on the vp pointer before dereferencing.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_eventfd.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> index 5995a62aff8d8..b398e58411dd7 100644
> --- a/drivers/hv/mshv_eventfd.c
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -169,7 +169,12 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
>  		return -EOPNOTSUPP;
>  #endif
>  
> +	if (irq->lapic_apic_id >= MSHV_MAX_VPS)
> +		return -EINVAL;
> +
>  	vp = partition->pt_vp_array[irq->lapic_apic_id];
> +	if (!vp)
> +		return -EINVAL;
>  
>  	if (!vp->vp_register_page)
>  		return -EOPNOTSUPP;
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 11/18] mshv: Fix sleeping under spinlock in mshv_portid_alloc
  2026-05-07 15:43 ` [PATCH v4 11/18] mshv: Fix sleeping under spinlock in mshv_portid_alloc Stanislav Kinsburskii
@ 2026-05-11  3:33   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-11  3:33 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:43:59PM +0000, Stanislav Kinsburskii wrote:
> idr_alloc() is called with GFP_KERNEL inside idr_lock(), which holds a
> spinlock. GFP_KERNEL allows the allocator to sleep, triggering a
> sleeping-while-atomic bug.
> 
> Fix by using idr_preload(GFP_KERNEL) before taking the lock to
> pre-allocate memory in a sleepable context, then idr_alloc() with
> GFP_NOWAIT inside the spinlock-protected section.
> 
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_portid_table.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
> index 4cdf8e9575390..d87a82e399e96 100644
> --- a/drivers/hv/mshv_portid_table.c
> +++ b/drivers/hv/mshv_portid_table.c
> @@ -40,12 +40,14 @@ mshv_port_table_fini(void)
>  int
>  mshv_portid_alloc(struct port_table_info *info)
>  {
> -	int ret = 0;
> +	int ret;
>  
> +	idr_preload(GFP_KERNEL);
>  	idr_lock(&port_table_idr);
>  	ret = idr_alloc(&port_table_idr, info, PORTID_MIN,
> -			PORTID_MAX, GFP_KERNEL);
> +			PORTID_MAX, GFP_NOWAIT);
>  	idr_unlock(&port_table_idr);
> +	idr_preload_end();
>  
>  	return ret;
>  }
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 18/18] mshv: Fix missing error code on VP allocation failure
  2026-05-07 15:44 ` [PATCH v4 18/18] mshv: Fix missing error code on VP allocation failure Stanislav Kinsburskii
@ 2026-05-11  3:35   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-11  3:35 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:44:37PM +0000, Stanislav Kinsburskii wrote:
> In mshv_partition_ioctl_create_vp(), when kzalloc for the VP struct
> fails, the code jumps to the cleanup path without setting ret. At that
> point ret is 0 from the preceding successful mshv_vp_stats_map() call,
> so the function returns success to userspace despite having failed to
> create the VP. No fd is installed and no VP is registered in pt_vp_array,
> but userspace has no way to know the operation failed.
> 
> Set ret to -ENOMEM before jumping to the cleanup path.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 1c18d1c1f7947..03c65ff6a7397 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1189,8 +1189,10 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
>  		goto unmap_ghcb_page;
>  
>  	vp = kzalloc_obj(*vp);
> -	if (!vp)
> +	if (!vp) {
> +		ret = -ENOMEM;
>  		goto unmap_stats_pages;
> +	}
>  
>  	vp->vp_partition = mshv_partition_get(partition);
>  	if (!vp->vp_partition) {
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
  2026-05-07 15:43 ` [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access Stanislav Kinsburskii
@ 2026-05-11  3:46   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-11  3:46 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:43:04PM +0000, Stanislav Kinsburskii wrote:
> The bounds check inside the PFN-filling loop can return -EINVAL while
> interrupts are disabled via local_irq_save(), leaking IRQ state.
> 
> Remove the check — it is redundant because the loop invariant
> (done + i < page_count == page_struct_count >> large_shift) guarantees
> (done + i) << large_shift < page_struct_count always holds.
> 
> While here, fix type mismatches: change 'int done' to 'u64 done' and
> use u64 for loop and batch-size variables so they match the u64
> page_count they are compared against.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_hv_call.c |   18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
  2026-05-07 15:43 ` [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions Stanislav Kinsburskii
@ 2026-05-11 13:48   ` Anirudh Rayabharam
  2026-05-11 15:06     ` Stanislav Kinsburskii
  2026-05-11 15:12     ` Stanislav Kinsburskii
  0 siblings, 2 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-11 13:48 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> fails on an unencrypted partition. The condition on the error path:
> 
>     if (ret && mshv_partition_encrypted(partition))
> 
> only handles map failures for encrypted partitions — if the partition is
> not encrypted and the map fails, execution falls through to 'return 0',
> silently ignoring the error.
> 
> Additionally, calling mshv_region_invalidate() inline on map failure
> zeroes the mreg_pages array before the caller's cleanup path
> (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> mappings for partially-mapped pages.
> 
> Fix by returning immediately on success and falling through to error
> return on failure. For unencrypted partitions, the caller's
> mshv_region_destroy() handles unmap followed by invalidate in the
> correct order. For encrypted partitions where re-sharing fails, zero
> the page array without unpinning — the pages are inaccessible to the
> host and must not be unpinned, but zeroing prevents
> mshv_region_destroy() from attempting to unpin them.

mshv_region_destroy() calls mshv_region_invalidate() which calls
mshv_region_invalidate_pages() which just does:

	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
						 u64 page_offset, u64 page_count)
	{
		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
			unpin_user_pages(region->mreg_pages + page_offset, page_count);

		memset(region->mreg_pages + page_offset, 0,
		       page_count * sizeof(struct page *));
	}

It doesn't checked for zeroed pages before unpinning.

> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 665d565899c15..7e4252b6bc65c 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
>  			pt_err(partition,
>  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
>  			       region->start_gfn, ret);
> -			goto invalidate_region;
> +			goto err_out;
>  		}
>  	}
>  
>  	ret = mshv_region_map(region);
> -	if (ret && mshv_partition_encrypted(partition)) {
> +	if (ret)
> +		goto share_region;
> +
> +	return 0;
> +
> +share_region:
> +	if (mshv_partition_encrypted(partition)) {
>  		int shrc;
>  
>  		shrc = mshv_region_share(region);
>  		if (!shrc)
> -			goto invalidate_region;
> +			goto err_out;
>  
>  		pt_err(partition,
>  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
>  		       region->start_gfn, shrc);
>  		/*
> -		 * Don't unpin if marking shared failed because pages are no
> -		 * longer mapped in the host, ie root, anymore.
> +		 * Re-sharing failed — the pages remain inaccessible to the
> +		 * host.  Zero the page array so that mshv_region_destroy()
> +		 * won't attempt to unpin them (leaking the page references
> +		 * is intentional; unpinning host-inaccessible pages would be
> +		 * unsafe).
>  		 */

Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
can remove it from here altogether. Either way, should this zeroing
logic be added there too so as not to crash the host by unpinning pages
that weren't marked shared?

From mshv_region_destroy():

	if (mshv_partition_encrypted(partition)) {
		ret = mshv_region_share(region);
		if (ret) {
			pt_err(partition,
			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
			       ret);
			return;
		}
	}

Thanks,
Anirudh.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 03/18] mshv: Fix race in mshv_irqfd_deassign
  2026-05-07 15:43 ` [PATCH v4 03/18] mshv: Fix race in mshv_irqfd_deassign Stanislav Kinsburskii
@ 2026-05-11 13:57   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-11 13:57 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:43:15PM +0000, Stanislav Kinsburskii wrote:
> mshv_irqfd_deactivate() and the hlist traversal of pt_irqfds_list
> require pt->pt_irqfds_lock to be held, but mshv_irqfd_deassign()
> omits it. This races with the EPOLLHUP path in mshv_irqfd_wakeup(),
> which does take the lock before calling mshv_irqfd_deactivate().
> 
> Additionally, mshv_irqfd_deactivate() uses hlist_del() which poisons
> the node pointers rather than resetting them. Since
> mshv_irqfd_is_active() relies on hlist_unhashed() (checks pprev ==
> NULL), a poisoned node still appears active. If a concurrent path calls
> mshv_irqfd_deactivate() again on the same irqfd, the guard fails to
> prevent a double hlist_del() on poisoned pointers.
> 
> Fix both issues:
> - Add the missing spin_lock_irq/spin_unlock_irq around the list
>   traversal in mshv_irqfd_deassign(), matching mshv_irqfd_release().
> - Use hlist_del_init() instead of hlist_del() so the node is properly
>   marked as unhashed after removal, making the is_active guard reliable.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_eventfd.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor
  2026-05-07 15:44 ` [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor Stanislav Kinsburskii
@ 2026-05-11 14:26   ` Anirudh Rayabharam
  2026-05-11 15:29     ` Stanislav Kinsburskii
  0 siblings, 1 reply; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-11 14:26 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:44:32PM +0000, Stanislav Kinsburskii wrote:
> mshv_partition_ioctl_create_vp() called anon_inode_getfd() before
> publishing the new VP into partition->pt_vp_array.  anon_inode_getfd()
> includes fd_install(), so the fd was live in current->files before the
> publish ran.
> 
> A concurrent MSHV_RUN_VP ioctl on that fd does not serialise against the
> in-progress MSHV_CREATE_VP — it takes vp->vp_mutex, not the partition
> mutex.  Once the VP starts running and traps, mshv_intercept_isr() can look
> up partition->pt_vp_array[vp_index] and observe NULL, silently dropping the
> intercept message.
> 
> Split the fd creation: reserve an fd with get_unused_fd_flags(), create the
> file with anon_inode_getfile(), publish the VP via smp_store_release(), and
> finally call fd_install() as the userspace-visibility commit point.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c |   29 ++++++++++++++++++++++-------
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index e32f6e0f9f637..1c18d1c1f7947 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1142,6 +1142,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
>  	struct mshv_vp *vp;
>  	struct page *intercept_msg_page, *register_page, *ghcb_page;
>  	struct hv_stats_page *stats_pages[2];
> +	struct file *file;
> +	int fd;
>  	long ret;
>  
>  	if (copy_from_user(&args, arg, sizeof(args)))
> @@ -1214,14 +1216,18 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
>  	if (ret)
>  		goto put_partition;
>  
> -	/*
> -	 * Keep anon_inode_getfd last: it installs fd in the file struct and
> -	 * thus makes the state accessible in user space.
> -	 */
> -	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
> -			       O_RDWR | O_CLOEXEC);

Why not just move this anon_inode_getfd() after the smp_store_release()
call?

Thanks,
Anirudh.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
  2026-05-11 13:48   ` Anirudh Rayabharam
@ 2026-05-11 15:06     ` Stanislav Kinsburskii
  2026-05-13 11:15       ` Anirudh Rayabharam
  2026-05-11 15:12     ` Stanislav Kinsburskii
  1 sibling, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-11 15:06 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > fails on an unencrypted partition. The condition on the error path:
> > 
> >     if (ret && mshv_partition_encrypted(partition))
> > 
> > only handles map failures for encrypted partitions — if the partition is
> > not encrypted and the map fails, execution falls through to 'return 0',
> > silently ignoring the error.
> > 
> > Additionally, calling mshv_region_invalidate() inline on map failure
> > zeroes the mreg_pages array before the caller's cleanup path
> > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > mappings for partially-mapped pages.
> > 
> > Fix by returning immediately on success and falling through to error
> > return on failure. For unencrypted partitions, the caller's
> > mshv_region_destroy() handles unmap followed by invalidate in the
> > correct order. For encrypted partitions where re-sharing fails, zero
> > the page array without unpinning — the pages are inaccessible to the
> > host and must not be unpinned, but zeroing prevents
> > mshv_region_destroy() from attempting to unpin them.
> 
> mshv_region_destroy() calls mshv_region_invalidate() which calls
> mshv_region_invalidate_pages() which just does:
> 
> 	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> 						 u64 page_offset, u64 page_count)
> 	{
> 		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> 			unpin_user_pages(region->mreg_pages + page_offset, page_count);
> 
> 		memset(region->mreg_pages + page_offset, 0,
> 		       page_count * sizeof(struct page *));
> 	}
> 
> It doesn't checked for zeroed pages before unpinning.
> 
> > 
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
> >  1 file changed, 16 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index 665d565899c15..7e4252b6bc65c 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> >  			pt_err(partition,
> >  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> >  			       region->start_gfn, ret);
> > -			goto invalidate_region;
> > +			goto err_out;
> >  		}
> >  	}
> >  
> >  	ret = mshv_region_map(region);
> > -	if (ret && mshv_partition_encrypted(partition)) {
> > +	if (ret)
> > +		goto share_region;
> > +
> > +	return 0;
> > +
> > +share_region:
> > +	if (mshv_partition_encrypted(partition)) {
> >  		int shrc;
> >  
> >  		shrc = mshv_region_share(region);
> >  		if (!shrc)
> > -			goto invalidate_region;
> > +			goto err_out;
> >  
> >  		pt_err(partition,
> >  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> >  		       region->start_gfn, shrc);
> >  		/*
> > -		 * Don't unpin if marking shared failed because pages are no
> > -		 * longer mapped in the host, ie root, anymore.
> > +		 * Re-sharing failed — the pages remain inaccessible to the
> > +		 * host.  Zero the page array so that mshv_region_destroy()
> > +		 * won't attempt to unpin them (leaking the page references
> > +		 * is intentional; unpinning host-inaccessible pages would be
> > +		 * unsafe).
> >  		 */
> 
> Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> can remove it from here altogether. Either way, should this zeroing
> logic be added there too so as not to crash the host by unpinning pages
> that weren't marked shared?
> 

Indeed.
The issue with all this code is that we are leaking state in
mshv_region_pin if we wimply return from it: we don't know if the pages
are pinned or unshared (or mapped) in the destruction callback.
We should either introduce a state variable to track this or just don't
call the generic mshv_region_put on case of region creation error.
The latter approch make mshv_map_user_memory idempotent and thus looks
like a better way forward.
What do you think?

Thanks,
Stanislav

> >From mshv_region_destroy():
> 
> 	if (mshv_partition_encrypted(partition)) {
> 		ret = mshv_region_share(region);
> 		if (ret) {
> 			pt_err(partition,
> 			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
> 			       ret);
> 			return;
> 		}
> 	}
> 
> Thanks,
> Anirudh.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
  2026-05-11 13:48   ` Anirudh Rayabharam
  2026-05-11 15:06     ` Stanislav Kinsburskii
@ 2026-05-11 15:12     ` Stanislav Kinsburskii
  1 sibling, 0 replies; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-11 15:12 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > fails on an unencrypted partition. The condition on the error path:
> > 
> >     if (ret && mshv_partition_encrypted(partition))
> > 
> > only handles map failures for encrypted partitions — if the partition is
> > not encrypted and the map fails, execution falls through to 'return 0',
> > silently ignoring the error.
> > 
> > Additionally, calling mshv_region_invalidate() inline on map failure
> > zeroes the mreg_pages array before the caller's cleanup path
> > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > mappings for partially-mapped pages.
> > 
> > Fix by returning immediately on success and falling through to error
> > return on failure. For unencrypted partitions, the caller's
> > mshv_region_destroy() handles unmap followed by invalidate in the
> > correct order. For encrypted partitions where re-sharing fails, zero
> > the page array without unpinning — the pages are inaccessible to the
> > host and must not be unpinned, but zeroing prevents
> > mshv_region_destroy() from attempting to unpin them.
> 
> mshv_region_destroy() calls mshv_region_invalidate() which calls
> mshv_region_invalidate_pages() which just does:
> 
> 	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> 						 u64 page_offset, u64 page_count)
> 	{
> 		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> 			unpin_user_pages(region->mreg_pages + page_offset, page_count);
> 
> 		memset(region->mreg_pages + page_offset, 0,
> 		       page_count * sizeof(struct page *));
> 	}
> 
> It doesn't checked for zeroed pages before unpinning.
> 

unpin_user_pages skips NULL pages.

Thanks,
Stanislav

> > 
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
> >  1 file changed, 16 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index 665d565899c15..7e4252b6bc65c 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> >  			pt_err(partition,
> >  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> >  			       region->start_gfn, ret);
> > -			goto invalidate_region;
> > +			goto err_out;
> >  		}
> >  	}
> >  
> >  	ret = mshv_region_map(region);
> > -	if (ret && mshv_partition_encrypted(partition)) {
> > +	if (ret)
> > +		goto share_region;
> > +
> > +	return 0;
> > +
> > +share_region:
> > +	if (mshv_partition_encrypted(partition)) {
> >  		int shrc;
> >  
> >  		shrc = mshv_region_share(region);
> >  		if (!shrc)
> > -			goto invalidate_region;
> > +			goto err_out;
> >  
> >  		pt_err(partition,
> >  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> >  		       region->start_gfn, shrc);
> >  		/*
> > -		 * Don't unpin if marking shared failed because pages are no
> > -		 * longer mapped in the host, ie root, anymore.
> > +		 * Re-sharing failed — the pages remain inaccessible to the
> > +		 * host.  Zero the page array so that mshv_region_destroy()
> > +		 * won't attempt to unpin them (leaking the page references
> > +		 * is intentional; unpinning host-inaccessible pages would be
> > +		 * unsafe).
> >  		 */
> 
> Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> can remove it from here altogether. Either way, should this zeroing
> logic be added there too so as not to crash the host by unpinning pages
> that weren't marked shared?
> 
> >From mshv_region_destroy():
> 
> 	if (mshv_partition_encrypted(partition)) {
> 		ret = mshv_region_share(region);
> 		if (ret) {
> 			pt_err(partition,
> 			       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
> 			       ret);
> 			return;
> 		}
> 	}
> 
> Thanks,
> Anirudh.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor
  2026-05-11 14:26   ` Anirudh Rayabharam
@ 2026-05-11 15:29     ` Stanislav Kinsburskii
  2026-05-12 12:46       ` Anirudh Rayabharam
  0 siblings, 1 reply; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-11 15:29 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Mon, May 11, 2026 at 02:26:45PM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:44:32PM +0000, Stanislav Kinsburskii wrote:
> > mshv_partition_ioctl_create_vp() called anon_inode_getfd() before
> > publishing the new VP into partition->pt_vp_array.  anon_inode_getfd()
> > includes fd_install(), so the fd was live in current->files before the
> > publish ran.
> > 
> > A concurrent MSHV_RUN_VP ioctl on that fd does not serialise against the
> > in-progress MSHV_CREATE_VP — it takes vp->vp_mutex, not the partition
> > mutex.  Once the VP starts running and traps, mshv_intercept_isr() can look
> > up partition->pt_vp_array[vp_index] and observe NULL, silently dropping the
> > intercept message.
> > 
> > Split the fd creation: reserve an fd with get_unused_fd_flags(), create the
> > file with anon_inode_getfile(), publish the VP via smp_store_release(), and
> > finally call fd_install() as the userspace-visibility commit point.
> > 
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_main.c |   29 ++++++++++++++++++++++-------
> >  1 file changed, 22 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index e32f6e0f9f637..1c18d1c1f7947 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -1142,6 +1142,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
> >  	struct mshv_vp *vp;
> >  	struct page *intercept_msg_page, *register_page, *ghcb_page;
> >  	struct hv_stats_page *stats_pages[2];
> > +	struct file *file;
> > +	int fd;
> >  	long ret;
> >  
> >  	if (copy_from_user(&args, arg, sizeof(args)))
> > @@ -1214,14 +1216,18 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
> >  	if (ret)
> >  		goto put_partition;
> >  
> > -	/*
> > -	 * Keep anon_inode_getfd last: it installs fd in the file struct and
> > -	 * thus makes the state accessible in user space.
> > -	 */
> > -	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
> > -			       O_RDWR | O_CLOEXEC);
> 
> Why not just move this anon_inode_getfd() after the smp_store_release()
> call?
> 

Because anon_inode_getfd() can still fail at the fd-table step after the
VP is already visible to lockless ISR readers, and unpublishing it would
require a grace-period wait under pt_mutex while ISRs may have already
observed and acted on the doomed VP. The split form keeps every failable
step before the publish, so failure paths stay simple and the publish
itself cannot fail.

Thanks,
Stanislav


> Thanks,
> Anirudh.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor
  2026-05-11 15:29     ` Stanislav Kinsburskii
@ 2026-05-12 12:46       ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-12 12:46 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Mon, May 11, 2026 at 08:29:50AM -0700, Stanislav Kinsburskii wrote:
> On Mon, May 11, 2026 at 02:26:45PM +0000, Anirudh Rayabharam wrote:
> > On Thu, May 07, 2026 at 03:44:32PM +0000, Stanislav Kinsburskii wrote:
> > > mshv_partition_ioctl_create_vp() called anon_inode_getfd() before
> > > publishing the new VP into partition->pt_vp_array.  anon_inode_getfd()
> > > includes fd_install(), so the fd was live in current->files before the
> > > publish ran.
> > > 
> > > A concurrent MSHV_RUN_VP ioctl on that fd does not serialise against the
> > > in-progress MSHV_CREATE_VP — it takes vp->vp_mutex, not the partition
> > > mutex.  Once the VP starts running and traps, mshv_intercept_isr() can look
> > > up partition->pt_vp_array[vp_index] and observe NULL, silently dropping the
> > > intercept message.
> > > 
> > > Split the fd creation: reserve an fd with get_unused_fd_flags(), create the
> > > file with anon_inode_getfile(), publish the VP via smp_store_release(), and
> > > finally call fd_install() as the userspace-visibility commit point.
> > > 
> > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c |   29 ++++++++++++++++++++++-------
> > >  1 file changed, 22 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index e32f6e0f9f637..1c18d1c1f7947 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -1142,6 +1142,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
> > >  	struct mshv_vp *vp;
> > >  	struct page *intercept_msg_page, *register_page, *ghcb_page;
> > >  	struct hv_stats_page *stats_pages[2];
> > > +	struct file *file;
> > > +	int fd;
> > >  	long ret;
> > >  
> > >  	if (copy_from_user(&args, arg, sizeof(args)))
> > > @@ -1214,14 +1216,18 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
> > >  	if (ret)
> > >  		goto put_partition;
> > >  
> > > -	/*
> > > -	 * Keep anon_inode_getfd last: it installs fd in the file struct and
> > > -	 * thus makes the state accessible in user space.
> > > -	 */
> > > -	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
> > > -			       O_RDWR | O_CLOEXEC);
> > 
> > Why not just move this anon_inode_getfd() after the smp_store_release()
> > call?
> > 
> 
> Because anon_inode_getfd() can still fail at the fd-table step after the
> VP is already visible to lockless ISR readers, and unpublishing it would
> require a grace-period wait under pt_mutex while ISRs may have already
> observed and acted on the doomed VP. The split form keeps every failable
> step before the publish, so failure paths stay simple and the publish
> itself cannot fail.

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR
  2026-05-07 15:44 ` [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR Stanislav Kinsburskii
@ 2026-05-13  5:32   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13  5:32 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:44:10PM +0000, Stanislav Kinsburskii wrote:
> mshv_intercept_isr() reads vp_index from the SynIC intercept message
> payload and uses it directly to index into partition->pt_vp_array without
> validating that vp_index < MSHV_MAX_VPS.
> 
> Mshv treats the Microsoft Hypervisor as trusted, so a malformed vp_index is
> not a security concern; the threat model does not include a malicious
> hypervisor. A hypervisor bug that placed an out-of-range value here would,
> however, cause an out-of-bounds read of pt_vp_array in hardirq context,
> manifesting as random memory corruption or a host crash with no clear
> signal pointing back to the hypervisor as the source.
> 
> handle_bitset_message() and handle_pair_message() already perform this
> defensive check on hypervisor-supplied vp_index values, with an explicit
> "This shouldn't happen, but just in case" comment  Add the same check to
> mshv_intercept_isr() for consistency, turning a potential silent corruption
> into a debuggable pr_debug message.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index bac890cd2b468..89207aad7cf1f 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -387,6 +387,11 @@ mshv_intercept_isr(struct hv_message *msg)
>  	 */
>  	vp_index =
>  	       ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
> +	/* This shouldn't happen, but just in case. */
> +	if (unlikely(vp_index >= MSHV_MAX_VPS)) {
> +		pr_debug("VP index %u out of bounds\n", vp_index);
> +		goto unlock_out;
> +	}
>  	vp = partition->pt_vp_array[vp_index];
>  	if (unlikely(!vp)) {
>  		pr_debug("failed to find VP %u\n", vp_index);
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path
  2026-05-07 15:44 ` [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path Stanislav Kinsburskii
@ 2026-05-13  9:57   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13  9:57 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:44:16PM +0000, Stanislav Kinsburskii wrote:
> mshv_partition_ioctl_create_vp() initialises a VP struct (allocations,
> mutex_init, init_waitqueue_head, page mappings) and then publishes the
> pointer into partition->pt_vp_array.  Several ISR paths read this array
> locklessly: the intercept ISR, the two scheduler ISRs, and
> mshv_try_assert_irq_fast() on the irqfd fast path.
> 
> Of these, only mshv_try_assert_irq_fast() can structurally race the
> publish.  It runs from an eventfd waker without holding pt_mutex, and
> MSHV_IRQFD does not require the target lapic_apic_id (== vp_index) to
> refer to an existing VP at registration time.  A user can therefore
> register an irqfd targeting a yet-to-be-created VP, then trigger
> mshv_try_assert_irq_fast() concurrently with MSHV_CREATE_VP for the
> same index.  On weakly-ordered architectures the reader can observe a
> non-NULL pointer in pt_vp_array before the initialising stores to the
> VP struct become visible, leading to use of partially-initialised
> fields (e.g. vp_register_page).
> 
> The other ISR readers cannot reach this race: the hypervisor will not
> generate intercept or scheduler messages for a VP that has never been
> told to run, and the user can only call MSHV_RUN_VP on the VP fd
> returned by MSHV_CREATE_VP, which by construction is returned after
> the publish.  Leave those readers as plain loads.
> 
> Use smp_store_release() in mshv_partition_ioctl_create_vp() to publish
> the pointer, and pair it with smp_load_acquire() in
> mshv_try_assert_irq_fast().  On x86 these compile to plain accesses
> under TSO; on ARM64 they emit one-instruction acquire/release barriers,
> acceptable on this fast path.
> 
> The destroy-side path (destroy_partition() clearing pt_vp_array[i] to
> NULL after kfree(vp)) has a separate ordering and lifetime concern
> that is out of scope here.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_eventfd.c   |    9 ++++++++-
>  drivers/hv/mshv_root_main.c |    8 +++++++-
>  2 files changed, 15 insertions(+), 2 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period
  2026-05-07 15:44 ` [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period Stanislav Kinsburskii
@ 2026-05-13 10:11   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13 10:11 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:44:21PM +0000, Stanislav Kinsburskii wrote:
> destroy_partition() frees mshv_vp with plain kfree() while ISR readers
> walk pt_vp_array[] under rcu_read_lock().  On non-root schedulers,
> where drain_all_vps() does not run, an in-flight intercept ISR can
> observe a non-NULL pt_vp_array slot and dereference freed memory in
> kick_vp().  On the root scheduler the same race exists in a narrower
> form: drain_vp_signals() synchronises on kick_vp()'s kicked_by_hv flag
> but not on its wake_up() tail, so the wait-queue lock embedded in vp
> can still be held when destroy_partition() reaches kfree(vp).
> 
> Add struct rcu_head vp_rcu to struct mshv_vp, clear the pt_vp_array
> slot before the free, and use kfree_rcu() so the actual kfree happens
> after a grace period.  drain_all_vps() is retained because it serves a
> separate purpose (telling the hypervisor to stop signalling and
> reconciling signal counts) that kfree_rcu() does not address.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h      |    1 +
>  drivers/hv/mshv_root_main.c |    5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
  2026-05-07 15:44 ` [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor Stanislav Kinsburskii
@ 2026-05-13 11:12   ` Anirudh Rayabharam
  2026-05-13 17:39     ` Stanislav Kinsburskii
  0 siblings, 1 reply; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13 11:12 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:44:26PM +0000, Stanislav Kinsburskii wrote:
> handle_pair_message() iterates up to msg->vp_count without verifying it
> against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> from untrusted hypervisor data, a malformed message with a large value
> would cause out-of-bounds reads from the partition_ids and vp_indexes
> arrays.
> 
> handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> 64) and advances bank_contents for each one. However, the payload buffer
> only has space for 16 bank entries. A valid_bank_mask with more than 16
> bits set causes bank_contents to read beyond the message buffer.
> 
> Fix both by adding bounds validation:
> - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> - Track banks consumed and stop before exceeding buffer capacity
> 
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 89207aad7cf1f..5d509299f14d7 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
>  static void
>  handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
>  {
> -	int bank_idx, vps_signaled = 0, bank_mask_size;
> +	int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> +	const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> +			      sizeof(u64) - 2; /* subtract format + mask */

Could this be a constant in the header?

Thanks,
Anirudh.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
  2026-05-11 15:06     ` Stanislav Kinsburskii
@ 2026-05-13 11:15       ` Anirudh Rayabharam
  2026-05-13 17:31         ` Stanislav Kinsburskii
  0 siblings, 1 reply; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13 11:15 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Mon, May 11, 2026 at 08:06:44AM -0700, Stanislav Kinsburskii wrote:
> On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> > On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > > fails on an unencrypted partition. The condition on the error path:
> > > 
> > >     if (ret && mshv_partition_encrypted(partition))
> > > 
> > > only handles map failures for encrypted partitions — if the partition is
> > > not encrypted and the map fails, execution falls through to 'return 0',
> > > silently ignoring the error.
> > > 
> > > Additionally, calling mshv_region_invalidate() inline on map failure
> > > zeroes the mreg_pages array before the caller's cleanup path
> > > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > > mappings for partially-mapped pages.
> > > 
> > > Fix by returning immediately on success and falling through to error
> > > return on failure. For unencrypted partitions, the caller's
> > > mshv_region_destroy() handles unmap followed by invalidate in the
> > > correct order. For encrypted partitions where re-sharing fails, zero
> > > the page array without unpinning — the pages are inaccessible to the
> > > host and must not be unpinned, but zeroing prevents
> > > mshv_region_destroy() from attempting to unpin them.
> > 
> > mshv_region_destroy() calls mshv_region_invalidate() which calls
> > mshv_region_invalidate_pages() which just does:
> > 
> > 	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > 						 u64 page_offset, u64 page_count)
> > 	{
> > 		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > 			unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > 
> > 		memset(region->mreg_pages + page_offset, 0,
> > 		       page_count * sizeof(struct page *));
> > 	}
> > 
> > It doesn't checked for zeroed pages before unpinning.
> > 
> > > 
> > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
> > >  1 file changed, 16 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 665d565899c15..7e4252b6bc65c 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> > >  			pt_err(partition,
> > >  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> > >  			       region->start_gfn, ret);
> > > -			goto invalidate_region;
> > > +			goto err_out;
> > >  		}
> > >  	}
> > >  
> > >  	ret = mshv_region_map(region);
> > > -	if (ret && mshv_partition_encrypted(partition)) {
> > > +	if (ret)
> > > +		goto share_region;
> > > +
> > > +	return 0;
> > > +
> > > +share_region:
> > > +	if (mshv_partition_encrypted(partition)) {
> > >  		int shrc;
> > >  
> > >  		shrc = mshv_region_share(region);
> > >  		if (!shrc)
> > > -			goto invalidate_region;
> > > +			goto err_out;
> > >  
> > >  		pt_err(partition,
> > >  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> > >  		       region->start_gfn, shrc);
> > >  		/*
> > > -		 * Don't unpin if marking shared failed because pages are no
> > > -		 * longer mapped in the host, ie root, anymore.
> > > +		 * Re-sharing failed — the pages remain inaccessible to the
> > > +		 * host.  Zero the page array so that mshv_region_destroy()
> > > +		 * won't attempt to unpin them (leaking the page references
> > > +		 * is intentional; unpinning host-inaccessible pages would be
> > > +		 * unsafe).
> > >  		 */
> > 
> > Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> > can remove it from here altogether. Either way, should this zeroing
> > logic be added there too so as not to crash the host by unpinning pages
> > that weren't marked shared?
> > 
> 
> Indeed.
> The issue with all this code is that we are leaking state in
> mshv_region_pin if we wimply return from it: we don't know if the pages
> are pinned or unshared (or mapped) in the destruction callback.
> We should either introduce a state variable to track this or just don't
> call the generic mshv_region_put on case of region creation error.
> The latter approch make mshv_map_user_memory idempotent and thus looks
> like a better way forward.
> What do you think?

I'm not sure I follow the latter approach. Omitting mshv_region_put()
would cause a dangling reference to the mshv_region right?

Thanks,
Anirudh.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract
  2026-05-07 15:43 ` [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract Stanislav Kinsburskii
@ 2026-05-13 11:20   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13 11:20 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:43:54PM +0000, Stanislav Kinsburskii wrote:
> mshv_portid_lookup() previously took rcu_read_lock() internally, ran
> idr_find(), released the read lock, and copied the struct contents
> into a caller-supplied buffer.  This had two problems.
> 
> 1. The struct copy ran outside the read section, racing with
>    mshv_portid_free() which does idr_remove + synchronize_rcu + kfree.
>    A copy that started just before synchronize_rcu() observed the read
>    section as already drained and was free to read freed memory while
>    the writer was kfree()'ing the entry.
> 
> 2. The only consumer, mshv_doorbell_isr(), then dispatched a callback
>    using fields of the snapshot — entirely outside any RCU read
>    section.  The callback's data argument and any field it touches
>    were therefore safe only because mshv_isr() runs from
>    sysvec_hyperv_callback, a non-threaded system vector that
>    synchronize_rcu() implicitly waits for via the hardirq quiescent-
>    state coupling.  That protection is real today but undocumented and
>    fragile: a future move of mshv_isr() to a threaded context, or a
>    future caller that registers a doorbell with a shorter-lived data
>    pointer, would silently expose a use-after-free.
> 
> Make the contract explicit instead of implicit.  mshv_portid_lookup()
> now returns a pointer to the table entry and requires the caller to
> hold rcu_read_lock for the entire lifetime of that pointer.  The
> contract is annotated with __must_hold(RCU) so sparse flags any
> direct caller that forgets it.  The sole caller, mshv_doorbell_isr(),
> takes rcu_read_lock around the whole drain loop, so the lookup, the
> field reads, and the doorbell_cb dispatch all run inside one
> read-side critical section.  synchronize_rcu() in mshv_portid_free()
> now genuinely waits for any in-flight callback before kfree() runs,
> without relying on hardirq context for correctness.
> 
> This also drops the by-value struct copy: entries are publish-once
> (populated before idr_alloc) and free-once (after synchronize_rcu),
> so a pointer dereferenced inside the read section gives a stable
> view of the contents without copying.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_portid_table.c |   22 +++++++---------------
>  drivers/hv/mshv_root.h         |    2 +-
>  drivers/hv/mshv_synic.c        |   15 +++++++++------
>  3 files changed, 17 insertions(+), 22 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free
  2026-05-07 15:44 ` [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free Stanislav Kinsburskii
@ 2026-05-13 11:22   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13 11:22 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:44:05PM +0000, Stanislav Kinsburskii wrote:
> mshv_portid_free() uses synchronize_rcu() followed by kfree() to
> reclaim port table entries. This blocks the caller until a full RCU
> grace period elapses, which is unnecessary since the same module already
> uses the non-blocking kfree_rcu() pattern in mshv_port_table_fini().
> 
> Replace with kfree_rcu() to avoid the blocking wait and keep the
> reclamation strategy consistent across the file.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_portid_table.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
> index d87a82e399e96..42d21b92b88fd 100644
> --- a/drivers/hv/mshv_portid_table.c
> +++ b/drivers/hv/mshv_portid_table.c
> @@ -62,8 +62,7 @@ mshv_portid_free(int port_id)
>  	WARN_ON(!info);
>  	idr_unlock(&port_table_idr);
>  
> -	synchronize_rcu();
> -	kfree(info);
> +	kfree_rcu(info, portbl_rcu);
>  }
>  
>  /*
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0
  2026-05-07 15:43 ` [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0 Stanislav Kinsburskii
@ 2026-05-13 11:36   ` Anirudh Rayabharam
  0 siblings, 0 replies; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13 11:36 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:43:49PM +0000, Stanislav Kinsburskii wrote:
> The duplicate routing entry check in mshv_update_routing_table() uses
> guest_irq_num != 0 to detect whether a GSI slot is already occupied.
> This fails for GSI 0 because its guest_irq_num is 0 both when the slot
> is unused (zero-initialized) and when legitimately assigned. As a
> result, duplicate entries for GSI 0 are silently accepted, with the
> second entry overwriting the first — corrupting the routing table
> without any error reported to userspace.
> 
> While GSI 0 (legacy timer) is unlikely to appear in MSI-based routing
> in practice, the check is semantically wrong — it conflates
> "uninitialized" with "GSI number 0." Use girq_entry_valid instead,
> which is explicitly set to true when an entry is populated and remains
> zero for unused slots regardless of the GSI number.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_irq.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
> index 59584a132ca9f..db05512db5548 100644
> --- a/drivers/hv/mshv_irq.c
> +++ b/drivers/hv/mshv_irq.c
> @@ -88,7 +88,7 @@ int mshv_update_routing_table(struct mshv_partition *partition,
>  		/*
>  		 * Allow only one to one mapping between GSI and MSI routing.
>  		 */
> -		if (girq->guest_irq_num != 0) {
> +		if (girq->girq_entry_valid) {
>  			r = -EINVAL;
>  			goto out;
>  		}
> 
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
  2026-05-07 15:43 ` [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data Stanislav Kinsburskii
@ 2026-05-13 12:14   ` Anirudh Rayabharam
  2026-05-13 17:38     ` Stanislav Kinsburskii
  0 siblings, 1 reply; 42+ messages in thread
From: Anirudh Rayabharam @ 2026-05-13 12:14 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> In mshv_irqfd_assign(), the level-triggered validation for resample
> irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> mshv_irqfd_update() has populated the field. Since the irqfd struct is
> zero-allocated, level_triggered is always 0 at that point, causing the
> check to always reject resample irqfds with -EINVAL. This makes
> level-triggered interrupt resampling — used to avoid interrupt storms
> with assigned devices — completely non-functional.

What bugs would this manifest as? Why haven't we seen any such bugs so
far?

Thanks,
Anirudh.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
  2026-05-13 11:15       ` Anirudh Rayabharam
@ 2026-05-13 17:31         ` Stanislav Kinsburskii
  0 siblings, 0 replies; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-13 17:31 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Wed, May 13, 2026 at 11:15:37AM +0000, Anirudh Rayabharam wrote:
> On Mon, May 11, 2026 at 08:06:44AM -0700, Stanislav Kinsburskii wrote:
> > On Mon, May 11, 2026 at 01:48:56PM +0000, Anirudh Rayabharam wrote:
> > > On Thu, May 07, 2026 at 03:43:09PM +0000, Stanislav Kinsburskii wrote:
> > > > mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
> > > > fails on an unencrypted partition. The condition on the error path:
> > > > 
> > > >     if (ret && mshv_partition_encrypted(partition))
> > > > 
> > > > only handles map failures for encrypted partitions — if the partition is
> > > > not encrypted and the map fails, execution falls through to 'return 0',
> > > > silently ignoring the error.
> > > > 
> > > > Additionally, calling mshv_region_invalidate() inline on map failure
> > > > zeroes the mreg_pages array before the caller's cleanup path
> > > > (mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
> > > > pages where mreg_pages[offset] is NULL, this can leave stale SLAT
> > > > mappings for partially-mapped pages.
> > > > 
> > > > Fix by returning immediately on success and falling through to error
> > > > return on failure. For unencrypted partitions, the caller's
> > > > mshv_region_destroy() handles unmap followed by invalidate in the
> > > > correct order. For encrypted partitions where re-sharing fails, zero
> > > > the page array without unpinning — the pages are inaccessible to the
> > > > host and must not be unpinned, but zeroing prevents
> > > > mshv_region_destroy() from attempting to unpin them.
> > > 
> > > mshv_region_destroy() calls mshv_region_invalidate() which calls
> > > mshv_region_invalidate_pages() which just does:
> > > 
> > > 	static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > > 						 u64 page_offset, u64 page_count)
> > > 	{
> > > 		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > > 			unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > > 
> > > 		memset(region->mreg_pages + page_offset, 0,
> > > 		       page_count * sizeof(struct page *));
> > > 	}
> > > 
> > > It doesn't checked for zeroed pages before unpinning.
> > > 
> > > > 
> > > > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > >  drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
> > > >  1 file changed, 16 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > index 665d565899c15..7e4252b6bc65c 100644
> > > > --- a/drivers/hv/mshv_root_main.c
> > > > +++ b/drivers/hv/mshv_root_main.c
> > > > @@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
> > > >  			pt_err(partition,
> > > >  			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> > > >  			       region->start_gfn, ret);
> > > > -			goto invalidate_region;
> > > > +			goto err_out;
> > > >  		}
> > > >  	}
> > > >  
> > > >  	ret = mshv_region_map(region);
> > > > -	if (ret && mshv_partition_encrypted(partition)) {
> > > > +	if (ret)
> > > > +		goto share_region;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +share_region:
> > > > +	if (mshv_partition_encrypted(partition)) {
> > > >  		int shrc;
> > > >  
> > > >  		shrc = mshv_region_share(region);
> > > >  		if (!shrc)
> > > > -			goto invalidate_region;
> > > > +			goto err_out;
> > > >  
> > > >  		pt_err(partition,
> > > >  		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> > > >  		       region->start_gfn, shrc);
> > > >  		/*
> > > > -		 * Don't unpin if marking shared failed because pages are no
> > > > -		 * longer mapped in the host, ie root, anymore.
> > > > +		 * Re-sharing failed — the pages remain inaccessible to the
> > > > +		 * host.  Zero the page array so that mshv_region_destroy()
> > > > +		 * won't attempt to unpin them (leaking the page references
> > > > +		 * is intentional; unpinning host-inaccessible pages would be
> > > > +		 * unsafe).
> > > >  		 */
> > > 
> > > Actually, mshv_region_destroy() also does mshv_region_share(). Maybe we
> > > can remove it from here altogether. Either way, should this zeroing
> > > logic be added there too so as not to crash the host by unpinning pages
> > > that weren't marked shared?
> > > 
> > 
> > Indeed.
> > The issue with all this code is that we are leaking state in
> > mshv_region_pin if we wimply return from it: we don't know if the pages
> > are pinned or unshared (or mapped) in the destruction callback.
> > We should either introduce a state variable to track this or just don't
> > call the generic mshv_region_put on case of region creation error.
> > The latter approch make mshv_map_user_memory idempotent and thus looks
> > like a better way forward.
> > What do you think?
> 

I tried to say, that there can be two apporaches to solving this issue:
  1. Either make sure mshv_region_pus() can destroy a partially created
     region (tha'ts what this patch is doing) or
  2. Make mshv_map_user_region be idempotent by not calling mshv_region_put() if the region was not
     fully created and destroy the partioally created parts of it on
     error paths. This would require preserving some state in the region
     struct to know if the region was pinned or shared or mapped.
     This looks like a leaner apporach, but it requires additional state
     tracking and more code changes.

Thanks,
Stanislav


> I'm not sure I follow the latter approach. Omitting mshv_region_put()
> would cause a dangling reference to the mshv_region right?
> 
> Thanks,
> Anirudh.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data
  2026-05-13 12:14   ` Anirudh Rayabharam
@ 2026-05-13 17:38     ` Stanislav Kinsburskii
  0 siblings, 0 replies; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-13 17:38 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Wed, May 13, 2026 at 12:14:49PM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:43:43PM +0000, Stanislav Kinsburskii wrote:
> > In mshv_irqfd_assign(), the level-triggered validation for resample
> > irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
> > mshv_irqfd_update() has populated the field. Since the irqfd struct is
> > zero-allocated, level_triggered is always 0 at that point, causing the
> > check to always reject resample irqfds with -EINVAL. This makes
> > level-triggered interrupt resampling — used to avoid interrupt storms
> > with assigned devices — completely non-functional.
> 
> What bugs would this manifest as? Why haven't we seen any such bugs so
> far?
> 

This patch fixes a logical error.
Whtout the change this hunk always fails:

        if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
            !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {

and the reason we never seen it as that we never used
register_irqfd_with_resample() function of the mshv crate.

Thanks,
Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
  2026-05-13 11:12   ` Anirudh Rayabharam
@ 2026-05-13 17:39     ` Stanislav Kinsburskii
  0 siblings, 0 replies; 42+ messages in thread
From: Stanislav Kinsburskii @ 2026-05-13 17:39 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel

On Wed, May 13, 2026 at 11:12:05AM +0000, Anirudh Rayabharam wrote:
> On Thu, May 07, 2026 at 03:44:26PM +0000, Stanislav Kinsburskii wrote:
> > handle_pair_message() iterates up to msg->vp_count without verifying it
> > against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> > from untrusted hypervisor data, a malformed message with a large value
> > would cause out-of-bounds reads from the partition_ids and vp_indexes
> > arrays.
> > 
> > handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> > 64) and advances bank_contents for each one. However, the payload buffer
> > only has space for 16 bank entries. A valid_bank_mask with more than 16
> > bits set causes bank_contents to read beyond the message buffer.
> > 
> > Fix both by adding bounds validation:
> > - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> > - Track banks consumed and stop before exceeding buffer capacity
> > 
> > Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_synic.c |   20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index 89207aad7cf1f..5d509299f14d7 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
> >  static void
> >  handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
> >  {
> > -	int bank_idx, vps_signaled = 0, bank_mask_size;
> > +	int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> > +	const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> > +			      sizeof(u64) - 2; /* subtract format + mask */
> 
> Could this be a constant in the header?
> 

Yes, it could. But it the only place it's used and it's pretty
self-explanatory, so I don't think it needs to be.

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2026-05-13 17:39 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-07 15:42 [PATCH v4 00/18] mshv: Bug fixes across the mshv_root module Stanislav Kinsburskii
2026-05-07 15:43 ` [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access Stanislav Kinsburskii
2026-05-11  3:46   ` Anirudh Rayabharam
2026-05-07 15:43 ` [PATCH v4 02/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions Stanislav Kinsburskii
2026-05-11 13:48   ` Anirudh Rayabharam
2026-05-11 15:06     ` Stanislav Kinsburskii
2026-05-13 11:15       ` Anirudh Rayabharam
2026-05-13 17:31         ` Stanislav Kinsburskii
2026-05-11 15:12     ` Stanislav Kinsburskii
2026-05-07 15:43 ` [PATCH v4 03/18] mshv: Fix race in mshv_irqfd_deassign Stanislav Kinsburskii
2026-05-11 13:57   ` Anirudh Rayabharam
2026-05-07 15:43 ` [PATCH v4 04/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast Stanislav Kinsburskii
2026-05-11  3:24   ` Anirudh Rayabharam
2026-05-07 15:43 ` [PATCH v4 05/18] mshv: irqfd: Reject routing updates that invalidate resampler binding Stanislav Kinsburskii
2026-05-07 15:43 ` [PATCH v4 06/18] mshv: Fix broken seqcount read protection Stanislav Kinsburskii
2026-05-07 15:43 ` [PATCH v4 07/18] mshv: Consolidate irqfd interrupt injection paths Stanislav Kinsburskii
2026-05-07 15:43 ` [PATCH v4 08/18] mshv: Fix level-triggered check on uninitialized data Stanislav Kinsburskii
2026-05-13 12:14   ` Anirudh Rayabharam
2026-05-13 17:38     ` Stanislav Kinsburskii
2026-05-07 15:43 ` [PATCH v4 09/18] mshv: Fix duplicate GSI detection for GSI 0 Stanislav Kinsburskii
2026-05-13 11:36   ` Anirudh Rayabharam
2026-05-07 15:43 ` [PATCH v4 10/18] mshv: portid_table: Make mshv_portid_lookup() RCU-aware by contract Stanislav Kinsburskii
2026-05-13 11:20   ` Anirudh Rayabharam
2026-05-07 15:43 ` [PATCH v4 11/18] mshv: Fix sleeping under spinlock in mshv_portid_alloc Stanislav Kinsburskii
2026-05-11  3:33   ` Anirudh Rayabharam
2026-05-07 15:44 ` [PATCH v4 12/18] mshv: Use kfree_rcu in mshv_portid_free Stanislav Kinsburskii
2026-05-13 11:22   ` Anirudh Rayabharam
2026-05-07 15:44 ` [PATCH v4 13/18] mshv: Add missing vp_index bounds check in intercept ISR Stanislav Kinsburskii
2026-05-13  5:32   ` Anirudh Rayabharam
2026-05-07 15:44 ` [PATCH v4 14/18] mshv: Order pt_vp_array publish against irqfd assertion path Stanislav Kinsburskii
2026-05-13  9:57   ` Anirudh Rayabharam
2026-05-07 15:44 ` [PATCH v4 15/18] mshv: Defer mshv_vp free to an RCU grace period Stanislav Kinsburskii
2026-05-13 10:11   ` Anirudh Rayabharam
2026-05-07 15:44 ` [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor Stanislav Kinsburskii
2026-05-13 11:12   ` Anirudh Rayabharam
2026-05-13 17:39     ` Stanislav Kinsburskii
2026-05-07 15:44 ` [PATCH v4 17/18] mshv: Publish VP to pt_vp_array before installing the file descriptor Stanislav Kinsburskii
2026-05-11 14:26   ` Anirudh Rayabharam
2026-05-11 15:29     ` Stanislav Kinsburskii
2026-05-12 12:46       ` Anirudh Rayabharam
2026-05-07 15:44 ` [PATCH v4 18/18] mshv: Fix missing error code on VP allocation failure Stanislav Kinsburskii
2026-05-11  3:35   ` Anirudh Rayabharam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox