Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH v3 07/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_try_assert_irq_fast() dereferences the vp pointer obtained from
pt_vp_array[lapic_apic_id] without checking for NULL or validating that
lapic_apic_id is within bounds. A spurious interrupt from the hypervisor
targeting a non-existent VP (or one not yet created) causes a NULL
pointer dereference and crashes the host.

Add a bounds check on lapic_apic_id against MSHV_MAX_VPS and a NULL
check on the vp pointer before dereferencing.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 3ab6338064237..509911ffcbeee 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -169,7 +169,12 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
 		return -EOPNOTSUPP;
 #endif
 
+	if (irq->lapic_apic_id >= MSHV_MAX_VPS)
+		return -EINVAL;
+
 	vp = partition->pt_vp_array[irq->lapic_apic_id];
+	if (!vp)
+		return -EINVAL;
 
 	if (!vp->vp_register_page)
 		return -EOPNOTSUPP;



^ permalink raw reply related

* [PATCH v3 06/18] mshv: Add defensive synchronize_srcu in irqfd shutdown
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_irqfd_assign() adds the irqfd to the partition's hlist and then
registers the wait entry on the eventfd waitqueue via vfs_poll(). A
narrow window exists between these two operations where the irqfd is
visible to deactivation paths but the wait entry is not yet initialized
on the waitqueue.

Currently this is not reachable because mshv_irqfd_assign() and
mshv_irqfd_deassign() are serialized by the partition mutex, and the
EPOLLHUP wakeup path can only fire after vfs_poll() has registered the
wait entry. However, if future refactoring removes or relaxes that
serialization, mshv_irqfd_shutdown() could call
eventfd_ctx_remove_wait_queue() before the wait entry is on the queue,
causing a NULL pointer dereference (the list_head is zeroed by kzalloc
and not initialized by init_waitqueue_func_entry()).

Add synchronize_srcu_expedited() at the start of mshv_irqfd_shutdown()
as a defensive measure, ensuring the assignment path's SRCU read-side
section (which covers vfs_poll() registration) has completed. This
follows the pattern established by KVM in irqfd_shutdown().

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 5995a62aff8d8..3ab6338064237 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -248,8 +248,12 @@ static void mshv_irqfd_shutdown(struct work_struct *work)
 {
 	struct mshv_irqfd *irqfd =
 			container_of(work, struct mshv_irqfd, irqfd_shutdown);
+	struct mshv_partition *pt = irqfd->irqfd_partn;
 	u64 cnt;
 
+	/* Make sure irqfd has been initialized in assign path. */
+	synchronize_srcu_expedited(&pt->pt_irq_srcu);
+
 	/*
 	 * Synchronize with the wait-queue and unhook ourselves to prevent
 	 * further events.



^ permalink raw reply related

* [PATCH v3 05/18] mshv: Fix race in mshv_irqfd_deassign
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_irqfd_deactivate() and the hlist traversal of pt_irqfds_list
require pt->pt_irqfds_lock to be held, but mshv_irqfd_deassign()
omits it. This races with the EPOLLHUP path in mshv_irqfd_wakeup(),
which does take the lock before calling mshv_irqfd_deactivate().

Additionally, mshv_irqfd_deactivate() uses hlist_del() which poisons
the node pointers rather than resetting them. Since
mshv_irqfd_is_active() relies on hlist_unhashed() (checks pprev ==
NULL), a poisoned node still appears active. If a concurrent path calls
mshv_irqfd_deactivate() again on the same irqfd, the guard fails to
prevent a double hlist_del() on poisoned pointers.

Fix both issues:
- Add the missing spin_lock_irq/spin_unlock_irq around the list
  traversal in mshv_irqfd_deassign(), matching mshv_irqfd_release().
- Use hlist_del_init() instead of hlist_del() so the node is properly
  marked as unhashed after removal, making the is_active guard reliable.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 90959f639dc32..5995a62aff8d8 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -284,7 +284,7 @@ static void mshv_irqfd_deactivate(struct mshv_irqfd *irqfd)
 	if (!mshv_irqfd_is_active(irqfd))
 		return;
 
-	hlist_del(&irqfd->irqfd_hnode);
+	hlist_del_init(&irqfd->irqfd_hnode);
 
 	queue_work(irqfd_cleanup_wq, &irqfd->irqfd_shutdown);
 }
@@ -541,13 +541,14 @@ static int mshv_irqfd_deassign(struct mshv_partition *pt,
 	if (IS_ERR(eventfd))
 		return PTR_ERR(eventfd);
 
+	spin_lock_irq(&pt->pt_irqfds_lock);
 	hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list,
 				  irqfd_hnode) {
 		if (irqfd->irqfd_eventfd_ctx == eventfd &&
 		    irqfd->irqfd_irqnum == args->gsi)
-
 			mshv_irqfd_deactivate(irqfd);
 	}
+	spin_unlock_irq(&pt->pt_irqfds_lock);
 
 	eventfd_ctx_put(eventfd);
 



^ permalink raw reply related

* [PATCH v3 04/18] mshv: Fix potential u64 overflow in region overlap check
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_partition_create_region() checks for overlapping guest memory
regions using arithmetic that can overflow u64:

  mem->guest_pfn + nr_pages <= rg->start_gfn

If guest_pfn is near U64_MAX, the addition wraps around to a small
value, causing the overlap check to incorrectly pass. This could allow
creation of overlapping regions.

Fix by validating the sum with check_add_overflow() before the loop and
using the pre-computed end_gfn in the comparison.

Fixes: f91bc8f61abf ("mshv: Allow mappings that overlap in uaddr")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 7e4252b6bc65c..0e9b696853fcb 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1290,11 +1290,15 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 {
 	struct mshv_mem_region *rg;
 	u64 nr_pages = HVPFN_DOWN(mem->size);
+	u64 end_gfn;
+
+	if (check_add_overflow(mem->guest_pfn, nr_pages, &end_gfn))
+		return -EINVAL;
 
 	/* Reject overlapping regions */
 	spin_lock(&partition->pt_mem_regions_lock);
 	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
-		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
+		if (end_gfn <= rg->start_gfn ||
 		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
 			continue;
 		spin_unlock(&partition->pt_mem_regions_lock);



^ permalink raw reply related

* [PATCH v3 03/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
fails on an unencrypted partition. The condition on the error path:

    if (ret && mshv_partition_encrypted(partition))

only handles map failures for encrypted partitions — if the partition is
not encrypted and the map fails, execution falls through to 'return 0',
silently ignoring the error.

Additionally, calling mshv_region_invalidate() inline on map failure
zeroes the mreg_pages array before the caller's cleanup path
(mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
pages where mreg_pages[offset] is NULL, this can leave stale SLAT
mappings for partially-mapped pages.

Fix by returning immediately on success and falling through to error
return on failure. For unencrypted partitions, the caller's
mshv_region_destroy() handles unmap followed by invalidate in the
correct order. For encrypted partitions where re-sharing fails, zero
the page array without unpinning — the pages are inaccessible to the
host and must not be unpinned, but zeroing prevents
mshv_region_destroy() from attempting to unpin them.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 665d565899c15..7e4252b6bc65c 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 			pt_err(partition,
 			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
 			       region->start_gfn, ret);
-			goto invalidate_region;
+			goto err_out;
 		}
 	}
 
 	ret = mshv_region_map(region);
-	if (ret && mshv_partition_encrypted(partition)) {
+	if (ret)
+		goto share_region;
+
+	return 0;
+
+share_region:
+	if (mshv_partition_encrypted(partition)) {
 		int shrc;
 
 		shrc = mshv_region_share(region);
 		if (!shrc)
-			goto invalidate_region;
+			goto err_out;
 
 		pt_err(partition,
 		       "Failed to share memory region (guest_pfn: %llu): %d\n",
 		       region->start_gfn, shrc);
 		/*
-		 * Don't unpin if marking shared failed because pages are no
-		 * longer mapped in the host, ie root, anymore.
+		 * Re-sharing failed — the pages remain inaccessible to the
+		 * host.  Zero the page array so that mshv_region_destroy()
+		 * won't attempt to unpin them (leaking the page references
+		 * is intentional; unpinning host-inaccessible pages would be
+		 * unsafe).
 		 */
+		memset(region->mreg_pages, 0,
+		       region->nr_pages * sizeof(region->mreg_pages[0]));
 		goto err_out;
 	}
-
-	return 0;
-
-invalidate_region:
-	mshv_region_invalidate(region);
 err_out:
 	return ret;
 }



^ permalink raw reply related

* [PATCH v3 02/18] mshv: Fix potential integer overflow in mshv_region_create
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The allocation size is computed as:

  sizeof(*region) + sizeof(struct page *) * nr_pages

where nr_pages is a u64 originating from userspace. A sufficiently
large nr_pages can overflow the multiplication, resulting in a small
allocation followed by out-of-bounds writes when populating mreg_pages.

Use struct_size() which returns SIZE_MAX on overflow, causing vzalloc
to safely return NULL — caught by the existing error check.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6f..1d04a97980b8b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -177,7 +177,7 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 {
 	struct mshv_mem_region *region;
 
-	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+	region = vzalloc(struct_size(region, mreg_pages, nr_pages));
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 



^ permalink raw reply related

* [PATCH v3 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Stanislav Kinsburskii @ 2026-05-04 19:08 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The bounds check inside the PFN-filling loop can return -EINVAL while
interrupts are disabled via local_irq_save(), leaking IRQ state.

Remove the check — it is redundant because the loop invariant
(done + i < page_count == page_struct_count >> large_shift) guarantees
(done + i) << large_shift < page_struct_count always holds.

While here, fix type mismatches: change 'int done' to 'u64 done' and
use u64 for loop and batch-size variables so they match the u64
page_count they are compared against.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 129456bd72aba..cc580225e9e45 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -1042,7 +1042,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 {
 	struct hv_input_modify_sparse_spa_page_host_access *input_page;
 	u64 status;
-	int done = 0;
+	u64 done = 0;
 	unsigned long irq_flags, large_shift = 0;
 	u64 page_count = page_struct_count;
 	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
@@ -1059,9 +1059,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 	}
 
 	while (done < page_count) {
-		ulong i, completed, remain = page_count - done;
-		int rep_count = min(remain,
-				    HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+		u64 i, completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain,
+				      HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -1075,15 +1075,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 		input_page->flags = flags;
 		input_page->host_access = host_access;
 
-		for (i = 0; i < rep_count; i++) {
-			u64 index = (done + i) << large_shift;
-
-			if (index >= page_struct_count)
-				return -EINVAL;
-
+		for (i = 0; i < rep_count; i++)
 			input_page->spa_page_list[i] =
-						page_to_pfn(pages[index]);
-		}
+				page_to_pfn(pages[(done + i) << large_shift]);
 
 		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
 					     NULL);



^ permalink raw reply related

* [PATCH v3 00/18] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-05-04 19:08 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

This series addresses bugs found during a continued review of the
mshv_root module mostly introduced by commit 621191d709b14 ("Drivers: hv:
Introduce mshv_root module to expose /dev/mshv to VMMs").

Changes since v2:
- "Fix mshv_prepare_pinned_region error path for unencrypted
  partitions": removed inline mshv_region_invalidate() to prevent
  zeroing mreg_pages before mshv_region_destroy() can unmap partial
  SLAT mappings; for encrypted share-failure, memset the page array
  without unpinning (pages are host-inaccessible).
- "Consolidate irqfd interrupt injection paths": fixed data race in
  mshv_irqfd_assign EPOLLIN path — girq_ent is now snapshotted inside
  the seqcount loop (matching mshv_irqfd_wakeup) to prevent a
  concurrent routing update from injecting vector 0 to VP 0.
- "Add missing vp_index bounds check in intercept ISR": added
  array_index_nospec() after the bounds check to prevent speculative
  out-of-bounds array access.
- "Add store/load ordering for VP array publish": added missing
  smp_load_acquire in mshv_try_assert_irq_fast.

Changes since v1:
- Added 8 new patches addressing issues found by Sashiko (automated
  review) covering the irqfd, portid, scheduler message, and VP
  lifecycle paths.
- Consolidated the irqfd fast/slow injection paths to eliminate
  duplicated seqcount reads and fix the GSI 0 validity bypass.
- Added memory ordering for the lockless VP array.

The fixes range from data corruption and use-after-free to silent
functional failures and sleeping-while-atomic:

 Memory region management:
  - Integer overflow on userspace-controlled allocation size
    (mshv_region_create)
  - Silent success on map failure for unencrypted partitions
    (mshv_prepare_pinned_region)
  - u64 overflow in region overlap check allowing overlapping mappings

 IRQ/eventfd path:
  - IRQ state leak and type truncation in hypercall helpers
  - Missing locking and hlist_del vs hlist_del_init race in irqfd
    deassign
  - Defensive synchronize_srcu in irqfd shutdown (follows KVM pattern)
  - NULL pointer dereference on spurious interrupt to non-existent VP
    (mshv_try_assert_irq_fast)
  - Broken seqcount read protection — torn reads of interrupt routing
  - Duplicated and inconsistent validity checks between fast/slow
    injection paths; fast path could inject vector 0 spuriously
  - Level-triggered check on uninitialized data making interrupt
    resampling completely non-functional
  - Duplicate GSI 0 detection using the wrong predicate

 Port ID table:
  - Use-after-RCU in mshv_portid_lookup (dereference outside read-side
    critical section)
  - Sleeping under spinlock in mshv_portid_alloc (GFP_KERNEL inside
    idr_lock)
  - Use kfree_rcu for deferred free without blocking

 SynIC / ISR paths:
  - Missing VP index bounds check in intercept ISR (OOB in interrupt
    context from untrusted hypervisor data)
  - Missing store/load ordering for VP array publish — lockless ISR
    readers could observe partially-initialized VP
  - Missing bounds validation in scheduler messages
    (handle_pair_message vp_count, handle_bitset_message bank_mask)

 Miscellaneous:
  - Missing error code on VP allocation failure (silent success to
    userspace)

Kudos to Claude and Sashiko for assisting with analysis and
implementation.

---

Stanislav Kinsburskii (18):
      mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
      mshv: Fix potential integer overflow in mshv_region_create
      mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
      mshv: Fix potential u64 overflow in region overlap check
      mshv: Fix race in mshv_irqfd_deassign
      mshv: Add defensive synchronize_srcu in irqfd shutdown
      mshv: Add NULL check for vp in mshv_try_assert_irq_fast
      mshv: Fix broken seqcount read protection
      mshv: Consolidate irqfd interrupt injection paths
      mshv: Fix level-triggered check on uninitialized data
      mshv: Fix duplicate GSI detection for GSI 0
      mshv: Fix use-after-RCU in mshv_portid_lookup
      mshv: Fix sleeping under spinlock in mshv_portid_alloc
      mshv: Use kfree_rcu in mshv_portid_free
      mshv: Add missing vp_index bounds check in intercept ISR
      mshv: Add store/load ordering for VP array publish
      mshv: Validate scheduler message bounds from hypervisor
      mshv: Fix missing error code on VP allocation failure


 drivers/hv/mshv_eventfd.c      |  108 +++++++++++++++++++++++++---------------
 drivers/hv/mshv_irq.c          |    2 -
 drivers/hv/mshv_portid_table.c |   12 ++--
 drivers/hv/mshv_regions.c      |    2 -
 drivers/hv/mshv_root_hv_call.c |   18 ++-----
 drivers/hv/mshv_root_main.c    |   39 ++++++++++----
 drivers/hv/mshv_synic.c        |   36 +++++++++++--
 7 files changed, 136 insertions(+), 81 deletions(-)


^ permalink raw reply

* Re: [PATCH v3 00/18] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-05-04 19:06 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792097254.89142.47656055124497980.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Mon, May 04, 2026 at 06:59:01PM +0000, Stanislav Kinsburskii wrote:
> This series addresses bugs found during a continued review of the
> mshv_root module introduced by commit 621191d709b14 ("Drivers: hv:
> Introduce mshv_root module to expose /dev/mshv to VMMs").
> 

THis series is malformed.
Please disregard.

Thanks,
Stanislav

> Changes since v2:
> - "Fix mshv_prepare_pinned_region error path for unencrypted
>   partitions": removed inline mshv_region_invalidate() to prevent
>   zeroing mreg_pages before mshv_region_destroy() can unmap partial
>   SLAT mappings; for encrypted share-failure, memset the page array
>   without unpinning (pages are host-inaccessible).
> - "Consolidate irqfd interrupt injection paths": fixed data race in
>   mshv_irqfd_assign EPOLLIN path — girq_ent is now snapshotted inside
>   the seqcount loop (matching mshv_irqfd_wakeup) to prevent a
>   concurrent routing update from injecting vector 0 to VP 0.
> - "Add missing vp_index bounds check in intercept ISR": added
>   array_index_nospec() after the bounds check to prevent speculative
>   out-of-bounds array access.
> - "Add store/load ordering for VP array publish": added missing
>   smp_load_acquire in mshv_try_assert_irq_fast.
> 
> Changes since v1:
> - Added 8 new patches addressing issues found by Sashiko (automated
>   review) covering the irqfd, portid, scheduler message, and VP
>   lifecycle paths.
> - Consolidated the irqfd fast/slow injection paths to eliminate
>   duplicated seqcount reads and fix the GSI 0 validity bypass.
> - Added memory ordering for the lockless VP array.
> 
> The fixes range from data corruption and use-after-free to silent
> functional failures and sleeping-while-atomic:
> 
>  Memory region management:
>   - Integer overflow on userspace-controlled allocation size
>     (mshv_region_create)
>   - Silent success on map failure for unencrypted partitions
>     (mshv_prepare_pinned_region)
>   - u64 overflow in region overlap check allowing overlapping mappings
> 
>  IRQ/eventfd path:
>   - IRQ state leak and type truncation in hypercall helpers
>   - Missing locking and hlist_del vs hlist_del_init race in irqfd
>     deassign
>   - Defensive synchronize_srcu in irqfd shutdown (follows KVM pattern)
>   - NULL pointer dereference on spurious interrupt to non-existent VP
>     (mshv_try_assert_irq_fast)
>   - Broken seqcount read protection — torn reads of interrupt routing
>   - Duplicated and inconsistent validity checks between fast/slow
>     injection paths; fast path could inject vector 0 spuriously
>   - Level-triggered check on uninitialized data making interrupt
>     resampling completely non-functional
>   - Duplicate GSI 0 detection using the wrong predicate
> 
>  Port ID table:
>   - Use-after-RCU in mshv_portid_lookup (dereference outside read-side
>     critical section)
>   - Sleeping under spinlock in mshv_portid_alloc (GFP_KERNEL inside
>     idr_lock)
>   - Use kfree_rcu for deferred free without blocking
> 
>  SynIC / ISR paths:
>   - Missing VP index bounds check in intercept ISR (OOB in interrupt
>     context from untrusted hypervisor data)
>   - Missing store/load ordering for VP array publish — lockless ISR
>     readers could observe partially-initialized VP
>   - Missing bounds validation in scheduler messages
>     (handle_pair_message vp_count, handle_bitset_message bank_mask)
> 
>  Miscellaneous:
>   - Missing error code on VP allocation failure (silent success to
>     userspace)
> 
> Kudos to Claude and Sashiko for assisting with analysis and
> implementation.
> 
> 
> ---
> 
> Stanislav Kinsburskii (18):
>       mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
>       mshv: Fix potential integer overflow in mshv_region_create
>       mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
>       mshv: Fix potential u64 overflow in region overlap check
>       mshv: Fix race in mshv_irqfd_deassign
>       mshv: Add defensive synchronize_srcu in irqfd shutdown
>       mshv: Add NULL check for vp in mshv_try_assert_irq_fast
>       mshv: Fix broken seqcount read protection
>       mshv: Consolidate irqfd interrupt injection paths
>       mshv: Fix level-triggered check on uninitialized data
>       mshv: Fix duplicate GSI detection for GSI 0
>       mshv: Fix use-after-RCU in mshv_portid_lookup
>       mshv: Fix sleeping under spinlock in mshv_portid_alloc
>       mshv: Use kfree_rcu in mshv_portid_free
>       mshv: Add missing vp_index bounds check in intercept ISR
>       mshv: Add store/load ordering for VP array publish
>       mshv: Validate scheduler message bounds from hypervisor
>       mshv: Fix missing error code on VP allocation failure
> 
> 
>  drivers/hv/mshv_eventfd.c      |  108 +++++++++++++++++++++++++---------------
>  drivers/hv/mshv_irq.c          |    2 -
>  drivers/hv/mshv_portid_table.c |   12 ++--
>  drivers/hv/mshv_regions.c      |    2 -
>  drivers/hv/mshv_root_hv_call.c |   18 ++-----
>  drivers/hv/mshv_root_main.c    |   39 ++++++++++----
>  drivers/hv/mshv_synic.c        |   36 +++++++++++--
>  7 files changed, 136 insertions(+), 81 deletions(-)
> 

^ permalink raw reply

* [PATCH v3] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Stanislav Kinsburskii @ 2026-05-04 19:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792097254.89142.47656055124497980.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The bounds check inside the PFN-filling loop can return -EINVAL while
interrupts are disabled via local_irq_save(), leaking IRQ state.

Remove the check — it is redundant because the loop invariant
(done + i < page_count == page_struct_count >> large_shift) guarantees
(done + i) << large_shift < page_struct_count always holds.

While here, fix type mismatches: change 'int done' to 'u64 done' and
use u64 for loop and batch-size variables so they match the u64
page_count they are compared against.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 129456bd72aba..cc580225e9e45 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -1042,7 +1042,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 {
 	struct hv_input_modify_sparse_spa_page_host_access *input_page;
 	u64 status;
-	int done = 0;
+	u64 done = 0;
 	unsigned long irq_flags, large_shift = 0;
 	u64 page_count = page_struct_count;
 	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
@@ -1059,9 +1059,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 	}
 
 	while (done < page_count) {
-		ulong i, completed, remain = page_count - done;
-		int rep_count = min(remain,
-				    HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+		u64 i, completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain,
+				      HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -1075,15 +1075,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 		input_page->flags = flags;
 		input_page->host_access = host_access;
 
-		for (i = 0; i < rep_count; i++) {
-			u64 index = (done + i) << large_shift;
-
-			if (index >= page_struct_count)
-				return -EINVAL;
-
+		for (i = 0; i < rep_count; i++)
 			input_page->spa_page_list[i] =
-						page_to_pfn(pages[index]);
-		}
+				page_to_pfn(pages[(done + i) << large_shift]);
 
 		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
 					     NULL);



^ permalink raw reply related

* [PATCH v3 00/18] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-05-04 18:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

This series addresses bugs found during a continued review of the
mshv_root module introduced by commit 621191d709b14 ("Drivers: hv:
Introduce mshv_root module to expose /dev/mshv to VMMs").

Changes since v2:
- "Fix mshv_prepare_pinned_region error path for unencrypted
  partitions": removed inline mshv_region_invalidate() to prevent
  zeroing mreg_pages before mshv_region_destroy() can unmap partial
  SLAT mappings; for encrypted share-failure, memset the page array
  without unpinning (pages are host-inaccessible).
- "Consolidate irqfd interrupt injection paths": fixed data race in
  mshv_irqfd_assign EPOLLIN path — girq_ent is now snapshotted inside
  the seqcount loop (matching mshv_irqfd_wakeup) to prevent a
  concurrent routing update from injecting vector 0 to VP 0.
- "Add missing vp_index bounds check in intercept ISR": added
  array_index_nospec() after the bounds check to prevent speculative
  out-of-bounds array access.
- "Add store/load ordering for VP array publish": added missing
  smp_load_acquire in mshv_try_assert_irq_fast.

Changes since v1:
- Added 8 new patches addressing issues found by Sashiko (automated
  review) covering the irqfd, portid, scheduler message, and VP
  lifecycle paths.
- Consolidated the irqfd fast/slow injection paths to eliminate
  duplicated seqcount reads and fix the GSI 0 validity bypass.
- Added memory ordering for the lockless VP array.

The fixes range from data corruption and use-after-free to silent
functional failures and sleeping-while-atomic:

 Memory region management:
  - Integer overflow on userspace-controlled allocation size
    (mshv_region_create)
  - Silent success on map failure for unencrypted partitions
    (mshv_prepare_pinned_region)
  - u64 overflow in region overlap check allowing overlapping mappings

 IRQ/eventfd path:
  - IRQ state leak and type truncation in hypercall helpers
  - Missing locking and hlist_del vs hlist_del_init race in irqfd
    deassign
  - Defensive synchronize_srcu in irqfd shutdown (follows KVM pattern)
  - NULL pointer dereference on spurious interrupt to non-existent VP
    (mshv_try_assert_irq_fast)
  - Broken seqcount read protection — torn reads of interrupt routing
  - Duplicated and inconsistent validity checks between fast/slow
    injection paths; fast path could inject vector 0 spuriously
  - Level-triggered check on uninitialized data making interrupt
    resampling completely non-functional
  - Duplicate GSI 0 detection using the wrong predicate

 Port ID table:
  - Use-after-RCU in mshv_portid_lookup (dereference outside read-side
    critical section)
  - Sleeping under spinlock in mshv_portid_alloc (GFP_KERNEL inside
    idr_lock)
  - Use kfree_rcu for deferred free without blocking

 SynIC / ISR paths:
  - Missing VP index bounds check in intercept ISR (OOB in interrupt
    context from untrusted hypervisor data)
  - Missing store/load ordering for VP array publish — lockless ISR
    readers could observe partially-initialized VP
  - Missing bounds validation in scheduler messages
    (handle_pair_message vp_count, handle_bitset_message bank_mask)

 Miscellaneous:
  - Missing error code on VP allocation failure (silent success to
    userspace)

Kudos to Claude and Sashiko for assisting with analysis and
implementation.


---

Stanislav Kinsburskii (18):
      mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
      mshv: Fix potential integer overflow in mshv_region_create
      mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
      mshv: Fix potential u64 overflow in region overlap check
      mshv: Fix race in mshv_irqfd_deassign
      mshv: Add defensive synchronize_srcu in irqfd shutdown
      mshv: Add NULL check for vp in mshv_try_assert_irq_fast
      mshv: Fix broken seqcount read protection
      mshv: Consolidate irqfd interrupt injection paths
      mshv: Fix level-triggered check on uninitialized data
      mshv: Fix duplicate GSI detection for GSI 0
      mshv: Fix use-after-RCU in mshv_portid_lookup
      mshv: Fix sleeping under spinlock in mshv_portid_alloc
      mshv: Use kfree_rcu in mshv_portid_free
      mshv: Add missing vp_index bounds check in intercept ISR
      mshv: Add store/load ordering for VP array publish
      mshv: Validate scheduler message bounds from hypervisor
      mshv: Fix missing error code on VP allocation failure


 drivers/hv/mshv_eventfd.c      |  108 +++++++++++++++++++++++++---------------
 drivers/hv/mshv_irq.c          |    2 -
 drivers/hv/mshv_portid_table.c |   12 ++--
 drivers/hv/mshv_regions.c      |    2 -
 drivers/hv/mshv_root_hv_call.c |   18 ++-----
 drivers/hv/mshv_root_main.c    |   39 ++++++++++----
 drivers/hv/mshv_synic.c        |   36 +++++++++++--
 7 files changed, 136 insertions(+), 81 deletions(-)


^ permalink raw reply

* RE: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Michael Kelley @ 2026-05-04 16:06 UTC (permalink / raw)
  To: Naman Jain, Michael Kelley, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <516a4e65-3a4d-4e09-b445-28cb740b5653@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Wednesday, April 29, 2026 2:57 AM
> 
> On 4/27/2026 11:08 AM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> >>

[snip]

> >> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> >> index 08278547b84c..b4d80c9a673a 100644
> >> --- a/arch/x86/include/asm/mshyperv.h
> >> +++ b/arch/x86/include/asm/mshyperv.h
> >> @@ -286,7 +286,6 @@ struct mshv_vtl_cpu_context {
> >>   #ifdef CONFIG_HYPERV_VTL_MODE
> >>   void __init hv_vtl_init_platform(void);
> >>   int __init hv_vtl_early_init(void);
> >> -void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> >>   void mshv_vtl_return_call_init(u64 vtl_return_offset);
> >>   void mshv_vtl_return_hypercall(void);
> >>   void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> >> @@ -294,7 +293,6 @@ int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, bool shared);
> >>   #else
> >>   static inline void __init hv_vtl_init_platform(void) {}
> >>   static inline int __init hv_vtl_early_init(void) { return 0; }
> >> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> >>   static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
> >>   static inline void mshv_vtl_return_hypercall(void) {}
> >>   static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> >> diff --git a/drivers/hv/mshv_vtl.h b/drivers/hv/mshv_vtl.h
> >> index a6eea52f7aa2..103f07371f3f 100644
> >> --- a/drivers/hv/mshv_vtl.h
> >> +++ b/drivers/hv/mshv_vtl.h
> >> @@ -22,4 +22,7 @@ struct mshv_vtl_run {
> >>   	char vtl_ret_actions[MSHV_MAX_RUN_MSG_SIZE];
> >>   };
> >>
> >> +static_assert(sizeof(struct mshv_vtl_cpu_context) <= 1024,
> >> +	      "struct mshv_vtl_cpu_context exceeds reserved space in struct mshv_vtl_run");
> >> +
> >>   #endif /* _MSHV_VTL_H */
> >> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> >> index db183c8cfb95..8cdf2a9fbdfb 100644
> >> --- a/include/asm-generic/mshyperv.h
> >> +++ b/include/asm-generic/mshyperv.h
> >> @@ -396,8 +396,10 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
> >>
> >>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
> >>   u8 __init get_vtl(void);
> >> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> >>   #else
> >>   static inline u8 get_vtl(void) { return 0; }
> >> +static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> >
> > Is this stub needed? Maybe I missed something, but it looks to me like none
> > of the code that calls this gets built unless CONFIG_HYPERV_VTL_MODE is set.
> > See further comments about stubs in Patch 8 of this series.
> >
> 
> Config dependencies would handle such cases, and this is not required. I
> saw similar stubs added in the code, so I thought this is a norm that
> should be followed, and not rely on config dependencies.
> I can remove it.
> 

Others might disagree with me, but I don't think it's the norm to add
stubs when they aren't truly needed. As you can see from some of my
other comments, I look for ways to eliminate stubs. Stubs are indicative
of a boundary between separately built components, and I generally
try to minimize the surface area of such boundaries. A large surface area
often means that the overall design could be improved by re-thinking
which code goes with which component.

Michael

^ permalink raw reply

* RE: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Michael Kelley @ 2026-05-04 16:06 UTC (permalink / raw)
  To: Naman Jain, Michael Kelley, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <567cf73b-6a48-4fc3-b312-9be6d71e2f22@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Wednesday, April 29, 2026 2:58 AM
> 
> On 4/27/2026 11:10 AM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> >>
> >> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
> >> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
> >> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
> >> in architecture-specific code.
> >>
> >> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
> >> include/asm-generic/mshyperv.h so they are visible to both arch and
> >> driver code.
> >>
> >> Change the return type from void to bool so the caller can determine
> >> whether the register page was successfully configured and set
> >> mshv_has_reg_page accordingly.
> >>
> >> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> >> ---
> >>   arch/x86/hyperv/hv_vtl.c       | 32 ++++++++++++++++++++++
> >>   drivers/hv/mshv_vtl_main.c     | 49 +++-------------------------------
> >>   include/asm-generic/mshyperv.h | 17 ++++++++++++
> >>   3 files changed, 53 insertions(+), 45 deletions(-)
> >>
> 
> <snip>
> 
> >>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
> >> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> >
> > This comment pre-dates your patch, but I don't understand the point
> > it is trying to make. The comment is factually true, but I don't know
> > why calling that out is relevant. The REG_PAGE MSR seems to be
> > conceptually separate and distinct from the SIMP MSR, so the fact
> > that the layouts are the same is just a coincidence. Or is there some
> > relationship between the two MSRs that I'm not aware of, and the
> > comment is trying (and failing?) to point out?
> 
> This was added as per suggestion from Nuno in my initial series for
> MSHV_VTL. If the reference in "identical to" is misleading, I should
> remove it.
> 
> https://lore.kernel.org/all/68143eb0-e6a7-4579-bedb-4c2ec5aaef6b@linux.microsoft.com/
> 
> Quoting:
> """
> it is a generic structure that
> appears to be used for several overlay page MSRs (SIMP, SIEF, etc).
> 
> But, the type doesn't appear in the hv*dk headers explicitly; it's just
> used internally by the hypervisor.
> 
> I think it should be renamed with a hv_ prefix to indicate it's part of
> the hypervisor ABI, and a brief comment with the provenance:
> 
> /* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> union hv_synic_overlay_page_msr {
> 	/* <snip> */
> };

OK, so this union is not associated *only* with the REG_PAGE MSR
(though that MSR is the only current user). Instead, it is intended to
be a more generic description of MSRs that set up overlay pages. I
don't think I had previously noticed Nuno's comment on the topic.

Looking through hvgdk_mini.h and hvhdk.h, I see 6 definitions that
are exactly the same:

* union hv_reference_tsc_msr
* union hv_x64_msr_hypercall_contents
* union hv_vp_assist_msr_contents
* union hv_synic_simp
* union hv_synic_siefp
* union hv_synic_sirbp

There's an argument to be made for removing these 6 unique definitions
and using union hv_synic_overlay_page_msr instead (though "synic"
would need to be removed from the name).  I would not object to such
an approach. It's a small extra layer of conceptual indirection, but saves
some lines of code for duplicative definitions. The alternative is to drop
the idea of a generic overlay page MSR layout, and replace union
hv_synic_overlay_page_msr with a definition that is specific to the
REG_PAGE MSR, like the other six above. 

I could go either way. If we want to use a generic overlay page definition,
then that approach should be applied everywhere. With the current
state of your patch set, we're halfway in between -- the generic definition
is used one place, but duplicative specific MSR definitions are used other
places. That's probably the least desirable approach.

Michael

^ permalink raw reply

* RE: [PATCH v4 1/3] mshv: limit SynIC management to MSHV-owned resources
From: Michael Kelley @ 2026-05-04 15:09 UTC (permalink / raw)
  To: Jork Loeser, linux-hyperv@vger.kernel.org
  Cc: x86@kernel.org, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, Anirudh Rayabharam, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org
In-Reply-To: <20260427213855.1675044-2-jloeser@linux.microsoft.com>

From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Monday, April 27, 2026 2:39 PM
> 
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
> 
> Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
> from under VMBus causes its later cleanup to write SynIC MSRs while
> SynIC is disabled, which the hypervisor does not tolerate.
> 
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
>   and nested root partition); on a non-nested root partition VMBus
>   does not run, so MSHV must enable/disable them
> 
> While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
> calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
> PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
> register GPAs regardless of the kernel page size, so using PAGE_SHIFT
> produces wrong addresses on ARM64 with 64K pages.

I agree that this is a good change. But any kernel image built with
CONFIG_MSHV_ROOT set must use only 4KiB pages, as enforced
by the dependency in drivers/hv/Kconfig. The change makes the
code explicitly match the SynIC register layout, which is good,
but it doesn't actually fix a problem since root MSHV code can't
run on ARM64 with 64KiB pages. My only concern is that this
commit message should not imply that an ARM64/64KiB
configuration is possible for the root.

Michael

> 
> Note that initialization order matters - VMBUS first, MSHV second,
> and the reverse on de-init. Ideally, we would want a dedicated SYNIC
> driver that replaces the cross-dependencies with a clear API and
> dynamic tracking. Such refactor should go into its own dedicated
> series, outside of this kexec fix series.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/hv.c         |   3 +
>  drivers/hv/mshv_synic.c | 150 ++++++++++++++++++++++++++--------------
>  2 files changed, 103 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index ae60fd542292..ef4b1b03395d 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -272,6 +272,9 @@ void hv_synic_free(void)
>  /*
>   * hv_hyp_synic_enable_regs - Initialize the Synthetic Interrupt Controller
>   * with the hypervisor.
> + *
> + * Note: When MSHV is present, mshv_synic_cpu_init() intializes further
> + * registers later.
>   */
>  void hv_hyp_synic_enable_regs(unsigned int cpu)
>  {
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index e2288a726fec..2db3b0192eac 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -13,6 +13,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/hyperv.h>
>  #include <linux/reboot.h>
>  #include <asm/mshyperv.h>
>  #include <linux/acpi.h>
> @@ -456,46 +457,75 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
>  	union hv_synic_sint sint;
> -	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  			&spages->synic_event_flags_page;
>  	struct hv_synic_event_ring_page **event_ring_page =
>  			&spages->synic_event_ring_page;
> +	/*
> +	 * VMBus owns SIMP/SIEFP/SCONTROL when it is active.
> +	 * See hv_hyp_synic_enable_regs() for that initialization.
> +	 */
> +	bool vmbus_active = hv_vmbus_exists();
> 
> -	/* Setup the Synic's message page */
> +	/*
> +	 * Map the SYNIC message page. When VMBus is not active the
> +	 * hypervisor pre-provisions the SIMP GPA but may not set
> +	 * simp_enabled - enable it here.
> +	 */
>  	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> -	simp.simp_enabled = true;
> +	if (!vmbus_active) {
> +		simp.simp_enabled = true;
> +		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +	}
>  	*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
>  			     HV_HYP_PAGE_SIZE,
>  			     MEMREMAP_WB);
> 
>  	if (!(*msg_page))
> -		return -EFAULT;
> -
> -	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +		goto cleanup_simp;
> 
> -	/* Setup the Synic's event flags page */
> +	/*
> +	 * Map the event flags page. Same as SIMP: enable when
> +	 * VMBus is not active, already enabled by VMBus otherwise.
> +	 */
>  	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> -	siefp.siefp_enabled = true;
> -	*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
> -				     PAGE_SIZE, MEMREMAP_WB);
> +	if (!vmbus_active) {
> +		siefp.siefp_enabled = true;
> +		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +	}
> +	*event_flags_page = memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,
> +				     HV_HYP_PAGE_SIZE, MEMREMAP_WB);
> 
>  	if (!(*event_flags_page))
> -		goto cleanup;
> -
> -	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +		goto cleanup_siefp;
> 
>  	/* Setup the Synic's event ring page */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> -	sirbp.sirbp_enabled = true;
> -	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
> -				    PAGE_SIZE, MEMREMAP_WB);
> 
> -	if (!(*event_ring_page))
> -		goto cleanup;
> +	if (hv_root_partition()) {
> +		*event_ring_page = memremap(sirbp.base_sirbp_gpa <<
> HV_HYP_PAGE_SHIFT,
> +					    HV_HYP_PAGE_SIZE, MEMREMAP_WB);
> 
> +		if (!(*event_ring_page))
> +			goto cleanup_siefp;
> +	} else {
> +		/*
> +		 * On L1VH the hypervisor does not provide a SIRBP page.
> +		 * Allocate one and program its GPA into the MSR.
> +		 */
> +		*event_ring_page = (struct hv_synic_event_ring_page *)
> +			get_zeroed_page(GFP_KERNEL);
> +
> +		if (!(*event_ring_page))
> +			goto cleanup_siefp;
> +
> +		sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
> +				>> HV_HYP_PAGE_SHIFT;
> +	}
> +
> +	sirbp.sirbp_enabled = true;
>  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> 
>  	if (mshv_sint_irq != -1)
> @@ -518,28 +548,30 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
> 
> -	/* Enable global synic bit */
> -	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> -	sctrl.enable = 1;
> -	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	/* When VMBus is active it already enabled SCONTROL. */
> +	if (!vmbus_active) {
> +		union hv_synic_scontrol sctrl;
> +
> +		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> +		sctrl.enable = 1;
> +		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	}
> 
>  	return 0;
> 
> -cleanup:
> -	if (*event_ring_page) {
> -		sirbp.sirbp_enabled = false;
> -		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> -		memunmap(*event_ring_page);
> -	}
> -	if (*event_flags_page) {
> +cleanup_siefp:
> +	if (*event_flags_page)
> +		memunmap(*event_flags_page);
> +	if (!vmbus_active) {
>  		siefp.siefp_enabled = false;
>  		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> -		memunmap(*event_flags_page);
>  	}
> -	if (*msg_page) {
> +cleanup_simp:
> +	if (*msg_page)
> +		memunmap(*msg_page);
> +	if (!vmbus_active) {
>  		simp.simp_enabled = false;
>  		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> -		memunmap(*msg_page);
>  	}
> 
>  	return -EFAULT;
> @@ -548,16 +580,15 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  static int mshv_synic_cpu_exit(unsigned int cpu)
>  {
>  	union hv_synic_sint sint;
> -	union hv_synic_simp simp;
> -	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
> -	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  		&spages->synic_event_flags_page;
>  	struct hv_synic_event_ring_page **event_ring_page =
>  		&spages->synic_event_ring_page;
> +	/* VMBus owns SIMP/SIEFP/SCONTROL when it is active */
> +	bool vmbus_active = hv_vmbus_exists();
> 
>  	/* Disable the interrupt */
>  	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 +
> HV_SYNIC_INTERCEPTION_SINT_INDEX);
> @@ -574,28 +605,47 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
>  	if (mshv_sint_irq != -1)
>  		disable_percpu_irq(mshv_sint_irq);
> 
> -	/* Disable Synic's event ring page */
> +	/* Disable SYNIC event ring page owned by MSHV */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
>  	sirbp.sirbp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> -	memunmap(*event_ring_page);
> 
> -	/* Disable Synic's event flags page */
> -	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> -	siefp.siefp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +	if (hv_root_partition()) {
> +		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +		memunmap(*event_ring_page);
> +	} else {
> +		sirbp.base_sirbp_gpa = 0;
> +		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +		free_page((unsigned long)*event_ring_page);
> +	}
> +
> +	/*
> +	 * Release our mappings of the message and event flags pages.
> +	 * When VMBus is not active, we enabled SIMP/SIEFP - disable
> +	 * them. Otherwise VMBus owns the MSRs - leave them.
> +	 */
>  	memunmap(*event_flags_page);
> +	if (!vmbus_active) {
> +		union hv_synic_simp simp;
> +		union hv_synic_siefp siefp;
> 
> -	/* Disable Synic's message page */
> -	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> -	simp.simp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +		siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> +		siefp.siefp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +
> +		simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> +		simp.simp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +	}
>  	memunmap(*msg_page);
> 
> -	/* Disable global synic bit */
> -	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> -	sctrl.enable = 0;
> -	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	/* When VMBus is active it owns SCONTROL - leave it. */
> +	if (!vmbus_active) {
> +		union hv_synic_scontrol sctrl;
> +
> +		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> +		sctrl.enable = 0;
> +		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	}
> 
>  	return 0;
>  }
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH v3 3/6] x86/hyperv: Skip LP/VP creation on kexec
From: Michael Kelley @ 2026-05-04 15:09 UTC (permalink / raw)
  To: Jork Loeser, linux-hyperv@vger.kernel.org
  Cc: x86@kernel.org, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, Anirudh Rayabharam,
	Stanislav Kinsburskii, Mukesh Rathor
In-Reply-To: <20260408013645.286723-4-jloeser@linux.microsoft.com>

From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Tuesday, April 7, 2026 6:37 PM
> 
> After a kexec the logical processors and virtual processors already
> exist in the hypervisor because they were created by the previous
> kernel. Attempting to add them again causes either a BUG_ON or
> corrupted VP state leading to MCEs in the new kernel.
> 
> Add hv_lp_exists() to probe whether an LP is already present by
> calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
> LP exists and we skip the add-LP and create-VP loops entirely.
> 
> Also add hv_call_notify_all_processors_started() which informs the
> hypervisor that all processors are online. This is required after
> adding LPs (fresh boot) and is a no-op on kexec since we skip that
> path.

Adding hv_call_notify_all_processors_started() seems like it should be
a separate patch. And this paragraph in the commit message leaves me
with questions:  Is it really "required"?  If it is, how does the existing
upstream code ever work? Does the change need to be backported
to stable kernels? If it isn't *really* required, what are the implications
of not doing it?

> 
> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
> Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  arch/x86/kernel/cpu/mshyperv.c |  7 +++++
>  drivers/hv/hv_proc.c           | 47 ++++++++++++++++++++++++++++++++++
>  include/asm-generic/mshyperv.h | 10 ++++++++
>  include/hyperv/hvgdk_mini.h    |  1 +
>  include/hyperv/hvhdk_mini.h    | 12 +++++++++
>  5 files changed, 77 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index e498b6b2ef19..b5b6a58b67b0 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -431,6 +431,10 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
>  	}
> 
>  #ifdef CONFIG_X86_64
> +	/* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
> +	if (num_present_cpus() == 1 || hv_lp_exists(1))
> +		return;
> +
>  	for_each_present_cpu(i) {
>  		if (i == 0)
>  			continue;
> @@ -438,6 +442,9 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
>  		BUG_ON(ret);
>  	}
> 
> +	ret = hv_call_notify_all_processors_started();
> +	WARN_ON(ret);
> +
>  	for_each_present_cpu(i) {
>  		if (i == 0)
>  			continue;

An observation:  hv_smp_prepare_cpus() is getting to be a bit of a mess.
It handles both the SNP case and the root case, which aren't really related.
I could envision having hv_smp_prepare_cpus_for_snp() and
hv_smp_prepare_cpus_for_root() in order to separate the two cases
cleanly.

Then hv_smp_prepare_cpus_for_root() calls four functions in hv_proc.c,
all of which require stubs for the case where MSHV root isn't being built.
Better would be to move the root version of prepare CPUs functionality
into a new function in hv_proc.c, and only have a stub for that single
function. Three of the other four called functions could then become static.
The #ifdef CONFIG_X86_64 could also go away since hv_proc.c is only
built for x64.

I'll probably submit a separate patch to implement these suggested
cleanups, unless someone else wants to do it first.

Michael

> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index 3cb4b2a3035c..57b2c64197cb 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -239,3 +239,50 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(hv_call_create_vp);
> +
> +int hv_call_notify_all_processors_started(void)
> +{
> +	struct hv_input_notify_partition_event *input;
> +	u64 status;
> +	unsigned long irq_flags;
> +	int ret = 0;
> +
> +	local_irq_save(irq_flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->event = HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED;
> +	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT,
> +				 input, NULL);
> +	local_irq_restore(irq_flags);
> +
> +	if (!hv_result_success(status)) {
> +		hv_status_err(status, "\n");
> +		ret = hv_result_to_errno(status);
> +	}
> +	return ret;
> +}
> +
> +bool hv_lp_exists(u32 lp_index)
> +{
> +	struct hv_input_get_logical_processor_run_time *input;
> +	struct hv_output_get_logical_processor_run_time *output;
> +	unsigned long flags;
> +	u64 status;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	input->lp_index = lp_index;
> +	status = hv_do_hypercall(HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME,
> +				 input, output);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status) &&
> +	    hv_result(status) != HV_STATUS_INVALID_LP_INDEX) {
> +		hv_status_err(status, "\n");
> +		BUG();
> +	}
> +
> +	return hv_result_success(status);
> +}
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index d37b68238c97..bf601d67cecb 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -347,6 +347,8 @@ bool hv_result_needs_memory(u64 status);
>  int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
>  int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
>  int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
> +int hv_call_notify_all_processors_started(void);
> +bool hv_lp_exists(u32 lp_index);
>  int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
> 
>  #else /* CONFIG_MSHV_ROOT */
> @@ -366,6 +368,14 @@ static inline int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id)
>  {
>  	return -EOPNOTSUPP;
>  }
> +static inline int hv_call_notify_all_processors_started(void)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline bool hv_lp_exists(u32 lp_index)
> +{
> +	return false;
> +}
>  static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>  {
>  	return -EOPNOTSUPP;
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index f9600f87186a..6a4e8b9d570f 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -435,6 +435,7 @@ union hv_vp_assist_msr_contents {	 /*
> HV_REGISTER_VP_ASSIST_PAGE */
>  /* HV_CALL_CODE */
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE		0x0002
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST		0x0003
> +#define HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME		0x0004
>  #define HVCALL_NOTIFY_LONG_SPIN_WAIT			0x0008
>  #define HVCALL_SEND_IPI					0x000b
>  #define HVCALL_ENABLE_VP_VTL				0x000f
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index 091c03e26046..b4cb2fa26e9b 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -362,6 +362,7 @@ union hv_partition_event_input {
> 
>  enum hv_partition_event {
>  	HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
> +	HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED = 4,
>  };
> 
>  struct hv_input_notify_partition_event {
> @@ -369,6 +370,17 @@ struct hv_input_notify_partition_event {
>  	union hv_partition_event_input input;
>  } __packed;
> 
> +struct hv_input_get_logical_processor_run_time {
> +	u32 lp_index;
> +} __packed;
> +
> +struct hv_output_get_logical_processor_run_time {
> +	u64 global_time;
> +	u64 local_run_time;
> +	u64 rsvdz0;
> +	u64 hypervisor_time;
> +} __packed;
> +
>  struct hv_lp_startup_status {
>  	u64 hv_status;
>  	u64 substatus1;
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH v3 2/6] x86/hyperv: move stimer cleanup to hv_machine_shutdown()
From: Michael Kelley @ 2026-05-04 15:08 UTC (permalink / raw)
  To: Jork Loeser, linux-hyperv@vger.kernel.org
  Cc: x86@kernel.org, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, Anirudh Rayabharam
In-Reply-To: <20260408013645.286723-3-jloeser@linux.microsoft.com>

From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Tuesday, April 7, 2026 6:37 PM
> 
> Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
> hv_machine_shutdown() in the platform code. This ensures stimer cleanup
> happens before the vmbus unload, which is required for root partition
> kexec to work correctly.

I know I'm late to the party in commenting on this patch set. But I'll add
my $.02 anyway on this patch and a couple others in the set.

I had difficulty understanding this patch's intent. The commit message
says the intent is to make sure stimer cleanup happens before VMBus
unload. But at first glance, the code in this patch doesn't change the
ordering. It appears to be immaterial whether hv_stimer_global_cleanup()
is called just before invoking hv_kexec_handler(), or as the first step of
hv_kexec_handler().

But then I realized that hv_kexec_handler isn't set unless the VMBus driver
is loaded and initialized, and that doesn't happen in the root partition.
In the old code, stimer cleanup is dependent on the VMBus driver being
loaded, and that's a wrong dependency, as stimer operation is independent
of VMBus (almost -- there's the old non-direct mode stimer path that
depends on the VMBus interrupt handler, which muddies the conceptual
separation). So the patch does the right thing, but the commit message
doesn't make the real reasoning clear. My feedback would be to improve
the commit message.

Michael

> 
> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  arch/x86/kernel/cpu/mshyperv.c | 8 ++++++--
>  drivers/hv/vmbus_drv.c         | 1 -
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index a7dfc29d3470..e498b6b2ef19 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -237,8 +237,12 @@ void hv_remove_crash_handler(void)
>  #ifdef CONFIG_KEXEC_CORE
>  static void hv_machine_shutdown(void)
>  {
> -	if (kexec_in_progress && hv_kexec_handler)
> -		hv_kexec_handler();
> +	if (kexec_in_progress) {
> +		hv_stimer_global_cleanup();
> +
> +		if (hv_kexec_handler)
> +			hv_kexec_handler();
> +	}

Immediately following your code change is this code:

        if (kexec_in_progress)
                cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);

So with this change there are two "if (kexec_in_progress)"
statements in a row. It's not wrong, but it looks like the
patch wasn't fully integrated into the existing code. The
cpuhp_remove_state() and the comment should be moved
under the newly added "if (kexec_in_progress)" and the
duplicate "if (kexec_in_progress)" can be dropped.

Michael

> 
>  	/*
>  	 * Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 5e7a6839c933..c5dfe9f3b206 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2891,7 +2891,6 @@ static struct platform_driver vmbus_platform_driver = {
> 
>  static void hv_kexec_handler(void)
>  {
> -	hv_stimer_global_cleanup();
>  	vmbus_initiate_unload(false);
>  	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
>  	mb();
> --
> 2.43.0
> 


^ permalink raw reply

* Re: [PATCH 0/3] net: mana: Fix mana_destroy_rxq() cleanup for partial RXQ init
From: Dipayaan Roy @ 2026-05-03  3:38 UTC (permalink / raw)
  To: Simon Horman
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260502165258.GM15617@horms.kernel.org>

On Sat, May 02, 2026 at 05:52:58PM +0100, Simon Horman wrote:
> On Wed, Apr 29, 2026 at 08:57:51PM -0700, Dipayaan Roy wrote:
> > When mana_create_rxq() fails partway through initialization (e.g. the
> > hardware rejects the WQ object creation), the error path calls
> > mana_destroy_rxq() to tear down a partially-initialized RXQ.
> > This exposed multiple issues in mana_destroy_rxq() path, as it assumed
> > the RXQ was always fully initialized, leading to multiple issues:
> > 
> > 1. xdp_rxq_info_unreg() was called on an unregistered xdp_rxq,
> >    triggering a WARN_ON ("Driver BUG") in net/core/xdp.c.
> > 
> > 2. mana_destroy_wq_obj() was called with INVALID_MANA_HANDLE,
> >    sending a bogus destroy command to the hardware.
> > 
> > 3. mana_deinit_cq() was called twice — once inside mana_destroy_rxq()
> >    and again in mana_create_rxq()'s error path — causing a
> >    use-after-free since mana_destroy_rxq() frees the rxq first.
> > 
> > This was observed during ethtool ring parameter changes when the
> > hardware returned an error creating the RXQ. This series makes
> > mana_destroy_rxq() safe to call at any stage of RXQ initialization
> > by guarding each teardown step, and removes the redundant cleanup
> > in mana_create_rxq().
> 
> For the series:
> 
> Reviewed-by: Simon Horman <horms@kernel.org>
> 
> I don't think that you need to repost for this. But please keep in mind for
> future submissions that fixes for code present in the net tree should be
> targeted at that tree, like this:
> 
> Subject: [PATCH net vN/M] ...
Thanks Simon,
it was a typo from my end.

> 
> Also, FTR, there is an AI generated review of this patch-set available
> on sashiko.dev. It seems to me that the issues flagged there pre-date
> this patch-set and should not block progress of it. But you may wish
> to use that review as the basis of some follow-up.
Agreed, Sashiko flagged a pre-exisitng issue in the tx path.
I will send that as a separate patch set.

Thanks and Regards
Dipayaan Roy



^ permalink raw reply

* Re: [PATCH net, v2] net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR
From: Dipayaan Roy @ 2026-05-03  3:03 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260501185324.0f02dc72@kernel.org>

On Fri, May 01, 2026 at 06:53:24PM -0700, Jakub Kicinski wrote:
> On Wed, 29 Apr 2026 11:57:55 -0700 Dipayaan Roy wrote:
> > During Function Level Reset recovery, the MANA driver reads
> > hardware BAR0 registers that may temporarily contain garbage values.
> > The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used
> > to compute gc->shm_base, which is later dereferenced via readl() in
> > mana_smc_poll_register(). If the hardware returns an unaligned or
> > out-of-range value, the driver must not blindly use it, as this would
> > propagate the hardware error into a kernel crash.
> > 
> > The following crash was observed on an arm64 Hyper-V guest running
> > kernel 6.17.0-3013-azure during VF reset recovery triggered by HWC
> > timeout.
> > 
> > [13291.785274] Unable to handle kernel paging request at virtual address ffff8000a200001b
> > [13291.785311] Mem abort info:
> > [13291.785332]   ESR = 0x0000000096000021
> > [13291.785343]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [13291.785355]   SET = 0, FnV = 0
> > [13291.785363]   EA = 0, S1PTW = 0
> > [13291.785372]   FSC = 0x21: alignment fault
> > [13291.785382] Data abort info:
> > [13291.785391]   ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
> > [13291.785404]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [13291.785412]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [13291.785421] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000014df3a1000
> > [13291.785432] [ffff8000a200001b] pgd=1000000100438403, p4d=1000000100438403, pud=1000000100439403, pmd=0068000fc2000711
> > [13291.785703] Internal error: Oops: 0000000096000021 [#1]  SMP
> > [13291.830975] Modules linked in: tls qrtr mana_ib ib_uverbs ib_core xt_owner xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables cfg80211 8021q garp mrp stp llc binfmt_misc joydev serio_raw nls_iso8859_1 hid_generic aes_ce_blk aes_ce_cipher polyval_ce ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher hid_hyperv sm4 sm3_ce sha3_ce hv_netvsc hid vmgenid hyperv_keyboard hyperv_drm sch_fq_codel nvme_fabrics efi_pstore dm_multipath nfnetlink vsock_loopback vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport vmw_vmci vsock dmi_sysfs ip_tables x_tables autofs4
> > [13291.862630] CPU: 122 UID: 0 PID: 61796 Comm: kworker/122:2 Tainted: G        W           6.17.0-3013-azure #13-Ubuntu VOLUNTARY
> > [13291.869902] Tainted: [W]=WARN
> > [13291.871901] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 01/08/2026
> > [13291.878086] Workqueue: events mana_serv_func
> > [13291.880718] pstate: 62400005 (nZCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
> > [13291.884835] pc : mana_smc_poll_register+0x48/0xb0
> > [13291.887902] lr : mana_smc_setup_hwc+0x70/0x1c0
> > [13291.890493] sp : ffff8000ab79bbb0
> > [13291.892364] x29: ffff8000ab79bbb0 x28: ffff00410c8b5900 x27: ffff00410d630680
> > [13291.896252] x26: ffff004171f9fd80 x25: 000000016ed55000 x24: 000000017f37e000
> > [13291.899990] x23: 0000000000000000 x22: 000000016ed55000 x21: 0000000000000000
> > [13291.904497] x20: ffff8000a200001b x19: 0000000000004e20 x18: ffff8000a6183050
> > [13291.908308] x17: 0000000000000000 x16: 0000000000000000 x15: 000000000000000a
> > [13291.912542] x14: 0000000000000004 x13: 0000000000000000 x12: 0000000000000000
> > [13291.916298] x11: 0000000000000000 x10: 0000000000000001 x9 : ffffc45006af1bd8
> > [13291.920945] x8 : ffff000151129000 x7 : 0000000000000000 x6 : 0000000000000000
> > [13291.925293] x5 : 000000015f214000 x4 : 000000017217a000 x3 : 000000016ed50000
> > [13291.930436] x2 : 000000016ed55000 x1 : 0000000000000000 x0 : ffff8000a1ffffff
> > [13291.934342] Call trace:
> > [13291.935736]  mana_smc_poll_register+0x48/0xb0 (P)
> > [13291.938611]  mana_smc_setup_hwc+0x70/0x1c0
> > [13291.941113]  mana_hwc_create_channel+0x1a0/0x3a0
> > [13291.944283]  mana_gd_setup+0x16c/0x398
> > [13291.946584]  mana_gd_resume+0x24/0x70
> > [13291.948917]  mana_do_service+0x13c/0x1d0
> > [13291.951583]  mana_serv_func+0x34/0x68
> > [13291.953732]  process_one_work+0x168/0x3d0
> > [13291.956745]  worker_thread+0x2ac/0x480
> > [13291.959104]  kthread+0xf8/0x110
> > [13291.961026]  ret_from_fork+0x10/0x20
> > [13291.963560] Code: d2807d00 9417c551 71000673 54000220 (b9400281)
> > [13291.967299] ---[ end trace 0000000000000000 ]---
> > 
> > Disassembly of mana_smc_poll_register() around the crash site:
> > 
> > Disassembly of section .text:
> > 
> > 00000000000047c8 <mana_smc_poll_register>:
> >     47c8: d503201f        nop
> >     47cc: d503201f        nop
> >     47d0: d503233f        paciasp
> >     47d4: f800865e        str     x30, [x18], #8
> >     47d8: a9bd7bfd        stp     x29, x30, [sp, #-48]!
> >     47dc: 910003fd        mov     x29, sp
> >     47e0: a90153f3        stp     x19, x20, [sp, #16]
> >     47e4: 91007014        add     x20, x0, #0x1c
> >     47e8: 5289c413        mov     w19, #0x4e20
> >     47ec: f90013f5        str     x21, [sp, #32]
> >     47f0: 12001c35        and     w21, w1, #0xff
> >     47f4: 14000008        b       4814 <mana_smc_poll_register+0x4c>
> >     47f8: 36f801e1  tbz  w1, #31, 4834 <mana_smc_poll_register+0x6c>
> >     47fc: 52800042        mov     w2, #0x2
> >     4800: d280fa01        mov     x1, #0x7d0
> >     4804: d2807d00        mov     x0, #0x3e8
> >     4808: 94000000        bl      0 <usleep_range_state>
> >     480c: 71000673        subs    w19, w19, #0x1
> >     4810: 54000200        b.eq    4850 <mana_smc_poll_register+0x88>
> >     4814: b9400281      ldr   w1, [x20] <-- **** CRASHED HERE *****
> >     4818: d50331bf        dmb     oshld
> >     481c: 2a0103e2        mov     w2, w1
> >     ...
> > 
> > From the crash signature x20 = ffff8000a200001b, this address
> > ends in 0x1b which is not 4-byte aligned, so the 'ldr w1, [x20]'
> > instruction (readl) triggers the arm64 alignment fault (FSC = 0x21).
> > 
> > The root cause is in mana_gd_init_vf_regs(), which computes:
> > 
> >   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > 
> > The offset is used without any validation.  The same problem exists
> > in mana_gd_init_pf_regs() for sriov_base_off and sriov_shm_off.
> > 
> > Fix this by validating all offsets before use:
> > 
> > - VF: check shm_off is within BAR0, properly aligned to 4 bytes
> >   (readl requirement), and leaves room for the full 256-bit
> >   (32-byte) SMC aperture.
> > 
> > - PF: check sriov_base_off is within BAR0, aligned to 8 bytes
> >   (readq requirement), and leaves room to safely read the
> >   sriov_shm_off register at sriov_base_off + GDMA_PF_REG_SHM_OFF.
> >   Then check sriov_shm_off leaves room for the full SMC aperture.
> >   All arithmetic uses subtraction rather than addition to avoid
> >   integer overflow on garbage firmware values.
> > 
> > without validating the offset read from hardware. If the register
> > returns a garbage value that is neither within bar 0 bounds nor aligned
> > to the 4-byte granularity, thus causing the alignment fault.
> > 
> > Define SMC_APERTURE_SIZE (32 bytes, derived from the 256-bit aperture
> > width)
> > 
> > Return -EPROTO on invalid values.  The existing recovery path in
> > mana_serv_reset() already handles -EPROTO by falling through to PCI
> > device rescan, giving the hardware another chance to present valid
> > register values after reset.
> > 
> > Fixes: 9bf66036d686 ("net: mana: Handle hardware recovery events when probing the device")
> > Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> > 
> > ---
> > Changes in v2:
> > - Fix sriov_base_off alignment check: sizeof(u32) to sizeof(u64), since
> >   mana_gd_r64() (readq) requires 8-byte alignment on arm64.
> > - Fix sriov_base_off bounds: also verify enough space remains in BAR0
> >   to safely read sriov_shm_off at offset GDMA_PF_REG_SHM_OFF + 8 bytes.
> > - Fix integer overflow: rewrite bounds checks using subtraction
> >   (remaining = bar0_size - base) instead of addition.
> > - Fix SMC aperture size: add gc->bar0_size - shm_off < SMC_APERTURE_SIZE
> >   checks in both VF and PF paths; previously only the start address was
> >   validated, but mana_smc_poll_register() accesses up to shm_base + 0x1c
> >   (28 bytes from base, 32 bytes total).
> > - Export SMC_APERTURE_SIZE to shm_channel.h.
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 40 ++++++++++++++++---
> >  include/net/mana/shm_channel.h                |  6 +++
> >  2 files changed, 41 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..d8e816882f02 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -43,8 +43,9 @@ static u64 mana_gd_r64(struct gdma_context *g, u64 offset)
> >  static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > -	void __iomem *sriov_base_va;
> > +	u64 remaining_barsize;
> >  	u64 sriov_base_off;
> > +	u64 sriov_shm_off;
> >  
> >  	gc->db_page_size = mana_gd_r32(gc, GDMA_PF_REG_DB_PAGE_SIZE) & 0xFFFF;
> >  
> > @@ -73,10 +74,28 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> >  	gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
> >  
> >  	sriov_base_off = mana_gd_r64(gc, GDMA_SRIOV_REG_CFG_BASE_OFF);
> > +	if (sriov_base_off >= gc->bar0_size ||
> > +	    gc->bar0_size - sriov_base_off <
> > +		GDMA_PF_REG_SHM_OFF + sizeof(u64) ||
> 
> nit: fits on a single line, I think?
>
It goes beyond the limit of 80, hence did it this way.
 
> > +	    !IS_ALIGNED(sriov_base_off, sizeof(u64))) {
> > +		dev_err(gc->dev,
> > +			"SRIOV base offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> > +			sriov_base_off, (u64)gc->bar0_size);
> > +		return -EPROTO;
> > +	}
> >  
> > -	sriov_base_va = gc->bar0_va + sriov_base_off;
> > -	gc->shm_base = sriov_base_va +
> > -			mana_gd_r64(gc, sriov_base_off + GDMA_PF_REG_SHM_OFF);
> > +	remaining_barsize = gc->bar0_size - sriov_base_off;
> > +	sriov_shm_off = mana_gd_r64(gc, sriov_base_off + GDMA_PF_REG_SHM_OFF);
> > +	if (sriov_shm_off >= remaining_barsize ||
> > +	    remaining_barsize - sriov_shm_off < SMC_APERTURE_SIZE ||
> > +	    !IS_ALIGNED(sriov_shm_off, sizeof(u32))) {
> > +		dev_err(gc->dev,
> > +			"SRIOV SHM offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> > +			sriov_shm_off, (u64)gc->bar0_size);
> > +		return -EPROTO;
> > +	}
> > +
> > +	gc->shm_base = gc->bar0_va + sriov_base_off + sriov_shm_off;
> >  
> >  	return 0;
> >  }
> > @@ -84,6 +103,7 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> >  static int mana_gd_init_vf_regs(struct pci_dev *pdev)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > +	u64 shm_off;
> >  
> >  	gc->db_page_size = mana_gd_r32(gc, GDMA_REG_DB_PAGE_SIZE) & 0xFFFF;
> >  
> > @@ -111,7 +131,17 @@ static int mana_gd_init_vf_regs(struct pci_dev *pdev)
> >  	gc->db_page_base = gc->bar0_va + gc->db_page_off;
> >  	gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
> >  
> > -	gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > +	shm_off = mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > +	if (shm_off >= gc->bar0_size ||
> > +	    gc->bar0_size - shm_off < SMC_APERTURE_SIZE ||
> > +	    !IS_ALIGNED(shm_off, sizeof(u32))) {
> > +		dev_err(gc->dev,
> > +			"SHM offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> > +			shm_off, (u64)gc->bar0_size);
> > +		return -EPROTO;
> > +	}
> > +
> > +	gc->shm_base = gc->bar0_va + shm_off;
> >  
> >  	return 0;
> >  }
> > diff --git a/include/net/mana/shm_channel.h b/include/net/mana/shm_channel.h
> > index 5199b41497ff..dbabcfb95daf 100644
> > --- a/include/net/mana/shm_channel.h
> > +++ b/include/net/mana/shm_channel.h
> > @@ -4,6 +4,12 @@
> >  #ifndef _SHM_CHANNEL_H
> >  #define _SHM_CHANNEL_H
> >  
> > +#define SMC_APERTURE_BITS 256
> > +#define SMC_BASIC_UNIT (sizeof(u32))
> > +#define SMC_APERTURE_DWORDS (SMC_APERTURE_BITS / (SMC_BASIC_UNIT * 8))
> > +#define SMC_LAST_DWORD (SMC_APERTURE_DWORDS - 1)
> > +#define SMC_APERTURE_SIZE  (SMC_APERTURE_BITS / 8)
> 
> AI bots complain that we're redefining this.
> Since it's a fix I think it's better to remove the existing definition
> even if it lives in a driver that goes via a different tree.
> 
Ack, removed this in the next version.
> >  struct shm_channel {
> >  	struct device *dev;
> >  	void __iomem *base;
> -- 
> pw-bot: cr

Hi Jakub,

Thanks for the comments, I have shared v3 addressing it.

Regards
Dipayaan Roy


^ permalink raw reply

* Re: [PATCH rc 00/15] Various bug fixes for RDMA drivers in the uapi functions
From: Jason Gunthorpe @ 2026-05-02 18:39 UTC (permalink / raw)
  To: Andrew Lunn, Broadcom internal kernel review list, Bryan Tan,
	Eric Dumazet, Junxian Huang, Konstantin Taranov, Jakub Kicinski,
	Leon Romanovsky, linux-hyperv, linux-rdma, netdev, Paolo Abeni,
	Selvin Xavier, Chengchang Tang, Tariq Toukan, Vishnu Dasa,
	Yishai Hadas
  Cc: Abhijit Gangurde, Adit Ranadive, Allen Hubbe, Andrew Boyer,
	Aditya Sarwade, Brad Spengler, Bryan Tan, David S. Miller,
	Dexuan Cui, Doug Ledford, George Zhang, Jorgen Hansen, Jianbo Liu,
	Kai Aizen, Leon Romanovsky, Leon Romanovsky, Yixian Liu, Long Li,
	Lijun Ou, Parav Pandit, patches, Roland Dreier, Roland Dreier,
	Sagi Grimberg, Ajay Sharma, stable, Tariq Toukan, Wei Hu (Xavier),
	Shaobo Xu, Nenglong Zhao
In-Reply-To: <0-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com>

On Tue, Apr 28, 2026 at 01:17:33PM -0300, Jason Gunthorpe wrote:
> All were found by Sashiko or Claude AI tools. They vary in severity, but
> are all things that shouldn't be present.
> 
> Jason Gunthorpe (15):
>   RDMA/ionic: Fix typo in format string
>   RDMA/mlx5: Restore zero-init to mlx5_ib_modify_qp() ucmd
>   RDMA/mlx5: Add missing store/release for lock elision pattern
>   RDMA/mana: Validate rx_hash_key_len
>   RDMA/mana: Remove user triggerable WARN_ON() in
>     mana_ib_create_qp_rss()
>   RDMA/mana: Fix mana_destroy_wq_obj() cleanup in
>     mana_ib_create_qp_rss()
>   RDMA/mana: Fix error unwind in mana_ib_create_qp_rss()
>   RDMA/ocrdma: Clarify the mm_head searching
>   RDMA/ocrdma: Don't NULL deref uctx on errors in ocrdma_copy_pd_uresp()
>   RDMA/vmw_pvrdma: Fix double free on pvrdma_alloc_ucontext() error path
>   RDMA/mlx4: Fix resource leak on error in mlx4_ib_create_srq()
>   RDMA/mlx4: Fix mis-use of RCU in mlx4_srq_event()
>   RDMA/hns: Fix xarray race in hns_roce_create_srq()
>   RDMA/hns: Fix xarray race in hns_roce_create_qp_common()
>   RDMA/hns: Fix unlocked call to hns_roce_qp_remove()
> 
>  drivers/infiniband/hw/hns/hns_roce_qp.c         | 13 ++++++++++---
>  drivers/infiniband/hw/hns/hns_roce_srq.c        | 12 ++++++------
>  drivers/infiniband/hw/ionic/ionic_ibdev.c       |  2 +-
>  drivers/infiniband/hw/mana/cq.c                 |  5 +++--
>  drivers/infiniband/hw/mana/qp.c                 | 16 ++++++++++------
>  drivers/infiniband/hw/mlx4/srq.c                |  4 +++-
>  drivers/infiniband/hw/mlx5/main.c               |  8 ++++----
>  drivers/infiniband/hw/mlx5/qp.c                 |  2 +-
>  drivers/infiniband/hw/mlx5/umr.c                |  4 ++--
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c     |  8 ++++----
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c |  2 +-
>  drivers/net/ethernet/mellanox/mlx4/srq.c        | 13 +++++++------
>  12 files changed, 52 insertions(+), 37 deletions(-)

Applied to for-rc

Jason

^ permalink raw reply

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-05-02 17:15 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <afYMN6vbiX7Rzss+@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Sat, May 02, 2026 at 07:37:43AM -0700, Shradha Gupta wrote:
> On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> > On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > > In mana driver, the number of IRQs allocated is capped by the
> > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > their NUMA/core bindings.
> > > 
> > > This is important, especially in the envs where number of vCPUs are so
> > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > much more than their overheads if they were spread across sibling vCPUs.
> > > 
> > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > IRQs are assigned at a later stage compared to static allocation, other
> > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > same vCPU, while some vCPUs have none.
> > > 
> > > In such cases when many parallel TCP connections are tested, the
> > > throughput drops significantly.
> > > 
> > > Test envs:
> > > =======================================================
> > > Case 1: without this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > 
> > > 	TYPE		effective vCPU aff
> > > =======================================================
> > > IRQ0:	HWC		0
> > > IRQ1:	mana_q1		0
> > > IRQ2:	mana_q2		2
> > > IRQ3:	mana_q3		0
> > > IRQ4:	mana_q4		3
> > > 
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU		0	1	2	3
> > > =======================================================
> > > pass 1:		38.85	0.03	24.89	24.65
> > > pass 2:		39.15	0.03	24.57	25.28
> > > pass 3:		40.36	0.03	23.20	23.17
> > > 
> > > =======================================================
> > > Case 2: with this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > 
> > >         TYPE            effective vCPU aff
> > > =======================================================
> > > IRQ0:   HWC             0
> > > IRQ1:   mana_q1         0
> > > IRQ2:   mana_q2         1
> > > IRQ3:   mana_q3         2
> > > IRQ4:   mana_q4         3
> > > 
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU            0       1       2       3
> > > =======================================================
> > > pass 1:         15.42	15.85	14.99	14.51
> > > pass 2:         15.53	15.94	15.81	15.93
> > > pass 3:         16.41	16.35	16.40	16.36
> > > 
> > > =======================================================
> > > Throughput Impact(in Gbps, same env)
> > > =======================================================
> > > TCP conn	with patch	w/o patch
> > > 20480		15.65		7.73
> > > 10240		15.63		8.93
> > > 8192		15.64		9.69
> > > 6144		15.64		13.16
> > > 4096		15.69		15.75
> > > 2048		15.69		15.83
> > > 1024		15.71		15.28
> > > 
> > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > Cc: stable@vger.kernel.org
> > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > ---
> > > Changes in v2
> > >  * Removed the unused skip_first_cpu variable
> > >  * fixed exit condition in irq_setup_linear() with len == 0
> > >  * changed return type of irq_setup_linear() as it will always be 0
> > >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > >  * added appropriate comments to indicate expected behaviour when
> > >    IRQs are more than or equal to num_online_cpus()
> > > ---
> > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
> > >  1 file changed, 40 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > index 098fbda0d128..d740d1dc43da 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > >  	} else {
> > >  		/* If dynamic allocation is enabled we have already allocated
> > >  		 * hwc msi
> > > +		 * Also, we make sure in this case the following is always true
> > > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > >  		 */
> > >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > >  	}
> > > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > >  	return 0;
> > >  }
> > >  
> > > +/* should be called with cpus_read_lock() held */
> > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > +{
> > > +	int cpu;
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		if (len == 0)
> > > +			break;
> > > +
> > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > +		len--;
> > > +	}
> > > +}
> > > +
> > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > >  {
> > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > >  	struct gdma_irq_context *gic;
> > > -	bool skip_first_cpu = false;
> > >  	int *irqs, irq, err, i;
> > >  
> > >  	irqs = kmalloc_objs(int, nvec);
> > 
> > So what about WARN_ON() and nvec adjustment before kmalloc?
> Hey Yury,
> 
> I am still a bit unsure about the WARN_ON() before kmalloc, as after
> that also, in the same function till we take the cpus_read_lock() the
> num_online_cpus() can change(or reduce). That's why I introduced the
> dev_dbg() to capture hot-remove edge case.

OK.
 
> Do you still think it adds more value?

It's your driver, so you know better. I just wonder because you said
it's good to add WARN_ON(), and then didn't do that.

> > 
> > > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > >  	 */
> > >  	cpus_read_lock();
> > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > -		skip_first_cpu = true;
> > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > +		if (err) {
> > > +			cpus_read_unlock();
> > > +			goto free_irq;
> > 
> > One thing puzzles me: if you skip first CPU with this 'true', and the
> > gc->num_msix_usable == num_online_cpus(), it's one more than you can
> > distribute. What do I miss?
> > 
> 
> Let me explain this case a bit better then,
> 
> - num_msix_usable = HWC IRQ + Queue IRQ
> - nvec in this functions is only Queue IRQ (HWC already setup)
> 
> When num_online_cpus == num_msix_usable:
> - nvec = num_online_cpus - 1
> - first CPU is already assigned to HWC IRQ, so skip it
> - Queue IRQs fit in the remaining CPUs
> 
> please let me know if I did not get your question right

Can you put that in a comment?

> > > +		}
> > > +	} else {
> > > +		/*
> > > +		 * When num_msix_usable are more than num_online_cpus, we try to
> > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > +		 * CPU core affinity does not matter.
> > 
> > If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> > In theory, the system would have most of flexibility to balance them.
> > 
> 
> Okay, let me fix the comment and elaborate on this. It doesn't matter
> because in such a case we want to anyway exhaust and distribute the
> Queue IRQs to all vCPUs.
> We don't want to rely on the system's balancer in this case as it could
> be skewed by other devices' IRQ weights

I don't understand this. If I want to reserve some CPUs to solely
handle IRQs from my high-priority hardware, then I configure my system
accordingly. For example, assign all non-networking IRQs on CPU0, and
all networking IRQs to all CPUs.

In your case, you distribute IRQs evenly, which means you've no
preferred CPUs. So, assuming the system is only running your IRQ
driver, it's at max is as good as all-CPU distribution. In case of
heavy loading some particular CPU, your scheme could cause
corresponding IRQs to starve.

I recall, when we was working on irq_setup(), the original idea was to
distribute IRQs one-to-one, but than I suggested the 

        irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));

and after experiments, you agreed on that.

Can you please run your throughput test for my suggested distribution
too? Would be also nice to see how each distribution works when some
CPUs are under stress.

Thanks,
Yury

^ permalink raw reply

* Re: [PATCH 0/3] net: mana: Fix mana_destroy_rxq() cleanup for partial RXQ init
From: Simon Horman @ 2026-05-02 16:52 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260430035935.1859220-1-dipayanroy@linux.microsoft.com>

On Wed, Apr 29, 2026 at 08:57:51PM -0700, Dipayaan Roy wrote:
> When mana_create_rxq() fails partway through initialization (e.g. the
> hardware rejects the WQ object creation), the error path calls
> mana_destroy_rxq() to tear down a partially-initialized RXQ.
> This exposed multiple issues in mana_destroy_rxq() path, as it assumed
> the RXQ was always fully initialized, leading to multiple issues:
> 
> 1. xdp_rxq_info_unreg() was called on an unregistered xdp_rxq,
>    triggering a WARN_ON ("Driver BUG") in net/core/xdp.c.
> 
> 2. mana_destroy_wq_obj() was called with INVALID_MANA_HANDLE,
>    sending a bogus destroy command to the hardware.
> 
> 3. mana_deinit_cq() was called twice — once inside mana_destroy_rxq()
>    and again in mana_create_rxq()'s error path — causing a
>    use-after-free since mana_destroy_rxq() frees the rxq first.
> 
> This was observed during ethtool ring parameter changes when the
> hardware returned an error creating the RXQ. This series makes
> mana_destroy_rxq() safe to call at any stage of RXQ initialization
> by guarding each teardown step, and removes the redundant cleanup
> in mana_create_rxq().

For the series:

Reviewed-by: Simon Horman <horms@kernel.org>

I don't think that you need to repost for this. But please keep in mind for
future submissions that fixes for code present in the net tree should be
targeted at that tree, like this:

Subject: [PATCH net vN/M] ...

Also, FTR, there is an AI generated review of this patch-set available
on sashiko.dev. It seems to me that the issues flagged there pre-date
this patch-set and should not block progress of it. But you may wish
to use that review as the basis of some follow-up.

^ permalink raw reply

* Re: [PATCH net-next v6 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: Simon Horman @ 2026-05-02 15:26 UTC (permalink / raw)
  To: longli
  Cc: 'Simon Horman', kotaranov, kuba, davem, pabeni, edumazet,
	andrew+netdev, jgg, leon, haiyangz, kys, wei.liu, decui, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260429221625.1841150-3-longli@microsoft.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/

I apologise for any overlap with this and the review of this
patchset available at sashiko.dev. They both seem relevant and
I'm unsure how to cleanly pass them both on.

Both Sashiko instances also flagged concerns in the remaining
patches of this patchset. But I am not forwarding them as they
seem to be lower priority, and I think I've already created
enough noise here.

---
net: mana: Query device capabilities and configure MSI-X sharing for EQs

This patch queries device capabilities early in GDMA probe and chooses
between dedicated MSI-X per vPort or a shared-MSI-X mode based on the
available vector count, introducing gc->max_num_queues_vport,
gc->msi_sharing and gc->msi_bitmap, and renaming mana_query_device_cfg()
to mana_gd_query_device_cfg() so it can be called from the GDMA layer.

> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..b96859e0aec9 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -149,6 +149,9 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
>  	struct gdma_query_max_resources_resp resp = {};
>  	struct gdma_general_req req = {};
> +	unsigned int max_num_queues;
> +	u8 bm_hostmode;
> +	u16 num_ports;
>  	int err;
>  
>  	mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
> @@ -194,6 +197,40 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
>  	if (gc->max_num_queues > gc->num_msix_usable - 1)
>  		gc->max_num_queues = gc->num_msix_usable - 1;
>  
> +	err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
> +				       MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
> +	if (err)
> +		return err;

Is this the same MANA_QUERY_DEV_CONFIG that mana_probe() issues shortly
after?  mana_probe() still contains:

	err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
				       MANA_MICRO_VERSION, &num_ports, &bm_hostmode);

so the same HWC round-trip runs twice on every probe with identical
arguments, both populating gc->adapter_mtu.

The commit message frames this rename as consolidation ("used at GDMA
device probe time"), but the mana_probe() call-site is not removed.  Was
one of the two call-sites meant to go away?

The relocation of debugfs_create_u16("adapter-MTU", ...) from
mana_gd_query_device_cfg() into mana_probe() only makes sense if the
function is now invoked more than once, which hints at the same
duplication.

> +
> +	if (!num_ports)
> +		return -EINVAL;
> +
> +	/*
> +	 * Adjust gc->max_num_queues returned from the SOC to allow dedicated
> +	 * MSIx for each vPort. Clamp to no less than MANA_DEF_NUM_QUEUES.
> +	 */
> +	max_num_queues = (gc->num_msix_usable - 1) / num_ports;
> +	max_num_queues = roundup_pow_of_two(max(max_num_queues, 1U));
> +	if (max_num_queues < MANA_DEF_NUM_QUEUES)
> +		max_num_queues = MANA_DEF_NUM_QUEUES;
> +
> +	/*
> +	 * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
> +	 * Ethernet EQs when (max_num_queues * num_ports > num_msix_usable - 1)
> +	 */
> +	max_num_queues = min(gc->max_num_queues, max_num_queues);
> +	if (max_num_queues * num_ports > gc->num_msix_usable - 1)
> +		gc->msi_sharing = true;

Is gc->msi_sharing ever reset to false?  The only two writers are this
line and mana_gd_setup_hwc_irqs() (the !pci_msix_can_alloc_dyn branch),
and both only set it to true.  mana_gd_remove_irqs() frees msi_bitmap
and zeros max_num_msix / num_msix_usable, but does not clear
msi_sharing, and the gdma_context survives mana_gd_suspend() /
mana_gd_resume().

Once true is latched in any setup cycle, mana_gd_setup() will always
take the sharing branch on resume even if the recomputed
max_num_queues * num_ports fits within num_msix_usable - 1.  Is that
consistent with the commit message saying sharing "is only enabled when
there are not enough MSI-X vectors for dedicated allocation"?

> +
> +	/* If MSI is shared, use max allowed value */
> +	if (gc->msi_sharing)
> +		gc->max_num_queues_vport = min(gc->num_msix_usable - 1, gc->max_num_queues);
> +	else
> +		gc->max_num_queues_vport = max_num_queues;
> +
> +	dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
> +		 gc->msi_sharing, gc->max_num_queues);

Should this print gc->max_num_queues_vport rather than
gc->max_num_queues?  The block immediately above computes
gc->max_num_queues_vport and leaves gc->max_num_queues unchanged, and
mana_probe_port() sizes alloc_etherdev_mq() and apc->max_queues from
gc->max_num_queues_vport:

	ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
				 gc->max_num_queues_vport);
	...
	apc->max_queues = gc->max_num_queues_vport;
	apc->num_queues = min(gc->max_num_queues_vport, MANA_DEF_NUM_QUEUES);

so the logged value does not match the per-vPort queue count the driver
actually exposes.

> +
>  	return 0;
>  }
>  
> @@ -1856,6 +1893,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)

[ ... ]

> @@ -1968,20 +2008,30 @@ static int mana_gd_setup(struct pci_dev *pdev)
>  	if (err)
>  		goto destroy_hwc;
>  
> -	err = mana_gd_query_max_resources(pdev);
> +	err = mana_gd_detect_devices(pdev);
>  	if (err)
>  		goto destroy_hwc;
>  
> -	err = mana_gd_setup_remaining_irqs(pdev);
> -	if (err) {
> -		dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
> -		goto destroy_hwc;
> -	}
> -
> -	err = mana_gd_detect_devices(pdev);
> +	err = mana_gd_query_max_resources(pdev);
>  	if (err)
>  		goto destroy_hwc;
>  
> +	if (!gc->msi_sharing) {
> +		gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable, GFP_KERNEL);
> +		if (!gc->msi_bitmap) {
> +			err = -ENOMEM;
> +			goto destroy_hwc;
> +		}
> +		/* Set bit for HWC */
> +		set_bit(0, gc->msi_bitmap);
> +	} else {
> +		err = mana_gd_setup_remaining_irqs(pdev);
> +		if (err) {
> +			dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
> +			goto destroy_hwc;
> +		}
> +	}
> +

Can the driver bring up any vPort after this change when the !msi_sharing
branch is taken?

In the dedicated branch, only gc->msi_bitmap is allocated and bit 0 is
set for HWC.  mana_gd_setup_remaining_irqs() is skipped, so no
gdma_irq_context is inserted into gc->irq_contexts for indices 1..
num_msix_usable-1.

Later, mana_create_eq() still assigns

	spec.eq.msix_index = (i + 1) % gc->num_msix_usable;

and mana_gd_register_irq() does:

	gic = xa_load(&gc->irq_contexts, msi_index);
	if (WARN_ON(!gic))
		return -EINVAL;

On a typical cloud SKU with, for example, num_msix_usable=32,
num_ports=1 and num_online_cpus=16, the new math keeps msi_sharing=false
(16 * 1 <= 31), so every EQ-create goes down this path and hits the
WARN_ON.  Doesn't that make every vPort open and every resume fail for
the common dedicated-MSI-X case?

The msi_bitmap allocated here is not consumed anywhere in this commit;
the on-demand allocation via mana_gd_get_gic() appears in the later
commit "net: mana: Allocate interrupt context for each EQ when creating
vPort" (dbbdf40a8974).  Should the bitmap and the new branch be
introduced in the same commit that actually uses them, so each commit in
the series is independently bootable?

>  	dev_dbg(&pdev->dev, "mana gdma setup successful\n");
>  	return 0;
>

^ permalink raw reply

* Re: [PATCH net-next v6 1/6] net: mana: Create separate EQs for each vPort
From: Simon Horman @ 2026-05-02 15:29 UTC (permalink / raw)
  To: longli
  Cc: kotaranov, kuba, davem, pabeni, edumazet, andrew+netdev, jgg,
	leon, haiyangz, kys, wei.liu, decui, netdev, linux-rdma,
	linux-hyperv, linux-kernel
In-Reply-To: <20260502152354.289044-2-horms@kernel.org>

On Sat, May 02, 2026 at 04:23:55PM +0100, Simon Horman wrote:
> From: 'Simon Horman' <horms@kernel.org>
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev

Sorry about this, there was supposed to be some different text here.

This review is available at https://netdev-ai.bots.linux.dev/sashiko/
And I apologise that it overlaps with the review from https://sashiko.dev
which I also posted.

^ permalink raw reply

* Re: [PATCH net-next v6 1/6] net: mana: Create separate EQs for each vPort
From: Simon Horman @ 2026-05-02 15:23 UTC (permalink / raw)
  To: longli
  Cc: 'Simon Horman', kotaranov, kuba, davem, pabeni, edumazet,
	andrew+netdev, jgg, leon, haiyangz, kys, wei.liu, decui, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260429221625.1841150-2-longli@microsoft.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: mana: Create separate EQs for each vPort

This change moves EQ ownership from struct mana_context to struct
mana_port_context so each vPort can have dedicated EQs, and exports
mana_create_eq()/mana_destroy_eq() for use by the MANA RDMA driver.

> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
> index ac5e75dd3494..60cc02e4ad10 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
>  	pd->vport_use_count--;
>  	WARN_ON(pd->vport_use_count < 0);
>  
> -	if (!pd->vport_use_count)
> +	if (!pd->vport_use_count) {
> +		mana_destroy_eq(mpc);
>  		mana_uncfg_vport(mpc);
> +	}
>  
>  	mutex_unlock(&pd->vport_mutex);
>  }
> @@ -55,15 +57,21 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>  		return err;
>  	}
>  
> -	mutex_unlock(&pd->vport_mutex);
>  
>  	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
>  	pd->tx_vp_offset = mpc->tx_vp_offset;
> +	err = mana_create_eq(mpc);
> +	if (err) {
> +		mana_uncfg_vport(mpc);
> +		pd->vport_use_count--;
> +	}
> +
> +	mutex_unlock(&pd->vport_mutex);
>  
>  	ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
>  		  mpc->port_handle, pd->pdn, doorbell_id);

On the failure path, mana_uncfg_vport() is called and pd->vport_use_count
is decremented, but execution falls through to the ibdev_dbg() line that
prints the vport handle as if it were a success. Should this debug print
be skipped on the error path so logs are not ambiguous?

Also, pd->tx_shortform_allowed and pd->tx_vp_offset are assigned from mpc
before the mana_create_eq() attempt and are not cleared when that call
fails. Callers are expected to gate on pd->vport_use_count, but is it
worth resetting these fields on failure so the state matches the
pre-patch behavior?

The mutex_unlock(&pd->vport_mutex) has moved from before the field
assignments to after mana_create_eq(). That now holds pd->vport_mutex
across GFP_KERNEL allocations, debugfs directory creation, and multiple
mana_gd_create_mana_eq() firmware RPCs, and establishes the lock order
pd->vport_mutex -> apc->vport_mutex via mana_uncfg_vport() on the error
path. Is the extended critical section intentional?

>  
> -	return 0;
> +	return err;
>  }
>  
>  int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 645581359cee..6f1043383e8c 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -168,7 +168,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  		cq_spec.gdma_region = cq->queue.gdma_region;
>  		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
>  		cq_spec.modr_ctx_id = 0;
> -		eq = &mpc->ac->eqs[cq->comp_vector];
> +		/* EQs are created when a raw QP configures the vport.
> +		 * A raw QP must be created before creating rwq_ind_tbl.
> +		 */
> +		if (!mpc->eqs) {
> +			ret = -EINVAL;
> +			i--;
> +			goto fail;
> +		}
> +		eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];

Can the mpc->eqs read race with a concurrent free? mana_ib_create_qp_rss()
runs without pd->vport_mutex or RTNL, but mpc->eqs is freed by
mana_destroy_eq() from two paths:

  mana_ib_uncfg_vport()   (under pd->vport_mutex, on last raw-QP destroy)
  mana_dealloc_queues()   (under RTNL, on netdev down)

both of which do:

  kfree(apc->eqs);
  apc->eqs = NULL;

with no RCU grace period or reader-visible synchronization. If CPU-A
passes the !mpc->eqs check after CPU-B begins ip link set dev X down,
does CPU-A then dereference freed memory via mpc->eqs[...].eq->id?

Separately, what populates mpc->eqs for an RDMA-only RSS QP user on a
probed-but-idle Ethernet port? Pre-patch mana_probe() called
mana_create_eq(ac) unconditionally, so ac->eqs existed for the device
lifetime. After this patch the only creators are mana_alloc_queues()
(Ethernet up) and mana_ib_cfg_vport() (raw QP). An RDMA application that
uses only RSS QPs and never creates a raw QP will now get -EINVAL here
where it used to succeed. Is this intended, and should the commit log
mention it?

The adjacent comment:

   /* EQs are created when a raw QP configures the vport.
    * A raw QP must be created before creating rwq_ind_tbl.
    */

omits the Ethernet-up path that also populates mpc->eqs. Would it be
clearer to describe both creators?

There is also a behavior change in the index expression:

   eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];

Pre-patch this was ac->eqs[cq->comp_vector] sized by gc->max_num_queues.
Now comp_vector is folded modulo mpc->num_queues, which is tunable via
ethtool -L. Userspace that used distinct comp_vector values to hit
distinct EQs will silently alias when comp_vector >= num_queues. Should
this be documented or rejected with -EINVAL rather than silently
wrapped?

Can mpc->num_queues be 0 at this point? mana_set_channels() does not
reject new_count==0, and kzalloc_objs(struct mana_eq, 0) returns
ZERO_SIZE_PTR, which passes the !mpc->eqs check. During the window
between mana_create_eq() and the subsequent netif_set_real_num_tx_queues()
failing, a concurrent RDMA QP create would compute
cq->comp_vector % 0 here. Should mpc->num_queues be validated alongside
mpc->eqs?

The placement of the !mpc->eqs check is inside the per-iteration loop
over ind_tbl_size, but mpc->eqs cannot change across iterations, so the
check is only meaningful at i==0. It works today because i-- then makes
i = -1 and the cleanup while (i-- > 0) skips. Would hoisting the check
above the loop be clearer and less fragile against refactoring?

>  		cq_spec.attached_eq = eq->eq->id;
>  
>  		ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
> @@ -317,7 +325,11 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
>  	cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
>  	cq_spec.modr_ctx_id = 0;
>  	eq_vec = send_cq->comp_vector;
> -	eq = &mpc->ac->eqs[eq_vec];
> +	if (!mpc->eqs) {
> +		err = -EINVAL;
> +		goto err_destroy_queue;
> +	}
> +	eq = &mpc->eqs[eq_vec % mpc->num_queues];

The same mpc->eqs TOCTOU concern applies here: the check and subsequent
dereference occur without pd->vport_mutex or RTNL held, so a concurrent
mana_dealloc_queues() on the Ethernet side can free mpc->eqs between the
check and the index access. Is there synchronization that prevents this?

The same comp_vector % num_queues aliasing and num_queues==0 divide
concerns apply here as well.

>  	cq_spec.attached_eq = eq->eq->id;
>  
>  	err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index a654b3699c4c..6c709f8b875d 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c

[ ... ]

> -static int mana_create_eq(struct mana_context *ac)
> +int mana_create_eq(struct mana_port_context *apc)
>  {
> -	struct gdma_dev *gd = ac->gdma_dev;
> +	struct gdma_dev *gd = apc->ac->gdma_dev;
>  	struct gdma_context *gc = gd->gdma_context;
>  	struct gdma_queue_spec spec = {};
>  	int err;
>  	int i;
>  
> -	ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
> -	if (!ac->eqs)
> +	WARN_ON(apc->eqs);
> +	apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
> +	if (!apc->eqs)
>  		return -ENOMEM;

WARN_ON(apc->eqs) is used as a defensive assertion but execution
proceeds to overwrite apc->eqs unconditionally. If the invariant is
violated, the previous array and all HW EQs it referenced are leaked.
Should this instead bail out with an error or call mana_destroy_eq()
first to keep the function idempotent against future callers?

[ ... ]

> @@ -3326,6 +3349,9 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  	mana_fence_rqs(apc);
>  
>  	/* Even in err case, still need to cleanup the vPort */
> +	mana_destroy_rxqs(apc);
> +	mana_destroy_txq(apc);
> +	mana_destroy_eq(apc);
>  	mana_destroy_vport(apc);

What is the intended interaction with outstanding RDMA QPs here? An
RDMA user that created an RSS QP while the Ethernet netdev was up sees
mpc->eqs populated by mana_alloc_queues(). When the admin issues
ip link set dev N down, mana_dealloc_queues() -> mana_destroy_eq()
destroys those EQs and frees mpc->eqs while the RDMA QPs are still
alive, leaving the QPs with dangling attached_eq IDs at the hardware
level and stale kernel references.

Pre-patch ac->eqs lived for the full mana_context lifetime and was torn
down only in mana_remove(). Is unconditionally destroying the EQs on
netdev-down the intended new behavior, and if so how are concurrent
RDMA consumers expected to recover?

>  	return 0;
>  }

^ permalink raw reply

* Re: [PATCH net-next v6 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: Simon Horman @ 2026-05-02 15:08 UTC (permalink / raw)
  To: longli
  Cc: 'Simon Horman', kotaranov, kuba, davem, pabeni, edumazet,
	andrew+netdev, jgg, leon, haiyangz, kys, wei.liu, decui, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260429221625.1841150-3-longli@microsoft.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: mana: Query device capabilities and configure MSI-X sharing for EQs

When querying the device, this adjusts the max number of queues to allow
dedicated MSI-X vectors for each vPort, clamping to no less than
MANA_DEF_NUM_QUEUES. MSI-X sharing among vPorts is disabled by default
and enabled only when there are not enough vectors. It also renames
mana_query_device_cfg() to mana_gd_query_device_cfg().

> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c

[ ... ]

> +	/*
> +	 * Adjust gc->max_num_queues returned from the SOC to allow dedicated
> +	 * MSIx for each vPort. Clamp to no less than MANA_DEF_NUM_QUEUES.
> +	 */
> +	max_num_queues = (gc->num_msix_usable - 1) / num_ports;
> +	max_num_queues = roundup_pow_of_two(max(max_num_queues, 1U));

Does rounding up here inflate max_num_queues and unnecessarily force MSI-X
sharing?

For example, if there are 63 usable MSIs and 2 ports, the division yields 31.
Rounding up gives 32. The subsequent check below:

> +	/*
> +	 * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
> +	 * Ethernet EQs when (max_num_queues * num_ports > num_msix_usable - 1)
> +	 */
> +	max_num_queues = min(gc->max_num_queues, max_num_queues);
> +	if (max_num_queues * num_ports > gc->num_msix_usable - 1)
> +		gc->msi_sharing = true;

would then evaluate to true (32 * 2 > 63) and force the driver into MSI-X
shared mode. This seems to contradict the intent to use dedicated MSI-X
whenever possible.

Would it be better to use rounddown_pow_of_two() instead to ensure the
calculated queues fit within the available dedicated vectors?

[ ... ]

> @@ -1968,20 +2008,30 @@ static int mana_gd_setup(struct pci_dev *pdev)
>  	if (err)
>  		goto destroy_hwc;
>  
> -	err = mana_gd_query_max_resources(pdev);
> +	err = mana_gd_detect_devices(pdev);
>  	if (err)
>  		goto destroy_hwc;
>  
> -	err = mana_gd_setup_remaining_irqs(pdev);
> -	if (err) {
> -		dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
> -		goto destroy_hwc;
> -	}
> -
> -	err = mana_gd_detect_devices(pdev);
> +	err = mana_gd_query_max_resources(pdev);
>  	if (err)
>  		goto destroy_hwc;
>  
> +	if (!gc->msi_sharing) {
> +		gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable, GFP_KERNEL);
> +		if (!gc->msi_bitmap) {
> +			err = -ENOMEM;
> +			goto destroy_hwc;
> +		}
> +		/* Set bit for HWC */
> +		set_bit(0, gc->msi_bitmap);
> +	} else {
> +		err = mana_gd_setup_remaining_irqs(pdev);
> +		if (err) {
> +			dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
> +			goto destroy_hwc;
> +		}
> +	}

If msi_sharing is disabled, we allocate the msi_bitmap but skip calling
mana_gd_setup_remaining_irqs(). 

Since mana_gd_setup_hwc_irqs() only allocates a single vector for the hardware
channel when dynamic allocation is supported, does this leave the device
without interrupts for its Ethernet queues?

If so, it seems this could lead to queue creation failures when the driver
attempts to map uninitialized vectors. I notice this is fixed in a later patch
in the series ("net: mana: Allocate interrupt context for each EQ when
creating vPort"), but does leaving it out here break bisectability?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox