* [PATCH v3 14/18] mshv: Use kfree_rcu in mshv_portid_free
From: Stanislav Kinsburskii @ 2026-05-04 19:10 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_portid_free() uses synchronize_rcu() followed by kfree() to
reclaim port table entries. This blocks the caller until a full RCU
grace period elapses, which is unnecessary since the same module already
uses the non-blocking kfree_rcu() pattern in mshv_port_table_fini().
Replace with kfree_rcu() to avoid the blocking wait and keep the
reclamation strategy consistent across the file.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_portid_table.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index d6884c601b298..1ccbafe7aa596 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -62,8 +62,7 @@ mshv_portid_free(int port_id)
WARN_ON(!info);
idr_unlock(&port_table_idr);
- synchronize_rcu();
- kfree(info);
+ kfree_rcu(info, portbl_rcu);
}
int
^ permalink raw reply related
* [PATCH v3 13/18] mshv: Fix sleeping under spinlock in mshv_portid_alloc
From: Stanislav Kinsburskii @ 2026-05-04 19:10 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
idr_alloc() is called with GFP_KERNEL inside idr_lock(), which holds a
spinlock. GFP_KERNEL allows the allocator to sleep, triggering a
sleeping-while-atomic bug.
Fix by using idr_preload(GFP_KERNEL) before taking the lock to
pre-allocate memory in a sleepable context, then idr_alloc() with
GFP_NOWAIT inside the spinlock-protected section.
Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_portid_table.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index f1aaef69eb9b7..d6884c601b298 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -40,12 +40,14 @@ mshv_port_table_fini(void)
int
mshv_portid_alloc(struct port_table_info *info)
{
- int ret = 0;
+ int ret;
+ idr_preload(GFP_KERNEL);
idr_lock(&port_table_idr);
ret = idr_alloc(&port_table_idr, info, PORTID_MIN,
- PORTID_MAX, GFP_KERNEL);
+ PORTID_MAX, GFP_NOWAIT);
idr_unlock(&port_table_idr);
+ idr_preload_end();
return ret;
}
^ permalink raw reply related
* [PATCH v3 12/18] mshv: Fix use-after-RCU in mshv_portid_lookup
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_portid_lookup() drops the RCU read lock before copying the
port_table_info struct found by idr_find(). If mshv_portid_free() runs
concurrently on another CPU, it can remove the entry and free it (via
synchronize_rcu + kfree) before the copy at line *info = *_info
completes — resulting in a use-after-free.
Move rcu_read_unlock() after the struct copy so the object remains
protected for the entire duration of the read-side access.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_portid_table.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index c349af1f0aaac..f1aaef69eb9b7 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -72,12 +72,11 @@ mshv_portid_lookup(int port_id, struct port_table_info *info)
rcu_read_lock();
_info = idr_find(&port_table_idr, port_id);
- rcu_read_unlock();
-
if (_info) {
*info = *_info;
ret = 0;
}
+ rcu_read_unlock();
return ret;
}
^ permalink raw reply related
* [PATCH v3 11/18] mshv: Fix duplicate GSI detection for GSI 0
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
The duplicate routing entry check in mshv_update_routing_table() uses
guest_irq_num != 0 to detect whether a GSI slot is already occupied.
This fails for GSI 0 because its guest_irq_num is 0 both when the slot
is unused (zero-initialized) and when legitimately assigned. As a
result, duplicate entries for GSI 0 are silently accepted, with the
second entry overwriting the first — corrupting the routing table
without any error reported to userspace.
While GSI 0 (legacy timer) is unlikely to appear in MSI-based routing
in practice, the check is semantically wrong — it conflates
"uninitialized" with "GSI number 0." Use girq_entry_valid instead,
which is explicitly set to true when an entry is populated and remains
zero for unused slots regardless of the GSI number.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_irq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
index b3142c84dcbc2..65a4ffc82d566 100644
--- a/drivers/hv/mshv_irq.c
+++ b/drivers/hv/mshv_irq.c
@@ -51,7 +51,7 @@ int mshv_update_routing_table(struct mshv_partition *partition,
/*
* Allow only one to one mapping between GSI and MSI routing.
*/
- if (girq->guest_irq_num != 0) {
+ if (girq->girq_entry_valid) {
r = -EINVAL;
goto out;
}
^ permalink raw reply related
* [PATCH v3 10/18] mshv: Fix level-triggered check on uninitialized data
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
In mshv_irqfd_assign(), the level-triggered validation for resample
irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
mshv_irqfd_update() has populated the field. Since the irqfd struct is
zero-allocated, level_triggered is always 0 at that point, causing the
check to always reject resample irqfds with -EINVAL. This makes
level-triggered interrupt resampling — used to avoid interrupt storms
with assigned devices — completely non-functional.
Move the check after the mshv_irqfd_update() call, which resolves the
IRQ routing entry and populates irqfd_lapic_irq with the actual trigger
mode.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_eventfd.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 8a5a1b517ad4b..c99ca5770d3d0 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -483,6 +483,19 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
init_poll_funcptr(&irqfd->irqfd_polltbl, mshv_irqfd_queue_proc);
spin_lock_irq(&pt->pt_irqfds_lock);
+ ret = 0;
+ hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
+ if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
+ continue;
+ /* This fd is used for another irq already. */
+ ret = -EBUSY;
+ spin_unlock_irq(&pt->pt_irqfds_lock);
+ goto fail;
+ }
+
+ idx = srcu_read_lock(&pt->pt_irq_srcu);
+ mshv_irqfd_update(pt, irqfd);
+
#if IS_ENABLED(CONFIG_X86)
if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
!irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {
@@ -491,22 +504,12 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
* Otherwise return with failure
*/
spin_unlock_irq(&pt->pt_irqfds_lock);
+ srcu_read_unlock(&pt->pt_irq_srcu, idx);
ret = -EINVAL;
goto fail;
}
#endif
- ret = 0;
- hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
- if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
- continue;
- /* This fd is used for another irq already. */
- ret = -EBUSY;
- spin_unlock_irq(&pt->pt_irqfds_lock);
- goto fail;
- }
- idx = srcu_read_lock(&pt->pt_irq_srcu);
- mshv_irqfd_update(pt, irqfd);
hlist_add_head(&irqfd->irqfd_hnode, &pt->pt_irqfds_list);
spin_unlock_irq(&pt->pt_irqfds_lock);
^ permalink raw reply related
* [PATCH v3 09/18] mshv: Consolidate irqfd interrupt injection paths
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
The irqfd interrupt injection had duplicated seqcount reads and
inconsistent validity checks between the fast and slow paths:
1. The wakeup handler snapshots irqfd_lapic_irq under seqcount, then on
fast-path failure calls mshv_assert_irq_slow() which re-reads both
girq_ent and lapic_irq under seqcount again — wasteful and confusing.
2. The validity check (girq_entry_valid) only existed in the slow path.
The fast path would blindly accept a zero-initialized structure when
routing was not configured, potentially injecting vector 0 to VP 0.
3. The condition 'girq_ent.guest_irq_num && !girq_ent.girq_entry_valid'
short-circuits for GSI 0 (guest_irq_num == 0), bypassing the
validity check entirely.
4. mshv_irqfd_resampler_ack() reads irqfd_lapic_irq.lapic_control
without seqcount protection, allowing torn reads when racing with
mshv_irqfd_update().
Consolidate by:
- Moving the seqcount snapshot and validity check into the wakeup
handler, performed once before either injection path.
- Changing mshv_assert_irq_slow() to accept a pre-snapshotted
const struct mshv_lapic_irq pointer, eliminating its internal
seqcount read and SRCU lock/unlock.
- Using !girq_entry_valid alone as the validity condition, fixing the
GSI 0 bypass.
- Adding seqcount protection in mshv_irqfd_resampler_ack() to prevent
torn reads of interrupt_type.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_eventfd.c | 64 +++++++++++++++++++++++++--------------------
1 file changed, 36 insertions(+), 28 deletions(-)
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 7275b9eaa7541..8a5a1b517ad4b 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -90,7 +90,15 @@ static void mshv_irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian)
hlist_for_each_entry_srcu(irqfd, &resampler->rsmplr_irqfd_list,
irqfd_resampler_hnode,
srcu_read_lock_held(&partition->pt_irq_srcu)) {
- if (hv_should_clear_interrupt(irqfd->irqfd_lapic_irq.lapic_control.interrupt_type))
+ struct mshv_lapic_irq irq;
+ unsigned int seq;
+
+ do {
+ seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+ irq = irqfd->irqfd_lapic_irq;
+ } while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+
+ if (hv_should_clear_interrupt(irq.lapic_control.interrupt_type))
hv_call_clear_virtual_interrupt(partition->pt_id);
eventfd_signal(irqfd->irqfd_resamplefd);
@@ -198,36 +206,19 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
}
#endif
-static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd)
+static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd,
+ const struct mshv_lapic_irq *irq)
{
struct mshv_partition *partition = irqfd->irqfd_partn;
- struct mshv_guest_irq_ent girq_ent;
- struct mshv_lapic_irq irq;
- unsigned int seq;
- int idx;
-
- idx = srcu_read_lock(&partition->pt_irq_srcu);
-
- do {
- seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
- girq_ent = irqfd->irqfd_girq_ent;
- irq = irqfd->irqfd_lapic_irq;
- } while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
-
- if (girq_ent.guest_irq_num && !girq_ent.girq_entry_valid) {
- srcu_read_unlock(&partition->pt_irq_srcu, idx);
- return;
- }
#if IS_ENABLED(CONFIG_X86)
WARN_ON(irqfd->irqfd_resampler &&
- !irq.lapic_control.level_triggered);
+ !irq->lapic_control.level_triggered);
#endif
hv_call_assert_virtual_interrupt(partition->pt_id,
- irq.lapic_vector, irq.lapic_apic_id,
- irq.lapic_control);
- srcu_read_unlock(&partition->pt_irq_srcu, idx);
+ irq->lapic_vector, irq->lapic_apic_id,
+ irq->lapic_control);
}
static void mshv_irqfd_resampler_shutdown(struct mshv_irqfd *irqfd)
@@ -316,6 +307,7 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
int ret = 0;
if (flags & EPOLLIN) {
+ struct mshv_guest_irq_ent girq_ent;
struct mshv_lapic_irq irq;
u64 cnt;
@@ -323,14 +315,18 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
idx = srcu_read_lock(&pt->pt_irq_srcu);
do {
seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+ girq_ent = irqfd->irqfd_girq_ent;
irq = irqfd->irqfd_lapic_irq;
} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+ if (!girq_ent.girq_entry_valid)
+ goto out_unlock;
+
/* An event has been signaled, raise an interrupt */
- ret = mshv_try_assert_irq_fast(irqfd, &irq);
- if (ret)
- mshv_assert_irq_slow(irqfd);
+ if (mshv_try_assert_irq_fast(irqfd, &irq))
+ mshv_assert_irq_slow(irqfd, &irq);
+out_unlock:
srcu_read_unlock(&pt->pt_irq_srcu, idx);
ret = 1;
@@ -520,8 +516,20 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
*/
events = vfs_poll(fd_file(f), &irqfd->irqfd_polltbl);
- if (events & EPOLLIN)
- mshv_assert_irq_slow(irqfd);
+ if (events & EPOLLIN) {
+ struct mshv_guest_irq_ent girq_ent;
+ struct mshv_lapic_irq irq;
+ unsigned int seq;
+
+ do {
+ seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+ girq_ent = irqfd->irqfd_girq_ent;
+ irq = irqfd->irqfd_lapic_irq;
+ } while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+
+ if (girq_ent.girq_entry_valid)
+ mshv_assert_irq_slow(irqfd, &irq);
+ }
srcu_read_unlock(&pt->pt_irq_srcu, idx);
return 0;
^ permalink raw reply related
* [PATCH v3 08/18] mshv: Fix broken seqcount read protection
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_irqfd_update() writes both irqfd_girq_ent and irqfd_lapic_irq as a
logical unit under seqcount write protection. Readers must snapshot these
fields inside the seqcount begin/retry loop to obtain a consistent
point-in-time view — otherwise a concurrent update can produce a torn
read where one field comes from the old state and the other from the new.
Both mshv_assert_irq_slow() and mshv_irqfd_wakeup() get this wrong: the
seqcount loop bodies are empty (just spinning until a stable sequence is
observed), and all reads of the protected fields happen after the loop
with no protection from concurrent writes. If mshv_irqfd_update() races
with interrupt assertion, the caller may use a stale or mixed
vector/apic_id/control combination — delivering an interrupt to the
wrong vCPU, with the wrong vector, or with the wrong trigger mode. This
can cause spurious or lost interrupts in the guest, or a stuck interrupt
line in the level-triggered case.
Fix mshv_assert_irq_slow() by snapshotting both irqfd_girq_ent and
irqfd_lapic_irq into local variables inside the seqcount loop, then
using those locals for the validity check and the hypercall.
Fix mshv_irqfd_wakeup() by snapshotting irqfd_lapic_irq inside its
seqcount loop and passing the snapshot to mshv_try_assert_irq_fast(),
so the fast path operates on the consistent copy rather than reading
the field directly outside seqcount protection.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_eventfd.c | 47 +++++++++++++++++++++++++--------------------
1 file changed, 26 insertions(+), 21 deletions(-)
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 509911ffcbeee..7275b9eaa7541 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -151,10 +151,10 @@ static int mshv_vp_irq_set_vector(struct mshv_vp *vp, u32 vector)
* Try to raise irq for guest via shared vector array. hyp does the actual
* inject of the interrupt.
*/
-static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
+ const struct mshv_lapic_irq *irq)
{
struct mshv_partition *partition = irqfd->irqfd_partn;
- struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
struct mshv_vp *vp;
if (!(ms_hyperv.ext_features &
@@ -191,7 +191,8 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
return 0;
}
#else /* CONFIG_X86_64 */
-static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
+ const struct mshv_lapic_irq *irq)
{
return -EOPNOTSUPP;
}
@@ -200,30 +201,32 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd)
{
struct mshv_partition *partition = irqfd->irqfd_partn;
- struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
+ struct mshv_guest_irq_ent girq_ent;
+ struct mshv_lapic_irq irq;
unsigned int seq;
int idx;
-#if IS_ENABLED(CONFIG_X86)
- WARN_ON(irqfd->irqfd_resampler &&
- !irq->lapic_control.level_triggered);
-#endif
-
idx = srcu_read_lock(&partition->pt_irq_srcu);
- if (irqfd->irqfd_girq_ent.guest_irq_num) {
- if (!irqfd->irqfd_girq_ent.girq_entry_valid) {
- srcu_read_unlock(&partition->pt_irq_srcu, idx);
- return;
- }
- do {
- seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
- } while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+ do {
+ seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+ girq_ent = irqfd->irqfd_girq_ent;
+ irq = irqfd->irqfd_lapic_irq;
+ } while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+
+ if (girq_ent.guest_irq_num && !girq_ent.girq_entry_valid) {
+ srcu_read_unlock(&partition->pt_irq_srcu, idx);
+ return;
}
- hv_call_assert_virtual_interrupt(irqfd->irqfd_partn->pt_id,
- irq->lapic_vector, irq->lapic_apic_id,
- irq->lapic_control);
+#if IS_ENABLED(CONFIG_X86)
+ WARN_ON(irqfd->irqfd_resampler &&
+ !irq.lapic_control.level_triggered);
+#endif
+
+ hv_call_assert_virtual_interrupt(partition->pt_id,
+ irq.lapic_vector, irq.lapic_apic_id,
+ irq.lapic_control);
srcu_read_unlock(&partition->pt_irq_srcu, idx);
}
@@ -313,16 +316,18 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
int ret = 0;
if (flags & EPOLLIN) {
+ struct mshv_lapic_irq irq;
u64 cnt;
eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
idx = srcu_read_lock(&pt->pt_irq_srcu);
do {
seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+ irq = irqfd->irqfd_lapic_irq;
} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
/* An event has been signaled, raise an interrupt */
- ret = mshv_try_assert_irq_fast(irqfd);
+ ret = mshv_try_assert_irq_fast(irqfd, &irq);
if (ret)
mshv_assert_irq_slow(irqfd);
^ permalink raw reply related
* [PATCH v3 07/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_try_assert_irq_fast() dereferences the vp pointer obtained from
pt_vp_array[lapic_apic_id] without checking for NULL or validating that
lapic_apic_id is within bounds. A spurious interrupt from the hypervisor
targeting a non-existent VP (or one not yet created) causes a NULL
pointer dereference and crashes the host.
Add a bounds check on lapic_apic_id against MSHV_MAX_VPS and a NULL
check on the vp pointer before dereferencing.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_eventfd.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 3ab6338064237..509911ffcbeee 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -169,7 +169,12 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
return -EOPNOTSUPP;
#endif
+ if (irq->lapic_apic_id >= MSHV_MAX_VPS)
+ return -EINVAL;
+
vp = partition->pt_vp_array[irq->lapic_apic_id];
+ if (!vp)
+ return -EINVAL;
if (!vp->vp_register_page)
return -EOPNOTSUPP;
^ permalink raw reply related
* [PATCH v3 06/18] mshv: Add defensive synchronize_srcu in irqfd shutdown
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_irqfd_assign() adds the irqfd to the partition's hlist and then
registers the wait entry on the eventfd waitqueue via vfs_poll(). A
narrow window exists between these two operations where the irqfd is
visible to deactivation paths but the wait entry is not yet initialized
on the waitqueue.
Currently this is not reachable because mshv_irqfd_assign() and
mshv_irqfd_deassign() are serialized by the partition mutex, and the
EPOLLHUP wakeup path can only fire after vfs_poll() has registered the
wait entry. However, if future refactoring removes or relaxes that
serialization, mshv_irqfd_shutdown() could call
eventfd_ctx_remove_wait_queue() before the wait entry is on the queue,
causing a NULL pointer dereference (the list_head is zeroed by kzalloc
and not initialized by init_waitqueue_func_entry()).
Add synchronize_srcu_expedited() at the start of mshv_irqfd_shutdown()
as a defensive measure, ensuring the assignment path's SRCU read-side
section (which covers vfs_poll() registration) has completed. This
follows the pattern established by KVM in irqfd_shutdown().
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_eventfd.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 5995a62aff8d8..3ab6338064237 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -248,8 +248,12 @@ static void mshv_irqfd_shutdown(struct work_struct *work)
{
struct mshv_irqfd *irqfd =
container_of(work, struct mshv_irqfd, irqfd_shutdown);
+ struct mshv_partition *pt = irqfd->irqfd_partn;
u64 cnt;
+ /* Make sure irqfd has been initialized in assign path. */
+ synchronize_srcu_expedited(&pt->pt_irq_srcu);
+
/*
* Synchronize with the wait-queue and unhook ourselves to prevent
* further events.
^ permalink raw reply related
* [PATCH v3 05/18] mshv: Fix race in mshv_irqfd_deassign
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_irqfd_deactivate() and the hlist traversal of pt_irqfds_list
require pt->pt_irqfds_lock to be held, but mshv_irqfd_deassign()
omits it. This races with the EPOLLHUP path in mshv_irqfd_wakeup(),
which does take the lock before calling mshv_irqfd_deactivate().
Additionally, mshv_irqfd_deactivate() uses hlist_del() which poisons
the node pointers rather than resetting them. Since
mshv_irqfd_is_active() relies on hlist_unhashed() (checks pprev ==
NULL), a poisoned node still appears active. If a concurrent path calls
mshv_irqfd_deactivate() again on the same irqfd, the guard fails to
prevent a double hlist_del() on poisoned pointers.
Fix both issues:
- Add the missing spin_lock_irq/spin_unlock_irq around the list
traversal in mshv_irqfd_deassign(), matching mshv_irqfd_release().
- Use hlist_del_init() instead of hlist_del() so the node is properly
marked as unhashed after removal, making the is_active guard reliable.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_eventfd.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 90959f639dc32..5995a62aff8d8 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -284,7 +284,7 @@ static void mshv_irqfd_deactivate(struct mshv_irqfd *irqfd)
if (!mshv_irqfd_is_active(irqfd))
return;
- hlist_del(&irqfd->irqfd_hnode);
+ hlist_del_init(&irqfd->irqfd_hnode);
queue_work(irqfd_cleanup_wq, &irqfd->irqfd_shutdown);
}
@@ -541,13 +541,14 @@ static int mshv_irqfd_deassign(struct mshv_partition *pt,
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
+ spin_lock_irq(&pt->pt_irqfds_lock);
hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list,
irqfd_hnode) {
if (irqfd->irqfd_eventfd_ctx == eventfd &&
irqfd->irqfd_irqnum == args->gsi)
-
mshv_irqfd_deactivate(irqfd);
}
+ spin_unlock_irq(&pt->pt_irqfds_lock);
eventfd_ctx_put(eventfd);
^ permalink raw reply related
* [PATCH v3 04/18] mshv: Fix potential u64 overflow in region overlap check
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_partition_create_region() checks for overlapping guest memory
regions using arithmetic that can overflow u64:
mem->guest_pfn + nr_pages <= rg->start_gfn
If guest_pfn is near U64_MAX, the addition wraps around to a small
value, causing the overlap check to incorrectly pass. This could allow
creation of overlapping regions.
Fix by validating the sum with check_add_overflow() before the loop and
using the pre-computed end_gfn in the comparison.
Fixes: f91bc8f61abf ("mshv: Allow mappings that overlap in uaddr")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_root_main.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 7e4252b6bc65c..0e9b696853fcb 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1290,11 +1290,15 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
{
struct mshv_mem_region *rg;
u64 nr_pages = HVPFN_DOWN(mem->size);
+ u64 end_gfn;
+
+ if (check_add_overflow(mem->guest_pfn, nr_pages, &end_gfn))
+ return -EINVAL;
/* Reject overlapping regions */
spin_lock(&partition->pt_mem_regions_lock);
hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
- if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
+ if (end_gfn <= rg->start_gfn ||
rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
continue;
spin_unlock(&partition->pt_mem_regions_lock);
^ permalink raw reply related
* [PATCH v3 03/18] mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_prepare_pinned_region() returns 0 (success) when mshv_region_map()
fails on an unencrypted partition. The condition on the error path:
if (ret && mshv_partition_encrypted(partition))
only handles map failures for encrypted partitions — if the partition is
not encrypted and the map fails, execution falls through to 'return 0',
silently ignoring the error.
Additionally, calling mshv_region_invalidate() inline on map failure
zeroes the mreg_pages array before the caller's cleanup path
(mshv_region_destroy) can call mshv_region_unmap(). Since unmap skips
pages where mreg_pages[offset] is NULL, this can leave stale SLAT
mappings for partially-mapped pages.
Fix by returning immediately on success and falling through to error
return on failure. For unencrypted partitions, the caller's
mshv_region_destroy() handles unmap followed by invalidate in the
correct order. For encrypted partitions where re-sharing fails, zero
the page array without unpinning — the pages are inaccessible to the
host and must not be unpinned, but zeroing prevents
mshv_region_destroy() from attempting to unpin them.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_root_main.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 665d565899c15..7e4252b6bc65c 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1360,32 +1360,38 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
pt_err(partition,
"Failed to unshare memory region (guest_pfn: %llu): %d\n",
region->start_gfn, ret);
- goto invalidate_region;
+ goto err_out;
}
}
ret = mshv_region_map(region);
- if (ret && mshv_partition_encrypted(partition)) {
+ if (ret)
+ goto share_region;
+
+ return 0;
+
+share_region:
+ if (mshv_partition_encrypted(partition)) {
int shrc;
shrc = mshv_region_share(region);
if (!shrc)
- goto invalidate_region;
+ goto err_out;
pt_err(partition,
"Failed to share memory region (guest_pfn: %llu): %d\n",
region->start_gfn, shrc);
/*
- * Don't unpin if marking shared failed because pages are no
- * longer mapped in the host, ie root, anymore.
+ * Re-sharing failed — the pages remain inaccessible to the
+ * host. Zero the page array so that mshv_region_destroy()
+ * won't attempt to unpin them (leaking the page references
+ * is intentional; unpinning host-inaccessible pages would be
+ * unsafe).
*/
+ memset(region->mreg_pages, 0,
+ region->nr_pages * sizeof(region->mreg_pages[0]));
goto err_out;
}
-
- return 0;
-
-invalidate_region:
- mshv_region_invalidate(region);
err_out:
return ret;
}
^ permalink raw reply related
* [PATCH v3 02/18] mshv: Fix potential integer overflow in mshv_region_create
From: Stanislav Kinsburskii @ 2026-05-04 19:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
The allocation size is computed as:
sizeof(*region) + sizeof(struct page *) * nr_pages
where nr_pages is a u64 originating from userspace. A sufficiently
large nr_pages can overflow the multiplication, resulting in a small
allocation followed by out-of-bounds writes when populating mreg_pages.
Use struct_size() which returns SIZE_MAX on overflow, causing vzalloc
to safely return NULL — caught by the existing error check.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_regions.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6f..1d04a97980b8b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -177,7 +177,7 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
{
struct mshv_mem_region *region;
- region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+ region = vzalloc(struct_size(region, mreg_pages, nr_pages));
if (!region)
return ERR_PTR(-ENOMEM);
^ permalink raw reply related
* [PATCH v3 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Stanislav Kinsburskii @ 2026-05-04 19:08 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
The bounds check inside the PFN-filling loop can return -EINVAL while
interrupts are disabled via local_irq_save(), leaking IRQ state.
Remove the check — it is redundant because the loop invariant
(done + i < page_count == page_struct_count >> large_shift) guarantees
(done + i) << large_shift < page_struct_count always holds.
While here, fix type mismatches: change 'int done' to 'u64 done' and
use u64 for loop and batch-size variables so they match the u64
page_count they are compared against.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_root_hv_call.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 129456bd72aba..cc580225e9e45 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -1042,7 +1042,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
{
struct hv_input_modify_sparse_spa_page_host_access *input_page;
u64 status;
- int done = 0;
+ u64 done = 0;
unsigned long irq_flags, large_shift = 0;
u64 page_count = page_struct_count;
u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
@@ -1059,9 +1059,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
}
while (done < page_count) {
- ulong i, completed, remain = page_count - done;
- int rep_count = min(remain,
- HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+ u64 i, completed, remain = page_count - done;
+ u64 rep_count = min_t(u64, remain,
+ HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
local_irq_save(irq_flags);
input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -1075,15 +1075,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
input_page->flags = flags;
input_page->host_access = host_access;
- for (i = 0; i < rep_count; i++) {
- u64 index = (done + i) << large_shift;
-
- if (index >= page_struct_count)
- return -EINVAL;
-
+ for (i = 0; i < rep_count; i++)
input_page->spa_page_list[i] =
- page_to_pfn(pages[index]);
- }
+ page_to_pfn(pages[(done + i) << large_shift]);
status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
NULL);
^ permalink raw reply related
* [PATCH v3 00/18] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-05-04 19:08 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
This series addresses bugs found during a continued review of the
mshv_root module mostly introduced by commit 621191d709b14 ("Drivers: hv:
Introduce mshv_root module to expose /dev/mshv to VMMs").
Changes since v2:
- "Fix mshv_prepare_pinned_region error path for unencrypted
partitions": removed inline mshv_region_invalidate() to prevent
zeroing mreg_pages before mshv_region_destroy() can unmap partial
SLAT mappings; for encrypted share-failure, memset the page array
without unpinning (pages are host-inaccessible).
- "Consolidate irqfd interrupt injection paths": fixed data race in
mshv_irqfd_assign EPOLLIN path — girq_ent is now snapshotted inside
the seqcount loop (matching mshv_irqfd_wakeup) to prevent a
concurrent routing update from injecting vector 0 to VP 0.
- "Add missing vp_index bounds check in intercept ISR": added
array_index_nospec() after the bounds check to prevent speculative
out-of-bounds array access.
- "Add store/load ordering for VP array publish": added missing
smp_load_acquire in mshv_try_assert_irq_fast.
Changes since v1:
- Added 8 new patches addressing issues found by Sashiko (automated
review) covering the irqfd, portid, scheduler message, and VP
lifecycle paths.
- Consolidated the irqfd fast/slow injection paths to eliminate
duplicated seqcount reads and fix the GSI 0 validity bypass.
- Added memory ordering for the lockless VP array.
The fixes range from data corruption and use-after-free to silent
functional failures and sleeping-while-atomic:
Memory region management:
- Integer overflow on userspace-controlled allocation size
(mshv_region_create)
- Silent success on map failure for unencrypted partitions
(mshv_prepare_pinned_region)
- u64 overflow in region overlap check allowing overlapping mappings
IRQ/eventfd path:
- IRQ state leak and type truncation in hypercall helpers
- Missing locking and hlist_del vs hlist_del_init race in irqfd
deassign
- Defensive synchronize_srcu in irqfd shutdown (follows KVM pattern)
- NULL pointer dereference on spurious interrupt to non-existent VP
(mshv_try_assert_irq_fast)
- Broken seqcount read protection — torn reads of interrupt routing
- Duplicated and inconsistent validity checks between fast/slow
injection paths; fast path could inject vector 0 spuriously
- Level-triggered check on uninitialized data making interrupt
resampling completely non-functional
- Duplicate GSI 0 detection using the wrong predicate
Port ID table:
- Use-after-RCU in mshv_portid_lookup (dereference outside read-side
critical section)
- Sleeping under spinlock in mshv_portid_alloc (GFP_KERNEL inside
idr_lock)
- Use kfree_rcu for deferred free without blocking
SynIC / ISR paths:
- Missing VP index bounds check in intercept ISR (OOB in interrupt
context from untrusted hypervisor data)
- Missing store/load ordering for VP array publish — lockless ISR
readers could observe partially-initialized VP
- Missing bounds validation in scheduler messages
(handle_pair_message vp_count, handle_bitset_message bank_mask)
Miscellaneous:
- Missing error code on VP allocation failure (silent success to
userspace)
Kudos to Claude and Sashiko for assisting with analysis and
implementation.
---
Stanislav Kinsburskii (18):
mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
mshv: Fix potential integer overflow in mshv_region_create
mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
mshv: Fix potential u64 overflow in region overlap check
mshv: Fix race in mshv_irqfd_deassign
mshv: Add defensive synchronize_srcu in irqfd shutdown
mshv: Add NULL check for vp in mshv_try_assert_irq_fast
mshv: Fix broken seqcount read protection
mshv: Consolidate irqfd interrupt injection paths
mshv: Fix level-triggered check on uninitialized data
mshv: Fix duplicate GSI detection for GSI 0
mshv: Fix use-after-RCU in mshv_portid_lookup
mshv: Fix sleeping under spinlock in mshv_portid_alloc
mshv: Use kfree_rcu in mshv_portid_free
mshv: Add missing vp_index bounds check in intercept ISR
mshv: Add store/load ordering for VP array publish
mshv: Validate scheduler message bounds from hypervisor
mshv: Fix missing error code on VP allocation failure
drivers/hv/mshv_eventfd.c | 108 +++++++++++++++++++++++++---------------
drivers/hv/mshv_irq.c | 2 -
drivers/hv/mshv_portid_table.c | 12 ++--
drivers/hv/mshv_regions.c | 2 -
drivers/hv/mshv_root_hv_call.c | 18 ++-----
drivers/hv/mshv_root_main.c | 39 ++++++++++----
drivers/hv/mshv_synic.c | 36 +++++++++++--
7 files changed, 136 insertions(+), 81 deletions(-)
^ permalink raw reply
* Re: [PATCH v3 00/18] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-05-04 19:06 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792097254.89142.47656055124497980.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Mon, May 04, 2026 at 06:59:01PM +0000, Stanislav Kinsburskii wrote:
> This series addresses bugs found during a continued review of the
> mshv_root module introduced by commit 621191d709b14 ("Drivers: hv:
> Introduce mshv_root module to expose /dev/mshv to VMMs").
>
THis series is malformed.
Please disregard.
Thanks,
Stanislav
> Changes since v2:
> - "Fix mshv_prepare_pinned_region error path for unencrypted
> partitions": removed inline mshv_region_invalidate() to prevent
> zeroing mreg_pages before mshv_region_destroy() can unmap partial
> SLAT mappings; for encrypted share-failure, memset the page array
> without unpinning (pages are host-inaccessible).
> - "Consolidate irqfd interrupt injection paths": fixed data race in
> mshv_irqfd_assign EPOLLIN path — girq_ent is now snapshotted inside
> the seqcount loop (matching mshv_irqfd_wakeup) to prevent a
> concurrent routing update from injecting vector 0 to VP 0.
> - "Add missing vp_index bounds check in intercept ISR": added
> array_index_nospec() after the bounds check to prevent speculative
> out-of-bounds array access.
> - "Add store/load ordering for VP array publish": added missing
> smp_load_acquire in mshv_try_assert_irq_fast.
>
> Changes since v1:
> - Added 8 new patches addressing issues found by Sashiko (automated
> review) covering the irqfd, portid, scheduler message, and VP
> lifecycle paths.
> - Consolidated the irqfd fast/slow injection paths to eliminate
> duplicated seqcount reads and fix the GSI 0 validity bypass.
> - Added memory ordering for the lockless VP array.
>
> The fixes range from data corruption and use-after-free to silent
> functional failures and sleeping-while-atomic:
>
> Memory region management:
> - Integer overflow on userspace-controlled allocation size
> (mshv_region_create)
> - Silent success on map failure for unencrypted partitions
> (mshv_prepare_pinned_region)
> - u64 overflow in region overlap check allowing overlapping mappings
>
> IRQ/eventfd path:
> - IRQ state leak and type truncation in hypercall helpers
> - Missing locking and hlist_del vs hlist_del_init race in irqfd
> deassign
> - Defensive synchronize_srcu in irqfd shutdown (follows KVM pattern)
> - NULL pointer dereference on spurious interrupt to non-existent VP
> (mshv_try_assert_irq_fast)
> - Broken seqcount read protection — torn reads of interrupt routing
> - Duplicated and inconsistent validity checks between fast/slow
> injection paths; fast path could inject vector 0 spuriously
> - Level-triggered check on uninitialized data making interrupt
> resampling completely non-functional
> - Duplicate GSI 0 detection using the wrong predicate
>
> Port ID table:
> - Use-after-RCU in mshv_portid_lookup (dereference outside read-side
> critical section)
> - Sleeping under spinlock in mshv_portid_alloc (GFP_KERNEL inside
> idr_lock)
> - Use kfree_rcu for deferred free without blocking
>
> SynIC / ISR paths:
> - Missing VP index bounds check in intercept ISR (OOB in interrupt
> context from untrusted hypervisor data)
> - Missing store/load ordering for VP array publish — lockless ISR
> readers could observe partially-initialized VP
> - Missing bounds validation in scheduler messages
> (handle_pair_message vp_count, handle_bitset_message bank_mask)
>
> Miscellaneous:
> - Missing error code on VP allocation failure (silent success to
> userspace)
>
> Kudos to Claude and Sashiko for assisting with analysis and
> implementation.
>
>
> ---
>
> Stanislav Kinsburskii (18):
> mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
> mshv: Fix potential integer overflow in mshv_region_create
> mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
> mshv: Fix potential u64 overflow in region overlap check
> mshv: Fix race in mshv_irqfd_deassign
> mshv: Add defensive synchronize_srcu in irqfd shutdown
> mshv: Add NULL check for vp in mshv_try_assert_irq_fast
> mshv: Fix broken seqcount read protection
> mshv: Consolidate irqfd interrupt injection paths
> mshv: Fix level-triggered check on uninitialized data
> mshv: Fix duplicate GSI detection for GSI 0
> mshv: Fix use-after-RCU in mshv_portid_lookup
> mshv: Fix sleeping under spinlock in mshv_portid_alloc
> mshv: Use kfree_rcu in mshv_portid_free
> mshv: Add missing vp_index bounds check in intercept ISR
> mshv: Add store/load ordering for VP array publish
> mshv: Validate scheduler message bounds from hypervisor
> mshv: Fix missing error code on VP allocation failure
>
>
> drivers/hv/mshv_eventfd.c | 108 +++++++++++++++++++++++++---------------
> drivers/hv/mshv_irq.c | 2 -
> drivers/hv/mshv_portid_table.c | 12 ++--
> drivers/hv/mshv_regions.c | 2 -
> drivers/hv/mshv_root_hv_call.c | 18 ++-----
> drivers/hv/mshv_root_main.c | 39 ++++++++++----
> drivers/hv/mshv_synic.c | 36 +++++++++++--
> 7 files changed, 136 insertions(+), 81 deletions(-)
>
^ permalink raw reply
* [PATCH v3] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Stanislav Kinsburskii @ 2026-05-04 19:04 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792097254.89142.47656055124497980.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
The bounds check inside the PFN-filling loop can return -EINVAL while
interrupts are disabled via local_irq_save(), leaking IRQ state.
Remove the check — it is redundant because the loop invariant
(done + i < page_count == page_struct_count >> large_shift) guarantees
(done + i) << large_shift < page_struct_count always holds.
While here, fix type mismatches: change 'int done' to 'u64 done' and
use u64 for loop and batch-size variables so they match the u64
page_count they are compared against.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_root_hv_call.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 129456bd72aba..cc580225e9e45 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -1042,7 +1042,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
{
struct hv_input_modify_sparse_spa_page_host_access *input_page;
u64 status;
- int done = 0;
+ u64 done = 0;
unsigned long irq_flags, large_shift = 0;
u64 page_count = page_struct_count;
u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
@@ -1059,9 +1059,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
}
while (done < page_count) {
- ulong i, completed, remain = page_count - done;
- int rep_count = min(remain,
- HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+ u64 i, completed, remain = page_count - done;
+ u64 rep_count = min_t(u64, remain,
+ HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
local_irq_save(irq_flags);
input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -1075,15 +1075,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
input_page->flags = flags;
input_page->host_access = host_access;
- for (i = 0; i < rep_count; i++) {
- u64 index = (done + i) << large_shift;
-
- if (index >= page_struct_count)
- return -EINVAL;
-
+ for (i = 0; i < rep_count; i++)
input_page->spa_page_list[i] =
- page_to_pfn(pages[index]);
- }
+ page_to_pfn(pages[(done + i) << large_shift]);
status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
NULL);
^ permalink raw reply related
* [PATCH v3 00/18] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-05-04 18:59 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
This series addresses bugs found during a continued review of the
mshv_root module introduced by commit 621191d709b14 ("Drivers: hv:
Introduce mshv_root module to expose /dev/mshv to VMMs").
Changes since v2:
- "Fix mshv_prepare_pinned_region error path for unencrypted
partitions": removed inline mshv_region_invalidate() to prevent
zeroing mreg_pages before mshv_region_destroy() can unmap partial
SLAT mappings; for encrypted share-failure, memset the page array
without unpinning (pages are host-inaccessible).
- "Consolidate irqfd interrupt injection paths": fixed data race in
mshv_irqfd_assign EPOLLIN path — girq_ent is now snapshotted inside
the seqcount loop (matching mshv_irqfd_wakeup) to prevent a
concurrent routing update from injecting vector 0 to VP 0.
- "Add missing vp_index bounds check in intercept ISR": added
array_index_nospec() after the bounds check to prevent speculative
out-of-bounds array access.
- "Add store/load ordering for VP array publish": added missing
smp_load_acquire in mshv_try_assert_irq_fast.
Changes since v1:
- Added 8 new patches addressing issues found by Sashiko (automated
review) covering the irqfd, portid, scheduler message, and VP
lifecycle paths.
- Consolidated the irqfd fast/slow injection paths to eliminate
duplicated seqcount reads and fix the GSI 0 validity bypass.
- Added memory ordering for the lockless VP array.
The fixes range from data corruption and use-after-free to silent
functional failures and sleeping-while-atomic:
Memory region management:
- Integer overflow on userspace-controlled allocation size
(mshv_region_create)
- Silent success on map failure for unencrypted partitions
(mshv_prepare_pinned_region)
- u64 overflow in region overlap check allowing overlapping mappings
IRQ/eventfd path:
- IRQ state leak and type truncation in hypercall helpers
- Missing locking and hlist_del vs hlist_del_init race in irqfd
deassign
- Defensive synchronize_srcu in irqfd shutdown (follows KVM pattern)
- NULL pointer dereference on spurious interrupt to non-existent VP
(mshv_try_assert_irq_fast)
- Broken seqcount read protection — torn reads of interrupt routing
- Duplicated and inconsistent validity checks between fast/slow
injection paths; fast path could inject vector 0 spuriously
- Level-triggered check on uninitialized data making interrupt
resampling completely non-functional
- Duplicate GSI 0 detection using the wrong predicate
Port ID table:
- Use-after-RCU in mshv_portid_lookup (dereference outside read-side
critical section)
- Sleeping under spinlock in mshv_portid_alloc (GFP_KERNEL inside
idr_lock)
- Use kfree_rcu for deferred free without blocking
SynIC / ISR paths:
- Missing VP index bounds check in intercept ISR (OOB in interrupt
context from untrusted hypervisor data)
- Missing store/load ordering for VP array publish — lockless ISR
readers could observe partially-initialized VP
- Missing bounds validation in scheduler messages
(handle_pair_message vp_count, handle_bitset_message bank_mask)
Miscellaneous:
- Missing error code on VP allocation failure (silent success to
userspace)
Kudos to Claude and Sashiko for assisting with analysis and
implementation.
---
Stanislav Kinsburskii (18):
mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
mshv: Fix potential integer overflow in mshv_region_create
mshv: Fix mshv_prepare_pinned_region error path for unencrypted partitions
mshv: Fix potential u64 overflow in region overlap check
mshv: Fix race in mshv_irqfd_deassign
mshv: Add defensive synchronize_srcu in irqfd shutdown
mshv: Add NULL check for vp in mshv_try_assert_irq_fast
mshv: Fix broken seqcount read protection
mshv: Consolidate irqfd interrupt injection paths
mshv: Fix level-triggered check on uninitialized data
mshv: Fix duplicate GSI detection for GSI 0
mshv: Fix use-after-RCU in mshv_portid_lookup
mshv: Fix sleeping under spinlock in mshv_portid_alloc
mshv: Use kfree_rcu in mshv_portid_free
mshv: Add missing vp_index bounds check in intercept ISR
mshv: Add store/load ordering for VP array publish
mshv: Validate scheduler message bounds from hypervisor
mshv: Fix missing error code on VP allocation failure
drivers/hv/mshv_eventfd.c | 108 +++++++++++++++++++++++++---------------
drivers/hv/mshv_irq.c | 2 -
drivers/hv/mshv_portid_table.c | 12 ++--
drivers/hv/mshv_regions.c | 2 -
drivers/hv/mshv_root_hv_call.c | 18 ++-----
drivers/hv/mshv_root_main.c | 39 ++++++++++----
drivers/hv/mshv_synic.c | 36 +++++++++++--
7 files changed, 136 insertions(+), 81 deletions(-)
^ permalink raw reply
* RE: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Michael Kelley @ 2026-05-04 16:06 UTC (permalink / raw)
To: Naman Jain, Michael Kelley, K . Y . Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
mrigendrachaubey, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org, vdso@mailbox.org,
ssengar@linux.microsoft.com
In-Reply-To: <516a4e65-3a4d-4e09-b445-28cb740b5653@linux.microsoft.com>
From: Naman Jain <namjain@linux.microsoft.com> Sent: Wednesday, April 29, 2026 2:57 AM
>
> On 4/27/2026 11:08 AM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> >>
[snip]
> >> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> >> index 08278547b84c..b4d80c9a673a 100644
> >> --- a/arch/x86/include/asm/mshyperv.h
> >> +++ b/arch/x86/include/asm/mshyperv.h
> >> @@ -286,7 +286,6 @@ struct mshv_vtl_cpu_context {
> >> #ifdef CONFIG_HYPERV_VTL_MODE
> >> void __init hv_vtl_init_platform(void);
> >> int __init hv_vtl_early_init(void);
> >> -void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> >> void mshv_vtl_return_call_init(u64 vtl_return_offset);
> >> void mshv_vtl_return_hypercall(void);
> >> void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> >> @@ -294,7 +293,6 @@ int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, bool shared);
> >> #else
> >> static inline void __init hv_vtl_init_platform(void) {}
> >> static inline int __init hv_vtl_early_init(void) { return 0; }
> >> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> >> static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
> >> static inline void mshv_vtl_return_hypercall(void) {}
> >> static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> >> diff --git a/drivers/hv/mshv_vtl.h b/drivers/hv/mshv_vtl.h
> >> index a6eea52f7aa2..103f07371f3f 100644
> >> --- a/drivers/hv/mshv_vtl.h
> >> +++ b/drivers/hv/mshv_vtl.h
> >> @@ -22,4 +22,7 @@ struct mshv_vtl_run {
> >> char vtl_ret_actions[MSHV_MAX_RUN_MSG_SIZE];
> >> };
> >>
> >> +static_assert(sizeof(struct mshv_vtl_cpu_context) <= 1024,
> >> + "struct mshv_vtl_cpu_context exceeds reserved space in struct mshv_vtl_run");
> >> +
> >> #endif /* _MSHV_VTL_H */
> >> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> >> index db183c8cfb95..8cdf2a9fbdfb 100644
> >> --- a/include/asm-generic/mshyperv.h
> >> +++ b/include/asm-generic/mshyperv.h
> >> @@ -396,8 +396,10 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
> >>
> >> #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
> >> u8 __init get_vtl(void);
> >> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> >> #else
> >> static inline u8 get_vtl(void) { return 0; }
> >> +static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> >
> > Is this stub needed? Maybe I missed something, but it looks to me like none
> > of the code that calls this gets built unless CONFIG_HYPERV_VTL_MODE is set.
> > See further comments about stubs in Patch 8 of this series.
> >
>
> Config dependencies would handle such cases, and this is not required. I
> saw similar stubs added in the code, so I thought this is a norm that
> should be followed, and not rely on config dependencies.
> I can remove it.
>
Others might disagree with me, but I don't think it's the norm to add
stubs when they aren't truly needed. As you can see from some of my
other comments, I look for ways to eliminate stubs. Stubs are indicative
of a boundary between separately built components, and I generally
try to minimize the surface area of such boundaries. A large surface area
often means that the overall design could be improved by re-thinking
which code goes with which component.
Michael
^ permalink raw reply
* RE: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Michael Kelley @ 2026-05-04 16:06 UTC (permalink / raw)
To: Naman Jain, Michael Kelley, K . Y . Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
mrigendrachaubey, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org, vdso@mailbox.org,
ssengar@linux.microsoft.com
In-Reply-To: <567cf73b-6a48-4fc3-b312-9be6d71e2f22@linux.microsoft.com>
From: Naman Jain <namjain@linux.microsoft.com> Sent: Wednesday, April 29, 2026 2:58 AM
>
> On 4/27/2026 11:10 AM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
> >>
> >> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
> >> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
> >> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
> >> in architecture-specific code.
> >>
> >> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
> >> include/asm-generic/mshyperv.h so they are visible to both arch and
> >> driver code.
> >>
> >> Change the return type from void to bool so the caller can determine
> >> whether the register page was successfully configured and set
> >> mshv_has_reg_page accordingly.
> >>
> >> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> >> ---
> >> arch/x86/hyperv/hv_vtl.c | 32 ++++++++++++++++++++++
> >> drivers/hv/mshv_vtl_main.c | 49 +++-------------------------------
> >> include/asm-generic/mshyperv.h | 17 ++++++++++++
> >> 3 files changed, 53 insertions(+), 45 deletions(-)
> >>
>
> <snip>
>
> >> #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
> >> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> >
> > This comment pre-dates your patch, but I don't understand the point
> > it is trying to make. The comment is factually true, but I don't know
> > why calling that out is relevant. The REG_PAGE MSR seems to be
> > conceptually separate and distinct from the SIMP MSR, so the fact
> > that the layouts are the same is just a coincidence. Or is there some
> > relationship between the two MSRs that I'm not aware of, and the
> > comment is trying (and failing?) to point out?
>
> This was added as per suggestion from Nuno in my initial series for
> MSHV_VTL. If the reference in "identical to" is misleading, I should
> remove it.
>
> https://lore.kernel.org/all/68143eb0-e6a7-4579-bedb-4c2ec5aaef6b@linux.microsoft.com/
>
> Quoting:
> """
> it is a generic structure that
> appears to be used for several overlay page MSRs (SIMP, SIEF, etc).
>
> But, the type doesn't appear in the hv*dk headers explicitly; it's just
> used internally by the hypervisor.
>
> I think it should be renamed with a hv_ prefix to indicate it's part of
> the hypervisor ABI, and a brief comment with the provenance:
>
> /* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> union hv_synic_overlay_page_msr {
> /* <snip> */
> };
OK, so this union is not associated *only* with the REG_PAGE MSR
(though that MSR is the only current user). Instead, it is intended to
be a more generic description of MSRs that set up overlay pages. I
don't think I had previously noticed Nuno's comment on the topic.
Looking through hvgdk_mini.h and hvhdk.h, I see 6 definitions that
are exactly the same:
* union hv_reference_tsc_msr
* union hv_x64_msr_hypercall_contents
* union hv_vp_assist_msr_contents
* union hv_synic_simp
* union hv_synic_siefp
* union hv_synic_sirbp
There's an argument to be made for removing these 6 unique definitions
and using union hv_synic_overlay_page_msr instead (though "synic"
would need to be removed from the name). I would not object to such
an approach. It's a small extra layer of conceptual indirection, but saves
some lines of code for duplicative definitions. The alternative is to drop
the idea of a generic overlay page MSR layout, and replace union
hv_synic_overlay_page_msr with a definition that is specific to the
REG_PAGE MSR, like the other six above.
I could go either way. If we want to use a generic overlay page definition,
then that approach should be applied everywhere. With the current
state of your patch set, we're halfway in between -- the generic definition
is used one place, but duplicative specific MSR definitions are used other
places. That's probably the least desirable approach.
Michael
^ permalink raw reply
* RE: [PATCH v4 1/3] mshv: limit SynIC management to MSHV-owned resources
From: Michael Kelley @ 2026-05-04 15:09 UTC (permalink / raw)
To: Jork Loeser, linux-hyperv@vger.kernel.org
Cc: x86@kernel.org, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Michael Kelley, Anirudh Rayabharam, linux-kernel@vger.kernel.org,
linux-arch@vger.kernel.org
In-Reply-To: <20260427213855.1675044-2-jloeser@linux.microsoft.com>
From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Monday, April 27, 2026 2:39 PM
>
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
>
> Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
> from under VMBus causes its later cleanup to write SynIC MSRs while
> SynIC is disabled, which the hypervisor does not tolerate.
>
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
> and nested root partition); on a non-nested root partition VMBus
> does not run, so MSHV must enable/disable them
>
> While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
> calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
> PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
> register GPAs regardless of the kernel page size, so using PAGE_SHIFT
> produces wrong addresses on ARM64 with 64K pages.
I agree that this is a good change. But any kernel image built with
CONFIG_MSHV_ROOT set must use only 4KiB pages, as enforced
by the dependency in drivers/hv/Kconfig. The change makes the
code explicitly match the SynIC register layout, which is good,
but it doesn't actually fix a problem since root MSHV code can't
run on ARM64 with 64KiB pages. My only concern is that this
commit message should not imply that an ARM64/64KiB
configuration is possible for the root.
Michael
>
> Note that initialization order matters - VMBUS first, MSHV second,
> and the reverse on de-init. Ideally, we would want a dedicated SYNIC
> driver that replaces the cross-dependencies with a clear API and
> dynamic tracking. Such refactor should go into its own dedicated
> series, outside of this kexec fix series.
>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> drivers/hv/hv.c | 3 +
> drivers/hv/mshv_synic.c | 150 ++++++++++++++++++++++++++--------------
> 2 files changed, 103 insertions(+), 50 deletions(-)
>
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index ae60fd542292..ef4b1b03395d 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -272,6 +272,9 @@ void hv_synic_free(void)
> /*
> * hv_hyp_synic_enable_regs - Initialize the Synthetic Interrupt Controller
> * with the hypervisor.
> + *
> + * Note: When MSHV is present, mshv_synic_cpu_init() intializes further
> + * registers later.
> */
> void hv_hyp_synic_enable_regs(unsigned int cpu)
> {
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index e2288a726fec..2db3b0192eac 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -13,6 +13,7 @@
> #include <linux/interrupt.h>
> #include <linux/io.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/hyperv.h>
> #include <linux/reboot.h>
> #include <asm/mshyperv.h>
> #include <linux/acpi.h>
> @@ -456,46 +457,75 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> union hv_synic_siefp siefp;
> union hv_synic_sirbp sirbp;
> union hv_synic_sint sint;
> - union hv_synic_scontrol sctrl;
> struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> struct hv_synic_event_flags_page **event_flags_page =
> &spages->synic_event_flags_page;
> struct hv_synic_event_ring_page **event_ring_page =
> &spages->synic_event_ring_page;
> + /*
> + * VMBus owns SIMP/SIEFP/SCONTROL when it is active.
> + * See hv_hyp_synic_enable_regs() for that initialization.
> + */
> + bool vmbus_active = hv_vmbus_exists();
>
> - /* Setup the Synic's message page */
> + /*
> + * Map the SYNIC message page. When VMBus is not active the
> + * hypervisor pre-provisions the SIMP GPA but may not set
> + * simp_enabled - enable it here.
> + */
> simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> - simp.simp_enabled = true;
> + if (!vmbus_active) {
> + simp.simp_enabled = true;
> + hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> + }
> *msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
> HV_HYP_PAGE_SIZE,
> MEMREMAP_WB);
>
> if (!(*msg_page))
> - return -EFAULT;
> -
> - hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> + goto cleanup_simp;
>
> - /* Setup the Synic's event flags page */
> + /*
> + * Map the event flags page. Same as SIMP: enable when
> + * VMBus is not active, already enabled by VMBus otherwise.
> + */
> siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> - siefp.siefp_enabled = true;
> - *event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
> - PAGE_SIZE, MEMREMAP_WB);
> + if (!vmbus_active) {
> + siefp.siefp_enabled = true;
> + hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> + }
> + *event_flags_page = memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,
> + HV_HYP_PAGE_SIZE, MEMREMAP_WB);
>
> if (!(*event_flags_page))
> - goto cleanup;
> -
> - hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> + goto cleanup_siefp;
>
> /* Setup the Synic's event ring page */
> sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> - sirbp.sirbp_enabled = true;
> - *event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
> - PAGE_SIZE, MEMREMAP_WB);
>
> - if (!(*event_ring_page))
> - goto cleanup;
> + if (hv_root_partition()) {
> + *event_ring_page = memremap(sirbp.base_sirbp_gpa <<
> HV_HYP_PAGE_SHIFT,
> + HV_HYP_PAGE_SIZE, MEMREMAP_WB);
>
> + if (!(*event_ring_page))
> + goto cleanup_siefp;
> + } else {
> + /*
> + * On L1VH the hypervisor does not provide a SIRBP page.
> + * Allocate one and program its GPA into the MSR.
> + */
> + *event_ring_page = (struct hv_synic_event_ring_page *)
> + get_zeroed_page(GFP_KERNEL);
> +
> + if (!(*event_ring_page))
> + goto cleanup_siefp;
> +
> + sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
> + >> HV_HYP_PAGE_SHIFT;
> + }
> +
> + sirbp.sirbp_enabled = true;
> hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
>
> if (mshv_sint_irq != -1)
> @@ -518,28 +548,30 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> sint.as_uint64);
>
> - /* Enable global synic bit */
> - sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> - sctrl.enable = 1;
> - hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> + /* When VMBus is active it already enabled SCONTROL. */
> + if (!vmbus_active) {
> + union hv_synic_scontrol sctrl;
> +
> + sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> + sctrl.enable = 1;
> + hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> + }
>
> return 0;
>
> -cleanup:
> - if (*event_ring_page) {
> - sirbp.sirbp_enabled = false;
> - hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> - memunmap(*event_ring_page);
> - }
> - if (*event_flags_page) {
> +cleanup_siefp:
> + if (*event_flags_page)
> + memunmap(*event_flags_page);
> + if (!vmbus_active) {
> siefp.siefp_enabled = false;
> hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> - memunmap(*event_flags_page);
> }
> - if (*msg_page) {
> +cleanup_simp:
> + if (*msg_page)
> + memunmap(*msg_page);
> + if (!vmbus_active) {
> simp.simp_enabled = false;
> hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> - memunmap(*msg_page);
> }
>
> return -EFAULT;
> @@ -548,16 +580,15 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> static int mshv_synic_cpu_exit(unsigned int cpu)
> {
> union hv_synic_sint sint;
> - union hv_synic_simp simp;
> - union hv_synic_siefp siefp;
> union hv_synic_sirbp sirbp;
> - union hv_synic_scontrol sctrl;
> struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> struct hv_synic_event_flags_page **event_flags_page =
> &spages->synic_event_flags_page;
> struct hv_synic_event_ring_page **event_ring_page =
> &spages->synic_event_ring_page;
> + /* VMBus owns SIMP/SIEFP/SCONTROL when it is active */
> + bool vmbus_active = hv_vmbus_exists();
>
> /* Disable the interrupt */
> sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 +
> HV_SYNIC_INTERCEPTION_SINT_INDEX);
> @@ -574,28 +605,47 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
> if (mshv_sint_irq != -1)
> disable_percpu_irq(mshv_sint_irq);
>
> - /* Disable Synic's event ring page */
> + /* Disable SYNIC event ring page owned by MSHV */
> sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> sirbp.sirbp_enabled = false;
> - hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> - memunmap(*event_ring_page);
>
> - /* Disable Synic's event flags page */
> - siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> - siefp.siefp_enabled = false;
> - hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> + if (hv_root_partition()) {
> + hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> + memunmap(*event_ring_page);
> + } else {
> + sirbp.base_sirbp_gpa = 0;
> + hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> + free_page((unsigned long)*event_ring_page);
> + }
> +
> + /*
> + * Release our mappings of the message and event flags pages.
> + * When VMBus is not active, we enabled SIMP/SIEFP - disable
> + * them. Otherwise VMBus owns the MSRs - leave them.
> + */
> memunmap(*event_flags_page);
> + if (!vmbus_active) {
> + union hv_synic_simp simp;
> + union hv_synic_siefp siefp;
>
> - /* Disable Synic's message page */
> - simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> - simp.simp_enabled = false;
> - hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> + siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> + siefp.siefp_enabled = false;
> + hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +
> + simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> + simp.simp_enabled = false;
> + hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> + }
> memunmap(*msg_page);
>
> - /* Disable global synic bit */
> - sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> - sctrl.enable = 0;
> - hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> + /* When VMBus is active it owns SCONTROL - leave it. */
> + if (!vmbus_active) {
> + union hv_synic_scontrol sctrl;
> +
> + sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> + sctrl.enable = 0;
> + hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> + }
>
> return 0;
> }
> --
> 2.43.0
>
^ permalink raw reply
* RE: [PATCH v3 3/6] x86/hyperv: Skip LP/VP creation on kexec
From: Michael Kelley @ 2026-05-04 15:09 UTC (permalink / raw)
To: Jork Loeser, linux-hyperv@vger.kernel.org
Cc: x86@kernel.org, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Michael Kelley, linux-kernel@vger.kernel.org,
linux-arch@vger.kernel.org, Anirudh Rayabharam,
Stanislav Kinsburskii, Mukesh Rathor
In-Reply-To: <20260408013645.286723-4-jloeser@linux.microsoft.com>
From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Tuesday, April 7, 2026 6:37 PM
>
> After a kexec the logical processors and virtual processors already
> exist in the hypervisor because they were created by the previous
> kernel. Attempting to add them again causes either a BUG_ON or
> corrupted VP state leading to MCEs in the new kernel.
>
> Add hv_lp_exists() to probe whether an LP is already present by
> calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
> LP exists and we skip the add-LP and create-VP loops entirely.
>
> Also add hv_call_notify_all_processors_started() which informs the
> hypervisor that all processors are online. This is required after
> adding LPs (fresh boot) and is a no-op on kexec since we skip that
> path.
Adding hv_call_notify_all_processors_started() seems like it should be
a separate patch. And this paragraph in the commit message leaves me
with questions: Is it really "required"? If it is, how does the existing
upstream code ever work? Does the change need to be backported
to stable kernels? If it isn't *really* required, what are the implications
of not doing it?
>
> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
> Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> arch/x86/kernel/cpu/mshyperv.c | 7 +++++
> drivers/hv/hv_proc.c | 47 ++++++++++++++++++++++++++++++++++
> include/asm-generic/mshyperv.h | 10 ++++++++
> include/hyperv/hvgdk_mini.h | 1 +
> include/hyperv/hvhdk_mini.h | 12 +++++++++
> 5 files changed, 77 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index e498b6b2ef19..b5b6a58b67b0 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -431,6 +431,10 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
> }
>
> #ifdef CONFIG_X86_64
> + /* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
> + if (num_present_cpus() == 1 || hv_lp_exists(1))
> + return;
> +
> for_each_present_cpu(i) {
> if (i == 0)
> continue;
> @@ -438,6 +442,9 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
> BUG_ON(ret);
> }
>
> + ret = hv_call_notify_all_processors_started();
> + WARN_ON(ret);
> +
> for_each_present_cpu(i) {
> if (i == 0)
> continue;
An observation: hv_smp_prepare_cpus() is getting to be a bit of a mess.
It handles both the SNP case and the root case, which aren't really related.
I could envision having hv_smp_prepare_cpus_for_snp() and
hv_smp_prepare_cpus_for_root() in order to separate the two cases
cleanly.
Then hv_smp_prepare_cpus_for_root() calls four functions in hv_proc.c,
all of which require stubs for the case where MSHV root isn't being built.
Better would be to move the root version of prepare CPUs functionality
into a new function in hv_proc.c, and only have a stub for that single
function. Three of the other four called functions could then become static.
The #ifdef CONFIG_X86_64 could also go away since hv_proc.c is only
built for x64.
I'll probably submit a separate patch to implement these suggested
cleanups, unless someone else wants to do it first.
Michael
> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index 3cb4b2a3035c..57b2c64197cb 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -239,3 +239,50 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
> return ret;
> }
> EXPORT_SYMBOL_GPL(hv_call_create_vp);
> +
> +int hv_call_notify_all_processors_started(void)
> +{
> + struct hv_input_notify_partition_event *input;
> + u64 status;
> + unsigned long irq_flags;
> + int ret = 0;
> +
> + local_irq_save(irq_flags);
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + memset(input, 0, sizeof(*input));
> + input->event = HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED;
> + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT,
> + input, NULL);
> + local_irq_restore(irq_flags);
> +
> + if (!hv_result_success(status)) {
> + hv_status_err(status, "\n");
> + ret = hv_result_to_errno(status);
> + }
> + return ret;
> +}
> +
> +bool hv_lp_exists(u32 lp_index)
> +{
> + struct hv_input_get_logical_processor_run_time *input;
> + struct hv_output_get_logical_processor_run_time *output;
> + unsigned long flags;
> + u64 status;
> +
> + local_irq_save(flags);
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> + input->lp_index = lp_index;
> + status = hv_do_hypercall(HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME,
> + input, output);
> + local_irq_restore(flags);
> +
> + if (!hv_result_success(status) &&
> + hv_result(status) != HV_STATUS_INVALID_LP_INDEX) {
> + hv_status_err(status, "\n");
> + BUG();
> + }
> +
> + return hv_result_success(status);
> +}
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index d37b68238c97..bf601d67cecb 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -347,6 +347,8 @@ bool hv_result_needs_memory(u64 status);
> int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
> int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
> int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
> +int hv_call_notify_all_processors_started(void);
> +bool hv_lp_exists(u32 lp_index);
> int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
>
> #else /* CONFIG_MSHV_ROOT */
> @@ -366,6 +368,14 @@ static inline int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id)
> {
> return -EOPNOTSUPP;
> }
> +static inline int hv_call_notify_all_processors_started(void)
> +{
> + return -EOPNOTSUPP;
> +}
> +static inline bool hv_lp_exists(u32 lp_index)
> +{
> + return false;
> +}
> static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
> {
> return -EOPNOTSUPP;
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index f9600f87186a..6a4e8b9d570f 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -435,6 +435,7 @@ union hv_vp_assist_msr_contents { /*
> HV_REGISTER_VP_ASSIST_PAGE */
> /* HV_CALL_CODE */
> #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
> #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
> +#define HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME 0x0004
> #define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
> #define HVCALL_SEND_IPI 0x000b
> #define HVCALL_ENABLE_VP_VTL 0x000f
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index 091c03e26046..b4cb2fa26e9b 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -362,6 +362,7 @@ union hv_partition_event_input {
>
> enum hv_partition_event {
> HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
> + HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED = 4,
> };
>
> struct hv_input_notify_partition_event {
> @@ -369,6 +370,17 @@ struct hv_input_notify_partition_event {
> union hv_partition_event_input input;
> } __packed;
>
> +struct hv_input_get_logical_processor_run_time {
> + u32 lp_index;
> +} __packed;
> +
> +struct hv_output_get_logical_processor_run_time {
> + u64 global_time;
> + u64 local_run_time;
> + u64 rsvdz0;
> + u64 hypervisor_time;
> +} __packed;
> +
> struct hv_lp_startup_status {
> u64 hv_status;
> u64 substatus1;
> --
> 2.43.0
>
^ permalink raw reply
* RE: [PATCH v3 2/6] x86/hyperv: move stimer cleanup to hv_machine_shutdown()
From: Michael Kelley @ 2026-05-04 15:08 UTC (permalink / raw)
To: Jork Loeser, linux-hyperv@vger.kernel.org
Cc: x86@kernel.org, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Michael Kelley, linux-kernel@vger.kernel.org,
linux-arch@vger.kernel.org, Anirudh Rayabharam
In-Reply-To: <20260408013645.286723-3-jloeser@linux.microsoft.com>
From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Tuesday, April 7, 2026 6:37 PM
>
> Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
> hv_machine_shutdown() in the platform code. This ensures stimer cleanup
> happens before the vmbus unload, which is required for root partition
> kexec to work correctly.
I know I'm late to the party in commenting on this patch set. But I'll add
my $.02 anyway on this patch and a couple others in the set.
I had difficulty understanding this patch's intent. The commit message
says the intent is to make sure stimer cleanup happens before VMBus
unload. But at first glance, the code in this patch doesn't change the
ordering. It appears to be immaterial whether hv_stimer_global_cleanup()
is called just before invoking hv_kexec_handler(), or as the first step of
hv_kexec_handler().
But then I realized that hv_kexec_handler isn't set unless the VMBus driver
is loaded and initialized, and that doesn't happen in the root partition.
In the old code, stimer cleanup is dependent on the VMBus driver being
loaded, and that's a wrong dependency, as stimer operation is independent
of VMBus (almost -- there's the old non-direct mode stimer path that
depends on the VMBus interrupt handler, which muddies the conceptual
separation). So the patch does the right thing, but the commit message
doesn't make the real reasoning clear. My feedback would be to improve
the commit message.
Michael
>
> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> arch/x86/kernel/cpu/mshyperv.c | 8 ++++++--
> drivers/hv/vmbus_drv.c | 1 -
> 2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index a7dfc29d3470..e498b6b2ef19 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -237,8 +237,12 @@ void hv_remove_crash_handler(void)
> #ifdef CONFIG_KEXEC_CORE
> static void hv_machine_shutdown(void)
> {
> - if (kexec_in_progress && hv_kexec_handler)
> - hv_kexec_handler();
> + if (kexec_in_progress) {
> + hv_stimer_global_cleanup();
> +
> + if (hv_kexec_handler)
> + hv_kexec_handler();
> + }
Immediately following your code change is this code:
if (kexec_in_progress)
cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
So with this change there are two "if (kexec_in_progress)"
statements in a row. It's not wrong, but it looks like the
patch wasn't fully integrated into the existing code. The
cpuhp_remove_state() and the comment should be moved
under the newly added "if (kexec_in_progress)" and the
duplicate "if (kexec_in_progress)" can be dropped.
Michael
>
> /*
> * Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 5e7a6839c933..c5dfe9f3b206 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2891,7 +2891,6 @@ static struct platform_driver vmbus_platform_driver = {
>
> static void hv_kexec_handler(void)
> {
> - hv_stimer_global_cleanup();
> vmbus_initiate_unload(false);
> /* Make sure conn_state is set as hv_synic_cleanup checks for it */
> mb();
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH 0/3] net: mana: Fix mana_destroy_rxq() cleanup for partial RXQ init
From: Dipayaan Roy @ 2026-05-03 3:38 UTC (permalink / raw)
To: Simon Horman
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260502165258.GM15617@horms.kernel.org>
On Sat, May 02, 2026 at 05:52:58PM +0100, Simon Horman wrote:
> On Wed, Apr 29, 2026 at 08:57:51PM -0700, Dipayaan Roy wrote:
> > When mana_create_rxq() fails partway through initialization (e.g. the
> > hardware rejects the WQ object creation), the error path calls
> > mana_destroy_rxq() to tear down a partially-initialized RXQ.
> > This exposed multiple issues in mana_destroy_rxq() path, as it assumed
> > the RXQ was always fully initialized, leading to multiple issues:
> >
> > 1. xdp_rxq_info_unreg() was called on an unregistered xdp_rxq,
> > triggering a WARN_ON ("Driver BUG") in net/core/xdp.c.
> >
> > 2. mana_destroy_wq_obj() was called with INVALID_MANA_HANDLE,
> > sending a bogus destroy command to the hardware.
> >
> > 3. mana_deinit_cq() was called twice — once inside mana_destroy_rxq()
> > and again in mana_create_rxq()'s error path — causing a
> > use-after-free since mana_destroy_rxq() frees the rxq first.
> >
> > This was observed during ethtool ring parameter changes when the
> > hardware returned an error creating the RXQ. This series makes
> > mana_destroy_rxq() safe to call at any stage of RXQ initialization
> > by guarding each teardown step, and removes the redundant cleanup
> > in mana_create_rxq().
>
> For the series:
>
> Reviewed-by: Simon Horman <horms@kernel.org>
>
> I don't think that you need to repost for this. But please keep in mind for
> future submissions that fixes for code present in the net tree should be
> targeted at that tree, like this:
>
> Subject: [PATCH net vN/M] ...
Thanks Simon,
it was a typo from my end.
>
> Also, FTR, there is an AI generated review of this patch-set available
> on sashiko.dev. It seems to me that the issues flagged there pre-date
> this patch-set and should not block progress of it. But you may wish
> to use that review as the basis of some follow-up.
Agreed, Sashiko flagged a pre-exisitng issue in the tx path.
I will send that as a separate patch set.
Thanks and Regards
Dipayaan Roy
^ permalink raw reply
* Re: [PATCH net, v2] net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR
From: Dipayaan Roy @ 2026-05-03 3:03 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260501185324.0f02dc72@kernel.org>
On Fri, May 01, 2026 at 06:53:24PM -0700, Jakub Kicinski wrote:
> On Wed, 29 Apr 2026 11:57:55 -0700 Dipayaan Roy wrote:
> > During Function Level Reset recovery, the MANA driver reads
> > hardware BAR0 registers that may temporarily contain garbage values.
> > The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used
> > to compute gc->shm_base, which is later dereferenced via readl() in
> > mana_smc_poll_register(). If the hardware returns an unaligned or
> > out-of-range value, the driver must not blindly use it, as this would
> > propagate the hardware error into a kernel crash.
> >
> > The following crash was observed on an arm64 Hyper-V guest running
> > kernel 6.17.0-3013-azure during VF reset recovery triggered by HWC
> > timeout.
> >
> > [13291.785274] Unable to handle kernel paging request at virtual address ffff8000a200001b
> > [13291.785311] Mem abort info:
> > [13291.785332] ESR = 0x0000000096000021
> > [13291.785343] EC = 0x25: DABT (current EL), IL = 32 bits
> > [13291.785355] SET = 0, FnV = 0
> > [13291.785363] EA = 0, S1PTW = 0
> > [13291.785372] FSC = 0x21: alignment fault
> > [13291.785382] Data abort info:
> > [13291.785391] ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
> > [13291.785404] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [13291.785412] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [13291.785421] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000014df3a1000
> > [13291.785432] [ffff8000a200001b] pgd=1000000100438403, p4d=1000000100438403, pud=1000000100439403, pmd=0068000fc2000711
> > [13291.785703] Internal error: Oops: 0000000096000021 [#1] SMP
> > [13291.830975] Modules linked in: tls qrtr mana_ib ib_uverbs ib_core xt_owner xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables cfg80211 8021q garp mrp stp llc binfmt_misc joydev serio_raw nls_iso8859_1 hid_generic aes_ce_blk aes_ce_cipher polyval_ce ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher hid_hyperv sm4 sm3_ce sha3_ce hv_netvsc hid vmgenid hyperv_keyboard hyperv_drm sch_fq_codel nvme_fabrics efi_pstore dm_multipath nfnetlink vsock_loopback vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport vmw_vmci vsock dmi_sysfs ip_tables x_tables autofs4
> > [13291.862630] CPU: 122 UID: 0 PID: 61796 Comm: kworker/122:2 Tainted: G W 6.17.0-3013-azure #13-Ubuntu VOLUNTARY
> > [13291.869902] Tainted: [W]=WARN
> > [13291.871901] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 01/08/2026
> > [13291.878086] Workqueue: events mana_serv_func
> > [13291.880718] pstate: 62400005 (nZCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
> > [13291.884835] pc : mana_smc_poll_register+0x48/0xb0
> > [13291.887902] lr : mana_smc_setup_hwc+0x70/0x1c0
> > [13291.890493] sp : ffff8000ab79bbb0
> > [13291.892364] x29: ffff8000ab79bbb0 x28: ffff00410c8b5900 x27: ffff00410d630680
> > [13291.896252] x26: ffff004171f9fd80 x25: 000000016ed55000 x24: 000000017f37e000
> > [13291.899990] x23: 0000000000000000 x22: 000000016ed55000 x21: 0000000000000000
> > [13291.904497] x20: ffff8000a200001b x19: 0000000000004e20 x18: ffff8000a6183050
> > [13291.908308] x17: 0000000000000000 x16: 0000000000000000 x15: 000000000000000a
> > [13291.912542] x14: 0000000000000004 x13: 0000000000000000 x12: 0000000000000000
> > [13291.916298] x11: 0000000000000000 x10: 0000000000000001 x9 : ffffc45006af1bd8
> > [13291.920945] x8 : ffff000151129000 x7 : 0000000000000000 x6 : 0000000000000000
> > [13291.925293] x5 : 000000015f214000 x4 : 000000017217a000 x3 : 000000016ed50000
> > [13291.930436] x2 : 000000016ed55000 x1 : 0000000000000000 x0 : ffff8000a1ffffff
> > [13291.934342] Call trace:
> > [13291.935736] mana_smc_poll_register+0x48/0xb0 (P)
> > [13291.938611] mana_smc_setup_hwc+0x70/0x1c0
> > [13291.941113] mana_hwc_create_channel+0x1a0/0x3a0
> > [13291.944283] mana_gd_setup+0x16c/0x398
> > [13291.946584] mana_gd_resume+0x24/0x70
> > [13291.948917] mana_do_service+0x13c/0x1d0
> > [13291.951583] mana_serv_func+0x34/0x68
> > [13291.953732] process_one_work+0x168/0x3d0
> > [13291.956745] worker_thread+0x2ac/0x480
> > [13291.959104] kthread+0xf8/0x110
> > [13291.961026] ret_from_fork+0x10/0x20
> > [13291.963560] Code: d2807d00 9417c551 71000673 54000220 (b9400281)
> > [13291.967299] ---[ end trace 0000000000000000 ]---
> >
> > Disassembly of mana_smc_poll_register() around the crash site:
> >
> > Disassembly of section .text:
> >
> > 00000000000047c8 <mana_smc_poll_register>:
> > 47c8: d503201f nop
> > 47cc: d503201f nop
> > 47d0: d503233f paciasp
> > 47d4: f800865e str x30, [x18], #8
> > 47d8: a9bd7bfd stp x29, x30, [sp, #-48]!
> > 47dc: 910003fd mov x29, sp
> > 47e0: a90153f3 stp x19, x20, [sp, #16]
> > 47e4: 91007014 add x20, x0, #0x1c
> > 47e8: 5289c413 mov w19, #0x4e20
> > 47ec: f90013f5 str x21, [sp, #32]
> > 47f0: 12001c35 and w21, w1, #0xff
> > 47f4: 14000008 b 4814 <mana_smc_poll_register+0x4c>
> > 47f8: 36f801e1 tbz w1, #31, 4834 <mana_smc_poll_register+0x6c>
> > 47fc: 52800042 mov w2, #0x2
> > 4800: d280fa01 mov x1, #0x7d0
> > 4804: d2807d00 mov x0, #0x3e8
> > 4808: 94000000 bl 0 <usleep_range_state>
> > 480c: 71000673 subs w19, w19, #0x1
> > 4810: 54000200 b.eq 4850 <mana_smc_poll_register+0x88>
> > 4814: b9400281 ldr w1, [x20] <-- **** CRASHED HERE *****
> > 4818: d50331bf dmb oshld
> > 481c: 2a0103e2 mov w2, w1
> > ...
> >
> > From the crash signature x20 = ffff8000a200001b, this address
> > ends in 0x1b which is not 4-byte aligned, so the 'ldr w1, [x20]'
> > instruction (readl) triggers the arm64 alignment fault (FSC = 0x21).
> >
> > The root cause is in mana_gd_init_vf_regs(), which computes:
> >
> > gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> >
> > The offset is used without any validation. The same problem exists
> > in mana_gd_init_pf_regs() for sriov_base_off and sriov_shm_off.
> >
> > Fix this by validating all offsets before use:
> >
> > - VF: check shm_off is within BAR0, properly aligned to 4 bytes
> > (readl requirement), and leaves room for the full 256-bit
> > (32-byte) SMC aperture.
> >
> > - PF: check sriov_base_off is within BAR0, aligned to 8 bytes
> > (readq requirement), and leaves room to safely read the
> > sriov_shm_off register at sriov_base_off + GDMA_PF_REG_SHM_OFF.
> > Then check sriov_shm_off leaves room for the full SMC aperture.
> > All arithmetic uses subtraction rather than addition to avoid
> > integer overflow on garbage firmware values.
> >
> > without validating the offset read from hardware. If the register
> > returns a garbage value that is neither within bar 0 bounds nor aligned
> > to the 4-byte granularity, thus causing the alignment fault.
> >
> > Define SMC_APERTURE_SIZE (32 bytes, derived from the 256-bit aperture
> > width)
> >
> > Return -EPROTO on invalid values. The existing recovery path in
> > mana_serv_reset() already handles -EPROTO by falling through to PCI
> > device rescan, giving the hardware another chance to present valid
> > register values after reset.
> >
> > Fixes: 9bf66036d686 ("net: mana: Handle hardware recovery events when probing the device")
> > Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> >
> > ---
> > Changes in v2:
> > - Fix sriov_base_off alignment check: sizeof(u32) to sizeof(u64), since
> > mana_gd_r64() (readq) requires 8-byte alignment on arm64.
> > - Fix sriov_base_off bounds: also verify enough space remains in BAR0
> > to safely read sriov_shm_off at offset GDMA_PF_REG_SHM_OFF + 8 bytes.
> > - Fix integer overflow: rewrite bounds checks using subtraction
> > (remaining = bar0_size - base) instead of addition.
> > - Fix SMC aperture size: add gc->bar0_size - shm_off < SMC_APERTURE_SIZE
> > checks in both VF and PF paths; previously only the start address was
> > validated, but mana_smc_poll_register() accesses up to shm_base + 0x1c
> > (28 bytes from base, 32 bytes total).
> > - Export SMC_APERTURE_SIZE to shm_channel.h.
> > ---
> > .../net/ethernet/microsoft/mana/gdma_main.c | 40 ++++++++++++++++---
> > include/net/mana/shm_channel.h | 6 +++
> > 2 files changed, 41 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..d8e816882f02 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -43,8 +43,9 @@ static u64 mana_gd_r64(struct gdma_context *g, u64 offset)
> > static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> > {
> > struct gdma_context *gc = pci_get_drvdata(pdev);
> > - void __iomem *sriov_base_va;
> > + u64 remaining_barsize;
> > u64 sriov_base_off;
> > + u64 sriov_shm_off;
> >
> > gc->db_page_size = mana_gd_r32(gc, GDMA_PF_REG_DB_PAGE_SIZE) & 0xFFFF;
> >
> > @@ -73,10 +74,28 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> > gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
> >
> > sriov_base_off = mana_gd_r64(gc, GDMA_SRIOV_REG_CFG_BASE_OFF);
> > + if (sriov_base_off >= gc->bar0_size ||
> > + gc->bar0_size - sriov_base_off <
> > + GDMA_PF_REG_SHM_OFF + sizeof(u64) ||
>
> nit: fits on a single line, I think?
>
It goes beyond the limit of 80, hence did it this way.
> > + !IS_ALIGNED(sriov_base_off, sizeof(u64))) {
> > + dev_err(gc->dev,
> > + "SRIOV base offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> > + sriov_base_off, (u64)gc->bar0_size);
> > + return -EPROTO;
> > + }
> >
> > - sriov_base_va = gc->bar0_va + sriov_base_off;
> > - gc->shm_base = sriov_base_va +
> > - mana_gd_r64(gc, sriov_base_off + GDMA_PF_REG_SHM_OFF);
> > + remaining_barsize = gc->bar0_size - sriov_base_off;
> > + sriov_shm_off = mana_gd_r64(gc, sriov_base_off + GDMA_PF_REG_SHM_OFF);
> > + if (sriov_shm_off >= remaining_barsize ||
> > + remaining_barsize - sriov_shm_off < SMC_APERTURE_SIZE ||
> > + !IS_ALIGNED(sriov_shm_off, sizeof(u32))) {
> > + dev_err(gc->dev,
> > + "SRIOV SHM offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> > + sriov_shm_off, (u64)gc->bar0_size);
> > + return -EPROTO;
> > + }
> > +
> > + gc->shm_base = gc->bar0_va + sriov_base_off + sriov_shm_off;
> >
> > return 0;
> > }
> > @@ -84,6 +103,7 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> > static int mana_gd_init_vf_regs(struct pci_dev *pdev)
> > {
> > struct gdma_context *gc = pci_get_drvdata(pdev);
> > + u64 shm_off;
> >
> > gc->db_page_size = mana_gd_r32(gc, GDMA_REG_DB_PAGE_SIZE) & 0xFFFF;
> >
> > @@ -111,7 +131,17 @@ static int mana_gd_init_vf_regs(struct pci_dev *pdev)
> > gc->db_page_base = gc->bar0_va + gc->db_page_off;
> > gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
> >
> > - gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > + shm_off = mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > + if (shm_off >= gc->bar0_size ||
> > + gc->bar0_size - shm_off < SMC_APERTURE_SIZE ||
> > + !IS_ALIGNED(shm_off, sizeof(u32))) {
> > + dev_err(gc->dev,
> > + "SHM offset 0x%llx out of range or unaligned (BAR0 size 0x%llx)\n",
> > + shm_off, (u64)gc->bar0_size);
> > + return -EPROTO;
> > + }
> > +
> > + gc->shm_base = gc->bar0_va + shm_off;
> >
> > return 0;
> > }
> > diff --git a/include/net/mana/shm_channel.h b/include/net/mana/shm_channel.h
> > index 5199b41497ff..dbabcfb95daf 100644
> > --- a/include/net/mana/shm_channel.h
> > +++ b/include/net/mana/shm_channel.h
> > @@ -4,6 +4,12 @@
> > #ifndef _SHM_CHANNEL_H
> > #define _SHM_CHANNEL_H
> >
> > +#define SMC_APERTURE_BITS 256
> > +#define SMC_BASIC_UNIT (sizeof(u32))
> > +#define SMC_APERTURE_DWORDS (SMC_APERTURE_BITS / (SMC_BASIC_UNIT * 8))
> > +#define SMC_LAST_DWORD (SMC_APERTURE_DWORDS - 1)
> > +#define SMC_APERTURE_SIZE (SMC_APERTURE_BITS / 8)
>
> AI bots complain that we're redefining this.
> Since it's a fix I think it's better to remove the existing definition
> even if it lives in a driver that goes via a different tree.
>
Ack, removed this in the next version.
> > struct shm_channel {
> > struct device *dev;
> > void __iomem *base;
> --
> pw-bot: cr
Hi Jakub,
Thanks for the comments, I have shared v3 addressing it.
Regards
Dipayaan Roy
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox