Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH 06/10] mshv: Fix duplicate GSI detection for GSI 0
From: Stanislav Kinsburskii @ 2026-04-29 18:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The duplicate routing entry check in mshv_update_routing_table() uses
guest_irq_num != 0 to detect whether a GSI slot is already occupied.
This fails for GSI 0 because its guest_irq_num is 0 both when the slot
is unused (zero-initialized) and when legitimately assigned. As a
result, duplicate entries for GSI 0 are silently accepted, with the
second entry overwriting the first — corrupting the routing table
without any error reported to userspace.

While GSI 0 (legacy timer) is unlikely to appear in MSI-based routing
in practice, the check is semantically wrong — it conflates
"uninitialized" with "GSI number 0." Use girq_entry_valid instead,
which is explicitly set to true when an entry is populated and remains
zero for unused slots regardless of the GSI number.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_irq.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
index b3142c84dcbc2..65a4ffc82d566 100644
--- a/drivers/hv/mshv_irq.c
+++ b/drivers/hv/mshv_irq.c
@@ -51,7 +51,7 @@ int mshv_update_routing_table(struct mshv_partition *partition,
 		/*
 		 * Allow only one to one mapping between GSI and MSI routing.
 		 */
-		if (girq->guest_irq_num != 0) {
+		if (girq->girq_entry_valid) {
 			r = -EINVAL;
 			goto out;
 		}

^ permalink raw reply related

* [PATCH 05/10] mshv: Fix level-triggered check on uninitialized data
From: Stanislav Kinsburskii @ 2026-04-29 18:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

In mshv_irqfd_assign(), the level-triggered validation for resample
irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
mshv_irqfd_update() has populated the field. Since the irqfd struct is
zero-allocated, level_triggered is always 0 at that point, causing the
check to always reject resample irqfds with -EINVAL. This makes
level-triggered interrupt resampling — used to avoid interrupt storms
with assigned devices — completely non-functional.

Move the check after the mshv_irqfd_update() call, which resolves the
IRQ routing entry and populates irqfd_lapic_irq with the actual trigger
mode.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |   25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index d9491a14f30f1..fd594acce3235 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -478,6 +478,19 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
 	init_poll_funcptr(&irqfd->irqfd_polltbl, mshv_irqfd_queue_proc);
 
 	spin_lock_irq(&pt->pt_irqfds_lock);
+	ret = 0;
+	hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
+		if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
+			continue;
+		/* This fd is used for another irq already. */
+		ret = -EBUSY;
+		spin_unlock_irq(&pt->pt_irqfds_lock);
+		goto fail;
+	}
+
+	idx = srcu_read_lock(&pt->pt_irq_srcu);
+	mshv_irqfd_update(pt, irqfd);
+
 #if IS_ENABLED(CONFIG_X86)
 	if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
 	    !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {
@@ -486,22 +499,12 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
 		 * Otherwise return with failure
 		 */
 		spin_unlock_irq(&pt->pt_irqfds_lock);
+		srcu_read_unlock(&pt->pt_irq_srcu, idx);
 		ret = -EINVAL;
 		goto fail;
 	}
 #endif
-	ret = 0;
-	hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
-		if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
-			continue;
-		/* This fd is used for another irq already. */
-		ret = -EBUSY;
-		spin_unlock_irq(&pt->pt_irqfds_lock);
-		goto fail;
-	}
 
-	idx = srcu_read_lock(&pt->pt_irq_srcu);
-	mshv_irqfd_update(pt, irqfd);
 	hlist_add_head(&irqfd->irqfd_hnode, &pt->pt_irqfds_list);
 	spin_unlock_irq(&pt->pt_irqfds_lock);
 



^ permalink raw reply related

* [PATCH 04/10] mshv: Fix broken seqcount read protection
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_irqfd_update() writes both irqfd_girq_ent and irqfd_lapic_irq as a
logical unit under seqcount write protection. Readers must snapshot these
fields inside the seqcount begin/retry loop to obtain a consistent
point-in-time view — otherwise a concurrent update can produce a torn
read where one field comes from the old state and the other from the new.

Both mshv_assert_irq_slow() and mshv_irqfd_wakeup() get this wrong: the
seqcount loop bodies are empty (just spinning until a stable sequence is
observed), and all reads of the protected fields happen after the loop
with no protection from concurrent writes. If mshv_irqfd_update() races
with interrupt assertion, the caller may use a stale or mixed
vector/apic_id/control combination — delivering an interrupt to the
wrong vCPU, with the wrong vector, or with the wrong trigger mode. This
can cause spurious or lost interrupts in the guest, or a stuck interrupt
line in the level-triggered case.

Fix mshv_assert_irq_slow() by snapshotting both irqfd_girq_ent and
irqfd_lapic_irq into local variables inside the seqcount loop, then
using those locals for the validity check and the hypercall.

Fix mshv_irqfd_wakeup() by snapshotting irqfd_lapic_irq inside its
seqcount loop and passing the snapshot to mshv_try_assert_irq_fast(),
so the fast path operates on the consistent copy rather than reading
the field directly outside seqcount protection.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |   47 +++++++++++++++++++++++++--------------------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 704c229ee3b19..d9491a14f30f1 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -151,10 +151,10 @@ static int mshv_vp_irq_set_vector(struct mshv_vp *vp, u32 vector)
  * Try to raise irq for guest via shared vector array. hyp does the actual
  * inject of the interrupt.
  */
-static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
+				    const struct mshv_lapic_irq *irq)
 {
 	struct mshv_partition *partition = irqfd->irqfd_partn;
-	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
 	struct mshv_vp *vp;
 
 	if (!(ms_hyperv.ext_features &
@@ -186,7 +186,8 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
 	return 0;
 }
 #else /* CONFIG_X86_64 */
-static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
+				    const struct mshv_lapic_irq *irq)
 {
 	return -EOPNOTSUPP;
 }
@@ -195,30 +196,32 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
 static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd)
 {
 	struct mshv_partition *partition = irqfd->irqfd_partn;
-	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
+	struct mshv_guest_irq_ent girq_ent;
+	struct mshv_lapic_irq irq;
 	unsigned int seq;
 	int idx;
 
-#if IS_ENABLED(CONFIG_X86)
-	WARN_ON(irqfd->irqfd_resampler &&
-		!irq->lapic_control.level_triggered);
-#endif
-
 	idx = srcu_read_lock(&partition->pt_irq_srcu);
-	if (irqfd->irqfd_girq_ent.guest_irq_num) {
-		if (!irqfd->irqfd_girq_ent.girq_entry_valid) {
-			srcu_read_unlock(&partition->pt_irq_srcu, idx);
-			return;
-		}
 
-		do {
-			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
-		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+	do {
+		seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+		girq_ent = irqfd->irqfd_girq_ent;
+		irq = irqfd->irqfd_lapic_irq;
+	} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+
+	if (girq_ent.guest_irq_num && !girq_ent.girq_entry_valid) {
+		srcu_read_unlock(&partition->pt_irq_srcu, idx);
+		return;
 	}
 
-	hv_call_assert_virtual_interrupt(irqfd->irqfd_partn->pt_id,
-					 irq->lapic_vector, irq->lapic_apic_id,
-					 irq->lapic_control);
+#if IS_ENABLED(CONFIG_X86)
+	WARN_ON(irqfd->irqfd_resampler &&
+		!irq.lapic_control.level_triggered);
+#endif
+
+	hv_call_assert_virtual_interrupt(partition->pt_id,
+					 irq.lapic_vector, irq.lapic_apic_id,
+					 irq.lapic_control);
 	srcu_read_unlock(&partition->pt_irq_srcu, idx);
 }
 
@@ -304,16 +307,18 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
 	int ret = 0;
 
 	if (flags & EPOLLIN) {
+		struct mshv_lapic_irq irq;
 		u64 cnt;
 
 		eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
 		idx = srcu_read_lock(&pt->pt_irq_srcu);
 		do {
 			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+			irq = irqfd->irqfd_lapic_irq;
 		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
 
 		/* An event has been signaled, raise an interrupt */
-		ret = mshv_try_assert_irq_fast(irqfd);
+		ret = mshv_try_assert_irq_fast(irqfd, &irq);
 		if (ret)
 			mshv_assert_irq_slow(irqfd);
 



^ permalink raw reply related

* [PATCH 03/10] mshv: Fix missing lock in mshv_irqfd_deassign
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_irqfd_deactivate() and the hlist traversal of pt_irqfds_list
require pt->pt_irqfds_lock to be held, but mshv_irqfd_deassign()
omits it. This races with the EPOLLHUP path in mshv_irqfd_wakeup(),
which does take the lock before calling mshv_irqfd_deactivate().

Add the missing spin_lock_irq/spin_unlock_irq around the list
traversal, matching the pattern in mshv_irqfd_release().

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 90959f639dc32..704c229ee3b19 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -541,13 +541,14 @@ static int mshv_irqfd_deassign(struct mshv_partition *pt,
 	if (IS_ERR(eventfd))
 		return PTR_ERR(eventfd);
 
+	spin_lock_irq(&pt->pt_irqfds_lock);
 	hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list,
 				  irqfd_hnode) {
 		if (irqfd->irqfd_eventfd_ctx == eventfd &&
 		    irqfd->irqfd_irqnum == args->gsi)
-
 			mshv_irqfd_deactivate(irqfd);
 	}
+	spin_unlock_irq(&pt->pt_irqfds_lock);
 
 	eventfd_ctx_put(eventfd);
 



^ permalink raw reply related

* [PATCH 02/10] mshv: Fix potential integer overflow in mshv_region_create
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The allocation size is computed as:

  sizeof(*region) + sizeof(struct page *) * nr_pages

where nr_pages is a u64 originating from userspace. A sufficiently
large nr_pages can overflow the multiplication, resulting in a small
allocation followed by out-of-bounds writes when populating mreg_pages.

Use struct_size() which returns SIZE_MAX on overflow, causing vzalloc
to safely return NULL — caught by the existing error check.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6f..1d04a97980b8b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -177,7 +177,7 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 {
 	struct mshv_mem_region *region;
 
-	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+	region = vzalloc(struct_size(region, mreg_pages, nr_pages));
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 



^ permalink raw reply related

* [PATCH 01/10] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The bounds check inside the PFN-filling loop can return -EINVAL while
interrupts are disabled via local_irq_save(), leaking IRQ state.

Remove the check — it is redundant because the loop invariant
(done + i < page_count == page_struct_count >> large_shift) guarantees
(done + i) << large_shift < page_struct_count always holds.

While here, fix type mismatches: change 'int done' to 'u64 done' and
use u64 for loop and batch-size variables so they match the u64
page_count they are compared against.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index f8c2341193da5..61871ad131b4b 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -1041,7 +1041,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 {
 	struct hv_input_modify_sparse_spa_page_host_access *input_page;
 	u64 status;
-	int done = 0;
+	u64 done = 0;
 	unsigned long irq_flags, large_shift = 0;
 	u64 page_count = page_struct_count;
 	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
@@ -1058,9 +1058,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 	}
 
 	while (done < page_count) {
-		ulong i, completed, remain = page_count - done;
-		int rep_count = min(remain,
-				    HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+		u64 i, completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain,
+				      HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -1074,15 +1074,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 		input_page->flags = flags;
 		input_page->host_access = host_access;
 
-		for (i = 0; i < rep_count; i++) {
-			u64 index = (done + i) << large_shift;
-
-			if (index >= page_struct_count)
-				return -EINVAL;
-
+		for (i = 0; i < rep_count; i++)
 			input_page->spa_page_list[i] =
-						page_to_pfn(pages[index]);
-		}
+				page_to_pfn(pages[(done + i) << large_shift]);
 
 		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
 					     NULL);



^ permalink raw reply related

* [PATCH 00/10] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

 This series addresses bugs found during a review of the mshv_root module
 introduced by commit 621191d709b14 ("Drivers: hv: Introduce mshv_root
 module to expose /dev/mshv to VMMs").

 The fixes range from data corruption and use-after-free to silent
 functional failures:

  - IRQ state leak and type truncation in hypercall helpers
    (hv_call_modify_spa_host_access)
  - Integer overflow on userspace-controlled allocation size
    (mshv_region_create)
  - Missing locking, broken seqcount read protection, and a check on
    uninitialized data in the irqfd path — the latter makes
    level-triggered interrupt resampling completely non-functional
  - Duplicate GSI 0 detection using the wrong predicate
  - Use-after-RCU in port ID lookup
  - Missing VP index bounds check in intercept ISR (OOB in interrupt
    context)
  - Missing error code on VP allocation failure (silent success to
    userspace)

---

Stanislav Kinsburskii (10):
      mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
      mshv: Fix potential integer overflow in mshv_region_create
      mshv: Fix missing lock in mshv_irqfd_deassign
      mshv: Fix broken seqcount read protection
      mshv: Fix level-triggered check on uninitialized data
      mshv: Fix duplicate GSI detection for GSI 0
      mshv: Fix use-after-RCU in mshv_portid_lookup
      mshv: Use kfree_rcu in mshv_portid_free
      mshv: Add missing vp_index bounds check in intercept ISR
      mshv: Fix missing error code on VP allocation failure


 drivers/hv/mshv_eventfd.c      |   75 ++++++++++++++++++++++------------------
 drivers/hv/mshv_irq.c          |    2 +
 drivers/hv/mshv_portid_table.c |    6 +--
 drivers/hv/mshv_regions.c      |    2 +
 drivers/hv/mshv_root_hv_call.c |   18 +++-------
 drivers/hv/mshv_root_main.c    |    4 ++
 drivers/hv/mshv_synic.c        |    4 ++
 7 files changed, 59 insertions(+), 52 deletions(-)


^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-29 18:01 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Long Li, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69213486F821CA5A2C793C81BF342@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Tuesday, April 28, 2026 6:58 PM
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, April 23, 2026 10:40 AM

[snip]

> 
> > Question about Gen 1 VMs: If the Linux frame buffer driver moves
> > the frame buffer somewhere other than the default location, and
> > then the VM does a kexec/kdump, what does the legacy PCI graphic
> > device BAR report as the frame buffer location? Does it *always*
> > report 4G-128MB, or does it report the new location? I can run
> 
> It always reports 4G-128MB.

OK, good to know. I was hoping it might report the new location. :-(

> BTW,  I suspect a Gen2 VM may have the same issue, i.e.
> currently we only reserve 8MB below 4GB; if hyperv_drm uses
> high MMIO, I suspect the UEFI firmware would still report the
> same original low MMIO framebuffer base/size to the kdump kernel,
> but there is no easy way to verify this for Gen2 VMs...
> 

[snip]

> 
> However,  when the kdump kernel starts to run, and I print the
> pci_resource_start(pdev, 0) and pci_resource_len(pdev, 0)
> from vmbus_reserve_fb(), I still see 4G-128MB:
> [   12.506159] Gen1 VM: start=0xf8000000, size=0x4000000
> 
> In this case, we can't really fix the MMIO conflict, e.g.
> if both hv_pci and hyperv_drm are built as modules, then
> the order of loading them can be nondeterministic:if the order
> in the first kernel is different from the order in
> the kdump kernel, we run into trouble.

Yep.

> 
> If the order is deterministic (e.g. hv_pci is
> built-in, and hyperv_drm is built as a module),
> we should be good since both allocates MMIO from
> the high MMIO range in a deterministic way.
> 

Yep.

Thanks,

Michael

^ permalink raw reply

* RE: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Michael Kelley @ 2026-04-29 18:01 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Long Li, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, matthew.ruffell@canonical.com,
	johansen@templeofstupid.com
  Cc: stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69214DC322549834104D26E0BF342@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Tuesday, April 28, 2026 8:13 PM
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, April 23, 2026 10:40 AM

[snip]

> > > +	/* Hyper-V CoCo guests do not have a framebuffer device. */
> > > +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> > > +		return;
> >
> > This test is testing feature "A" (mem encryption) in order to determine
> > the presence of feature "B" (no framebuffer), because current
> > configurations happen to always have "A" and "B" at the same time. But
> > the linkage between the features is tenuous, and if configurations should
> > change in the future, testing this way could be bogus. It works now, but I'm
> > leery of depending on the linkage between "A" and "B".
> >
> > You could set up a "can_have_framebuffer" flag in ms_hyperv_init_platform()
> > if running in a CVM, and test that flag here. But I'd suggest just dropping
> > this optimization. CVMs are always Gen2 (and that's not going to change),
> > so they have plenty of low mmio space.
> 
> This is not true on a lab host, e.g. I have a TDX VM on a lab host created
> by these 2 commands (without the 2nd command, Hyper-V won't allow
> the TDX VM to start):
> 
>     New-VM -Generation 2 -GuestStateIsolationType Tdx -Name $vmName
>     Disable-VMConsoleSupport -VMName $vmName
> 
> The low_mmio_base is still 4GB-128MB. In this case, it's not a good idea
> to try to reserve the 128MB:
> 
> 1) the available low MMIO size is smaller than 128MB due to the vTPM
> MMIO range.
> 
> 2) even if we can reserve the 109.25 low mmio range
> [0xf8000000-0xfed3ffff], we may not want to do that, just in case
> some assigned PCI device has 32-bit BARs.
> 
> So, IMO we need to keep the check:
>  +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>  +		return;
> 
> BTW, I think this may be a slightly better check here:
> +        if (hv_is_isolation_supported())
> +                return;

Agreed. Using hv_is_isolation_supported() seems better than
cc_platform_has() for this purpose.

> 
> A CVM on Hyper-V won't start without the command line
>     Disable-VMConsoleSupport -VMName $vmName

Unfortunately, on my laptop Hyper-V, a VM with VBS Isolation appears
to *not* require Disable-VMConsoleSupport. I can start the VM, and the
VM is offered the VMBus synthvid, mouse, and keyboard devices.

But what's weird in this case is that vmbus_reserved_fb() sees lfb_base
and lfb_start as 0. Furthermore, as a test, I changed the "allowed_in_isolated"
flag to true for the synthvid device, and the Hyper-V DRM driver loads and
initializes. In doing so, the vmconnect.exe window is resized larger, as is
done in a normal VM. /proc/iomem shows that the DRM driver claimed
the expected MMIO range at the start of low MMIO space. I can run a user
space program that mmaps /dev/fb0 and writes pixels to the mmap'ed
memory, and that succeeds as it would in a normal VM, but the
vmconnect.exe window doesn't show anything. It appears that the Hyper-V
host has allocated memory for the frame buffer, but is ignoring anything
that is written to it.

Running Disable-VMConsoleSupport works as expected -- the synthvid,
mouse, and keyboard devices are no longer offered to the VM.

> 
> IMO this is very unlikely to change in the future, because the Hyper-V
> synthetic framebuffer VMBus device is not a trusted device for a CVM,
> so there is no reason for Hyper-V to offer such a device to CVMs; even
> if the host offers it, currently the guest hv_vmbus driver ignores it.
> 

In the case of VBS Isolation, if such a VM also had a PCI pass-thru device,
the core problem could recur. I.e., not reserving space for the framebuffer
could allow the PCI device to try to use MMIO space that Hyper-V has
set up for the frame buffer, causing the PCI device to fail. And that's a
worse problem than just having the graphics console not function. I
can't actually try the failure case because I don't have an assignable PCI
device on my laptop, but it seems likely based on the evidence that
Hyper-V is setting up a framebuffer device.

So instead of not reserving any MMIO space for the framebuffer on
CVMs, the code you already have limits the reservation to half of the
MMIO space below 4 GB. Won't that work to avoid exhausting the low
MMIO space in a CVM that's running on a local Hyper-V with only 128
MiB of low MMIO space?

> When we assign a physical PCI GPU device to a CVM, I'm not sure if there
> is any framebuffer from the GPU or not. Even if there is, that's a completely
> different scenario and not reserving some low MMIO for "framebuffer"
> is unrelated: I think hyperv_drm (or the deprecated hyperv_fb) is the only
> driver that sets the fb_overlap_ok parameter of vmbus_allocate_mmio().
> 
> > And at the moment, CVMs don't
> > support PCI devices,
> 
> This is not true: recently I created a "Standard DC16eds v6" TDX CVM
> on Azure, and I did see two NVMe local temporary disks in "nvme list"
>  (here TDISP is not used). In 2023, we added the commit
> 2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs")
> and I believe some users are running CVMs with GPUs.

Interesting! I worked on commit 2c6ba4216844, but had not noticed
that Azure now has offerings that makes use of it. I'll take a look at
that TDX VM size.

Thanks,

Michael

^ permalink raw reply

* [PATCH v2] mshv: Simplify GPA map/unmap hypercall helpers
From: Stanislav Kinsburskii @ 2026-04-29 16:48 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
preceding bug-fix patches:

Move "done += completed" before the status checks so that pages mapped
by a partially-successful batch are included in the error cleanup unmap.
Previously these mappings were leaked on failure.

While here, improve type safety and readability:
 - Change "int done" to "u64 done" to match the u64 page_count it is
   compared against, avoiding signed/unsigned comparison hazards.
 - Use u64 for loop iteration and batch size variables consistently.
 - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
 - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
 - Simplify the error-path unmap to use "done << large_shift" directly
   instead of mutating done in place.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |   55 +++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 35 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index e5992c324904a..1f19a4ca824f0 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 	struct hv_input_map_gpa_pages *input_page;
 	u64 status, *pfnlist;
 	unsigned long irq_flags, large_shift = 0;
-	int ret = 0, done = 0;
-	u64 page_count = page_struct_count;
+	u64 done = 0, page_count = page_struct_count;
+	int ret = 0;
 
 	if (page_count == 0 || (pages && mmio_spa))
 		return -EINVAL;
@@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 	}
 
 	while (done < page_count) {
-		ulong i, completed, remain = page_count - done;
-		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
+		u64 i, completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain, HV_MAP_GPA_BATCH_SIZE);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		input_page->map_flags = flags;
 		pfnlist = input_page->source_gpa_page_list;
 
-		for (i = 0; i < rep_count; i++)
-			if (flags & HV_MAP_GPA_NO_ACCESS) {
+		for (i = 0; i < rep_count; i++) {
+			if (flags & HV_MAP_GPA_NO_ACCESS)
 				pfnlist[i] = 0;
-			} else if (pages) {
-				u64 index = (done + i) << large_shift;
-
-				if (index >= page_struct_count) {
-					ret = -EINVAL;
-					break;
-				}
-				pfnlist[i] = page_to_pfn(pages[index]);
-			} else {
+			else if (pages)
+				pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
+			else
 				pfnlist[i] = mmio_spa + done + i;
-			}
-		if (ret) {
-			local_irq_restore(irq_flags);
-			break;
 		}
 
 		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
@@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		local_irq_restore(irq_flags);
 
 		completed = hv_repcomp(status);
+		done += completed;
 
 		if (hv_result_needs_memory(status)) {
 			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
 						    HV_MAP_GPA_DEPOSIT_PAGES);
 			if (ret)
 				break;
-
 		} else if (!hv_result_success(status)) {
 			ret = hv_result_to_errno(status);
 			break;
 		}
-
-		done += completed;
 	}
 
 	if (ret && done) {
 		u32 unmap_flags = 0;
 
-		if (flags & HV_MAP_GPA_LARGE_PAGE) {
+		if (flags & HV_MAP_GPA_LARGE_PAGE)
 			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-			done <<= large_shift;
-		}
-		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
+		hv_call_unmap_gpa_pages(partition_id, gfn,
+					done << large_shift, unmap_flags);
 	}
 
 	return ret;
@@ -305,7 +292,7 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 	struct hv_input_unmap_gpa_pages *input_page;
 	u64 status, page_count = page_count_4k;
 	unsigned long irq_flags, large_shift = 0;
-	int ret = 0, done = 0;
+	u64 done = 0;
 
 	if (page_count == 0)
 		return -EINVAL;
@@ -319,8 +306,8 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 	}
 
 	while (done < page_count) {
-		ulong completed, remain = page_count - done;
-		int rep_count = min(remain, HV_UMAP_GPA_PAGES);
+		u64 completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain, HV_UMAP_GPA_PAGES);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -333,15 +320,13 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 		local_irq_restore(irq_flags);
 
 		completed = hv_repcomp(status);
-		if (!hv_result_success(status)) {
-			ret = hv_result_to_errno(status);
-			break;
-		}
-
 		done += completed;
+
+		if (!hv_result_success(status))
+			return hv_result_to_errno(status);
 	}
 
-	return ret;
+	return 0;
 }
 
 int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,



^ permalink raw reply related

* Re: [PATCH] mshv: Simplify GPA map/unmap hypercall helpers
From: Stanislav Kinsburskii @ 2026-04-29 15:15 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260429-orca-of-legal-symmetry-3c72bc@anirudhrb>

On Wed, Apr 29, 2026 at 11:02:37AM +0000, Anirudh Rayabharam wrote:
> On Tue, Apr 28, 2026 at 11:21:12PM +0000, Stanislav Kinsburskii wrote:
> > Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
> > preceding bug-fix patches:
> > 
> > Move "done += completed" before the status checks so that pages mapped
> > by a partially-successful batch are included in the error cleanup unmap.
> > Previously these mappings were leaked on failure.
> > 
> > While here, improve type safety and readability:
> >  - Change "int done" to "u64 done" to match the u64 page_count it is
> >    compared against, avoiding signed/unsigned comparison hazards.
> >  - Use u64 for loop iteration and batch size variables consistently.
> >  - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
> >  - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
> >  - Simplify the error-path unmap to use "done << large_shift" directly
> >    instead of mutating done in place.
> > 
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_hv_call.c |   55 +++++++++++++++-------------------------
> >  1 file changed, 20 insertions(+), 35 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index e5992c324904a..f5f205a397834 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  	struct hv_input_map_gpa_pages *input_page;
> >  	u64 status, *pfnlist;
> >  	unsigned long irq_flags, large_shift = 0;
> > -	int ret = 0, done = 0;
> > -	u64 page_count = page_struct_count;
> > +	u64 done = 0, page_count = page_struct_count;
> > +	int ret = 0;
> >  
> >  	if (page_count == 0 || (pages && mmio_spa))
> >  		return -EINVAL;
> > @@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  	}
> >  
> >  	while (done < page_count) {
> > -		ulong i, completed, remain = page_count - done;
> > -		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> > +		u64 i, completed, remain = page_count - done;
> > +		u64 rep_count = min(remain, (u64)HV_MAP_GPA_BATCH_SIZE);
> >  
> >  		local_irq_save(irq_flags);
> >  		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > @@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  		input_page->map_flags = flags;
> >  		pfnlist = input_page->source_gpa_page_list;
> >  
> > -		for (i = 0; i < rep_count; i++)
> > -			if (flags & HV_MAP_GPA_NO_ACCESS) {
> > +		for (i = 0; i < rep_count; i++) {
> > +			if (flags & HV_MAP_GPA_NO_ACCESS)
> >  				pfnlist[i] = 0;
> > -			} else if (pages) {
> > -				u64 index = (done + i) << large_shift;
> > -
> > -				if (index >= page_struct_count) {
> > -					ret = -EINVAL;
> > -					break;
> > -				}
> > -				pfnlist[i] = page_to_pfn(pages[index]);
> > -			} else {
> > +			else if (pages)
> > +				pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
> > +			else
> >  				pfnlist[i] = mmio_spa + done + i;
> > -			}
> > -		if (ret) {
> > -			local_irq_restore(irq_flags);
> > -			break;
> >  		}
> >  
> >  		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> > @@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  		local_irq_restore(irq_flags);
> >  
> >  		completed = hv_repcomp(status);
> > +		done += completed;
> >  
> >  		if (hv_result_needs_memory(status)) {
> >  			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
> >  						    HV_MAP_GPA_DEPOSIT_PAGES);
> >  			if (ret)
> >  				break;
> > -
> >  		} else if (!hv_result_success(status)) {
> >  			ret = hv_result_to_errno(status);
> >  			break;
> >  		}
> > -
> > -		done += completed;
> >  	}
> >  
> >  	if (ret && done) {
> >  		u32 unmap_flags = 0;
> >  
> > -		if (flags & HV_MAP_GPA_LARGE_PAGE) {
> > +		if (flags & HV_MAP_GPA_LARGE_PAGE)
> >  			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > -			done <<= large_shift;
> > -		}
> > -		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> > +		hv_call_unmap_gpa_pages(partition_id, gfn,
> > +					done << large_shift, unmap_flags);
> 
> How does this work? Earlier we were doing "done << large_shift" only if
> HV_MAP_GPA_LARGE_PAGE is set but now we always do it.
> 

It works becuase large_shift in initialized to 0 when
HV_MAP_GPA_LARGE_PAGE is not set.

Thanks,
Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply

* Re: [PATCH] mshv: Add dedicated ioctl for GVA to GPA translation
From: Anirudh Rayabharam @ 2026-04-29 13:11 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177741648871.626779.11067281081219290277.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Tue, Apr 28, 2026 at 10:48:24PM +0000, Stanislav Kinsburskii wrote:
> Add an MSHV_TRANSLATE_GVA ioctl on the VP fd that wraps
> HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX with transparent fault-in handling for
> movable memory regions. The passthrough path for this hypercall is retained
> for backward compatibility.
> 
> When guest-backing pages reside in movable memory regions, the mmu_notifier
> invalidation path remaps them to NO_ACCESS in the hypervisor's second-level
> address translation tables. If the VMM issues a GVA translation (e.g.
> during MMIO emulation) while a page-table page is invalidated, the
> hypervisor returns HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS. The VMM cannot
> resolve this on its own.
> 
> The new ioctl detects this transient GPA access failure, faults the page
> back in via mshv_region_handle_gfn_fault(), and retries the translation
> until it succeeds or an unrecoverable error occurs.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h         |    3 ++
>  drivers/hv/mshv_root_hv_call.c |   37 +++++++++++++++++++++
>  drivers/hv/mshv_root_main.c    |   69 ++++++++++++++++++++++++++++++++++++++++
>  include/hyperv/hvgdk_mini.h    |    1 +
>  include/hyperv/hvhdk.h         |   41 ++++++++++++++++++++++++
>  include/uapi/linux/mshv.h      |   10 ++++++
>  6 files changed, 161 insertions(+)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 1f086dcb7aa1a..2e6c4414740cc 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -290,6 +290,9 @@ int hv_call_delete_vp(u64 partition_id, u32 vp_index);
>  int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
>  				     u64 dest_addr,
>  				     union hv_interrupt_control control);
> +int hv_call_translate_virtual_address_ex(u32 vp_index, u64 partition_id,
> +					 u64 flags, u64 gva, u64 *gfn,
> +					 struct hv_translate_gva_result_ex *result);
>  int hv_call_clear_virtual_interrupt(u64 partition_id);
>  int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
>  				  union hv_gpa_page_access_state_flags state_flags,
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index e5992c324904a..9ff4ba5373f59 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -692,6 +692,43 @@ int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code,
>  	return 0;
>  }
>  
> +int hv_call_translate_virtual_address_ex(u32 vp_index, u64 partition_id,
> +					 u64 flags, u64 gva, u64 *gfn,
> +					 struct hv_translate_gva_result_ex *result)
> +{
> +	struct hv_input_translate_virtual_address *input;
> +	struct hv_output_translate_virtual_address_ex *output;
> +	unsigned long irq_flags;
> +	u64 status;
> +
> +	local_irq_save(irq_flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = partition_id;
> +	input->vp_index = vp_index;
> +	input->control_flags = flags;
> +	input->gva_page = gva >> HV_HYP_PAGE_SHIFT;
> +
> +	status = hv_do_hypercall(HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX,
> +				 input, output);
> +
> +	if (!hv_result_success(status)) {
> +		local_irq_restore(irq_flags);
> +		pr_err("%s: %s\n", __func__, hv_result_to_string(status));
> +		return hv_result_to_errno(status);
> +	}
> +
> +	*result = output->translation_result;
> +	*gfn = output->gpa_page;
> +
> +	local_irq_restore(irq_flags);
> +
> +	return 0;
> +}
> +
>  int
>  hv_call_clear_virtual_interrupt(u64 partition_id)
>  {
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index bd1359eb58dd4..2d7b6923415a8 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -898,6 +898,72 @@ mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
>  	return 0;
>  }
>  
> +static bool mshv_gpa_fault_retryable(u32 result_code)
> +{
> +	/*
> +	 * Note: HV_TRANSLATE_GVA_GPA_UNMAPPED is intentionally not handled
> +	 * here. The guest page table cannot be unmapped under normal
> +	 * operation. It may be mapped with no access during page moves,
> +	 * but a truly unmapped state indicates a kernel driver bug.
> +	 * Retrying in this case would only mask the underlying problem of
> +	 * an unmapped guest page table.
> +	 */
> +	return result_code == HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS;
> +}
> +
> +static long
> +mshv_vp_ioctl_translate_gva(struct mshv_vp *vp, void __user *user_args)
> +{
> +	struct mshv_partition *partition = vp->vp_partition;
> +	struct mshv_translate_gva args;
> +	struct hv_translate_gva_result_ex result;
> +	u64 gfn, gpa;
> +	int ret;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	do {
> +		ret = hv_call_translate_virtual_address_ex(vp->vp_index,
> +							   partition->pt_id,
> +							   args.flags, args.gva,
> +							   &gfn, &result);
> +		if (ret)
> +			return ret;
> +
> +		if (mshv_gpa_fault_retryable(result.result_code)) {
> +			struct mshv_mem_region *region;
> +			bool faulted;
> +
> +			region = mshv_partition_region_by_gfn_get(partition,
> +								  gfn);
> +			if (!region)
> +				return -EFAULT;
> +
> +			faulted = false;
> +			if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> +				faulted = mshv_region_handle_gfn_fault(region,
> +								       gfn);
> +			mshv_region_put(region);
> +
> +			if (!faulted)
> +				return -EFAULT;
> +
> +			cond_resched();
> +		}
> +	} while (mshv_gpa_fault_retryable(result.result_code));
> +
> +	gpa = (gfn << PAGE_SHIFT) | (args.gva & ~PAGE_MASK);
> +
> +        if (copy_to_user(args.result, &result, sizeof(*args.result)))

Indentation is a bit off here.

With that fixed:

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Olaf Hering @ 2026-04-29 12:44 UTC (permalink / raw)
  To: Thorsten Blum
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, stable, linux-hyperv, linux-kernel
In-Reply-To: <afH7VELGgh8eGBUC@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 252 bytes --]

Wed, 29 Apr 2026 14:36:36 +0200 Thorsten Blum <thorsten.blum@linux.dev>:

> What makes you think this is just "cosmetics"?

It does fix an unlikely bug indeed, but it does not need to trigger the whole paperwork attached to a Fixes tag.


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Thorsten Blum @ 2026-04-29 12:36 UTC (permalink / raw)
  To: Olaf Hering
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, stable, Ky Srinivasan, linux-hyperv,
	linux-kernel
In-Reply-To: <20260429142724.4d74641a.olaf@aepfle.de>

On Wed, Apr 29, 2026 at 02:27:24PM +0200, Olaf Hering wrote:
> Tue, 14 Apr 2026 13:10:08 +0200 Thorsten Blum <thorsten.blum@linux.dev>:
> 
> > Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)")
> 
> Please do not abuse the Fixes tag when it fact this change is "cosmetics".

What makes you think this is just "cosmetics"?

^ permalink raw reply

* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Olaf Hering @ 2026-04-29 12:27 UTC (permalink / raw)
  To: Thorsten Blum
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, stable, Ky Srinivasan, linux-hyperv,
	linux-kernel
In-Reply-To: <20260414111008.307220-2-thorsten.blum@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 235 bytes --]

Tue, 14 Apr 2026 13:10:08 +0200 Thorsten Blum <thorsten.blum@linux.dev>:

> Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)")

Please do not abuse the Fixes tag when it fact this change is "cosmetics".


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] mshv: Simplify GPA map/unmap hypercall helpers
From: Anirudh Rayabharam @ 2026-04-29 11:02 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177741845948.632922.14128507833980339307.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Tue, Apr 28, 2026 at 11:21:12PM +0000, Stanislav Kinsburskii wrote:
> Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
> preceding bug-fix patches:
> 
> Move "done += completed" before the status checks so that pages mapped
> by a partially-successful batch are included in the error cleanup unmap.
> Previously these mappings were leaked on failure.
> 
> While here, improve type safety and readability:
>  - Change "int done" to "u64 done" to match the u64 page_count it is
>    compared against, avoiding signed/unsigned comparison hazards.
>  - Use u64 for loop iteration and batch size variables consistently.
>  - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
>  - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
>  - Simplify the error-path unmap to use "done << large_shift" directly
>    instead of mutating done in place.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_hv_call.c |   55 +++++++++++++++-------------------------
>  1 file changed, 20 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index e5992c324904a..f5f205a397834 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  	struct hv_input_map_gpa_pages *input_page;
>  	u64 status, *pfnlist;
>  	unsigned long irq_flags, large_shift = 0;
> -	int ret = 0, done = 0;
> -	u64 page_count = page_struct_count;
> +	u64 done = 0, page_count = page_struct_count;
> +	int ret = 0;
>  
>  	if (page_count == 0 || (pages && mmio_spa))
>  		return -EINVAL;
> @@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  	}
>  
>  	while (done < page_count) {
> -		ulong i, completed, remain = page_count - done;
> -		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> +		u64 i, completed, remain = page_count - done;
> +		u64 rep_count = min(remain, (u64)HV_MAP_GPA_BATCH_SIZE);
>  
>  		local_irq_save(irq_flags);
>  		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> @@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  		input_page->map_flags = flags;
>  		pfnlist = input_page->source_gpa_page_list;
>  
> -		for (i = 0; i < rep_count; i++)
> -			if (flags & HV_MAP_GPA_NO_ACCESS) {
> +		for (i = 0; i < rep_count; i++) {
> +			if (flags & HV_MAP_GPA_NO_ACCESS)
>  				pfnlist[i] = 0;
> -			} else if (pages) {
> -				u64 index = (done + i) << large_shift;
> -
> -				if (index >= page_struct_count) {
> -					ret = -EINVAL;
> -					break;
> -				}
> -				pfnlist[i] = page_to_pfn(pages[index]);
> -			} else {
> +			else if (pages)
> +				pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
> +			else
>  				pfnlist[i] = mmio_spa + done + i;
> -			}
> -		if (ret) {
> -			local_irq_restore(irq_flags);
> -			break;
>  		}
>  
>  		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> @@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  		local_irq_restore(irq_flags);
>  
>  		completed = hv_repcomp(status);
> +		done += completed;
>  
>  		if (hv_result_needs_memory(status)) {
>  			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
>  						    HV_MAP_GPA_DEPOSIT_PAGES);
>  			if (ret)
>  				break;
> -
>  		} else if (!hv_result_success(status)) {
>  			ret = hv_result_to_errno(status);
>  			break;
>  		}
> -
> -		done += completed;
>  	}
>  
>  	if (ret && done) {
>  		u32 unmap_flags = 0;
>  
> -		if (flags & HV_MAP_GPA_LARGE_PAGE) {
> +		if (flags & HV_MAP_GPA_LARGE_PAGE)
>  			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> -			done <<= large_shift;
> -		}
> -		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> +		hv_call_unmap_gpa_pages(partition_id, gfn,
> +					done << large_shift, unmap_flags);

How does this work? Earlier we were doing "done << large_shift" only if
HV_MAP_GPA_LARGE_PAGE is set but now we always do it.

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH V1 04/13] mshv: Provide a way to get partition id if running in a VMM process
From: Anirudh Rayabharam @ 2026-04-29 10:35 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260422023239.1171963-5-mrathor@linux.microsoft.com>

On Tue, Apr 21, 2026 at 07:32:30PM -0700, Mukesh R wrote:
> Many PCI passthru related hypercalls require partition id of the target
> guest. Guests are actually managed by MSHV driver and the partition id
> is only maintained there. Add a field in the partition struct in MSHV
> driver to save the tgid of the VMM process creating the partition,
> and add a function there to retrieve partition id if current process
> is a VMM process.
> 
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h         |  1 +
>  drivers/hv/mshv_root_main.c    | 22 ++++++++++++++++++++++
>  include/asm-generic/mshyperv.h |  5 +++++
>  3 files changed, 28 insertions(+)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 3/3] mshv: unmap debugfs stats pages on kexec
From: Anirudh Rayabharam @ 2026-04-29 10:10 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-4-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:54PM -0700, Jork Loeser wrote:
> On L1VH, debugfs stats pages are overlay pages: the kernel allocates
> them and registers the GPAs with the hypervisor via
> HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
> hypervisor across kexec. If the kexec'd kernel reuses those physical
> pages, the hypervisor's overlay semantics cause a machine check
> exception.
> 
> Fix this by calling mshv_debugfs_exit() from the reboot notifier,
> which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
> kexec. This releases the overlay bindings so the physical pages can be
> safely reused. Guard mshv_debugfs_exit() against being called when
> init failed.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/mshv_debugfs.c | 7 ++++++-
>  drivers/hv/mshv_synic.c   | 1 +
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
> index 418b6dc8f3c2..3c3e02237ae9 100644
> --- a/drivers/hv/mshv_debugfs.c
> +++ b/drivers/hv/mshv_debugfs.c
> @@ -674,8 +674,10 @@ int __init mshv_debugfs_init(void)
>  
>  	mshv_debugfs = debugfs_create_dir("mshv", NULL);
>  	if (IS_ERR(mshv_debugfs)) {
> +		err = PTR_ERR(mshv_debugfs);
> +		mshv_debugfs = NULL;
>  		pr_err("%s: failed to create debugfs directory\n", __func__);

Might as well print err here.

Nevertheless:

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH v2 12/15] mshv_vtl: Move VSM code page offset logic to x86 files
From: Naman Jain @ 2026-04-29 10:00 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157E0525DDDD153888F5AFBD4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:10 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> The VSM code page offset register (HV_REGISTER_VSM_CODE_PAGE_OFFSETS)
>> is x86 specific, its value configures the static call used to return
>> to VTL0 via the hypercall page. Move the register read from the common
>> mshv_vtl_get_vsm_regs() into the x86 mshv_vtl_return_call_init(),
>> which is the sole consumer of the offset.
>>
>> Change mshv_vtl_return_call_init() from taking a u64 parameter
>> to taking no arguments, and rename mshv_vtl_get_vsm_regs() to
>> mshv_vtl_get_vsm_cap_reg() since it now only fetches
>> HV_REGISTER_VSM_CAPABILITIES.
>>
>> No functional change on x86. This prepares the common driver code for
>> ARM64 where VSM code page offsets do not apply.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/x86/hyperv/hv_vtl.c        | 19 +++++++++++++++++--
>>   arch/x86/include/asm/mshyperv.h |  4 ++--
>>   drivers/hv/mshv_vtl_main.c      | 24 +++++++++++++-----------
>>   3 files changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
>> index f3ffb6a7cb2d..7c10b34cf8a4 100644
>> --- a/arch/x86/hyperv/hv_vtl.c
>> +++ b/arch/x86/hyperv/hv_vtl.c
>> @@ -293,10 +293,25 @@ EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
>>
>>   DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void));
>>
>> -void mshv_vtl_return_call_init(u64 vtl_return_offset)
>> +int mshv_vtl_return_call_init(void)
>>   {
>> +	struct hv_register_assoc vsm_pg_offset_reg;
>> +	union hv_register_vsm_page_offsets offsets;
>> +	int ret;
>> +
>> +	vsm_pg_offset_reg.name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> +
>> +	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> +				       1, input_vtl_zero, &vsm_pg_offset_reg);
>> +	if (ret)
>> +		return ret;
>> +
>> +	offsets.as_uint64 = vsm_pg_offset_reg.value.reg64;
>> +
>>   	static_call_update(__mshv_vtl_return_hypercall,
>> -			   (void *)((u8 *)hv_hypercall_pg + vtl_return_offset));
>> +			   (void *)((u8 *)hv_hypercall_pg + offsets.vtl_return_offset));
>> +
>> +	return 0;
>>   }
>>   EXPORT_SYMBOL(mshv_vtl_return_call_init);
>>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index b4d80c9a673a..b48f115c1292 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -286,14 +286,14 @@ struct mshv_vtl_cpu_context {
>>   #ifdef CONFIG_HYPERV_VTL_MODE
>>   void __init hv_vtl_init_platform(void);
>>   int __init hv_vtl_early_init(void);
>> -void mshv_vtl_return_call_init(u64 vtl_return_offset);
>> +int mshv_vtl_return_call_init(void);
>>   void mshv_vtl_return_hypercall(void);
>>   void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, bool shared);
>>   #else
>>   static inline void __init hv_vtl_init_platform(void) {}
>>   static inline int __init hv_vtl_early_init(void) { return 0; }
>> -static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>> +static inline int mshv_vtl_return_call_init(void) { return 0; }
>>   static inline void mshv_vtl_return_hypercall(void) {}
>>   static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>>   #endif
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index 4c9ae65ad3e8..be498c9234fd 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -79,7 +79,6 @@ struct mshv_vtl {
>>   };
>>
>>   static struct mutex mshv_vtl_poll_file_lock;
>> -static union hv_register_vsm_page_offsets mshv_vsm_page_offsets;
>>   static union hv_register_vsm_capabilities mshv_vsm_capabilities;
>>
>>   static DEFINE_PER_CPU(struct mshv_vtl_poll_file, mshv_vtl_poll_file);
>> @@ -203,21 +202,19 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>>   	/* VTL2 Host VSP SINT is (un)masked when the user mode requests that */
>>   }
>>
>> -static int mshv_vtl_get_vsm_regs(void)
>> +static int mshv_vtl_get_vsm_cap_reg(void)
>>   {
>> -	struct hv_register_assoc registers[2];
>> -	int ret, count = 2;
>> +	struct hv_register_assoc vsm_capability_reg;
>> +	int ret;
>>
>> -	registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> -	registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
>> +	vsm_capability_reg.name = HV_REGISTER_VSM_CAPABILITIES;
>>
>>   	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> -				       count, input_vtl_zero, registers);
>> +				       1, input_vtl_zero, &vsm_capability_reg);
>>   	if (ret)
>>   		return ret;
>>
>> -	mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
>> -	mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
>> +	mshv_vsm_capabilities.as_uint64 = vsm_capability_reg.value.reg64;
>>
>>   	return ret;
> 
> Nit: This could be just "return 0".

Acked.

> 
>>   }
>> @@ -1139,13 +1136,18 @@ static int __init mshv_vtl_init(void)
>>   	tasklet_init(&msg_dpc, mshv_vtl_sint_on_msg_dpc, 0);
>>   	init_waitqueue_head(&fd_wait_queue);
>>
>> -	if (mshv_vtl_get_vsm_regs()) {
>> +	if (mshv_vtl_get_vsm_cap_reg()) {
>>   		dev_emerg(dev, "Unable to get VSM capabilities !!\n");
> 
> Why is this failure an emergency message, while the other failures
> here in mshv_vtl_init() are just error messages? When there's lack
> of consistency, I always wonder if there is a reason ..... :-)

It might be because I didn’t pay enough attention to the old code :)
dev_err() should work just fine, I'll change it.


> 
>>   		ret = -ENODEV;
>>   		goto free_dev;
>>   	}
>>
>> -	mshv_vtl_return_call_init(mshv_vsm_page_offsets.vtl_return_offset);
>> +	ret = mshv_vtl_return_call_init();
>> +	if (ret) {
>> +		dev_err(dev, "mshv_vtl_return_call_init failed: %d\n", ret);
>> +		goto free_dev;
>> +	}
>> +
>>   	ret = hv_vtl_setup_synic();
>>   	if (ret)
>>   		goto free_dev;
>> --
>> 2.43.0
>>

Regards,
Naman

^ permalink raw reply

* Re: [PATCH v4 2/3] mshv: clean up SynIC state on kexec for L1VH
From: Anirudh Rayabharam @ 2026-04-29  9:58 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-3-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:53PM -0700, Jork Loeser wrote:
> The reboot notifier that tears down the SynIC cpuhp state guards the
> cleanup with hv_root_partition(), so on L1VH (where
> hv_root_partition() is false) SINT0, SINT5, and SIRBP are never
> cleaned up before kexec. The kexec'd kernel then inherits stale
> unmasked SINTs and an enabled SIRBP pointing to freed memory.
> 
> Remove the hv_root_partition() guard so the cleanup runs for all
> parent partitions.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 2db3b0192eac..978a1cace341 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -723,9 +723,6 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
>  static int mshv_synic_reboot_notify(struct notifier_block *nb,
>  			      unsigned long code, void *unused)
>  {
> -	if (!hv_root_partition())
> -		return 0;
> -
>  	cpuhp_remove_state(synic_cpuhp_online);
>  	return 0;
>  }
> -- 
> 2.43.0
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 1/3] mshv: limit SynIC management to MSHV-owned resources
From: Anirudh Rayabharam @ 2026-04-29  9:58 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-2-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:52PM -0700, Jork Loeser wrote:
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
> 
> Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
> from under VMBus causes its later cleanup to write SynIC MSRs while
> SynIC is disabled, which the hypervisor does not tolerate.
> 
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
>   and nested root partition); on a non-nested root partition VMBus
>   does not run, so MSHV must enable/disable them
> 
> While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
> calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
> PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
> register GPAs regardless of the kernel page size, so using PAGE_SHIFT
> produces wrong addresses on ARM64 with 64K pages.
> 
> Note that initialization order matters - VMBUS first, MSHV second,
> and the reverse on de-init. Ideally, we would want a dedicated SYNIC
> driver that replaces the cross-dependencies with a clear API and
> dynamic tracking. Such refactor should go into its own dedicated
> series, outside of this kexec fix series.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/hv.c         |   3 +
>  drivers/hv/mshv_synic.c | 150 ++++++++++++++++++++++++++--------------
>  2 files changed, 103 insertions(+), 50 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Naman Jain @ 2026-04-29  9:57 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157467FDBC0203C67A67042D4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:10 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
>> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
>> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
>> in architecture-specific code.
>>
>> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
>> include/asm-generic/mshyperv.h so they are visible to both arch and
>> driver code.
>>
>> Change the return type from void to bool so the caller can determine
>> whether the register page was successfully configured and set
>> mshv_has_reg_page accordingly.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/x86/hyperv/hv_vtl.c       | 32 ++++++++++++++++++++++
>>   drivers/hv/mshv_vtl_main.c     | 49 +++-------------------------------
>>   include/asm-generic/mshyperv.h | 17 ++++++++++++
>>   3 files changed, 53 insertions(+), 45 deletions(-)
>>

<snip>

>>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> 
> This comment pre-dates your patch, but I don't understand the point
> it is trying to make. The comment is factually true, but I don't know
> why calling that out is relevant. The REG_PAGE MSR seems to be
> conceptually separate and distinct from the SIMP MSR, so the fact
> that the layouts are the same is just a coincidence. Or is there some
> relationship between the two MSRs that I'm not aware of, and the
> comment is trying (and failing?) to point out?

This was added as per suggestion from Nuno in my initial series for 
MSHV_VTL. If the reference in "identical to" is misleading, I should 
remove it.

https://lore.kernel.org/all/68143eb0-e6a7-4579-bedb-4c2ec5aaef6b@linux.microsoft.com/

Quoting:
"""
it is a generic structure that
appears to be used for several overlay page MSRs (SIMP, SIEF, etc).

But, the type doesn't appear in the hv*dk headers explicitly; it's just
used internally by the hypervisor.

I think it should be renamed with a hv_ prefix to indicate it's part of
the hypervisor ABI, and a brief comment with the provenance:

/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
union hv_synic_overlay_page_msr {
	/* <snip> */
};
"""

> 
>> +union hv_synic_overlay_page_msr {
>> +	u64 as_uint64;
>> +	struct {
>> +		u64 enabled: 1;
>> +		u64 reserved: 11;
>> +		u64 pfn: 52;
>> +	} __packed;
>> +};
>> +
>>   u8 __init get_vtl(void);
>>   void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>>   #else
>>   static inline u8 get_vtl(void) { return 0; }
>>   static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> +static inline bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu) { return false; }
> 
> As with Patch 8, if CONFIG_HYPERV_VTL_MODE caused mshv_common.o
> to be built, this stub wouldn't be needed.
> 

Acked.


>>   #endif
>>
>>   #endif
>> --
>> 2.43.0
>>

Regards,
Naman

^ permalink raw reply

* Re: [PATCH v2 08/15] Drivers: hv: Move hv_call_(get|set)_vp_registers() declarations
From: Naman Jain @ 2026-04-29  9:57 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157852404B5258EF13A5450D4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:09 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move hv_call_get_vp_registers() and hv_call_set_vp_registers()
>> declarations from drivers/hv/mshv.h to include/asm-generic/mshyperv.h.
>>
>> These functions are defined in mshv_common.c and are going to be called
>> from both drivers/hv/ and arch/x86/hyperv/hv_vtl.c. The latter never
>> included mshv.h, relying on implicit declaration visibility. Moving the
>> declarations to the arch-generic Hyper-V header makes them properly
>> visible to all architecture-specific callers.
>>
>> Provide static inline stubs returning -EOPNOTSUPP when neither
>> CONFIG_MSHV_ROOT nor CONFIG_MSHV_VTL is enabled.
> 
> Looking at the drivers/hv/Kconfig, it's possible to build with
> CONFIG_HYPERV_VTL_MODE=y, but not CONFIG_MSHV_VTL. In such a
> case, mshv_common.o doesn't get built, which is why the stubs are
> needed. Is such a configuration desirable for some scenarios?
> 
> I wonder if having CONFIG_HYPERV_VTL_MODE force the building of
> mshv_common.o would be a better approach. Then the stubs wouldn't
> be needed. The "ifneq" statement in drivers/hv/Makefile could use
> CONFIG_HYPERV_VTL_MODE instead of CONFIG_MSHV_VTL, and
> everything would be good since CONFIG_MSHV_VTL depends on
> CONFIG_HYPERV_VTL_MODE.
> 

This looks good. I'll try this and make the changes. In case there are 
some challenges with that, I'll revert back.


Regards,
Naman

^ permalink raw reply

* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-04-29  9:56 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157C147A1B915F9B45D3B74D4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:08 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
>> driver on arm64. This function enables the transition between Virtual
>> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Reviewed-by: Roman Kisel <vdso@mailbox.org>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/hyperv/Makefile        |   1 +
>>   arch/arm64/hyperv/hv_vtl.c        | 158 ++++++++++++++++++++++++++++++
>>   arch/arm64/include/asm/mshyperv.h |  13 +++
>>   arch/x86/include/asm/mshyperv.h   |   2 -
>>   drivers/hv/mshv_vtl.h             |   3 +
>>   include/asm-generic/mshyperv.h    |   2 +
>>   6 files changed, 177 insertions(+), 2 deletions(-)
>>   create mode 100644 arch/arm64/hyperv/hv_vtl.c
>>
> 
> [snip]
> 
>> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
>> index 585b23a26f1b..9eb0e5999f29 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -60,6 +60,18 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>   				ARM_SMCCC_SMC_64,		\
>>   				ARM_SMCCC_OWNER_VENDOR_HYP,	\
>>   				HV_SMCCC_FUNC_NUMBER)
>> +
>> +struct mshv_vtl_cpu_context {
>> +/*
>> + * x18 is managed by the hypervisor. It won't be reloaded from this array.
>> + * It is included here for convenience in array indexing.
>> + * 'rsvd' field serves as alignment padding so q[] starts at offset 32*8=256.
>> + */
>> +	__u64 x[31];
>> +	__u64 rsvd;
>> +	__uint128_t q[32];
>> +};
>> +
>>   #ifdef CONFIG_HYPERV_VTL_MODE
>>   /*
>>    * Get/Set the register. If the function returns `1`, that must be done via
>> @@ -69,6 +81,7 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs,
>> bool set, b
>>   {
>>   	return 1;
>>   }
>> +
> 
> This appears to be a spurious blank line being added since there
> are no other changes in the vicinity.

Acked.

> 
>>   #endif
>>
>>   #include <asm-generic/mshyperv.h>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index 08278547b84c..b4d80c9a673a 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -286,7 +286,6 @@ struct mshv_vtl_cpu_context {
>>   #ifdef CONFIG_HYPERV_VTL_MODE
>>   void __init hv_vtl_init_platform(void);
>>   int __init hv_vtl_early_init(void);
>> -void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   void mshv_vtl_return_call_init(u64 vtl_return_offset);
>>   void mshv_vtl_return_hypercall(void);
>>   void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> @@ -294,7 +293,6 @@ int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set,
>> bool shared);
>>   #else
>>   static inline void __init hv_vtl_init_platform(void) {}
>>   static inline int __init hv_vtl_early_init(void) { return 0; }
>> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>>   static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>>   static inline void mshv_vtl_return_hypercall(void) {}
>>   static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> diff --git a/drivers/hv/mshv_vtl.h b/drivers/hv/mshv_vtl.h
>> index a6eea52f7aa2..103f07371f3f 100644
>> --- a/drivers/hv/mshv_vtl.h
>> +++ b/drivers/hv/mshv_vtl.h
>> @@ -22,4 +22,7 @@ struct mshv_vtl_run {
>>   	char vtl_ret_actions[MSHV_MAX_RUN_MSG_SIZE];
>>   };
>>
>> +static_assert(sizeof(struct mshv_vtl_cpu_context) <= 1024,
>> +	      "struct mshv_vtl_cpu_context exceeds reserved space in struct
>> mshv_vtl_run");
>> +
>>   #endif /* _MSHV_VTL_H */
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index db183c8cfb95..8cdf2a9fbdfb 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -396,8 +396,10 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>
>>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>>   u8 __init get_vtl(void);
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   #else
>>   static inline u8 get_vtl(void) { return 0; }
>> +static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> 
> Is this stub needed? Maybe I missed something, but it looks to me like none
> of the code that calls this gets built unless CONFIG_HYPERV_VTL_MODE is set.
> See further comments about stubs in Patch 8 of this series.
> 

Config dependencies would handle such cases, and this is not required. I 
saw similar stubs added in the code, so I thought this is a norm that 
should be followed, and not rely on config dependencies.
I can remove it.

Regards,
Naman


^ permalink raw reply

* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-04-29  9:56 UTC (permalink / raw)
  To: Mark Rutland, Marc Zyngier
  Cc: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Michael Kelley, Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi,
	Sascha Bischoff, mrigendrachaubey, linux-hyperv, linux-arm-kernel,
	linux-kernel, linux-arch, linux-riscv, vdso, ssengar
In-Reply-To: <aeolHwXHFH4AnX_n@J2N7QTR9R3.cambridge.arm.com>



On 4/23/2026 7:26 PM, Mark Rutland wrote:
> On Thu, Apr 23, 2026 at 12:41:57PM +0000, Naman Jain wrote:
>> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
>> driver on arm64. This function enables the transition between Virtual
>> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Reviewed-by: Roman Kisel <vdso@mailbox.org>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/hyperv/Makefile        |   1 +
>>   arch/arm64/hyperv/hv_vtl.c        | 158 ++++++++++++++++++++++++++++++
>>   arch/arm64/include/asm/mshyperv.h |  13 +++
>>   arch/x86/include/asm/mshyperv.h   |   2 -
>>   drivers/hv/mshv_vtl.h             |   3 +
>>   include/asm-generic/mshyperv.h    |   2 +
>>   6 files changed, 177 insertions(+), 2 deletions(-)
>>   create mode 100644 arch/arm64/hyperv/hv_vtl.c
>>
>> diff --git a/arch/arm64/hyperv/Makefile b/arch/arm64/hyperv/Makefile
>> index 87c31c001da9..9701a837a6e1 100644
>> --- a/arch/arm64/hyperv/Makefile
>> +++ b/arch/arm64/hyperv/Makefile
>> @@ -1,2 +1,3 @@
>>   # SPDX-License-Identifier: GPL-2.0
>>   obj-y		:= hv_core.o mshyperv.o
>> +obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
>> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
>> new file mode 100644
>> index 000000000000..59cbeb74e7b9
>> --- /dev/null
>> +++ b/arch/arm64/hyperv/hv_vtl.c
>> @@ -0,0 +1,158 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2026, Microsoft, Inc.
>> + *
>> + * Authors:
>> + *     Roman Kisel <romank@linux.microsoft.com>
>> + *     Naman Jain <namjain@linux.microsoft.com>
>> + */
>> +
>> +#include <asm/mshyperv.h>
>> +#include <asm/neon.h>
>> +#include <linux/export.h>
>> +
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>> +{
>> +	struct user_fpsimd_state fpsimd_state;
>> +	u64 base_ptr = (u64)vtl0->x;
>> +
>> +	/*
>> +	 * Obtain the CPU FPSIMD registers for VTL context switch.
>> +	 * This saves the current task's FP/NEON state and allows us to
>> +	 * safely load VTL0's FP/NEON context for the hypercall.
>> +	 */
>> +	kernel_neon_begin(&fpsimd_state);
>> +
>> +	/*
>> +	 * VTL switch for ARM64 platform - managing VTL0's CPU context.
>> +	 * We explicitly use the stack to save the base pointer, and use x16
>> +	 * as our working register for accessing the context structure.
>> +	 *
>> +	 * Register Handling:
>> +	 * - X0-X17: Saved/restored (general-purpose, shared for VTL communication)
>> +	 * - X18: NOT touched - hypervisor-managed per-VTL (platform register)
>> +	 * - X19-X30: Saved/restored (part of VTL0's execution context)
>> +	 * - Q0-Q31: Saved/restored (128-bit NEON/floating-point registers, shared)
>> +	 * - SP: Not in structure, hypervisor-managed per-VTL
>> +	 *
>> +	 * X29 (FP) and X30 (LR) are in the structure and must be saved/restored
>> +	 * as part of VTL0's complete execution state.
>> +	 */
>> +	asm __volatile__ (
>> +		/* Save base pointer to stack explicitly, then load into x16 */
>> +		"str %0, [sp, #-16]!\n\t"     /* Push base pointer onto stack */
>> +		"mov x16, %0\n\t"             /* Load base pointer into x16 */
>> +		/* Volatile registers (Windows ARM64 ABI: x0-x17) */
>> +		"ldp x0, x1, [x16]\n\t"
>> +		"ldp x2, x3, [x16, #(2*8)]\n\t"
>> +		"ldp x4, x5, [x16, #(4*8)]\n\t"
>> +		"ldp x6, x7, [x16, #(6*8)]\n\t"
>> +		"ldp x8, x9, [x16, #(8*8)]\n\t"
>> +		"ldp x10, x11, [x16, #(10*8)]\n\t"
>> +		"ldp x12, x13, [x16, #(12*8)]\n\t"
>> +		"ldp x14, x15, [x16, #(14*8)]\n\t"
>> +		/* x16 will be loaded last, after saving base pointer */
>> +		"ldr x17, [x16, #(17*8)]\n\t"
>> +		/* x18 is hypervisor-managed per-VTL - DO NOT LOAD */
>> +
>> +		/* General-purpose registers: x19-x30 */
>> +		"ldp x19, x20, [x16, #(19*8)]\n\t"
>> +		"ldp x21, x22, [x16, #(21*8)]\n\t"
>> +		"ldp x23, x24, [x16, #(23*8)]\n\t"
>> +		"ldp x25, x26, [x16, #(25*8)]\n\t"
>> +		"ldp x27, x28, [x16, #(27*8)]\n\t"
>> +
>> +		/* Frame pointer and link register */
>> +		"ldp x29, x30, [x16, #(29*8)]\n\t"
>> +
>> +		/* Shared NEON/FP registers: Q0-Q31 (128-bit) */
>> +		"ldp q0, q1, [x16, #(32*8)]\n\t"
>> +		"ldp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
>> +		"ldp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
>> +		"ldp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
>> +		"ldp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
>> +		"ldp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
>> +		"ldp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
>> +		"ldp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
>> +		"ldp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
>> +		"ldp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
>> +		"ldp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
>> +		"ldp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
>> +		"ldp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
>> +		"ldp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
>> +		"ldp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
>> +		"ldp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
>> +
>> +		/* Now load x16 itself */
>> +		"ldr x16, [x16, #(16*8)]\n\t"
>> +
>> +		/* Return to the lower VTL */
>> +		"hvc #3\n\t"
> 
> NAK to this.
> 
> * This is a non-SMCCC hypercall, which we have NAK'd in general in the
>    past for various reasons that I am not going to rehash here.
> 
> * It's not clear how this is going to be extended with necessary
>    architecture state in future (e.g. SVE, SME). This is not
>    future-proof, and I don't believe this is maintainable.
> 
> * This breaks general requirements for reliable stacktracing by
>    clobbering state (e.g. x29) that we depend upon being valid AT ALL
>    TIMES outside of entry code.
> 
> * IMO, if this needs to be saved/restored, that should happen in
>    whatever you are calling.
> 
> Mark.


Merging threads for addressing comments from Mark Rutland and Marc 
Zyngier on this patch.

Thanks for reviewing the changes. Please allow me to briefly explain the 
use case here and then address your comments.

Hyper-V's Virtual Trust Levels (VTLs) provide hardware-enforced 
isolation within a single VM, analogous to ARM TrustZone. The kernel 
runs in VTL2 (higher privilege) as a "paravisor", a security monitor 
that handles intercepts for the primary OS in VTL0 (lower privilege). 
The VTL switch (mshv_vtl_return_call) is functionally equivalent to 
KVM's guest enter/exit. It saves VTL2 state, loads VTL0's GPRs other 
registers from a shared context structure, issues hvc #3 to let VTL0 
run, and on return saves VTL0's updated state back.

Coming to the problems with the code, I have identified a few ways to 
address them.

I can put the assembly code in a separate .S file with 
SYM_FUNC_START/SYM_FUNC_END and marked as noinstr, to prevent 
ftrace/kprobes from instrumenting between the GPR load and the hvc, 
which could have corrupted VTL0 register state. This should solve x29 
clobbering, stack tracing problems.

I should use kernel_neon_begin()/kernel_neon_end() to save/restore the 
full extended FP state of the current task in VTL2. VTL0's Q0-Q31 can be 
loaded/saved separately via fpsimd_load_state()/fpsimd_save_state(). 
This way, the assembly touches none of the SIMD registers. This is 
SVE/SME-safe for VTL2's task state. VTL0 still only carries Q0-Q31 in 
the context struct, and extending to SVE, SME is a future context struct 
change, which will need Hyper-V arm64 ABI support.
This way, VTL2's callee-saved regs (x19-x28, x29, x30) are explicitly 
saved to the stack frame at the top and restored at the bottom of 
assembly code. The C caller (in hv_vtl.c) is a clean function call.

Regarding Non-SMCCC "hvc #3" call, I have a limitation here owing to the 
ABI that is defined by the Hyper-V hypervisor. Fixing this requires a 
hypervisor-side change to support SMCCC-style dispatch for VTL return. 
Until then, hvc #3 is the only working interface. Moreover there would 
be backward compatibility issues with this new ABI interface, if at all 
it is added.

Link to TLFS: 
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm#on-arm64-platforms-3

Please correct me if any of the above is incorrect or if I should be 
looking at some other existing examples to solve these problems.

Regards,
Naman

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox