* [PATCH 02/10] mshv: Fix potential integer overflow in mshv_region_create
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
The allocation size is computed as:
sizeof(*region) + sizeof(struct page *) * nr_pages
where nr_pages is a u64 originating from userspace. A sufficiently
large nr_pages can overflow the multiplication, resulting in a small
allocation followed by out-of-bounds writes when populating mreg_pages.
Use struct_size() which returns SIZE_MAX on overflow, causing vzalloc
to safely return NULL — caught by the existing error check.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_regions.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6f..1d04a97980b8b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -177,7 +177,7 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
{
struct mshv_mem_region *region;
- region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+ region = vzalloc(struct_size(region, mreg_pages, nr_pages));
if (!region)
return ERR_PTR(-ENOMEM);
^ permalink raw reply related
* [PATCH 01/10] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
The bounds check inside the PFN-filling loop can return -EINVAL while
interrupts are disabled via local_irq_save(), leaking IRQ state.
Remove the check — it is redundant because the loop invariant
(done + i < page_count == page_struct_count >> large_shift) guarantees
(done + i) << large_shift < page_struct_count always holds.
While here, fix type mismatches: change 'int done' to 'u64 done' and
use u64 for loop and batch-size variables so they match the u64
page_count they are compared against.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_root_hv_call.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index f8c2341193da5..61871ad131b4b 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -1041,7 +1041,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
{
struct hv_input_modify_sparse_spa_page_host_access *input_page;
u64 status;
- int done = 0;
+ u64 done = 0;
unsigned long irq_flags, large_shift = 0;
u64 page_count = page_struct_count;
u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
@@ -1058,9 +1058,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
}
while (done < page_count) {
- ulong i, completed, remain = page_count - done;
- int rep_count = min(remain,
- HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+ u64 i, completed, remain = page_count - done;
+ u64 rep_count = min_t(u64, remain,
+ HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
local_irq_save(irq_flags);
input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -1074,15 +1074,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
input_page->flags = flags;
input_page->host_access = host_access;
- for (i = 0; i < rep_count; i++) {
- u64 index = (done + i) << large_shift;
-
- if (index >= page_struct_count)
- return -EINVAL;
-
+ for (i = 0; i < rep_count; i++)
input_page->spa_page_list[i] =
- page_to_pfn(pages[index]);
- }
+ page_to_pfn(pages[(done + i) << large_shift]);
status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
NULL);
^ permalink raw reply related
* [PATCH 00/10] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
This series addresses bugs found during a review of the mshv_root module
introduced by commit 621191d709b14 ("Drivers: hv: Introduce mshv_root
module to expose /dev/mshv to VMMs").
The fixes range from data corruption and use-after-free to silent
functional failures:
- IRQ state leak and type truncation in hypercall helpers
(hv_call_modify_spa_host_access)
- Integer overflow on userspace-controlled allocation size
(mshv_region_create)
- Missing locking, broken seqcount read protection, and a check on
uninitialized data in the irqfd path — the latter makes
level-triggered interrupt resampling completely non-functional
- Duplicate GSI 0 detection using the wrong predicate
- Use-after-RCU in port ID lookup
- Missing VP index bounds check in intercept ISR (OOB in interrupt
context)
- Missing error code on VP allocation failure (silent success to
userspace)
---
Stanislav Kinsburskii (10):
mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
mshv: Fix potential integer overflow in mshv_region_create
mshv: Fix missing lock in mshv_irqfd_deassign
mshv: Fix broken seqcount read protection
mshv: Fix level-triggered check on uninitialized data
mshv: Fix duplicate GSI detection for GSI 0
mshv: Fix use-after-RCU in mshv_portid_lookup
mshv: Use kfree_rcu in mshv_portid_free
mshv: Add missing vp_index bounds check in intercept ISR
mshv: Fix missing error code on VP allocation failure
drivers/hv/mshv_eventfd.c | 75 ++++++++++++++++++++++------------------
drivers/hv/mshv_irq.c | 2 +
drivers/hv/mshv_portid_table.c | 6 +--
drivers/hv/mshv_regions.c | 2 +
drivers/hv/mshv_root_hv_call.c | 18 +++-------
drivers/hv/mshv_root_main.c | 4 ++
drivers/hv/mshv_synic.c | 4 ++
7 files changed, 59 insertions(+), 52 deletions(-)
^ permalink raw reply
* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-29 18:01 UTC (permalink / raw)
To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
wei.liu@kernel.org, Long Li, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
bhelgaas@google.com, Jake Oshins, linux-hyperv@vger.kernel.org,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
matthew.ruffell@canonical.com, kjlx@templeofstupid.com
Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69213486F821CA5A2C793C81BF342@SA1PR21MB6921.namprd21.prod.outlook.com>
From: Dexuan Cui <DECUI@microsoft.com> Sent: Tuesday, April 28, 2026 6:58 PM
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, April 23, 2026 10:40 AM
[snip]
>
> > Question about Gen 1 VMs: If the Linux frame buffer driver moves
> > the frame buffer somewhere other than the default location, and
> > then the VM does a kexec/kdump, what does the legacy PCI graphic
> > device BAR report as the frame buffer location? Does it *always*
> > report 4G-128MB, or does it report the new location? I can run
>
> It always reports 4G-128MB.
OK, good to know. I was hoping it might report the new location. :-(
> BTW, I suspect a Gen2 VM may have the same issue, i.e.
> currently we only reserve 8MB below 4GB; if hyperv_drm uses
> high MMIO, I suspect the UEFI firmware would still report the
> same original low MMIO framebuffer base/size to the kdump kernel,
> but there is no easy way to verify this for Gen2 VMs...
>
[snip]
>
> However, when the kdump kernel starts to run, and I print the
> pci_resource_start(pdev, 0) and pci_resource_len(pdev, 0)
> from vmbus_reserve_fb(), I still see 4G-128MB:
> [ 12.506159] Gen1 VM: start=0xf8000000, size=0x4000000
>
> In this case, we can't really fix the MMIO conflict, e.g.
> if both hv_pci and hyperv_drm are built as modules, then
> the order of loading them can be nondeterministic:if the order
> in the first kernel is different from the order in
> the kdump kernel, we run into trouble.
Yep.
>
> If the order is deterministic (e.g. hv_pci is
> built-in, and hyperv_drm is built as a module),
> we should be good since both allocates MMIO from
> the high MMIO range in a deterministic way.
>
Yep.
Thanks,
Michael
^ permalink raw reply
* RE: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Michael Kelley @ 2026-04-29 18:01 UTC (permalink / raw)
To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
wei.liu@kernel.org, Long Li, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, matthew.ruffell@canonical.com,
johansen@templeofstupid.com
Cc: stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69214DC322549834104D26E0BF342@SA1PR21MB6921.namprd21.prod.outlook.com>
From: Dexuan Cui <DECUI@microsoft.com> Sent: Tuesday, April 28, 2026 8:13 PM
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, April 23, 2026 10:40 AM
[snip]
> > > + /* Hyper-V CoCo guests do not have a framebuffer device. */
> > > + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> > > + return;
> >
> > This test is testing feature "A" (mem encryption) in order to determine
> > the presence of feature "B" (no framebuffer), because current
> > configurations happen to always have "A" and "B" at the same time. But
> > the linkage between the features is tenuous, and if configurations should
> > change in the future, testing this way could be bogus. It works now, but I'm
> > leery of depending on the linkage between "A" and "B".
> >
> > You could set up a "can_have_framebuffer" flag in ms_hyperv_init_platform()
> > if running in a CVM, and test that flag here. But I'd suggest just dropping
> > this optimization. CVMs are always Gen2 (and that's not going to change),
> > so they have plenty of low mmio space.
>
> This is not true on a lab host, e.g. I have a TDX VM on a lab host created
> by these 2 commands (without the 2nd command, Hyper-V won't allow
> the TDX VM to start):
>
> New-VM -Generation 2 -GuestStateIsolationType Tdx -Name $vmName
> Disable-VMConsoleSupport -VMName $vmName
>
> The low_mmio_base is still 4GB-128MB. In this case, it's not a good idea
> to try to reserve the 128MB:
>
> 1) the available low MMIO size is smaller than 128MB due to the vTPM
> MMIO range.
>
> 2) even if we can reserve the 109.25 low mmio range
> [0xf8000000-0xfed3ffff], we may not want to do that, just in case
> some assigned PCI device has 32-bit BARs.
>
> So, IMO we need to keep the check:
> + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> + return;
>
> BTW, I think this may be a slightly better check here:
> + if (hv_is_isolation_supported())
> + return;
Agreed. Using hv_is_isolation_supported() seems better than
cc_platform_has() for this purpose.
>
> A CVM on Hyper-V won't start without the command line
> Disable-VMConsoleSupport -VMName $vmName
Unfortunately, on my laptop Hyper-V, a VM with VBS Isolation appears
to *not* require Disable-VMConsoleSupport. I can start the VM, and the
VM is offered the VMBus synthvid, mouse, and keyboard devices.
But what's weird in this case is that vmbus_reserved_fb() sees lfb_base
and lfb_start as 0. Furthermore, as a test, I changed the "allowed_in_isolated"
flag to true for the synthvid device, and the Hyper-V DRM driver loads and
initializes. In doing so, the vmconnect.exe window is resized larger, as is
done in a normal VM. /proc/iomem shows that the DRM driver claimed
the expected MMIO range at the start of low MMIO space. I can run a user
space program that mmaps /dev/fb0 and writes pixels to the mmap'ed
memory, and that succeeds as it would in a normal VM, but the
vmconnect.exe window doesn't show anything. It appears that the Hyper-V
host has allocated memory for the frame buffer, but is ignoring anything
that is written to it.
Running Disable-VMConsoleSupport works as expected -- the synthvid,
mouse, and keyboard devices are no longer offered to the VM.
>
> IMO this is very unlikely to change in the future, because the Hyper-V
> synthetic framebuffer VMBus device is not a trusted device for a CVM,
> so there is no reason for Hyper-V to offer such a device to CVMs; even
> if the host offers it, currently the guest hv_vmbus driver ignores it.
>
In the case of VBS Isolation, if such a VM also had a PCI pass-thru device,
the core problem could recur. I.e., not reserving space for the framebuffer
could allow the PCI device to try to use MMIO space that Hyper-V has
set up for the frame buffer, causing the PCI device to fail. And that's a
worse problem than just having the graphics console not function. I
can't actually try the failure case because I don't have an assignable PCI
device on my laptop, but it seems likely based on the evidence that
Hyper-V is setting up a framebuffer device.
So instead of not reserving any MMIO space for the framebuffer on
CVMs, the code you already have limits the reservation to half of the
MMIO space below 4 GB. Won't that work to avoid exhausting the low
MMIO space in a CVM that's running on a local Hyper-V with only 128
MiB of low MMIO space?
> When we assign a physical PCI GPU device to a CVM, I'm not sure if there
> is any framebuffer from the GPU or not. Even if there is, that's a completely
> different scenario and not reserving some low MMIO for "framebuffer"
> is unrelated: I think hyperv_drm (or the deprecated hyperv_fb) is the only
> driver that sets the fb_overlap_ok parameter of vmbus_allocate_mmio().
>
> > And at the moment, CVMs don't
> > support PCI devices,
>
> This is not true: recently I created a "Standard DC16eds v6" TDX CVM
> on Azure, and I did see two NVMe local temporary disks in "nvme list"
> (here TDISP is not used). In 2023, we added the commit
> 2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs")
> and I believe some users are running CVMs with GPUs.
Interesting! I worked on commit 2c6ba4216844, but had not noticed
that Azure now has offerings that makes use of it. I'll take a look at
that TDX VM size.
Thanks,
Michael
^ permalink raw reply
* [PATCH v2] mshv: Simplify GPA map/unmap hypercall helpers
From: Stanislav Kinsburskii @ 2026-04-29 16:48 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
preceding bug-fix patches:
Move "done += completed" before the status checks so that pages mapped
by a partially-successful batch are included in the error cleanup unmap.
Previously these mappings were leaked on failure.
While here, improve type safety and readability:
- Change "int done" to "u64 done" to match the u64 page_count it is
compared against, avoiding signed/unsigned comparison hazards.
- Use u64 for loop iteration and batch size variables consistently.
- Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
- Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
- Simplify the error-path unmap to use "done << large_shift" directly
instead of mutating done in place.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_root_hv_call.c | 55 +++++++++++++++-------------------------
1 file changed, 20 insertions(+), 35 deletions(-)
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index e5992c324904a..1f19a4ca824f0 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
struct hv_input_map_gpa_pages *input_page;
u64 status, *pfnlist;
unsigned long irq_flags, large_shift = 0;
- int ret = 0, done = 0;
- u64 page_count = page_struct_count;
+ u64 done = 0, page_count = page_struct_count;
+ int ret = 0;
if (page_count == 0 || (pages && mmio_spa))
return -EINVAL;
@@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
}
while (done < page_count) {
- ulong i, completed, remain = page_count - done;
- int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
+ u64 i, completed, remain = page_count - done;
+ u64 rep_count = min_t(u64, remain, HV_MAP_GPA_BATCH_SIZE);
local_irq_save(irq_flags);
input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
input_page->map_flags = flags;
pfnlist = input_page->source_gpa_page_list;
- for (i = 0; i < rep_count; i++)
- if (flags & HV_MAP_GPA_NO_ACCESS) {
+ for (i = 0; i < rep_count; i++) {
+ if (flags & HV_MAP_GPA_NO_ACCESS)
pfnlist[i] = 0;
- } else if (pages) {
- u64 index = (done + i) << large_shift;
-
- if (index >= page_struct_count) {
- ret = -EINVAL;
- break;
- }
- pfnlist[i] = page_to_pfn(pages[index]);
- } else {
+ else if (pages)
+ pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
+ else
pfnlist[i] = mmio_spa + done + i;
- }
- if (ret) {
- local_irq_restore(irq_flags);
- break;
}
status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
@@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
local_irq_restore(irq_flags);
completed = hv_repcomp(status);
+ done += completed;
if (hv_result_needs_memory(status)) {
ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
HV_MAP_GPA_DEPOSIT_PAGES);
if (ret)
break;
-
} else if (!hv_result_success(status)) {
ret = hv_result_to_errno(status);
break;
}
-
- done += completed;
}
if (ret && done) {
u32 unmap_flags = 0;
- if (flags & HV_MAP_GPA_LARGE_PAGE) {
+ if (flags & HV_MAP_GPA_LARGE_PAGE)
unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
- done <<= large_shift;
- }
- hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
+ hv_call_unmap_gpa_pages(partition_id, gfn,
+ done << large_shift, unmap_flags);
}
return ret;
@@ -305,7 +292,7 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
struct hv_input_unmap_gpa_pages *input_page;
u64 status, page_count = page_count_4k;
unsigned long irq_flags, large_shift = 0;
- int ret = 0, done = 0;
+ u64 done = 0;
if (page_count == 0)
return -EINVAL;
@@ -319,8 +306,8 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
}
while (done < page_count) {
- ulong completed, remain = page_count - done;
- int rep_count = min(remain, HV_UMAP_GPA_PAGES);
+ u64 completed, remain = page_count - done;
+ u64 rep_count = min_t(u64, remain, HV_UMAP_GPA_PAGES);
local_irq_save(irq_flags);
input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -333,15 +320,13 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
local_irq_restore(irq_flags);
completed = hv_repcomp(status);
- if (!hv_result_success(status)) {
- ret = hv_result_to_errno(status);
- break;
- }
-
done += completed;
+
+ if (!hv_result_success(status))
+ return hv_result_to_errno(status);
}
- return ret;
+ return 0;
}
int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
^ permalink raw reply related
* Re: [PATCH] mshv: Simplify GPA map/unmap hypercall helpers
From: Stanislav Kinsburskii @ 2026-04-29 15:15 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260429-orca-of-legal-symmetry-3c72bc@anirudhrb>
On Wed, Apr 29, 2026 at 11:02:37AM +0000, Anirudh Rayabharam wrote:
> On Tue, Apr 28, 2026 at 11:21:12PM +0000, Stanislav Kinsburskii wrote:
> > Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
> > preceding bug-fix patches:
> >
> > Move "done += completed" before the status checks so that pages mapped
> > by a partially-successful batch are included in the error cleanup unmap.
> > Previously these mappings were leaked on failure.
> >
> > While here, improve type safety and readability:
> > - Change "int done" to "u64 done" to match the u64 page_count it is
> > compared against, avoiding signed/unsigned comparison hazards.
> > - Use u64 for loop iteration and batch size variables consistently.
> > - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
> > - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
> > - Simplify the error-path unmap to use "done << large_shift" directly
> > instead of mutating done in place.
> >
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> > drivers/hv/mshv_root_hv_call.c | 55 +++++++++++++++-------------------------
> > 1 file changed, 20 insertions(+), 35 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index e5992c324904a..f5f205a397834 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> > struct hv_input_map_gpa_pages *input_page;
> > u64 status, *pfnlist;
> > unsigned long irq_flags, large_shift = 0;
> > - int ret = 0, done = 0;
> > - u64 page_count = page_struct_count;
> > + u64 done = 0, page_count = page_struct_count;
> > + int ret = 0;
> >
> > if (page_count == 0 || (pages && mmio_spa))
> > return -EINVAL;
> > @@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> > }
> >
> > while (done < page_count) {
> > - ulong i, completed, remain = page_count - done;
> > - int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> > + u64 i, completed, remain = page_count - done;
> > + u64 rep_count = min(remain, (u64)HV_MAP_GPA_BATCH_SIZE);
> >
> > local_irq_save(irq_flags);
> > input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > @@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> > input_page->map_flags = flags;
> > pfnlist = input_page->source_gpa_page_list;
> >
> > - for (i = 0; i < rep_count; i++)
> > - if (flags & HV_MAP_GPA_NO_ACCESS) {
> > + for (i = 0; i < rep_count; i++) {
> > + if (flags & HV_MAP_GPA_NO_ACCESS)
> > pfnlist[i] = 0;
> > - } else if (pages) {
> > - u64 index = (done + i) << large_shift;
> > -
> > - if (index >= page_struct_count) {
> > - ret = -EINVAL;
> > - break;
> > - }
> > - pfnlist[i] = page_to_pfn(pages[index]);
> > - } else {
> > + else if (pages)
> > + pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
> > + else
> > pfnlist[i] = mmio_spa + done + i;
> > - }
> > - if (ret) {
> > - local_irq_restore(irq_flags);
> > - break;
> > }
> >
> > status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> > @@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> > local_irq_restore(irq_flags);
> >
> > completed = hv_repcomp(status);
> > + done += completed;
> >
> > if (hv_result_needs_memory(status)) {
> > ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
> > HV_MAP_GPA_DEPOSIT_PAGES);
> > if (ret)
> > break;
> > -
> > } else if (!hv_result_success(status)) {
> > ret = hv_result_to_errno(status);
> > break;
> > }
> > -
> > - done += completed;
> > }
> >
> > if (ret && done) {
> > u32 unmap_flags = 0;
> >
> > - if (flags & HV_MAP_GPA_LARGE_PAGE) {
> > + if (flags & HV_MAP_GPA_LARGE_PAGE)
> > unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > - done <<= large_shift;
> > - }
> > - hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> > + hv_call_unmap_gpa_pages(partition_id, gfn,
> > + done << large_shift, unmap_flags);
>
> How does this work? Earlier we were doing "done << large_shift" only if
> HV_MAP_GPA_LARGE_PAGE is set but now we always do it.
>
It works becuase large_shift in initialized to 0 when
HV_MAP_GPA_LARGE_PAGE is not set.
Thanks,
Stanislav
> Thanks,
> Anirudh.
^ permalink raw reply
* Re: [PATCH] mshv: Add dedicated ioctl for GVA to GPA translation
From: Anirudh Rayabharam @ 2026-04-29 13:11 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177741648871.626779.11067281081219290277.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Tue, Apr 28, 2026 at 10:48:24PM +0000, Stanislav Kinsburskii wrote:
> Add an MSHV_TRANSLATE_GVA ioctl on the VP fd that wraps
> HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX with transparent fault-in handling for
> movable memory regions. The passthrough path for this hypercall is retained
> for backward compatibility.
>
> When guest-backing pages reside in movable memory regions, the mmu_notifier
> invalidation path remaps them to NO_ACCESS in the hypervisor's second-level
> address translation tables. If the VMM issues a GVA translation (e.g.
> during MMIO emulation) while a page-table page is invalidated, the
> hypervisor returns HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS. The VMM cannot
> resolve this on its own.
>
> The new ioctl detects this transient GPA access failure, faults the page
> back in via mshv_region_handle_gfn_fault(), and retries the translation
> until it succeeds or an unrecoverable error occurs.
>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_root.h | 3 ++
> drivers/hv/mshv_root_hv_call.c | 37 +++++++++++++++++++++
> drivers/hv/mshv_root_main.c | 69 ++++++++++++++++++++++++++++++++++++++++
> include/hyperv/hvgdk_mini.h | 1 +
> include/hyperv/hvhdk.h | 41 ++++++++++++++++++++++++
> include/uapi/linux/mshv.h | 10 ++++++
> 6 files changed, 161 insertions(+)
>
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 1f086dcb7aa1a..2e6c4414740cc 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -290,6 +290,9 @@ int hv_call_delete_vp(u64 partition_id, u32 vp_index);
> int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
> u64 dest_addr,
> union hv_interrupt_control control);
> +int hv_call_translate_virtual_address_ex(u32 vp_index, u64 partition_id,
> + u64 flags, u64 gva, u64 *gfn,
> + struct hv_translate_gva_result_ex *result);
> int hv_call_clear_virtual_interrupt(u64 partition_id);
> int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
> union hv_gpa_page_access_state_flags state_flags,
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index e5992c324904a..9ff4ba5373f59 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -692,6 +692,43 @@ int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code,
> return 0;
> }
>
> +int hv_call_translate_virtual_address_ex(u32 vp_index, u64 partition_id,
> + u64 flags, u64 gva, u64 *gfn,
> + struct hv_translate_gva_result_ex *result)
> +{
> + struct hv_input_translate_virtual_address *input;
> + struct hv_output_translate_virtual_address_ex *output;
> + unsigned long irq_flags;
> + u64 status;
> +
> + local_irq_save(irq_flags);
> +
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> + memset(input, 0, sizeof(*input));
> + input->partition_id = partition_id;
> + input->vp_index = vp_index;
> + input->control_flags = flags;
> + input->gva_page = gva >> HV_HYP_PAGE_SHIFT;
> +
> + status = hv_do_hypercall(HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX,
> + input, output);
> +
> + if (!hv_result_success(status)) {
> + local_irq_restore(irq_flags);
> + pr_err("%s: %s\n", __func__, hv_result_to_string(status));
> + return hv_result_to_errno(status);
> + }
> +
> + *result = output->translation_result;
> + *gfn = output->gpa_page;
> +
> + local_irq_restore(irq_flags);
> +
> + return 0;
> +}
> +
> int
> hv_call_clear_virtual_interrupt(u64 partition_id)
> {
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index bd1359eb58dd4..2d7b6923415a8 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -898,6 +898,72 @@ mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
> return 0;
> }
>
> +static bool mshv_gpa_fault_retryable(u32 result_code)
> +{
> + /*
> + * Note: HV_TRANSLATE_GVA_GPA_UNMAPPED is intentionally not handled
> + * here. The guest page table cannot be unmapped under normal
> + * operation. It may be mapped with no access during page moves,
> + * but a truly unmapped state indicates a kernel driver bug.
> + * Retrying in this case would only mask the underlying problem of
> + * an unmapped guest page table.
> + */
> + return result_code == HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS;
> +}
> +
> +static long
> +mshv_vp_ioctl_translate_gva(struct mshv_vp *vp, void __user *user_args)
> +{
> + struct mshv_partition *partition = vp->vp_partition;
> + struct mshv_translate_gva args;
> + struct hv_translate_gva_result_ex result;
> + u64 gfn, gpa;
> + int ret;
> +
> + if (copy_from_user(&args, user_args, sizeof(args)))
> + return -EFAULT;
> +
> + do {
> + ret = hv_call_translate_virtual_address_ex(vp->vp_index,
> + partition->pt_id,
> + args.flags, args.gva,
> + &gfn, &result);
> + if (ret)
> + return ret;
> +
> + if (mshv_gpa_fault_retryable(result.result_code)) {
> + struct mshv_mem_region *region;
> + bool faulted;
> +
> + region = mshv_partition_region_by_gfn_get(partition,
> + gfn);
> + if (!region)
> + return -EFAULT;
> +
> + faulted = false;
> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> + faulted = mshv_region_handle_gfn_fault(region,
> + gfn);
> + mshv_region_put(region);
> +
> + if (!faulted)
> + return -EFAULT;
> +
> + cond_resched();
> + }
> + } while (mshv_gpa_fault_retryable(result.result_code));
> +
> + gpa = (gfn << PAGE_SHIFT) | (args.gva & ~PAGE_MASK);
> +
> + if (copy_to_user(args.result, &result, sizeof(*args.result)))
Indentation is a bit off here.
With that fixed:
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Olaf Hering @ 2026-04-29 12:44 UTC (permalink / raw)
To: Thorsten Blum
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Greg Kroah-Hartman, stable, linux-hyperv, linux-kernel
In-Reply-To: <afH7VELGgh8eGBUC@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 252 bytes --]
Wed, 29 Apr 2026 14:36:36 +0200 Thorsten Blum <thorsten.blum@linux.dev>:
> What makes you think this is just "cosmetics"?
It does fix an unlikely bug indeed, but it does not need to trigger the whole paperwork attached to a Fixes tag.
Olaf
[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Thorsten Blum @ 2026-04-29 12:36 UTC (permalink / raw)
To: Olaf Hering
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Greg Kroah-Hartman, stable, Ky Srinivasan, linux-hyperv,
linux-kernel
In-Reply-To: <20260429142724.4d74641a.olaf@aepfle.de>
On Wed, Apr 29, 2026 at 02:27:24PM +0200, Olaf Hering wrote:
> Tue, 14 Apr 2026 13:10:08 +0200 Thorsten Blum <thorsten.blum@linux.dev>:
>
> > Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)")
>
> Please do not abuse the Fixes tag when it fact this change is "cosmetics".
What makes you think this is just "cosmetics"?
^ permalink raw reply
* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Olaf Hering @ 2026-04-29 12:27 UTC (permalink / raw)
To: Thorsten Blum
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Greg Kroah-Hartman, stable, Ky Srinivasan, linux-hyperv,
linux-kernel
In-Reply-To: <20260414111008.307220-2-thorsten.blum@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 235 bytes --]
Tue, 14 Apr 2026 13:10:08 +0200 Thorsten Blum <thorsten.blum@linux.dev>:
> Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)")
Please do not abuse the Fixes tag when it fact this change is "cosmetics".
Olaf
[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH] mshv: Simplify GPA map/unmap hypercall helpers
From: Anirudh Rayabharam @ 2026-04-29 11:02 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177741845948.632922.14128507833980339307.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Tue, Apr 28, 2026 at 11:21:12PM +0000, Stanislav Kinsburskii wrote:
> Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
> preceding bug-fix patches:
>
> Move "done += completed" before the status checks so that pages mapped
> by a partially-successful batch are included in the error cleanup unmap.
> Previously these mappings were leaked on failure.
>
> While here, improve type safety and readability:
> - Change "int done" to "u64 done" to match the u64 page_count it is
> compared against, avoiding signed/unsigned comparison hazards.
> - Use u64 for loop iteration and batch size variables consistently.
> - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
> - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
> - Simplify the error-path unmap to use "done << large_shift" directly
> instead of mutating done in place.
>
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_root_hv_call.c | 55 +++++++++++++++-------------------------
> 1 file changed, 20 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index e5992c324904a..f5f205a397834 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> struct hv_input_map_gpa_pages *input_page;
> u64 status, *pfnlist;
> unsigned long irq_flags, large_shift = 0;
> - int ret = 0, done = 0;
> - u64 page_count = page_struct_count;
> + u64 done = 0, page_count = page_struct_count;
> + int ret = 0;
>
> if (page_count == 0 || (pages && mmio_spa))
> return -EINVAL;
> @@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> }
>
> while (done < page_count) {
> - ulong i, completed, remain = page_count - done;
> - int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> + u64 i, completed, remain = page_count - done;
> + u64 rep_count = min(remain, (u64)HV_MAP_GPA_BATCH_SIZE);
>
> local_irq_save(irq_flags);
> input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> @@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> input_page->map_flags = flags;
> pfnlist = input_page->source_gpa_page_list;
>
> - for (i = 0; i < rep_count; i++)
> - if (flags & HV_MAP_GPA_NO_ACCESS) {
> + for (i = 0; i < rep_count; i++) {
> + if (flags & HV_MAP_GPA_NO_ACCESS)
> pfnlist[i] = 0;
> - } else if (pages) {
> - u64 index = (done + i) << large_shift;
> -
> - if (index >= page_struct_count) {
> - ret = -EINVAL;
> - break;
> - }
> - pfnlist[i] = page_to_pfn(pages[index]);
> - } else {
> + else if (pages)
> + pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
> + else
> pfnlist[i] = mmio_spa + done + i;
> - }
> - if (ret) {
> - local_irq_restore(irq_flags);
> - break;
> }
>
> status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> @@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> local_irq_restore(irq_flags);
>
> completed = hv_repcomp(status);
> + done += completed;
>
> if (hv_result_needs_memory(status)) {
> ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
> HV_MAP_GPA_DEPOSIT_PAGES);
> if (ret)
> break;
> -
> } else if (!hv_result_success(status)) {
> ret = hv_result_to_errno(status);
> break;
> }
> -
> - done += completed;
> }
>
> if (ret && done) {
> u32 unmap_flags = 0;
>
> - if (flags & HV_MAP_GPA_LARGE_PAGE) {
> + if (flags & HV_MAP_GPA_LARGE_PAGE)
> unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> - done <<= large_shift;
> - }
> - hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> + hv_call_unmap_gpa_pages(partition_id, gfn,
> + done << large_shift, unmap_flags);
How does this work? Earlier we were doing "done << large_shift" only if
HV_MAP_GPA_LARGE_PAGE is set but now we always do it.
Thanks,
Anirudh.
^ permalink raw reply
* Re: [PATCH V1 04/13] mshv: Provide a way to get partition id if running in a VMM process
From: Anirudh Rayabharam @ 2026-04-29 10:35 UTC (permalink / raw)
To: Mukesh R
Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
bhelgaas, arnd
In-Reply-To: <20260422023239.1171963-5-mrathor@linux.microsoft.com>
On Tue, Apr 21, 2026 at 07:32:30PM -0700, Mukesh R wrote:
> Many PCI passthru related hypercalls require partition id of the target
> guest. Guests are actually managed by MSHV driver and the partition id
> is only maintained there. Add a field in the partition struct in MSHV
> driver to save the tgid of the VMM process creating the partition,
> and add a function there to retrieve partition id if current process
> is a VMM process.
>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
> drivers/hv/mshv_root.h | 1 +
> drivers/hv/mshv_root_main.c | 22 ++++++++++++++++++++++
> include/asm-generic/mshyperv.h | 5 +++++
> 3 files changed, 28 insertions(+)
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH v4 3/3] mshv: unmap debugfs stats pages on kexec
From: Anirudh Rayabharam @ 2026-04-29 10:10 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-4-jloeser@linux.microsoft.com>
On Mon, Apr 27, 2026 at 02:38:54PM -0700, Jork Loeser wrote:
> On L1VH, debugfs stats pages are overlay pages: the kernel allocates
> them and registers the GPAs with the hypervisor via
> HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
> hypervisor across kexec. If the kexec'd kernel reuses those physical
> pages, the hypervisor's overlay semantics cause a machine check
> exception.
>
> Fix this by calling mshv_debugfs_exit() from the reboot notifier,
> which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
> kexec. This releases the overlay bindings so the physical pages can be
> safely reused. Guard mshv_debugfs_exit() against being called when
> init failed.
>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> drivers/hv/mshv_debugfs.c | 7 ++++++-
> drivers/hv/mshv_synic.c | 1 +
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
> index 418b6dc8f3c2..3c3e02237ae9 100644
> --- a/drivers/hv/mshv_debugfs.c
> +++ b/drivers/hv/mshv_debugfs.c
> @@ -674,8 +674,10 @@ int __init mshv_debugfs_init(void)
>
> mshv_debugfs = debugfs_create_dir("mshv", NULL);
> if (IS_ERR(mshv_debugfs)) {
> + err = PTR_ERR(mshv_debugfs);
> + mshv_debugfs = NULL;
> pr_err("%s: failed to create debugfs directory\n", __func__);
Might as well print err here.
Nevertheless:
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Thanks,
Anirudh.
^ permalink raw reply
* Re: [PATCH v2 12/15] mshv_vtl: Move VSM code page offset logic to x86 files
From: Naman Jain @ 2026-04-29 10:00 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
mrigendrachaubey, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org, vdso@mailbox.org,
ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157E0525DDDD153888F5AFBD4362@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/27/2026 11:10 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> The VSM code page offset register (HV_REGISTER_VSM_CODE_PAGE_OFFSETS)
>> is x86 specific, its value configures the static call used to return
>> to VTL0 via the hypercall page. Move the register read from the common
>> mshv_vtl_get_vsm_regs() into the x86 mshv_vtl_return_call_init(),
>> which is the sole consumer of the offset.
>>
>> Change mshv_vtl_return_call_init() from taking a u64 parameter
>> to taking no arguments, and rename mshv_vtl_get_vsm_regs() to
>> mshv_vtl_get_vsm_cap_reg() since it now only fetches
>> HV_REGISTER_VSM_CAPABILITIES.
>>
>> No functional change on x86. This prepares the common driver code for
>> ARM64 where VSM code page offsets do not apply.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/x86/hyperv/hv_vtl.c | 19 +++++++++++++++++--
>> arch/x86/include/asm/mshyperv.h | 4 ++--
>> drivers/hv/mshv_vtl_main.c | 24 +++++++++++++-----------
>> 3 files changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
>> index f3ffb6a7cb2d..7c10b34cf8a4 100644
>> --- a/arch/x86/hyperv/hv_vtl.c
>> +++ b/arch/x86/hyperv/hv_vtl.c
>> @@ -293,10 +293,25 @@ EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
>>
>> DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void));
>>
>> -void mshv_vtl_return_call_init(u64 vtl_return_offset)
>> +int mshv_vtl_return_call_init(void)
>> {
>> + struct hv_register_assoc vsm_pg_offset_reg;
>> + union hv_register_vsm_page_offsets offsets;
>> + int ret;
>> +
>> + vsm_pg_offset_reg.name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> +
>> + ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> + 1, input_vtl_zero, &vsm_pg_offset_reg);
>> + if (ret)
>> + return ret;
>> +
>> + offsets.as_uint64 = vsm_pg_offset_reg.value.reg64;
>> +
>> static_call_update(__mshv_vtl_return_hypercall,
>> - (void *)((u8 *)hv_hypercall_pg + vtl_return_offset));
>> + (void *)((u8 *)hv_hypercall_pg + offsets.vtl_return_offset));
>> +
>> + return 0;
>> }
>> EXPORT_SYMBOL(mshv_vtl_return_call_init);
>>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index b4d80c9a673a..b48f115c1292 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -286,14 +286,14 @@ struct mshv_vtl_cpu_context {
>> #ifdef CONFIG_HYPERV_VTL_MODE
>> void __init hv_vtl_init_platform(void);
>> int __init hv_vtl_early_init(void);
>> -void mshv_vtl_return_call_init(u64 vtl_return_offset);
>> +int mshv_vtl_return_call_init(void);
>> void mshv_vtl_return_hypercall(void);
>> void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, bool shared);
>> #else
>> static inline void __init hv_vtl_init_platform(void) {}
>> static inline int __init hv_vtl_early_init(void) { return 0; }
>> -static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>> +static inline int mshv_vtl_return_call_init(void) { return 0; }
>> static inline void mshv_vtl_return_hypercall(void) {}
>> static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> #endif
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index 4c9ae65ad3e8..be498c9234fd 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -79,7 +79,6 @@ struct mshv_vtl {
>> };
>>
>> static struct mutex mshv_vtl_poll_file_lock;
>> -static union hv_register_vsm_page_offsets mshv_vsm_page_offsets;
>> static union hv_register_vsm_capabilities mshv_vsm_capabilities;
>>
>> static DEFINE_PER_CPU(struct mshv_vtl_poll_file, mshv_vtl_poll_file);
>> @@ -203,21 +202,19 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>> /* VTL2 Host VSP SINT is (un)masked when the user mode requests that */
>> }
>>
>> -static int mshv_vtl_get_vsm_regs(void)
>> +static int mshv_vtl_get_vsm_cap_reg(void)
>> {
>> - struct hv_register_assoc registers[2];
>> - int ret, count = 2;
>> + struct hv_register_assoc vsm_capability_reg;
>> + int ret;
>>
>> - registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> - registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
>> + vsm_capability_reg.name = HV_REGISTER_VSM_CAPABILITIES;
>>
>> ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> - count, input_vtl_zero, registers);
>> + 1, input_vtl_zero, &vsm_capability_reg);
>> if (ret)
>> return ret;
>>
>> - mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
>> - mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
>> + mshv_vsm_capabilities.as_uint64 = vsm_capability_reg.value.reg64;
>>
>> return ret;
>
> Nit: This could be just "return 0".
Acked.
>
>> }
>> @@ -1139,13 +1136,18 @@ static int __init mshv_vtl_init(void)
>> tasklet_init(&msg_dpc, mshv_vtl_sint_on_msg_dpc, 0);
>> init_waitqueue_head(&fd_wait_queue);
>>
>> - if (mshv_vtl_get_vsm_regs()) {
>> + if (mshv_vtl_get_vsm_cap_reg()) {
>> dev_emerg(dev, "Unable to get VSM capabilities !!\n");
>
> Why is this failure an emergency message, while the other failures
> here in mshv_vtl_init() are just error messages? When there's lack
> of consistency, I always wonder if there is a reason ..... :-)
It might be because I didn’t pay enough attention to the old code :)
dev_err() should work just fine, I'll change it.
>
>> ret = -ENODEV;
>> goto free_dev;
>> }
>>
>> - mshv_vtl_return_call_init(mshv_vsm_page_offsets.vtl_return_offset);
>> + ret = mshv_vtl_return_call_init();
>> + if (ret) {
>> + dev_err(dev, "mshv_vtl_return_call_init failed: %d\n", ret);
>> + goto free_dev;
>> + }
>> +
>> ret = hv_vtl_setup_synic();
>> if (ret)
>> goto free_dev;
>> --
>> 2.43.0
>>
Regards,
Naman
^ permalink raw reply
* Re: [PATCH v4 2/3] mshv: clean up SynIC state on kexec for L1VH
From: Anirudh Rayabharam @ 2026-04-29 9:58 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-3-jloeser@linux.microsoft.com>
On Mon, Apr 27, 2026 at 02:38:53PM -0700, Jork Loeser wrote:
> The reboot notifier that tears down the SynIC cpuhp state guards the
> cleanup with hv_root_partition(), so on L1VH (where
> hv_root_partition() is false) SINT0, SINT5, and SIRBP are never
> cleaned up before kexec. The kexec'd kernel then inherits stale
> unmasked SINTs and an enabled SIRBP pointing to freed memory.
>
> Remove the hv_root_partition() guard so the cleanup runs for all
> parent partitions.
>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> drivers/hv/mshv_synic.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 2db3b0192eac..978a1cace341 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -723,9 +723,6 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
> static int mshv_synic_reboot_notify(struct notifier_block *nb,
> unsigned long code, void *unused)
> {
> - if (!hv_root_partition())
> - return 0;
> -
> cpuhp_remove_state(synic_cpuhp_online);
> return 0;
> }
> --
> 2.43.0
>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH v4 1/3] mshv: limit SynIC management to MSHV-owned resources
From: Anirudh Rayabharam @ 2026-04-29 9:58 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-2-jloeser@linux.microsoft.com>
On Mon, Apr 27, 2026 at 02:38:52PM -0700, Jork Loeser wrote:
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
>
> Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
> from under VMBus causes its later cleanup to write SynIC MSRs while
> SynIC is disabled, which the hypervisor does not tolerate.
>
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
> and nested root partition); on a non-nested root partition VMBus
> does not run, so MSHV must enable/disable them
>
> While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
> calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
> PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
> register GPAs regardless of the kernel page size, so using PAGE_SHIFT
> produces wrong addresses on ARM64 with 64K pages.
>
> Note that initialization order matters - VMBUS first, MSHV second,
> and the reverse on de-init. Ideally, we would want a dedicated SYNIC
> driver that replaces the cross-dependencies with a clear API and
> dynamic tracking. Such refactor should go into its own dedicated
> series, outside of this kexec fix series.
>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> drivers/hv/hv.c | 3 +
> drivers/hv/mshv_synic.c | 150 ++++++++++++++++++++++++++--------------
> 2 files changed, 103 insertions(+), 50 deletions(-)
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
* Re: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Naman Jain @ 2026-04-29 9:57 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
mrigendrachaubey, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org, vdso@mailbox.org,
ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157467FDBC0203C67A67042D4362@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/27/2026 11:10 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
>> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
>> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
>> in architecture-specific code.
>>
>> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
>> include/asm-generic/mshyperv.h so they are visible to both arch and
>> driver code.
>>
>> Change the return type from void to bool so the caller can determine
>> whether the register page was successfully configured and set
>> mshv_has_reg_page accordingly.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/x86/hyperv/hv_vtl.c | 32 ++++++++++++++++++++++
>> drivers/hv/mshv_vtl_main.c | 49 +++-------------------------------
>> include/asm-generic/mshyperv.h | 17 ++++++++++++
>> 3 files changed, 53 insertions(+), 45 deletions(-)
>>
<snip>
>> #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
>
> This comment pre-dates your patch, but I don't understand the point
> it is trying to make. The comment is factually true, but I don't know
> why calling that out is relevant. The REG_PAGE MSR seems to be
> conceptually separate and distinct from the SIMP MSR, so the fact
> that the layouts are the same is just a coincidence. Or is there some
> relationship between the two MSRs that I'm not aware of, and the
> comment is trying (and failing?) to point out?
This was added as per suggestion from Nuno in my initial series for
MSHV_VTL. If the reference in "identical to" is misleading, I should
remove it.
https://lore.kernel.org/all/68143eb0-e6a7-4579-bedb-4c2ec5aaef6b@linux.microsoft.com/
Quoting:
"""
it is a generic structure that
appears to be used for several overlay page MSRs (SIMP, SIEF, etc).
But, the type doesn't appear in the hv*dk headers explicitly; it's just
used internally by the hypervisor.
I think it should be renamed with a hv_ prefix to indicate it's part of
the hypervisor ABI, and a brief comment with the provenance:
/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
union hv_synic_overlay_page_msr {
/* <snip> */
};
"""
>
>> +union hv_synic_overlay_page_msr {
>> + u64 as_uint64;
>> + struct {
>> + u64 enabled: 1;
>> + u64 reserved: 11;
>> + u64 pfn: 52;
>> + } __packed;
>> +};
>> +
>> u8 __init get_vtl(void);
>> void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>> #else
>> static inline u8 get_vtl(void) { return 0; }
>> static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> +static inline bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu) { return false; }
>
> As with Patch 8, if CONFIG_HYPERV_VTL_MODE caused mshv_common.o
> to be built, this stub wouldn't be needed.
>
Acked.
>> #endif
>>
>> #endif
>> --
>> 2.43.0
>>
Regards,
Naman
^ permalink raw reply
* Re: [PATCH v2 08/15] Drivers: hv: Move hv_call_(get|set)_vp_registers() declarations
From: Naman Jain @ 2026-04-29 9:57 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
mrigendrachaubey, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org, vdso@mailbox.org,
ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157852404B5258EF13A5450D4362@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/27/2026 11:09 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move hv_call_get_vp_registers() and hv_call_set_vp_registers()
>> declarations from drivers/hv/mshv.h to include/asm-generic/mshyperv.h.
>>
>> These functions are defined in mshv_common.c and are going to be called
>> from both drivers/hv/ and arch/x86/hyperv/hv_vtl.c. The latter never
>> included mshv.h, relying on implicit declaration visibility. Moving the
>> declarations to the arch-generic Hyper-V header makes them properly
>> visible to all architecture-specific callers.
>>
>> Provide static inline stubs returning -EOPNOTSUPP when neither
>> CONFIG_MSHV_ROOT nor CONFIG_MSHV_VTL is enabled.
>
> Looking at the drivers/hv/Kconfig, it's possible to build with
> CONFIG_HYPERV_VTL_MODE=y, but not CONFIG_MSHV_VTL. In such a
> case, mshv_common.o doesn't get built, which is why the stubs are
> needed. Is such a configuration desirable for some scenarios?
>
> I wonder if having CONFIG_HYPERV_VTL_MODE force the building of
> mshv_common.o would be a better approach. Then the stubs wouldn't
> be needed. The "ifneq" statement in drivers/hv/Makefile could use
> CONFIG_HYPERV_VTL_MODE instead of CONFIG_MSHV_VTL, and
> everything would be good since CONFIG_MSHV_VTL depends on
> CONFIG_HYPERV_VTL_MODE.
>
This looks good. I'll try this and make the changes. In case there are
some challenges with that, I'll revert back.
Regards,
Naman
^ permalink raw reply
* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-04-29 9:56 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
mrigendrachaubey, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org, vdso@mailbox.org,
ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157C147A1B915F9B45D3B74D4362@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/27/2026 11:08 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
>> driver on arm64. This function enables the transition between Virtual
>> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Reviewed-by: Roman Kisel <vdso@mailbox.org>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/arm64/hyperv/Makefile | 1 +
>> arch/arm64/hyperv/hv_vtl.c | 158 ++++++++++++++++++++++++++++++
>> arch/arm64/include/asm/mshyperv.h | 13 +++
>> arch/x86/include/asm/mshyperv.h | 2 -
>> drivers/hv/mshv_vtl.h | 3 +
>> include/asm-generic/mshyperv.h | 2 +
>> 6 files changed, 177 insertions(+), 2 deletions(-)
>> create mode 100644 arch/arm64/hyperv/hv_vtl.c
>>
>
> [snip]
>
>> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
>> index 585b23a26f1b..9eb0e5999f29 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -60,6 +60,18 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>> ARM_SMCCC_SMC_64, \
>> ARM_SMCCC_OWNER_VENDOR_HYP, \
>> HV_SMCCC_FUNC_NUMBER)
>> +
>> +struct mshv_vtl_cpu_context {
>> +/*
>> + * x18 is managed by the hypervisor. It won't be reloaded from this array.
>> + * It is included here for convenience in array indexing.
>> + * 'rsvd' field serves as alignment padding so q[] starts at offset 32*8=256.
>> + */
>> + __u64 x[31];
>> + __u64 rsvd;
>> + __uint128_t q[32];
>> +};
>> +
>> #ifdef CONFIG_HYPERV_VTL_MODE
>> /*
>> * Get/Set the register. If the function returns `1`, that must be done via
>> @@ -69,6 +81,7 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs,
>> bool set, b
>> {
>> return 1;
>> }
>> +
>
> This appears to be a spurious blank line being added since there
> are no other changes in the vicinity.
Acked.
>
>> #endif
>>
>> #include <asm-generic/mshyperv.h>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index 08278547b84c..b4d80c9a673a 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -286,7 +286,6 @@ struct mshv_vtl_cpu_context {
>> #ifdef CONFIG_HYPERV_VTL_MODE
>> void __init hv_vtl_init_platform(void);
>> int __init hv_vtl_early_init(void);
>> -void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> void mshv_vtl_return_call_init(u64 vtl_return_offset);
>> void mshv_vtl_return_hypercall(void);
>> void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> @@ -294,7 +293,6 @@ int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set,
>> bool shared);
>> #else
>> static inline void __init hv_vtl_init_platform(void) {}
>> static inline int __init hv_vtl_early_init(void) { return 0; }
>> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>> static inline void mshv_vtl_return_hypercall(void) {}
>> static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> diff --git a/drivers/hv/mshv_vtl.h b/drivers/hv/mshv_vtl.h
>> index a6eea52f7aa2..103f07371f3f 100644
>> --- a/drivers/hv/mshv_vtl.h
>> +++ b/drivers/hv/mshv_vtl.h
>> @@ -22,4 +22,7 @@ struct mshv_vtl_run {
>> char vtl_ret_actions[MSHV_MAX_RUN_MSG_SIZE];
>> };
>>
>> +static_assert(sizeof(struct mshv_vtl_cpu_context) <= 1024,
>> + "struct mshv_vtl_cpu_context exceeds reserved space in struct
>> mshv_vtl_run");
>> +
>> #endif /* _MSHV_VTL_H */
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index db183c8cfb95..8cdf2a9fbdfb 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -396,8 +396,10 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>
>> #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>> u8 __init get_vtl(void);
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> #else
>> static inline u8 get_vtl(void) { return 0; }
>> +static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>
> Is this stub needed? Maybe I missed something, but it looks to me like none
> of the code that calls this gets built unless CONFIG_HYPERV_VTL_MODE is set.
> See further comments about stubs in Patch 8 of this series.
>
Config dependencies would handle such cases, and this is not required. I
saw similar stubs added in the code, so I thought this is a norm that
should be followed, and not rely on config dependencies.
I can remove it.
Regards,
Naman
^ permalink raw reply
* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-04-29 9:56 UTC (permalink / raw)
To: Mark Rutland, Marc Zyngier
Cc: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
Michael Kelley, Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi,
Sascha Bischoff, mrigendrachaubey, linux-hyperv, linux-arm-kernel,
linux-kernel, linux-arch, linux-riscv, vdso, ssengar
In-Reply-To: <aeolHwXHFH4AnX_n@J2N7QTR9R3.cambridge.arm.com>
On 4/23/2026 7:26 PM, Mark Rutland wrote:
> On Thu, Apr 23, 2026 at 12:41:57PM +0000, Naman Jain wrote:
>> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
>> driver on arm64. This function enables the transition between Virtual
>> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Reviewed-by: Roman Kisel <vdso@mailbox.org>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/arm64/hyperv/Makefile | 1 +
>> arch/arm64/hyperv/hv_vtl.c | 158 ++++++++++++++++++++++++++++++
>> arch/arm64/include/asm/mshyperv.h | 13 +++
>> arch/x86/include/asm/mshyperv.h | 2 -
>> drivers/hv/mshv_vtl.h | 3 +
>> include/asm-generic/mshyperv.h | 2 +
>> 6 files changed, 177 insertions(+), 2 deletions(-)
>> create mode 100644 arch/arm64/hyperv/hv_vtl.c
>>
>> diff --git a/arch/arm64/hyperv/Makefile b/arch/arm64/hyperv/Makefile
>> index 87c31c001da9..9701a837a6e1 100644
>> --- a/arch/arm64/hyperv/Makefile
>> +++ b/arch/arm64/hyperv/Makefile
>> @@ -1,2 +1,3 @@
>> # SPDX-License-Identifier: GPL-2.0
>> obj-y := hv_core.o mshyperv.o
>> +obj-$(CONFIG_HYPERV_VTL_MODE) += hv_vtl.o
>> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
>> new file mode 100644
>> index 000000000000..59cbeb74e7b9
>> --- /dev/null
>> +++ b/arch/arm64/hyperv/hv_vtl.c
>> @@ -0,0 +1,158 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2026, Microsoft, Inc.
>> + *
>> + * Authors:
>> + * Roman Kisel <romank@linux.microsoft.com>
>> + * Naman Jain <namjain@linux.microsoft.com>
>> + */
>> +
>> +#include <asm/mshyperv.h>
>> +#include <asm/neon.h>
>> +#include <linux/export.h>
>> +
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>> +{
>> + struct user_fpsimd_state fpsimd_state;
>> + u64 base_ptr = (u64)vtl0->x;
>> +
>> + /*
>> + * Obtain the CPU FPSIMD registers for VTL context switch.
>> + * This saves the current task's FP/NEON state and allows us to
>> + * safely load VTL0's FP/NEON context for the hypercall.
>> + */
>> + kernel_neon_begin(&fpsimd_state);
>> +
>> + /*
>> + * VTL switch for ARM64 platform - managing VTL0's CPU context.
>> + * We explicitly use the stack to save the base pointer, and use x16
>> + * as our working register for accessing the context structure.
>> + *
>> + * Register Handling:
>> + * - X0-X17: Saved/restored (general-purpose, shared for VTL communication)
>> + * - X18: NOT touched - hypervisor-managed per-VTL (platform register)
>> + * - X19-X30: Saved/restored (part of VTL0's execution context)
>> + * - Q0-Q31: Saved/restored (128-bit NEON/floating-point registers, shared)
>> + * - SP: Not in structure, hypervisor-managed per-VTL
>> + *
>> + * X29 (FP) and X30 (LR) are in the structure and must be saved/restored
>> + * as part of VTL0's complete execution state.
>> + */
>> + asm __volatile__ (
>> + /* Save base pointer to stack explicitly, then load into x16 */
>> + "str %0, [sp, #-16]!\n\t" /* Push base pointer onto stack */
>> + "mov x16, %0\n\t" /* Load base pointer into x16 */
>> + /* Volatile registers (Windows ARM64 ABI: x0-x17) */
>> + "ldp x0, x1, [x16]\n\t"
>> + "ldp x2, x3, [x16, #(2*8)]\n\t"
>> + "ldp x4, x5, [x16, #(4*8)]\n\t"
>> + "ldp x6, x7, [x16, #(6*8)]\n\t"
>> + "ldp x8, x9, [x16, #(8*8)]\n\t"
>> + "ldp x10, x11, [x16, #(10*8)]\n\t"
>> + "ldp x12, x13, [x16, #(12*8)]\n\t"
>> + "ldp x14, x15, [x16, #(14*8)]\n\t"
>> + /* x16 will be loaded last, after saving base pointer */
>> + "ldr x17, [x16, #(17*8)]\n\t"
>> + /* x18 is hypervisor-managed per-VTL - DO NOT LOAD */
>> +
>> + /* General-purpose registers: x19-x30 */
>> + "ldp x19, x20, [x16, #(19*8)]\n\t"
>> + "ldp x21, x22, [x16, #(21*8)]\n\t"
>> + "ldp x23, x24, [x16, #(23*8)]\n\t"
>> + "ldp x25, x26, [x16, #(25*8)]\n\t"
>> + "ldp x27, x28, [x16, #(27*8)]\n\t"
>> +
>> + /* Frame pointer and link register */
>> + "ldp x29, x30, [x16, #(29*8)]\n\t"
>> +
>> + /* Shared NEON/FP registers: Q0-Q31 (128-bit) */
>> + "ldp q0, q1, [x16, #(32*8)]\n\t"
>> + "ldp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
>> + "ldp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
>> + "ldp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
>> + "ldp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
>> + "ldp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
>> + "ldp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
>> + "ldp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
>> + "ldp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
>> + "ldp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
>> + "ldp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
>> + "ldp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
>> + "ldp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
>> + "ldp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
>> + "ldp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
>> + "ldp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
>> +
>> + /* Now load x16 itself */
>> + "ldr x16, [x16, #(16*8)]\n\t"
>> +
>> + /* Return to the lower VTL */
>> + "hvc #3\n\t"
>
> NAK to this.
>
> * This is a non-SMCCC hypercall, which we have NAK'd in general in the
> past for various reasons that I am not going to rehash here.
>
> * It's not clear how this is going to be extended with necessary
> architecture state in future (e.g. SVE, SME). This is not
> future-proof, and I don't believe this is maintainable.
>
> * This breaks general requirements for reliable stacktracing by
> clobbering state (e.g. x29) that we depend upon being valid AT ALL
> TIMES outside of entry code.
>
> * IMO, if this needs to be saved/restored, that should happen in
> whatever you are calling.
>
> Mark.
Merging threads for addressing comments from Mark Rutland and Marc
Zyngier on this patch.
Thanks for reviewing the changes. Please allow me to briefly explain the
use case here and then address your comments.
Hyper-V's Virtual Trust Levels (VTLs) provide hardware-enforced
isolation within a single VM, analogous to ARM TrustZone. The kernel
runs in VTL2 (higher privilege) as a "paravisor", a security monitor
that handles intercepts for the primary OS in VTL0 (lower privilege).
The VTL switch (mshv_vtl_return_call) is functionally equivalent to
KVM's guest enter/exit. It saves VTL2 state, loads VTL0's GPRs other
registers from a shared context structure, issues hvc #3 to let VTL0
run, and on return saves VTL0's updated state back.
Coming to the problems with the code, I have identified a few ways to
address them.
I can put the assembly code in a separate .S file with
SYM_FUNC_START/SYM_FUNC_END and marked as noinstr, to prevent
ftrace/kprobes from instrumenting between the GPR load and the hvc,
which could have corrupted VTL0 register state. This should solve x29
clobbering, stack tracing problems.
I should use kernel_neon_begin()/kernel_neon_end() to save/restore the
full extended FP state of the current task in VTL2. VTL0's Q0-Q31 can be
loaded/saved separately via fpsimd_load_state()/fpsimd_save_state().
This way, the assembly touches none of the SIMD registers. This is
SVE/SME-safe for VTL2's task state. VTL0 still only carries Q0-Q31 in
the context struct, and extending to SVE, SME is a future context struct
change, which will need Hyper-V arm64 ABI support.
This way, VTL2's callee-saved regs (x19-x28, x29, x30) are explicitly
saved to the stack frame at the top and restored at the bottom of
assembly code. The C caller (in hv_vtl.c) is a clean function call.
Regarding Non-SMCCC "hvc #3" call, I have a limitation here owing to the
ABI that is defined by the Hyper-V hypervisor. Fixing this requires a
hypervisor-side change to support SMCCC-style dispatch for VTL return.
Until then, hvc #3 is the only working interface. Moreover there would
be backward compatibility issues with this new ABI interface, if at all
it is added.
Link to TLFS:
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm#on-arm64-platforms-3
Please correct me if any of the above is incorrect or if I should be
looking at some other existing examples to solve these problems.
Regards,
Naman
^ permalink raw reply
* Re: [PATCH v2 03/15] Drivers: hv: Move vmbus_handler to common code
From: Naman Jain @ 2026-04-29 9:55 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
mrigendrachaubey, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org, vdso@mailbox.org,
ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157E3B0A6F76E4686D8C3E4D4362@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/27/2026 11:08 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move the vmbus_handler global variable and hv_setup_vmbus_handler()/
>> hv_remove_vmbus_handler() from arch/x86 to drivers/hv/hv_common.c.
>>
>> hv_setup_vmbus_handler() is called unconditionally in vmbus_bus_init()
>> and works for both x86 (sysvec handler) and arm64 (vmbus_percpu_isr).
>>
>> This eliminates the need for separate percpu vmbus handler setup
>> functions and __weak stubs, that are needed for adding ARM64 support
>> in MSHV_VTL driver where we need to set a custom per-cpu vmbus handler.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/x86/kernel/cpu/mshyperv.c | 12 ------------
>> drivers/hv/hv_common.c | 9 +++++++--
>> drivers/hv/vmbus_drv.c | 17 +++++++++--------
>> include/asm-generic/mshyperv.h | 1 +
>> 4 files changed, 17 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> index 89a2eb8a0722..68706ff5880e 100644
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -145,7 +145,6 @@ void hv_set_msr(unsigned int reg, u64 value)
>> EXPORT_SYMBOL_GPL(hv_set_msr);
>>
>> static void (*mshv_handler)(void);
>> -static void (*vmbus_handler)(void);
>> static void (*hv_stimer0_handler)(void);
>> static void (*hv_kexec_handler)(void);
>> static void (*hv_crash_handler)(struct pt_regs *regs);
>> @@ -172,17 +171,6 @@ void hv_setup_mshv_handler(void (*handler)(void))
>> mshv_handler = handler;
>> }
>>
>> -void hv_setup_vmbus_handler(void (*handler)(void))
>> -{
>> - vmbus_handler = handler;
>> -}
>> -
>> -void hv_remove_vmbus_handler(void)
>> -{
>> - /* We have no way to deallocate the interrupt gate */
>> - vmbus_handler = NULL;
>> -}
>> -
>> /*
>> * Routines to do per-architecture handling of stimer0
>> * interrupts when in Direct Mode
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index e8633bc51d56..eb7b0028b45d 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -758,13 +758,18 @@ bool __weak hv_isolation_type_tdx(void)
>> }
>> EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);
>>
>> -void __weak hv_setup_vmbus_handler(void (*handler)(void))
>> +void (*vmbus_handler)(void);
>> +EXPORT_SYMBOL_GPL(vmbus_handler);
>> +
>> +void hv_setup_vmbus_handler(void (*handler)(void))
>> {
>> + vmbus_handler = handler;
>> }
>> EXPORT_SYMBOL_GPL(hv_setup_vmbus_handler);
>>
>> -void __weak hv_remove_vmbus_handler(void)
>> +void hv_remove_vmbus_handler(void)
>> {
>> + vmbus_handler = NULL;
>> }
>> EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
>
> I'd suggest moving hv_setup_vmbus_handler() and
> hv_remove_vmbus_handler() above or below the group
> of __weak stubs in this source code file. There's a comment
> describing the purpose of these __weak functions, and
> intermixing these two functions that are no longer __weak
> produces something of a jumble.
>
Acked.
>>
>> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
>> index bc4fc1951ae1..052ca8b11cee 100644
>> --- a/drivers/hv/vmbus_drv.c
>> +++ b/drivers/hv/vmbus_drv.c
>> @@ -1415,7 +1415,8 @@ EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
>>
>> static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
>> {
>> - vmbus_isr();
>> + if (vmbus_handler)
>> + vmbus_handler();
>
> Is it necessary to test vmbus_handler first? From what I can
> see, it is always set before the per-cpu interrupt is setup.
After the shuffle of hv_remove_vmbus_handler() and freeing the irq, it
can be safely removed. When I was setting the vmbus_handler to NULL
first, before freeing the IRQ, this was required.
>
>> return IRQ_HANDLED;
>> }
>>
>> @@ -1517,8 +1518,10 @@ static int vmbus_bus_init(void)
>> vmbus_irq_initialized = true;
>> }
>>
>> + hv_setup_vmbus_handler(vmbus_isr);
>> +
>> if (vmbus_irq == -1) {
>> - hv_setup_vmbus_handler(vmbus_isr);
>> + /* x86: sysvec handler uses vmbus_handler directly */
>> } else {
>> ret = request_percpu_irq(vmbus_irq, vmbus_percpu_isr,
>> "Hyper-V VMbus", &vmbus_evt);
>> @@ -1553,9 +1556,8 @@ static int vmbus_bus_init(void)
>> return 0;
>>
>> err_connect:
>> - if (vmbus_irq == -1)
>> - hv_remove_vmbus_handler();
>> - else
>> + hv_remove_vmbus_handler();
>> + if (vmbus_irq != -1)
>> free_percpu_irq(vmbus_irq, &vmbus_evt);
>
> These operations should be reordered so they are the inverse
> of how they are setup. I.e., free_percpu_irq() first, then remove
> the VMBus handler. That's just good standard practice unless
> there's a specific reason to do the cleanup ordering differently. In
> fact, hv_remove_vmbus_handler() needs to be moved down
> to the err_setup label so it's done if request_percpu_irq()
> fails.
Acked. I will do the same for other hv_remove_vmbus_handler() as well.
>
>> err_setup:
>> if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
>> @@ -3026,9 +3028,8 @@ static void __exit vmbus_exit(void)
>> vmbus_connection.conn_state = DISCONNECTED;
>> hv_stimer_global_cleanup();
>> vmbus_disconnect();
>> - if (vmbus_irq == -1)
>> - hv_remove_vmbus_handler();
>> - else
>> + hv_remove_vmbus_handler();
>> + if (vmbus_irq != -1)
>> free_percpu_irq(vmbus_irq, &vmbus_evt);
>
> Ordering should be changed here as well so it is the inverse
> of how things are set up.
>
>> if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
>> smpboot_unregister_percpu_thread(&vmbus_irq_threads);
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index 2810aa05dc73..db183c8cfb95 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -179,6 +179,7 @@ static inline u64 hv_generate_guest_id(u64 kernel_version)
>>
>> int hv_get_hypervisor_version(union hv_hypervisor_version_info *info);
>>
>> +extern void (*vmbus_handler)(void);
>> void hv_setup_vmbus_handler(void (*handler)(void));
>> void hv_remove_vmbus_handler(void);
>> void hv_setup_stimer0_handler(void (*handler)(void));
>> --
>> 2.43.0
>>
Regards,
Naman
^ permalink raw reply
* Re: [PATCH v2 02/15] Drivers: hv: Move hv_vp_assist_page to common files
From: Naman Jain @ 2026-04-29 9:55 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
mrigendrachaubey, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org, vdso@mailbox.org,
ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157BEAF5480D931C3756B7DD4362@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/27/2026 11:07 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move the logic to initialize and export hv_vp_assist_page from x86
>> architecture code to Hyper-V common code to allow it to be used for
>> upcoming arm64 support in MSHV_VTL driver.
>> Note: This change also improves error handling - if VP assist page
>> allocation fails, hyperv_init() now returns early instead of
>> continuing with partial initialization.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Reviewed-by: Roman Kisel <vdso@mailbox.org>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/x86/hyperv/hv_init.c | 88 +-----------------------------
>> arch/x86/include/asm/mshyperv.h | 14 -----
>> drivers/hv/hv_common.c | 94 ++++++++++++++++++++++++++++++++-
>> include/asm-generic/mshyperv.h | 16 ++++++
>> include/hyperv/hvgdk_mini.h | 6 ++-
>> 5 files changed, 115 insertions(+), 103 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index 323adc93f2dc..75a98b5e451b 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -81,9 +81,6 @@ union hv_ghcb * __percpu *hv_ghcb_pg;
>> /* Storage to save the hypercall page temporarily for hibernation */
>> static void *hv_hypercall_pg_saved;
>>
>> -struct hv_vp_assist_page **hv_vp_assist_page;
>> -EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>> -
>> static int hyperv_init_ghcb(void)
>> {
>> u64 ghcb_gpa;
>> @@ -117,59 +114,12 @@ static int hyperv_init_ghcb(void)
>>
>> static int hv_cpu_init(unsigned int cpu)
>> {
>> - union hv_vp_assist_msr_contents msr = { 0 };
>> - struct hv_vp_assist_page **hvp;
>> int ret;
>>
>> ret = hv_common_cpu_init(cpu);
>> if (ret)
>> return ret;
>>
>> - if (!hv_vp_assist_page)
>> - return 0;
>> -
>> - hvp = &hv_vp_assist_page[cpu];
>> - if (hv_root_partition()) {
>> - /*
>> - * For root partition we get the hypervisor provided VP assist
>> - * page, instead of allocating a new page.
>> - */
>> - rdmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> - *hvp = memremap(msr.pfn << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> - PAGE_SIZE, MEMREMAP_WB);
>> - } else {
>> - /*
>> - * The VP assist page is an "overlay" page (see Hyper-V TLFS's
>> - * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> - * out to make sure we always write the EOI MSR in
>> - * hv_apic_eoi_write() *after* the EOI optimization is disabled
>> - * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> - * case of CPU offlining and the VM will hang.
>> - */
>> - if (!*hvp) {
>> - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
>> -
>> - /*
>> - * Hyper-V should never specify a VM that is a Confidential
>> - * VM and also running in the root partition. Root partition
>> - * is blocked to run in Confidential VM. So only decrypt assist
>> - * page in non-root partition here.
>> - */
>> - if (*hvp && !ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
>> - WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>> - memset(*hvp, 0, PAGE_SIZE);
>> - }
>> - }
>> -
>> - if (*hvp)
>> - msr.pfn = vmalloc_to_pfn(*hvp);
>> -
>> - }
>> - if (!WARN_ON(!(*hvp))) {
>> - msr.enable = 1;
>> - wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> - }
>> -
>> /* Allow Hyper-V stimer vector to be injected from Hypervisor. */
>> if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE)
>> apic_update_vector(cpu, HYPERV_STIMER0_VECTOR, true);
>> @@ -286,23 +236,6 @@ static int hv_cpu_die(unsigned int cpu)
>>
>> hv_common_cpu_die(cpu);
>>
>> - if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
>> - union hv_vp_assist_msr_contents msr = { 0 };
>> - if (hv_root_partition()) {
>> - /*
>> - * For root partition the VP assist page is mapped to
>> - * hypervisor provided page, and thus we unmap the
>> - * page here and nullify it, so that in future we have
>> - * correct page address mapped in hv_cpu_init.
>> - */
>> - memunmap(hv_vp_assist_page[cpu]);
>> - hv_vp_assist_page[cpu] = NULL;
>> - rdmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> - msr.enable = 0;
>> - }
>> - wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> - }
>> -
>> if (hv_reenlightenment_cb == NULL)
>> return 0;
>>
>> @@ -460,21 +393,6 @@ void __init hyperv_init(void)
>> if (hv_common_init())
>> return;
>>
>> - /*
>> - * The VP assist page is useless to a TDX guest: the only use we
>> - * would have for it is lazy EOI, which can not be used with TDX.
>> - */
>> - if (hv_isolation_type_tdx())
>> - hv_vp_assist_page = NULL;
>> - else
>> - hv_vp_assist_page = kzalloc_objs(*hv_vp_assist_page, nr_cpu_ids);
>> - if (!hv_vp_assist_page) {
>> - ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>> -
>> - if (!hv_isolation_type_tdx())
>> - goto common_free;
>> - }
>> -
>> if (ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
>> /* Negotiate GHCB Version. */
>> if (!hv_ghcb_negotiate_protocol())
>> @@ -483,7 +401,7 @@ void __init hyperv_init(void)
>>
>> hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
>> if (!hv_ghcb_pg)
>> - goto free_vp_assist_page;
>> + goto free_ghcb_page;
>
> Seems like this should be "goto common_free". The allocation of
> hv_ghcb_pg has failed, so going to a label where hv_ghcb_pg is
> freed seems redundant. It works since free_percpu() checks for
> a NULL argument, but it's a bit unexpected since the common_free
> label is already there.
Thanks for catching this, I'll fix it.
>
>> }
>>
>> cpuhp = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "x86/hyperv_init:online",
>> @@ -613,10 +531,6 @@ void __init hyperv_init(void)
>> cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
>> free_ghcb_page:
>> free_percpu(hv_ghcb_pg);
>> -free_vp_assist_page:
>> - kfree(hv_vp_assist_page);
>> - hv_vp_assist_page = NULL;
>> -common_free:
>> hv_common_free();
>> }
>>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index f64393e853ee..95b452387969 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -155,16 +155,6 @@ static inline u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
>> return _hv_do_fast_hypercall16(control, input1, input2);
>> }
>>
>> -extern struct hv_vp_assist_page **hv_vp_assist_page;
>> -
>> -static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
>> -{
>> - if (!hv_vp_assist_page)
>> - return NULL;
>> -
>> - return hv_vp_assist_page[cpu];
>> -}
>> -
>> void __init hyperv_init(void);
>> void hyperv_setup_mmu_ops(void);
>> void set_hv_tscchange_cb(void (*cb)(void));
>> @@ -254,10 +244,6 @@ static inline void hyperv_setup_mmu_ops(void) {}
>> static inline void set_hv_tscchange_cb(void (*cb)(void)) {}
>> static inline void clear_hv_tscchange_cb(void) {}
>> static inline void hyperv_stop_tsc_emulation(void) {};
>> -static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
>> -{
>> - return NULL;
>> -}
>> static inline int hyperv_flush_guest_mapping(u64 as) { return -1; }
>> static inline int hyperv_flush_guest_mapping_range(u64 as,
>> hyperv_fill_flush_list_func fill_func, void *data)
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 6b67ac616789..e8633bc51d56 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -28,7 +28,11 @@
>> #include <linux/slab.h>
>> #include <linux/dma-map-ops.h>
>> #include <linux/set_memory.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/io.h>
>> +#include <linux/hyperv.h>
>> #include <hyperv/hvhdk.h>
>> +#include <hyperv/hvgdk.h>
>> #include <asm/mshyperv.h>
>>
>> u64 hv_current_partition_id = HV_PARTITION_ID_SELF;
>> @@ -78,6 +82,8 @@ static struct ctl_table_header *hv_ctl_table_hdr;
>> u8 * __percpu *hv_synic_eventring_tail;
>> EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
>>
>> +struct hv_vp_assist_page **hv_vp_assist_page;
>> +EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>> /*
>> * Hyper-V specific initialization and shutdown code that is
>> * common across all architectures. Called from architecture
>> @@ -92,6 +98,9 @@ void __init hv_common_free(void)
>> if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE)
>> hv_kmsg_dump_unregister();
>>
>> + kfree(hv_vp_assist_page);
>> + hv_vp_assist_page = NULL;
>> +
>> kfree(hv_vp_index);
>> hv_vp_index = NULL;
>>
>> @@ -394,6 +403,23 @@ int __init hv_common_init(void)
>> for (i = 0; i < nr_cpu_ids; i++)
>> hv_vp_index[i] = VP_INVAL;
>>
>> + /*
>> + * The VP assist page is useless to a TDX guest: the only use we
>> + * would have for it is lazy EOI, which can not be used with TDX.
>> + */
>> + if (hv_isolation_type_tdx()) {
>> + hv_vp_assist_page = NULL;
>> +#ifdef CONFIG_X86_64
>> + ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>> +#endif
>
> I realize that this #ifdef went away for the reason I flagged in v1 of
> this patch set, but it's back again for a different reason.
>
> Let me suggest another approach. hv_common_init() is called from
> both the x86/64 and arm64 hyperv_init() functions. Immediately after
> the call to hv_common_init() in the x86/64 hyperv_init(), test
> hv_vp_assist_page for NULL and clear
> HV_X64_ENLIGHTENED_VMCS_RECOMMENDED if it is. No #ifdef is
> needed, and x86/64 specific hackery stays under arch/x86 instead of
> being in common code.
Acked. Thanks.
>
>> + } else {
>> + hv_vp_assist_page = kzalloc_objs(*hv_vp_assist_page, nr_cpu_ids);
>> + if (!hv_vp_assist_page) {
>> + hv_common_free();
>> + return -ENOMEM;
>> + }
>> + }
>> +
>> return 0;
>> }
>>
>> @@ -471,6 +497,8 @@ void __init ms_hyperv_late_init(void)
>>
>> int hv_common_cpu_init(unsigned int cpu)
>> {
>> + union hv_vp_assist_msr_contents msr = { 0 };
>> + struct hv_vp_assist_page **hvp;
>> void **inputarg, **outputarg;
>> u8 **synic_eventring_tail;
>> u64 msr_vp_index;
>> @@ -539,7 +567,53 @@ int hv_common_cpu_init(unsigned int cpu)
>> sizeof(u8), flags);
>> /* No need to unwind any of the above on failure here */
>> if (unlikely(!*synic_eventring_tail))
>> - ret = -ENOMEM;
>> + return -ENOMEM;
>> + }
>> +
>> + if (!hv_vp_assist_page)
>> + return ret;
>> +
>> + hvp = &hv_vp_assist_page[cpu];
>> + if (hv_root_partition()) {
>> + /*
>> + * For root partition we get the hypervisor provided VP assist
>> + * page, instead of allocating a new page.
>> + */
>> + msr.as_uint64 = hv_get_msr(HV_MSR_VP_ASSIST_PAGE);
>> + *hvp = memremap(msr.pfn << HV_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> + HV_HYP_PAGE_SIZE, MEMREMAP_WB);
>> + } else {
>> + /*
>> + * The VP assist page is an "overlay" page (see Hyper-V TLFS's
>> + * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> + * out to make sure that on x86/x64, we always write the EOI MSR in
>> + * hv_apic_eoi_write() *after* the EOI optimization is disabled
>> + * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> + * case of CPU offlining and the VM will hang.
>> + */
>> + if (!*hvp) {
>> + *hvp = __vmalloc(HV_HYP_PAGE_SIZE, flags | __GFP_ZERO);
>> +
>> + /*
>> + * Hyper-V should never specify a VM that is a Confidential
>> + * VM and also running in the root partition. Root partition
>> + * is blocked to run in Confidential VM. So only decrypt assist
>> + * page in non-root partition here.
>> + */
>> + if (*hvp &&
>> + !ms_hyperv.paravisor_present &&
>> + hv_isolation_type_snp()) {
>> + WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>> + memset(*hvp, 0, HV_HYP_PAGE_SIZE);
>> + }
>> + }
>> +
>> + if (*hvp)
>> + msr.pfn = page_to_hvpfn(vmalloc_to_page(*hvp));
>
> Your Patch 0 changelog mentions adding a comment about vmalloc_to_pfn(), which
> I didn't see anywhere. I'm not sure what that comment would say, so maybe it
> became unnecessary.
I think I mixed up two things. Changelog was about your suggestion to
add "x86/x64" in above comment about GPA Overlay Pages.
I also changed this function to page_to_hvpfn(vmalloc_to_page(*hvp)) as
per your suggestion.
Apologies for the confusion.
>
>> + }
>> + if (!WARN_ON(!(*hvp))) {
>> + msr.enable = 1;
>> + hv_set_msr(HV_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> }
>>
>> return ret;
>> @@ -566,6 +640,24 @@ int hv_common_cpu_die(unsigned int cpu)
>> *synic_eventring_tail = NULL;
>> }
>>
>> + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
>> + union hv_vp_assist_msr_contents msr = { 0 };
>> +
>> + if (hv_root_partition()) {
>> + /*
>> + * For root partition the VP assist page is mapped to
>> + * hypervisor provided page, and thus we unmap the
>> + * page here and nullify it, so that in future we have
>> + * correct page address mapped in hv_cpu_init.
>> + */
>> + memunmap(hv_vp_assist_page[cpu]);
>> + hv_vp_assist_page[cpu] = NULL;
>> + msr.as_uint64 = hv_get_msr(HV_MSR_VP_ASSIST_PAGE);
>> + msr.enable = 0;
>> + }
>> + hv_set_msr(HV_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> + }
>> +
>> return 0;
>> }
>>
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index d37b68238c97..2810aa05dc73 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -25,6 +25,7 @@
>> #include <linux/nmi.h>
>> #include <asm/ptrace.h>
>> #include <hyperv/hvhdk.h>
>> +#include <hyperv/hvgdk.h>
>>
>> #define VTPM_BASE_ADDRESS 0xfed40000
>>
>> @@ -299,6 +300,16 @@ do { \
>> #define hv_status_debug(status, fmt, ...) \
>> hv_status_printk(debug, status, fmt, ##__VA_ARGS__)
>>
>> +extern struct hv_vp_assist_page **hv_vp_assist_page;
>> +
>> +static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
>> +{
>> + if (!hv_vp_assist_page)
>> + return NULL;
>> +
>> + return hv_vp_assist_page[cpu];
>> +}
>> +
>> const char *hv_result_to_string(u64 hv_status);
>> int hv_result_to_errno(u64 status);
>> void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
>> @@ -327,6 +338,11 @@ static inline enum hv_isolation_type hv_get_isolation_type(void)
>> {
>> return HV_ISOLATION_TYPE_NONE;
>> }
>> +
>> +static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
>> +{
>> + return NULL;
>> +}
>> #endif /* CONFIG_HYPERV */
>>
>> #if IS_ENABLED(CONFIG_MSHV_ROOT)
>> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
>> index 056ef7b6b360..c72d04cd5ae4 100644
>> --- a/include/hyperv/hvgdk_mini.h
>> +++ b/include/hyperv/hvgdk_mini.h
>> @@ -149,6 +149,7 @@ struct hv_u128 {
>> #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT 12
>
> Can this X64 specific definition of the shift be eliminated entirely,
> and a single common definition for x86/64 and arm64 be used?
> As I understand it, the MSR layout is the same on both architectures.
> The one gotcha is that kvm_hv_set_msr() would need to be updated.
>
> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_MASK defined below isn't
> used anywhere, so it could go away too. (The KVM selftest usage has
> its own definition.)
>
> I realize these are changes to a source code file that is derived from
> Windows, and I'm not sure of the guidelines for such changes. So maybe
> these suggestions have to be ignored ....
The VP assist page definition is common to both x86 and arm64, so the
address mask and shift can be shared. Also, I don't see the shift and
mask definitions in the Hyper-V header so seem to be specific to in
kernel usage.
>
>> #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_MASK \
>> (~((1ull << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) - 1))
>> +#define HV_MSR_VP_ASSIST_PAGE (HV_X64_MSR_VP_ASSIST_PAGE)
>
> This is the correct file for this #define, but it should be placed down around
> line 1148 or so with the other HV_MSR_* definitions in terms of HV_X64_MSR_*
>
Acked.
>>
>> /* Hyper-V Enlightened VMCS version mask in nested features CPUID */
>> #define HV_X64_ENLIGHTENED_VMCS_VERSION 0xff
>> @@ -410,6 +411,7 @@ union hv_x64_msr_hypercall_contents {
>> #if defined(CONFIG_ARM64)
>> #define HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE BIT(8)
>> #define HV_STIMER_DIRECT_MODE_AVAILABLE BIT(13)
>> +#define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT 12
>> #endif /* CONFIG_ARM64 */
>>
>> #if defined(CONFIG_X86)
>> @@ -1163,6 +1165,8 @@ enum hv_register_name {
>> #define HV_MSR_STIMER0_CONFIG (HV_X64_MSR_STIMER0_CONFIG)
>> #define HV_MSR_STIMER0_COUNT (HV_X64_MSR_STIMER0_COUNT)
>>
>> +#define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT
>> +
>> #elif defined(CONFIG_ARM64) /* CONFIG_X86 */
>>
>> #define HV_MSR_CRASH_P0 (HV_REGISTER_GUEST_CRASH_P0)
>> @@ -1185,7 +1189,7 @@ enum hv_register_name {
>>
>> #define HV_MSR_STIMER0_CONFIG (HV_REGISTER_STIMER0_CONFIG)
>> #define HV_MSR_STIMER0_COUNT (HV_REGISTER_STIMER0_COUNT)
>> -
>> +#define HV_MSR_VP_ASSIST_PAGE (HV_REGISTER_VP_ASSIST_PAGE)
>
> Nit: This definition is slightly mis-aligned. It has spaces where there
> should be a tab to match the similar definitions above it.
>
Acked.
>> #endif /* CONFIG_ARM64 */
>>
>> union hv_explicit_suspend_register {
>> --
>> 2.43.0
>>
Regards,
Naman
^ permalink raw reply
* [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-04-29 9:06 UTC (permalink / raw)
To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
Shradha Gupta, Saurabh Singh Sengar, stable
In mana driver, the number of IRQs allocated is capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vCPUs, irrespective of
their NUMA/core bindings.
This is important, especially in the envs where number of vCPUs are so
few that the softIRQ handling overhead on two IRQs on the same vCPU is
much more than their overheads if they were spread across sibling vCPUs.
This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU, while some vCPUs have none.
In such cases when many parallel TCP connections are tested, the
throughput drops significantly.
Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
TYPE effective vCPU aff
=======================================================
IRQ0: HWC 0
IRQ1: mana_q1 0
IRQ2: mana_q2 2
IRQ3: mana_q3 0
IRQ4: mana_q4 3
%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU 0 1 2 3
=======================================================
pass 1: 38.85 0.03 24.89 24.65
pass 2: 39.15 0.03 24.57 25.28
pass 3: 40.36 0.03 23.20 23.17
=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
TYPE effective vCPU aff
=======================================================
IRQ0: HWC 0
IRQ1: mana_q1 0
IRQ2: mana_q2 1
IRQ3: mana_q3 2
IRQ4: mana_q4 3
%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU 0 1 2 3
=======================================================
pass 1: 15.42 15.85 14.99 14.51
pass 2: 15.53 15.94 15.81 15.93
pass 3: 16.41 16.35 16.40 16.36
=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn with patch w/o patch
20480 15.65 7.73
10240 15.63 8.93
8192 15.64 9.69
6144 15.64 13.16
4096 15.69 15.75
2048 15.69 15.83
1024 15.71 15.28
Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
Changes in v2
* Removed the unused skip_first_cpu variable
* fixed exit condition in irq_setup_linear() with len == 0
* changed return type of irq_setup_linear() as it will always be 0
* removed the unnecessary rcu_read_lock() in irq_setup_linear()
* added appropriate comments to indicate expected behaviour when
IRQs are more than or equal to num_online_cpus()
---
.../net/ethernet/microsoft/mana/gdma_main.c | 47 ++++++++++++++++---
1 file changed, 40 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..d740d1dc43da 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
} else {
/* If dynamic allocation is enabled we have already allocated
* hwc msi
+ * Also, we make sure in this case the following is always true
+ * (num_msix_usable - 1 HWC) <= num_online_cpus()
*/
gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
}
@@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
return 0;
}
+/* should be called with cpus_read_lock() held */
+static void irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ if (len == 0)
+ break;
+
+ irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+ len--;
+ }
+}
+
static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_irq_context *gic;
- bool skip_first_cpu = false;
int *irqs, irq, err, i;
irqs = kmalloc_objs(int, nvec);
@@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
* first CPU sibling group since they are already affinitized to HWC IRQ
*/
cpus_read_lock();
- if (gc->num_msix_usable <= num_online_cpus())
- skip_first_cpu = true;
+ if (gc->num_msix_usable <= num_online_cpus()) {
+ err = irq_setup(irqs, nvec, gc->numa_node, true);
+ if (err) {
+ cpus_read_unlock();
+ goto free_irq;
+ }
+ } else {
+ /*
+ * When num_msix_usable are more than num_online_cpus, we try to
+ * make sure we are using all vcpus. In such a case NUMA or
+ * CPU core affinity does not matter.
+ * Note: in this case the total mana IRQ should always be
+ * num_online_cpus + 1. The first HWC IRQ is already handled
+ * in HWC setup calls
+ * However, if CPUs went offline since num_msix_usable was
+ * computed, nvec count will be more than num_online_cpus().
+ * In such cases remaining extra IRQs will retain their default
+ * affinity.
+ */
+ if (nvec > num_online_cpus())
+ dev_dbg(&pdev->dev,
+ "IRQ count %d exceeds online CPU count %d. Some IRQs will share CPU\n",
+ nvec, num_online_cpus());
- err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
- if (err) {
- cpus_read_unlock();
- goto free_irq;
+ irq_setup_linear(irqs, nvec);
}
cpus_read_unlock();
base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
--
2.34.1
^ permalink raw reply related
* Re: [PATCH rc 00/15] Various bug fixes for RDMA drivers in the uapi functions
From: Junxian Huang @ 2026-04-29 7:55 UTC (permalink / raw)
To: Jason Gunthorpe, Andrew Lunn,
Broadcom internal kernel review list, Bryan Tan, Eric Dumazet,
Konstantin Taranov, Jakub Kicinski, Leon Romanovsky, linux-hyperv,
linux-rdma, netdev, Paolo Abeni, Selvin Xavier, Chengchang Tang,
Tariq Toukan, Vishnu Dasa, Yishai Hadas
Cc: Abhijit Gangurde, Adit Ranadive, Allen Hubbe, Andrew Boyer,
Aditya Sarwade, Brad Spengler, Bryan Tan, David S. Miller,
Dexuan Cui, Doug Ledford, George Zhang, Jorgen Hansen, Jianbo Liu,
Kai Aizen, Leon Romanovsky, Leon Romanovsky, Yixian Liu, Long Li,
Lijun Ou, Parav Pandit, patches, Roland Dreier, Roland Dreier,
Sagi Grimberg, Ajay Sharma, stable, Tariq Toukan, Wei Hu (Xavier),
Shaobo Xu, Nenglong Zhao
In-Reply-To: <0-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com>
On 2026/4/29 0:17, Jason Gunthorpe wrote:
> All were found by Sashiko or Claude AI tools. They vary in severity, but
> are all things that shouldn't be present.
>
> Jason Gunthorpe (15):
> RDMA/hns: Fix xarray race in hns_roce_create_srq()
> RDMA/hns: Fix xarray race in hns_roce_create_qp_common()
> RDMA/hns: Fix unlocked call to hns_roce_qp_remove()
For hns patches:
Reviewed-by: Junxian Huang <huangjunxian6@hisilicon.com>
Thanks,
Junxian
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox