Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH v5 00/21] paravirt: cleanup and reorg
From: Juergen Gross @ 2026-01-05 11:04 UTC (permalink / raw)
  To: linux-kernel, x86, linux-hyperv, virtualization, loongarch,
	linuxppc-dev, linux-riscv, kvm
  Cc: Juergen Gross, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Jiri Kosina, Josh Poimboeuf,
	Pawan Gupta, Boris Ostrovsky, xen-devel, Ajay Kaher,
	Alexey Makhalov, Broadcom internal kernel review list,
	Russell King, Catalin Marinas, Huacai Chen, WANG Xuerui,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Ghiti, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-arm-kernel, Paolo Bonzini,
	Vitaly Kuznetsov, Stefano Stabellini, Oleksandr Tyshchenko,
	Daniel Lezcano, Oleg Nesterov

Some cleanups and reorg of paravirt code and headers:

- The first 2 patches should be not controversial at all, as they
  remove just some no longer needed #include and struct forward
  declarations.

- The 3rd patch is removing CONFIG_PARAVIRT_DEBUG, which IMO has
  no real value, as it just changes a crash to a BUG() (the stack
  trace will basically be the same). As the maintainer of the main
  paravirt user (Xen) I have never seen this crash/BUG() to happen.

- The 4th patch is just a movement of code.

- I don't know for what reason asm/paravirt_api_clock.h was added,
  as all archs supporting it do it exactly in the same way. Patch
  5 is removing it.

- Patches 6-14 are streamlining the paravirt clock interfaces by
  using a common implementation across architectures where possible
  and by moving the related code into common sched code, as this is
  where it should live.

- Patches 15-20 are more like RFC material preparing the paravirt
  infrastructure to support multiple pv_ops function arrays.
  As a prerequisite for that it makes life in objtool much easier
  with dropping the Xen static initializers of the pv_ops sub-
  structures, which is done in patches 15-17.
  Patches 18-20 are doing the real preparations for multiple pv_ops
  arrays and using those arrays in multiple headers.

- Patch 21 is an example how the new scheme can look like using the
  PV-spinlocks.

Changes in V2:
- new patches 13-18 and 20
- complete rework of patch 21

Changes in V3:
- fixed 2 issues detected by kernel test robot

Changes in V4:
- fixed one build issue

Changes in V5:
- fixed another build issue
- rebase

Juergen Gross (21):
  x86/paravirt: Remove not needed includes of paravirt.h
  x86/paravirt: Remove some unneeded struct declarations
  x86/paravirt: Remove PARAVIRT_DEBUG config option
  x86/paravirt: Move thunk macros to paravirt_types.h
  paravirt: Remove asm/paravirt_api_clock.h
  sched: Move clock related paravirt code to kernel/sched
  arm/paravirt: Use common code for paravirt_steal_clock()
  arm64/paravirt: Use common code for paravirt_steal_clock()
  loongarch/paravirt: Use common code for paravirt_steal_clock()
  riscv/paravirt: Use common code for paravirt_steal_clock()
  x86/paravirt: Use common code for paravirt_steal_clock()
  x86/paravirt: Move paravirt_sched_clock() related code into tsc.c
  x86/paravirt: Introduce new paravirt-base.h header
  x86/paravirt: Move pv_native_*() prototypes to paravirt.c
  x86/xen: Drop xen_irq_ops
  x86/xen: Drop xen_cpu_ops
  x86/xen: Drop xen_mmu_ops
  objtool: Allow multiple pv_ops arrays
  x86/paravirt: Allow pv-calls outside paravirt.h
  x86/paravirt: Specify pv_ops array in paravirt macros
  x86/pvlocks: Move paravirt spinlock functions into own header

 arch/Kconfig                                  |   3 +
 arch/arm/Kconfig                              |   1 +
 arch/arm/include/asm/paravirt.h               |  22 --
 arch/arm/include/asm/paravirt_api_clock.h     |   1 -
 arch/arm/kernel/Makefile                      |   1 -
 arch/arm/kernel/paravirt.c                    |  23 --
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/include/asm/paravirt.h             |  14 -
 arch/arm64/include/asm/paravirt_api_clock.h   |   1 -
 arch/arm64/kernel/paravirt.c                  |  11 +-
 arch/loongarch/Kconfig                        |   1 +
 arch/loongarch/include/asm/paravirt.h         |  13 -
 .../include/asm/paravirt_api_clock.h          |   1 -
 arch/loongarch/kernel/paravirt.c              |  10 +-
 arch/powerpc/include/asm/paravirt.h           |   3 -
 arch/powerpc/include/asm/paravirt_api_clock.h |   2 -
 arch/powerpc/platforms/pseries/setup.c        |   4 +-
 arch/riscv/Kconfig                            |   1 +
 arch/riscv/include/asm/paravirt.h             |  14 -
 arch/riscv/include/asm/paravirt_api_clock.h   |   1 -
 arch/riscv/kernel/paravirt.c                  |  11 +-
 arch/x86/Kconfig                              |   8 +-
 arch/x86/entry/entry_64.S                     |   1 -
 arch/x86/entry/vsyscall/vsyscall_64.c         |   1 -
 arch/x86/hyperv/hv_spinlock.c                 |  11 +-
 arch/x86/include/asm/apic.h                   |   4 -
 arch/x86/include/asm/highmem.h                |   1 -
 arch/x86/include/asm/mshyperv.h               |   1 -
 arch/x86/include/asm/paravirt-base.h          |  35 ++
 arch/x86/include/asm/paravirt-spinlock.h      | 145 ++++++++
 arch/x86/include/asm/paravirt.h               | 331 +++++-------------
 arch/x86/include/asm/paravirt_api_clock.h     |   1 -
 arch/x86/include/asm/paravirt_types.h         | 269 +++++++-------
 arch/x86/include/asm/pgtable_32.h             |   1 -
 arch/x86/include/asm/ptrace.h                 |   2 +-
 arch/x86/include/asm/qspinlock.h              |  87 +----
 arch/x86/include/asm/spinlock.h               |   1 -
 arch/x86/include/asm/timer.h                  |   1 +
 arch/x86/include/asm/tlbflush.h               |   4 -
 arch/x86/kernel/Makefile                      |   2 +-
 arch/x86/kernel/apm_32.c                      |   1 -
 arch/x86/kernel/callthunks.c                  |   1 -
 arch/x86/kernel/cpu/bugs.c                    |   1 -
 arch/x86/kernel/cpu/vmware.c                  |   1 +
 arch/x86/kernel/kvm.c                         |  13 +-
 arch/x86/kernel/kvmclock.c                    |   1 +
 arch/x86/kernel/paravirt-spinlocks.c          |  26 +-
 arch/x86/kernel/paravirt.c                    |  42 +--
 arch/x86/kernel/tsc.c                         |  10 +-
 arch/x86/kernel/vsmp_64.c                     |   1 -
 arch/x86/lib/cache-smp.c                      |   1 -
 arch/x86/mm/init.c                            |   1 -
 arch/x86/xen/enlighten_pv.c                   |  82 ++---
 arch/x86/xen/irq.c                            |  20 +-
 arch/x86/xen/mmu_pv.c                         | 100 ++----
 arch/x86/xen/spinlock.c                       |  11 +-
 arch/x86/xen/time.c                           |   2 +
 drivers/clocksource/hyperv_timer.c            |   2 +
 drivers/xen/time.c                            |   2 +-
 include/linux/sched/cputime.h                 |  18 +
 kernel/sched/core.c                           |   5 +
 kernel/sched/cputime.c                        |  13 +
 kernel/sched/sched.h                          |   3 +-
 tools/objtool/arch/x86/decode.c               |   8 +-
 tools/objtool/check.c                         |  78 ++++-
 tools/objtool/include/objtool/check.h         |   1 +
 66 files changed, 662 insertions(+), 827 deletions(-)
 delete mode 100644 arch/arm/include/asm/paravirt.h
 delete mode 100644 arch/arm/include/asm/paravirt_api_clock.h
 delete mode 100644 arch/arm/kernel/paravirt.c
 delete mode 100644 arch/arm64/include/asm/paravirt_api_clock.h
 delete mode 100644 arch/loongarch/include/asm/paravirt_api_clock.h
 delete mode 100644 arch/powerpc/include/asm/paravirt_api_clock.h
 delete mode 100644 arch/riscv/include/asm/paravirt_api_clock.h
 create mode 100644 arch/x86/include/asm/paravirt-base.h
 create mode 100644 arch/x86/include/asm/paravirt-spinlock.h
 delete mode 100644 arch/x86/include/asm/paravirt_api_clock.h

-- 
2.51.0


^ permalink raw reply

* [PATCH RESEND v2 3/3] x86/hyperv: Remove ASM_CALL_CONSTRAINT with VMMCALL insn
From: Uros Bizjak @ 2026-01-05  9:02 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-kernel
  Cc: Uros Bizjak, Michael Kelley, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin
In-Reply-To: <20260105090422.6243-1-ubizjak@gmail.com>

Unlike CALL instruction, VMMCALL does not push to the stack, so it's
OK to allow the compiler to insert it before the frame pointer gets
set up by the containing function. ASM_CALL_CONSTRAINT is for CALLs
that must be inserted after the frame pointer is set up, so it is
over-constraining here and can be removed.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
v2: Expand commit message and include ASM_CALL_CONSTRAINT explanation
---
 arch/x86/hyperv/ivm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 7365d8f43181..be7fad43a88d 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -392,7 +392,7 @@ u64 hv_snp_hypercall(u64 control, u64 param1, u64 param2)
 
 	register u64 __r8 asm("r8") = param2;
 	asm volatile("vmmcall"
-		     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
+		     : "=a" (hv_status),
 		       "+c" (control), "+d" (param1), "+r" (__r8)
 		     : : "cc", "memory", "r9", "r10", "r11");
 
-- 
2.52.0


^ permalink raw reply related

* [PATCH RESEND v2 2/3] x86/hyperv: Use savesegment() instead of inline asm() to save segment registers
From: Uros Bizjak @ 2026-01-05  9:02 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-kernel
  Cc: Uros Bizjak, Wei Liu, Michael Kelley, K. Y. Srinivasan,
	Haiyang Zhang, Dexuan Cui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin
In-Reply-To: <20260105090422.6243-1-ubizjak@gmail.com>

Use standard savesegment() utility macro to save segment registers.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/hyperv/ivm.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 651771534cae..7365d8f43181 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -25,6 +25,7 @@
 #include <asm/e820/api.h>
 #include <asm/desc.h>
 #include <asm/msr.h>
+#include <asm/segment.h>
 #include <uapi/asm/vmx.h>
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
@@ -315,16 +316,16 @@ int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip, unsigned int cpu)
 	vmsa->gdtr.base = gdtr.address;
 	vmsa->gdtr.limit = gdtr.size;
 
-	asm volatile("movl %%es, %%eax;" : "=a" (vmsa->es.selector));
+	savesegment(es, vmsa->es.selector);
 	hv_populate_vmcb_seg(vmsa->es, vmsa->gdtr.base);
 
-	asm volatile("movl %%cs, %%eax;" : "=a" (vmsa->cs.selector));
+	savesegment(cs, vmsa->cs.selector);
 	hv_populate_vmcb_seg(vmsa->cs, vmsa->gdtr.base);
 
-	asm volatile("movl %%ss, %%eax;" : "=a" (vmsa->ss.selector));
+	savesegment(ss, vmsa->ss.selector);
 	hv_populate_vmcb_seg(vmsa->ss, vmsa->gdtr.base);
 
-	asm volatile("movl %%ds, %%eax;" : "=a" (vmsa->ds.selector));
+	savesegment(ds, vmsa->ds.selector);
 	hv_populate_vmcb_seg(vmsa->ds, vmsa->gdtr.base);
 
 	vmsa->efer = native_read_msr(MSR_EFER);
-- 
2.52.0


^ permalink raw reply related

* [PATCH RESEND v2 1/3] x86: Use MOVL when reading segment registers
From: Uros Bizjak @ 2026-01-05  9:02 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-kernel
  Cc: Uros Bizjak, Michael Kelley, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin

Use MOVL when reading segment registers to avoid 0x66 operand-size
override insn prefix. The segment value is always 16-bit and gets
zero-extended to the full 32-bit size.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/segment.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index f59ae7186940..9f5be2bbd291 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -348,7 +348,7 @@ static inline void __loadsegment_fs(unsigned short value)
  * Save a segment register away:
  */
 #define savesegment(seg, value)				\
-	asm("mov %%" #seg ",%0":"=r" (value) : : "memory")
+	asm("movl %%" #seg ",%k0" : "=r" (value) : : "memory")
 
 #endif /* !__ASSEMBLER__ */
 #endif /* __KERNEL__ */
-- 
2.52.0


^ permalink raw reply related

* RE: [RFC][PATCH v0] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Vitaly Kuznetsov @ 2026-01-05  9:00 UTC (permalink / raw)
  To: Michael Kelley, Mukesh Rathor, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com
In-Reply-To: <SN6PR02MB41575124C6D2CD7492899991D4BBA@SN6PR02MB4157.namprd02.prod.outlook.com>

Michael Kelley <mhklinux@outlook.com> writes:

> From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Friday, January 2, 2026 7:55 AM
>> 
>> Mukesh Rathor <mrathor@linux.microsoft.com> writes:
>> 
>> > MSVC compiler used to compile the Microsoft Hyper-V hypervisor currently,
>> > has an assert intrinsic that uses interrupt vector 0x29 to create an
>> > exception. This will cause hypervisor to then crash and collect core. As
>> > such, if this interrupt number is assigned to a device by linux and the
>> > device generates it, hypervisor will crash. There are two other such
>> > vectors hard coded in the hypervisor, 0x2C and 0x2D.
>> >
>> > Fortunately, the three vectors are part of the kernel driver space, and
>> > that makes it feasible to reserve them early so they are not assigned
>> > later.
>> >
>> > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> > ---
>> >  arch/x86/kernel/cpu/mshyperv.c | 22 ++++++++++++++++++++++
>> >  1 file changed, 22 insertions(+)
>> >
>> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> > index 579fb2c64cfd..19d41f7434df 100644
>> > --- a/arch/x86/kernel/cpu/mshyperv.c
>> > +++ b/arch/x86/kernel/cpu/mshyperv.c
>> > @@ -478,6 +478,25 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>> >  }
>> >  EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>> >
>> > +/*
>> > + * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
>> > + * will crash or hang or break into debugger.
>> > + */
>> > +static void hv_reserve_irq_vectors(void)
>> > +{
>> > +	#define HYPERV_DBG_FASTFAIL_VECTOR	0x29
>> > +	#define HYPERV_DBG_ASSERT_VECTOR	0x2C
>> > +	#define HYPERV_DBG_SERVICE_VECTOR	0x2D
>> > +
>> > +	if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
>> > +	    test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
>> > +	    test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
>> > +		BUG();
>> 
>> Would it be less hackish to use sysvec_install() with a dummy handler
>> for all three vectors instead?
>
> It would be, but unfortunately, it doesn't work. sysvec_install() requires
> that the vector be >= FIRST_SYSTEM_VECTOR, and these vectors are not.
>

True; then maybe introduce a new API like sysvec_reserve() without the
limitation? What I'm personally afraid of is that looking at
sysvec_install() it already has an additional fred_install_sysvec()
which operates over its own sysvec_table and only does
idt_install_sysvec() when !cpu_feature_enabled(X86_FEATURE_FRED) -- and
this patch just plays with system_vectors directly. Maybe this is even
correct for now but I believe can be fragile in the future.

Ultimately, I think it's up to x86 maintainers to say whether they think
that playing with system_vectors outside of the core is OK and expected
or if a new, explicit API is preferable.

-- 
Vitaly


^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Haiyang Zhang @ 2026-01-03 20:34 UTC (permalink / raw)
  To: Jakub Kicinski, Haiyang Zhang
  Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	KY Srinivasan, Wei Liu, Dexuan Cui, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Long Li, Konstantin Taranov,
	Simon Horman, Erni Sri Satya Vennela, Shradha Gupta,
	Saurabh Sengar, Aditya Garg, Dipayaan Roy, Shiraz Saleem,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	Paul Rosswurm
In-Reply-To: <20260102161147.1938b51d@kernel.org>



> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, January 2, 2026 7:12 PM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Long Li
> <longli@microsoft.com>; Konstantin Taranov <kotaranov@microsoft.com>;
> Simon Horman <horms@kernel.org>; Erni Sri Satya Vennela
> <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH net-next, 1/2] net: mana: Add support for
> coalesced RX packets on CQE
> 
> On Fri,  2 Jan 2026 13:35:57 -0800 Haiyang Zhang wrote:
> > +		NL_SET_ERR_MSG_FMT(extack, "Set rx-frames to %u failed:%d\n",
> > +				   ec->rx_max_coalesced_frames, err);
> 
> No trailing new line in extack messages, please.
> Also please do not duplicate the err value in the message itself,
> it's already passed to user space. Well behaved user space will format
> this as eg:
> 
>   Set rx-frames to 123 failed:-11: Invalid argument

I will update the patch.

Thanks,
- Haiyang

^ permalink raw reply

* [PATCH net-next, v6] net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
From: Dipayaan Roy @ 2026-01-03  4:57 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, dipayanroy

Implement .ndo_tx_timeout for MANA so any stalled TX queue can be detected
and a device-controlled port reset for all queues can be scheduled to a
ordered workqueue. The reset for all queues on stall detection is
recomended by hardware team.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
Changes in v6:
  - Rebased.
Changes in v5:
  -Fixed commit message, used 'create_singlethread_workqueue' and fixed
   cleanup part.
Changes in v4:
  -Fixed commit message, work initialization before registering netdev,
   fixed potential null pointer de-reference bug.
Changes in v3:
  -Fixed commit meesage, removed rtnl_trylock and added
   disable_work_sync, fixed mana_queue_reset_work, and few
   cosmetics.
Changes in v2:
  -Fixed cosmetic changes.
---
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 77 ++++++++++++++++++-
 include/net/mana/gdma.h                       |  7 +-
 include/net/mana/mana.h                       |  8 +-
 3 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1ad154f9db1a..d8451f550db4 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -299,6 +299,42 @@ static int mana_get_gso_hs(struct sk_buff *skb)
 	return gso_hs;
 }
 
+static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
+{
+	struct mana_queue_reset_work *reset_queue_work =
+			container_of(work, struct mana_queue_reset_work, work);
+
+	struct mana_port_context *apc = container_of(reset_queue_work,
+						     struct mana_port_context,
+						     queue_reset_work);
+	struct net_device *ndev = apc->ndev;
+	int err;
+
+	rtnl_lock();
+
+	/* Pre-allocate buffers to prevent failure in mana_attach later */
+	err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+	if (err) {
+		netdev_err(ndev, "Insufficient memory for reset post tx stall detection\n");
+		goto out;
+	}
+
+	err = mana_detach(ndev, false);
+	if (err) {
+		netdev_err(ndev, "mana_detach failed: %d\n", err);
+		goto dealloc_pre_rxbufs;
+	}
+
+	err = mana_attach(ndev);
+	if (err)
+		netdev_err(ndev, "mana_attach failed: %d\n", err);
+
+dealloc_pre_rxbufs:
+	mana_pre_dealloc_rxbufs(apc);
+out:
+	rtnl_unlock();
+}
+
 netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 {
 	enum mana_tx_pkt_format pkt_fmt = MANA_SHORT_PKT_FMT;
@@ -839,6 +875,23 @@ static int mana_change_mtu(struct net_device *ndev, int new_mtu)
 	return err;
 }
 
+static void mana_tx_timeout(struct net_device *netdev, unsigned int txqueue)
+{
+	struct mana_port_context *apc = netdev_priv(netdev);
+	struct mana_context *ac = apc->ac;
+	struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+	/* Already in service, hence tx queue reset is not required.*/
+	if (gc->in_service)
+		return;
+
+	/* Note: If there are pending queue reset work for this port(apc),
+	 * subsequent request queued up from here are ignored. This is because
+	 * we are using the same work instance per port(apc).
+	 */
+	queue_work(ac->per_port_queue_reset_wq, &apc->queue_reset_work.work);
+}
+
 static int mana_shaper_set(struct net_shaper_binding *binding,
 			   const struct net_shaper *shaper,
 			   struct netlink_ext_ack *extack)
@@ -924,6 +977,7 @@ static const struct net_device_ops mana_devops = {
 	.ndo_bpf		= mana_bpf,
 	.ndo_xdp_xmit		= mana_xdp_xmit,
 	.ndo_change_mtu		= mana_change_mtu,
+	.ndo_tx_timeout		= mana_tx_timeout,
 	.net_shaper_ops         = &mana_shaper_ops,
 };
 
@@ -3287,6 +3341,8 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	ndev->min_mtu = ETH_MIN_MTU;
 	ndev->needed_headroom = MANA_HEADROOM;
 	ndev->dev_port = port_idx;
+	/* Recommended timeout based on HW FPGA re-config scenario. */
+	ndev->watchdog_timeo = 15 * HZ;
 	SET_NETDEV_DEV(ndev, gc->dev);
 
 	netif_set_tso_max_size(ndev, GSO_MAX_SIZE);
@@ -3303,6 +3359,10 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	if (err)
 		goto reset_apc;
 
+	/* Initialize the per port queue reset work.*/
+	INIT_WORK(&apc->queue_reset_work.work,
+		  mana_per_port_queue_reset_work_handler);
+
 	netdev_lockdep_set_classes(ndev);
 
 	ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
@@ -3549,6 +3609,14 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 	if (ac->num_ports > MAX_PORTS_IN_MANA_DEV)
 		ac->num_ports = MAX_PORTS_IN_MANA_DEV;
 
+	ac->per_port_queue_reset_wq =
+		create_singlethread_workqueue("mana_per_port_queue_reset_wq");
+	if (!ac->per_port_queue_reset_wq) {
+		dev_err(dev, "Failed to allocate per port queue reset workqueue\n");
+		err = -ENOMEM;
+		goto out;
+	}
+
 	if (!resuming) {
 		for (i = 0; i < ac->num_ports; i++) {
 			err = mana_probe_port(ac, i, &ac->ports[i]);
@@ -3616,13 +3684,15 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 
 	for (i = 0; i < ac->num_ports; i++) {
 		ndev = ac->ports[i];
-		apc = netdev_priv(ndev);
 		if (!ndev) {
 			if (i == 0)
 				dev_err(dev, "No net device to remove\n");
 			goto out;
 		}
 
+		apc = netdev_priv(ndev);
+		disable_work_sync(&apc->queue_reset_work.work);
+
 		/* All cleanup actions should stay after rtnl_lock(), otherwise
 		 * other functions may access partially cleaned up data.
 		 */
@@ -3649,6 +3719,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 
 	mana_destroy_eq(ac);
 out:
+	if (ac->per_port_queue_reset_wq) {
+		destroy_workqueue(ac->per_port_queue_reset_wq);
+		ac->per_port_queue_reset_wq = NULL;
+	}
+
 	mana_gd_deregister_device(gd);
 
 	if (suspending)
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index eaa27483f99b..a59bd4035a99 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -598,6 +598,10 @@ enum {
 
 /* Driver can self reset on FPGA Reconfig EQE notification */
 #define GDMA_DRV_CAP_FLAG_1_HANDLE_RECONFIG_EQE BIT(17)
+
+/* Driver detects stalled send queues and recovers them */
+#define GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY BIT(18)
+
 #define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
 
 /* Driver supports linearizing the skb when num_sge exceeds hardware limit */
@@ -621,7 +625,8 @@ enum {
 	 GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE | \
 	 GDMA_DRV_CAP_FLAG_1_PERIODIC_STATS_QUERY | \
 	 GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
-	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY)
+	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
+	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index d7e089c6b694..cef78a871c7c 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -480,7 +480,7 @@ struct mana_context {
 	struct mana_ethtool_hc_stats hc_stats;
 	struct mana_eq *eqs;
 	struct dentry *mana_eqs_debugfs;
-
+	struct workqueue_struct *per_port_queue_reset_wq;
 	/* Workqueue for querying hardware stats */
 	struct delayed_work gf_stats_work;
 	bool hwc_timeout_occurred;
@@ -492,9 +492,15 @@ struct mana_context {
 	u32 link_event;
 };
 
+struct mana_queue_reset_work {
+	/* Work structure */
+	struct work_struct work;
+};
+
 struct mana_port_context {
 	struct mana_context *ac;
 	struct net_device *ndev;
+	struct mana_queue_reset_work queue_reset_work;
 
 	u8 mac_addr[ETH_ALEN];
 
-- 
2.43.0


^ permalink raw reply related

* RE: [PATCH] mshv: Align huge page stride with guest mapping
From: Michael Kelley @ 2026-01-03  1:16 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aVhWMoH3GvpGAR0a@skinsburskii.localdomain>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 3:35 PM
> 
> On Fri, Jan 02, 2026 at 09:13:31PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 12:03 PM
> > >
> > > On Fri, Jan 02, 2026 at 06:04:56PM +0000, Michael Kelley wrote:
> > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 9:43 AM
> > > > >
> > > > > On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote:
> > > > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, December 23, 2025 8:26 AM
> > > > > > >
> > > > > > > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > > > > > > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > > > > > > > >
> > > > > > > > [snip]
> > > > > > > > >
> > > > > > > > > Separately, in looking at this, I spotted another potential problem with
> > > > > > > > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > > > > > > > not clear on. To create a new region, the user space VMM issues the
> > > > > > > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > > > > > > > size, and the guest PFN. The only requirement on these values is that the
> > > > > > > > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > > > > > > > specified where the userspace address and the guest PFN have different
> > > > > > > > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > > > > > > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > > > > > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > > > > > > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > > > > > > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > > > > > > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > > > > > > > to creating 4K mappings, or does it return an error? Returning an error would
> > > > > > > > > seem to be problematic for movable pages because the error wouldn't
> > > > > > > > > occur until the guest VM is running and takes a range fault on the region.
> > > > > > > > > Silently falling back to creating 4K mappings has performance implications,
> > > > > > > > > though I guess it would work. My question is whether the
> > > > > > > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > > > > > > > error immediately.
> > > > > > > > >
> > > > > > > >
> > > > > > > > In thinking about this more, I can answer my own question about the
> > > > > > > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > > > > > > > list of 4K system PFNs is not provided as an input to the hypercall, so
> > > > > > > > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > > > > > > > sequential PFNs would be wrong, so it must return an error if the
> > > > > > > > alignment of a system PFN isn't on a 2 Meg boundary.
> > > > > > > >
> > > > > > > > For a pinned region, this error happens in mshv_region_map() as
> > > > > > > > called from  mshv_prepare_pinned_region(), so will propagate back
> > > > > > > > to the ioctl. But the error happens only if pin_user_pages_fast()
> > > > > > > > allocates one or more 2 Meg pages. So creating a pinned region
> > > > > > > > where the guest PFN and userspace address have different offsets
> > > > > > > > modulo 2 Meg might or might not succeed.
> > > > > > > >
> > > > > > > > For a movable region, the error probably can't occur.
> > > > > > > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > > > > > > > around the faulting guest PFN. mshv_region_range_fault() then
> > > > > > > > determines the corresponding userspace addr, which won't be on
> > > > > > > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > > > > > > > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > > > > > > > always do 4K mappings and will succeed. The downside is that a
> > > > > > > > movable region with a guest PFN and userspace address with
> > > > > > > > different offsets never gets any 2 Meg pages or mappings.
> > > > > > > >
> > > > > > > > My conclusion is the same -- such misalignment should not be
> > > > > > > > allowed when creating a region that has the potential to use 2 Meg
> > > > > > > > pages. Regions less than 2 Meg in size could be excluded from such
> > > > > > > > a requirement if there is benefit in doing so. It's possible to have
> > > > > > > > regions up to (but not including) 4 Meg where the alignment prevents
> > > > > > > > having a 2 Meg page, and those could also be excluded from the
> > > > > > > > requirement.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not sure I understand the problem.
> > > > > > > There are three cases to consider:
> > > > > > > 1. Guest mapping, where page sizes are controlled by the guest.
> > > > > > > 2. Host mapping, where page sizes are controlled by the host.
> > > > > >
> > > > > > And by "host", you mean specifically the Linux instance running in the
> > > > > > root partition. It hosts the VMM processes and creates the memory
> > > > > > regions for each guest.
> > > > > >
> > > > > > > 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
> > > > > > >
> > > > > > > The first case is not relevant here and is included for completeness.
> > > > > >
> > > > > > Agreed.
> > > > > >
> > > > > > >
> > > > > > > The second and third cases (host and hypervisor) share the memory layout,
> > > > > >
> > > > > > Right. More specifically, they are both operating on the same set of physical
> > > > > > memory pages, and hence "share" a set of what I've referred to as
> > > > > > "system PFNs" (to distinguish from guest PFNs, or GFNs).
> > > > > >
> > > > > > > but it is up
> > > > > > > to each entity to decide which page sizes to use. For example, the host might map the
> > > > > > > proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
> > > > > >
> > > > > > Agreed.
> > > > > >
> > > > > > > In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> > > > > > > can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
> > > > > >
> > > > > > Yes, that's possible, but subject to significant requirements. A 2M page can be
> > > > > > used only if the underlying physical memory is a physically contiguous 2M chunk.
> > > > > > Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
> > > > > > and the virtual address to which it is being mapped must be on a 2M boundary.
> > > > > > In the case of the host, that virtual address is the user space address in the
> > > > > > user space process. In the case of the hypervisor, that "virtual address" is the
> > > > > > the location in guest physical address space; i.e., the guest PFN left-shifted 9
> > > > > > to be a guest physical address.
> > > > > >
> > > > > > These requirements are from the physical processor and its requirements on
> > > > > > page table formats as specified by the hardware architecture. Whereas the
> > > > > > page table entry for a 4K page contains the entire PFN, the page table entry
> > > > > > for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
> > > > > > which is equivalent to requiring that the PFN be on a 2M boundary. These
> > > > > > requirements apply to both host and hypervisor mappings.
> > > > > >
> > > > > > When MSHV code in the host creates a new pinned region via the ioctl,
> > > > > > MSHV code first allocates memory for the region using pin_user_pages_fast(),
> > > > > > which returns the system PFN for each page of physical memory that is
> > > > > > allocated. If the host, at its discretion, allocates a 2M page, then a series
> > > > > > of 512 sequential 4K PFNs is returned for that 2M page, and the first of
> > > > > > the 512 sequential PFNs must have its low order 9 bits be zero.
> > > > > >
> > > > > > Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
> > > > > > the hypervisor to map the allocated memory into the guest physical
> > > > > > address space at a particular guest PFN. If the allocated memory contains
> > > > > > a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
> > > > > > causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
> > > > > > the hypervisor do that mapping as a 2M large page. The hypercall does not
> > > > > > have the option of dropping back to 4K page mappings in this case. If
> > > > > > the 2M alignment of the system PFN is different from the 2M alignment
> > > > > > of the target guest PFN, it's not possible to create the mapping and the
> > > > > > hypercall fails.
> > > > > >
> > > > > > The core problem is that the same 2M of physical memory wants to be
> > > > > > mapped by the host as a 2M page and by the hypervisor as a 2M page.
> > > > > > That can't be done unless the host alignment (in the VMM virtual address
> > > > > > space) and the guest physical address (i.e., the target guest PFN) alignment
> > > > > > match and are both on 2M boundaries.
> > > > > >
> > > > >
> > > > > But why is it a problem? If both the host and the hypervisor can map ap
> > > > > huge page, but the guest can't, it's still a win, no?
> > > > > In other words, if VMM passes a host huge page aligned region as a guest
> > > > > unaligned, it's a VMM problem, not a hypervisor problem. And I' don't
> > > > > understand why would we want to prevent such cases.
> > > > >
> > > >
> > > > Fair enough -- mostly. If you want to allow the misaligned case and live
> > > > with not getting the 2M mapping in the guest, that works except in the
> > > > situation that I described above, where the HVCALL_MAP_GPA_PAGES
> > > > hypercall fails when creating a pinned region.
> > > >
> > > > The failure is flakey in that if the Linux in the root partition does not
> > > > map any of the region as a 2M page, the hypercall succeeds and the
> > > > MSHV_GET_GUEST_MEMORY ioctl succeeds. But if the root partition
> > > > happens to map any of the region as a 2M page, the hypercall will fail,
> > > > and the MSHV_GET_GUEST_MEMORY ioctl will fail. Presumably such
> > > > flakey behavior is bad for the VMM.
> > > >
> > > > One solution is that mshv_chunk_stride() must return a stride > 1 only
> > > > if both the gfn (which it currently checks) AND the corresponding
> > > > userspace_addr are 2M aligned. Then the HVCALL_MAP_GPA_PAGES
> > > > hypercall will never have HV_MAP_GPA_LARGE_PAGE set for the
> > > > misaligned case, and the failure won't occur.
> > > >
> > >
> > > I think see your point, but I also think this issue doesn't exist,
> > > because map_chunk_stride() returns huge page stride iff page if:
> > > 1. the folio order is PMD_ORDER and
> > > 2. GFN is huge page aligned and
> > > 3. number of 4K pages is huge pages aligned.
> > >
> > > On other words, a host huge page won't be mapped as huge if the page
> > > can't be mapped as huge in the guest.
> >
> > OK, I'm missing how what you say is true. For pinned regions,
> > the memory is allocated and mapped into the host userspace address
> > first, as done by mshv_prepare_pinned_region() calling mshv_region_pin(),
> > which calls pin_user_pages_fast(). This is all done without considering
> > the GFN or GFN alignment. So one or more 2M pages might be allocated
> > and mapped in the host before any guest mapping is looked at. Agreed?
> >
> 
> Agreed.
> 
> > Then mshv_prepare_pinned_region() calls mshv_region_map() to do the
> > guest mapping. This eventually gets down to mshv_chunk_stride(). In
> > mshv_chunk_stride() when your conditions #2 and #3 are met, the
> > corresponding struct page argument to mshv_chunk_stride() may be a
> > 4K page that is in the middle of a 2M page instead of at the beginning
> > (if the region is mis-aligned). But the key point is that the 4K page in
> > the middle is part of a folio that will return a folio order of PMD_ORDER.
> > I.e., a folio order of PMD_ORDER is not sufficient to ensure that the
> > struct page arg is at the *start* of a 2M-aligned physical memory range
> > that can be mapped into the guest as a 2M page.
> >
> 
> I'm trying to undestand how this can even happen, so please bear with
> me.
> In other words (and AFAIU), what you are saying in the following:
> 
> 1. VMM creates a mapping with a huge page(s) (this implies that virtual
>    address is huge page aligned, size is huge page aligned and physical
>    pages are consequtive).
> 2. VMM tries to create a region via ioctl, but instead of passing the
>    start of the region, is passes an offset into one of the the region's
>    huge pages, and in the same time with the base GFN and the size huge
>    page aligned (to meet the #2 and #3 conditions).
> 3. mshv_chunk_stride() sees a folio order of PMD_ORDER, and tries to map
>    the corresponding pages as huge, which will be rejected by the
>    hypervisor.
> 
> Is this accurate?

Yes, pretty much. In Step 1, the VMM may just allocate some virtual
address space, and not do anything to populate it with physical pages.
So populating with any 2M pages may not happen until Step 2 when
the ioctl calls pin_user_pages_fast(). But *when* the virtual address
space gets populated with physical pages doesn't really matter. We
just know that it happens before the ioctl tries to map the memory
into the guest -- i.e., mshv_prepare_pinned_region() calls
mshv_region_map().

And yes, the problem is what you call out in Step 2: as input to the
ioctl, the fields "userspace_addr" and "guest_pfn" in struct
mshv_user_mem_region could have different alignments modulo 2M
boundaries. When they are different, that's what I'm calling a "mis-aligned
region", (referring to a struct mshv_mem_region that is created and
setup by the ioctl).

> A subseqeunt question: if it is accurate, why the driver needs to
> support this case? It looks like a VMM bug to me.

I don't know if the driver needs to support this case. That's a question
for the VMM people to answer. I wouldn't necessarily assume that the
VMM always allocates virtual address space with exactly the size and
alignment that matches the regions it creates with the ioctl. The
kernel ioctl doesn't care how the VMM allocates and manages its
virtual address space, so the VMM is free to do whatever it wants
in that regard, as long as it meets the requirements of the ioctl. So
the requirements of the ioctl in this case are something to be
negotiated with the VMM.

> Also, how should it support it? By rejecting such requests in the ioctl?

Rejecting requests to create a mis-aligned region is certainly one option
if the VMM agrees that's OK. The ioctl currently requires only that
"userspace_addr" and "size" be page aligned, so those requirements
could be tightened.

The other approach is to fix mshv_chunk_stride() to handle the
mis-aligned case. Doing so it even easier than I first envisioned.
I think this works:

@@ -49,7 +49,8 @@ static int mshv_chunk_stride(struct page *page,
         */
        if (page_order &&
            IS_ALIGNED(gfn, PTRS_PER_PMD) &&
-           IS_ALIGNED(page_count, PTRS_PER_PMD))
+           IS_ALIGNED(page_count, PTRS_PER_PMD) &&
+           IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD))
                return 1 << page_order;

        return 1;

But as we discussed earlier, this fix means never getting 2M mappings
in the guest for a region that is mis-aligned.

Michael

> 
> Thanks,
> Stanislav
> 
> > The problem does *not* happen with a movable region, but the reasoning
> > is different. hmm_range_fault() is always called with a 2M range aligned
> > to the GFN, which in a mis-aligned region means that the host userspace
> > address is never 2M aligned. So hmm_range_fault() is never able to allocate
> > and map a 2M page. mshv_chunk_stride() will never get a folio order > 1,
> > and the hypercall is never asked to do a 2M mapping. Both host and guest
> > mappings will always be 4K and everything works.
> >
> > Michael
> >
> > > And this function is called for
> > > both movable and pinned region, so the hypercal should never fail due to
> > > huge page alignment issue.
> > >
> > > What do I miss here?
> > >
> > > Thanks,
> > > Stanislav
> > >
> > >
> > > > Michael
> > > >
> > > > >
> > > > > > Movable regions behave a bit differently because the memory for the
> > > > > > region is not allocated on the host "up front" when the region is created.
> > > > > > The memory is faulted in as the guest runs, and the vagaries of the current
> > > > > > MSHV in Linux code are such that 2M pages are never created on the host
> > > > > > if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
> > > > > > to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
> > > > > > mappings, which works even with the misalignment.
> > > > > >
> > > > > > >
> > > > > > > This adjustment happens at runtime. Could this be the missing detail here?
> > > > > >
> > > > > > Adjustments at runtime are a different topic from the issue I'm raising,
> > > > > > though eventually there's some relationship. My issue occurs in the
> > > > > > creation of a new region, and the setting up of the initial hypervisor
> > > > > > mapping. I haven't thought through the details of adjustments at runtime.
> > > > > >
> > > > > > My usual caveats apply -- this is all "thought experiment". If I had the
> > > > > > means do some runtime testing to confirm, I would. It's possible the
> > > > > > hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
> > > > > > that given the basics of how physical processors work with page tables.
> > > > > >
> > > > > > Michael

^ permalink raw reply

* Re: [PATCH net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Jakub Kicinski @ 2026-01-03  0:11 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Long Li, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Shradha Gupta, Saurabh Sengar,
	Aditya Garg, Dipayaan Roy, Shiraz Saleem, linux-kernel,
	linux-rdma, paulros
In-Reply-To: <1767389759-3460-2-git-send-email-haiyangz@linux.microsoft.com>

On Fri,  2 Jan 2026 13:35:57 -0800 Haiyang Zhang wrote:
> +		NL_SET_ERR_MSG_FMT(extack, "Set rx-frames to %u failed:%d\n",
> +				   ec->rx_max_coalesced_frames, err);

No trailing new line in extack messages, please.
Also please do not duplicate the err value in the message itself,
it's already passed to user space. Well behaved user space will format
this as eg:

  Set rx-frames to 123 failed:-11: Invalid argument
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii @ 2026-01-02 23:35 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415724D13B2751F8FAA1053BD4BBA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Fri, Jan 02, 2026 at 09:13:31PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 12:03 PM
> > 
> > On Fri, Jan 02, 2026 at 06:04:56PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 9:43 AM
> > > >
> > > > On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote:
> > > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, December 23, 2025 8:26 AM
> > > > > >
> > > > > > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > > > > > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > > > > > > >
> > > > > > > [snip]
> > > > > > > >
> > > > > > > > Separately, in looking at this, I spotted another potential problem with
> > > > > > > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > > > > > > not clear on. To create a new region, the user space VMM issues the
> > > > > > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > > > > > > size, and the guest PFN. The only requirement on these values is that the
> > > > > > > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > > > > > > specified where the userspace address and the guest PFN have different
> > > > > > > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > > > > > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > > > > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > > > > > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > > > > > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > > > > > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > > > > > > to creating 4K mappings, or does it return an error? Returning an error would
> > > > > > > > seem to be problematic for movable pages because the error wouldn't
> > > > > > > > occur until the guest VM is running and takes a range fault on the region.
> > > > > > > > Silently falling back to creating 4K mappings has performance implications,
> > > > > > > > though I guess it would work. My question is whether the
> > > > > > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > > > > > > error immediately.
> > > > > > > >
> > > > > > >
> > > > > > > In thinking about this more, I can answer my own question about the
> > > > > > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > > > > > > list of 4K system PFNs is not provided as an input to the hypercall, so
> > > > > > > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > > > > > > sequential PFNs would be wrong, so it must return an error if the
> > > > > > > alignment of a system PFN isn't on a 2 Meg boundary.
> > > > > > >
> > > > > > > For a pinned region, this error happens in mshv_region_map() as
> > > > > > > called from  mshv_prepare_pinned_region(), so will propagate back
> > > > > > > to the ioctl. But the error happens only if pin_user_pages_fast()
> > > > > > > allocates one or more 2 Meg pages. So creating a pinned region
> > > > > > > where the guest PFN and userspace address have different offsets
> > > > > > > modulo 2 Meg might or might not succeed.
> > > > > > >
> > > > > > > For a movable region, the error probably can't occur.
> > > > > > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > > > > > > around the faulting guest PFN. mshv_region_range_fault() then
> > > > > > > determines the corresponding userspace addr, which won't be on
> > > > > > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > > > > > > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > > > > > > always do 4K mappings and will succeed. The downside is that a
> > > > > > > movable region with a guest PFN and userspace address with
> > > > > > > different offsets never gets any 2 Meg pages or mappings.
> > > > > > >
> > > > > > > My conclusion is the same -- such misalignment should not be
> > > > > > > allowed when creating a region that has the potential to use 2 Meg
> > > > > > > pages. Regions less than 2 Meg in size could be excluded from such
> > > > > > > a requirement if there is benefit in doing so. It's possible to have
> > > > > > > regions up to (but not including) 4 Meg where the alignment prevents
> > > > > > > having a 2 Meg page, and those could also be excluded from the
> > > > > > > requirement.
> > > > > > >
> > > > > >
> > > > > > I'm not sure I understand the problem.
> > > > > > There are three cases to consider:
> > > > > > 1. Guest mapping, where page sizes are controlled by the guest.
> > > > > > 2. Host mapping, where page sizes are controlled by the host.
> > > > >
> > > > > And by "host", you mean specifically the Linux instance running in the
> > > > > root partition. It hosts the VMM processes and creates the memory
> > > > > regions for each guest.
> > > > >
> > > > > > 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
> > > > > >
> > > > > > The first case is not relevant here and is included for completeness.
> > > > >
> > > > > Agreed.
> > > > >
> > > > > >
> > > > > > The second and third cases (host and hypervisor) share the memory layout,
> > > > >
> > > > > Right. More specifically, they are both operating on the same set of physical
> > > > > memory pages, and hence "share" a set of what I've referred to as
> > > > > "system PFNs" (to distinguish from guest PFNs, or GFNs).
> > > > >
> > > > > > but it is up
> > > > > > to each entity to decide which page sizes to use. For example, the host might map the
> > > > > > proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
> > > > >
> > > > > Agreed.
> > > > >
> > > > > > In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> > > > > > can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
> > > > >
> > > > > Yes, that's possible, but subject to significant requirements. A 2M page can be
> > > > > used only if the underlying physical memory is a physically contiguous 2M chunk.
> > > > > Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
> > > > > and the virtual address to which it is being mapped must be on a 2M boundary.
> > > > > In the case of the host, that virtual address is the user space address in the
> > > > > user space process. In the case of the hypervisor, that "virtual address" is the
> > > > > the location in guest physical address space; i.e., the guest PFN left-shifted 9
> > > > > to be a guest physical address.
> > > > >
> > > > > These requirements are from the physical processor and its requirements on
> > > > > page table formats as specified by the hardware architecture. Whereas the
> > > > > page table entry for a 4K page contains the entire PFN, the page table entry
> > > > > for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
> > > > > which is equivalent to requiring that the PFN be on a 2M boundary. These
> > > > > requirements apply to both host and hypervisor mappings.
> > > > >
> > > > > When MSHV code in the host creates a new pinned region via the ioctl,
> > > > > MSHV code first allocates memory for the region using pin_user_pages_fast(),
> > > > > which returns the system PFN for each page of physical memory that is
> > > > > allocated. If the host, at its discretion, allocates a 2M page, then a series
> > > > > of 512 sequential 4K PFNs is returned for that 2M page, and the first of
> > > > > the 512 sequential PFNs must have its low order 9 bits be zero.
> > > > >
> > > > > Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
> > > > > the hypervisor to map the allocated memory into the guest physical
> > > > > address space at a particular guest PFN. If the allocated memory contains
> > > > > a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
> > > > > causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
> > > > > the hypervisor do that mapping as a 2M large page. The hypercall does not
> > > > > have the option of dropping back to 4K page mappings in this case. If
> > > > > the 2M alignment of the system PFN is different from the 2M alignment
> > > > > of the target guest PFN, it's not possible to create the mapping and the
> > > > > hypercall fails.
> > > > >
> > > > > The core problem is that the same 2M of physical memory wants to be
> > > > > mapped by the host as a 2M page and by the hypervisor as a 2M page.
> > > > > That can't be done unless the host alignment (in the VMM virtual address
> > > > > space) and the guest physical address (i.e., the target guest PFN) alignment
> > > > > match and are both on 2M boundaries.
> > > > >
> > > >
> > > > But why is it a problem? If both the host and the hypervisor can map ap
> > > > huge page, but the guest can't, it's still a win, no?
> > > > In other words, if VMM passes a host huge page aligned region as a guest
> > > > unaligned, it's a VMM problem, not a hypervisor problem. And I' don't
> > > > understand why would we want to prevent such cases.
> > > >
> > >
> > > Fair enough -- mostly. If you want to allow the misaligned case and live
> > > with not getting the 2M mapping in the guest, that works except in the
> > > situation that I described above, where the HVCALL_MAP_GPA_PAGES
> > > hypercall fails when creating a pinned region.
> > >
> > > The failure is flakey in that if the Linux in the root partition does not
> > > map any of the region as a 2M page, the hypercall succeeds and the
> > > MSHV_GET_GUEST_MEMORY ioctl succeeds. But if the root partition
> > > happens to map any of the region as a 2M page, the hypercall will fail,
> > > and the MSHV_GET_GUEST_MEMORY ioctl will fail. Presumably such
> > > flakey behavior is bad for the VMM.
> > >
> > > One solution is that mshv_chunk_stride() must return a stride > 1 only
> > > if both the gfn (which it currently checks) AND the corresponding
> > > userspace_addr are 2M aligned. Then the HVCALL_MAP_GPA_PAGES
> > > hypercall will never have HV_MAP_GPA_LARGE_PAGE set for the
> > > misaligned case, and the failure won't occur.
> > >
> > 
> > I think see your point, but I also think this issue doesn't exist,
> > because map_chunk_stride() returns huge page stride iff page if:
> > 1. the folio order is PMD_ORDER and
> > 2. GFN is huge page aligned and
> > 3. number of 4K pages is huge pages aligned.
> > 
> > On other words, a host huge page won't be mapped as huge if the page
> > can't be mapped as huge in the guest.
> 
> OK, I'm missing how what you say is true. For pinned regions,
> the memory is allocated and mapped into the host userspace address
> first, as done by mshv_prepare_pinned_region() calling mshv_region_pin(),
> which calls pin_user_pages_fast(). This is all done without considering
> the GFN or GFN alignment. So one or more 2M pages might be allocated
> and mapped in the host before any guest mapping is looked at. Agreed?
> 

Agreed.

> Then mshv_prepare_pinned_region() calls mshv_region_map() to do the
> guest mapping. This eventually gets down to mshv_chunk_stride(). In
> mshv_chunk_stride() when your conditions #2 and #3 are met, the
> corresponding struct page argument to mshv_chunk_stride() may be a
> 4K page that is in the middle of a 2M page instead of at the beginning
> (if the region is mis-aligned). But the key point is that the 4K page in
> the middle is part of a folio that will return a folio order of PMD_ORDER.
> I.e., a folio order of PMD_ORDER is not sufficient to ensure that the
> struct page arg is at the *start* of a 2M-aligned physical memory range
> that can be mapped into the guest as a 2M page.
> 

I'm trying to undestand how this can even happen, so please bear with
me.
In other words (and AFAIU), what you are saying in the following:

1. VMM creates a mapping with a huge page(s) (this implies that virtual
   address is huge page aligned, size is huge page aligned and physical
   pages are consequtive).
2. VMM tries to create a region via ioctl, but instead of passing the
   start of the region, is passes an offset into one of the the region's
   huge pages, and in the same time with the base GFN and the size huge
   page aligned (to meet the #2 and #3 conditions).
3. mshv_chunk_stride() sees a folio order of PMD_ORDER, and tries to map
   the corresponding pages as huge, which will be rejected by the
   hypervisor.

Is this accurate?
A subseqeunt question: if it is accurate, why the driver needs to
support this case? It looks like a VMM bug to me.
Also, how should it support it? By rejecting such requests in the ioctl?

Thanks,
Stanislav

> The problem does *not* happen with a movable region, but the reasoning
> is different. hmm_range_fault() is always called with a 2M range aligned
> to the GFN, which in a mis-aligned region means that the host userspace
> address is never 2M aligned. So hmm_range_fault() is never able to allocate
> and map a 2M page. mshv_chunk_stride() will never get a folio order > 1,
> and the hypercall is never asked to do a 2M mapping. Both host and guest
> mappings will always be 4K and everything works.
> 
> Michael
> 
> > And this function is called for
> > both movable and pinned region, so the hypercal should never fail due to
> > huge page alignment issue.
> > 
> > What do I miss here?
> > 
> > Thanks,
> > Stanislav
> > 
> > 
> > > Michael
> > >
> > > >
> > > > > Movable regions behave a bit differently because the memory for the
> > > > > region is not allocated on the host "up front" when the region is created.
> > > > > The memory is faulted in as the guest runs, and the vagaries of the current
> > > > > MSHV in Linux code are such that 2M pages are never created on the host
> > > > > if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
> > > > > to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
> > > > > mappings, which works even with the misalignment.
> > > > >
> > > > > >
> > > > > > This adjustment happens at runtime. Could this be the missing detail here?
> > > > >
> > > > > Adjustments at runtime are a different topic from the issue I'm raising,
> > > > > though eventually there's some relationship. My issue occurs in the
> > > > > creation of a new region, and the setting up of the initial hypervisor
> > > > > mapping. I haven't thought through the details of adjustments at runtime.
> > > > >
> > > > > My usual caveats apply -- this is all "thought experiment". If I had the
> > > > > means do some runtime testing to confirm, I would. It's possible the
> > > > > hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
> > > > > that given the basics of how physical processors work with page tables.
> > > > >
> > > > > Michael

^ permalink raw reply

* [PATCH v1] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Mukesh Rathor @ 2026-01-02 22:02 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel
  Cc: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen, x86, hpa

MSVC compiler, used to compile the Microsoft Hyper-V hypervisor currently,
has an assert intrinsic that uses interrupt vector 0x29 to create an
exception. This will cause hypervisor to then crash and collect core. As
such, if this interrupt number is assigned to a device by linux and the
device generates it, hypervisor will crash. There are two other such
vectors hard coded in the hypervisor, 0x2C and 0x2D for debug purposes.
Fortunately, the three vectors are part of the kernel driver space and
that makes it feasible to reserve them early so they are not assigned
later.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---

v1: Add ifndef CONFIG_X86_FRED (thanks hpa)

 arch/x86/kernel/cpu/mshyperv.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 579fb2c64cfd..8ef4ca6733ac 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -478,6 +478,27 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
 }
 EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
 
+#ifndef CONFIG_X86_FRED
+/*
+ * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
+ * will crash or hang or break into debugger.
+ */
+static void hv_reserve_irq_vectors(void)
+{
+	#define HYPERV_DBG_FASTFAIL_VECTOR	0x29
+	#define HYPERV_DBG_ASSERT_VECTOR	0x2C
+	#define HYPERV_DBG_SERVICE_VECTOR	0x2D
+
+	if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
+	    test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
+	    test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
+		BUG();
+
+	pr_info("Hyper-V:reserve vectors: %d %d %d\n", HYPERV_DBG_ASSERT_VECTOR,
+		HYPERV_DBG_SERVICE_VECTOR, HYPERV_DBG_FASTFAIL_VECTOR);
+}
+#endif          /* CONFIG_X86_FRED */
+
 static void __init ms_hyperv_init_platform(void)
 {
 	int hv_max_functions_eax, eax;
@@ -510,6 +531,11 @@ static void __init ms_hyperv_init_platform(void)
 
 	hv_identify_partition_type();
 
+#ifndef CONFIG_X86_FRED
+	if (hv_root_partition())
+		hv_reserve_irq_vectors();
+#endif  /* CONFIG_X86_FRED */
+
 	if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC))
 		ms_hyperv.hints |= HV_DEPRECATING_AEOI_RECOMMENDED;
 
-- 
2.51.2.vfs.0.1


^ permalink raw reply related

* [PATCH net-next, 2/2] net: mana: Add ethtool counters for RX CQEs in coalesced type
From: Haiyang Zhang @ 2026-01-02 21:35 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Long Li, Konstantin Taranov,
	Simon Horman, Erni Sri Satya Vennela, Shradha Gupta,
	Saurabh Sengar, Aditya Garg, Dipayaan Roy, Shiraz Saleem,
	linux-kernel, linux-rdma
  Cc: paulros
In-Reply-To: <1767389759-3460-1-git-send-email-haiyangz@linux.microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>

For RX CQEs with type CQE_RX_COALESCED_4, to measure the coalescing
efficiency, add counters to count how many contains 2, 3, 4 packets
respectively.
Also, add a counter for the error case of first packet with length == 0.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 25 +++++++++++++++++--
 .../ethernet/microsoft/mana/mana_ethtool.c    | 17 ++++++++++---
 include/net/mana/mana.h                       | 10 +++++---
 3 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a46a1adf83bc..78824567d80b 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2083,8 +2083,22 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 
 nextpkt:
 	pktlen = oob->ppi[i].pkt_len;
-	if (pktlen == 0)
+	if (pktlen == 0) {
+		/* Collect coalesced CQE count based on packets processed.
+		 * Coalesced CQEs have at least 2 packets, so index is i - 2.
+		 */
+		if (i > 1) {
+			u64_stats_update_begin(&rxq->stats.syncp);
+			rxq->stats.coalesced_cqe[i - 2]++;
+			u64_stats_update_end(&rxq->stats.syncp);
+		} else if (i == 0) {
+			/* Error case stat */
+			u64_stats_update_begin(&rxq->stats.syncp);
+			rxq->stats.pkt_len0_err++;
+			u64_stats_update_end(&rxq->stats.syncp);
+		}
 		return;
+	}
 
 	curr = rxq->buf_index;
 	rxbuf_oob = &rxq->rx_oobs[curr];
@@ -2102,8 +2116,15 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 
 	mana_post_pkt_rxq(rxq);
 
-	if (coalesced && (++i < MANA_RXCOMP_OOB_NUM_PPI))
+	if (!coalesced)
+		return;
+
+	if (++i < MANA_RXCOMP_OOB_NUM_PPI)
 		goto nextpkt;
+
+	u64_stats_update_begin(&rxq->stats.syncp);
+	rxq->stats.coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 2]++;
+	u64_stats_update_end(&rxq->stats.syncp);
 }
 
 static void mana_poll_rx_cq(struct mana_cq *cq)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 1b9ed5c9bbff..773f50b1a4f4 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -20,8 +20,6 @@ static const struct mana_stats_desc mana_eth_stats[] = {
 					tx_cqe_unknown_type)},
 	{"tx_linear_pkt_cnt", offsetof(struct mana_ethtool_stats,
 				       tx_linear_pkt_cnt)},
-	{"rx_coalesced_err", offsetof(struct mana_ethtool_stats,
-					rx_coalesced_err)},
 	{"rx_cqe_unknown_type", offsetof(struct mana_ethtool_stats,
 					rx_cqe_unknown_type)},
 };
@@ -151,7 +149,7 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
-	int i;
+	int i, j;
 
 	if (stringset != ETH_SS_STATS)
 		return;
@@ -170,6 +168,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
 		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
 		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
@@ -203,6 +204,8 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	u64 xdp_xmit;
 	u64 xdp_drop;
 	u64 xdp_tx;
+	u64 pkt_len0_err;
+	u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
 	u64 tso_packets;
 	u64 tso_bytes;
 	u64 tso_inner_packets;
@@ -211,7 +214,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	u64 short_pkt_fmt;
 	u64 csum_partial;
 	u64 mana_map_err;
-	int q, i = 0;
+	int q, i = 0, j;
 
 	if (!apc->port_is_up)
 		return;
@@ -241,6 +244,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 			xdp_drop = rx_stats->xdp_drop;
 			xdp_tx = rx_stats->xdp_tx;
 			xdp_redirect = rx_stats->xdp_redirect;
+			pkt_len0_err = rx_stats->pkt_len0_err;
+			for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+				coalesced_cqe[j] = rx_stats->coalesced_cqe[j];
 		} while (u64_stats_fetch_retry(&rx_stats->syncp, start));
 
 		data[i++] = packets;
@@ -248,6 +254,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 		data[i++] = xdp_drop;
 		data[i++] = xdp_tx;
 		data[i++] = xdp_redirect;
+		data[i++] = pkt_len0_err;
+		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+			data[i++] = coalesced_cqe[j];
 	}
 
 	for (q = 0; q < num_queues; q++) {
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 51d26ebeff6c..f8dd19860103 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -61,8 +61,11 @@ enum TRI_STATE {
 
 #define MAX_PORTS_IN_MANA_DEV 256
 
+/* Maximum number of packets per coalesced CQE */
+#define MANA_RXCOMP_OOB_NUM_PPI 4
+
 /* Update this count whenever the respective structures are changed */
-#define MANA_STATS_RX_COUNT 5
+#define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
 #define MANA_STATS_TX_COUNT 11
 
 #define MANA_RX_FRAG_ALIGNMENT 64
@@ -73,6 +76,8 @@ struct mana_stats_rx {
 	u64 xdp_drop;
 	u64 xdp_tx;
 	u64 xdp_redirect;
+	u64 pkt_len0_err;
+	u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
 	struct u64_stats_sync syncp;
 };
 
@@ -227,8 +232,6 @@ struct mana_rxcomp_perpkt_info {
 	u32 pkt_hash;
 }; /* HW DATA */
 
-#define MANA_RXCOMP_OOB_NUM_PPI 4
-
 /* Receive completion OOB */
 struct mana_rxcomp_oob {
 	struct mana_cqe_header cqe_hdr;
@@ -378,7 +381,6 @@ struct mana_ethtool_stats {
 	u64 tx_cqe_err;
 	u64 tx_cqe_unknown_type;
 	u64 tx_linear_pkt_cnt;
-	u64 rx_coalesced_err;
 	u64 rx_cqe_unknown_type;
 };
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Haiyang Zhang @ 2026-01-02 21:35 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Long Li, Konstantin Taranov,
	Simon Horman, Erni Sri Satya Vennela, Shradha Gupta,
	Saurabh Sengar, Aditya Garg, Dipayaan Roy, Shiraz Saleem,
	linux-kernel, linux-rdma
  Cc: paulros
In-Reply-To: <1767389759-3460-1-git-send-email-haiyangz@linux.microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>

Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
check and process the type CQE_RX_COALESCED_4. The default setting is
disabled, to avoid possible regression on latency.

And add ethtool handler to switch this feature. To turn it on, run:
  ethtool -C <nic> rx-frames 4
To turn it off:
  ethtool -C <nic> rx-frames 1

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 32 ++++++-----
 .../ethernet/microsoft/mana/mana_ethtool.c    | 55 +++++++++++++++++++
 include/net/mana/mana.h                       |  2 +
 3 files changed, 74 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1ad154f9db1a..a46a1adf83bc 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1330,7 +1330,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 	req->update_hashkey = update_key;
 	req->update_indir_tab = update_tab;
 	req->default_rxobj = apc->default_rxobj;
-	req->cqe_coalescing_enable = 0;
+	req->cqe_coalescing_enable = apc->cqe_coalescing_enable;
 
 	if (update_key)
 		memcpy(&req->hashkey, apc->hashkey, MANA_HASH_KEY_SIZE);
@@ -1864,11 +1864,12 @@ static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va,
 }
 
 static void mana_rx_skb(void *buf_va, bool from_pool,
-			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
+			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq,
+			int i)
 {
 	struct mana_stats_rx *rx_stats = &rxq->stats;
 	struct net_device *ndev = rxq->ndev;
-	uint pkt_len = cqe->ppi[0].pkt_len;
+	uint pkt_len = cqe->ppi[i].pkt_len;
 	u16 rxq_idx = rxq->rxq_idx;
 	struct napi_struct *napi;
 	struct xdp_buff xdp = {};
@@ -1912,7 +1913,7 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
 	}
 
 	if (cqe->rx_hashtype != 0 && (ndev->features & NETIF_F_RXHASH)) {
-		hash_value = cqe->ppi[0].pkt_hash;
+		hash_value = cqe->ppi[i].pkt_hash;
 
 		if (cqe->rx_hashtype & MANA_HASH_L4)
 			skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L4);
@@ -2047,9 +2048,11 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	struct mana_recv_buf_oob *rxbuf_oob;
 	struct mana_port_context *apc;
 	struct device *dev = gc->dev;
+	bool coalesced = false;
 	void *old_buf = NULL;
 	u32 curr, pktlen;
 	bool old_fp;
+	int i = 0;
 
 	apc = netdev_priv(ndev);
 
@@ -2064,9 +2067,8 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		goto drop;
 
 	case CQE_RX_COALESCED_4:
-		netdev_err(ndev, "RX coalescing is unsupported\n");
-		apc->eth_stats.rx_coalesced_err++;
-		return;
+		coalesced = true;
+		break;
 
 	case CQE_RX_OBJECT_FENCE:
 		complete(&rxq->fence_event);
@@ -2079,14 +2081,10 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		return;
 	}
 
-	pktlen = oob->ppi[0].pkt_len;
-
-	if (pktlen == 0) {
-		/* data packets should never have packetlength of zero */
-		netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
-			   rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+nextpkt:
+	pktlen = oob->ppi[i].pkt_len;
+	if (pktlen == 0)
 		return;
-	}
 
 	curr = rxq->buf_index;
 	rxbuf_oob = &rxq->rx_oobs[curr];
@@ -2097,12 +2095,15 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	/* Unsuccessful refill will have old_buf == NULL.
 	 * In this case, mana_rx_skb() will drop the packet.
 	 */
-	mana_rx_skb(old_buf, old_fp, oob, rxq);
+	mana_rx_skb(old_buf, old_fp, oob, rxq, i);
 
 drop:
 	mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
 
 	mana_post_pkt_rxq(rxq);
+
+	if (coalesced && (++i < MANA_RXCOMP_OOB_NUM_PPI))
+		goto nextpkt;
 }
 
 static void mana_poll_rx_cq(struct mana_cq *cq)
@@ -3276,6 +3277,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc->port_handle = INVALID_MANA_HANDLE;
 	apc->pf_filter_handle = INVALID_MANA_HANDLE;
 	apc->port_idx = port_idx;
+	apc->cqe_coalescing_enable = 0;
 
 	mutex_init(&apc->vport_mutex);
 	apc->vport_use_count = 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 0e2f4343ac67..1b9ed5c9bbff 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -397,6 +397,58 @@ static void mana_get_channels(struct net_device *ndev,
 	channel->combined_count = apc->num_queues;
 }
 
+static int mana_get_coalesce(struct net_device *ndev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	ec->rx_max_coalesced_frames =
+		apc->cqe_coalescing_enable ? MANA_RXCOMP_OOB_NUM_PPI : 1;
+
+	return 0;
+}
+
+static int mana_set_coalesce(struct net_device *ndev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u8 saved_cqe_coalescing_enable;
+	int err;
+
+	if (ec->rx_max_coalesced_frames != 1 &&
+	    ec->rx_max_coalesced_frames != MANA_RXCOMP_OOB_NUM_PPI) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "rx-frames must be 1 or %u, got %u",
+				   MANA_RXCOMP_OOB_NUM_PPI,
+				   ec->rx_max_coalesced_frames);
+		return -EINVAL;
+	}
+
+	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+	apc->cqe_coalescing_enable =
+		ec->rx_max_coalesced_frames == MANA_RXCOMP_OOB_NUM_PPI;
+
+	if (!apc->port_is_up)
+		return 0;
+
+	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+
+	if (err) {
+		netdev_err(ndev, "Set rx-frames to %u failed:%d\n",
+			   ec->rx_max_coalesced_frames, err);
+		NL_SET_ERR_MSG_FMT(extack, "Set rx-frames to %u failed:%d\n",
+				   ec->rx_max_coalesced_frames, err);
+
+		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+	}
+
+	return err;
+}
+
 static int mana_set_channels(struct net_device *ndev,
 			     struct ethtool_channels *channels)
 {
@@ -517,6 +569,7 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 }
 
 const struct ethtool_ops mana_ethtool_ops = {
+	.supported_coalesce_params = ETHTOOL_COALESCE_RX_MAX_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
 	.get_sset_count		= mana_get_sset_count,
 	.get_strings		= mana_get_strings,
@@ -527,6 +580,8 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_rxfh		= mana_set_rxfh,
 	.get_channels		= mana_get_channels,
 	.set_channels		= mana_set_channels,
+	.get_coalesce		= mana_get_coalesce,
+	.set_coalesce		= mana_set_coalesce,
 	.get_ringparam          = mana_get_ringparam,
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index d7e089c6b694..51d26ebeff6c 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -556,6 +556,8 @@ struct mana_port_context {
 	bool port_is_up;
 	bool port_st_save; /* Saved port state */
 
+	u8 cqe_coalescing_enable;
+
 	struct mana_ethtool_stats eth_stats;
 
 	struct mana_ethtool_phy_stats phy_stats;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next, 0/2] net: mana: Add support for coalesced RX packets
From: Haiyang Zhang @ 2026-01-02 21:35 UTC (permalink / raw)
  To: linux-hyperv, netdev; +Cc: haiyangz, paulros

From: Haiyang Zhang <haiyangz@microsoft.com>

Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
update the RX code path, and ethtool handler. Also add counters for it.

Haiyang Zhang (2):
  net: mana: Add support for coalesced RX packets on CQE
  net: mana: Add ethtool counters for RX CQEs in coalesced type

 drivers/net/ethernet/microsoft/mana/mana_en.c | 49 +++++++++----
 .../ethernet/microsoft/mana/mana_ethtool.c    | 72 +++++++++++++++++--
 include/net/mana/mana.h                       | 12 ++--
 3 files changed, 112 insertions(+), 21 deletions(-)

-- 
2.34.1


^ permalink raw reply

* RE: [PATCH] mshv: Align huge page stride with guest mapping
From: Michael Kelley @ 2026-01-02 21:13 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aVgkj4V60kddKk4o@skinsburskii.localdomain>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 12:03 PM
> 
> On Fri, Jan 02, 2026 at 06:04:56PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 9:43 AM
> > >
> > > On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote:
> > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, December 23, 2025 8:26 AM
> > > > >
> > > > > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > > > > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > > > > > >
> > > > > > [snip]
> > > > > > >
> > > > > > > Separately, in looking at this, I spotted another potential problem with
> > > > > > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > > > > > not clear on. To create a new region, the user space VMM issues the
> > > > > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > > > > > size, and the guest PFN. The only requirement on these values is that the
> > > > > > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > > > > > specified where the userspace address and the guest PFN have different
> > > > > > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > > > > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > > > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > > > > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > > > > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > > > > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > > > > > to creating 4K mappings, or does it return an error? Returning an error would
> > > > > > > seem to be problematic for movable pages because the error wouldn't
> > > > > > > occur until the guest VM is running and takes a range fault on the region.
> > > > > > > Silently falling back to creating 4K mappings has performance implications,
> > > > > > > though I guess it would work. My question is whether the
> > > > > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > > > > > error immediately.
> > > > > > >
> > > > > >
> > > > > > In thinking about this more, I can answer my own question about the
> > > > > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > > > > > list of 4K system PFNs is not provided as an input to the hypercall, so
> > > > > > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > > > > > sequential PFNs would be wrong, so it must return an error if the
> > > > > > alignment of a system PFN isn't on a 2 Meg boundary.
> > > > > >
> > > > > > For a pinned region, this error happens in mshv_region_map() as
> > > > > > called from  mshv_prepare_pinned_region(), so will propagate back
> > > > > > to the ioctl. But the error happens only if pin_user_pages_fast()
> > > > > > allocates one or more 2 Meg pages. So creating a pinned region
> > > > > > where the guest PFN and userspace address have different offsets
> > > > > > modulo 2 Meg might or might not succeed.
> > > > > >
> > > > > > For a movable region, the error probably can't occur.
> > > > > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > > > > > around the faulting guest PFN. mshv_region_range_fault() then
> > > > > > determines the corresponding userspace addr, which won't be on
> > > > > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > > > > > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > > > > > always do 4K mappings and will succeed. The downside is that a
> > > > > > movable region with a guest PFN and userspace address with
> > > > > > different offsets never gets any 2 Meg pages or mappings.
> > > > > >
> > > > > > My conclusion is the same -- such misalignment should not be
> > > > > > allowed when creating a region that has the potential to use 2 Meg
> > > > > > pages. Regions less than 2 Meg in size could be excluded from such
> > > > > > a requirement if there is benefit in doing so. It's possible to have
> > > > > > regions up to (but not including) 4 Meg where the alignment prevents
> > > > > > having a 2 Meg page, and those could also be excluded from the
> > > > > > requirement.
> > > > > >
> > > > >
> > > > > I'm not sure I understand the problem.
> > > > > There are three cases to consider:
> > > > > 1. Guest mapping, where page sizes are controlled by the guest.
> > > > > 2. Host mapping, where page sizes are controlled by the host.
> > > >
> > > > And by "host", you mean specifically the Linux instance running in the
> > > > root partition. It hosts the VMM processes and creates the memory
> > > > regions for each guest.
> > > >
> > > > > 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
> > > > >
> > > > > The first case is not relevant here and is included for completeness.
> > > >
> > > > Agreed.
> > > >
> > > > >
> > > > > The second and third cases (host and hypervisor) share the memory layout,
> > > >
> > > > Right. More specifically, they are both operating on the same set of physical
> > > > memory pages, and hence "share" a set of what I've referred to as
> > > > "system PFNs" (to distinguish from guest PFNs, or GFNs).
> > > >
> > > > > but it is up
> > > > > to each entity to decide which page sizes to use. For example, the host might map the
> > > > > proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
> > > >
> > > > Agreed.
> > > >
> > > > > In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> > > > > can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
> > > >
> > > > Yes, that's possible, but subject to significant requirements. A 2M page can be
> > > > used only if the underlying physical memory is a physically contiguous 2M chunk.
> > > > Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
> > > > and the virtual address to which it is being mapped must be on a 2M boundary.
> > > > In the case of the host, that virtual address is the user space address in the
> > > > user space process. In the case of the hypervisor, that "virtual address" is the
> > > > the location in guest physical address space; i.e., the guest PFN left-shifted 9
> > > > to be a guest physical address.
> > > >
> > > > These requirements are from the physical processor and its requirements on
> > > > page table formats as specified by the hardware architecture. Whereas the
> > > > page table entry for a 4K page contains the entire PFN, the page table entry
> > > > for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
> > > > which is equivalent to requiring that the PFN be on a 2M boundary. These
> > > > requirements apply to both host and hypervisor mappings.
> > > >
> > > > When MSHV code in the host creates a new pinned region via the ioctl,
> > > > MSHV code first allocates memory for the region using pin_user_pages_fast(),
> > > > which returns the system PFN for each page of physical memory that is
> > > > allocated. If the host, at its discretion, allocates a 2M page, then a series
> > > > of 512 sequential 4K PFNs is returned for that 2M page, and the first of
> > > > the 512 sequential PFNs must have its low order 9 bits be zero.
> > > >
> > > > Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
> > > > the hypervisor to map the allocated memory into the guest physical
> > > > address space at a particular guest PFN. If the allocated memory contains
> > > > a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
> > > > causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
> > > > the hypervisor do that mapping as a 2M large page. The hypercall does not
> > > > have the option of dropping back to 4K page mappings in this case. If
> > > > the 2M alignment of the system PFN is different from the 2M alignment
> > > > of the target guest PFN, it's not possible to create the mapping and the
> > > > hypercall fails.
> > > >
> > > > The core problem is that the same 2M of physical memory wants to be
> > > > mapped by the host as a 2M page and by the hypervisor as a 2M page.
> > > > That can't be done unless the host alignment (in the VMM virtual address
> > > > space) and the guest physical address (i.e., the target guest PFN) alignment
> > > > match and are both on 2M boundaries.
> > > >
> > >
> > > But why is it a problem? If both the host and the hypervisor can map ap
> > > huge page, but the guest can't, it's still a win, no?
> > > In other words, if VMM passes a host huge page aligned region as a guest
> > > unaligned, it's a VMM problem, not a hypervisor problem. And I' don't
> > > understand why would we want to prevent such cases.
> > >
> >
> > Fair enough -- mostly. If you want to allow the misaligned case and live
> > with not getting the 2M mapping in the guest, that works except in the
> > situation that I described above, where the HVCALL_MAP_GPA_PAGES
> > hypercall fails when creating a pinned region.
> >
> > The failure is flakey in that if the Linux in the root partition does not
> > map any of the region as a 2M page, the hypercall succeeds and the
> > MSHV_GET_GUEST_MEMORY ioctl succeeds. But if the root partition
> > happens to map any of the region as a 2M page, the hypercall will fail,
> > and the MSHV_GET_GUEST_MEMORY ioctl will fail. Presumably such
> > flakey behavior is bad for the VMM.
> >
> > One solution is that mshv_chunk_stride() must return a stride > 1 only
> > if both the gfn (which it currently checks) AND the corresponding
> > userspace_addr are 2M aligned. Then the HVCALL_MAP_GPA_PAGES
> > hypercall will never have HV_MAP_GPA_LARGE_PAGE set for the
> > misaligned case, and the failure won't occur.
> >
> 
> I think see your point, but I also think this issue doesn't exist,
> because map_chunk_stride() returns huge page stride iff page if:
> 1. the folio order is PMD_ORDER and
> 2. GFN is huge page aligned and
> 3. number of 4K pages is huge pages aligned.
> 
> On other words, a host huge page won't be mapped as huge if the page
> can't be mapped as huge in the guest.

OK, I'm missing how what you say is true. For pinned regions,
the memory is allocated and mapped into the host userspace address
first, as done by mshv_prepare_pinned_region() calling mshv_region_pin(),
which calls pin_user_pages_fast(). This is all done without considering
the GFN or GFN alignment. So one or more 2M pages might be allocated
and mapped in the host before any guest mapping is looked at. Agreed?

Then mshv_prepare_pinned_region() calls mshv_region_map() to do the
guest mapping. This eventually gets down to mshv_chunk_stride(). In
mshv_chunk_stride() when your conditions #2 and #3 are met, the
corresponding struct page argument to mshv_chunk_stride() may be a
4K page that is in the middle of a 2M page instead of at the beginning
(if the region is mis-aligned). But the key point is that the 4K page in
the middle is part of a folio that will return a folio order of PMD_ORDER.
I.e., a folio order of PMD_ORDER is not sufficient to ensure that the
struct page arg is at the *start* of a 2M-aligned physical memory range
that can be mapped into the guest as a 2M page.

The problem does *not* happen with a movable region, but the reasoning
is different. hmm_range_fault() is always called with a 2M range aligned
to the GFN, which in a mis-aligned region means that the host userspace
address is never 2M aligned. So hmm_range_fault() is never able to allocate
and map a 2M page. mshv_chunk_stride() will never get a folio order > 1,
and the hypercall is never asked to do a 2M mapping. Both host and guest
mappings will always be 4K and everything works.

Michael

> And this function is called for
> both movable and pinned region, so the hypercal should never fail due to
> huge page alignment issue.
> 
> What do I miss here?
> 
> Thanks,
> Stanislav
> 
> 
> > Michael
> >
> > >
> > > > Movable regions behave a bit differently because the memory for the
> > > > region is not allocated on the host "up front" when the region is created.
> > > > The memory is faulted in as the guest runs, and the vagaries of the current
> > > > MSHV in Linux code are such that 2M pages are never created on the host
> > > > if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
> > > > to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
> > > > mappings, which works even with the misalignment.
> > > >
> > > > >
> > > > > This adjustment happens at runtime. Could this be the missing detail here?
> > > >
> > > > Adjustments at runtime are a different topic from the issue I'm raising,
> > > > though eventually there's some relationship. My issue occurs in the
> > > > creation of a new region, and the setting up of the initial hypervisor
> > > > mapping. I haven't thought through the details of adjustments at runtime.
> > > >
> > > > My usual caveats apply -- this is all "thought experiment". If I had the
> > > > means do some runtime testing to confirm, I would. It's possible the
> > > > hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
> > > > that given the basics of how physical processors work with page tables.
> > > >
> > > > Michael

^ permalink raw reply

* Re: [PATCH] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii @ 2026-01-02 20:03 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157288D26ECC9E69240CFECD4BBA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Fri, Jan 02, 2026 at 06:04:56PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 9:43 AM
> > 
> > On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday,
> > December 23, 2025 8:26 AM
> > > >
> > > > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > > > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > > > > >
> > > > > [snip]
> > > > > >
> > > > > > Separately, in looking at this, I spotted another potential problem with
> > > > > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > > > > not clear on. To create a new region, the user space VMM issues the
> > > > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > > > > size, and the guest PFN. The only requirement on these values is that the
> > > > > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > > > > specified where the userspace address and the guest PFN have different
> > > > > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > > > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > > > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > > > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > > > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > > > > to creating 4K mappings, or does it return an error? Returning an error would
> > > > > > seem to be problematic for movable pages because the error wouldn't
> > > > > > occur until the guest VM is running and takes a range fault on the region.
> > > > > > Silently falling back to creating 4K mappings has performance implications,
> > > > > > though I guess it would work. My question is whether the
> > > > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > > > > error immediately.
> > > > > >
> > > > >
> > > > > In thinking about this more, I can answer my own question about the
> > > > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > > > > list of 4K system PFNs is not provided as an input to the hypercall, so
> > > > > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > > > > sequential PFNs would be wrong, so it must return an error if the
> > > > > alignment of a system PFN isn't on a 2 Meg boundary.
> > > > >
> > > > > For a pinned region, this error happens in mshv_region_map() as
> > > > > called from  mshv_prepare_pinned_region(), so will propagate back
> > > > > to the ioctl. But the error happens only if pin_user_pages_fast()
> > > > > allocates one or more 2 Meg pages. So creating a pinned region
> > > > > where the guest PFN and userspace address have different offsets
> > > > > modulo 2 Meg might or might not succeed.
> > > > >
> > > > > For a movable region, the error probably can't occur.
> > > > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > > > > around the faulting guest PFN. mshv_region_range_fault() then
> > > > > determines the corresponding userspace addr, which won't be on
> > > > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > > > > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > > > > always do 4K mappings and will succeed. The downside is that a
> > > > > movable region with a guest PFN and userspace address with
> > > > > different offsets never gets any 2 Meg pages or mappings.
> > > > >
> > > > > My conclusion is the same -- such misalignment should not be
> > > > > allowed when creating a region that has the potential to use 2 Meg
> > > > > pages. Regions less than 2 Meg in size could be excluded from such
> > > > > a requirement if there is benefit in doing so. It's possible to have
> > > > > regions up to (but not including) 4 Meg where the alignment prevents
> > > > > having a 2 Meg page, and those could also be excluded from the
> > > > > requirement.
> > > > >
> > > >
> > > > I'm not sure I understand the problem.
> > > > There are three cases to consider:
> > > > 1. Guest mapping, where page sizes are controlled by the guest.
> > > > 2. Host mapping, where page sizes are controlled by the host.
> > >
> > > And by "host", you mean specifically the Linux instance running in the
> > > root partition. It hosts the VMM processes and creates the memory
> > > regions for each guest.
> > >
> > > > 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
> > > >
> > > > The first case is not relevant here and is included for completeness.
> > >
> > > Agreed.
> > >
> > > >
> > > > The second and third cases (host and hypervisor) share the memory layout,
> > >
> > > Right. More specifically, they are both operating on the same set of physical
> > > memory pages, and hence "share" a set of what I've referred to as
> > > "system PFNs" (to distinguish from guest PFNs, or GFNs).
> > >
> > > > but it is up
> > > > to each entity to decide which page sizes to use. For example, the host might map the
> > > > proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
> > >
> > > Agreed.
> > >
> > > > In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> > > > can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
> > >
> > > Yes, that's possible, but subject to significant requirements. A 2M page can be
> > > used only if the underlying physical memory is a physically contiguous 2M chunk.
> > > Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
> > > and the virtual address to which it is being mapped must be on a 2M boundary.
> > > In the case of the host, that virtual address is the user space address in the
> > > user space process. In the case of the hypervisor, that "virtual address" is the
> > > the location in guest physical address space; i.e., the guest PFN left-shifted 9
> > > to be a guest physical address.
> > >
> > > These requirements are from the physical processor and its requirements on
> > > page table formats as specified by the hardware architecture. Whereas the
> > > page table entry for a 4K page contains the entire PFN, the page table entry
> > > for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
> > > which is equivalent to requiring that the PFN be on a 2M boundary. These
> > > requirements apply to both host and hypervisor mappings.
> > >
> > > When MSHV code in the host creates a new pinned region via the ioctl,
> > > MSHV code first allocates memory for the region using pin_user_pages_fast(),
> > > which returns the system PFN for each page of physical memory that is
> > > allocated. If the host, at its discretion, allocates a 2M page, then a series
> > > of 512 sequential 4K PFNs is returned for that 2M page, and the first of
> > > the 512 sequential PFNs must have its low order 9 bits be zero.
> > >
> > > Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
> > > the hypervisor to map the allocated memory into the guest physical
> > > address space at a particular guest PFN. If the allocated memory contains
> > > a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
> > > causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
> > > the hypervisor do that mapping as a 2M large page. The hypercall does not
> > > have the option of dropping back to 4K page mappings in this case. If
> > > the 2M alignment of the system PFN is different from the 2M alignment
> > > of the target guest PFN, it's not possible to create the mapping and the
> > > hypercall fails.
> > >
> > > The core problem is that the same 2M of physical memory wants to be
> > > mapped by the host as a 2M page and by the hypervisor as a 2M page.
> > > That can't be done unless the host alignment (in the VMM virtual address
> > > space) and the guest physical address (i.e., the target guest PFN) alignment
> > > match and are both on 2M boundaries.
> > >
> > 
> > But why is it a problem? If both the host and the hypervisor can map ap
> > huge page, but the guest can't, it's still a win, no?
> > In other words, if VMM passes a host huge page aligned region as a guest
> > unaligned, it's a VMM problem, not a hypervisor problem. And I' don't
> > understand why would we want to prevent such cases.
> > 
> 
> Fair enough -- mostly. If you want to allow the misaligned case and live
> with not getting the 2M mapping in the guest, that works except in the
> situation that I described above, where the HVCALL_MAP_GPA_PAGES
> hypercall fails when creating a pinned region.
> 
> The failure is flakey in that if the Linux in the root partition does not
> map any of the region as a 2M page, the hypercall succeeds and the
> MSHV_GET_GUEST_MEMORY ioctl succeeds. But if the root partition
> happens to map any of the region as a 2M page, the hypercall will fail,
> and the MSHV_GET_GUEST_MEMORY ioctl will fail. Presumably such
> flakey behavior is bad for the VMM.
> 
> One solution is that mshv_chunk_stride() must return a stride > 1 only
> if both the gfn (which it currently checks) AND the corresponding
> userspace_addr are 2M aligned. Then the HVCALL_MAP_GPA_PAGES
> hypercall will never have HV_MAP_GPA_LARGE_PAGE set for the
> misaligned case, and the failure won't occur.
> 

I think see your point, but I also think this issue doesn't exist,
because map_chunk_stride() returns huge page stride iff page if:
1. the folio order is PMD_ORDER and
2. GFN is huge page aligned and
3. number of 4K pages is huge pages aligned.

On other words, a host huge page won't be mapped as huge if the page
can't be mapped as huge in the guest. And this function is called for
both movable and pinned region, so the hypercal should never fail due to
huge page alignment issue.

What do I miss here?

Thanks,
Stanislav


> Michael
> 
> > 
> > > Movable regions behave a bit differently because the memory for the
> > > region is not allocated on the host "up front" when the region is created.
> > > The memory is faulted in as the guest runs, and the vagaries of the current
> > > MSHV in Linux code are such that 2M pages are never created on the host
> > > if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
> > > to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
> > > mappings, which works even with the misalignment.
> > >
> > > >
> > > > This adjustment happens at runtime. Could this be the missing detail here?
> > >
> > > Adjustments at runtime are a different topic from the issue I'm raising,
> > > though eventually there's some relationship. My issue occurs in the
> > > creation of a new region, and the setting up of the initial hypervisor
> > > mapping. I haven't thought through the details of adjustments at runtime.
> > >
> > > My usual caveats apply -- this is all "thought experiment". If I had the
> > > means do some runtime testing to confirm, I would. It's possible the
> > > hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
> > > that given the basics of how physical processors work with page tables.
> > >
> > > Michael

^ permalink raw reply

* RE: [PATCH 1/3] drivers: video: fbdev: Remove hyperv_fb driver
From: Michael Kelley @ 2026-01-02 19:23 UTC (permalink / raw)
  To: Helge Deller, Prasanna Kumar T S M, linux-fbdev@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-hyperv@vger.kernel.org,
	ssengar@linux.microsoft.com, wei.liu@kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	Thomas Zimmermann
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <e7360fcd-d507-4272-8215-89b15a797b41@gmx.de>

From: Helge Deller <deller@gmx.de> Sent: Friday, January 2, 2026 11:21 AM
> 
> On 1/2/26 20:17, Michael Kelley wrote:
> > From: Helge Deller <deller@gmx.de> Sent: Friday, January 2, 2026 11:11 AM
> >>
> >> On 1/2/26 18:45, Michael Kelley wrote:
> >>> From: Helge Deller <deller@gmx.de> Sent: Tuesday, December 30, 2025 1:07 AM
> >>>>
> >>>> On 12/27/25 05:24, Prasanna Kumar T S M wrote:
> >>>>> The HyperV DRM driver is available since 5.14. This makes the hyperv_fb
> >>>>> driver redundant, remove it.
> >>>>>
> >>>>> Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
> >>>>> ---
> >>>>>     MAINTAINERS                     |   10 -
> >>>>>     drivers/video/fbdev/Kconfig     |   11 -
> >>>>>     drivers/video/fbdev/Makefile    |    1 -
> >>>>>     drivers/video/fbdev/hyperv_fb.c | 1388 -------------------------------
> >>>>>     4 files changed, 1410 deletions(-)
> >>>>>     delete mode 100644 drivers/video/fbdev/hyperv_fb.c
> >>>>
> >>>> applied to fbdev git tree.
> >>>>
> >>>
> >>> Helge -- it looks like you picked up only this patch of the three-patch series.
> >>> The other two patches of the series are fixing up comments that referenc
> >>> the hyperv_fb driver, and they affect the DRM and Hyper-V subsystems. Just
> >>> want to make sure those maintainers pick up the other two patches if that's
> >>> your intent.
> >>
> >> Since the patches #2 and #3 only fix comments, I've now applied both to
> >> the fbdev tree as well. If there will be conflicts (e.g. if maintainers pick up too),
> >> I can easily drop them again.
> >>
> >> Thanks!
> >> Helge
> >
> > Any chance you can fix the typo in the Subject line of the 3rd patch?
> > "drm/hyprev" should be "drm/hyperv".
> 
> Sure. Fixed now.
> 

All looks good! Appreciate it ...

Michael

^ permalink raw reply

* Re: [PATCH 1/3] drivers: video: fbdev: Remove hyperv_fb driver
From: Helge Deller @ 2026-01-02 19:21 UTC (permalink / raw)
  To: Michael Kelley, Prasanna Kumar T S M, linux-fbdev@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-hyperv@vger.kernel.org,
	ssengar@linux.microsoft.com, wei.liu@kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	Thomas Zimmermann
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415706E623885B4173D238AFD4BBA@SN6PR02MB4157.namprd02.prod.outlook.com>

On 1/2/26 20:17, Michael Kelley wrote:
> From: Helge Deller <deller@gmx.de> Sent: Friday, January 2, 2026 11:11 AM
>>
>> On 1/2/26 18:45, Michael Kelley wrote:
>>> From: Helge Deller <deller@gmx.de> Sent: Tuesday, December 30, 2025 1:07 AM
>>>>
>>>> On 12/27/25 05:24, Prasanna Kumar T S M wrote:
>>>>> The HyperV DRM driver is available since 5.14. This makes the hyperv_fb
>>>>> driver redundant, remove it.
>>>>>
>>>>> Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
>>>>> ---
>>>>>     MAINTAINERS                     |   10 -
>>>>>     drivers/video/fbdev/Kconfig     |   11 -
>>>>>     drivers/video/fbdev/Makefile    |    1 -
>>>>>     drivers/video/fbdev/hyperv_fb.c | 1388 -------------------------------
>>>>>     4 files changed, 1410 deletions(-)
>>>>>     delete mode 100644 drivers/video/fbdev/hyperv_fb.c
>>>>
>>>> applied to fbdev git tree.
>>>>
>>>
>>> Helge -- it looks like you picked up only this patch of the three-patch series.
>>> The other two patches of the series are fixing up comments that referenc
>>> the hyperv_fb driver, and they affect the DRM and Hyper-V subsystems. Just
>>> want to make sure those maintainers pick up the other two patches if that's
>>> your intent.
>>
>> Since the patches #2 and #3 only fix comments, I've now applied both to
>> the fbdev tree as well. If there will be conflicts (e.g. if maintainers pick up too),
>> I can easily drop them again.
>>
>> Thanks!
>> Helge
> 
> Any chance you can fix the typo in the Subject line of the 3rd patch?
> "drm/hyprev" should be "drm/hyperv".

Sure. Fixed now.

Thanks!
Helge

^ permalink raw reply

* RE: [PATCH 1/3] drivers: video: fbdev: Remove hyperv_fb driver
From: Michael Kelley @ 2026-01-02 19:17 UTC (permalink / raw)
  To: Helge Deller, Prasanna Kumar T S M, linux-fbdev@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-hyperv@vger.kernel.org,
	ssengar@linux.microsoft.com, wei.liu@kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	Thomas Zimmermann
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <7d2fbfe3-eac9-421b-8e75-8d44b26fd2b3@gmx.de>

From: Helge Deller <deller@gmx.de> Sent: Friday, January 2, 2026 11:11 AM
> 
> On 1/2/26 18:45, Michael Kelley wrote:
> > From: Helge Deller <deller@gmx.de> Sent: Tuesday, December 30, 2025 1:07 AM
> >>
> >> On 12/27/25 05:24, Prasanna Kumar T S M wrote:
> >>> The HyperV DRM driver is available since 5.14. This makes the hyperv_fb
> >>> driver redundant, remove it.
> >>>
> >>> Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
> >>> ---
> >>>    MAINTAINERS                     |   10 -
> >>>    drivers/video/fbdev/Kconfig     |   11 -
> >>>    drivers/video/fbdev/Makefile    |    1 -
> >>>    drivers/video/fbdev/hyperv_fb.c | 1388 -------------------------------
> >>>    4 files changed, 1410 deletions(-)
> >>>    delete mode 100644 drivers/video/fbdev/hyperv_fb.c
> >>
> >> applied to fbdev git tree.
> >>
> >
> > Helge -- it looks like you picked up only this patch of the three-patch series.
> > The other two patches of the series are fixing up comments that referenc
> > the hyperv_fb driver, and they affect the DRM and Hyper-V subsystems. Just
> > want to make sure those maintainers pick up the other two patches if that's
> > your intent.
> 
> Since the patches #2 and #3 only fix comments, I've now applied both to
> the fbdev tree as well. If there will be conflicts (e.g. if maintainers pick up too),
> I can easily drop them again.
> 
> Thanks!
> Helge

Any chance you can fix the typo in the Subject line of the 3rd patch?
"drm/hyprev" should be "drm/hyperv".

Thx ...

Michael


^ permalink raw reply

* Re: [PATCH 1/3] drivers: video: fbdev: Remove hyperv_fb driver
From: Helge Deller @ 2026-01-02 19:10 UTC (permalink / raw)
  To: Michael Kelley, Prasanna Kumar T S M, linux-fbdev@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-hyperv@vger.kernel.org,
	ssengar@linux.microsoft.com, wei.liu@kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	Thomas Zimmermann
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415700F34CA2A4296A542F73D4BBA@SN6PR02MB4157.namprd02.prod.outlook.com>

On 1/2/26 18:45, Michael Kelley wrote:
> From: Helge Deller <deller@gmx.de> Sent: Tuesday, December 30, 2025 1:07 AM
>>
>> On 12/27/25 05:24, Prasanna Kumar T S M wrote:
>>> The HyperV DRM driver is available since 5.14. This makes the hyperv_fb
>>> driver redundant, remove it.
>>>
>>> Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
>>> ---
>>>    MAINTAINERS                     |   10 -
>>>    drivers/video/fbdev/Kconfig     |   11 -
>>>    drivers/video/fbdev/Makefile    |    1 -
>>>    drivers/video/fbdev/hyperv_fb.c | 1388 -------------------------------
>>>    4 files changed, 1410 deletions(-)
>>>    delete mode 100644 drivers/video/fbdev/hyperv_fb.c
>>
>> applied to fbdev git tree.
>>
> 
> Helge -- it looks like you picked up only this patch of the three-patch series.
> The other two patches of the series are fixing up comments that referenc
> the hyperv_fb driver, and they affect the DRM and Hyper-V subsystems. Just
> want to make sure those maintainers pick up the other two patches if that's
> your intent.

Since the patches #2 and #3 only fix comments, I've now applied both to
the fbdev tree as well. If there will be conflicts (e.g. if maintainers pick up too),
I can easily drop them again.

Thanks!
Helge

^ permalink raw reply

* RE: [PATCH] mshv: Align huge page stride with guest mapping
From: Michael Kelley @ 2026-01-02 18:04 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aVgDloDX9nMH6hZH@skinsburskii.localdomain>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 2, 2026 9:43 AM
> 
> On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday,
> December 23, 2025 8:26 AM
> > >
> > > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > > > >
> > > > [snip]
> > > > >
> > > > > Separately, in looking at this, I spotted another potential problem with
> > > > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > > > not clear on. To create a new region, the user space VMM issues the
> > > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > > > size, and the guest PFN. The only requirement on these values is that the
> > > > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > > > specified where the userspace address and the guest PFN have different
> > > > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > > > to creating 4K mappings, or does it return an error? Returning an error would
> > > > > seem to be problematic for movable pages because the error wouldn't
> > > > > occur until the guest VM is running and takes a range fault on the region.
> > > > > Silently falling back to creating 4K mappings has performance implications,
> > > > > though I guess it would work. My question is whether the
> > > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > > > error immediately.
> > > > >
> > > >
> > > > In thinking about this more, I can answer my own question about the
> > > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > > > list of 4K system PFNs is not provided as an input to the hypercall, so
> > > > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > > > sequential PFNs would be wrong, so it must return an error if the
> > > > alignment of a system PFN isn't on a 2 Meg boundary.
> > > >
> > > > For a pinned region, this error happens in mshv_region_map() as
> > > > called from  mshv_prepare_pinned_region(), so will propagate back
> > > > to the ioctl. But the error happens only if pin_user_pages_fast()
> > > > allocates one or more 2 Meg pages. So creating a pinned region
> > > > where the guest PFN and userspace address have different offsets
> > > > modulo 2 Meg might or might not succeed.
> > > >
> > > > For a movable region, the error probably can't occur.
> > > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > > > around the faulting guest PFN. mshv_region_range_fault() then
> > > > determines the corresponding userspace addr, which won't be on
> > > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > > > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > > > always do 4K mappings and will succeed. The downside is that a
> > > > movable region with a guest PFN and userspace address with
> > > > different offsets never gets any 2 Meg pages or mappings.
> > > >
> > > > My conclusion is the same -- such misalignment should not be
> > > > allowed when creating a region that has the potential to use 2 Meg
> > > > pages. Regions less than 2 Meg in size could be excluded from such
> > > > a requirement if there is benefit in doing so. It's possible to have
> > > > regions up to (but not including) 4 Meg where the alignment prevents
> > > > having a 2 Meg page, and those could also be excluded from the
> > > > requirement.
> > > >
> > >
> > > I'm not sure I understand the problem.
> > > There are three cases to consider:
> > > 1. Guest mapping, where page sizes are controlled by the guest.
> > > 2. Host mapping, where page sizes are controlled by the host.
> >
> > And by "host", you mean specifically the Linux instance running in the
> > root partition. It hosts the VMM processes and creates the memory
> > regions for each guest.
> >
> > > 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
> > >
> > > The first case is not relevant here and is included for completeness.
> >
> > Agreed.
> >
> > >
> > > The second and third cases (host and hypervisor) share the memory layout,
> >
> > Right. More specifically, they are both operating on the same set of physical
> > memory pages, and hence "share" a set of what I've referred to as
> > "system PFNs" (to distinguish from guest PFNs, or GFNs).
> >
> > > but it is up
> > > to each entity to decide which page sizes to use. For example, the host might map the
> > > proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
> >
> > Agreed.
> >
> > > In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> > > can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
> >
> > Yes, that's possible, but subject to significant requirements. A 2M page can be
> > used only if the underlying physical memory is a physically contiguous 2M chunk.
> > Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
> > and the virtual address to which it is being mapped must be on a 2M boundary.
> > In the case of the host, that virtual address is the user space address in the
> > user space process. In the case of the hypervisor, that "virtual address" is the
> > the location in guest physical address space; i.e., the guest PFN left-shifted 9
> > to be a guest physical address.
> >
> > These requirements are from the physical processor and its requirements on
> > page table formats as specified by the hardware architecture. Whereas the
> > page table entry for a 4K page contains the entire PFN, the page table entry
> > for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
> > which is equivalent to requiring that the PFN be on a 2M boundary. These
> > requirements apply to both host and hypervisor mappings.
> >
> > When MSHV code in the host creates a new pinned region via the ioctl,
> > MSHV code first allocates memory for the region using pin_user_pages_fast(),
> > which returns the system PFN for each page of physical memory that is
> > allocated. If the host, at its discretion, allocates a 2M page, then a series
> > of 512 sequential 4K PFNs is returned for that 2M page, and the first of
> > the 512 sequential PFNs must have its low order 9 bits be zero.
> >
> > Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
> > the hypervisor to map the allocated memory into the guest physical
> > address space at a particular guest PFN. If the allocated memory contains
> > a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
> > causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
> > the hypervisor do that mapping as a 2M large page. The hypercall does not
> > have the option of dropping back to 4K page mappings in this case. If
> > the 2M alignment of the system PFN is different from the 2M alignment
> > of the target guest PFN, it's not possible to create the mapping and the
> > hypercall fails.
> >
> > The core problem is that the same 2M of physical memory wants to be
> > mapped by the host as a 2M page and by the hypervisor as a 2M page.
> > That can't be done unless the host alignment (in the VMM virtual address
> > space) and the guest physical address (i.e., the target guest PFN) alignment
> > match and are both on 2M boundaries.
> >
> 
> But why is it a problem? If both the host and the hypervisor can map ap
> huge page, but the guest can't, it's still a win, no?
> In other words, if VMM passes a host huge page aligned region as a guest
> unaligned, it's a VMM problem, not a hypervisor problem. And I' don't
> understand why would we want to prevent such cases.
> 

Fair enough -- mostly. If you want to allow the misaligned case and live
with not getting the 2M mapping in the guest, that works except in the
situation that I described above, where the HVCALL_MAP_GPA_PAGES
hypercall fails when creating a pinned region.

The failure is flakey in that if the Linux in the root partition does not
map any of the region as a 2M page, the hypercall succeeds and the
MSHV_GET_GUEST_MEMORY ioctl succeeds. But if the root partition
happens to map any of the region as a 2M page, the hypercall will fail,
and the MSHV_GET_GUEST_MEMORY ioctl will fail. Presumably such
flakey behavior is bad for the VMM.

One solution is that mshv_chunk_stride() must return a stride > 1 only
if both the gfn (which it currently checks) AND the corresponding
userspace_addr are 2M aligned. Then the HVCALL_MAP_GPA_PAGES
hypercall will never have HV_MAP_GPA_LARGE_PAGE set for the
misaligned case, and the failure won't occur.

Michael

> 
> > Movable regions behave a bit differently because the memory for the
> > region is not allocated on the host "up front" when the region is created.
> > The memory is faulted in as the guest runs, and the vagaries of the current
> > MSHV in Linux code are such that 2M pages are never created on the host
> > if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
> > to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
> > mappings, which works even with the misalignment.
> >
> > >
> > > This adjustment happens at runtime. Could this be the missing detail here?
> >
> > Adjustments at runtime are a different topic from the issue I'm raising,
> > though eventually there's some relationship. My issue occurs in the
> > creation of a new region, and the setting up of the initial hypervisor
> > mapping. I haven't thought through the details of adjustments at runtime.
> >
> > My usual caveats apply -- this is all "thought experiment". If I had the
> > means do some runtime testing to confirm, I would. It's possible the
> > hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
> > that given the basics of how physical processors work with page tables.
> >
> > Michael

^ permalink raw reply

* RE: [PATCH 3/3] drm/hyprev: Remove reference to hyperv_fb driver
From: Michael Kelley @ 2026-01-02 17:45 UTC (permalink / raw)
  To: Prasanna Kumar T S M, linux-hyperv@vger.kernel.org,
	drawat.floss@gmail.com, tzimmermann@suse.de, Helge Deller
  Cc: linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
	simona@ffwll.ch, airlied@gmail.com, mripard@kernel.org,
	maarten.lankhorst@linux.intel.com
In-Reply-To: <1766809906-26535-1-git-send-email-ptsm@linux.microsoft.com>

From: Prasanna Kumar T S M <ptsm@linux.microsoft.com> Sent: Friday, December 26, 2025 8:32 PM
> 

There's a typo in the "Subject:" line of this patch -- drm/hyprev should be
drm/hyperv.

Michael

> Remove hyperv_fb reference as the driver is removed.
> 
> Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
> ---
>  drivers/gpu/drm/Kconfig                   |  3 +--
>  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 15 +++++----------
>  2 files changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 7e6bc0b3a589..01a1438b35a0 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -407,8 +407,7 @@ config DRM_HYPERV
>  	help
>  	 This is a KMS driver for Hyper-V synthetic video device. Choose this
>  	 option if you would like to enable drm driver for Hyper-V virtual
> -	 machine. Unselect Hyper-V framebuffer driver (CONFIG_FB_HYPERV) so
> -	 that DRM driver is used by default.
> +	 machine.
> 
>  	 If M is selected the module will be called hyperv_drm.
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 013a7829182d..051ecc526832 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -1,8 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
>   * Copyright 2021 Microsoft
> - *
> - * Portions of this code is derived from hyperv_fb.c
>   */
> 
>  #include <linux/hyperv.h>
> @@ -304,16 +302,13 @@ int hyperv_update_situation(struct hv_device *hdev, u8
> active, u32 bpp,
>   * but the Hyper-V host still draws a point as an extra mouse pointer,
>   * which is unwanted, especially when Xorg is running.
>   *
> - * The hyperv_fb driver uses synthvid_send_ptr() to hide the unwanted
> - * pointer, by setting msg.ptr_pos.is_visible = 1 and setting the
> - * msg.ptr_shape.data. Note: setting msg.ptr_pos.is_visible to 0 doesn't
> + * Hide the unwanted pointer, by setting msg.ptr_pos.is_visible = 1 and setting
> + * the msg.ptr_shape.data. Note: setting msg.ptr_pos.is_visible to 0 doesn't
>   * work in tests.
>   *
> - * Copy synthvid_send_ptr() to hyperv_drm and rename it to
> - * hyperv_hide_hw_ptr(). Note: hyperv_hide_hw_ptr() is also called in the
> - * handler of the SYNTHVID_FEATURE_CHANGE event, otherwise the host still
> - * draws an extra unwanted mouse pointer after the VM Connection window is
> - * closed and reopened.
> + * The hyperv_hide_hw_ptr() is also called in the handler of the
> + * SYNTHVID_FEATURE_CHANGE event, otherwise the host still draws an extra
> + * unwanted mouse pointer after the VM Connection window is closed and reopened.
>   */
>  int hyperv_hide_hw_ptr(struct hv_device *hdev)
>  {
> --
> 2.49.0
> 


^ permalink raw reply

* RE: [PATCH 1/3] drivers: video: fbdev: Remove hyperv_fb driver
From: Michael Kelley @ 2026-01-02 17:45 UTC (permalink / raw)
  To: Helge Deller, Prasanna Kumar T S M, linux-fbdev@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-hyperv@vger.kernel.org,
	ssengar@linux.microsoft.com, wei.liu@kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	Thomas Zimmermann
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <e37ef037-fb4f-418c-937b-b3deb632d0ca@gmx.de>

From: Helge Deller <deller@gmx.de> Sent: Tuesday, December 30, 2025 1:07 AM
> 
> On 12/27/25 05:24, Prasanna Kumar T S M wrote:
> > The HyperV DRM driver is available since 5.14. This makes the hyperv_fb
> > driver redundant, remove it.
> >
> > Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
> > ---
> >   MAINTAINERS                     |   10 -
> >   drivers/video/fbdev/Kconfig     |   11 -
> >   drivers/video/fbdev/Makefile    |    1 -
> >   drivers/video/fbdev/hyperv_fb.c | 1388 -------------------------------
> >   4 files changed, 1410 deletions(-)
> >   delete mode 100644 drivers/video/fbdev/hyperv_fb.c
> 
> applied to fbdev git tree.
> 

Helge -- it looks like you picked up only this patch of the three-patch series.
The other two patches of the series are fixing up comments that referenc
the hyperv_fb driver, and they affect the DRM and Hyper-V subsystems. Just
want to make sure those maintainers pick up the other two patches if that's
your intent.

Michael

^ permalink raw reply

* Re: [PATCH] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii @ 2026-01-02 17:42 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB41573BF52C6A4447C720CDD6D4B5A@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, December 23, 2025 8:26 AM
> > 
> > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > > >
> > > [snip]
> > > >
> > > > Separately, in looking at this, I spotted another potential problem with
> > > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > > not clear on. To create a new region, the user space VMM issues the
> > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > > size, and the guest PFN. The only requirement on these values is that the
> > > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > > specified where the userspace address and the guest PFN have different
> > > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > > to creating 4K mappings, or does it return an error? Returning an error would
> > > > seem to be problematic for movable pages because the error wouldn't
> > > > occur until the guest VM is running and takes a range fault on the region.
> > > > Silently falling back to creating 4K mappings has performance implications,
> > > > though I guess it would work. My question is whether the
> > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > > error immediately.
> > > >
> > >
> > > In thinking about this more, I can answer my own question about the
> > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > > list of 4K system PFNs is not provided as an input to the hypercall, so
> > > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > > sequential PFNs would be wrong, so it must return an error if the
> > > alignment of a system PFN isn't on a 2 Meg boundary.
> > >
> > > For a pinned region, this error happens in mshv_region_map() as
> > > called from  mshv_prepare_pinned_region(), so will propagate back
> > > to the ioctl. But the error happens only if pin_user_pages_fast()
> > > allocates one or more 2 Meg pages. So creating a pinned region
> > > where the guest PFN and userspace address have different offsets
> > > modulo 2 Meg might or might not succeed.
> > >
> > > For a movable region, the error probably can't occur.
> > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > > around the faulting guest PFN. mshv_region_range_fault() then
> > > determines the corresponding userspace addr, which won't be on
> > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > > always do 4K mappings and will succeed. The downside is that a
> > > movable region with a guest PFN and userspace address with
> > > different offsets never gets any 2 Meg pages or mappings.
> > >
> > > My conclusion is the same -- such misalignment should not be
> > > allowed when creating a region that has the potential to use 2 Meg
> > > pages. Regions less than 2 Meg in size could be excluded from such
> > > a requirement if there is benefit in doing so. It's possible to have
> > > regions up to (but not including) 4 Meg where the alignment prevents
> > > having a 2 Meg page, and those could also be excluded from the
> > > requirement.
> > >
> > 
> > I'm not sure I understand the problem.
> > There are three cases to consider:
> > 1. Guest mapping, where page sizes are controlled by the guest.
> > 2. Host mapping, where page sizes are controlled by the host.
> 
> And by "host", you mean specifically the Linux instance running in the
> root partition. It hosts the VMM processes and creates the memory
> regions for each guest.
> 
> > 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
> > 
> > The first case is not relevant here and is included for completeness.
> 
> Agreed.
> 
> > 
> > The second and third cases (host and hypervisor) share the memory layout, 
> 
> Right. More specifically, they are both operating on the same set of physical
> memory pages, and hence "share" a set of what I've referred to as
> "system PFNs" (to distinguish from guest PFNs, or GFNs).
> 
> > but it is up
> > to each entity to decide which page sizes to use. For example, the host might map the
> > proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
> 
> Agreed.
> 
> > In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> > can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
> 
> Yes, that's possible, but subject to significant requirements. A 2M page can be
> used only if the underlying physical memory is a physically contiguous 2M chunk.
> Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
> and the virtual address to which it is being mapped must be on a 2M boundary.
> In the case of the host, that virtual address is the user space address in the
> user space process. In the case of the hypervisor, that "virtual address" is the
> the location in guest physical address space; i.e., the guest PFN left-shifted 9
> to be a guest physical address.
> 
> These requirements are from the physical processor and its requirements on
> page table formats as specified by the hardware architecture. Whereas the
> page table entry for a 4K page contains the entire PFN, the page table entry
> for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
> which is equivalent to requiring that the PFN be on a 2M boundary. These
> requirements apply to both host and hypervisor mappings.
> 
> When MSHV code in the host creates a new pinned region via the ioctl,
> MSHV code first allocates memory for the region using pin_user_pages_fast(),
> which returns the system PFN for each page of physical memory that is
> allocated. If the host, at its discretion, allocates a 2M page, then a series
> of 512 sequential 4K PFNs is returned for that 2M page, and the first of
> the 512 sequential PFNs must have its low order 9 bits be zero.
> 
> Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
> the hypervisor to map the allocated memory into the guest physical
> address space at a particular guest PFN. If the allocated memory contains
> a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
> causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
> the hypervisor do that mapping as a 2M large page. The hypercall does not
> have the option of dropping back to 4K page mappings in this case. If
> the 2M alignment of the system PFN is different from the 2M alignment
> of the target guest PFN, it's not possible to create the mapping and the
> hypercall fails.
> 
> The core problem is that the same 2M of physical memory wants to be
> mapped by the host as a 2M page and by the hypervisor as a 2M page.
> That can't be done unless the host alignment (in the VMM virtual address
> space) and the guest physical address (i.e., the target guest PFN) alignment
> match and are both on 2M boundaries.
> 

But why is it a problem? If both the host and the hypervisor can map ap
huge page, but the guest can't, it's still a win, no?
In other words, if VMM passes a host huge page aligned region as a guest
unaligned, it's a VMM problem, not a hypervisor problem. And I' don't
understand why would we want to prevent such cases.

Thanks,
Stanislav

> Movable regions behave a bit differently because the memory for the
> region is not allocated on the host "up front" when the region is created.
> The memory is faulted in as the guest runs, and the vagaries of the current
> MSHV in Linux code are such that 2M pages are never created on the host
> if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
> to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
> mappings, which works even with the misalignment.
> 
> > 
> > This adjustment happens at runtime. Could this be the missing detail here?
> 
> Adjustments at runtime are a different topic from the issue I'm raising,
> though eventually there's some relationship. My issue occurs in the
> creation of a new region, and the setting up of the initial hypervisor
> mapping. I haven't thought through the details of adjustments at runtime.
> 
> My usual caveats apply -- this is all "thought experiment". If I had the
> means do some runtime testing to confirm, I would. It's possible the
> hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
> that given the basics of how physical processors work with page tables.
> 
> Michael

^ permalink raw reply

* Re: [PATCH v2 1/8] KVM: SVM: Add a helper to detect VMRUN failures
From: Yosry Ahmed @ 2026-01-02 16:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, kvm, linux-hyperv, linux-kernel,
	Jim Mattson
In-Reply-To: <20251230211347.4099600-2-seanjc@google.com>

On Tue, Dec 30, 2025 at 01:13:40PM -0800, Sean Christopherson wrote:
> Add a helper to detect VMRUN failures so that KVM can guard against its
> own long-standing bug, where KVM neglects to set exitcode[63:32] when
> synthesizing a nested VMFAIL_INVALID VM-Exit.  This will allow fixing
> KVM's mess of treating exitcode as two separate 32-bit values without
> breaking KVM-on-KVM when running on an older, unfixed KVM.
> 
> Cc: Jim Mattson <jmattson@google.com>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

> ---
>  arch/x86/kvm/svm/nested.c | 16 +++++++---------
>  arch/x86/kvm/svm/svm.c    |  4 ++--
>  arch/x86/kvm/svm/svm.h    |  5 +++++
>  3 files changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index ba0f11c68372..f5bde972a2b1 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1134,7 +1134,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>  	vmcb12->control.exit_info_1       = vmcb02->control.exit_info_1;
>  	vmcb12->control.exit_info_2       = vmcb02->control.exit_info_2;
>  
> -	if (vmcb12->control.exit_code != SVM_EXIT_ERR)
> +	if (!svm_is_vmrun_failure(vmcb12->control.exit_code))
>  		nested_save_pending_event_to_vmcb12(svm, vmcb12);
>  
>  	if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> @@ -1425,6 +1425,9 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
>  	u32 exit_code = svm->vmcb->control.exit_code;
>  	int vmexit = NESTED_EXIT_HOST;
>  
> +	if (svm_is_vmrun_failure(exit_code))
> +		return NESTED_EXIT_DONE;
> +
>  	switch (exit_code) {
>  	case SVM_EXIT_MSR:
>  		vmexit = nested_svm_exit_handled_msr(svm);
> @@ -1432,7 +1435,7 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
>  	case SVM_EXIT_IOIO:
>  		vmexit = nested_svm_intercept_ioio(svm);
>  		break;
> -	case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: {
> +	case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f:
>  		/*
>  		 * Host-intercepted exceptions have been checked already in
>  		 * nested_svm_exit_special.  There is nothing to do here,
> @@ -1440,15 +1443,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
>  		 */
>  		vmexit = NESTED_EXIT_DONE;
>  		break;
> -	}
> -	case SVM_EXIT_ERR: {
> -		vmexit = NESTED_EXIT_DONE;
> -		break;
> -	}
> -	default: {
> +	default:
>  		if (vmcb12_is_intercept(&svm->nested.ctl, exit_code))
>  			vmexit = NESTED_EXIT_DONE;
> -	}
> +		break;
>  	}
>  
>  	return vmexit;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 24d59ccfa40d..c2ddf2e0aa1a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3540,7 +3540,7 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>  			return 1;
>  	}
>  
> -	if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
> +	if (svm_is_vmrun_failure(svm->vmcb->control.exit_code)) {
>  		kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
>  		kvm_run->fail_entry.hardware_entry_failure_reason
>  			= svm->vmcb->control.exit_code;
> @@ -4311,7 +4311,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
>  
>  		/* Track VMRUNs that have made past consistency checking */
>  		if (svm->nested.nested_run_pending &&
> -		    svm->vmcb->control.exit_code != SVM_EXIT_ERR)
> +		    !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
>                          ++vcpu->stat.nested_run;
>  
>  		svm->nested.nested_run_pending = 0;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 01be93a53d07..0f006793f973 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -424,6 +424,11 @@ static __always_inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
>  	return container_of(vcpu, struct vcpu_svm, vcpu);
>  }
>  
> +static inline bool svm_is_vmrun_failure(u64 exit_code)
> +{
> +	return (u32)exit_code == (u32)SVM_EXIT_ERR;
> +}
> +
>  /*
>   * Only the PDPTRs are loaded on demand into the shadow MMU.  All other
>   * fields are synchronized on VM-Exit, because accessing the VMCB is cheap.
> -- 
> 2.52.0.351.gbe84eed79e-goog
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox